Events with Dynamic Handlers, Simplified

Elgg’s event systems are based on registering callables to handle certain event names. As the codebase is still heavily procedural, most event handlers in the core and 3rd party plugins are global functions. Since I’m firmly in the static-code-is-evil camp, I’d much prefer handlers be dynamic method calls, but this introduces some complexity:

  1. Nice plugins allow their event handlers to be unregistered, but dynamic callbacks make this a bit awkward; any plugin that wanted to unregister another plugin’s handler would have to first get a reference to the other plugin’s object. Kind of messy.
  2. This requires proactively creating lots of objects—and loading lots of code—you don’t need in order to create the callbacks.

One way to work around the second issue would be to register callbacks on a proxy objects that lazy-load the real objects on the first method call. This adds complexity, could lead to typehint failures if the proxy gets accessed directly, and doesn’t solve the first issue at all. So I think the event system needs to be extended in a simple way:

We could allow callback strings of the form "service->method" where service is the name of a pre-registered (and lazy-loading) service on Elgg’s DI container. Essentially the event trigger would call $dic->service->method().

This would allow trivially (un)registering these handlers, as well as efficiently lazy-loading the objects without the complexity of proxies.

I’d also suggest never actually binding to the service. This way you could replace a whole set of handlers by replacing the service on the DI container.

Dependency Injection: Ask For What You Need

Despite the scary name, the concept is simple: Your class or function should ask for dependencies upfront. It should not reach out for them. This gives you a few powerful advantages:

1. In unit tests you can pass in mock dependencies to verify/fake behavior.

Imagine your class depends on a database. You really don’t want to require a live database to run your tests, but also you want to be able to test a variety of responses from the database without running a bunch of setup queries. What if your logic depends on the time being around 3AM? You can’t reliably test this if your object pulls time from the environment.

2. Your API doesn’t lie to users about what’s really required to use it.

Imagine if you started cooking a recipe and step 7 read, “Go buy 10 mangos now.” Imagine a Skydive class whose jump() method waited 10 seconds then executed Backpack.getInstance().getParachute.deploy(). “Umm, I’m in the air and the Skydive class didn’t tell me I’d need a backpack…”. Skydive should have required a parachute in the constructor.

3. Your API behaves more predictably.

Imagine a function getNumDays(month). If I pass in February, I usually get 28, but sometimes 29! This is because there’s a hidden dependency on the current year. This function would be useless for processing old data.

Baking the dependencies in

Note that the same principles apply to functions (fine for this exercise, but you should avoid global functions/static methods for all sorts of reasons). Consider this function for baking peanut butter cookies:

function makePBCookies(AlbertsonsPB $pb) {
  // ... 20 lines of setup code
  $egg = new LargePublixEgg();
  $sugar = new DixieSugar('1 cup');
  $chef = Yelp::find('chef');
  // create an oven
  // mix and bake for 10 min
  return $cookies;
}

From the argument list, it isn’t clear you’ll need a large Publix egg on hand. What if you don’t have a Publix in your area? What if you want less sugary cookies? Let’s refactor:

function makePBCookies(Egg $egg, Sugar $sugar, PB $pb) {
  // ... 20 lines of setup code
  $chef = Yelp::find('chef');
  // create an oven
  // mix and bake for 10 min
  return $cookies;
}

Now it’s immediately clear at the top what ingredients you’ll need; you can use any brands you want; and you can even change amounts/sizes to yield all kinds of flavors. This is really flexible, but we can make it even better:

function makePBCookies(Egg $egg, Sugar $sugar, PB $pb, Oven $oven, Chef $chef) {
  $mix = $chef->mix(array($egg, $sugar, $pb));
  return $chef->bake($oven, $mix);
}

In case it wasn’t obvious, it’s now clear we’ll need a chef and oven, and it makes good sense to outsource the oven design because this logic isn’t really specific to peanut butter cookies. A sign that our refactoring is going well is that the function body is starting to read more like a narrative.

Let’s build the ultimate cookie-making function:

function makeCookies(BakeList $list, Chef $chef, Oven $oven, Decorations $decs = null) {
  // BakeList is a composite of recipe & ingredients
  $mix = $chef->mix($list->getIngredients(), $list->getRecipe());
  $cookies = $chef->bake($oven, $mix);
  if ($decs) {
    $cookies = $chef->decorate($cookies, $decs);
  }
  return $cookies;
}

We’ve refactored into several easy-to-test components, each with a specific task. This function is also easy to test because we just need to verify how the dependencies are passed and what methods are called on them. If we wanted to go even further we might notice that cookies come out differently in Denver vs. NYC (there’s a hidden dependency on altitude).

But that’s all dependency injection is: asking for what you need and not relying on sniffing dependencies from the environment. It makes code more flexible, more reusable, and more testable. It’s awesome.

I highly recommend Miško Hevery’s 2008 talks about dependency injection and testability, which really solidified the concepts and their importance for me. [This is an improved version of an article I wrote around that time hosted elsewhere.]

Unpacking the Access Control Systems in Drupal and Elgg

Elgg’s access control system, which determines what content a user can view, is somewhat limited and very opinionated, with several use cases—access control lists, friends—baked into the core system. In hopes of making this cleaner and more powerful, I’ve been studying Drupal’s access system. (Caveat: My knowledge in this area of Drupal comes mainly from reading code, schema, docs, and two great overviews by Mike Potter and Larry Garfield, so please chime in if I run off the rails.) 

Drupal

Drupal’s system also influences update and delete permissions, but here I’m only interested in the “view” permission. Also, although Drupal has hook_node_access()—a procedural calculation of permissions for a node (like an Elgg entity) already in memory—I’m focusing on the systems that craft SQL conditions to fetch only nodes visible to the user. This is critical to get right in the SQL, because if your access control relies on code, you can never predict the number of queries required to generate a list for browsing. In this area, Drupal’s realms/grants API (hook_node_grants()is extremely powerful.

Realms and Grants in a Nutshell

At a particular time, a user exists in zero or more “realms”; more or less arbitrary labels which may be based on user attributes, roles, associations, the current system state, time…anything. Each realm has been granted (via DB rows) the permission to view individual nodes. So to query, we build up a user’s list of realms, this is baked into the query, and the DB returns nodes matching at least one realm.

E.g., at 2:30 PM today, an anonymous visitor might be in the realms (public, time_afternoon, season_winter*), whereas Mary, who logged in, might exist in the realms (public, logged_in, user_123, friendedby_345, role_developer, team_A, is_over_30, time_afternoon, season_winter). So Mary will likely see more nodes because her queries provide more opportunity to match grant rows. *Note these realms are made up examples.

Clearly this is very expressive, but Drupal (maybe for better) doesn’t provide many features out-of-the-box, so (maybe for worse) doesn’t build in many realms; the API is mostly a framework for implementing an access control system on top of added features. Contrib modules appear up to the task of providing realms based on all kinds of things (groups, taxonomies, associations with particular nodes), but it’s hard to collaboratively build an access control system, so these modules apparently don’t work well with each other and non-access modules must be careful to tap into the appropriate systems to keep nodes protected.

(Implementation oddities: The grants are done in the node_access table, which probably should’ve been called “node_grants”, especially because this table is only somewhat related to the hook called “node_access“. Less seriously—depending on your system size—each realm name (VARCHAR) is duplicated for every node/realm combination, so there’s some opportunity for normalization.)

Elgg

If you squint, Elgg’s system is a bit similar. Each entity has an “access level” (a realm), with values like “public”, “logged in”, “private”, “friends”, or values representing access control lists (a group or a subset of your friends like a Google+ circle).

That an entity can have only one realm is of course the biggest (and most painful) difference, but also the implementation is significantly complicated by some realms needing to map to different tables. E.g. Elgg has to ensure “friends” maps to rows in an entities relationship table based upon the owner of the entity, while also mapping to the ACL table.

I imagine a lot of these differences come from Drupal being old as time with a lot bigger API reboots, and because Elgg’s access system was targeted to meet the needs of features like friends and user groups, which were built-in from the beginning. It’s hard to predict which schema results in faster queries, and will depend on the use case, but Drupal queries I would guess are easier to generate and safer to alter.

Conclusions

I think in the long run Elgg would be wise to adopt a realms/grants schema, though I would probably suggest normalizing with a separate “realms” table to hold the name and other useful bits. Elgg group ACLs and friend collections would map directly into realms, but friend relationships would need to be duplicated into realms just like groups have an ACL distinct from the membership relationship. Really I think a grants table could completely replace Elgg’s “entity_relationships” table, since both tables just map one entity to others with a name.

As for Drupal, I think the docs could more clearly describe realms and grants (unless I’ve totally got this wrong). I’m less sure of the quality of the API that populates/maintains the tables; it looks like the hooks are pretty low-level ways of asking “would you like to dump some rows into node_access?” and it’s not clear how much of the table must be rebuilt or how often this happens.

GitX for Cleaner Commits

A great way to make your pull requests easier to review is to reduce each commit to a particular purpose/functional change. In practice this often means not combining multiple feature additions in a single commit, or not including whitespace changes in a commit that makes some code change.

The command-line git add is just not great for anything but very trivial changes, so I use GitX (the active fork) when I’m on OSX.

Getting to the UI is a quick gitx in terminal and command+2 to switch to the Stage view. Here you can move changes in/out of staging via drag/drop files or by clicking on individual lines in diff views. This lets you run wild and loose during development then, with a scalpel, carefully craft your changes into a logical set of cleaner commits. E.g. if you’ve altering code, unstage all your unrelated whitespace changes and put them in the last commit, or just revert them or git stash to move them to another branch.

Git: Finish features before merging them.

There’s no perfect way to develop software and use source control because projects, teams, and work environments can vary so much; what works for a small office of employees might not for a loose group of part-time contributors spread across many timezones, as many open source projects are.

Jade Rubick is not a fan of long-running feature branches in git and—if I’m reading this right—argues for merging into master frequently, not waiting for an entire feature to be implemented. This is supposed to force the team to be aware of all code changes occurring.

While lack of communication about features in development can certainly be problematic, I think this is a sledgehammer of a solution. Taking this approach to its extreme, it might make sense to have all developers work huddled together so they can say what they’re working on in real-time, or all work on one workstation. My point is that using a workflow with high costs to address a communication deficit is not so great an idea.

My big fear of this workflow is that it eases the flow of incorrect/unwise code into production. Who reviews this code? What if it takes the whole codebase in a bad direction, but no one at the moment has time to realize that? I think the benefits of feature branches/pull requests are just huge:

  1. A branch frees the developer to experiment big and take chances without forcing the rest of the team down their path. The value in some big ideas will not be apparent looking at them piecemeal. Some of this work will lead to great things, some will be tossed away, all of it will be good learning.
  2. Likewise, the PR process can catch incorrect/unwise solutions before they’re merged into the product. This is huge. Some ideas sound great but you only realize 80% into the work that they’re unwise. If that work is sitting in a PR, you just close it and it can live on as a reminder to future devs who get the same idea. If not, you now have the job of shoehorning that code out. On the codebases I work on so many features have been improved/overhauled/abandoned by the review/feedback loop that it seems absolutely crazy to bypass this process. In an async distributed team, I think the PR is basically the perfect code review tool.
  3. PRs provide a great historical and educational record of what changes are involved in providing a certain feature, what all files are involved, etc. I’ve found that reading pull requests and merge diffs to be just as illustrative as reading source code. If a feature required changes in dozens of files over three weeks, how will I ever piece together the 6 commits out of 100 that were important?
  4. Feature branches make it a lot easier to revert a feature or apply it to another branch. For life on the edge I’ve built upon versions of frameworks with experimental branches merged in. If I regret this I can always generate a revert commit to sync back up with a stable branch.

All workflows have costs/benefits, I just think that the benefits of not merging feature branches until they’re really ready are huge compared with the costs Jade described.

My hunch is there must be better ways to keep a team aware of other work being done on feature branches. E.g. Make pull requests as soon as the feature branch is created and push to it as you work. That way team members can set aside time to check in on pull requests in progress and provide feedback.

I agree with Jade that feature flags can be a great idea, but that’s mostly orthogonal to source control workflow.

 

JSMin’s classic delimma: division or RegExp literal

JSMin uses a pretty crude tokenizer-like algorithm based on assumptions of what kind of characters can precede/follow others. In the PHP port I maintain I’ve made numerous fixes to handle most syntax in the wild, but I see no fix for this:

Here’s a very boiled down version of a statement created by another minifier:

if(a)/b/.test(c)?d():e();

Since (a) follows if, we know it’s a condition, and, if truthy, the statement /b/.test(c)?d():e() will be executed.

However, JSMin has no parser and therefore no concept that (a) is an if condition; it just sees an expression. Since lots of scripts in the wild have expressions like (expression) / divisor, we’ve told JSMin to see that as division. So when JSMin arrives at )/, it incorrectly assumes / is the division operator, then incorrectly treats the following / as the beginning of a RegExp literal.

Lessons: Don’t run minified code through JSMin*, and don’t use it at all if you have access to a JavaScript/Java runtime.

*Minify already knows to skip JSMin for files ending in -min.js or .min.js.

Using BINARY in a MySQL IN Expression

With the default collations in MySQL (e.g. utf8_general_ci), strings will be matched case-insensitively:

WHERE col IN ('IT', 'Ruby') will match “it” and “ruby”.

For case-sensitive matches, each string must be preceded by the keyword BINARY:

WHERE col IN (BINARY 'IT', BINARY 'Ruby')

Caveat: I would guess that BINARY affects the matching of strings with combining characters (same character, different byte representation).

When Reasonable, Return Early

When you have a function or piece of code that must handle several different cases, I find it much better to eliminate special cases using return or throw at the beginning. (“returning early”).

if (!input.isValid()) {
    throw new InvalidInputException();
}

// handle input (20 lines)

Several advantages here:

  1. It frees our mind from worrying about those cases as we read the rest of the function. Programmers, as humans, have limited mental stacks, and if we’re reading a function that must handle several cases, each of those cases must be held in our memory until we see the code that handles them.
  2. It simplifies the remaining code, since it’s handling a smaller set of cases.
  3. It compartmentalizes the code into: “handle special cases, then handle the common case.”
  4. It keeps the code that handles special cases close to the checks for those cases.
  5. The primary piece of code stays at the root indent level.
  6. Diff views stay simple when adding/removing special conditions (it does not introduce big indenting changes across many lines).

In fact there’s (what I consider to be) an anti-pattern that involves checking for special cases early, but instead placing the meat of the function in the if block, followed by an else block to handle the special case:

if (input.isValid()) {

    // handle input (20 lines)

} else {
    throw new InvalidInputException();
}

Notice how the handling of the error can now become separated from the cause of the error, and the meat of the code now must be indented. This anti-pattern gets obviously worse the more conditions you must test for:

if (input1.isValid()) {
    if (input2.isValid()) {
        if (input3.isValid()) {
            if (input4.isValid()) {

                // handle input (20 lines)

            } else {
                throw new InvalidInputException('input4 is bad');
            }
        } else {
            throw new InvalidInputException('input3 is bad');
        }
    } else {
        throw new InvalidInputException('input2 is bad');
    }
} else {
    throw new InvalidInputException('input1 is bad');
}

This code is hard to read, hard to edit, and harder to review when changes occur to the conditions:

  1. The invalid handling code appears in the reverse order that the inputs are checked.
  2. The invalid handling code is very far from the conditions (you have to lean on the IDE to see where the braces line up).
  3. The meat of the function starts off 4 indents over.
  4. Adding/removing validity checks is a huge pain that involves re-indenting a bunch of code.
  5. Change diffs can look really messy as indent levels are changed.

Code with a lot of special cases/conditions probably needs to be refactored to separate validation from the processing of the input.

There are certainly situations where returning early is not an obvious win (hence “when reasonable”). This often happens when you have a return value that contains complex state, or some statements must be executed before every return/throw statement.

Apache mod_cache and mod_rewrite: Danger

I solved a very strange bug today where mod_cache kept returning a particular cached file for seemingly every URL on a site.

After using LogLevel debug, I finally realized that mod_cache does not store content by the request URL, but rather by the final URL after RewriteRules have been processed.

This PHP application was using a common RewriteRule to map most requests to a single script, but the full request URL was not being copied into the rewritten URL (the script was just relying on $_SERVER['REQUEST_URI'] to route the request). The result is mod_cache considered all URLs the same, and it was only by some magic of session cache-busting headers that the app managed to function at all. Internally mod_cache was just churning that single cache file over and over.

The simple fix was to make sure the full URL path ended up in the internal URL:

RewriteRule ^(.*)$  index.php/$1 [L,QSA]

Character Encoding Bug of the Day

Today I had one of those bugs that starts out looking simple and keeps going deeper and deeper. Video service Kaltura has a plugin for Moodle, that just stopped working one day (no changes on the server).

  • It’s throwing an exception because an expected element isn’t in the page.
  • Oh, the element’s supposed to be delivered by XHR from the plugin.
  • But the plugin’s code is generating correct markup…
  • Why is Moodle’s function to serialize an array into a JS function call returning null for that markup?
  • json_encode is converting the markup string to null?
  • Because json_encode is choking on invalid UTF-8.
  • Because the markup has a right single quotation encoded in Windows-1252 :(
  • And that string is coming from the Kaltura API.

So over 2 years ago someone named a video player Jim’s Test Player and over the weekend Kaltura’s API started returning that single quote in Windows-1252. We removed the quote from the name and the problem disappeared.