Minify update

Minify 2.1.4 is approaching release and will have several long-awaited features and hopefully easier configuration.

Looking towards version 2.2/3(?), I recently committed the beginnings of a complete refactoring of the Minify API. The goal is to have a more flexible and extensible design that can include plugins like LESS and maybe handle @imports on the server-side—I’m still mulling over how radically a plugin should be able to alter the overall processing of the request. A key to doing this is splitting plugins into at least two stages and allowing the first to influence the cache id, and possibly add more sources/processors. By asserting that the request’s cache id is stable after the first stage, you can serve requests from the cache most of the time without running the second stage (JSMin, calls to Closure Compiler API, etc.).

Another goal was to eliminate static/global state from all components and have each component independently testable, meaning refactoring out access to global state like request/$_SERVER vars, file access from controllers, etc. A bunch of new classes and interfaces (read: files) could mean a performance hit, but we will also add an internal URL cache so that most requests will not load the system at all. At most once every N seconds the full stack will be loaded in order to make sure the cache is up-to-date with file mtimes, etc., but most requests will hit a very minimal front controller that simply includes a pre-generated PHP script (sending the necessary headers and output) depending on query string and Accept-Encoding. In theory this should be absurdly fast (for PHP output anyway), but of course sticking a reverse proxy in front is always a better idea if you can.

There’s a lot of work to be done. I thought of the URL cache a little late so some of my earlier design decisions were based on performance, and they need to be revisited.

Minify 2.1.3 released

This is how I spend my time off? I squashed a few more bugs and got Minify 2.1.3 finally out the door. It fixes a few HTTP bugs, including working around a Safari/webkit bug that was preventing efficient cache usage. There’s also code to make it far easier to diagnose users’ URI rewriting problems when they come up (caused a lot of past headaches).

I updated mrclay.org to use it, so you can check out Minify’s HTTP handling using Mark Nottingham’s new Resource Expert Droid:

RED is a robot that checks HTTP resources to see how they’ll behave, pointing out common problems and suggesting improvements. Although it is not a HTTP conformance tester, it can find a number of HTTP-related issues.

Minify getting out there

Interest in Minify seems to be picking up:

Any ideas on how to make it better?

Minifying Javascript and CSS on mrclay.org

Update: Please read the new version of this article. It covers Minify 2.1, which is much easier to use.

Minify v2 is coming along, but it’s time to start getting some real-world testing, so last night I started serving this site’s Javascript and CSS (at least the 6 files in my WordPress templates) via a recent Minify snapshot.

As you can see below, I was serving 67K over 6 requests and was using some packed Javascript, which has a client-side decompression overhead.

fiddler1.png

Using Minify, this is down to 2 requests, 28K (58% reduction), and I’m no longer using any packed Javascript:

fiddler2.png

Getting it working

  1. Exported Minify from svn (only the /lib tree is really needed).
  2. Placed the contents of /lib in my PHP include path.
  3. Determined where I wanted to store cache files (server-side caching is a must.)
  4. Gathered a list of the JS/CSS files I wanted to serve.
  5. Created “min.php” in the doc root:
    // load Minify
    require_once 'Minify.php';
    
    // setup caching
    Minify::useServerCache(realpath("{$_SERVER['DOCUMENT_ROOT']}/../tmp"));
    
    // controller options
    $options = array(
    	'groups' => array(
    		'js' => array(
    			'//wp-content/chili/jquery-1.2.3.min.js'
    			,'//wp-content/chili/chili-1.8b.js'
    			,'//wp-content/chili/recipes.js'
    			,'//js/email.js'
    		)
    		,'css' => array(
    			'//wp-content/chili/recipes.css'
    			,'//wp-content/themes/orangesky/style.css'
    		)
    	)
    );
    
    // serve it!
    Minify::serve('Groups', $options);

    (note: The double solidi at the beginning of the filenames are shortcuts for $_SERVER['DOCUMENT_ROOT'].)

  6. In HTML, replaced the 4 script elements with one:
    <script type="text/javascript" src="/min/js"></script>

    (note: Why not “min.php/js”? Since I use MultiViews, I can request min.php by just “min”.)

  7. and replaced the 2 stylesheet links with one:
    <link rel="stylesheet" href="/min/css" type="text/css" media="screen" />

At this point Minify was doing its job, but there was a big problem: My theme’s CSS uses relative URIs to reference images. Thankfully Minify’s CSS minifier can rewrite these, but I needed to specify that option just for style.css.

I did that by giving a Minify_Source object in place of the filename:

// load Minify_Source
require_once 'Minify/Source.php';

// new controller options
$options = array(
	'groups' => array(
		'js' => array(
			'//wp-content/chili/jquery-1.2.3.min.js'
			,'//wp-content/chili/chili-1.8b.js'
			,'//wp-content/chili/recipes.js'
			,'//js/email.js'
		)
		,'css' => array(
			'//wp-content/chili/recipes.css'

			// style.css has some relative URIs we'll need to fix since
			// it will be served from a different URL
			,new Minify_Source(array(
				'filepath' => '//wp-content/themes/orangesky/style.css'
				,'minifyOptions' => array(
					'prependRelativePath' => '../wp-content/themes/orangesky/'
				)
			))
		)
	)
);

Now, during the minification of style.css, Minify prepends all relative URIs with ../wp-content/themes/orangesky/, which fixes all the image links.

What’s next

This is fine for now, but there’s one more step we can do: send far off Expires headers with our JS/CSS. This is tricky because whenever a change is made to a source file, the URL used to call it must change in order to force the browser to download the new version. As of this morning, Minify has an elegant way to handle this, but I’ll tackle this in a later post.

Update: Please read the new version of this article. It covers Minify 2.1, which is much easier to use.

New Minify version in the works

Back in August I came across Ryan Grove‘s great Minify project, which is essentially a PHP client-optimizing file server for Javascript and CSS. I was in the design stage of a project with similar goals so I decided to try merging the two projects. Ryan thought the result had potential to be the base of the next major release, and since he hadn’t had much time lately to devote to Minify, he made me co-owner of the project.

Last week I finally got around to committing the code, and since then I’ve been fleshing it out, adding more features, docs, and tests, and updating the project pages a bit. The major difference is that the server is now built from separate utility classes, each lazy-loaded as needed. Some new features:

  • separate front controller makes it easy to change how HTTP requests are handled
  • file caching uses the very mature Cache_Lite
  • supports conditional GET and far-future Expires models of client caching
  • serve content from any source, not just files
  • new “minifiers” (content compressors) for CSS and HTML with unit tests

At this point to see the new Minify in action you’ll need a Subversion client, like the Windows shell extension TortoiseSVN or the Eclipse plugin Subclipse. You can checkout a read-only working copy at http://minify.googlecode.com/svn/trunk/ and the README has easy setup instructions.

JSMin’s classic delimma: division or RegExp literal

JSMin uses a pretty crude tokenizer-like algorithm based on assumptions of what kind of characters can precede/follow others. In the PHP port I maintain I’ve made numerous fixes to handle most syntax in the wild, but I see no fix for this:

Here’s a very boiled down version of a statement created by another minifier:

if(a)/b/.test(c)?d():e();

Since (a) follows if, we know it’s a condition, and, if truthy, the statement /b/.test(c)?d():e() will be executed.

However, JSMin has no parser and therefore no concept that (a) is an if condition; it just sees an expression. Since lots of scripts in the wild have expressions like (expression) / divisor, we’ve told JSMin to see that as division. So when JSMin arrives at )/, it incorrectly assumes / is the division operator, then incorrectly treats the following / as the beginning of a RegExp literal.

Lessons: Don’t run minified code through JSMin*, and don’t use it at all if you have access to a JavaScript/Java runtime.

*Minify already knows to skip JSMin for files ending in -min.js or .min.js.

In Support of Bloated, Heavyweight IDEs

I’ve done plenty of programming in bare-bones text editors of all kinds over the years. Free/open editors were once pretty bad and a lot of capable commercial ones have been expensive. Today it’s still handy to pop a change into Github’s web editor or nano. Frankly, though, I’m unconvinced by arguments suggesting I use text editors that don’t really understand the codeI believe that, independent of your skill level, you’ll produce better code and faster by using the most powerful IDE you can get your hands on.

To convince you of this, I’ll try to show how each of the following list of features, in isolation, is good for your productivity. Then it should follow that each one you work without will be lowering it. Also it’s important to note that leaving in place or producing bugs that must be fixed later, or that create external costs, reduces your real productivity, and “you” in the list below could also mean you six months from now or another developer. The list is not in any particular order, just numbered for reference.

  1. Syntax highlighting saves you from finding and fixing typos after compilation failures. In a language where a script file may be conditionally executed, like PHP, you may leave a bug that will have to be dug up by someone else after costing end users a lot of time. In rarer cases the code may compile but not do what you expected, costing even more time. SH also makes the many contexts available in code (strings, functions, vars, comments, etc.) significantly easier to see when scanning/scrolling.
  2. Having a background task scan the code can help catch errors that simple syntax highlighting cannot, since most highlighters are designed to expect valid syntax and may not show problems.
  3. Highlighting matching braces/parenthesis eases the writing and reading of code and expressions.
  4. IDEs can show the opening/closing lines of blocks that appear offscreen without you needing to scroll. Although long blocks/function bodies can be a signal to refactor, this can aid you in working on existing code like this.
  5. Highlighting the use of unknown variables/functions/methods can show false positives for problems, but more often signals a bug that’s hidden from sight: E.g. a variable declared above has been removed; the type of a variable is not what is expected; a library upgrade has removed a method, or a piece of code has been transplanted from another context without its dependencies. Missing these problems has a big future cost as these may not always cause compile or even runtime errors.
  6. Highlighting an unused variable warns you that it isn’t being used how it was probably intended. It may uncover a logic bug or mean you can safely remove its declaration, preventing you from later having to wonder what it’s for.
  7. Highlighting the violation of type hints saves you from having to find those problems at compile or run-time.
  8. Auto-completing file paths and highlighting unresolvable ones saves you from time-consumingly debugging 404 errors.
  9. Background scanning other files for problems (applying all the above features to unopened project files) allows you to quickly see and fix bugs that you/others left. Simply opening an existing codebase in a more capable IDE can reveal thousands of certain/potential code problems. If you’re responsible for that code, you’ve potentially saved an enormous amount of time: End users hitting bugs, reporting them, you reading, investigating, fixing, typing summaries, etc. etc. etc. This feature is like having a whole team of programmers scouring your codebase for you; a big productivity boost.
  10. Understanding multiple language contexts can help a great deal when you’re forced to work in files with different contexts embedded within each other.
  11. Parameter/documentation tooltips eliminate the need to look up function purpose, signatures, and return types. While you should memorize commonly used functions, a significant amount of programming involves using new/unfamiliar libraries. Leaving your context to look up docs imposes costs in time and concentration. Sometimes that cost yields later benefits, but often you’ve just forgotten the order of a few parameters.
  12. Jumping to the declaration of a function/variable saves you from having to search for it.
  13. Find usages in an IDE that comprehends code allows you to quickly understand where and how a variable/function is used (or mentioned in comments!) in a codebase with very little error.
  14. Rename refactoring can carefully change an identifier in your code (and optionally filenames and comments) across an entire project of files. This can also apply to CSS; when renaming a class/id, the IDE may offer to replace its usages elsewhere in CSS and HTML markup. The obvious benefits are time savings and reduction in the errors you might make using more simple string/regular expression replacements, but there are other gains: When the cost of changing a name reduces to almost nothing, you will be more inclined to improve names when needed. Better names can reduce the time needed to understand the code and how it should be used, and to recognize when it’s not being used well.
  15. Comprehension of variable/expression type allows the IDE to offer intelligent autocompletion options, reducing your time spent typing, fixing typing errors, and looking up property/method names on classes. But more than saving time, when an expected autocomplete option doesn’t appear, it can let you know that your variable/expression is not of the type that you think it is.
  16. IDEs can automatically suggest variables for function arguments based on type, so if you’re calling a function that needs a Foo and a Bar, the IDE can suggest your local vars of those types. This eliminates the need to remember parameter order or the exact names of your vars. Your role briefly becomes to check the work of the IDE, which is almost always correct. In strongly typed languages like Java, this can be a great boost; I found Java development in NetBeans to be an eye-opening experience to how helpful a good IDE can be.
  17. IDEs can grok Javadoc-style comments, auto-generate them based on code, and highlight discrepancies between the comments and the code. This reduces the time you spend documenting, improves your accuracy when documenting, and can highlight problems where someone has changed the signature/return type of a function, or has documented it incorrectly. IDEs can add support for various libraries so that that code can be understood by the IDE without being in the project.
  18. IDEs can maintain local histories of project files (like committing each change in git) so you can easily revert recent changes (or bring back files accidentally deleted!) or better understand the overall impact of your recent changes.
  19. IDEs can integrate with source control so you can see changes made that aren’t committed. E.g. a color might appear in the scroll bar next to uncommitted changes. You could click to jump to that change, mouse over to get a tooltip of the original code and choose to revert if needed. This could give you a good idea of changes before you switch focus to your source control tools to commit them. Of course being able to perform source control operations inside the IDE saves time, too.
  20. IDEs can maintain a local cache of code on a remote server, making it snappier to work on, reducing the time you’d spend switching to a separate SFTP app, and allowing you to adjust the code outside the IDE. The IDE could monitor local and remote changes and allow merging between the versions as necessary.
  21. IDEs can help you maintain code style standards in different projects, and allow you to instantly restyle a selection or file(s) according to those standards. When contributing to open source projects, this can save you from having to go back and restyle code after having your change rejected.
  22. Integrated debugging offers a huge productivity win. Inline debugging statements are no replacement for the ability to carefully step though execution with access to all variable contents/types and the full call stack. In some cases bugs that are practically impossible to find without debugging can be found in a few minutes with one.
  23. Integrated unit test running also makes for much less context switching when using test-driven development.

This list is obviously not exhaustive, and is geared towards the IDEs that I’m most familiar with, but the kernel is that environments that truly understand the language and your codebase as a whole can give you some powerful advantages over those that don’t. A fair argument is that lightweight editors with few bells and whistles “stay out of your way” and can be more responsive on large codebases–which is true–but you can’t ignore that there’s a large real productivity cost incurred in doing without these features.

For PHP users, PhpStorm* includes almost everything in the list, with Netbeans coming a close second. With my limited experience Eclipse PDT was great for local projects, but I’ve only seen basic syntax highlighting working in the Remote System Explorer. All three also fairly well understand Javascript, CSS, HTML, and to some extent basic SQL and XML DTDs.

*Full disclosure: PhpStorm granted me a copy for my work on Minify, but I requested it, and my ravings about it and other IDEs are all unsolicited.

You Probably Don’t Need ETag

(updated 3/4 to include the “Serving from clusters” case)

As I see more server scripts implementing conditional GET (a good thing), I also see the tendency to use a hash of the content for the ETag header value. While this doesn’t break anything, this often needlessly reduces performance of the system.

ETag is often misunderstood to function as a cache key. I.e., if two URLs give the same ETag, the browser could use the same cache entry for both. This is not the case. ETag is a cache key for a given URL. Think of the cache key as (URL + ETag). Both must match for the client to be able to create conditional GET requests.

What follows is that, if you have a unique URL and can send a Last-Modified header (e.g. based on mtime), you don’t need ETag at all. The older HTTP/1.0 Last-Modified/If-Modified-Since mechanism works just fine for implementing conditional GETs and will save you a bit of bandwidth. Opening and hashing content to create or validate an ETag is just a waste of resources and bandwidth.

When you actually need ETag

There are only a few situations where Last-Modified won’t suffice.

Multiple versions of a single URL

Let’s say a page outputs different content for logged in users, and you want to allow conditional GETs for each version. In this case, ETag needs to change with auth status, and, in fact, you should assume different users might share a browser, so you’d want to embed something user-specific in the ETag as well. E.g., ETag = mtime + userId.

In the case above, make sure to mark private pages with “private” in the Cache-Control header, so any user-specific content will not be kept in shared proxy caches.

No modification time available

If there’s no way to get (or guess) a Last-Modified time, you’ll have use ETag if you want to allow conditional GETs at all. You can generate it by hashing the content (or using any function that changes when the content changes).

Serving from clusters

If you serve files from multiple servers, it’s possible that file timestamps could differ, causing Last-Modified dates sent out to shift and needless 200 responses when a client hits a different server. Basically, if you can’t trust your mtime to stay synched (I don’t know how often this is an issue), it may be better to place a hash of the content in an ETag.

In any case using ETag, when handling a conditional GET request (which may contain multiple ETag values in the If-None-Match header), it’s not sufficient to return the 304 status code; you must include the particular ETag for the content you want used. Most software I’ve seen at least gets this right.

I got this wrong, too.

While writing this article I realized my own PHP conditional GET class used in Minify, has no way to disable unnecessary ETags (when the last modified time is known).

Where’s the code?

Google’s free open source project hosting has been awesome for Minify, so when I was looking around for Subversion hosting for my personal code, I figured why not host it there? So here’s a bunch of my PHP and Javascript code. Hopefully some of it will be useful to people. A few PHP highlights:

  • HashUtils implements password hashing with a random salt, preventing the use of rainbow table cracks. It can also sign and verify the signature of string content.
  • StringDebug makes debugging strings with whitespace/non-printable/UTF-8 characters much less painful.
  • CookieStorage saves/fetches tamper-proof and optionally encrypted strings in cookies.
  • TimeZone simplifies the handling of date/times between timezones using an API you already know: strtotime() and date().
  • Utf8String is an immutable UTF-8 string class that aims to simplify the API of the phputf8 library and make behind-the-scene optimizations like using native functions whenever possible. Mostly a proof-of-concept, but it works.

I’ll get the examples of these online at some point, but if you export /trunk/php, all the lowercase-named files are demos/tests. There’s also an “oldies” branch (not necessarily goodies).

Hopefully this makes the political jibba jabba more forgivable.