All Links Must Be To Web Pages

July 10th, 2008

Jeff Atwood posted a rant about the iTunes Music Store requiring iTunes to be installed in order to access it. In particular, he didn’t like how a link to something on the iTMS resulted in an error page if you didn’t have iTunes installed.

Is it so unreasonable to expect links in your browser to resolve to, oh, I don't know, web pages containing information about the thing you just clicked on? Is there anything more anti-web than demanding users install custom software to display information that could have just as easily been delivered through the browser?

It’s an exceptionally compelling argument but there’s one small flaw. If you’re not sure what it is, just drop me an email.

Creating Clean URLs With IBM WCM

July 8th, 2008

One of the challenges with many content management systems, and IBM’s is no exception, is creating short, clean URLs. As part of structuring and managing your content, the URL segments tend to build up to very long URLs. While most systems have a way to provide shorter aliases they need to be manually created and tend to just redirect to real URL rather than being the one canonical URL for content.

For example, while redeveloping LiveWorks! to be served from an IBM WCM server instead of a WordPress blog (there’s really only so far you can push WordPress before it breaks), I wound up with URLs like: http://liveworks.ephox.com:10038/wps/wcm/connect/LiveWorks/lw/home/mailing-list/ instead of the desired http://liveworks.ephox.com/mailing-list/

Fortunately, Apache 2 has a few modules that can help out here, as well as allowing the non-IBM content (mailing list management, download files etc) to be served from the same domain easily. The basic idea is that clients connect to the Apache server which translates the URL for WCM and proxies it through. When the content is returned, the Apache server modifies the URLs so that the links go to our nice URLs instead of the ugly long ones.

To get started we need  to load the module that we’re going to use:

LoadModule proxy_module         /usr/lib/apache2-prefork/mod_proxy.so
LoadModule proxy_http_module    /usr/lib/apache2-prefork/mod_proxy_http.so
LoadModule ext_filter_module    /usr/lib/apache2-prefork/mod_ext_filter.so

Since we’re using the mod_proxy module, the next thing we need to do is make sure we don’t have an open proxy server as that has some rather bad security consequences for our server and the internet as a whole:

ProxyRequests Off

Now we enable a specific pass through to the WCM server, adding in all that extra URL cruft that it requires:

ProxyPass / http://localhost:10038/wps/wcm/connect/LiveWorks/lw/

So now the URL “http://liveworks.ephox.com/hints-tips/article-name/” will go to “http://liveworks.ephox.com:10038/wps/wcm/connect/LiveWorks/lw/hints-tips/article-name/” and display the right content. We’re not done yet though, our mailing-list page from the original example is meant to be in the root level, but all content in WCM has to be in a site-area so we’ve had to add a “home” site area to hold it. That means our pretty URL is currently http://liveworks.ephox.com/home/mailing-list/

We’ll add a specific ProxyPass directive for the mailing-list page so we get:

ProxyPass /mailing-list http://localhost:10038/wps/wcm/connect/LiveWorks/lw/home/mailing-list
ProxyPass / http://localhost:10038/wps/wcm/connect/LiveWorks/lw/

Now our mailing list URL is right, but as soon as we click a link we get the ugly URLs back! This is where we need Apache to rewrite the URLs for us:

ExtFilterDefine change-urls mode=output intype=text/html \
        cmd="/usr/bin/sed -e s:/wps/wcm/connect/LiveWorks/lw::g"
ExtFilterDefine change-home-urls mode=output intype=text/html \
        cmd="/usr/bin/sed -e s:/wps/wcm/connect/LiveWorks/lw/home::g"
<Location />
        SetOutputFilter change-home-urls;change-urls
</Location>

Wow that’s ugly - both in the way the configuration looks and the way it works. Hopefully some Apache gurus will be able to suggest a better way of going about this. It works by adding two output filters, both of which run ‘sed’ over the content before it’s returned to the client. The change-home-urls filter changes all the URLs to documents in our home site area so they appear as if they weren’t in a site area and the second one changes any other links to get rid of the usual URL cruft at the start. Now we can happily click links and everything works nicely.

One slight oddity in this set up that I’ve found, is that it actually works correctly even if the URL rewriting doesn’t pick up every URL. For example, the URLs to components like stylesheets and images don’t include the site name (the lw part of the URL) so they don’t get rewritten. Somehow they still work though and it would be reasonably simple to devise a filter if they ever start causing problems so we’ll just ignore them for now.

Finally, what about those sections of the site that we want Apache to serve directly? The ProxyPass config has an option specifically to prevent URL patterns from being proxied:

ProxyPass /downloads !

Put that before the other ProxyPass configuration items and any URLs that start with /downloads will be served directly by Apache.

Some caveats:

  1. This only handles unauthenticated users accessing the site, it breaks if you try to access the site through Apache from a browser that’s logged in to the Portal server. IBM Portal tries to redirect you to /wps/wcm/myconnect/ and our filters don’t handle that at all. You could switch it so that only authenticated users can access the site by changing /connect/ to /myconnect/ but there isn’t a clean way to make it available to both at once.
  2. By adding two output filters that run sed, every time a request is made for a HTML page, two instances of sed are run to process the content. For small sites that probably won’t matter but if your server is under load it’s very likely to be a major bottleneck.
  3. Since the output filter doesn’t parse the HTML it will rewrite the URLs anywhere it sees them. If you write /wps/wcm/connect/LiveWorks/lw/ in your content it will be converted to just / but any variant like /wps/wcm/connect/LibraryName/SiteName/ would work just fine.

For the Apache gurus out there, I’d love to hear some options on how to avoid the use of separate sed processes all the time. I tried the mod_proxy_html module but had trouble getting it to compile on SUSE 10. Plus I rather like the simplicity of the regex instead of attempting to actually parse the HTML given how unique the URL strings are.

CMS and Mac

July 1st, 2008

Some time ago now, James Robertson blogged about the poor state of Mac support in CMS products. Quite rightly he identified the WYSIWYG editor as the most common problem area which of course got my attention. It’s over six years ago now that Ephox switched over to Java from ActiveX to get support for Mac and it’s probably the smartest thing we’ve ever done. Not because we have vast numbers of Mac users, but because it only takes one Mac user to sink a deal.

It’s taken me so long to post because just talking about your Mac support has no credibility, so I wanted to show copy and paste on Mac - the precise task that James found so many problems with. So I present for your entertainment, copy and paste from Word on a Mac, the 30 second demo, complete with cheesy music. Naturally in QuickTime with iPhone optimized versions built in.

I had wanted to go over the top and do it all in the style of an old silent movie but there’s only so much time I can justify on this…

Variable Declarations

June 20th, 2008

Jef Atwood has discovered the implicit variable type declaration in C#:

BufferedReader br = new BufferedReader (new FileReader(name));

Who came up with this stuff?

Is there really any doubt what type of the variable br is? Does it help anyone, ever, to have another BufferedReader on the front of that line? This has bothered me for years, but it was an itch I just couldn't scratch. Until now.

Actually, there is a question about the type of br - it could be a BufferedReader or it could be any superclass of BufferedReader and sometimes that distinction is important.  The most common example of this is in the collections API:

Map data = new HashMap();

Why not just call it a Map? Well let’s say you later want the map to be sorted, then it simply becomes:

Map data = new TreeMap();

And none of your other code needs to change. You’re guaranteed to not be using any HashMap specific methods because the compiler won’t let you and it’s quite clear that the intent is to work with a Map instead of specifically a HashMap. Saving a few extra characters really doesn’t give any benefit for either reading or writing.

What I find most fascinating about this, is that the example touches on the most verbose, confusing and frustrating (yet powerful) part of writing in Java - the IO library, and then complains about the type system. Here’s the real problem in that line of code:

new BufferedReader(new FileReader(name));

Do you have any idea how many people forget to wrap the FileReader in a BufferedReader and destroy the performance of IO (and then blame Java)? Why isn’t that something simple like:

Reader reader = new File(name).read();

Call the method name wantever you want and there’s a bunch of variants on this, but just provide a single method that does what the programmer almost always wants to do. If unbuffered, serial access to a file is ever desirable (when???) then provide an additional method that takes a boolean to let you turn off buffering but just make it simple.

Why people are obsessed with type systems as the harbinger of productivity is just beyond me - it’s the design of the libraries stupid! That’s why PHP feels so messy to write. That’s why Java feels so verbose to write. That’s why ruby feels so easy to write - it provides so many of these really handy shortcuts. No amount of syntactic sugar or changes to the type system is going to change that, because most of your code is dealing with the libraries, not the syntax of the language.

Now That’s Fast

June 18th, 2008

I got just got home from a very entertaining evening with some folk from the Web Content 2008 conference watching, or rather largely ignoring, an overall boring game of basketball between two teams I didn’t know from a bar of soap (for the record, the Celtics won and were premiers or something). Anyway, I found in my email an entire conversation within Ephox around this article on CMS Wire about the talk I gave today. It’s actually a very good summary of what I said and I hadn’t realized there was anyone from CMS Wire even at the conference (Rachelle, please do say hello tomorrow if you get this, I don’t know what you look like).

It’s simply amazing how fast news travels these days, within 12 hours a speech I’d given in Chicago had been written about by CMS Wire, noticed by Google which sent an alert to our engineering manager in Australia and then got replies from a couple of people within our US office, all while I was out at the pub drinking, developing strategic business relationships (Seth, you’re going to write lots more good stuff about Ephox now right?). That’s really very cool.