Integrating The Editor

July 31st, 2007

The Ephox Weblog pointed me to James Robertson's comments on Seth Gottlieb's article, "Homebrew CMS" all of which is good reading. The key part for me is:

Editing environment. If the authors can't easily and efficiently get their words onto the site, you're toast. There's a huge amount that goes into a good editing tool, including table support, CSS, images, spell checking, and clean cut-and-pasting from Word. Even if you chose to use one of the commercial editing tools (a good idea!), it still needs to be tightly integrated into the CMS.

Integrating the editor into the CMS well is a key attribute to making things easy for users. The vast majority of the time, users will be creating content in the editor so your selection of editors is really important. However, at fairly regular intervals during content creation users will want to interact with the system - to insert an image from the repository (or upload a new one), link to existing content or apply a standard page element. You need to make sure that those actions are available from within the editor instead of requiring the user to go over to some other part of the interface.

You should also take some time to integrate the styles for your site with the editor. Each editor has a different way of exposing the styles to the user and you need to make sure that applying the right style is simple and intuitive for users. You also want to think about what menu and toolbar items are appropriate for the content your users are creating - the interface may need to be changed from the default or even customized on a per user or role basis. Taking the time to properly integrate the editor can be the difference between a successful deployment or having an expensive system go unused.

There's a lot of tips on ways to improve the editing experience over at LiveWorks! It's centered around EditLive! but the ideas apply to most editors.

APP For Scalability

July 27th, 2007

One of the common first steps for scaling up an application is to move the database off to a dedicated server - often followed by having multiple application server instances to handle requests. With a standard SQL database that's pretty straight forward, with data stored in Amazon S3 that's not always as simple.

S3 obviously provides a network API, but it doesn't necessarily provide all the functionality you need from your data layer. For instance if you need to update search indexes you need a central server to track the changes and update the indexes. You may also need synchronization above what S3 provides etc. Whatever the reason, you need to provide a server to handle those data layer tasks and then pass the storage off to S3.

So where does APP (Atom Publishing Protocol) fit in? It occurs to me that for a large number of applications it's probably a very good interface to use. Firstly, there's some libraries being created that make it easy to get up and running. More importantly, because it's a standard, other applications can also connect to your data store.

Of course it all depends on exactly what data your application deals with, and I still haven't worked out the best way to deal with versions, but it shows some interesting potential.

Versioned Resources In REST APIs

July 27th, 2007

I like the idea of resources being addressable by a simple URL, but I'm having some difficulty reconciling that with resources that are versioned. Getting at and working with the latest version of a document with REST APIs is all pretty straight forward, but how do you retrieve the document history or a specific version of the document? I'm sure this is something that people have already worked out, but all my searching for discussions of it leads to people talking about versioning the API so that things don't break when you change what operations are available or the data format returned, rather than versioning the resources themselves.

It seems to me that the version information should be available at basically the same place as the main document - it is after all essentially the same document just older. So something like …/resource.html?version=2 makes sense for retrieving a specific version. However, that requires that you know which versions are available - even in a system that doesn't have branching versions may go missing to delete spam or copyright infringement problems. There's also no reason that versions should be simple numbers - they might be 1.1 or a, b, c or just the ETag for that version.

Speaking of ETags, it would be interesting to see how they could be used - there's an If-Match and If-None-Match header for ETags and I've never quite grasped why both exist. It would be nice for one of them to turn out to mean, "I want this exact version" but I doubt it's the case and I really doubt anything out there supports it.

Getting a list of available versions is also an issue - you could consider the list of versions a separate resource (say …/versions/resource.html or …/resource.html/versions) which returns a list of available versions and their meta data but doesn't allow PUT or POST operations. It seems odd though that versions are a separate resource. The other argument is something like …/resource.html?history=all which is okay but needs systems for limiting how much is returned and paging etc.

How are other people handling versioned resources with REST style APIs?

Solr Search Index Backups?

July 25th, 2007

If you have a massive set of documents that you're using Solr to search (let's say a few million HTML pages) how much should you worry about losing the search index?

It is of course always possible to reindex the original documents, but that would take a fair while, so should you keep a backup of the search index? If you restored the backup, how would you identify which documents needed updating?

Solr seems to support replication - should you just use that as a constant backup that you can swap over to using if something goes wrong?

So many questions…

And it looks like there are answers out there somewhere - Solr has a bunch of tools related to backups and stuff. Looks promising, though the documentation doesn't really say what the best practices are at least there are tools that look like they help you manage the search indexes.

Amazon EC2 As A Webhost

July 24th, 2007

We need to move our company wiki and JIRA instance to a server with more RAM and CPU to spare as they're pretty slow on the current overloaded virtual server, so we've been looking at a few different options. One that came up was using Amazon's EC2 and S3 services. We knew straight off that we didn't need the scalability they offered but getting some experience using them could be beneficial and we really didn't know anything about what they actually offered so it was worth a quick look.

Those familiar with EC2 won't be surprised to hear that we won't be going with the service for three reasons:

  1. It's at least as expensive as the dedicated server we'd need.
  2. The filesystem gets reset everytime the server reboots (S3 provides a REST API to store and retrieve data, not a filesystem)
  3. The server gets a new IP address every time it reboots.

The cost only applies to us because we don't need scalability - our needs are really quite consistent so we're not avoiding purchasing large amounts of redundant hardware. We also have the ability to just pay a hosting company to set up one dedicated server for us instead of setting up our own server farm. If you were offering software as a service however, Amazon's offering is likely to save you a lot of money.

The filesystem resetting is a challenge for deploying most existing web applications, but not for software designed to run with S3. For instance, it's pretty easy to imagine a wiki implementation that uses S3 as it's "database" for storing data directly (probably with some local caching etc). Wikis are somewhat ideal for this because search is about the only query you perform on the data - otherwise you just retrieve pages by name which S3 is perfectly suited for. The fact that so many wikis use flat files instead of databases is an indication that it'd work pretty well. There would be a few hurdles but nothing insurmountable.

The dynamic IP however is a real pain. There are examples of using dynamic DNS to work around it but the lag in DNS updates seems like a problem to me. The better solution would of course be to have a load balancer in front of your EC2 instances - the load balancer would have a static IP address and the EC2 instances would just register with it when they start up. Unfortunately this means you have to have a server outside of EC2 to do the load balancing which means another hosting provider to work with and it just seems odd to have the load balancing server in a different data center to the rest of the servers. If Amazon added an option to build an EC2 machine that could only ever have one instance but had a guaranteed IP address it would be the perfect solution.

It's certainly something interesting to play with - I'll have to chase up a corporate credit card and see if I can get access to do some experimentation some time.