Aim Higher

June 24th, 2008

Someone came up with a cool idea to add a universal edit button to make editing wikis easier. It adds a button like for RSS feeds that redirects the browser to the edit page. It’s clever but it aims way too low. If you can get browsers to add an editing button you have an opportunity to either point to an online form or a standalone application that could also edit the page. In other words, Atom Publishing Protocol auto-detection.

It would be a great shame if such a spec were limited to just redirecting to a HTML page or if it were thought of as a useful feature for wikis instead of a useful feature for the internet everywhere. Let’s face it, nearly every page on the internet can be edited by someone and I’m sure they’d appreciate it being easier to get there. Just look at the number of content management systems that provide some form of inline editing (heck RedDot renamed their company after the red dots that you clicked on inline to edit a section).

Automatic Spelling Dictionary Selection

May 24th, 2008

David Welton described a frustration he had with FireFox’s spell checker which piqued my interest:

I write most of my email (in gmail) and submit most web site content in English, however, a significant portion is also done in Italian. I leave the spell checker in English, because Italian is, in general, quite easy to spell, so that even as a native speaker, a helping hand is occasionally welcome. However, it isn't as if I write Italian perfectly either, so the help there would be nice as well. I find it quite annoying to go change the language in the spell checker option each time, especially when, as an example, I'm responding to email and do 2 in English, one in Italian, another in English, and so on.

On the face of it, identifying what language an author is using looks a lot like a natural language processing problem and thus requires a PhD and many years of research to tackle. Looking a little closer though, you begin to realize that the problem domain is dramatically reduced in this case:

  1. We’re dealing with a preselected list of languages that the user can actually speak. For the majority of people who’d use this feature, that would be 2 languages and it would be very rare to make double digits.
  2. We don’t really want to know the language that’s supposed to be in use, we just want to find the spelling dictionary that most accurately checks the language that’s being used.
  3. We obviously have a spelling dictionary for each of the candidate languages around or it’s pointless detecting the language in the first place.

Given the first restriction of a limited number of languages, we can use a pretty brute force algorithm since we don’t need to check every language on earth. Given the third restriction, we have a big list of valid words in each language. Finally, given the second constraint, we just want to find the spelling dictionary that marks the fewest words as errors and we’re very likely to have the right language.

So all we do is check the document with each dictionary, count the errors for each and pick the dictionary that gives the fewest errors. I implemented this and it works amazingly well for such a stupid algorithm, but there’s a problem…

Dictionaries are really big. Downloading a dictionary for each possible language isn’t a lot of fun if you happen to be writing a browser based editor. Worse, loading 5 dictionaries into RAM takes an awful long time and consumes way too much RAM to cache.

Fortunately, the spell checker I happen to have handy breaks words up uses a simple optimization to avoid having to search through the entire word list every time - it includes a separate file of the most common words in the language. For the spelling checker that allows it to find a match for most words from a small hash table. For our purposes though, the most common words are exactly what we need and they reduce the size of the dictionary to something extremely manageable.

In fact, the most common words from the 12 languages EditLive! supports amount to amount to about 40k in total. Easily small enough to download them all and they load fast enough that there’s no need to cache them in RAM at all.

I was going to post the code I wrote but it depends on a proprietary, third party spelling checker and is frankly so basic that it’s really not worth sharing. I was going to put it up as a servlet for people to play around with but Google has one that’s way better anyway. Still, it was fun to do up and nice to know that something that seems to complex is actually ridiculously simple.

Sun Wiki Publisher

May 22nd, 2008

Kevin Gamble pointed me towards the Sun Wiki Publisher for publishing documents to MediaWiki servers straight from OpenOffice/StarOffice. The key problem with these types of integrations is that wiki markup simply can’t handle anywhere near the same level of expressiveness as even HTML, let alone a word processor document. Hence the description mentions:

All important text attributes such as headings, hyperlinks, lists and simple tables are supported. Even images are supported as long as they have already been uploaded to the wiki site. An automatic upload of images is currently not supported.

The lack of image upload is just due to the early stages of development, but the loss of formatting is going to be permanent. Generally wiki markup can’t handle things like nested tables and there’s a big difference between tables and “simple tables”. Those users who don’t use heading styles (and there are, unfortunately, a lot of them) will probably akin to a plain text dump.

Of course, determining which formatting you want to keep and which you don’t when moving content between systems is incredibly difficult but it’s always nice to actually have a choice which wiki markup just doesn’t provide.  For example, EditLive! has three modes for cleaning pasted content from Microsoft Word (and from any other source really):

  • Clean - structural information only, like headings, tables, lists etc.
  • Inline Formatting - preserve the formatting as best as possible with HTML by using inline styles.
  • Embedded Formatting - preserve the formatting as best as possible by adding CSS styles to the head of the document and using classes.

There’s also some options about plain text and prompting the user when the paste but they’re so unpopular we may as well just ignore them.

Towards the end of many EditLive! releases someone goes through the default configuration to make sure it’s got all the new settings and is the best set of defaults we can find. Inevitably, this leads to a discussion about which of the above three options is best and after 6 years or so of having this discussion there still isn’t a completely clear answer. Lately the trend has been towards pasting clean among our clients but there are still plenty that want pixel perfect rendering accuracy. In large part, it depends on how well structured the original documents are because if they don’t use heading styles they simply won’t import well with “clean” and if they’re full of gratuitous formatting that you want to get rid of but use heading styles well then “clean” works well for you.

Bottom line, having a choice about these things makes importing existing content so much easier - even if you hide all the inline formatting functions in the editor so users are still encouraged to just use headings and CSS etc.

Is HTML a Humane Markup Language?

May 18th, 2008

Jeff Atwood chimes in on the age old question of HTML vs Markdown/Textile/Custom Markup/etc. Unfortunately he rules out using a WYSIWYG editor with the one line statement:

Nothing's decided at this point, but we definitely won't be giving users one of those friendly-but-irritating HTML GUI browser layout controls.

Well sure, you wouldn’t give them a friendly-but-irritating HTML GUI browser control, but why not give them a good one? These days it takes a fair bit of effort to find a HTML editor that doesn’t handle the very basics fairly well and Jeff doesn’t seem to be looking at anything more complex than bold, italic and some hyperlinks. I think a lot of people get stuck in a real geek ego thing or remember the really early days of HTML editors and don’t actually evaluate modern editors properly and it’s a real shame.

All these simplified markup languages just don’t make sense to me, why make people learn something new just for some really basic formatting? Is it really that important to have the formatting capability at all?

My view is that you either need good, full featured formatting that a high quality HTML editor would provide, or you really want plain text. Hyperlinks can be added automatically to URLs easily enough which is the main thing you need and if focussing on the content is really what matters you don’t need any formatting markup at all.

Of course, I’ve seen things this way for ages (1, 2).

The Value Of Criticism

May 17th, 2008

CMSWire: Vendor Criticism of CMS Watch

In an industry whereby most of the "independent analysts" are heavily dependent on revenues from the very firms they claim to be "independent" of, it's unusual to see truly critical research get published. So it becomes a surprise to both buyers and sellers when they read such criticism. In our reports we widely distribute the compliments and brickbats — if something is truly terrible we will tell you.

The way I see it, criticism is one of the most important things a product company can receive - particularly in an aggregated, general form like an analyst report tends to give you. It lets you identify areas you need to improve in that are actually affecting clients rather than the ones that seem important to you.

The first thing I do when I get a new vendor report is to search through it for any mentions of Ephox (it’s amazing how often we turn up in reports about web content management systems) and see what they didn’t like so we can work out how to fix it. Then I go through and find the general trends in the report etc to establish a wider industry direction and market opportunities etc.