Conversion for the Web
Andrew Shebanow in Open Government and PDF:
The issue at hand is not whether governments should pick HTML or PDF. The issue at hand is whether governments are capable of publishing information at all. Show me an HTML creation tool that creates high quality, standards conformant markup from a Word document or any of the zillions of editing tools that government employees use. Now add in all the tools used by people who submit documents to the government. And all the versions of those tools released in the last 20 years. Now make sure that the HTML/XML works correctly even when the user doesn’t have the right browser or the right fonts installed.
I’ve actually worked with a number of government departments who were looking to move more content online and the content conversion problem is definitely a time consuming and challenging part of the problem. That’s precisely why I wind up getting involved, since EditLive! lets you easily copy and paste content from Word documents and produce clean, compliant XHTML. It can even (optionally) strip out inline formatting and leave just the structure like headings, tables and lists.
Furthermore, EditLive! is actually quite good at making sure the HTML works correctly even when the user doesn’t have the right browser or the right fonts installed, especially when it’s been configured to suit the particular content needs. Even with non-technical business authors this can work very well and is doing so for a significant number of government departments.
That’s not to say it’s the whole solution, there are systems out there where it’s hard to convert the content to HTML and where HTML may not be the best format anyway. Some of those cases may work better with PDF but certainly not all of them. To somehow suggest that PDF is a complete and simple solution to publishing information on the web misses quite a lot of the picture. For example:
- How do web site visitors navigate around and get to that PDF data? How do they search and find it? As much time is spent working out navigation structures as it is converting content.
- How do you expose information from databases with regularly changing information? Wouldn’t a HTML representation be easier to generate than PDF in most of these cases?
Putting information on the web is not simple and no single technology is going to make it simple. PDF definitely has it’s place on the web, but so does HTML and a number of other formats. PDF doesn’t alleviate compatibility concerns, not all users have a recent enough PDF reader, not all PDF embed all the fonts and when they do it makes the download very large etc and not all PDFs are standards compliant. Putting non-web stuff on the web is always a big, challenging project, so review the available technologies carefully and pick the ones that best achieve your goals. Very few companies have success with just dumping a whole heap of PDFs on a web server.

November 4th, 2009 at 3:49 pm
My point isn’t that PDF is perfect or even great, my point is that it is a practical and cost effective solution today, while the HTML/XML advocates are all proposing to force every state, local, and federal agency to do things in a new way. This requires significant investment in IT, training of employees, and additional time to process each document. And that doesn’t even address the problems that occur when non-governmental employees want to submit documents to the government, which will cause yet another set of IT, training, and time costs. The economic downside is huge for everyone except the publishers of software like EditLive! :-)
November 4th, 2009 at 3:54 pm
I guess my point is that even when you use PDF you’re going to incur all those costs. Moving stuff to the web is expensive and difficult – maybe some government departments could just dump a bunch of PDFs in a directory but that’s not really what anyone would want to get out of the process. As soon as you go beyond that you also get a whole heap of expensive problems to solve.
November 4th, 2009 at 4:03 pm
Most of those costs have already been incurred for PDF. Pretty much everyone who has a PC already has the ability to generate PDFs using the Print command. Most government employees already know how to Print, and it works well with the tools they already have. That is precisely why so many government documents are put out in PDF today.
Of course Adobe would like everyone in government to buy and upgrade their Acrobat Pro licenses on a regular basis. But the fact is that most modern OSes, office suites and such have PDF output built-in, so they don’t have to buy it unless they want to get into automated PDF workflows (around form data, approvals, etc.)
Again, I’m not saying things wouldn’t be better if governments didn’t put out their information in additional formats like XML or HTML that were more amenable to certain kinds of processing.. But demanding that they dump PDF and move to something entirely new en masse is foolish and impractical.
November 4th, 2009 at 4:22 pm
But generating the PDF is only a very tiny part of the problem. It’s as simple to paste from Word into EditLive! as it is to print – user’s already know how to copy paste. The problem is they don’t know how to use the publishing system to take the content in whatever form, apply all the required meta-data, select the right section of the site and finally publish it.