Diffing HTML

I think this is the final episode in my series of responses1 to Alastair's Responding to Adrian. What's the aim of diffing HTML, how hard is it and how do you go about it?

The aim is really important to identify. The most common and most useful aim that I see for diffing HTML is to be able to show users what changed between two versions of a document. Since the management of most content is centralized2, this equates to showing the combined changes in each version between the original version to compare and the final version to compare. If you've ever wanted to see what's changed on a wiki page, you've wanted this type of diff. If you're sending Word documents back and forth between people you probably want this type of diff too.

Another common and useful type of diff is where two parties have made changes to the same base version and you want to compare them so they can be merged into a single document with both changes. This is useful when you don't have a centralized repository controlling things, or if that repository allows concurrent changes.

The third use of diff is to store only changes to a document in order to save space, be that on the file system or network bandwidth. This is the one form which has absolutely no need for human readability.

It's also worth noting that diffing HTML is quite different to diffing generic XML. HTML is a document format and the type of content it generally contains is very natural language intensive. There are many XML formats that have similar attributes3, but also a lot of XML formats that don't4. For content that isn't natural language intensive, diffing in any of these use cases comes down to the classic computer science techniques - adding, removing and moving elements, adding, removing and changing attributes, changing element content etc. However, for natural language based content, the changes to the XML structure are far less important than the changes to the actual textual content.

It may well be that I've missed something, and if so please let me know, but despite a reasonable amount of searching I've never seen a diff tool for any format that can handle natural language well. For the aim of showing changes between document versions, the most important thing is that the diff output clearly shows the intent of changes, not the effect of them. Line based diff obviously doesn't cut it here as a single character spelling correction shouldn't cause the entire line to be marked as different. This kind of change is where Word's track changes shines - it can mark that one character change as a single character change while still marking a change from "the" to "tear" as a word change instead of removed h and an inserted a and r. It is complexities like these that make natural language diffing so hard.

XML structural diffing on the other hand is a fairly well solved problem as long as you just need to know that the content changed, rather than clearly what was changed. Docucomp for instance has very good diffing tools for HTML, but it is largely ineffective at showing the intent of changes to natural language.

In general, I agree with Alastair's comments on diffing, but he seems to be looking mostly at the second two use cases5 whereas I mostly focus on the first case since that's about the only use case I have for diff. I also agree that Word has some significant limitations and bugs in its track changes implementation, but as a technology concept it does show the potential for tracking changes as they happen instead of diffing after the fact as a way of showing changes between two versions of a document to humans. A complete implementation of track changes6 would enable the other two use cases but with more effort than a post-edit diff.

In the end, it really comes down to what your use case is.

 

1 - Previous responses: On The Importance Of Rendering Fidelity, The Invisible Formatting Tag Problem and the original The Challenge Of Intuitive WYSIWYG HTML

2 - at least in the area I work in where CMSs and similar systems are everywhere

3 - Docbook and DITA come to mind

4 - eg: the data from a form that is stored as XML

5 - which explains why three way diff is a hard requirement

6 - Word can't track structural changes to tables

2 Responses to “Diffing HTML”

  1. Alastair Says:

    I agree there are many possible use cases for diffing, but there should be no disagreement about diffing *is*. The basic definition of diffing comes down to determining the minimum edit distance for a given set of possible operations. And at this abstract level anyway there is no difference between diffing HTML, XML, text or even images.

    You seem to be arguing that the most human-comprehensible output may not necessarily be the absolute minimum edit distance between two documents, and that a non-optimal set of edit operations may more accurately reflect the underlying intent of the changes made. I see that this is a worthwhile problem but it is not diffing, as it is subject to the whims of human comprehension and not just mathematics.


  2. Adrian Sutton Says:

    If you wish to concentrate on a purely theoretical computer science problem, then by all means let the strict definition of diffing get in your way to finding a good solution. Personally, I prefer to analyze use cases and determine the solution that best matches. In the case I’ve been discussing, the whims of human comprehension is exactly what we’re dealing with and mathematics is merely an implementation detail.

    This is not a case where we are particularly disagreeing, as I noted above. We clearly approach these issues from different points of views and with different goals in mind which is why we come to different conclusions and identify different solutions. It has certainly been interesting to consider these areas, even if they are not the same areas and issues you originally intended to discuss. In short, thank you for your comments, they’ve prompted me to identify a number of assumptions and influences that make up my viewpoint and consider in more detail a number of issues that I work with every day.


Leave a Reply

Alternatively, subscribe to the Atom feed.