HTML Diff Tools?
Anyone know of good HTML diff tools that actually work well? For that matter, does anyone know of any diff tools that work well with plain text that isn't line based, eg: the standard type of text you'd find in a blog entry or a book? I'm guessing a combination of word based and line based diffing might work okay for that type of thing but I haven't come across much that actually tries to deal with the problem. At least, not in a way that aims to provide a meaningful result for humans rather than just a form of compression for updates.
I can probably wrangle the HTML side of it well enough if I had an existing tool that could take the plain text and provide meaningful differences.

October 11th, 2005 at 1:45 am
If you’re looking for differences in the words rather than differences in the HTML structure or the like, you could try this:
rm -f file1.txt
rm -f file2.txt
for word in echo $(lynx -dump $URL1); do echo $word >>file1.txt; done
for word in echo $(lynx -dump $URL2); do echo $word >>file2.txt; done
diff -dc file1 file2
October 11th, 2005 at 6:24 am
Yeah that’s not a bad way of doing it but would give weird results when changes match up across paragraphs. Possibly doing that on a per-paragraph basis would work so that you see the changes in each paragraph separately. I guess I’ll have to play around and see what works best.
October 11th, 2005 at 8:53 am
http://www.logilab.org/projects/xmldiff
It has an HTML mode, but probably uses libxml2 under the hood, and libxml2’s HTML parser is extremely unforgiving. Passing HTML through tidy with the -asxml flag usually produces usable results for me, even with the most badly mangled of HTML.
I haven’t used xmldiff, but I have been pondering differencing on XML files as a general case.
October 11th, 2005 at 9:01 am
I should have noted, the HTML in this case is coming out of EditLive! for Java so it’s perfectly well formed XHTML and I can depend on that. The real problem with xmldiff though is that it doesn’t give anything close to human readable output - it’s effectively designed for diff/patch style things not highlighting to users what changes occurred. Looks quite useful though.
October 11th, 2005 at 12:49 pm
The ‘wdiff’ program (in the Debian package ‘wdiff’) is a front-end to GNU ‘diff’ that compares word sequences, not lines. That may be closer to something you can use.
December 30th, 2006 at 12:26 pm
Found at a good tool at http://www.jamesdom.com . It diffs by element and even renders the source.
October 14th, 2007 at 12:26 am
I’ll add to the blog spam and point you to my HTML differ at http://code.google.com/p/daisydiff/ .
It should work better than other existing solutions and it’s completely free!
January 8th, 2009 at 10:05 am
Ironically this came up as number 3 when doing a search for html diff tools in 2009 :(…. (Ironic because I’m looking for a java html diff tool for java, and I work at Ephox currently :))