HTML Diff Tools?

Anyone know of good HTML diff tools that actually work well? For that matter, does anyone know of any diff tools that work well with plain text that isn't line based, eg: the standard type of text you'd find in a blog entry or a book? I'm guessing a combination of word based and line based diffing might work okay for that type of thing but I haven't come across much that actually tries to deal with the problem. At least, not in a way that aims to provide a meaningful result for humans rather than just a form of compression for updates.

I can probably wrangle the HTML side of it well enough if I had an existing tool that could take the plain text and provide meaningful differences.

8 Responses to “HTML Diff Tools?”

  1. Leon Brooks Says:

    If you’re looking for differences in the words rather than differences in the HTML structure or the like, you could try this:

    rm -f file1.txt
    rm -f file2.txt
    for word in echo $(lynx -dump $URL1); do echo $word >>file1.txt; done
    for word in echo $(lynx -dump $URL2); do echo $word >>file2.txt; done
    diff -dc file1 file2


  2. Adrian Sutton Says:

    Yeah that’s not a bad way of doing it but would give weird results when changes match up across paragraphs. Possibly doing that on a per-paragraph basis would work so that you see the changes in each paragraph separately. I guess I’ll have to play around and see what works best.


  3. Byron Ellacott Says:

    http://www.logilab.org/projects/xmldiff

    It has an HTML mode, but probably uses libxml2 under the hood, and libxml2’s HTML parser is extremely unforgiving. Passing HTML through tidy with the -asxml flag usually produces usable results for me, even with the most badly mangled of HTML.

    I haven’t used xmldiff, but I have been pondering differencing on XML files as a general case.


  4. Adrian Sutton Says:

    I should have noted, the HTML in this case is coming out of EditLive! for Java so it’s perfectly well formed XHTML and I can depend on that. The real problem with xmldiff though is that it doesn’t give anything close to human readable output - it’s effectively designed for diff/patch style things not highlighting to users what changes occurred. Looks quite useful though.


  5. Ben Finney Says:

    The ‘wdiff’ program (in the Debian package ‘wdiff’) is a front-end to GNU ‘diff’ that compares word sequences, not lines. That may be closer to something you can use.


  6. james Says:

    Found at a good tool at http://www.jamesdom.com . It diffs by element and even renders the source.


  7. Guy Van den Broeck Says:

    I’ll add to the blog spam and point you to my HTML differ at http://code.google.com/p/daisydiff/ .
    It should work better than other existing solutions and it’s completely free!


  8. Rob Dawson Says:

    Ironically this came up as number 3 when doing a search for html diff tools in 2009 :(…. (Ironic because I’m looking for a java html diff tool for java, and I work at Ephox currently :))


Leave a Reply

Alternatively, subscribe to the Atom feed.