Pain vs Pay-Off
Doug’s discovered a way to improve the effectiveness of simian to avoid adding more duplication to a code base:
The solution is very simple. The simian-report is a an xml file, so I wrote a SAX2 DefaultHandler that was able to parse the number of duplications at the different threshold levels. Putting this into a trivial ant task then gave us a task to help make things no worse even at levels below what the simian-check was doing! Within the first week, the new legacy-check was breaking the build (where the simian-check would never have) and focusing the teams attention on how to make things better.
The solution is simple and really cool. I have no doubt that it will have a big impact on the amount of duplication and the overall quality of our code. The trouble is, Simian’s normal reporting is really, really lousy and Doug’s extra check highlights just how useless Simian is at telling you where the duplication is instead of just detecting that there is duplication.
I discovered this new check when the build broke after a simple commit I made that fixed an annoying bug and removed some duplication. Unfortunately, it happened to change the distribution of duplication so that it triggered a simian-legacy failure. I can, after the use of a few choice words, handle failures like that because it shows I could have and probably should have done more to improve the code around the change I made. Someone’s got to go in a clean things up – it might as well be me.
That’s when I encountered Simian’s report. It’s one great big long HTML page listing pretty much every bit of duplication it found broken down into sections of numbers of lines duplicated and it has a pretty useless index at the top. Over VPN it took 15 minutes to download and render. It hung Safari and FireFox for about 5 minutes trying to render it locally. Once it loads you have to sort through the thousands of duplication matches it’s found (matching as little as 2 lines duplicated and with quite liberal settings on what constitutes a match – again, a good thing) to try and find the one new duplication that caused the build failure.
Eventually I discovered the easiest way was divide and conquer. Simian can run even if the code doesn’t compile, so I rolled back each file individually until Simian passed again. Then I patched my changes back in, searched the report for that file name and eliminated as much of the duplication involving that file as I could and reran Simian.
I’m still in two minds as to whether or not the new check is worth the pain it causes. I really like the idea of it, I just really wish there was an Eclipse plugin for Simian that highlighted duplication right there in the code, or at least a HTML report that showed the actual code marked up with duplication. Then again, if you just have to guess at what duplication to remove the code base will get better faster…

May 28th, 2008 at 7:56 pm
A simple solution would be to make sure you run the legacy-check before you commit, in the same way you run tests. Though I do agree that the reporting would be a far more effective solution. It might be possible for the check to keep better track of the changes and use this information to highlight new changes, but I’ll need some more slack time to pull that one off. :)
May 28th, 2008 at 8:32 pm
Running it before you check in doesn’t help – you still have to try and guess where the duplication is.
May 29th, 2008 at 6:39 am
smaller commits would help however ;)
May 29th, 2008 at 10:35 am
My solution is to download the simian XML for the failing build and the last passing build (they’re only about 1mb), then rather than viewing in a browser do a text diff between them. The problem with that is you have to diff between XML files generated on the same machine or the file locations don’t match.
It’s incredibly annoying, and it doesn’t solve trying to fix duplication before you check in – that’s still a crapshoot.
May 29th, 2008 at 4:53 pm
Smaller commits doesn’t help all that much actually – even once you narrow it down to one file it’s still a crap shoot as to what to fix and some things just need changes to multiple files. :)
Yeah I did the diff thing yesterday with pretty limited success though it was an edge case because a lot of files had very small changes so line numbers changed in existing duplication all over the place.
It’s a cool idea though and it is definitely working to improve the code base – just got to get a decent report out of it somehow.