We all know by now that continuous integration is part of good software development – check in regularly and have a suite of automated tests run to confirm that everything is working as expected. If a test fails, jump on it quickly and get the build back to green. Simple right?
But what happens when something goes wrong in a way that can’t be fixed quickly? For example, the build server has a hardware fault or runs out of disk space. You can’t just rollback the faulty change, its going to take time to get the build back to green. As your CI system grows, it may take time just to understand what went wrong. If your team is small they may all be taken up fixing the problem, but if the team is larger a pair focusses on fixing the build as quickly as possible and the other developers carry on working. Now you have two problems.
You still have the build problem, but now you also have a process problem because you’re no longer doing continuous integration. When things are working well in continuous integration, you have a continuous stream of commits proceeding through the build pipeline. If a bug is introduced the build quickly picks it up and you can identify the problem change easily because it can only be one of a few commits.
On the other hand, if developers keep working while the build is broken, they build up a large backlog of commits which makes it more difficult to identify which revision broke the build. It also makes it significantly harder to resolve the build problem because the code keeps changing and you can easily wind up with multiple build breakages starting to overlap and interact.
To avoid this problem, many companies put up an embargo on commits or close the source tree to prevent any further changes from being committed. This controls change in the build environment and makes it easier to resolve the problem, but it doesn’t prevent the build-up of changes. The result is that when the embargo is lifted, there is a huge swarm of incoming changes all at once, introducing merging problems and making it difficult to identify the culprit if any of them introduce another problem. There could well be multiple problems introduced by that batch of changes with their effects overlapping and interacting making it even harder. Essentially, the longer an embargo is up the greater the chance that it will need to be put back up because of problems in the batch of changes developed during the embargo.
So what’s the answer? Simple – stop working. The team as a whole will go faster if developers simply stop writing code once they reach the point where they would normally commit but can’t because there’s an embargo. For short embargos, most developers won’t be affected at all, but as the embargo lasts longer more and more developers will have to stop work. This feels really bad, but it ensures we keep doing continuous integration and overall benefits the team’s productivity. For build problems that are hard to understand, it also means that gradually more and more developers are available to spitball ideas about what’s wrong and to pick up lines of investigation to help get the build working again.
Also, not coding doesn’t mean that developers can’t do anything at all, maybe now is a good time to do those higher level design sessions and ensure everyone is pushing in the same direction, maybe read up on technology that is either in use but not fully understood, or that could be of benefit if it was introduced. If there are spikes to be played, they can usually still be picked up and worked on, write a blog post (like say, this one). Or even just take an early lunch.
The bottom line is that build breakages are always hugely expensive – pretending that everything is normal and you can continue work when the build system is broken doesn’t make them any less expensive, it just makes you look busier while creating the next problem.