The Google Testing Blog has an article “Just Say No to More End-to-End Tests” which winds up being a rather depressing evaluation of the testing capabilities, culture and infrastructure at Google. For example:
Let’s assume the team already has some fantastic test infrastructure in place. Every night:
- The latest version of the service is built.
- This version is then deployed to the team’s testing environment.
- All end-to-end tests then run against this testing environment.
- An email report summarizing the test results is sent to the team.
If your idea of fantastic test infrastructure starts with the words “every night” and ends with an email being sent you’re doomed. Bryan Pendleton does a good job of analysing and correcting the details so I won’t recover that ground. Instead, let me provide a view of what reasonable test infrastructure looks like.
At LMAX we’ve recently reached the milestone of 10,000 end to end acceptance tests. We’ve obviously invested a lot of time in building up all those tests but they’re invaluable in the way they free us to try daring things and make sweeping changes, confident that if anything is broken it will be caught. We’re happy to radically restructure components in ways that require lots of changes to unit tests because of those end-to-end tests.
We also have huge numbers of unit tests, integration tests, performance tests, static analysis and various other forms of tests, but the end-to-end tests are far more than a sanity-check, they’re a primary form of quality control.
Those end-to-end tests, or acceptance tests as we call them:
- run constantly through the day
- complete in around 50 minutes, including deploying and starting the servers, running all the tests and shutting down again at the end
- are all required to pass before we consider a version releasable
- are included in our information radiators to ensure the team has constant visibility into the test results
- are owned by the whole team – testers, developers and business analysts together
That’s pretty much entry-level for doing end-to-end testing (or frankly any testing). We’ve also got a few extra niceties that I’ve written about before:
- Results are stored in a database to make it easy to query and search
- Tests are isolated so we can run them in parallel to speed things up
- Tests are dynamically distributed across a pool of servers
- We run a subset of them in production to supplement our monitoring
- We can control and test time
Plus the test results are displayed in real-time, so we don’t even have to wait for the end of the test run to see any failures. Tests that failed on the previous run are run first to give us quick feedback on whether they’ve been fixed or not.
There’s lots of great stuff in there, but we have more work to do. We have an intermittency problem. When we started out we didn’t believe that intermittency could be avoided and accepted a certain level of breakage on each build – much like the Google post talks about expecting 90% of tests to pass. That attitude is a death-knell for test reliability. If you don’t have all the tests passing consistently, gradually and inevitably more and more tests become intermittent over time.
We’ve been fighting back hard against intermittency and making excellent progress – we’ve recently added the requirement that releases have no failures and green builds are the norm and if there are intermittent failures it’s usually only one or two per run. Currently we’re seeing an intermittent failure rate of around 0.00006% of tests run (which actually sounds pretty good but with 10,000 tests that’s far too many runs with failures that should have been green).
But improvements come in waves with new intermittency creeping in because it can hide in amongst the existing noise. It has taken and will take a lot of dedication and commitment to dig ourselves out of the intermittency hole we’re in but it’s absolutely possible and we will get there.
So next time you hear someone try to tell you that end-to-end tests aren’t worth the effort, point them to LMAX. We do end-to-end testing big time and it is massively, indisputably worth it. And we only expect it to become more worth it as we reduce the intermittency and continue improving our tools over time.