Living in a state of accord.

Making End-to-End Tests Work

The Google Testing Blog has an article “Just Say No to More End-to-End Tests” which winds up being a rather depressing evaluation of the testing capabilities, culture and infrastructure at Google. For example:

Let’s assume the team already has some fantastic test infrastructure in place. Every night:

  1. The latest version of the service is built. 
  2. This version is then deployed to the team’s testing environment. 
  3. All end-to-end tests then run against this testing environment. 
  4. An email report summarizing the test results is sent to the team.

If your idea of fantastic test infrastructure starts with the words “every night” and ends with an email being sent you’re doomed. Bryan Pendleton does a good job of analysing and correcting the details so I won’t recover that ground. Instead, let me provide a view of what reasonable test infrastructure looks like.

At LMAX we’ve recently reached the milestone of 10,000 end to end acceptance tests. We’ve obviously invested a lot of time in building up all those tests but they’re invaluable in the way they free us to try daring things and make sweeping changes, confident that if anything is broken it will be caught. We’re happy to radically restructure components in ways that require lots of changes to unit tests because of those end-to-end tests.

We also have huge numbers of unit tests, integration tests, performance tests, static analysis and various other forms of tests, but the end-to-end tests are far more than a sanity-check, they’re a primary form of quality control.

Those end-to-end tests, or acceptance tests as we call them:

  • run constantly through the day
  • complete in around 50 minutes, including deploying and starting the servers, running all the tests and shutting down again at the end
  • are all required to pass before we consider a version releasable
  • are included in our information radiators to ensure the team has constant visibility into the test results
  • are owned by the whole team – testers, developers and business analysts together

That’s pretty much entry-level for doing end-to-end testing (or frankly any testing). We’ve also got a few extra niceties that I’ve written about before:

Plus the test results are displayed in real-time, so we don’t even have to wait for the end of the test run to see any failures. Tests that failed on the previous run are run first to give us quick feedback on whether they’ve been fixed or not.

There’s lots of great stuff in there, but we have more work to do. We have an intermittency problem. When we started out we didn’t believe that intermittency could be avoided and accepted a certain level of breakage on each build – much like the Google post talks about expecting 90% of tests to pass. That attitude is a death-knell for test reliability. If you don’t have all the tests passing consistently, gradually and inevitably more and more tests become intermittent over time.

We’ve been fighting back hard against intermittency and making excellent progress – we’ve recently added the requirement that releases have no failures and green builds are the norm and if there are intermittent failures it’s usually only one or two per run. Currently we’re seeing an intermittent failure rate of around 0.00006% of tests run (which actually sounds pretty good but with 10,000 tests that’s far too many runs with failures that should have been green).

But improvements come in waves with new intermittency creeping in because it can hide in amongst the existing noise. It has taken and will take a lot of dedication and commitment to dig ourselves out of the intermittency hole we’re in but it’s absolutely possible and we will get there.

So next time you hear someone try to tell you that end-to-end tests aren’t worth the effort, point them to LMAX. We do end-to-end testing big time and it is massively, indisputably worth it. And we only expect it to become more worth it as we reduce the intermittency and continue improving our tools over time.

Testing@LMAX – Testing in Live

Previously in the Testing@LMAX series I’ve mentioned the way we’ve provided isolation between tests, allowing us to run them in parallel. That isolation extends all the way up to supporting a multi-tenancy module called venues which allows us to essentially run multiple, functionally separate exchanges on a single deployment of the LMAX Exchange.

We use the isolation of venues to reduce the amount of hardware we need to run our three separate liquidity pools (LMAX Professional, LMAX Institutional and LMAX Interbank), but that’s not all. We actually use the isolation venues provide to extend our testing all the way into production.

We have a subset of our acceptance tests which, using venues, are run against the exchange as it is deployed in production, using the same APIs our clients, MTF members and internal staff would use to test that the exchange is fully functional. We have an additional venue on each deployment of the exchange that is used to run these tests. The tests connect to the exchange via the same gateways as our clients (FIX, web, etc) and place real trades that match using the exact same systems and code paths as in the “real” venues. Code-wise there’s nothing special about the test venue, it just so happens that the only external parties that ever connect to it are our testing framework.

We don’t run our full suite of acceptance tests against the live exchange due to the time that would take and to ensure that we don’t affect the performance or latency of the exchange. Plus, we already know the code works correctly because it’s already run through continuous integration. Testing in live is focussed on verifying that the various components of the exchange are hooked up correctly and that the deployment process worked correctly. As such we’ve selected a subset of our tests that exercise the key functions of each of the services that make up the exchange. This includes things like testing that an MTF member can connect and provide prices, that clients can connect via either FIX or web and place orders that match against those prices and that the activity in the exchange is reported out correctly via trade reporting and market data feeds.

We run testing in live as an automated step at the start of our release, prior to making any changes, and again at the end of the release to ensure the release worked properly. If testing in live fails we roll back the release. We also run it automatically throughout the day as one part of our monitoring system, and it is also run manually whenever manual work is done or whenever there is any concern for how the exchange is functioning.

While we have quite a lot of other monitoring systems, the ability to run active monitoring like this against the production exchange, and go as far as actions that change state gives us a significant boost in confidence that everything is working as it should, and helps isolate problems more quickly when things aren’t.

Testing@LMAX – Test Isolation

One of the most common reasons people avoid writing end-to-end acceptance tests is how difficult it is to make them run fast. Primary amongst this is the time required to start up the entire service and shut it down again. At LMAX with the full exchange consisting of a large number of different services, multiple databases and other components start up is far too slow to be done for each test so our acceptance tests are designed to run against the same server instance and not interfere with each other.

Most functions in the exchange can be isolated by simply creating one or more accounts and instruments that are unique to the particular test. The test simply starts by using the usual administration APIs to create the users and instruments it needs. Just this basic level of isolation allows us to test a huge amount of the exchange’s functionality – all of the matching behaviour for example. With the tests completely isolated from each other we can run them in parallel against the same server and dramatically reduce the amount of hardware required to run all the acceptance tests in a reasonable time.

But there are many functions in an exchange that cut across instruments and accounts – for example the exchange rates used to convert the profit or loss a user earns in one currency back to their account base currency. Initially tests that needed to control exchange rates could only be run sequentially, each one taking over the entire exchange while they ran and significantly increasing the time required for the test run. More recently however we’ve made the concept of currency completely generic – tests now simply create a unique currencies they can use and are able to set the exchange rates between those currencies without affecting any other tests. This makes our acceptance tests run significantly faster, but also means new currencies can be supported in the exchange without any code changes – just use the administration UI to create the desired currency.

We’ve applied the same approach of creating a completely generic solution even when there is a known set of values in a range of other areas, giving us better test isolation and often making it easier to respond to unexpected future requirements. Sometimes this adds complexity to the code or administration options which could have been avoided but the increased testability is well worth it.

The ultimate level of test isolation however is our support for multiple venues running in a single instance of the exchange. This essentially moves the exchange to a multi-tenancy model, a venue encapsulates all aspects of an exchange, allowing us to test back office reports that track how money moves around the exchange, reconciliation reports that cover all trades and many other functions that report on the state of the exchange as a whole.

With the LMAX Exchange now essentially running in three forms (Professional, Institutional and Interbank) this support for venues is more than just an optimisation for tests – we can co-host different instances of the exchange on the same production hardware, reducing not only the upfront investment required but also the ongoing maintenance costs.

Overall we’ve seen that making features more easily testable (using end-to-end acceptance tests) surprisingly often delivers business benefit making the investment well worth it.

Testing@LMAX – Distributed Builds with Romero

LMAX has invested quite a lot of time into building a suite of automated tests to verify the behaviour of our exchange. While the majority of those tests are unit or integration tests that run extremely fast, in order for us to have confidence that the whole package fits together in the right way we have a lot of end-to-end acceptance tests as well.

These tests deliver a huge amount of confidence and are thus highly valuable to us, but they come at a significant cost because end-to-end tests are relatively time consuming to run. To minimise the feedback cycle we want to run these tests in parallel as much as possible.

We started out by simply creating separate groupings of tests, each of which would run in a different Jenkins job and thus run in parallel. However as the set of tests changed over time we kept having to rebalance the groups to ensure we got fast feedback. With jobs finishing at different times they would generally also pick different revisions to run against so we generally weren’t getting any revision that all the tests had run against, reducing confidence in the specific build we picked to release to production each iteration.

To solve this we’ve created custom software to run our acceptance tests which we call Romero. Romero has three parts:

  • The Romero server itself which coordinates everything
  • Servers which run a full copy of the exchange 
  • Agents which are allocated a test to run against a specific server

At the start of a test run, the revision to test is deployed to all servers, then Romero loads all the tests for that revision and begins allocating one test to each agent and assigning that agent a server to run against. When an agent finishes running a test it reports the results back to the server and is allocated another test to run. Romero also records information about how long a test takes to run and then uses that to ensure that the longest running tests are allocated first, to prevent them “overhanging” at the end of the run while all the other agents sit around idle.

To make things run even faster, most of our acceptance test suites are able to be run in parallel, running multiple test cases at once and also sharing a single server with multiple Romero agents. Some tests however are testing functions which affect the global state of the server and can’t share servers. Romero is able to identify these types of tests and use that information when allocating agents to servers.  Servers are designated as either parallel, supporting multiple agents, or sequential, supporting only a single agent. At the start of the run Romero calculates the optimal way to allocate servers between the two groups, again using historical information about how long each test takes.

All together this gives us an acceptance test environment which is self-balancing – if we add a lot of parallel tests one iteration servers are automatically moved from sequential to parallel to minimise the run time required.

Romero also has one further trick up its sleeve to reduce feedback time – it reports failures as they happen instead of waiting until the end of the run. Often a problematic commit can be reverted before the end of the run, which is a huge reduction in feedback time – normally the next run would already have started with the bad commit still in before anyone noticed the problem, effectively doubling the time required to fix.

The final advantage of Romero is that it seamlessly handles agents dying, even in the middle of running a test and reallocates that test to another agent, giving us better resiliency and keeping the feedback cycle going even in the case of minor problems in CI. Unfortunately we haven’t yet extended this resiliency to the test servers but it’s something we would like to do.

Background Logging with the Disruptor

Peter Lawrey posted an example of using the Exchanger class from core Java to implement a background logging implementation. He briefly compared it to the LMAX disruptor and since someone requested it, I thought it might be interesting to show a similar implementation using the disruptor.

Firstly, let’s revisit the very high level differences between the exchanger and the disruptor. Peter notes:

This approach has similar principles to the Disruptor. No GC using recycled, pre-allocated buffers and lock free operations (The Exchanger not completely lock free and doesn't busy wait, but it could)

Two keys difference are:

  • there is only one producer/consumer in this case, the disruptor supports multiple consumers.
  • this approach re-uses a much smaller buffer efficiently. If you are using ByteBuffer (as I have in the past) an optimal size might be 32 KB. The disruptor library was designed to exploit large amounts of memory on the assumption it is relative cheap and can use medium sized (MBs) to very large buffers (GBs). e.g. it was design for servers with 144 GB. I am sure it works well on much smaller servers. ;)

Actually, there’s nothing about the Disruptor that requires large amounts of memory. If you know that your producers and consumers are going to keep pace with each other well and you don’t have a requirement to replay old events, you can use quite a small ring buffer with the Disruptor. There are a lot of advantages to having a large ring buffer, but it’s by no means a requirement.

It’s also worth noting that the Disruptor does not require consumers to busy-spin, you can choose to use a blocking wait strategy, or strategies that combine busy-spin and blocking to handle both spikes and lulls in event rates efficiently.

There is also an important advantage to the Disruptor that wasn’t mentioned: it will process events immediately if the consumer is keeping up. If the consumer falls behind however, it can process events in a batch to catch up. This significantly reduces latency while still handling spikes in load efficiently.

The Code

First let’s start with the LogEntry class. This is a simple value object that is used as our entries on the ring buffer and passed from the producer thread over to the consumer thread.

Peter’s Exchanger based implementation – the use of StringBuilder in the LogEntry class is actually a race condition and not thread safe. Both the publishing side and the consumer side are attempting to modify it and depending on how long it takes the publishing side to write the log message to the StringBuilder, it will potentially be processed and then reset by the consumer side before the publisher is complete. In this implementation I’m instead using a simple String to avoid that problem.

The one Disruptor-specific addition is that we create an EventFactory instance which the Disruptor uses to pre-populate the ring buffer entries.

Next, let’s look at the BackgroundLogger class that sets up the process and acts as the producer.

In the constructor we create an ExecutorService which the Disruptor will use to execute the consumer threads (a single thread in this case), then the disruptor itself. We pass in the LogEntry.FACTORY instance for it to use to create the entries and a size for the ring buffer.

The log method is our producer method. Note the use of two-phase commit. First claim a slot with the method, then copy our values into that slot’s entry and finally publish the slot, ready for the consumer to process. We could have also used the Disruptor.publish method which can make this simpler for many use cases by rolling the two phase commit into call.

The producer doesn’t need to do any batching as the Disruptor will do that automatically if the consumer is falling behind, though there are also APIs that allow batching the producer which can improve the performance if it fits into your design (here it’s probably better to publish each log entry as it comes in).

The stop method uses the new shutdown method on the Disruptor which takes care of waiting until all consumers have processed all available entries for you, though the code for doing it yourself is quite straight-forward. Finally we shut down the executor.

Note that we don’t need a flush method since the Disruptor is always consuming log events as quickly as the consumer can.

Last of all, the consumer which is almost entirely implementation logic:

The consumer’s onEvent method is called for each LogEntry put into the Disruptor. The endOfBatch flag can be used as a signal to flush written content to disk, allowing very large buffer sizes to be used causing writes to disk to be batched when the consumer is running behind, yet also ensure that our valuable log messages get to disk as quickly as possible.

The full code is available as a Gist.