Symphonious

Living in a state of accord.

Testing@LMAX – Test Isolation

One of the most common reasons people avoid writing end-to-end acceptance tests is how difficult it is to make them run fast. Primary amongst this is the time required to start up the entire service and shut it down again. At LMAX with the full exchange consisting of a large number of different services, multiple databases and other components start up is far too slow to be done for each test so our acceptance tests are designed to run against the same server instance and not interfere with each other.

Most functions in the exchange can be isolated by simply creating one or more accounts and instruments that are unique to the particular test. The test simply starts by using the usual administration APIs to create the users and instruments it needs. Just this basic level of isolation allows us to test a huge amount of the exchange’s functionality – all of the matching behaviour for example. With the tests completely isolated from each other we can run them in parallel against the same server and dramatically reduce the amount of hardware required to run all the acceptance tests in a reasonable time.

But there are many functions in an exchange that cut across instruments and accounts – for example the exchange rates used to convert the profit or loss a user earns in one currency back to their account base currency. Initially tests that needed to control exchange rates could only be run sequentially, each one taking over the entire exchange while they ran and significantly increasing the time required for the test run. More recently however we’ve made the concept of currency completely generic – tests now simply create a unique currencies they can use and are able to set the exchange rates between those currencies without affecting any other tests. This makes our acceptance tests run significantly faster, but also means new currencies can be supported in the exchange without any code changes – just use the administration UI to create the desired currency.

We’ve applied the same approach of creating a completely generic solution even when there is a known set of values in a range of other areas, giving us better test isolation and often making it easier to respond to unexpected future requirements. Sometimes this adds complexity to the code or administration options which could have been avoided but the increased testability is well worth it.

The ultimate level of test isolation however is our support for multiple venues running in a single instance of the exchange. This essentially moves the exchange to a multi-tenancy model, a venue encapsulates all aspects of an exchange, allowing us to test back office reports that track how money moves around the exchange, reconciliation reports that cover all trades and many other functions that report on the state of the exchange as a whole.

With the LMAX Exchange now essentially running in three forms (Professional, Institutional and Interbank) this support for venues is more than just an optimisation for tests – we can co-host different instances of the exchange on the same production hardware, reducing not only the upfront investment required but also the ongoing maintenance costs.

Overall we’ve seen that making features more easily testable (using end-to-end acceptance tests) surprisingly often delivers business benefit making the investment well worth it.

Testing@LMAX – Time Travel and the TARDIS

Testing time related functions is always a challenge – generally it involves adding some form of abstraction over the system clock which can then be stubbed, mocked or otherwise controlled by unit tests in order to test the functionality. At LMAX we like the confidence that end-to-end acceptance tests give us but, like most financial systems, a significant amount of our functionality is highly time dependent so we need the same kind of control over time but in a way that works even when the system is running as a whole (which means it’s running in multiple different JVMs or possibly even on different servers).

We’ve achieved that by building on the same abstracted clock as is used in unit tests but exposing it in a system-wide, distributed way. To stay as close as possible to real-world conditions we have some reduced control in acceptance tests, in particular time always progresses forward – there’s no pause button. However we do have the ability to travel forward in time so that we can test scenarios that span multiple days, weeks or even months quickly. When running acceptance tests, the system clock uses a “time travel” implementation. Initially this clock simply returns the current system time, but it also listens for special time messages on the system’s messaging bus. When one of these time messages is received, the clock calculates the difference between the time specified in the message with the current system clock time and records that. From then on when it’s asked for the time, the clock adds on that difference to the current system time. As a result, when a time message is received time immediately jumps forward to that time and then continues advancing at the same rate as the system clock.

Like all good schedulers, ours are written in a way that ensures that events fire in the correct order even if time suddenly jumps forward past the point that the event should have triggered. So receiving a time message not only jumps forward, it also triggers all the events that should have fired during the time period we skipped, allowing us to test that they did their job correctly.

The time messages are published by a time travel service which is only run in our acceptance test environment – it exposes a JMX method which our acceptance tests use to set the current system time. Each service that uses time also exposes it’s current time and the time it’s schedulers have reached via JMX so when a test time travels we can wait until the message is received by each service and all the scheduled events have finished being run.

The TARDIS

The trouble with controlling time like this is that it affects the entire system so we can’t run multiple tests at the same time or they would interfere with each other. Having to run tests sequentially significantly increases the feedback cycle. To solve this we added the TARDIS to the DSL that runs our acceptance tests. The TARDIS provides a central point of control for multiple test cases running in parallel, coordinating time travel so that the tests all move forward together, without the actual test code needing to care about any of the details or synchronisation.

The TARDIS hooks into the DSL at two points – when a test asks to time travel and when a test finishes (by either passing or failing). When a test asks to time travel, the TARDIS tracks the destination times being requested and blocks the test until all tests are either ready to time or have completed. It then time travels to the earliest requested time and wakes up any tests that requested that time point so they can continue running. Tests that requested a time point further in the future remain paused waiting for the next time travel.

Since we had a lot of time travel tests already written before we invented the TARDIS this approach allowed us to start running them in parallel without having to rewrite them – the TARDIS is simply integrated into the DSL framework we use for all tests.

Currently the TARDIS only works for tests running in the same JVM, so essentially it allows test cases to run in parallel with other cases from the same test suite, but it doesn’t allow multiple test suites on separate Romero agents to run in parallel. The next step in its evolution will be to move the TARDIS out of the test’s DSL and provide it as an API from the time travel service on the server. At that point we can run multiple test suites in parallel against the same server. However, we haven’t yet done the research to determine what, if any, benefit we’d get from that change as different test suites may have very different time travel patterns and thus spend most of their time at interim time points waiting for other tests. Also the load on servers during time travel is quite high due to the number of scheduled jobs that can fire so running multiple test suites at once may not be viable.

* Time being in sync is actually a more complex concept than it first appears. The overall architecture of our system meant this approach to time actually did provide very accurate time sources relative to our main “source of all truth”, the exchange venue itself which is what we really cared about. Even so, anything that had to be strictly “in-sync” generated a timestamp in the service that triggered it and then included in the outgoing event which is the only sane way to do such things.

Testing@LMAX – Distributed Builds with Romero

LMAX has invested quite a lot of time into building a suite of automated tests to verify the behaviour of our exchange. While the majority of those tests are unit or integration tests that run extremely fast, in order for us to have confidence that the whole package fits together in the right way we have a lot of end-to-end acceptance tests as well.

These tests deliver a huge amount of confidence and are thus highly valuable to us, but they come at a significant cost because end-to-end tests are relatively time consuming to run. To minimise the feedback cycle we want to run these tests in parallel as much as possible.

We started out by simply creating separate groupings of tests, each of which would run in a different Jenkins job and thus run in parallel. However as the set of tests changed over time we kept having to rebalance the groups to ensure we got fast feedback. With jobs finishing at different times they would generally also pick different revisions to run against so we generally weren’t getting any revision that all the tests had run against, reducing confidence in the specific build we picked to release to production each iteration.

To solve this we’ve created custom software to run our acceptance tests which we call Romero. Romero has three parts:

  • The Romero server itself which coordinates everything
  • Servers which run a full copy of the exchange 
  • Agents which are allocated a test to run against a specific server

At the start of a test run, the revision to test is deployed to all servers, then Romero loads all the tests for that revision and begins allocating one test to each agent and assigning that agent a server to run against. When an agent finishes running a test it reports the results back to the server and is allocated another test to run. Romero also records information about how long a test takes to run and then uses that to ensure that the longest running tests are allocated first, to prevent them “overhanging” at the end of the run while all the other agents sit around idle.

To make things run even faster, most of our acceptance test suites are able to be run in parallel, running multiple test cases at once and also sharing a single server with multiple Romero agents. Some tests however are testing functions which affect the global state of the server and can’t share servers. Romero is able to identify these types of tests and use that information when allocating agents to servers.  Servers are designated as either parallel, supporting multiple agents, or sequential, supporting only a single agent. At the start of the run Romero calculates the optimal way to allocate servers between the two groups, again using historical information about how long each test takes.

All together this gives us an acceptance test environment which is self-balancing – if we add a lot of parallel tests one iteration servers are automatically moved from sequential to parallel to minimise the run time required.

Romero also has one further trick up its sleeve to reduce feedback time – it reports failures as they happen instead of waiting until the end of the run. Often a problematic commit can be reverted before the end of the run, which is a huge reduction in feedback time – normally the next run would already have started with the bad commit still in before anyone noticed the problem, effectively doubling the time required to fix.

The final advantage of Romero is that it seamlessly handles agents dying, even in the middle of running a test and reallocates that test to another agent, giving us better resiliency and keeping the feedback cycle going even in the case of minor problems in CI. Unfortunately we haven’t yet extended this resiliency to the test servers but it’s something we would like to do.

Testing@LMAX – Test Results Database

One of the things we tend to take for granted a bit at LMAX is that we store the results of our acceptance test runs in a database to make them easy to analyse later.  We not only store whether each test passed or not for each revision, but the failure message if it failed, when it ran, how long it took, what server it ran on and a bunch of other information.

Having this info somewhere that’s easy to query lets us perform some fairly advanced analysis on our test data. For example we can find tests that fail when run after 5pm New York (an important cutoff time in the world of finance) or around midnight (in various timezones). It has also allowed us to identify subtly faulty hardware based on the increased rate of test failures.

In our case we have custom software that distributes our acceptance tests across the available hardware so it records the results directly to the database, however we have also parsed JUnit reports from XML and imported into the database that way.

However you get the data, having a historical record of test results in a form that’s easy to query is a surprisingly powerful tool and worth the relatively small investment to set up.

Revert First, Ask Questions Later

The key to making continuous integration work well is to ensure that the build stays green – ensuring that the team always knows that if something doesn’t work it was their change that broke it. However, in any environment with a build pipeline beyond a simple commit build, for example integration, acceptance or deployment tests, sometimes things will break.

When that happens, there is always a natural tendency to commit an additional change that will fix it. That’s almost always a mistake.

The correct approach is to revert first and ask questions later. It doesn’t matter how small the fix might be there’s a risk that it will fail or introduce some other problem and extend the period that the tests are broken. However since we know the last build worked correctly, reverting the change is guaranteed to return things to working order. The developers working on the problematic change can then take their time to develop and test the fix then recommit everything together.

Reverting a commit isn’t a slight against its developers and doesn’t even suggest the changes being made are bad, merely that some detail hasn’t yet been completed and so it’s not yet ready to ship. Having a culture where that’s understood and accepted is an important step in delivering quality software.