Testing@LMAX – Distributed Builds with Romero

Adrian Sutton

March 30, 2014

LMAX has invested quite a lot of time into building a suite of automated tests to verify the behaviour of our exchange. While the majority of those tests are unit or integration tests that run extremely fast, in order for us to have confidence that the whole package fits together in the right way we have a lot of end-to-end acceptance tests as well.

These tests deliver a huge amount of confidence and are thus highly valuable to us, but they come at a significant cost because end-to-end tests are relatively time consuming to run. To minimise the feedback cycle we want to run these tests in parallel as much as possible.

We started out by simply creating separate groupings of tests, each of which would run in a different Jenkins job and thus run in parallel. However as the set of tests changed over time we kept having to rebalance the groups to ensure we got fast feedback. With jobs finishing at different times they would generally also pick different revisions to run against so we generally weren’t getting any revision that all the tests had run against, reducing confidence in the specific build we picked to release to production each iteration.

To solve this we’ve created custom software to run our acceptance tests which we call Romero. Romero has three parts:

The Romero server itself which coordinates everything
Servers which run a full copy of the exchange
Agents which are allocated a test to run against a specific server

At the start of a test run, the revision to test is deployed to all servers, then Romero loads all the tests for that revision and begins allocating one test to each agent and assigning that agent a server to run against. When an agent finishes running a test it reports the results back to the server and is allocated another test to run. Romero also records information about how long a test takes to run and then uses that to ensure that the longest running tests are allocated first, to prevent them “overhanging” at the end of the run while all the other agents sit around idle.

To make things run even faster, most of our acceptance test suites are able to be run in parallel, running multiple test cases at once and also sharing a single server with multiple Romero agents. Some tests however are testing functions which affect the global state of the server and can’t share servers. Romero is able to identify these types of tests and use that information when allocating agents to servers. Servers are designated as either parallel, supporting multiple agents, or sequential, supporting only a single agent. At the start of the run Romero calculates the optimal way to allocate servers between the two groups, again using historical information about how long each test takes.

All together this gives us an acceptance test environment which is self-balancing – if we add a lot of parallel tests one iteration servers are automatically moved from sequential to parallel to minimise the run time required.

Romero also has one further trick up its sleeve to reduce feedback time – it reports failures as they happen instead of waiting until the end of the run. Often a problematic commit can be reverted before the end of the run, which is a huge reduction in feedback time – normally the next run would already have started with the bad commit still in before anyone noticed the problem, effectively doubling the time required to fix.

The final advantage of Romero is that it seamlessly handles agents dying, even in the middle of running a test and reallocates that test to another agent, giving us better resiliency and keeping the feedback cycle going even in the case of minor problems in CI. Unfortunately we haven’t yet extended this resiliency to the test servers but it’s something we would like to do.