Previously in the Testing@LMAX series I’ve mentioned the way we’ve provided isolation between tests, allowing us to run them in parallel. That isolation extends all the way up to supporting a multi-tenancy module called venues which allows us to essentially run multiple, functionally separate exchanges on a single deployment of the LMAX Exchange.
We use the isolation of venues to reduce the amount of hardware we need to run our three separate liquidity pools (LMAX Professional, LMAX Institutional and LMAX Interbank), but that’s not all. We actually use the isolation venues provide to extend our testing all the way into production.
We have a subset of our acceptance tests which, using venues, are run against the exchange as it is deployed in production, using the same APIs our clients, MTF members and internal staff would use to test that the exchange is fully functional. We have an additional venue on each deployment of the exchange that is used to run these tests. The tests connect to the exchange via the same gateways as our clients (FIX, web, etc) and place real trades that match using the exact same systems and code paths as in the “real” venues. Code-wise there’s nothing special about the test venue, it just so happens that the only external parties that ever connect to it are our testing framework.
We don’t run our full suite of acceptance tests against the live exchange due to the time that would take and to ensure that we don’t affect the performance or latency of the exchange. Plus, we already know the code works correctly because it’s already run through continuous integration. Testing in live is focussed on verifying that the various components of the exchange are hooked up correctly and that the deployment process worked correctly. As such we’ve selected a subset of our tests that exercise the key functions of each of the services that make up the exchange. This includes things like testing that an MTF member can connect and provide prices, that clients can connect via either FIX or web and place orders that match against those prices and that the activity in the exchange is reported out correctly via trade reporting and market data feeds.
We run testing in live as an automated step at the start of our release, prior to making any changes, and again at the end of the release to ensure the release worked properly. If testing in live fails we roll back the release. We also run it automatically throughout the day as one part of our monitoring system, and it is also run manually whenever manual work is done or whenever there is any concern for how the exchange is functioning.
While we have quite a lot of other monitoring systems, the ability to run active monitoring like this against the production exchange, and go as far as actions that change state gives us a significant boost in confidence that everything is working as it should, and helps isolate problems more quickly when things aren’t.
One of the most common reasons people avoid writing end-to-end acceptance tests is how difficult it is to make them run fast. Primary amongst this is the time required to start up the entire service and shut it down again. At LMAX with the full exchange consisting of a large number of different services, multiple databases and other components start up is far too slow to be done for each test so our acceptance tests are designed to run against the same server instance and not interfere with each other.
Most functions in the exchange can be isolated by simply creating one or more accounts and instruments that are unique to the particular test. The test simply starts by using the usual administration APIs to create the users and instruments it needs. Just this basic level of isolation allows us to test a huge amount of the exchange’s functionality – all of the matching behaviour for example. With the tests completely isolated from each other we can run them in parallel against the same server and dramatically reduce the amount of hardware required to run all the acceptance tests in a reasonable time.
But there are many functions in an exchange that cut across instruments and accounts – for example the exchange rates used to convert the profit or loss a user earns in one currency back to their account base currency. Initially tests that needed to control exchange rates could only be run sequentially, each one taking over the entire exchange while they ran and significantly increasing the time required for the test run. More recently however we’ve made the concept of currency completely generic – tests now simply create a unique currencies they can use and are able to set the exchange rates between those currencies without affecting any other tests. This makes our acceptance tests run significantly faster, but also means new currencies can be supported in the exchange without any code changes – just use the administration UI to create the desired currency.
We’ve applied the same approach of creating a completely generic solution even when there is a known set of values in a range of other areas, giving us better test isolation and often making it easier to respond to unexpected future requirements. Sometimes this adds complexity to the code or administration options which could have been avoided but the increased testability is well worth it.
The ultimate level of test isolation however is our support for multiple venues running in a single instance of the exchange. This essentially moves the exchange to a multi-tenancy model, a venue encapsulates all aspects of an exchange, allowing us to test back office reports that track how money moves around the exchange, reconciliation reports that cover all trades and many other functions that report on the state of the exchange as a whole.
With the LMAX Exchange now essentially running in three forms (Professional, Institutional and Interbank) this support for venues is more than just an optimisation for tests – we can co-host different instances of the exchange on the same production hardware, reducing not only the upfront investment required but also the ongoing maintenance costs.
Overall we’ve seen that making features more easily testable (using end-to-end acceptance tests) surprisingly often delivers business benefit making the investment well worth it.
LMAX has invested quite a lot of time into building a suite of automated tests to verify the behaviour of our exchange. While the majority of those tests are unit or integration tests that run extremely fast, in order for us to have confidence that the whole package fits together in the right way we have a lot of end-to-end acceptance tests as well.
These tests deliver a huge amount of confidence and are thus highly valuable to us, but they come at a significant cost because end-to-end tests are relatively time consuming to run. To minimise the feedback cycle we want to run these tests in parallel as much as possible.
We started out by simply creating separate groupings of tests, each of which would run in a different Jenkins job and thus run in parallel. However as the set of tests changed over time we kept having to rebalance the groups to ensure we got fast feedback. With jobs finishing at different times they would generally also pick different revisions to run against so we generally weren’t getting any revision that all the tests had run against, reducing confidence in the specific build we picked to release to production each iteration.
To solve this we’ve created custom software to run our acceptance tests which we call Romero. Romero has three parts:
The Romero server itself which coordinates everything
Servers which run a full copy of the exchange
Agents which are allocated a test to run against a specific server
At the start of a test run, the revision to test is deployed to all servers, then Romero loads all the tests for that revision and begins allocating one test to each agent and assigning that agent a server to run against. When an agent finishes running a test it reports the results back to the server and is allocated another test to run. Romero also records information about how long a test takes to run and then uses that to ensure that the longest running tests are allocated first, to prevent them “overhanging” at the end of the run while all the other agents sit around idle.
To make things run even faster, most of our acceptance test suites are able to be run in parallel, running multiple test cases at once and also sharing a single server with multiple Romero agents. Some tests however are testing functions which affect the global state of the server and can’t share servers. Romero is able to identify these types of tests and use that information when allocating agents to servers. Servers are designated as either parallel, supporting multiple agents, or sequential, supporting only a single agent. At the start of the run Romero calculates the optimal way to allocate servers between the two groups, again using historical information about how long each test takes.
All together this gives us an acceptance test environment which is self-balancing – if we add a lot of parallel tests one iteration servers are automatically moved from sequential to parallel to minimise the run time required.
Romero also has one further trick up its sleeve to reduce feedback time – it reports failures as they happen instead of waiting until the end of the run. Often a problematic commit can be reverted before the end of the run, which is a huge reduction in feedback time – normally the next run would already have started with the bad commit still in before anyone noticed the problem, effectively doubling the time required to fix.
The final advantage of Romero is that it seamlessly handles agents dying, even in the middle of running a test and reallocates that test to another agent, giving us better resiliency and keeping the feedback cycle going even in the case of minor problems in CI. Unfortunately we haven’t yet extended this resiliency to the test servers but it’s something we would like to do.
Firstly, let’s revisit the very high level differences between the exchanger and the disruptor. Peter notes:
This approach has similar principles to the Disruptor. No GC using recycled, pre-allocated buffers and lock free operations (The Exchanger not completely lock free and doesn't busy wait, but it could)
Two keys difference are:
there is only one producer/consumer in this case, the disruptor supports multiple consumers.
this approach re-uses a much smaller buffer efficiently. If you are using ByteBuffer (as I have in the past) an optimal size might be 32 KB. The disruptor library was designed to exploit large amounts of memory on the assumption it is relative cheap and can use medium sized (MBs) to very large buffers (GBs). e.g. it was design for servers with 144 GB. I am sure it works well on much smaller servers. ;)
Actually, there’s nothing about the Disruptor that requires large amounts of memory. If you know that your producers and consumers are going to keep pace with each other well and you don’t have a requirement to replay old events, you can use quite a small ring buffer with the Disruptor. There are a lot of advantages to having a large ring buffer, but it’s by no means a requirement.
It’s also worth noting that the Disruptor does not require consumers to busy-spin, you can choose to use a blocking wait strategy, or strategies that combine busy-spin and blocking to handle both spikes and lulls in event rates efficiently.
There is also an important advantage to the Disruptor that wasn’t mentioned: it will process events immediately if the consumer is keeping up. If the consumer falls behind however, it can process events in a batch to catch up. This significantly reduces latency while still handling spikes in load efficiently.
First let’s start with the LogEntry class. This is a simple value object that is used as our entries on the ring buffer and passed from the producer thread over to the consumer thread.
Peter’s Exchanger based implementation – the use of StringBuilder in the LogEntry class is actually a race condition and not thread safe. Both the publishing side and the consumer side are attempting to modify it and depending on how long it takes the publishing side to write the log message to the StringBuilder, it will potentially be processed and then reset by the consumer side before the publisher is complete. In this implementation I’m instead using a simple String to avoid that problem.
The one Disruptor-specific addition is that we create an EventFactory instance which the Disruptor uses to pre-populate the ring buffer entries.
Next, let’s look at the BackgroundLogger class that sets up the process and acts as the producer.
In the constructor we create an ExecutorService which the Disruptor will use to execute the consumer threads (a single thread in this case), then the disruptor itself. We pass in the LogEntry.FACTORY instance for it to use to create the entries and a size for the ring buffer.
The log method is our producer method. Note the use of two-phase commit. First claim a slot with the ringBuffer.next() method, then copy our values into that slot’s entry and finally publish the slot, ready for the consumer to process. We could have also used the Disruptor.publish method which can make this simpler for many use cases by rolling the two phase commit into call.
The producer doesn’t need to do any batching as the Disruptor will do that automatically if the consumer is falling behind, though there are also APIs that allow batching the producer which can improve the performance if it fits into your design (here it’s probably better to publish each log entry as it comes in).
Note that we don’t need a flush method since the Disruptor is always consuming log events as quickly as the consumer can.
Last of all, the consumer which is almost entirely implementation logic:
The consumer’s onEvent method is called for each LogEntry put into the Disruptor. The endOfBatch flag can be used as a signal to flush written content to disk, allowing very large buffer sizes to be used causing writes to disk to be batched when the consumer is running behind, yet also ensure that our valuable log messages get to disk as quickly as possible.
As of this morning, the Disruptor Wizard has been merged into the core LMAX Disruptor source tree. The .NET port had included the wizard style syntax for quite some time and it seemed to be generally popular, so why make people grab two jars instead of one?
I also updated it to reflect the change in terminology within the Disruptor. Instead of Consumers, there are now EventProcessors and EventHandlers. That better reflects the fact that consumers can actually add additional values to the events. Additionally, the ProducerBarrier has been merged into the ring buffer itself and the ring buffer entries are now called events. Again, that better reflects the fact that the programming model around the disruptor is most often event based.
It doesn’t make much difference for the wizard API, except that:
The consumeWith method has been changed to handleEventsWith
The getProducerBarrier method has been replaced with a start method which returns the ring buffer. This clears up the confusion that the getProducerBarrier function was also used as the trigger to start the event handler threads. Now the method name is explicit about the fact that it will have side-effects.