Symphonious

Living in a state of accord.

Testing@LMAX – Time Travel and the TARDIS

Testing time related functions is always a challenge – generally it involves adding some form of abstraction over the system clock which can then be stubbed, mocked or otherwise controlled by unit tests in order to test the functionality. At LMAX we like the confidence that end-to-end acceptance tests give us but, like most financial systems, a significant amount of our functionality is highly time dependent so we need the same kind of control over time but in a way that works even when the system is running as a whole (which means it’s running in multiple different JVMs or possibly even on different servers).

We’ve achieved that by building on the same abstracted clock as is used in unit tests but exposing it in a system-wide, distributed way. To stay as close as possible to real-world conditions we have some reduced control in acceptance tests, in particular time always progresses forward – there’s no pause button. However we do have the ability to travel forward in time so that we can test scenarios that span multiple days, weeks or even months quickly. When running acceptance tests, the system clock uses a “time travel” implementation. Initially this clock simply returns the current system time, but it also listens for special time messages on the system’s messaging bus. When one of these time messages is received, the clock calculates the difference between the time specified in the message with the current system clock time and records that. From then on when it’s asked for the time, the clock adds on that difference to the current system time. As a result, when a time message is received time immediately jumps forward to that time and then continues advancing at the same rate as the system clock.

Like all good schedulers, ours are written in a way that ensures that events fire in the correct order even if time suddenly jumps forward past the point that the event should have triggered. So receiving a time message not only jumps forward, it also triggers all the events that should have fired during the time period we skipped, allowing us to test that they did their job correctly.

The time messages are published by a time travel service which is only run in our acceptance test environment – it exposes a JMX method which our acceptance tests use to set the current system time. Each service that uses time also exposes it’s current time and the time it’s schedulers have reached via JMX so when a test time travels we can wait until the message is received by each service and all the scheduled events have finished being run.

The TARDIS

The trouble with controlling time like this is that it affects the entire system so we can’t run multiple tests at the same time or they would interfere with each other. Having to run tests sequentially significantly increases the feedback cycle. To solve this we added the TARDIS to the DSL that runs our acceptance tests. The TARDIS provides a central point of control for multiple test cases running in parallel, coordinating time travel so that the tests all move forward together, without the actual test code needing to care about any of the details or synchronisation.

The TARDIS hooks into the DSL at two points – when a test asks to time travel and when a test finishes (by either passing or failing). When a test asks to time travel, the TARDIS tracks the destination times being requested and blocks the test until all tests are either ready to time or have completed. It then time travels to the earliest requested time and wakes up any tests that requested that time point so they can continue running. Tests that requested a time point further in the future remain paused waiting for the next time travel.

Since we had a lot of time travel tests already written before we invented the TARDIS this approach allowed us to start running them in parallel without having to rewrite them – the TARDIS is simply integrated into the DSL framework we use for all tests.

Currently the TARDIS only works for tests running in the same JVM, so essentially it allows test cases to run in parallel with other cases from the same test suite, but it doesn’t allow multiple test suites on separate Romero agents to run in parallel. The next step in its evolution will be to move the TARDIS out of the test’s DSL and provide it as an API from the time travel service on the server. At that point we can run multiple test suites in parallel against the same server. However, we haven’t yet done the research to determine what, if any, benefit we’d get from that change as different test suites may have very different time travel patterns and thus spend most of their time at interim time points waiting for other tests. Also the load on servers during time travel is quite high due to the number of scheduled jobs that can fire so running multiple test suites at once may not be viable.

* Time being in sync is actually a more complex concept than it first appears. The overall architecture of our system meant this approach to time actually did provide very accurate time sources relative to our main “source of all truth”, the exchange venue itself which is what we really cared about. Even so, anything that had to be strictly “in-sync” generated a timestamp in the service that triggered it and then included in the outgoing event which is the only sane way to do such things.

Testing@LMAX – Test Results Database

One of the things we tend to take for granted a bit at LMAX is that we store the results of our acceptance test runs in a database to make them easy to analyse later.  We not only store whether each test passed or not for each revision, but the failure message if it failed, when it ran, how long it took, what server it ran on and a bunch of other information.

Having this info somewhere that’s easy to query lets us perform some fairly advanced analysis on our test data. For example we can find tests that fail when run after 5pm New York (an important cutoff time in the world of finance) or around midnight (in various timezones). It has also allowed us to identify subtly faulty hardware based on the increased rate of test failures.

In our case we have custom software that distributes our acceptance tests across the available hardware so it records the results directly to the database, however we have also parsed JUnit reports from XML and imported into the database that way.

However you get the data, having a historical record of test results in a form that’s easy to query is a surprisingly powerful tool and worth the relatively small investment to set up.

Revert First, Ask Questions Later

The key to making continuous integration work well is to ensure that the build stays green – ensuring that the team always knows that if something doesn’t work it was their change that broke it. However, in any environment with a build pipeline beyond a simple commit build, for example integration, acceptance or deployment tests, sometimes things will break.

When that happens, there is always a natural tendency to commit an additional change that will fix it. That’s almost always a mistake.

The correct approach is to revert first and ask questions later. It doesn’t matter how small the fix might be there’s a risk that it will fail or introduce some other problem and extend the period that the tests are broken. However since we know the last build worked correctly, reverting the change is guaranteed to return things to working order. The developers working on the problematic change can then take their time to develop and test the fix then recommit everything together.

Reverting a commit isn’t a slight against its developers and doesn’t even suggest the changes being made are bad, merely that some detail hasn’t yet been completed and so it’s not yet ready to ship. Having a culture where that’s understood and accepted is an important step in delivering quality software.

Wifi Under Fedora Linux on a MacBook Pro 15″ Retina

Out of the box, Fedora 19 doesn’t have support for the broadcom wifi chip in the MacBook Pro 15″ Retina. There are quite a few complex instructions for adjusting firmware and compiling bits and bobs etc, but the easiest way to get it up and running on Fedora is using rpmfusion.

You can do it by downloading a bunch of rpms and stuffing around with USB drives, but its way easier if you setup network access first via either a thunderbolt ethernet adapter (make sure its plugged in before starting up as hotplugging thunderbolt doesn’t work under Linux), or via bluetooth.  The bluetooth connection can either be to a mobile phone sharing its data connection or if you have another Mac around, it can share its wifi network over bluetooth (turn on Internet Sharing in the Sharing settings panel).

Once you have network access, run a yum update so you have the latest packages from fedora. It didn’t work for me with the plain Fedora 19 install.

Then go to rpmfusion.org and install first the “RPM Fusion free for Fedora 19″ then the “RPM Fusion nonfree for Fedora 19″ RPMs.

Finally, run ‘sudo yum install broadcom-wl’.  After a reboot Linux should come back up with wifi working.

UPDATE (2014-01-13): I’ve found that each time a kernel upgrade comes through I’ve had to uninstall then reinstall broadcom-wl which is kind of annoying. This time round I’m experimenting with installing ‘kmod-wl’ which brings in broadcom-wl as a dependency but I’m hoping does better at tracking kernel updates.

Also worth noting that this approach continues to work with Fedora 20.

Media Release or Bug Report?

Ah Apple maps, ever the source of a good sensationalist headline. This time the Victorian police have warned people not to use Apple Maps to get to Mildura. This is definitely a bug with Apple maps, no question it should be and has been fixed.  What’s interesting though is that the Victorian police thought it would be easier to attempt to notify every iOS 6 user about the problem via the media and get them to use an alternate mapping application than it would be to call Apple and get them to fix the source data.

Users complaining publicly instead of being constructive and reporting the problem to the manufacturer certainly isn’t new, but in the age of web services its more self-defeating than ever.  Apple can fix the problem in one place and its fixed for every user, surely the Victorian police have a better way of getting in touch with Apple than through the media?

And lets not even start asking why anyone would blindly follow GPS directions into the Australia bush despite the fact that they are looking at a satellite photo that clearly shows there isn’t a town there.

The power of user feedback and cloud hosted data was very clear with my recent experience of using Apple maps – on the way to my destination was some road works which diverted the road and replaced a round about with a set of lights. Despite the fact that the roadworks were actually still in progress, Apple maps gave exactly correct instructions for the new layout. Meanwhile my TomTom GPS will still think there’s a round about there for up to another year and then require me to pay nearly $100 to get the updated maps. It wasn’t too long ago that I’d be looking up a book of maps to get my outdated driving instructions (and then have to memorise it or stop regularly to have another look).

Makes you realise what an incredible time we live in.