Living in a state of accord.

Testing@LMAX – Introducing ElementSpecification

Today LMAX Exchange has released ElementSpecification, a very small library we built to make working with selectors in selenium/WebDriver tests easier. It has three main aims:

  • Make it easier to understand selectors by using a very English-like syntax
  • Avoid common pitfalls when writing selectors that lead to either brittle or intermittent tests
  • Strongly discourage writing overly complicated selectors.

Essentially, we use ElementSpecification anywhere that we would have written CSS or XPath selectors by hand. ElementSpecification will automatically select the most appropriate format to use – choosing between a simple ID selector, CSS or XPath.

Making selectors easier to understand doesn’t mean making locators shorter – CSS is already a very terse language. We actually want to use more characters to express our intent so that future developers can read the specification without having to decode CSS. For example, the CSS:

#data-table tr[data-id='78'] .name


.thatContainsA("tr").withAttributeValue("data-id", "78")

Much longer, but if you were to read the CSS selector to yourself, it would come out a lot like the ElementSpecification syntax. That allows you to stay focussed on what the test is doing instead of pausing to decode the CSS. It’s also reduces the likelihood of misreading a vital character and misunderstanding the selector.

With ElementSpecification essentially acting as an adapter layer between the test author and the actual CSS, it’s also able to avoid some common intermittency pitfalls. In fact, the reason ElementSpecification was first built was because really smart people kept attempting to locate an element with a classname using:

//*[contains(@class, 'valid')]

which looks ok, but incorrectly also matches an element with the class ‘invalid’. Requiring the class attribute to exactly match ‘valid’ is too brittle because it will fail if an additional class is added to the element. Instead, ElementSpecification would generate:

contains(concat(' ', @class, ' '), ' valid ')

which is decidedly unpleasant to have to write by hand.

The biggest benefit we’ve seen from ElementSpecification though is that fact that it has a deliberately limited set of abilities. You can only descend down the DOM tree, never back up and never across to siblings. That makes selectors far easier to understand and avoids a lot of unintended coupling between the tests and incidental properties of the DOM. Sometimes it means augmenting the DOM to make it more semantic – things like adding a “data-id” attribute to rows as in the example above. It’s surprisingly rare how often we need to do that and surprising how useful those extra semantics wind up being for a whole variety of reasons anyway.

Testing@LMAX – Replacements in DSL

Given our DSL makes heavy use of aliases, we often have to provide a way to include the real name or ID as part of some string. For example, an audit record for a new account might be:

Created account 127322 with username someUser123.

But in our acceptance test we’d create the user with:


someUser is just an alias, the DSL creates a unique username to use and the system assigns a unique account ID that the DSL keeps track of for us. So to write a test that the new account audit record is written correctly, we need a way to build the expected audit string.

Our initial attempt was straight forward enough:

"expression: Created account <accountId> with username <username>.",
"account: someUser",

The DSL substitutes <accountId> and <username> with the real values for us. Simple, neat, worked well for quite a while. Except that over time we kept finding more and more things that needed to be substituted and encountered situations where an audit log would reference to two different accounts or multiple instruments, leading us to have <accountId2> and <accountId3> replacements.

So a little while back some of my clever colleagues changed things around to a slightly more complex but much more flexible syntax:

"expression: Created account <accountId:someUser> with username <username:someUser>."

Essentially, the replacement now contains both the type of value to insert as well as the alias to look up. This has been a minor revolution for our DSL – it’s now much easier to handle all sorts of complex cases and it’s much clearer which alias is being used for each replacement.

The biggest win though, is that because all the required information to perform the expansions is inside the one string – not requiring any extra DSL parameters – the replacement algorithm can be easily shared across all the different places in the DSL that need to support replacements. New types of things to be replaced can easily be added in one place and are then available everyone as well.

It’s surprising how much benefit you can get from such a simple little change.

Testing@LMAX – Abstraction by DSL

At LMAX, all our acceptance tests are written using a high level DSL. This gives us two key advantages:

  • Tests can focus on what they’re testing with the details and mechanics hidden behind the DSL
  • When things change we can update the DSL implementation to match, avoiding the need to change each test that happens to touch on the area.

The DSL that LMAX uses is probably not what most people think of when hearing the term DSL – it doesn’t attempt to read like plain English, just simplifies things down significantly. We’ve actually open sourced the simple little library that is the entrance-way to what we think of as the DSL – creatively named simple-dsl. It’s essentially the glue between what we write in an acceptance test and the plain-java implementation behind that.

As a simple example here’s a test that creates a user, instrument and places an order:

tradingAPI.placeOrder("instrument", "quantity: 5", "type: market",
"expectedStatus: UNMATCHED");

Overall, the aim is to have the acceptance tests written at a very high level – focussing on what should happen but leaving the how to the DSL implementation. The tradingAPI.placeOrder call is a good example of this, it’s testing that when the user places an order on an instrument with no liquidity, it won’t be matched. In the DSL that’s actually a two step process, first place the order and receive a synchronous OK response to say the order was received, then when the order reaches the matching engine an asynchronous event will be emitted to say the order was not matched. We could have made that two separate calls in the acceptance test but that would have exposed too much detail about how the system works when what we really care about is that the order is unfilled, how that’s reported is an implementation detail.

However that does mean that the implementation of the DSL an important part of the specification of the system. The acceptance tests express the user requirements and the DSL expresses the technical details of those requirements.

Model the Types of Users and Interactions

All our acceptance tests extends a base class, DslTestCase, that exposes a number of public variables that act as the entry points to the system (registrationAPI, adminAPI and tradingAPI in the example above). Each of these roughly represent a way that certain types of users interact with the system. So registrationAPI works with the API exposed by our registration gateway – the same APIs that our sign-up process on the website talks to.  adminAPI uses the same APIs our admin console talks to and tradingAPI is the API that both our UI uses and that many of our clients interact with directly.

We also have UI variants like adminUI and tradingUI that use selenium to open a browser and test the UI as well.

Our system tends to have a strong correlation between the type of user and the entry point to the system they use so our DSL mostly maps to the gateways into our system, but in other systems it may be more appropriate to focus more on the type of user regardless of what their entry point into the system is. Again the focus should be on what happens more than how. The way you categorise functions in the DSL should aid you in thinking that way.

That said, our top level DSL concepts aren’t entirely restricted to just the system entry point they model. For example the registrationAPI.createUser call in the example will initially talk to the system’s registration API, but since a new account isn’t very useful until it deposits funds, it then talks to the admin console to approve the registration and credit some funds into the users account. There’s a large dose of pragmatism involved in the DSL implementation with the goal being to make it easy to read and write the acceptance tests themselves and we’re willing to sacrifice a little design purity to get that (but only a little).

Top level concepts often further categorise the functionality they provide, for example our admin console that adminAPI drives has a lot of functionality and is used by a range of types of users, so it sub-categorises into things like marketOperations, customerServices, risk, etc.

Add Reusable Components to the DSL

One of the signs that people don’t understand the design of our DSL is when they extract repetitive pieces of tests into a private method within the test itself. On the surface this seems  like a reasonable idea, allowing that sequence of actions to be reused by multiple tests in the file. If the sequence is useful in many test cases within one file and significant enough to be worth the indirection of extracting a method it’s almost inevitably useful across many files.

Instead of extracting a private method, put reusable pieces into the DSL itself. Then they’ll be available to all your tests.  More importantly though, you can make that method fit into the DSL style properly – in our case, using simple-dsl to pass parameters instead of a fixed set of method parameters. 

One of our top level concepts in the DSL is ‘workflows’. It bundles together broader sequences of actions that cut across the boundaries of any one entrance point. It’s a handy home for many of the reusable functions we split out. The down side is it’s currently a real grab bag of random stuff and could do with some meaningful sub-categorisation. Naming is hard…

Design to Avoid Intermittency

The way the DSL is designed is a key weapon in the fight against intermittency. The first rule is to design each function to appear synchronous as much as possible. The LMAX Exchange is a highly asynchronous system design but our DSL hides that as much as possible.

The most useful pattern for this is that whenever you provide a setter-type function it should automatically wait and verify that the effect has been fully applied by checking the equivalent getter-type API. So the end of the DSL implementation for registrationAPI.createUser is a waiter that polls our broker service waiting for the account to actually show up there with the initial balance we credited. That way the test can carry on and place an order immediately without intermittently being rejected for a lack of funds.

The second key pattern applies when verifying values. We produce a lot of reports as CSV files so originally had DSL like:"date: today", "rememberAs: myCsvFile");"csvFile: myCsvFile", "amount: 10.00");

Apart from being pretty horrible to read, this leads to a lot of intermittency because our system doesn’t guarantee that cash flows will be recorded to the database immediately, it’s done asynchronously so is only guaranteed to happen within a reasonable time. Instead it’s much better to write:"date: today", "amount: 10.00");

Then inside the DSL you can use a waiter to poll the cashflow CSV until it does contain the expected value or whatever you define as a reasonable time elapses and the test times out and fails. Again, having the test focus on what and the DSL dealing with how allows us to write better tests.

Don’t Get Too Fancy with the DSL

The first thought most people have when they see our DSL is that it could be so much better if we used static types and chained method calls to get the compiler to validate more stuff and have refactoring tools work well. It sounds like a great idea, our simple string based DSL seems far too primitive to work in practice but we’ve actually tried it the other way as well and it’s not as great as it sounds.

Inevitably when you try to make the DSL too much like English or try to get the compiler more involved you add quite a lot of complexity to the DSL implementation which makes it a lot harder to maintain so the cost of your acceptance tests goes up – exactly the opposite of what you were intending.

The trade offs will vary considerably depending on which language you’re using for your tests and the best style of DSL to create will vary significantly. I strongly suspect though that regardless of language the best DSL is a radically simple one, just that different things are radically simple in different languages.

DSL’s Matter

This was meant to be a quick article before getting on to what I really wanted to talk about but suddenly I’m 1500 words in and still haven’t discussed anything about the implementation side of the DSL.

It turns out that while our DSL might be simple and something we take for granted, it’s a huge part of what makes our acceptance tests easily maintainable instead of gradually becoming a huge time sink that prevents any change to the system. My intuition is that those people who have tried acceptance tests and found them too expensive to maintain have failed to find the right style of abstraction in the DSL they use, leaving their tests too focused on how instead of what.

Making End-to-End Tests Work

The Google Testing Blog has an article “Just Say No to More End-to-End Tests” which winds up being a rather depressing evaluation of the testing capabilities, culture and infrastructure at Google. For example:

Let’s assume the team already has some fantastic test infrastructure in place. Every night:

  1. The latest version of the service is built. 
  2. This version is then deployed to the team’s testing environment. 
  3. All end-to-end tests then run against this testing environment. 
  4. An email report summarizing the test results is sent to the team.

If your idea of fantastic test infrastructure starts with the words “every night” and ends with an email being sent you’re doomed. Bryan Pendleton does a good job of analysing and correcting the details so I won’t recover that ground. Instead, let me provide a view of what reasonable test infrastructure looks like.

At LMAX we’ve recently reached the milestone of 10,000 end to end acceptance tests. We’ve obviously invested a lot of time in building up all those tests but they’re invaluable in the way they free us to try daring things and make sweeping changes, confident that if anything is broken it will be caught. We’re happy to radically restructure components in ways that require lots of changes to unit tests because of those end-to-end tests.

We also have huge numbers of unit tests, integration tests, performance tests, static analysis and various other forms of tests, but the end-to-end tests are far more than a sanity-check, they’re a primary form of quality control.

Those end-to-end tests, or acceptance tests as we call them:

  • run constantly through the day
  • complete in around 50 minutes, including deploying and starting the servers, running all the tests and shutting down again at the end
  • are all required to pass before we consider a version releasable
  • are included in our information radiators to ensure the team has constant visibility into the test results
  • are owned by the whole team – testers, developers and business analysts together

That’s pretty much entry-level for doing end-to-end testing (or frankly any testing). We’ve also got a few extra niceties that I’ve written about before:

Plus the test results are displayed in real-time, so we don’t even have to wait for the end of the test run to see any failures. Tests that failed on the previous run are run first to give us quick feedback on whether they’ve been fixed or not.

There’s lots of great stuff in there, but we have more work to do. We have an intermittency problem. When we started out we didn’t believe that intermittency could be avoided and accepted a certain level of breakage on each build – much like the Google post talks about expecting 90% of tests to pass. That attitude is a death-knell for test reliability. If you don’t have all the tests passing consistently, gradually and inevitably more and more tests become intermittent over time.

We’ve been fighting back hard against intermittency and making excellent progress – we’ve recently added the requirement that releases have no failures and green builds are the norm and if there are intermittent failures it’s usually only one or two per run. Currently we’re seeing an intermittent failure rate of around 0.00006% of tests run (which actually sounds pretty good but with 10,000 tests that’s far too many runs with failures that should have been green).

But improvements come in waves with new intermittency creeping in because it can hide in amongst the existing noise. It has taken and will take a lot of dedication and commitment to dig ourselves out of the intermittency hole we’re in but it’s absolutely possible and we will get there.

So next time you hear someone try to tell you that end-to-end tests aren’t worth the effort, point them to LMAX. We do end-to-end testing big time and it is massively, indisputably worth it. And we only expect it to become more worth it as we reduce the intermittency and continue improving our tools over time.

Testing@LMAX – Testing in Live

Previously in the Testing@LMAX series I’ve mentioned the way we’ve provided isolation between tests, allowing us to run them in parallel. That isolation extends all the way up to supporting a multi-tenancy module called venues which allows us to essentially run multiple, functionally separate exchanges on a single deployment of the LMAX Exchange.

We use the isolation of venues to reduce the amount of hardware we need to run our three separate liquidity pools (LMAX Professional, LMAX Institutional and LMAX Interbank), but that’s not all. We actually use the isolation venues provide to extend our testing all the way into production.

We have a subset of our acceptance tests which, using venues, are run against the exchange as it is deployed in production, using the same APIs our clients, MTF members and internal staff would use to test that the exchange is fully functional. We have an additional venue on each deployment of the exchange that is used to run these tests. The tests connect to the exchange via the same gateways as our clients (FIX, web, etc) and place real trades that match using the exact same systems and code paths as in the “real” venues. Code-wise there’s nothing special about the test venue, it just so happens that the only external parties that ever connect to it are our testing framework.

We don’t run our full suite of acceptance tests against the live exchange due to the time that would take and to ensure that we don’t affect the performance or latency of the exchange. Plus, we already know the code works correctly because it’s already run through continuous integration. Testing in live is focussed on verifying that the various components of the exchange are hooked up correctly and that the deployment process worked correctly. As such we’ve selected a subset of our tests that exercise the key functions of each of the services that make up the exchange. This includes things like testing that an MTF member can connect and provide prices, that clients can connect via either FIX or web and place orders that match against those prices and that the activity in the exchange is reported out correctly via trade reporting and market data feeds.

We run testing in live as an automated step at the start of our release, prior to making any changes, and again at the end of the release to ensure the release worked properly. If testing in live fails we roll back the release. We also run it automatically throughout the day as one part of our monitoring system, and it is also run manually whenever manual work is done or whenever there is any concern for how the exchange is functioning.

While we have quite a lot of other monitoring systems, the ability to run active monitoring like this against the production exchange, and go as far as actions that change state gives us a significant boost in confidence that everything is working as it should, and helps isolate problems more quickly when things aren’t.