Symphonious

Living in a state of accord.

Modernising Our JavaScript – Why Angular 2 Didn’t Work

At LMAX we value simplicity very highly and since most of the UI we need for the exchange is fairly straight forward settings and reports we’ve historically kept away from JavaScript frameworks. Instead we’ve stuck with jQuery and bootstrap along with some very simple common utilities and patterns we’ve built ourselves. Mostly this has worked very well for us.

Sometimes though we have more complex UIs where things dynamically change or state is more complex. In those cases things start to breakdown and get very messy. The simplicity of our libraries winds up causing complexity in our code instead of avoiding it. We needed something better.

Some side projects had used Angular and a few people were familiar with it so we started out trialling Angular 2.0. While it was much better for those complex cases the framework itself introduced so much complexity and cost it was unpleasant to work with.  Predominately we had two main issues:

  1. Build times were too slow
  2. It wasn’t well suited for dropping an Angular 2 component into an existing system rather than having everything live in Angular 2 world

Build Times

This was the most surprising problem – Angular 2 build times were painfully slow. We found we could build all of the java parts of the exchange before npm could even finish installing the dependencies for an Angular 2 project – even with all the dependencies in a local cache and using npm 5’s –offline option. We use buck for our build system and it does an excellent job of only building what needs to be changed and caching results so most of the time we could avoid the long npm install step, but it still needs to run often enough that it was a significant drain on the team’s productivity.

We did evaluate yarn and pnpm but neither were workable in our particular situation. They were both faster at installing dependencies but still far too slow.

The lingering question here is whether the npm install was so slow because of the sheer number of dependencies or because something about those dependencies was slow. Anecdotally it seemed like rxjs took forever to install but other issues led us away from angular before we fully understood this.

Even when the npm install could be avoided, the actual compile step was still slow enough to be a drain on the team. The projects we were using angular on were quite new with a fairly small amount of code. Running through the development server was fast, but a production mode build was slow.

Existing System Integration

The initial projects we used angular 2 on were completely from scratch so could do everything the angular 2 way. On those projects productivity was excellent and angular 2 was generally a joy to use. When we tried to build onto our existing systems using angular 2 things were much less pleasant.

Technically it was possible to build a single component on a page using angular 2 with other parts of the page using our older approach, but doing so felt fairly unnatural. The angular 2 way is significantly different to how we had been working and since angular 2 provides a full-suite of functionality it often felt like we were working against the framework rather than with it. Re-using our existing code within an angular 2 component felt wrong so we were being pushed towards duplicating code that worked perfectly well and we were happy with just to make it fit “the angular 2 way”.

If we intended to rewrite all our existing code using angular 2 that would be fine, but we’re not doing that. We have a heap of functionality that’s already built, working great and will be unlikely to need changes for quite some time. It would be a huge waste of time for us to go back and rewrite everything just to use the shiny new tool.

Angular 2 is Still Great

None of this means that angular 2 has irretrievable faults, it’s actually a pretty great tool to develop with. It just happens to shine most if you’re all-in with angular 2 and that’s never going to be our situation. I strongly suspect that even the build time issues would disappear if we could approach the build differently, but changing large parts of our build system and the development practices that work with it just doesn’t make sense when we have other options.

I can’t see any reason why a project built with angular 2 would need or want to migrate away. Nor would I rule out angular 2 for a new project. It’s a pretty great library, provides a ton of functionality that you can just run with and has excellent tooling. Just work out how your build works and if it’s going to be too slow early on.

For us though, Angular 2 didn’t turn out to be the wonderful new world we hoped it would be.

Finding What Buck Actually Built

Buck is a weird but very fast build tool that happens to be rather opaque about where it actually puts the things you build with it. They wind up somewhere under the buck-out folder but there’s no guarantee where and everything under there is considered buck’s private little scratch pad.

So how do you get build results out so they can be used? For things with built-in support like Java libraries you can use ‘buck publish’ to push them out to a repo but that doesn’t work for things you’ve built with a custom genrule. In those cases you could use an additional genrule build target to actually publish but it would only run when one of it’s dependencies have changed. Sometimes that’s an excellent feature but it’s not always what you want.

Similarly, you might want to actually run something you’ve built. You can almost always use the ‘buck run’ command to do that but it will tie up the buck daemon while it’s running so you can’t run two things at once.

For ultimate flexibility you really want to just find out where the built file is which thankfully is possible using ‘buck targets –show-full-output’. However it outputs both the target and it’s output:

$ buck targets --show-full-output //support/bigfeedback:bigfeedback
//bigfeedback:bigfeedback /code/buck-out/gen/bigfeedback/bigfeedback.jar

To get just the target file we need to pipe it through:

cut -d ' ' -f 2-

Or as a handy reusable bash function:

function findOutput() {
    $BUCK targets --show-full-output ${1} | cut -d ' ' -f 2-
}

 

Testing@LMAX – Isolate UI Tests with vncserver

One reason that automated UI tests can be unreliable is that they tend to be sensitive to what else is on screen at the time and even things like the current screen size. Developers running the tests locally also find it annoying to have windows opening and closing on their machine while the test runs and are unable to do anything else because their clicking might interfere with the test.

At LMAX we solve that by isolating tests in their own X session, created using vncserver. We simply start vncserver with:

vncserver :20 -geometry 1600x1200

Then set DISPLAY=:20 as an environment variable when starting WebDriver’s Firefox instance:

FirefoxBinary firefoxBinary = new FirefoxBinary();
firefoxBinary.setEnvironmentProperty("DISPLAY", ":20");

The Firefox window then pops up in it’s own isolated X session. We can still use a vnc client to watch as the test runs but we can also let it run in the background and continue using the machine for other things. In CI it allows us to run UI tests on a headless server.

Since we run a number of tests in parallel, in CI we start a number of vncserver instances and allocate a different one to each running test to ensure they’re completely isolated.

Simple, but incredibly effective.

Testing@LMAX – Screenshots with Selenium/WebDriver

When an automated UI test fails, it can be hard to tell exactly what went wrong just from the failure message. The failure message typically just says that some element the test was looking for wasn’t found, but it doesn’t tell you what was there.  Was there an error message displayed instead? Was the operation still executing? Did something completely unexpected happen instead?

To answer those questions our DSL automatically captures a screenshot when any UI operation fails and we include a link to it in the failure message. That way when someone reviews the test result they can see exactly what was on screen which typically makes it straight forward to identify what went wrong and fix it.

Until recently we’d been using the convenient and helpful looking TakesScreenshot.getScreenshotAs method that WebDriver provides.  For example:

((TakesScreenshot)webDriver).getScreenshotAs(
new SaveScreenshotOutputType(pngFilename));

As expected, this creates a PNG image in the specified location that looks for all the world like a screenshot of the browser content. Unfortunately, it’s lying.

WebDriver actually does something very clever and gets the browser to render the page content into a canvas element and then saves that as the PNG file. This is an extremely close approximation of what the page looks like with two important exceptions:

  1. It doesn’t respect the viewport size so body content is never scrolled off-screen.
  2. Any browser chrome or random other windows that have popped up aren’t shown.

Both of these things can be an issue – the scrolled-off-screen one being the most problematic.  Modern WebDriver quite accurately simulates a user clicking and typing keys so if somethings not on screen it can’t be clicked. When your test fails because an element was “present but not visible” and the screenshot shows it as very clearly visible, hilarity ensues. Very frustrating hilarity.

To fix this we’ve started taking honest-to-goodness screenshots. Since all our tests get their own X session (courtesy of vncserver) their windows are completely isolated from each other and a dump of the entire screen will capture precisely what a real user would see, browser chrome and scrolling included. Linux provides an entertaining array of options for capturing screenshots from the command line but the one that happened to be already installed was import, part of the ImageMagick suite. We simply execute:

import -display :20 -window root screenshot.png

where :20 is the X display this particular test has been allocated and screenshot.png is where we want the screenshot to wind up.

Since the WebDriver screenshot can be useful as well – for example finding out an error message is displayed at the top of the screen  we continue to grab that too.

Finally, for completeness we grab a dump of the DOM to a HTML file so we can later inspect what IDs, classes, attributes etc are present, including any hidden elements. webDriver.getPageSource() makes that easy and we append an extra HTML comment that includes webDriver.getCurrentUrl() for good measure.

Testing@LMAX – Compatibility Tests

Once an application goes live, it is absolutely essential that any future changes are able to be work with the existing data in production, typically by migrating it as changes are required. That existing data and the migrations applied to it are often the riskiest and least tested functions in the system. Mistakes in a migration will at best cause a multi-hour outage while backups are restored and more likely will subtly corrupt data, producing incorrect results that may go unnoticed for long periods making it impossible to roll back. At LMAX we reduce that risk by running compatibility tests.

Sanitised Data

The first requirement for testing data migrations is testing data that as production-like as possible. It’s almost never a good idea to bring actual production data into development and it’s definitely never going to happen for a finance company like LMAX. Instead, we have a sanitiser that runs in production and generates a very heavily sanitised version of production data that still has the same “shape”. For example, it has the same number of accounts, the same distribution of open positions across instruments, null values in the same places there’s null values in production and so on. However it replaces anything that’s even vaguely personally identifiable or sensitive information. If in doubt, sanitise it out.

Despite being heavily sanitised, we still treat it as data that should be secured but it can be brought into our development, testing and staging environments.

We have multiple production environments so we have sanitised data that matches the shape of each of those environments.

Can Migrations Run?

Once we have sanitised data the most basic check is to confirm the migration process will actually complete successfully. This ensures we can release successfully but doesn’t give us any real confidence that the migration is successful. For such a primitive check it’s surprisingly effective as it picks up the common errors of changing columns to NOT NULL when the production data actually does have null values, or adding a unique constraint to tables when the content isn’t actually unique.

Did Migrations Work?

The obvious next step is to write tests to confirm that migrations actually worked. Our compatibility test jobs are setup so that we can easily write a JUnit test that will be run after migrations complete so we can verify the state. 

The most direct form of test is an example based test. We select some examples from the production dataset and write assertions to check that specific bit of the data migrated in the way we expect. The down side is that we’re dealing with live production data which is regularly updated so it’s possible that our examples will change after we’ve written the test and then, correctly, migrate to something different to what we expect. Still, these are often useful to put through for a single run as a sanity test when developing the migration, then delete them.

Slightly more generic, we can write a test that assert constraints that must be true immediately after the migration completes. For example, when we made permissions more fine-grained we needed to assign a new type of payment role to every account that used real money, but not to any demo accounts (which use pretend money). We can write a test to verify that migration worked correctly quite easily, however once the migration goes out admin users may add or remove the role to different accounts and the constraint would no longer hold. For cases like that we simply delete the test once the migration has gone live at which point it’s done its job anyway.  We also mark the test with a @ValidUntil annotation that makes it clear that the test has a limited life time in case we forget to delete it.

Finally, we can often identify constraints that should always be valid and write permanent tests for them. These are extremely powerful, testing not just that our current migration works correctly but that no future migration breaks that expectation.

Did Something Unexpected Happen?

The compatibility tests that should always hold true have an additional benefit – they give us early feedback that the production data has diverged from expectations for some reason, typically because a bug slipped through. We can then investigate, fix the bug to prevent any more data issues and work out how to clean up the problem.

Obviously, finding issues only after production data has gone wrong is not something we ever want to do but it’s still a useful safety net if something slips through all our pre-release testing and the database schema constraints we use. Typically when we do find issues they are minor inconsistencies that don’t cause any issues now, but are like little time bombs just waiting for a future release to assume it can’t possibly happen. So even getting feedback that late in the process often allows us to avoid any user-noticeable effects.

Making Them Easy

We have a base class we extend our compatibility tests from which makes it easy to get a connection to the database and has a few handy utilities for asserting things. By far the most useful however is the assertNoRowsMatch method. It does exactly what it says – takes an SQL query and asserts that no rows match. If any do, it prints them out so you get really useful debug information to start investigating the problem. For example:

@Test
public void shouldHaveAPrincipalForEveryAccount() {
  assertNoRowsMatch(
    "SELECT a.account_id, a.name " +
    "FROM account a " +
    "LEFT JOIN principal p ON a.principal_id = p.principal_id " +
    "WHERE p.principal_id IS NULL");
}

If we’ve somehow wound up with an account with no principal that could log into it the test will print the account ID and name so we can investigate what happened and clean up the data.