Symphonious

Living in a state of accord.

Using WebPack with Buck

I’ve been gradually tidying up the build process for UI stuff at LMAX. We had been using a mix of requires and browserify – both pretty sub-optimally configured. Obviously when you have too many ways of doing things the answer is to introduce another new way so I’ve converted everything over to webpack.

Situation: There are 14 competing standards. We need to develop one universal standard that overs everyone's use cases. Situation: there are 15 competing standards...

Webpack is most often used as part of a node or gulp/grunt build process but our overall build process is controlled by Buck so I’ve had to work it into that setup. I was also keen to minimise the amount of code that had to be changed to support the new build process.

The final key requirement, which had almost entirely been missed by our previous UI build attempts was the ability to easily create reusable UI modules that are shared by some, but not all projects. Buck shuns the use of repositories in favour of a single source tree with everything in it so an internal npm repo wasn’t going to fly.

While the exact details are probably pretty specific to our setup, the overall shape of the build likely has benefit.  We have separate buck targets (using genrule) for a few different key stages:

Build node_modules

We use npm to install third party dependencies to build the node_modules directory we’ll need. We do this in an offline way by checking in the node cache as CI doesn’t have internet access but it’s pretty unsatisfactory. Checking in node_modules directory was tried previously but both svn and git have massive problems with the huge numbers of files it winds up containing.

yarn has much better offline support and other benefits as well, but it’s offline support requires a cache and the cache has every package already expanded so it winds up with hundreds of thousands of files to check in and deal with. Further investigations are required here…

For our projects that use Angular 2, this is actually the slowest part of our entire build (UI and server side). rxjs seems to be the main source of issues as it takes forever to install. Fortunately we don’t change our third party modules often and we can utilise the shared cache of artefacts so developers don’t wind up building this step locally too often.

Setup a Workspace and Generate webpack.config.js

We don’t want to have to repeat the same configuration for webpack, typescript, karma etc for everything UI thing we build. So our build process generates them for us, tweaking things as needed to account for the small differences between projects. It also grabs the node_modules from the previous step and installs any of our own shared components (with npm install <local/path/to/component>).

Build UI Stuff

Now we’ve got something that looks just like most stand-alone javascript projects would have – config files at the root, source ready to combine/minify etc. At this point we can just run things out of node_modules. So we have a target to build with ./node_modules/.bin/webpack, run tests with ./node_modules/.bin/karma or start the webpack dev server.

Buck can then pick up those results and integrate them where they’re needed in the final outputs ready for deployment.

Finding What Buck Actually Built

Buck is a weird but very fast build tool that happens to be rather opaque about where it actually puts the things you build with it. They wind up somewhere under the buck-out folder but there’s no guarantee where and everything under there is considered buck’s private little scratch pad.

So how do you get build results out so they can be used? For things with built-in support like Java libraries you can use ‘buck publish’ to push them out to a repo but that doesn’t work for things you’ve built with a custom genrule. In those cases you could use an additional genrule build target to actually publish but it would only run when one of it’s dependencies have changed. Sometimes that’s an excellent feature but it’s not always what you want.

Similarly, you might want to actually run something you’ve built. You can almost always use the ‘buck run’ command to do that but it will tie up the buck daemon while it’s running so you can’t run two things at once.

For ultimate flexibility you really want to just find out where the built file is which thankfully is possible using ‘buck targets –show-full-output’. However it outputs both the target and it’s output:

$ buck targets --show-full-output //support/bigfeedback:bigfeedback
//bigfeedback:bigfeedback /code/buck-out/gen/bigfeedback/bigfeedback.jar

To get just the target file we need to pipe it through:

cut -d ' ' -f 2-

Or as a handy reusable bash function:

function findOutput() {
    $BUCK targets --show-full-output ${1} | cut -d ' ' -f 2-
}

 

Replacing Symlinks with Hardlinks

Symlinks have been causing me grief lately.  Our build tool, buck, loves creating symlinks but publishes corrupt cache artefacts for any build rule that includes a symlink amongst it’s output.

We also wind up calling out to npm to manage JavaScript dependencies and it has an annoying (for us) habit of resolving symlinks when processing files and then failing to find required libraries because the node_modules folder was back where the symlink was, not with the original file. Mostly this problem is caused by buck creating so many symlinks.

So it’s useful to be able to get rid of symlinks which can be done with the handy -L or –dereference option to cp. Then instead of copying the symlink you copy the file it points to. Avoids all the problems with buck and npm but wastes lots of disk space and means that changes to the original file are no longer reflected in the new copy (so watching files doesn’t work).

Assuming our checkout is on a single file system (which seems reasonable) we can get the best of both worlds by using hard links.  cp has a handy option for that too -l or –link. But since buck gave us a symlink to start with it just gives us a hard link to the symlink that points to the original file.

So combining the two options, cp -Ll, should be exactly what we want. And if you’re using coreutils 8.25 or above it is. cp will dereference the symlink and create a hard link to the original file. If you’re using coreutils prior to 8.25 cp will just copy the symlink. Hitting a bug in coreutils is pretty much the definition of the world being out to get you.

Fortunately, we can work around the issue with a bit of find magic:

find ${DIR} -type l -exec bash -c 'ln -f "$(readlink -m "$0")" "$0"' {} \;

‘find -type l’ will find all symlinks.  For each of those we execute some bash, reading from inside out, to deference the symlink with readlink -m then use ln to create a hard link with the -f option to force it to overwrite the existing symlink.

Good times…

Travis CI

Probably the best thing I’ve discovered with my recent playing is Travis CI. I’ve known about it for quite some time, even played with it for simple projects but never with anything with any real complexity. Given this project uses rails which wants a database for pretty much every test and that database has to be postgres because I’m using it’s jsonb support, plus capybara and phantomjs for good measure, this certainly isn’t a simple project to test.

Despite that, getting things running on Travis CI was quick and easy. The container based builds give you a well controlled environment out of the box and they already have a heap of software you’re likely to need ready to go. Other things can be easily provided using a simple .travis.yml file.

While it’s common to have a build server these days, it’s quite uncommon for them to be so well controlled and able to be rebuilt quickly and easily – most wind up being special little snowflakes in some way or other. So building on Travis CI is immediately a massive leap forward in testing practices.

The two things we do at LMAX that I can’t see how to do with Travis CI (yet) is creating a database of historic test results that can be easily queried (at the individual test level, not the build result level), and running a build in parallel with the tests automatically load balanced across a number of hosts. I wouldn’t expect either to be provided as part of the free service anyway but they seem like the next major areas of value it could offer (with significant effort required I imagine).

Still, it’s been a real luxury to have a full CI setup for hacking on a little side project. So thanks for the generosity Travis CI.

Testing@LMAX – Distributed Builds with Romero

LMAX has invested quite a lot of time into building a suite of automated tests to verify the behaviour of our exchange. While the majority of those tests are unit or integration tests that run extremely fast, in order for us to have confidence that the whole package fits together in the right way we have a lot of end-to-end acceptance tests as well.

These tests deliver a huge amount of confidence and are thus highly valuable to us, but they come at a significant cost because end-to-end tests are relatively time consuming to run. To minimise the feedback cycle we want to run these tests in parallel as much as possible.

We started out by simply creating separate groupings of tests, each of which would run in a different Jenkins job and thus run in parallel. However as the set of tests changed over time we kept having to rebalance the groups to ensure we got fast feedback. With jobs finishing at different times they would generally also pick different revisions to run against so we generally weren’t getting any revision that all the tests had run against, reducing confidence in the specific build we picked to release to production each iteration.

To solve this we’ve created custom software to run our acceptance tests which we call Romero. Romero has three parts:

  • The Romero server itself which coordinates everything
  • Servers which run a full copy of the exchange 
  • Agents which are allocated a test to run against a specific server

At the start of a test run, the revision to test is deployed to all servers, then Romero loads all the tests for that revision and begins allocating one test to each agent and assigning that agent a server to run against. When an agent finishes running a test it reports the results back to the server and is allocated another test to run. Romero also records information about how long a test takes to run and then uses that to ensure that the longest running tests are allocated first, to prevent them “overhanging” at the end of the run while all the other agents sit around idle.

To make things run even faster, most of our acceptance test suites are able to be run in parallel, running multiple test cases at once and also sharing a single server with multiple Romero agents. Some tests however are testing functions which affect the global state of the server and can’t share servers. Romero is able to identify these types of tests and use that information when allocating agents to servers.  Servers are designated as either parallel, supporting multiple agents, or sequential, supporting only a single agent. At the start of the run Romero calculates the optimal way to allocate servers between the two groups, again using historical information about how long each test takes.

All together this gives us an acceptance test environment which is self-balancing – if we add a lot of parallel tests one iteration servers are automatically moved from sequential to parallel to minimise the run time required.

Romero also has one further trick up its sleeve to reduce feedback time – it reports failures as they happen instead of waiting until the end of the run. Often a problematic commit can be reverted before the end of the run, which is a huge reduction in feedback time – normally the next run would already have started with the bad commit still in before anyone noticed the problem, effectively doubling the time required to fix.

The final advantage of Romero is that it seamlessly handles agents dying, even in the middle of running a test and reallocates that test to another agent, giving us better resiliency and keeping the feedback cycle going even in the case of minor problems in CI. Unfortunately we haven’t yet extended this resiliency to the test servers but it’s something we would like to do.