Continuous integration (CI) has long been an interest of mine, both in work and out of it. In the NetSurf project we have our own CI infrastructure which cross-builds for all our targets and allows us to manage our input toolchains etc. But it is consuming a non-trivial amount of hardware which consumes a non-trivial amount of electricity, hosting costs, etc. And that's just for a small niche web-browser project. Fortunately we have a wonderful CI manager in the form of Vince Sanders who looks after that infrastructure for us and we rely on donations of hardware and hosting to keep it going.

Another interest of mine which has become particularly important to me over the past few years has been the concept of [elastic compute] power. Amazon really pushed this with their EC2 platform, but others are offering it with OpenStack based solutions as well as plenty of others still on other platforms. The particular feature of elastic compute which interests me from the point of view of CI is that of ensuring that I only consume CPU/RAM/IO resources when there is work to do and not otherwise. This is hard to do if you have a dedicated system which perhaps runs your CI controller (such as Jenkins which we use in the NetSurf project) which might have a database behind it to store all the jobs and their historical data etc. Your controller machine is therefore probably powerful enough that you give it a little more oomph and make it one of your workers for building your code. This means it has even more storage always online (for doing the build) and even more CPU and RAM and IO power which is usually idle but still costing you money to run.

I imagine there are plenty of solutions out there to this combination of technologies and perceived issues, but I've not found one I like. As such I have been thinking fairly carefully about the properties I want in an elastic CI system. (Please note, I'm using Amazon EC2 instance sizes simply as an example, I'm very much not wedded to EC2, indeed OpenStack would probably be preferable to me since it offers the chance of more heterogenous compute availability in a single control zone.

Lightweight controller

First, I want to see the controller being as light as possible. Specifically I'd like it to have primarily entirely static content, allowing it to consume only as much CPU and IO bandwidth as is necessary to serve a not-often-examined static website, and only as much storage as there is job history I wish to retain.

Second, I want the controller to be a lightweight reactive endpoint whose API exists explicitly to receive events from external entities such as git repositories to trigger operations without needing heavy processing on the controller; or internal entities such as backends to enumerate available elastic resources and to acquire access there-to. In addition I'd expect that API to be providing resources back to the frontend to be stored in the static output space.

Finally I want the controller to be capable of spooling up additional elastic computation resource on-demand in order to process incoming changes, or potentially to answer deeper harder queries which do not have pre-prepared static results.

I am imagining that such a controller would be runnable on an Amazon t2.micro instance.

Coordination backend

I am imagining that behind the controller, diallable-on-demand, we will have at least one coordination backend. Such a backend would be responsible for taking trigger information from the lightweight controller and make a decision about what jobs need to happen and then to schedule that work. Likely in general you would have only one coordination backend but it ought to be possible to have more than one, perhaps if a controller is shared between multiple projects.

The coordination backend would have to think harder than the controller so perhaps would need to be a slightly larger instance, and perhaps it would also contain enough resource to be a worker if necessary. Persistent storage between builds would have to be provided by some sort of storage backend such as EBS but for now I'm leaving that for a later decision.

Compute backend

Between zero and arbitrarily many backend systems will be needed to actually scale up and run the builds, tests, etc necessary to satisfy the job sequence triggered by the incoming events. As an example, NetSurf consumes around 2.5 minutes per change to the browser code, and that is spread among nine different builds across ten hosts covering 5 host operating systems. Some of those hosts have multiple CPUs, but some are VMs on the same physical hardware. Averaging that out therefore suggests that a change consumes around 20 CPU minutes to pass through our CI pipeline (without tests).

Costing out NetSurf's CI

Using Amazon EC2 as our model, with a t2.micro instance for the controller and m3.medium for the coordination and compute backends, we can estimate the total cost per year to run NetSurf's CI load. To do that we need to estimate the number of builds (and the complexity thereof) for NetSurf.

When we do a toolchain build, we need between 20m and 2h depending on the platform, totalling around 6h30m of time on an average of 2 CPUs. Some toolchains are rebuilt more often than others, but we seem to do about two toolchain builds per month.

When we do a software build, on average lets assume that we consume 20 CPU minutes since some builds will be easier than others given persistent storage of intermediate objects. 20 CPU minutes per build at an average change rate of around 4 per day consistently across the year (more when we're hack-festing, fewer when we're all on holiday or at another conference).

Totalling all the above, reaches 37,000 minutes per year of compute (or ca. 620 hours). If we assume we have a t2.micro on all the time, we can use the pre-pay schemes Amazon has in place. With an assumption that CI is a long-term investment, the controller will consume 151 USD for 3 years runtime. 620 hours of m3.medium will cost around 43 USD. Making a total cost for 3 years of CI of around 280 USD. This is a pretty low cost considering simply renting a system capable of running our CI would set us back around 2000 USD for a similar period. However as the load on your CI system increases there will naturally be a point at which it is much less cost-effective to use an elastic solution.

Also the above doesn't take into account storage costs. A rough guess at a hundred gigs of storage across three years will be another 400 USD.

Costing out a smaller project

Gitano is one of my personal projects and consists (like NetSurf does) of a number of libraries and then a main program. Since it's just me, and I am a lazy programmer, total changes per week on Gitano probably approaches five on average over its lifetime. A full build, running all the tests that Gitano and its libraries consumes around 2 CPU minutes. With the base cost of 151 USD for the controller for 3 years, and ca. 520 minutes (9 ish hours) per year of build time leading to less than 155 USD per year.

Conclusions

Elastic compute is pretty cheap, but minimising the cost of the persistent controller will be the critical point. This is why I suggested that the controller might want to be shared between multiple projects. If Gitano and NetSurf could share one controller, Gitano'd halve its cost per year, and NetSurf would be saving 70 USD. The more projects which can share the controller the more that saving can be effected. With pre-provisioned elastic compute to be shared, the cost could come down further. Sadly very few small projects can afford 50 to 100 USD per year for their CI, but perhaps enough interested parties could sponsor it for such projects.

Comments on this page are closed.