In this post, I’ll share how we significantly improved our continuous integration (CI) – by a double-digit percentage – thereby reducing delivery times, cutting cloud costs, and making developers’ lives better. This post is intended primarily as a guide for readers looking to make meaningful improvements to their CI without derailing their quarterly plans. While implementation details will vary across organizations, I hope you’ll find ideas here that you can apply in your own environment.
In the financial domain, reliability and accuracy are especially critical. To ensure both, we test our code at multiple levels: unit tests, integration tests, and several types of end-to-end tests. End-to-end tests are particularly important – not only because of the broad coverage they provide, but also because they allow us to demonstrate the system’s outputs to clients across versions and environments. This builds confidence in our system and proves that no regressions have been introduced. To minimize the risk of costly regressions, we decided early on to run most of our tests on most of our pull requests (PRs).
The challenge, however, of end-to-end tests and a strict testing policy is the overhead they add to our CI. So, we set out to improve it, focusing on speed (for better delivery times and developer experience) and cost-efficiency (for reduced expenses), without compromising accuracy or reliability.
Mapping the Issues
Naturally, we focused first on the most time-consuming bottlenecks and cost inefficiencies, and started with the low-hanging fruit. Once we had a clear understanding of our main bottlenecks, in terms of both cost and CI duration, we brainstormed and devised plans that evolved as we implemented solutions.
Ultimately, we tackled the issue using a three-pronged approach:
Some of the methods below are simple, and others, such as the smart test selection initiative and latest stable concept, are more complex. Together, they constitute a CI improvement playbook that made a huge difference for our team at FundGuard.
1) Reducing CI Time
We greatly reduced our CI runtime using the following techniques:
1A) Parallelizing Testing
We’ve had a test parallelization framework in place for a long time, but we were still leaving a lot on the table. We improved this by:
1B) Reducing Test / Pipeline Duration
We shortened our CI duration time through:
1C) Testing Smarter
We aimed to run fewer tests without compromising coverage, approaching the problem in two directions:
To evaluate potential test reductions, we asked these questions for each optimization idea:
We began with simple heuristics, such as:
These heuristics saved significant time – but we wanted to do more. We realized that for some of our heaviest tests, we could retrospectively determine whether they were truly necessary by analyzing completed PRs and their respective pipelines. To capitalize on this insight, we trained a simple ML classifier using past PRs as a dataset, where the changed files serve as the input (features for the model) and whether the tests were ultimately necessary serves as the label. We then translated the model’s insights into easy-to-understand heuristics used in our pipelines:
1D) Preventing the Introduction of New CI Bottlenecks
Improvements won’t last without enforcement. Developers under pressure to ship features can’t always prioritize CI efficiency if it isn’t enforced. So we’ve introduced guardrails – like limiting the runtime of specific tests and the overall pipeline duration, and we’re continuously monitoring our CI duration to detect regressions.
1E) Incremental Builds
When build and unit test times are low, rebuilding the entire project for each pipeline run is manageable – and the simplicity of doing so has its advantages. But as the system grows larger and more complex and build times increase, rebuilding everything for every PR becomes inefficient – especially since most PRs don’t change most modules. To address this, we implemented incremental builds: caching build artifacts and reusing them instead of rebuilding and retesting when no changes have been made to a given module (with help from open source tools).
1F) Reusing and Sharing Testing Infrastructure
Sharing infrastructure (like databases) or even microservices across pipelines can save time and cost – but adds complexity and may affect stability, as different PRs can interfere with each other. We haven’t implemented this yet, but it’s on our roadmap. When we do, we’ll proceed carefully.
2) Improving CI Stability
A stability issue is defined as a false negative in the pipeline – meaning a certain PR that should have passed the pipeline, but didn’t due to some issue that’s unrelated to it. Almost every stability issue that fails a pipeline forces a requeue (rerunning the pipeline), leading to slower deliveries, higher costs, and developer frustration.
After analyzing many stability issues, we realized that they mainly stem from:
To address these issues, we introduced the concept of the “latest stable” branch. The core idea is simple: we wanted to create a small buffer between the latest code (which may be unstable for the reasons mentioned above) and the developers, ensuring that the code they work with is stable.
Here’s how it works: The “latest stable” branch is always slightly behind the main development branch. Developers create new feature branches from it and rebase on top of it, ensuring they work with a stable version of the code. Feature branches eventually get merged to the main development branch – not to the latest stable branch (this works because the latest stable branch is always behind the main development branch – they never diverge). The latest stable branch is promoted to a newer version by running multiple instances of our main pipeline on the latest commit in the main development branch. If all the pipelines pass, we mark that commit as stable and promote the latest-stable branch to it. If not, the “on-call” shift is alerted to address the stability issue.
There is, of course, a trade-off when determining what qualifies as a stable commit. Stability is relative. For our purposes, we only need sufficient stability to ensure CI pipelines pass – clearly, we conduct more extensive testing before rolling out the system to clients.
On one hand, we want to provide developers with stable code. On the other hand, testing stability incurs time and cost – and requiring a higher level of stability increases the time and cost of verification. After some trial and error, we agreed to run three parallel CI pipelines every three hours to determine stability. This strategy offers a good balance: it’s stable enough for developers, provides relatively recent code, and is financially feasible.
This system comes with obvious pros and cons:
This con is mitigated by the fact that this is a backwards-compatible solution – developers can still rebase from the main branch if they must work with the latest code (in which case, they run the risk of getting the latest stability issues).
The latest-stable concept is also invaluable for detecting and investigating regressions. When a regression is introduced into the main branch, the three stability pipelines begin to fail, making it easier to pinpoint when the issue was introduced. This, in turn, helps quickly identify and revert the problematic change.
In addition to the latest-stable concept, we also introduced test burn-in – running new tests, especially the most error-prone ones, multiple times before integrating them to the pipeline. This significantly reduced test flakiness. We also ensure that our test infrastructure is designed for stability. For example, we have tools in place that enforce best practices, such as using predicates to await results when polling in our end-to-end tests, which helps prevent stability issues caused by eventual consistency.
3) Minimizing “Requeues After Service Deployment”
Once the unit and integration tests pass in a certain pipeline, we deploy a dedicated environment for end-to-end testing of the PR. This is when the CI can start to get expensive.
A requeue after service deployment – re-running the pipeline after deploying and running end-to-end tests – can be challenging. It means duplicated efforts, slower feedback, and higher costs. While we can’t eliminate these entirely, we reduced them by:
3A) Improving CI Stability
As discussed above – more stability means fewer requeues.
3B) Failing Fast
We moved some coverage earlier in the pipeline by converting key end-to-end tests into smarter, more focused unit tests (what’s often referred to as “shift left”). This shift provides developers with faster feedback, makes tests easier to run locally, and helps us detect bugs before deploying costly services. While end-to-end tests are invaluable, unit tests often suffice and are faster and easier to run locally. By improving our unit test infrastructure and educating developers and code reviewers, we’ve shifted more coverage to these faster tests, resulting in fewer end-to-end tests. Additionally, we’re exploring AI tools to further enhance our unit and integration test coverage.
3C) Reducing Merge Conflicts
Requeues often follow merge conflicts, so we worked to reduce them by:
By applying the methods above, we significantly improved our CI – cutting both runtime and costs by tens of percent and boosting stability. But, continuous integration requires a continuous effort. There’s always room to improve, and many more gains to discover.
Got more ideas for CI improvements? I’d love to hear them: michael.shachar@fundguard.com
Want in on the action? Check out our open positions: fundguard.com/careers
About the Author
100 Bishopsgate
18th Floor
London, EC2N 4AG, United Kingdom
Sign up for FundGuard Insights
Your use of information on this site is subject to the terms of our Legal Notice.
Please read our Privacy Policy.