In 2018, our team spent a lot of time working with feature flags and test-driven design (TDD).
Our project was to effect an architectural change to our system, changing the source of truth of some data, moving it out of the database owned by a legacy monolith into a new database controlled by a new microservice. However, much of the code requiring the data would remain in the monolith.*
Some examples of the types of things we feature-flagged are:
- whether to go down a refactored code path or not;
- whether to publish messages to a message queue when a certain event occurred;
- how to publish those messages (we tried multiple variations of batching and transaction boundaries to achieve acceptable performance);
- whether to just delete messages at the receiving end or actually handle them; and
- whether to use a local source of data or remote one.
We were working on a pretty important piece of code; the kind of business function where, if we stuffed it up, someone would probably have to spend several days doing remedial fixups or making phone calls to chase up millions of dollars. Hence, most of the feature flags we put in were to ensure we could test new code paths in production, and rollback safely & quickly. Some of them we even deliberately switched on for 10 minutes, let the new code run, shut them off again and then looked at the results offline. From memory, all of these flags were simple on/off switches, although we do have the capability to do per client/customer/category flags when we need to. These tips should work for either type.
Because we used so many different feature flags in a relatively short period of time, we got to experiment with and see the affect of a number of different patterns, both good and bad, and developed the following set of guidelines.
For context, in this particular app, our oldest and biggest monolith, most classes have a 1:1 unit test. That’s not our current preferred practice any more, but it’s a legacy we live with when working in this app. It also has a large collection of integration tests.
Goals for Feature Flags with TDD
There’s a couple of goals that we were aiming for as we did this work:
- We want to make sure all currently tested paths that the flag affects are tested with the flag on and off.
- We don’t want our test suites to become too large and/or hard to understand because of feature flags.
- We want a git history that’s easy to understand despite the temporary complexity of a feature flag.
- Ideally, deleting a feature flag will be a quick task that requires no in-depth reading of the code or tests.
Feature Flags and TDD: Our Five Guidelines
1: Make a copy of every affected test
The first step to ensuring tests don’t get unwieldy is to not try and test all the feature flag ON and OFF cases in one test. So: don’t open up the existing test file and start adding new “Feature X ON” test cases. Instead, make TWO test files: one for flag on, one for flag off. Yes, we’ll probably have a bunch of duplicated test cases that are exactly the same in both files, maybe even most of them, but this is only temporary.
There are three benefits to this approach: First, we’ve avoided creating a confusing monstrosity of a test with flag on, flag off, and flag-agnostic tests all mingled in together. Second, once we’ve decided to keep the flag on permanently, we can just delete the whole “flag off” test file. “Simples!” Thirdly, we can do the setup for the flag once at the top of each test file, rather than having to set it at the start of each test case and hoping people notice that detail when reading the test.
2. Make the copied file the test for the “Old” functionality, and the original file the test for the “New”
When we make a copy of the existing test to a new file, we want to make the new file the tests for the old functionality – with the feature flag off (e.g. cp FooTest FooPreFeatureXTest) – and change the original test file to have the test cases for when the flag is on. This helps to keep a contiguous SCM history for the enduring test suite showing how the functionality progressed.
The alternative is to put the “feature flag on” test cases in the new file, then, when we get rid of the feature flag, deleting the original test and renaming the “new” one to the “normal” one (e.g. mv FooWithNewFeatureXTest FooTest). We found problems with this latter approach when we looked at our git history: it would often show the whole test being deleted then re-created, which makes it much harder to inspect what changes were made to the tests when the new change was introduced.
3. At the end of any test that enables a feature flag, always set the flag back to what it was before the test started
This is important because, if our flags are stateful in a way that isn’t automatically reset between tests, we can end up with tests that don’t specifically turn the flag on or off then randomly running with the flag on depending on the order in which our tests are executed. For most tests this won’t make a difference, but for some it will and we could get failures in our build that don’t make sense because we can’t see from the single test’s code that the flag was left on by another test before it ran.
Note also the careful wording of this rule: we don’t always disable the flag at the end of the test; we set it back to what it was before the test. Being disciplined in resetting the flag is crucial to making the next tip work.
4. Run the whole build with the feature flag ON
Once we think we’ve created pre-feature and post-feature versions of all the test suites that we believe are affected by the feature flag, we need to run the whole build with the flag ON. This will flush out tests that are affected by the flag where we haven’t yet created a divergent suite. If we’ve been sufficiently comprehensive in our test modifications, nothing should fail, because all tests of paths that care about the flag should already be explicitly setting the flag in the test.
In a simple microservice, we’re probably unlikely to find anything with this step, although it should be quick to run so it’s still worth doing. On the other hand, in a complex monolith with many components, it’s quite possible that there are parts of the system that we’ve forgotten rely on the behaviour that we’ve just changed, and their integration tests may well fail as a result of the changes. We really want to find these transitive breakages before switching on the feature flag in a production system. If we don’t, we actually end up running an untested version of that dependent component in prod.
5. Switch on and delete feature flags ASAP
Feature flags are a device for assisting in controlled, reversible migration in production systems, but they also complicate the code. We want our code to be simple, so when there are feature flags in place, we make it a high priority to get them switched on in production. If we don’t prioritise switching them on, we can’t prioritise deleting them. This becomes especially taxing if we find the need to put other feature flags in the same area and start getting feature flag interplay.
The instant that a migration is completed in prod and we’re happy that the functionality should remain, the feature flags become technical debt. We want to remove them as soon as possible afterwards so that we revert our code to being as simple as it can be.The instant that a migration is completed in prod and we’re happy that the functionality should remain, the feature flags become technical debt. Click To Tweet
Yo, what about the Strategy pattern?
When I shared some of the above tips on Twitter, a few people responded that they like to use the Strategy pattern when they’re doing feature flagging.
While I’m a big fan of the Strategy pattern, I would generally avoid it when feature flagging. My reasoning is that feature flagging is ideally a temporary measure, whereas the Strategy pattern is a design construct for supporting multiple implementations. While feature flagging often results in multiple implementations for a short time, I don’t think it’s worth introducing a design construct for this short period, only to delete it again shortly after, once the flag is removed. So we generally just used an if/else in all places that rely on the flag, knowing that the ‘else’ branch will be removed shortly. We would probably make an exception to this if the feature flag pertained to a large or complex piece of code that was being almost entirely re-written, and it was clear that readability would be greatly improved by having the two implementations in separate classes.
How do YOU do Feature Flags and TDD?
These are the main guidelines that our team developed through almost a year of feature-flagging our way through a complex piece of migration work. If you’ve got other tips for how the work with feature flags, particularly in a TDD environment, I’d love to hear about them in the comments below.
* I just want to note that this is not a good design, nor one that we were keen on implementing. It was a compromise forced by a data migration effort with a deadline and some very old code that couldn’t be extracted from a monolith easily.