So obviously testing is important. That’s not exactly deeply insightful; we all know this. Even the slightest, most unassuming change can have weird effects. We sometime try to forget this and say things like, “That’s such a minor section, it shouldn’t break anything else.” Talk like that is just asked the universe to remind you.
My latest example of this was a minor patch release we were putting out a few months ago. I’m sure you all remember Heartbleed. We had the patched version of OpenSSL quickly available for customers, but after the dust settled we decided it would be best to roll it into a point release of the product and thus have it in all new downloads and installs by default.
This release was quite simple: update the version of OpenSSL included in the Apache bundled with our installer, and change our version number from “a.b.c” to “a.b.d”. As this was just a repackaging of the patch, the updated OpenSSL was even already in use by most of customers! What could go wrong?
We made the changes and passed the software over to QA. Everything was fine until one of the final tests involving a somewhat uncommon, but certainly not unused, configuration.
The software always checks its version number against the version number stored in the database to see if any db updates need to be run for the latest version (and will then run any applicable). However under this particular configuration, which allows for a read-only clone of the application, the software compares the two versions and simply won’t start unless they are equal.
This seems fine, except that it turns out that upgrading the software only changes the version number in the database when database upgrades are applied, and there weren’t any for this patch release, thus preventing this configuration from running at all after the upgrade.
So we had to dig in and figure out these conflicting behaviours. This was actually the first point release since the db upgraders had been rewritten (which wasn’t that recently, but we don’t do these sorts of patches often), so this interaction hadn’t been caught before. We made some changes and added a patch release (“a.b.d”) database upgrader, that had no actual executions, figuring that would hit the problem. Strangely it did not. Turns out that there is a race condition in the upgraders (by some crazy design) between the actual updates and the version update, causing the version update to not run unless the updates take some actual execution time, which in our test case, they didn’t. Lovely.
So what was supposed to be the simplest, non-code-changing release, now required either:
- A rewrite of part of the db upgrades on what was supposed to be a patch release with little dev time required.
- Disabling the read-only configuration version check (not desirable)
- Or adding in a db upgrader that did nothing other than pause for a small period of time (a rather hacky solution that could itself be error prone)
Suffice to say the release took a little longer than anticipated.
As I said, obviously testing is important. Had we not run a full regression test cycle, including uncommon configurations, despite everyone thinking that the changes were incredibly minor and that the OpenSSL patch was already being used by customers “in the wild”, we would not have caught this. But then again, as I also said, we all know this. It’s just useful to see the reminder sometimes.