Keep your libraries updated and nobody gets hurt
Don't make people practise software archaeology - keep your system updated.
Updates?
One of the benefits of the open-source world is the unmatched ability to issue updates rapidly to correct bugs and security vulnerabilities.
Software bug fixes keep streaming in: from recently introduced regressions, to long-standing bugs undetected for years, they are corrected by a wide range of people, from professionals paid to enhance the software, to individual contributors spotting a problem.
Security updates eliminating threats have been known to roll out in a matter of hours, which is totally unheard of in the proprietary software world.
By not keeping our system up-to-date, we are effectively negating this benefit of utilising open-source software.
False sense of stability
It is easy to fall in the trap of not updating software to later versions, through the false assumption that a tested and monitored system is stable.
The reality is that the system is tested and monitored by us DevOps according to our insights and to the best of our ability and not objectively so.
We have seen countless times the discovery of latent bugs or the triggering of unexpected behaviours in software previously considered to be extremely stable. However, the fear of bugs, instead of prompting us to upgrade, sometimes leads to an aversion to updating software.
This fear of upgrading may arise because of:
- No zero-downtime procedure defined, or reluctance to schedule maintenance windows.
- Too much faith in our own tests, and not enough in others'.
- Cumbersome deployment procedure that discourages us from deploying often (i.e. whenever we need to upgrade something).
- Lack of a QA system for testing updates frequently.
- The absurd notion that "if we don't touch it, it can't break", stemming from our uncertainty about our own system.
If we have taken reasonable steps to eliminate the above factors, upgrading will be less of a hassle and can become a welcome procedure.
Real world failures
Sometimes skipping an update or two (or four years' worth of updates) for a library means that we can trigger unexpected breakages in other software that are very hard to troubleshoot. The reason is often twofold:
- The bug is not part of the software that it manifests in, but comes from an underlying library.
- The behaviour is impossible to reproduce, unless the outdated library is taken into account.
Let's look at two real world cases where this happened in the context of PostgreSQL:
Scenario 1
Database appears to function normally. However, when it is backed up with pg_dump
and restored elsewhere, the restored database exhibits troubling behaviour. When we create a function in the restored DB, running it results in:
ERROR: cache lookup failed for type 5150
The investigation starts with querying pg_type
to find the culprit. However, that proves fruitless as the same database copy, restored elsewhere, exhibits failures for a different TYPE
ID. Trying barman
restore
instead of pg_dump
does not fix the problem either.
Some are starting to suspect a deep bug in PostgreSQL that corrupts its pg_catalog
, unlikely as that seems.
The debuggers come out: people go hunting for the bug in Postgres' code with gdb
but a surprising discovery is made: the error is coming not from within Postgres, but from the extension PGAudit.
A quick search confirms that this was triggered by a known bug that was fixed in PGAudit more than a year previously. An equally quick check reveals that on the restored database systems, the PGAudit extension hadn't been updated for two years.
Scenario 2
Database used for PostGIS exhibits an unexplained performance slowdown that is restricted to specific geospatial data values.
SELECT ST_DistanceSphere('POINT(-150 33)',
'POINT(-120.120120 42.488888)');
is more than fifty times slower than
SELECT ST_DistanceSphere('POINT(-150 33)',
'POINT(-120.120120 42.4888881)');
The behaviour is not reproducible. It's not even reproducible using the exact same PostgreSQL and PostGIS version combination.
Finally, someone manages to trigger this on a test system - but only for this exact data value.
We drag out the profilers. Only this particular number seems to cause the slowdown and the culprit according to perf seems to be... a multiplication?!
+ 62.78% 61.29% postgres libm-2.23.so [.] __mul
A quick check confirms that this does not happen on systems with different glibc
versions.
We determined that some slow paths for sine/cosine calculations were found and eliminated in the GNU libc mathematical functions almost two years previously.
The system in question used a glibc
that was four years old.
Give your system some love - update.
We can see that skipping library updates can trigger breakages in the real world. There are clear benefits to upgrading early and upgrading often. Even though our system appears stable, a use case will come along that will trigger a bug. Even worse, it may trigger the bug further down the dependency line, where it's harder to troubleshoot. Also, by not updating, we are exposing our users to security risks that may appear down the road (we have seen that nobody is immune to this).
Locking down our dependencies for fear of breakages does not pay off: a few months or years down the line, we will have no idea how to upgrade our system. The more time that passes, the more we'll be locked into the old version, and the harder it will be to upgrade because the world around us will have become incompatible with it.
Any potential issues caused by the upgrade are worth the risk, because they will be easier to fix by comparison. Especially skipping minor version updates, where there is usually no functionality change or breaking changes, is unforgivable in a modern devops environment. Having scheduled downtimes for updates will prolong your system's life and, in the long term, increase its uptime.
To summarise, eliminate the barriers and make it easier to update your system's dependencies often. The benefits are worth it.
Picture by Chris_Parfitt from Morden, Surrey, England / CC BY 2.0