SRE's review of Democracy
Day One
We've been handed this old legacy system called "Democracy". It's an emergency. The old maintainers are saying it has been misbehaving lately but they have no idea how to fix it. We've had a meeting with them to find out as much as possible about the system, but it turns out that all the original team members left the company long time ago. The current team doesn't have much understanding of the system beyond some basic operational knowledge.
We've conducted a cursory code review, focusing not so much on business logic but rather on the stuff that could possibly help us to tame it: Monitoring, reliability characteristics, feedback loops, automation already in place. Our first impression: Oh, God, is this thing complex! Second impression: The system is vaguely modular. Each module is strongly coupled with every other module though. It's an organically grown legacy system at its worst.
That being said, we've found a clue as to why the system may have worked fine for so long. There's a redundancy system called "Separation of Powers". It reminds me of the Tandem computers back from the 70s.
Day Two
We were wrong. "Separation of Powers" is not a system for redundancy. Each part of the system ("branch") has different business logic. However, each also acts as a watchdog process for the other branches. When it detects misbehavior it tries to apply corrective measures using its own business logic. Gasp!
Things are not looking good. We're still searching for monitoring.
Day Three
Hooray! We've found the monitoring! It turns out that "Election" is conducted once every four years. Each component reports its health (1 bit) to the central location. The data flow is so low that we have overlooked it until now. We are considering shortening the reporting period, but the subsystem is so deeply coupled with other subsystems that doing so could easily lead to a cascading failure.
In other news, there seems to be some redundancy after all. We've found a full-blown backup control system ("Shadow Cabinet") that is inactive at the moment, but might be able to take over in case of a major failure. We're investigating further.
Day Four
Today, we've found yet another monitoring system called "FreePress." As the name suggests it was open-sourced some time ago, but the corporate version have evolved quite a bit since then, so the documentation isn't very helpful. The bad news is that it's badly intertwined with the production system. The metrics look more or less okay as long as everything is working smoothly. However, it's unclear what will happen if things go south. It may distort the metrics or even fail entirely, leaving us with no data whatsoever at the moment of crisis.
By the way, the "Election" process may not be a monitoring system after all. I suspect it might actually be a feedback loop that triggers corrective measures in case of problems.
Day Five
The most important metric seems to be this big graph labeled "GDP". As far as we understand, it's supposed to indicate the overall health of the system. However, drilling into the code suggests that it's actually a throughput metric. If throughput goes down there's certainly a problem, but it's not clear why increasing throughput should be considered the primary health factor...
More news on the "Election" subsystem: We've found a floppy disk with the design doc, and it turns out that it's not a feedback loop after all. It's a distributed consensus algorithm (think Paxos)! The historical context is that they've used to run several control systems in parallel (for redundancy reasons maybe?) which resulted in numerous race conditions and outages. "Election" was put in place to ensure that only one control system acts as a master at any given time. The consensus algorithm is based on PTP (Peaceful Transfer of Power) protocol. The gist is that when most components are reporting being unhealthy, it is treated as a failure of the control system and the backup ("Shadow Cabinet") is automatically activated. The main control system then becomes the backup. It's unclear how it is supposed to be fixed while in backup mode, though.
Day Six
I met guys from Theocracy Inc. in a bar last night and complained about the GDP metric. They suggested using GNH ("Gross National Happiness") metric instead. I'll tell Ethan to add such a console first thing on Monday.
We've also dug into the operational practices for "Democracy". It turns out there was no postmortem culture. Outages were followed by cover-ups and blame-shifting. The most damaging consequence is that we have no clear understanding of the system's failure modes.
We now have a better understanding of the "Judiciary" branch, one of the "Separation of Powers" branches. It evaluates whether components are behaving according to pre-defined set of rules. If they are not, they are removed from production and put into suspended state ("BSD jails"). It's unclear how are they supposed to be fixed while suspended. (We've seen a similar problem with the backup control system so we might be missing something essential here.)
It's Sunday tomorrow. I am taking a day off to think about how to solve the mess. Hopefully, nothing will blow up while I'm away.
Author's note: Originally posted in 2017. Still funny. Reposting.