Let’s just say that I’m glad I wasn’t flying yesterday, or crossing a land border, for that matter. Or managing production CloudStrike software.
All that hoo-hah makes me think about four things…
1. Config files
Software systems often come in two parts: (a) the code that does the heavy lifting, and (b) config files with settings about particulars. It’s my understanding that the recent CloudStrike update issue was that a new config file was pushed out, and that it caused a pre-existing bug to manifest itself for the first time.
In a previous life, I fought a losing battle to include config files as part of the software certification process. After all, the prevailing view went, data files are “just data” not code. Why should the data be subject to software certification process? That just slows us down needlessly, and we have work to do.
My suspicion is that something like this was involved in the CloudStrike failure. They had recently detected some new circumstances that they wanted their software to flag. The update to their config files evidently defined the criteria for identifying those circumstances. My conjecture (based only on what I’ve read online) is that this was a manifestation of the argument that I lost. Is that what happened?
No one would ever publicly admit something like this. So I will never know.
2. Testing
It appears that the failed CloudStrike update was instantly deadly. If my understanding is approximately correct, once applied, the new config files triggered a bug that led a Windows reboot from which the computer could not recover — a blue-screen-of-death. If this was as instantly fatal as it seems to have been, and if this blue-screen-of-death reboot failure occurred on every machine once the update was applied, how was this not caught in testing?
Either (a) the testing was skipped, or (b) the testing environment did not realistically mimic the production environment. I strongly doubt the testing was skipped entirely. I suspect the latter.
No one would ever publicly admit something like this. So I will never know.
3. Incremental Rollout
When you have massive infrastructure running the same codebase on a single hardware base, you don’t update it all at the same time. This is kindergarten stuff. You roll the updates out slowly and see if things are ok. That way, if something goes wrong, it doesn’t crater your entire enterprise.
I’ve never been the guy responsible for this kind of update process. I fully understand that my perspective on shoulda and coulda necessarily doesn’t include the full story. But I can’t imagine staring into the abyss every time I mash the “update” button without some kind of reassurance that if I screw up, I will be able to stop the process before things spiral out of control.
In the case of CloudStrike, things definitely spiraled out of control. So did they really apply an across-the-board update to all of the production Windows machines of all of their customers in all geographies at the same time.
No one would ever publicly admit something like this. So I will never know.
4. Rollback Plans
Any mature software organization writes explicit plans that describe all the steps and all the contingencies involved in making changes to their production software, and these include “rollback plans” on how to un-update the changes if things go wrong. This sounds easier than it is, but it’s a thing. Thinking thru worst case scenarios really is part of the IT job. It’s not a luxury, because … well, because of what happened yesterday.
My suspicion is that there was no rollback plan for these config file updates. Or if there was a rollback plan, no one thought it thru sufficiently well to realize that it would involve an admin physically logging into each affected box. Did they write a rollback plan? If so, was it triggered? If so, why did it not work?
No one would ever publicly admit something like this. So I will never know.
5. I Will Never Know
I will never know the answers to these root cause problems. The best I’ll get is hand-waving, imprecise language, and maybe some credible-sounding proximate causes.
But that’s ok. I am just a math teacher teaching functions and equations. And I don’t need to know. I will sleep well, anyway.