The Microsoft/Crowdstrike outages — time for a rethink
Were you affected by the recent cyber outage?
Computer Software
There’s mayhem you can predict and plan for and there is mayhem that nobody does.
Without a great deal of thought, care and testing of computer systems, they can go down very easily, expensively and sometimes spectacularly. People who programmed in Basic in the early days will know how easy it is for a program to get stuck in a loop. An early academic program designed to model weight gain in anorexics failed to put an upper limit on how much somebody could weigh. A single missing full stop (period) could render a COBOL program a piece of junk. One program I know of failed in its fourth year because it contained a table with the number of days since the beginning of each month, and there was an error in the table for one month in the part of the table for leap years.
Y2K was predicted and a great deal of work went into preventing any system crashing as a result. There were also manufacturer-specific limits to be fixed at the same time (the size of the field holding the number of days since the year dot varied as well as their definition of when the year dot was). I worked in IT then, so I know just how much work was involved (and I got a lot of overtime out of it, thank you very much). Don’t let anyone tell you it was a false alarm.
I spent 25 years in IT, and I know that computer systems have become a lot more complex since then. There have been three trends which make us more vulnerable when they fail:
- One is growing oligopolies and monopolies such as Microsoft’s domination of the market, as in the latest outage.
- The second is the increasing use of third-party software. Every interface between two different systems is a weak point, as NASA found out to its $125 million cost back in the 1990s. I was responsible for testing the merger of the various computer systems used by Rover small cars, large cars, MG and Jaguar, and it was a nightmare to untangle the differences and get them all talking to the same system. And when you have a lot of different systems talking to each other, you have an increased likelihood of one of them introducing an error or malware.
- The third is the continuous updating of systems. Some updates are for good reasons, some are necessary for security against hackers or to fix serious errors (better design and testing would reduce this), and some are so that companies can demand more money from customers. A system that works is seldom left alone for long, and each new version introduces new vulnerabilit1es and the necessity for other systems to change too.
CrowdStrike IT outage
The CrowdStrike IT outage disabled around 8.5 million Microsoft Windows devices. No doubt CrowdStrike were pleased with how many customers they had. But was anyone noticing how many systems their software was involved in and worrying about how vulnerable that made us? What other third-party software is out there that is so widely spread? How many Microsofts are there dominating different sectors?
In-house software may seem expensive in the short run, but an outage caused by an outside software provider can be very expensive, and do a lot of damage. There can also be a lot of work keeping systems in sync. This outage has affected many sectors. And since most computing work has been outsourced, there are not so many people able to apply fixes. In this case every single Windows computer affected has to be manually fixed by somebody who knows what they are doing. No wonder it is going to take a long time to fix.
Does it make sense that one piece of software can cause such carnage? The outage has hit:
- flights
- health services
- banking systems
- blocked access to Microsoft services
- trains
- businesses
- emergency services, including 911 (999 and 111 were apparently not directly affected, but there was a big surge in calls)
- many online applications, such as tickets or payments
- hotels. hospitals, manufacturing, TV channels, stock markets, broadcasting, governmental services and more
It has also created increased opportunities for spammers, and maybe hackers too.
Only hours before the CrowdStrike IT outage, Microsoft had its own major outage with its Azure system.
Ironically, back in May Crowdstrike asked the question, “CrowdStrike vs. Microsoft: Microsoft’s security products can’t even protect Microsoft. How can they protect you?” (archive).
A certain amount of redundancy is required to give systems resiliency. How much, depends on how critical the system is.
Unchecked capitalism leads to intense pressure to gain market share and maximise profits, and careful thought and testing tends to fall by the wayside.
Another industry which is too oligarchical is the food industry, as can be seen by the number of brands cited in some food recalls.
It’s time to reconsider
Some countries and states have developed their own electric grids, for example, but with links and treaties with others. They have (or should have) protections in place to prevent an outage in one cascading to outages in others as happened in the Northeast blackout of 2003. Being isolated brings its own problems, as Texas found in 2021.
I’m not calling everything to be done in house and small scale. That would be hideously expensive and inefficient. But clearly the pendulum has moved too far in the opposite direction, and this outage shows just how fragile the systems we rely on are, which will only get worse as patch is applied to update is applied to upgrade. They don’t seem to have heard of ‘if it ain’t broke, don’t fix it’. The most successful companies might hate it, but scaling back would increase our resiliency and national security. It would also make more jobs available, which would add to the growth that economists and politicians crave. I’d much rather more money went to workers who will spend it in the economy than to the rich who may hoard it. It would reduce the cost of unemployment benefits too.