Microsoft outages: The implications of downtime on the delivery of critical public services


It quickly became clear the problem was not an issue with Microsoft’s Azure service, as it first appeared, but an issue with a single software provider – named CrowdStrike – who released a faulty update to their software, which was then distributed rapidly around the world via the Azure global networks.

As reported by Computer Weekly, that “bad patch” was available online for 78 minutes, and in that time was distributed to 8.5 million Microsoft machines that got locked into a boot cycle and became unusable.

Once it became clear the source of the problems was not an organised cyber-attack from persons unknown, things settled into resolution mode.

The impact on affected businesses and the general public was in some cases major, but – when it comes to hyperscaler outages – the world has a short memory, and things quickly fell back into “business as usual” mode.

Not another outage

Except, on 30 July 2024, Microsoft’s cloud services suffered another outage, affecting businesses globally and – again – without any warning.

This outage, however, was nothing like the CrowdStrike debacle in terms of cause, impact, or even implication.

What this latest outage demonstrates is that we have one single problem: our level of reliance on cloud services which might not be all that reliable.

But first we need to dig a bit deeper into why these two outages were not the same.

IT security folks try to determine and manage risks to data and IT systems and in doing so tend to consider three key characteristics: confidentiality, integrity and availability.

Maintaining these characteristics and keeping them within defined and acceptable ranges is what cyber-security is all about.

It is impractical in nearly every case to maintain perfect equilibrium of confidentiality, integrity and availability. And, in any event, different organisations need different blends of these three things to function optimally.

It is common for IT security folks to focus on confidentiality as the biggest concern, and indeed the UK Government Security Classification Scheme is principally about assigning classifications to data confidentiality. But, in some cases, confidentiality is the least important factor, whilst integrity and availability are of very high importance.

Think of the fire brigade, as an example. When a fire is reported, the fire’s location needs to be as accurate as possible, and the firefighters on the ground need to communicate as accurately as possible to ensure they get the resources needed to fight the fire.

In this example, integrity and availability are high priorities, but keeping the fire a secret is unlikely to be.

What we do need, if IT security is to be achieved, is all of those three things in some form. And when the balance is not right, that’s a problem.

Outage verses breach

The media use two different words to describe these problems, depending on the characteristic that is compromised. A loss of confidentiality is usually referred to as a breach, while a loss of integrity or availability is often called an outage.

These describe the visible effects of the compromise, but not always the cause of the problem. And that’s why the two reports of Microsoft outages in a little over a week need to be taken separately.

They might look the same to the public’s eye and might be referred to in the same way in the press – but they’re different things and understanding that is both important and necessary for lessons to be learned from each.

The Crowdstrike incident was a loss of integrity of a single file in its software, which resulted in a loss of overall service availability.

The 30 July incident does not appear to be the same at all. And whilst it was shorter lived at just a couple of hours, after which most services came back online largely unscathed, it might actually be a lot more serious in nature.

The latest ‘outage’ was a general and widespread loss of availability of Microsoft networking services for its global Azure service, reportedly caused by a “usage spike”, which could be a Microsoft euphemism for a denial-of-service (DoS) attack by an unknown bad actor.

A DoS attack occurs when a (usually malicious) user consumes all of the available service resources and leaves nothing for anyone else.

For as long as the attacker retains those resources, the service will remain unavailable to its legitimate users. And during that time the affected business or user will typically be unable to operate or function.

Denial of Service attacks are major threats that can result in serious financial and threat-to-life situations, and a lot of money and resource is put into preventing their occurrence, which to be fair Microsoft is usually pretty good at.

This time, however, it looks like something went wrong, and that might be a failure of the security countermeasure to stop these attacks.

Or it might simply be that the bad guys found a way to throw more resources into the attack.

Timing is everything

The attack’s timing could not have been worse for Microsoft, coming as it did on a day they report their earnings to investors. 

That lends further credibility to the suggestions that this was a directed attack, not an accidental error or poor admin practice.

Microsoft had a bad day, but will no doubt put it behind them quickly enough and revert to business as usual. Most likely many of its users will too.

The issue of course is that IT systems do fail, and they fail more than many of us like to admit. For blue light responders, such failures literally are a matter of the public’s life and death, and a lot of thought has gone into the creation of resilient IT systems across those groups and organisations we rely upon for our safety.

For about 20 years that was my day job – I worked on architecting, building and assuring these services so that when everything around them fell over during a time of crisis,  these still functioned.

Up to a couple of years ago this was handled through investments in national systems and dedicated police and other 999 service networks which operated under special commercial terms from a specific pool of approved UK suppliers experienced in the provision of ‘never fail’ IT.

In addition, individual forces and services operated under a mechanism of mutual aid – whereby each police force, ambulance trust, or fire service had relationships with their neighbouring opposite numbers to ensure that if their own systems went down someone else would pick up the slack immediately and with little or no service degradation at all.

This also worked in cases where the local incident was so serious that a local responder had to commit all of its resources to handling that incident and needed to send calls for help elsewhere, and there were even a series of systems that managed these circumstances. The National Mutual Aid Telephony (NMAT) and the Casualty Bureau (CasWeb) being two examples.

Those systems were designed with failure in mind, and to ensure that when systems failed, someone would still pick up the phone and be in a viable position to respond to the emergency.

At this point I am not saying that our national capability to do this has been fully degraded – and those responsible for them today will certainly argue that they are not.

What we cannot escape is the fact that over the past five years policing (and fire and ambulance, along with other critical sectors) have been shovelling services into the hyperscale clouds of Amazon Web Services (AWS) and Microsoft with little obvious regard for the delivery of critical responder capability if those services go down.

Rather than consider the possibility of those systems failing, the decision makers have chosen to assume they will stay available under all circumstances, even though they are commodity products consumed by the general public and have no special terms or prioritisation.

This has inevitably introduced risks into our national resilience that we have never faced before.

The use of Microsoft cloud for hosting critical and public safety services is mainly down to our blue light and critical national infrastructure  IT leaders not reading the fine print of Microsoft’s Universal Licence Terms for their online services, and its acceptable use policy.

Those very clearly identify that Microsoft online services, of which Azure and M365 are part, are not designed for ‘high-risk use’ and should not be used.

“Neither customer, nor those that access an online service through customer, may use an online service in any application or situation where failure of the online service could lead to the death or serious bodily injury of any person, or to severe physical or environmental damage, except in accordance with the high-risk use section below,” its term state.

The referred to high-risk use section goes on to state: “The online services are not designed or intended to support any use in which a service interruption, defect, error, or other failure of an online service could result in the death or serious bodily injury of any person or in physical or environmental damage.”

The senior leaders who chose to use these services either failed to do their due diligence or chose to accept risks that their predecessors never would and which might even fail to meet their obligations under legislation.

This work was sanctioned at the highest level, being funded largely by the Home Office and facilitated by their programmes, and the Police Digital Service, with the support of National Police Chiefs’ Council and the Police and Crime Commissioner. 

The adoption of new public cloud services brought much-needed commodity-based capabilities for the streamlining and modernisation of police data handling.

However, in addition to the legal issues previously covered in depth by Computer Weekly, they might also have exposed the UK to critical public safety risks that were not properly taken into account.

Microsoft do not fully escape accountability here – even with their responsibility limiting acceptable use policy (AUP) clauses.

Given the company’s direct relationships with the Police Digital Service and key forces, it is clear the company knows its AUP is being breached, and may have played a part in police users doing so.

We often talk about eggs and baskets as a euphemism for exposing ourselves to critical safety risks, but there is growing evidence that in the UK we might have already done that – or at least stand on the cusp of doing so.

Two forces (Met Police, and North Wales Police) have announced in recent years that they plan to move their control room services onto Azure Public Cloud, and I’ve examined the wisdom or otherwise of that in the past.

What is clear is that whoever is now responsible for initiatives like these within our new government – and indeed for the wider general adoption of public cloud by UK Critical National Services – needs to take full notice of the problems Microsoft’s systems had on 30 July 2024.

In all key respects, if core UK services did not get hit yesterday, then that means another bullet dodged.

This time around, however, there are some indications that this one might have been fired by a malicious actor, and if so – for the first time – it needs to be considered that Microsoft’s previously assumed ‘always-up’ cloud service might be just as vulnerable to availability outages.

As it has shown itself previously to be weaker than we thought for integrity and confidentiality compromises.

The bullet dodged this time may well have come from an attacker that has just found a DOS machine gun they can let loose at Azure whenever they like.

I am certain that in the US senior Microsoft leaders will be brought into US government committees over the coming days to explain the circumstances of this global incident.

I’m equally sure that under the previous administration the UK would not have done likewise.

I hope this new government are wiser than that and realise that just like the unfolding prison overcrowding and financial status issues they claim to have uncovered on taking office, we face another possible crisis in public cloud for critical services.

Microsoft ought to be brought into a UK parliamentary or other public oversight committee as soon as practicable to explain all the things covered in the US to the new government and to the UK public.

This does not have to be a bloodletting or public-shaming exercise – it’s a lessons learned opportunity, from which we might choose to pick a different pathway for our CNI service providers.

If afterwards the UK government do not do so, then that’s ok because it will be a risk-informed decision for which the new government will have taken on the mantle of responsibility.

Today they face the greater political risk of being left holding the parcel when the music stops, and then being accountable for the failures of the previous government that they simply chose not to examine or fix, which might be worse.

Either way the loser in such a situation is the UK public, who rely on services that must not fail, but which increasingly sit on platforms unsuitable for critical service delivery.



Source link