By Michael Gibbs

Modern cloud computing is the next evolution of the network and the datacenter as it provides a faster and more agile platform for digital transformation. The cloud is really just renting someone else’s network and datacenter. So, what makes the cloud so special? It’s the agility of the cloud—the ability to purchase what’s needed, when it’s needed, and to adapt in real-time. Cloud computing enables organizations to have better, faster, and cheaper technology platforms, which facilitates digital transformation.

There are many excellent cloud providers, although most of the cloud market share is held between AWS, Azure, and Google, which are run by some of the best network and datacenter operators in the world. After designing high-performance networks, data centers, and clouds for two decades, there is one lesson I have learned: all systems will fail. It’s simply inevitable, but it’s how you plan for what happens during a failure of a network, datacenter, or even cloud provider that determines how the business will operate during such a failure. Will the business be operating normally, or remain at a standstill until the problem is restored?

One thing about failures is they provide an opportunity to fix your systems so these problems won’t happen again. It takes honest, transparent information so you know how to architect around a problem, like an inevitable outage in a cloud provider. In some cases, the cloud providers were clear about what happened. In other cases, some of the explanations were not so transparent. To most experts in the field, the explanations sound extremely unlikely.

1. Power Failures 

On December 22, 2021, AWS had an outage, claiming it was caused by a power outage.  We have seen small data center outages in small businesses that have more of a server closet than a datacenter, but this is extremely uncommon in properly designed datacenters.

Most well-designed data centers have multiple power lines coming in with redundant transformers and generators. Some even have batteries backing up their generators, so power failures are very rare in properly designed datacenters. With that said, anything can happen, which is why you should plan for known and unknowable failures when designing high-performance and high-availability systems.

Now, there are systems that manage the power, cooling, and other critical data center functions.  If these were disrupted due to a network problem, application problem, or hacking event, these circumstances could also cause a datacenter power failure.

2. Network Problems

The network is the foundation of all cloud-based services, as the network enables communication between systems. If there are significant disruptions in the network, these problems can be catastrophic. In fact, if the network outage is significant enough, the network can cause the entire cloud to fail.

We saw multiple network outages in 2021. For instance, on December 7, AWS claimed to have a network outage that affected systems all over the world. The outage was not constrained to a single data center (i.e., availability zone) or region. Likewise, on November 16, Google Cloud had a substantial outage. Google asserted this outage was made by a network misconfiguration. This is a common cause for network outages, as a network misconfiguration can affect an organization’s ability to send traffic between systems.

Common causes of network outages include:

  • Routing problems (i.e., Open Shortest Path First, Border Gateway Protocol, Resource Reservation Protocol, Multiprotocol Label Switching)
  • Switching problems (Spanning tree, Rapid spanning tree, virtual LAN mismatch)
  • Overly restrictive firewalls, access lists, or security policy
  • IP addressing problems/conflicts
  • Routing software bugs
  • Misconfiguration
  • Faulty hardware (routers, switches, cables, wide area network connections)

3. Hacking Events 

Security events occur every day in our world. We consider the cloud to be slightly less secure than the traditional network and the datacenter, because all the same attack vectors can be used in the datacenter and the cloud. But the cloud has one new risk: It is the highest-value target for malicious hackers in the world. Successfully hacking the cloud can grant access to the information of virtually every customer on the cloud, making it a jackpot of digitally stored private data.

Azure’s cloud had two widely publicized hacking incidents in 2021. On April 1, Azure experienced a targeted attack on their DNS servers. There was a code defect that mitigates Azure’s ability to handle the increased volume of DNS requests. Additionally, in the last week of August 2021, Azure successfully thwarted one of the biggest distributed denial of service attacks (DDoS) in history—an attack that occurred at 2.4 Terabits per second. Thwarting an attack of that scale is beyond impressive, but in cybersecurity, we know that another malicious event is inevitably bound to happen. What we don’t know is which cloud provider will be affected next, when, or to what extent.

4. Application Failures/Control Plane Failures

When we look at the cloud, it’s little more than interconnected datacenters over a high-speed network. What makes the cloud different is its control plane that manages the environment, the virtual machines, containers, storage, networking functions, the management panels, and more variables that orchestrate a myriad of critical cloud functions. Disruptions to those functions can either be minor or take down the entire cloud and all the cloud providers’ customers simultaneously.  

Control plane problems are not common, but they do happen. And when they do, they can cause widespread system outages. For example, on October 4, 2021, Facebook made a BGP configuration mistake. BGP is essentially the control plane for all networking between internet service providers and their customers. Facebook’s entire global network—one created and managed by some of the best network engineers and architects in the world—was affected by one simple misconfiguration. 

Technology is not infallible. It is bound to fail eventually. As this case shows us, even the best organizations have failures, which means we must be able to plan around them. So, what can we do about cloud failures?

The cloud providers are some of the best network and datacenter operators in the world. However, all technology is bound to break at some point, and cloud architects need to maneuver around those eventual cloud failures and outages.  

Thankfully, the solution is quite simple: diversify your systems (in the same way an investor might diversify their portfolio) by using multiple cloud providers and/or a hybrid cloud. There is credence to that old adage, “Don’t put all your eggs in one basket.” Organizations—and many human lives around the world—depend on the resilience of cloud technology every day. Using a diversified multi-cloud or hybrid cloud is the best (and currently, the only) high-availability solution to persistent cloud outages and failures.

Michael Gibbs is CEO and founder of Go Cloud Architects. Questions and comments can be directed to 24×7 Magazine chief editor Keri Forsythe-Stephens at [email protected].