You may be hearing promises of 100% server uptime, but what is the reality?
Here at ColossusCloud, we like to educate our clients on the subject of server downtime so that they can be as well-prepared as possible to handle it. Because server downtime is not a question of if, but when.
Despite what some server solution providers promise, server downtime is an inevitable part of operating online. Someday, even for a few minutes, your server will go down, and you must be ready to mitigate the impact on you and your business.
What causes server downtime? Well, servers operate as part of a large physical infrastructure, which, like anything in the real world, is exposed to the risk of accidents and malfunctions. By taking the right preventative measures, the risk of anything going wrong can be minimized, but never removed. Are cloud service providers promising you 100% uptime? Don’t believe them, it’s not possible. To illustrate the point, here are some examples of major events that resulted in minutes or hours of server downtime and were very difficult or impossible to prevent.
Let’s say someone is out on the street digging a hole for utility work, and suddenly they hit a fiber optic line and slice it in half. Although data centers have multiple lines running in different directions, it will take at least a few minutes to reroute the data for the broken line to a different one, and during this time, some servers will experience downtime.
In the US, on average, there are 17,000 accidental fiber optic line cuts every year – and many more that are intentional works of vandalism.
Natural disasters pose a risk to infrastructure located in less stable regions of the world. Hurricane Florence, which struck Virginia in 2018, damaged power to a location that housed Amazon’s AWS Direct fiber optic lines. All the data centers in the region suddenly lost access to Amazon’s compute nodes, and the result was thousands of major websites going offline.
In another event, Hurricane Sandy flooded New York City streets, putting telecom basements under water and cutting off internet access to the data centers in the area. One data center even lost power completely due to their power system having been located below ground and becoming totally flooded.
Human error is another risk that cannot be eliminated. On February 2017, one such case caused Amazon’s S3 service in the North East of the US to go offline for several hours, affecting thousands of major websites.
In another event, Microsoft’s Azure cloud service went offline for several hours when technicians performing maintenance on fire suppression systems accidentally triggered them, causing power shutdown.
And on April 2016, all of Google’s data centers around the world went offline for 21 minutes, due to human error that affected all their network routing equipment at a global scale.
And that’s not all - acts of terrorism pose a significant threat too. In the tragic September 11, 2001 attacks on the World Trade Center buildings, multiple data centers were destroyed. Many clients using these data centers were keeping backups across the street in another building, which was also destroyed, and all their data was lost.
What does this tell us? It tells us that we must be prepared for servers going offline. Even if it is just a few seconds or a minute, while preventative maintenance is carried out on server equipment – something that we have comprehensive plans and procedures for at ColossusCloud. Or ten years after you deploy your server operating system, when the vendor no longer supports it and you need to replace it.
The good news is that when you know that server downtime happens to everyone, you can make preparations that put your business in the best position to get through it all with the minimum impact on you and your clients.
The important question is, what will you do today to prepare your business for server downtime?