Can Delta outage Happen to You?

By Robin Schumacher

I tend to always fly Delta because they have the best options for travel given where I live. I landed pretty late in San Jose on Sunday for some training our team is running, and got in under the wire just before Delta suffered a power outage in its Atlanta headquarters that brought its global business to a halt.

The downtime that originated in their Atlanta location affected every site in which Delta does business around the world. Such an event puts a very real and sobering face on the term “single point of failure”.

Airport terminal on the day of the Delta outage

In my first two years at DataStax, a question that I used to routinely field was something along the lines of “Who really needs a database like yours? I get that Google, Facebook, and Amazon need that kind of capability, but who else?”

How about a 92-year old airline like Delta?

When we talk about cloud applications here at DataStax, we don’t only refer to apps that run on Amazon, Azure, etc., or just run on your mobile device. We’re talking about applications that need to run everywhere, 24x7x365, and deliver intelligent interactions with consistent response times.

In short, it’s about delivering your data to you, anywhere, anytime, all the time for the best customer experience.

While some may think that unplanned downtime is inevitable, I can tell you that we have customers running our database that haven’t experienced a single outage in the five years I’ve been at DataStax. Stop for a moment and let that fact settle in. Five (or more) years – ZERO downtime.

This is why words like “indestructible” have been used to describe Apache Cassandra™ and DataStax Enterprise. The platform’s masterless architecture and its gold-standard replication capabilities originated at Facebook, Google, and Amazon, and the continuing capabilities added by DataStax give this NoSQL database platform redundancy, in both data and compute resources, so that unplanned downtime never occurs.

To be sure, there’s more to having a system exhibit constant uptime than ensuring the data layer is always available, but it is a critical part of the equation. In a 2015 tech paper, Google calls such a design a “multi-homed” system and carefully distinguishes it from legacy failover-based approaches:

Failover-based approaches, however, do not truly achieve high availability, and can have excessive cost due to the deployment of standby resources.

Our teams have had several bad experiences dealing with failover-based systems in the past. Since unplanned outages are rare, failover procedures were often added as an afterthought, not automated and not well tested. On multiple occasions, teams spent days recovering from an outage, bringing systems back online component by component. . .tuning the system as it tried to catch up processing the backlog starting from the initial outage. These situations not only cause extended unavailability, but are also extremely stressful for the teams running complex mission-critical systems.

Does this sound familiar? I’ve lost count of the number of customers I’ve talked to who have scars from trying to make failover-styled systems deliver what they need and have therefore moved on to a multi-home, always-on design that is in line with the digital age and the high customer experience expectations it sets.

What about you Mr. CXO? Where is your business right now with respect to being able to withstand a serious and unplanned outage, and what technology solutions are you investing in to ensure your business never goes down?

For Delta and other airlines (e.g. Southwest) that have experienced unplanned downtime, the costs in both dollars and reputation are enormous – something called out by the Wall Street Journal, “The technical problems likely will cost Delta millions of dollars in lost revenue and damage its hard-won reputation as the most reliable of the major U.S-based international carriers. . . .[Southwest’s] problem, caused by a single computer router that malfunctioned at its data center in Dallas, forced the airline to shut the entire system and reboot it, a 12-hour process. The cost was $5 million to $10 million, Southwest said.”

A Fox Business article stated that Delta’s biggest spend per year, outside of aircraft, was technology. It will be interesting to see what new tech investments and steps Delta takes to stave off a repeat of what happened yesterday.


Leave a Reply