On Nov. 26, Amazon Web Services, the world’s largest cloud service provider, experienced a major outage in its US-EAST-1 data center due to a “relatively small addition of capacity” to the Amazon Kinesis real-time data processing service. Just over two weeks later, Google’s Cloud Platform suffered a major failure in its quota management system, severely reducing the capacity of its authentication system. The AWS outage caused services from major organizations such as Adobe, Autodesk, Fidelity, New York City’s Metropolitan Transport Authority and the Washington Post, to go down without warning. The GCP failure prevented users from logging in to their Gmail and Google Cloud applications, leading to interruptions for organizations using Google for their core office utilities.
These failures should serve as warnings for organizations to be vigilant over cloud computing’s pitfalls, but more urgently, act as a wake-up call for the U.S. government to reevaluate how it works to manage the risk from cloud services and consider cloud interoperability to bolster national and economic security. Organizations are increasingly integrating cloud computing for its convenience and the revolutionary technologies it unlocks. Critical infrastructure operators, from energy to health care, are making the cloud a key next step in their future development. And because cloud computing is an industry with extremely high barriers to entry, operators are likely to default to AWS, Microsoft Azure, Google Cloud, IBM or Oracle in the future, making any disruption in their services a potential threat to national and economic security.
The recent AWS and GCP outages follow a string of major cloud computing failures from the three largest providers. In April 2011, AWS’ Elastic Block Store, a widely used storage service, went down due to a similar routine capacity upgrade, leading to cascading disruptions in the US-EAST-1 region. In February 2013, Microsoft’s Azure cloud service experienced a global outage after certificates securing customer data expired. In August 2015, Google Cloud’s data center in Belgium suffered minor data losses after lightning strikes on power grids knocked out its primary power supply. And in February 2017, AWS’ Simple Storage Service (S3), a host of entire websites and applications, experienced a four-hour disruption in the US-EAST-1 region when debugging slower-than-usual performance.
These three internet giants’ positions as industry leaders do not exempt them from failure; if anything, they are pioneers in the current era of complex distributed computing systems, making failures due to internal mistakes or act of God events to some measure unavoidable. In spite of cloud vendors’ promises of cost savings and commercial pressures to adopt the cloud, these recent failures should remind organizations to assess whether they can tolerate these “growing pains,” especially with respect to their critical functions.
Each cloud failure has impacted the economy more harshly than the one preceding it. The earlier disruptions, such as the 2013 Azure outage, only caused relatively minor disruptions with the most apparent problems occurring in the Xbox Music and Video platforms. A more serious impact occurred in 2016 when a power disruption at a Verizon data center containing JetBlue Airways databases caused JetBlue flights to be grounded for hours. The 2017 S3 outage saw consequences across different industries, affecting popular services like GitHub, Quora, Expedia and many mobile apps. And in this latest outage, services from critical sectors such as finance (Fidelity, Coinbase) and transportation (NYC MTA) were meaningfully impacted.
These realities raise important issues for regulators. Steps such as requiring organizations to disperse their computing infrastructure across several cloud providers and requiring providers to increase interoperability aren’t just prudent, they’re necessary. Interoperability could be reflected in common architectures or a software middleware layer to help organizations relocate their workloads between cloud vendors easily and greatly reducing the probability that all of an organization’s systems could fail at once. Instituting some of these changes would keep problems “localized” and prevent industry-wide failures due to overreliance on any one provider.