Skip to content

What Happens When "Half the Internet" Goes Down? Lessons from the Critical AWS US-EAST-1 Outage

Andrés Ujpán
Published date:
3 min read

October 20, 2025, will go down in the history of digital infrastructure. A service outage in Amazon Web Services (AWS) that originated in the provider’s most critical region, N. Virginia (US-EAST-1), caused a massive disruption of services, affecting everything from banking to e-commerce on a global scale.

Experts described the event as a “network amnesia” or a moment when “half the internet” seemed to disconnect simultaneously, exposing the inherent risk of excessive centralization in digital infrastructure.

The Root Cause: Digital Amnesia of the Control Plane

The incident began in the late hours of October 19 with elevated error rates and latencies. The root cause, identified at 12:26 AM PDT, was a DNS resolution issue that affected the endpoints of the regional DynamoDB service.

Why was a DNS failure so severe? DynamoDB is not just a customer database; it is a fundamental component used by the AWS control plane to manage the state, configuration, and authentication of nearly all other services. When DynamoDB’s DNS failed, it was as if the systems temporarily lost their memory, unable to find critical data to operate.

As one expert described it, Amazon still had the data, but for hours, “no one could find it,” temporarily separating applications from their data.

The Domino Effect: 16 Hours of Instability

Although the initial DNS issue was resolved at 2:24 AM PDT, the system did not recover immediately. The failure in the state service triggered a complex chain of cascading effects:

  • EC2 Failure: An internal EC2 subsystem (the service that launches virtual machine instances) experienced degradation due to its underlying dependency on DynamoDB.
  • Network Issues: Health checks for Network Load Balancers (NLBs) were also affected. This connectivity failure impacted essential services such as Lambda, DynamoDB, and CloudWatch.
  • Scaled Recovery: To prevent a massive overload, AWS implemented a key control measure: temporary throttling. This restricted high-demand operations, such as launching new EC2 instances and asynchronous Lambda invocations.

The full recovery of all AWS operations was not declared until 3:01 PM PDT on October 20, resulting in nearly 16 hours of disruption or instability for the global digital ecosystem.

Impact on Daily Life: Banks and E-commerce

The global reliance on US-EAST-1 translated into tangible failures for users and businesses:

  • Financial Sector: In Colombia, major banks, Bancolombia and Davivienda, experienced intermittent issues or failures on their websites and mobile apps. In total, at least 16 financial institutions issued statements regarding problems with their channels. The disruptions on these banking platforms persisted until approximately 6:00 PM local time, hours after AWS announced the recovery.
  • Commerce and Essential Services: High-volume e-commerce sites like Mercado Libre reported issues with the storage and performance of their services. Even essential public services, such as the Electrificadora de Santander (ESSA), had their customer service channels (phone lines and WhatsApp) compromised due to the outage.

The Strategic Lesson: Beyond Multi-AZ

The event demonstrated that an architecture designed with simple Multi-AZ resilience (within a single region) is not enough to mitigate a failure in the regional control plane, as was the case with the DNS/DynamoDB outage.

The main takeaway for companies is that the most critical workloads must evolve toward Multi-Region models. This incident forces a re-evaluation of risk planning: the failure of a low-level component in a critical region can paralyze global operations, demanding strategies that guarantee continuity even when the cloud’s most important region is destabilized.

Previous Post
Terminal productivity: the tools that transformed my workflow
Published date:
2 min read
Recommended Read
What is quantum computing and why is it important?
Updated date:
3 min read
100%