01

What is Cloudflare and Why Does It Matter?

Before we dive into what went wrong, let's understand what Cloudflare actually does. If you're building web applications, this is essential knowledge. Trust me, after reading this, you'll never look at the internet the same way! 🤯

🌐 The Problem Cloudflare Solves

Imagine you build an amazing e-commerce website. It's hosted on a server in Mumbai. Now, when someone from New York tries to access your site, their request has to travel thousands of kilometers across the ocean. That's slow. And if suddenly 10,000 people try to access your site at once? Your single server might crash. Not a great experience, right? 😣

🔧 Technical Analogy

Think of Cloudflare as a security guard + receptionist sitting between your users and your server. Every request goes through them first. They check if the visitor is legitimate, block the bad ones, and for common requests (like your homepage), they keep a copy ready so they can respond instantly without bothering your server. Now imagine having this guard in 330+ cities worldwide — so users always talk to someone nearby instead of waiting for a response from far away.

What Cloudflare Actually Does

20%+ of the Internet
330+ Cities Worldwide
Millions of Websites
~75T Requests/Day

Cloudflare provides several critical services:

  • CDN (Content Delivery Network) — Caches and serves your content from servers close to users
  • DDoS Protection — Blocks attacks where hackers flood your site with fake traffic
  • DNS Services — Translates domain names (like google.com) to IP addresses
  • Bot Management — Identifies and blocks malicious bots (this is the culprit in our story! 🎯)
  • SSL/TLS — Handles encryption for secure connections
  • Web Application Firewall — Protects against common attacks like SQL injection
⚠️ Why This Matters

Because Cloudflare sits between users and websites for such a large portion of the internet, when Cloudflare fails, it doesn't matter if your actual servers are running perfectly — users simply can't reach them. Your application is fine, but nobody can access it.

Services Affected by This Outage

This wasn't a minor hiccup. These major services went down or became partially unavailable — and it was chaos! 😱

  • X (Twitter) — ~700 million users 🐦
  • ChatGPT — couldn't log in 🤖
  • Spotify, Discord, Canva, Figma 🎵💬🎨
  • Claude AI, 1Password, Trello, Medium, Postman
  • League of Legends, Valorant (couldn't connect to servers) 🎮
  • Even DownDetector (the site people use to check if sites are down!) was down 😂
02

How Cloudflare's Architecture Works

To understand what broke, you need to understand how a request flows through Cloudflare's system. Don't worry, I'll make this simple! 🤓

🔄 The Request Journey

When you type twitter.com in your browser, here's what happens behind the scenes:

Request Flow Through Cloudflare
Step 1
Your Browser
Step 2
DNS Resolution
Step 3
TLS/HTTP Layer
Step 4
FL Proxy 💥
Step 5
Pingora
Step 6
Origin Server

Let me explain each layer (this is where it gets interesting! 🎯):

🌍 1. DNS Resolution
Your browser asks "What's the IP address of twitter.com?" Cloudflare's DNS servers respond with the IP of the nearest Cloudflare edge server (not Twitter's actual server).
🔒 2. TLS/HTTP Layer
Your encrypted HTTPS connection terminates here. Cloudflare decrypts it, inspects the request, and re-encrypts if needed.
⚡ 3. FL Proxy (Core Proxy) — THE FAILURE POINT
This is the brain of the operation. Called "FL" for "Frontline", this is where all the magic happens:
  • WAF rules are applied (blocking SQL injection, XSS, etc.)
  • DDoS protection kicks in
  • Bot Management runs here — generating bot scores for every request
  • Customer-specific configurations are applied
  • Traffic is routed to the appropriate service

This is where things broke. The FL Proxy crashed when loading a corrupted configuration file.

📦 4. Pingora
Handles caching and fetches content from the origin server if not cached. Written in Rust for performance.
🏠 5. Origin Server
The actual server running Twitter/X's code. During this outage, origin servers were perfectly healthy — users just couldn't reach them.
Key Insight: Single Point of Failure

The FL Proxy processes every single request that goes through Cloudflare. There's no way to bypass it. When it fails, everything fails. This is why understanding the architecture matters — one broken component in the critical path can take down everything downstream.

FL vs FL2: The Two Proxy Versions

Here's where it gets tricky! 🤔 Cloudflare was in the process of migrating customers from their old proxy (FL) to a new, improved one (FL2). During this outage, both versions were affected — but differently:

Aspect FL2 (New Proxy) FL (Old Proxy)
Written in Rust Older codebase
What happened Completely crashed with HTTP 5xx errors Continued running but returned incorrect bot scores (always 0)
User experience Error pages, couldn't access sites Could access sites, but bot rules misfired (false positives)

The FL2 proxy was stricter about input validation (a good thing normally!) and crashed when it received invalid data. The older FL proxy was more lenient but produced incorrect results. Ironic, right? 🤷‍♂️

03

The Bot Management System

The root cause of this outage was in Cloudflare's Bot Management system. Now this is where it gets really interesting! 🎯 Let's understand what it does and why a simple "feature file" brought down the internet.

🤖 What is Bot Management?

Ever wondered how websites know if you're a real person or an automated bot? 🤔

Not all traffic to websites is from real humans. A significant portion comes from bots — automated programs that access websites. Some bots are good (like Google's crawler that indexes your site for search), and some are bad (like scrapers stealing your content or attackers trying to brute-force passwords).

Bot Management uses machine learning to analyze every request and assign a "bot score" — a number that indicates how likely the request is from a bot vs. a human.

🔧 Technical Analogy

It's like a spam filter for websites. Just like Gmail looks at email patterns to decide "spam or not spam", Bot Management looks at request patterns to decide "bot or human". It checks things like: How fast are requests coming? Does this browser fingerprint look real? Is this IP known for suspicious activity? All these signals get combined into a single score.

How Bot Scores Work

Bot Score Generation
Input
HTTP Request
Config
Feature File 💥
Processing
ML Model
Output
Bot Score (1-99)
  • Score 1-29: Likely a bot
  • Score 30-70: Uncertain, might be either
  • Score 71-99: Likely a human

Website owners can then create rules like: "If bot score < 30, show a CAPTCHA" or "If bot score < 10, block the request."

📄 What is a Feature File?

Now we get to the critical piece — pay attention here! 👀 The machine learning model needs to know which characteristics (features) to look at when analyzing a request. These are defined in a feature configuration file.

🔧 What's in a Feature File?

A feature file contains a list of "features" — individual traits the ML model uses. For example:

  • user_agent_entropy — How random/unique is the User-Agent string?
  • request_rate — How many requests per second from this IP?
  • header_order — In what order are HTTP headers sent?
  • tls_fingerprint — What does the TLS handshake look like?

Why Feature Files Need Frequent Updates

Bad actors constantly evolve. It's like a cat and mouse game! 🐱🐭 When attackers figure out that Cloudflare is looking at Feature X, they'll modify their bots to appear normal on Feature X. So Cloudflare needs to constantly update the features — adding new ones, tweaking existing ones, removing obsolete ones.

This is why the feature file is regenerated every few minutes and pushed to every server globally. Sounds harmless, right? 🤷‍♂️

⚠️ The Design Decision That Caused Problems

For performance reasons, Cloudflare's proxy pre-allocates memory for the feature file. They set a limit of 200 features — well above their actual use of ~60 features. When the corrupted file contained over 200 features due to duplicates, it exceeded this limit and crashed the system.

The Code That Crashed

Here's the actual Rust code that caused the crash (simplified):

🦀 Rust — FL2 Proxy Bot Management Module
// This code checks if the number of features is within the limit
// MAX_FEATURES is set to 200

fn load_feature_config(features: Vec<Feature>) -> Result<Config, Error> {
    if features.len() > MAX_FEATURES {
        return Err(Error::TooManyFeatures);
    }
    
    // Pre-allocate memory for exactly this many features
    let config = Config::with_capacity(features.len());
    
    // ... rest of loading logic
}

// Somewhere in the calling code:
let config = load_feature_config(features).unwrap();  // 💥 CRASH HERE!

The problem was that .unwrap(). In Rust, .unwrap() says "I expect this to succeed, and if it doesn't, crash the program." When the feature count exceeded 200, the function returned an error, .unwrap() was called on that error, and the entire proxy crashed.

Key Lesson: Never Trust Your Own Data

Cloudflare's engineers assumed their internally-generated feature file would always be valid. They didn't apply the same defensive programming they would for user-provided input. This is a common mistake — we often trust "internal" data more than we should.

04

Understanding ClickHouse Distributed Databases

The bug originated in Cloudflare's ClickHouse database. If you're getting into large-scale systems, understanding distributed databases is essential. Let's break it down.

📊 What is ClickHouse?

ClickHouse is an open-source column-oriented database designed for analytics at massive scale. It's used by companies like Uber, Cloudflare, eBay, and Yandex to analyze billions of rows of data in real-time.

📚 Row vs Column Storage

Row-oriented databases (MySQL, PostgreSQL): Store data like a book — one complete row after another. Great for looking up a specific user's full profile.

Column-oriented databases (ClickHouse): Store data like a spreadsheet where each column is a separate file. Great for analytics queries like "What's the average of column X across 1 billion rows?" because you only read that one column.

How Distributed Databases Work

When you have billions of rows of data, one server isn't enough. You shard the data across many servers:

ClickHouse Distributed Architecture
Your Query
SELECT * FROM logs
Coordinator
Distributed Table
Shard 1
Rows 1-1M
Shard 2
Rows 1M-2M
Shard 3
Rows 2M-3M

Here's the key concept that caused the bug:

The Two-Database Structure

Cloudflare's ClickHouse has two logical databases:

Database Purpose What it contains
default Query entry point Distributed tables — virtual tables that fan out queries to all shards
r0 Actual storage Underlying tables — where the actual data lives on each shard

When you query default.http_requests_features, the Distributed engine automatically queries r0.http_requests_features on every shard and combines the results.

🔐 The Permission Change That Started It All

Here's where things went wrong. Cloudflare was improving their database security by making permissions more explicit.

Before the Change (11:04 UTC)

When users queried metadata (like "what columns exist in this table?"), they could only see the default database:

SQL — Metadata Query (Before)
SELECT name, type FROM system.columns
WHERE table = 'http_requests_features';

-- Result:
-- name          | type
-- --------------+--------
-- user_agent    | String
-- ip_address    | String
-- request_rate  | Float64
-- ... (~60 features)

After the Change (11:05 UTC)

The permission change made the r0 database visible too. Now the same query returned duplicates:

SQL — Metadata Query (After)
SELECT name, type FROM system.columns 
WHERE table = 'http_requests_features';

-- Result (PROBLEM!):
-- name          | type     | database
-- --------------+----------+----------
-- user_agent    | String   | default    ← Original
-- ip_address    | String   | default
-- request_rate  | Float64  | default
-- user_agent    | String   | r0         ← DUPLICATE!
-- ip_address    | String   | r0         ← DUPLICATE!
-- request_rate  | Float64  | r0         ← DUPLICATE!
-- ... (now ~120+ rows!)
💥 The Bug

The query that generates the feature file didn't filter by database name. It assumed all results would be from the default database. When the r0 tables became visible, every column appeared twice — more than doubling the feature count from ~60 to ~120+, which exceeded the limit!

The Problematic Query

SQL — The Query That Broke Everything
-- This query was used to generate the feature file
SELECT name, type FROM system.columns
WHERE table = 'http_requests_features'
ORDER BY name;

-- ❌ MISSING: WHERE database = 'default'

One missing WHERE clause. That's all it took. Just one line of code. 🤯

Why The Outage Was Intermittent (At First)

Here's where it gets even more confusing! 😵 The outage didn't hit all at once. It fluctuated. Why?

Cloudflare was gradually rolling out the permission change to their ClickHouse cluster. The feature file is regenerated every 5 minutes, and each regeneration randomly picks a node in the cluster to run the query on.

  • Query hits updated node: Bad feature file generated → Outage
  • Query hits non-updated node: Good feature file generated → Recovery

This made debugging incredibly confusing because the system would recover on its own, then fail again minutes later. Imagine trying to fix something that keeps "fixing itself" and breaking again! 😤

Why The Outage Kept Coming Back
11:20
Query → Updated Node → Bad File → 💥 DOWN
11:25
Query → Old Node → Good File → ✅ UP
11:30
Query → Updated Node → Bad File → 💥 DOWN
11:35
Query → Old Node → Good File → ✅ UP
... Eventually all nodes updated → Permanent outage until fix
05

What Actually Happened (Step by Step)

Now that you understand all the components, let's trace exactly what happened, minute by minute.

11:05 UTC
The Innocent Change
A database engineer deploys a permission change to the ClickHouse cluster. The goal? Improve security by making permissions more explicit. Users would now see metadata for tables they have access to in both default and r0 databases. This seemed like a good, routine change.
11:05 - 11:20 UTC
The Time Bomb Ticks
The permission change starts rolling out gradually across the ClickHouse cluster. The feature file regeneration job runs every 5 minutes. Depending on which node handles the query, it might or might not see the new permissions yet.
11:20 UTC
💥 Impact Begins
The feature file regeneration job runs on an updated node. The query returns duplicate rows. The feature file now contains 120+ features instead of 60. This file is propagated to all servers globally within seconds.
11:20 UTC (continued)
The Cascade Failure
Servers running FL2 (the new proxy) try to load the feature file. The code checks: "Do we have more than 200 features?" Yes. It throws an error. .unwrap() is called. PANIC. The proxy crashes. HTTP 5xx errors flood the network.
11:25 UTC (approximately)
Brief Recovery (The Confusion)
The next feature file regeneration runs on a node that hasn't been updated yet. A good feature file is generated and propagated. Servers recover. For 5 minutes, everything looks fine.
11:30 UTC
Down Again
Next regeneration hits an updated node. Bad file. Crash. This cycle continues — recovery, crash, recovery, crash — making the problem incredibly confusing to diagnose.
11:31 UTC
Automated Alert Fires
Cloudflare's automated testing detects the issue. An alert goes out.
11:32 UTC
Manual Investigation Starts
Engineers start looking at the problem. Initial symptoms point to Workers KV (a key-value store) having issues.
11:35 UTC
Incident Call Created
A formal incident is declared. Engineers from multiple teams join. The hunt begins.
11:35 - 13:00 UTC
Wild Goose Chase
The team goes down several wrong paths:
  • Initial suspicion: A DDoS attack (Cloudflare had recently defended against massive attacks)
  • The status page going down (unrelated coincidence!) reinforced the attack theory
  • The intermittent nature made it seem like attackers were probing
  • Focus shifts to Workers KV, then Access, then other services
13:05 UTC
First Mitigation
Engineers implement a bypass for Workers KV and Access, routing them through the old FL proxy instead of FL2. This reduces impact, though core issues remain.
13:37 UTC
Root Cause Identified
Finally! An engineer traces the crashes back to the Bot Management feature file. They examine it and see the duplicate features. The ClickHouse permission change is identified as the trigger.
14:24 UTC
Stop the Bleeding
Two actions taken simultaneously:
  1. Stop automatic regeneration of new feature files
  2. Retrieve the last known good feature file from before 11:20
14:30 UTC
Recovery Begins
The good feature file is pushed globally. Servers start recovering. Core traffic begins flowing again.
14:30 - 17:06 UTC
Cleanup and Recovery
Various services that entered bad states need manual intervention. A backlog of login attempts overwhelms the dashboard. Teams work to scale services and restart affected components.
17:06 UTC
✅ Full Recovery
All services restored. 5xx error rates return to baseline. The internet breathes again.
~6h Total Impact Duration
~3h Peak Outage
~2h To Find Root Cause
~50min To Deploy Fix
06

How They Detected the Problem

This section is crucial for anyone building systems at scale. Detection and observability are your lifeline when things go wrong! 🚨

🚨 The Monitoring That Worked

Cloudflare has extensive monitoring. Here's what fired:

  • Automated Health Checks: Synthetic tests that continuously make requests to Cloudflare services detected issues at 11:31 UTC — just 11 minutes after impact started. Pretty fast! ⚡
  • 5xx Error Rate Monitors: Dashboards immediately showed the spike in error responses.
  • Latency Metrics: Response times spiked because the proxy was spending CPU cycles on error handling and debugging.

🔍 The Investigation Challenges

Even with good monitoring, finding the root cause was hard. Here's why (this is where it gets interesting!):

1. Symptoms Didn't Point to the Cause

The visible symptoms were:

  • Workers KV returning errors
  • Access authentication failing
  • Dashboard login broken
  • General HTTP 5xx errors

None of these immediately screamed "Bot Management feature file!" The actual cause was several layers below the symptoms. Talk about a needle in a haystack! 🔍

2. The Intermittent Nature

The system would recover, then fail again. This pattern matched what you'd expect from:

  • A sophisticated DDoS attack (probing before the main assault)
  • A race condition in code
  • Network issues that come and go

It didn't match what you'd expect from a configuration problem (which usually causes persistent failures).

3. The Status Page Coincidence

Cloudflare's status page (hosted completely separately, not on Cloudflare) went down at the same time. This was a complete coincidence, but it made the team think an attacker was targeting both their infrastructure AND their communication channel. Can you imagine the panic? 😱

⚠️ Lesson: Coincidences Happen Under Pressure

When you're in incident response mode, your brain looks for patterns. Unrelated events can seem connected. Always verify assumptions. In this case, the status page issue was completely unrelated but wasted valuable investigation time.

4. Wrong Initial Hypothesis

The team initially suspected a DDoS attack because:

  • Cloudflare had recently defended against record-breaking attacks
  • The intermittent nature matched attack patterns
  • The status page going down reinforced this theory

How They Finally Found It

Around 13:37 UTC, an engineer looking at the FL2 proxy logs noticed the panic message:

🔴 Error Log
thread fl2_worker_thread panicked: called Result::unwrap() 
on an Err value: TooManyFeatures

This was the key. "TooManyFeatures" pointed directly to the Bot Management module. From there:

  1. They examined the current feature file
  2. They saw the duplicate entries
  3. They traced the feature file generation to the ClickHouse query
  4. They found the query didn't filter by database
  5. They checked recent ClickHouse changes — found the permission change at 11:05
Key Debugging Principle

The error message contained the answer: TooManyFeatures. Good error messages are invaluable. When writing code, invest in descriptive, specific error messages. Future you (or your on-call colleague at 3 AM) will thank you.

07

How They Fixed It

Once the root cause was identified, the fix was conceptually simple but operationally challenging. Here's the playbook they followed.

🛑 Step 1: Stop the Bleeding (14:24 UTC)

First priority: stop making things worse.

Immediate Actions
# 1. Stop automatic feature file generation
# This prevents new bad files from being created

$ kill-feature-file-job

# 2. Block propagation of feature files
# Even if a file is generated, don't push it

$ block-feature-file-distribution

This stabilized the situation — no new bad files would be created or distributed.

📦 Step 2: Restore Known Good State (14:24-14:30 UTC)

With the bleeding stopped, they needed to restore service:

  1. Find the last good file: Look at feature file history, find the last one generated before 11:20 UTC (before the permission change took effect)
  2. Validate the file: Check it has ~60 features, not 120+
  3. Manually inject into distribution: Push this file into the distribution queue
  4. Force restart: Restart the FL2 proxy across all servers to pick up the new file
💻 Shell — Conceptual Fix Process
# Find last good feature file
$ ls -la /var/cloudflare/feature-files/
feature-file-2025-11-18-11-15.json  # ← Before the change, should be good
feature-file-2025-11-18-11-20.json  # ← Bad, has duplicates
feature-file-2025-11-18-11-25.json  # ← Might be good (hit old node)
feature-file-2025-11-18-11-30.json  # ← Bad again

# Verify the good file
$ cat feature-file-2025-11-18-11-15.json | jq '.features | length'
62  # Good! Under 200 ✅

# Inject into distribution
$ inject-feature-file feature-file-2025-11-18-11-15.json --force --global

# Force proxy restart globally
$ restart-fl2-proxy --all-regions

🔧 Step 3: Handle the Cascade (14:30 - 17:06 UTC)

Restoring the core proxy wasn't enough. Other services had entered bad states:

The Dashboard Login Storm

While the system was down, millions of users kept trying to log in. When services recovered, all those retry attempts hit at once — a "thundering herd" problem. This is a classic distributed systems nightmare! 😣

🔧 Technical Analogy: Thundering Herd

Imagine a popular website's cache expires, and suddenly 10,000 users who were waiting all hit "refresh" at the exact same moment. Your database gets slammed with 10,000 identical requests instead of just one. That's the thundering herd problem! The fix? Add random delays to retries (so not everyone retries at once), or use a "circuit breaker" that temporarily rejects requests to prevent overload.

Solution: Scale up dashboard and login services, add rate limiting, gradually let traffic through.

Turnstile (CAPTCHA) Recovery

Cloudflare Turnstile (their CAPTCHA alternative) was down, which meant new logins to the dashboard were impossible. Even after the proxy recovered, Turnstile needed separate attention.

CPU Exhaustion from Error Logging

Here's an interesting side effect: Cloudflare's debugging systems automatically enhance errors with extra context. With millions of errors happening, this consumed massive CPU resources, further slowing recovery. Ironic, right? The very thing meant to help debug was making things worse! 🤦‍♂️

💡 Lesson: Your Debugging Can Make Things Worse

Heavy error logging, stack trace collection, and debugging information are great for diagnosing issues. But during a major outage, they can consume resources you desperately need for recovery. Consider having "emergency mode" logging that's more minimal.

🔒 Step 4: Permanent Fix

After immediate recovery, they needed to fix the underlying bugs:

Fix 1: The SQL Query

SQL — Before (Buggy)
SELECT name, type FROM system.columns
WHERE table = 'http_requests_features'
ORDER BY name;
SQL — After (Fixed) ✅
SELECT name, type FROM system.columns
WHERE table = 'http_requests_features'
  AND database = 'default'  -- ← Added this filter! That's it! 🎉
ORDER BY name;

One line of code. That's the difference between the internet working and the internet breaking. 🤯

Fix 2: Better Error Handling in FL2

Rust — Before (Crashy) ❌
let config = load_feature_config(features).unwrap();  // Crash on error
Rust — After (Resilient) ✅
let config = match load_feature_config(features) {
    Ok(c) => c,
    Err(e) => {
        // Log the error
        error!("Failed to load feature config: {:?}", e);
        // Fall back to previous good config
        get_previous_config()
    }
};
08

The Full Impact

Let's look at the full scope of what was affected during this incident. Spoiler: it was massive! 💥

Services Directly Impacted

Service Impact
Core CDN & Security HTTP 5xx errors for customer sites
Turnstile (CAPTCHA) Failed to load entirely
Workers KV Elevated 5xx errors
Dashboard Users couldn't log in
Cloudflare Access Authentication failures
Email Security Reduced spam detection, some Auto Move failures

Major Websites/Apps Affected

X ~700M Users
ChatGPT Login Issues
Spotify Service Down
Discord Connectivity Issues

Also affected: Canva, Figma, Claude AI, 1Password, Trello, Medium, Postman, League of Legends, Valorant, various crypto platforms, and ironically... DownDetector (the site people use to check if other sites are down).

Financial Impact

  • Cloudflare Stock (NET): Dropped 3.5% in pre-market trading
  • Customer Revenue Loss: Potentially millions across all affected sites (e-commerce transactions failed, ads didn't load, subscriptions couldn't be processed)
  • Crypto Markets: Multiple exchanges and DeFi platforms went offline, potentially affecting trades

Why Some Services Were Fine

Interestingly, not everything went down:

  • OpenAI API: Continued working (different infrastructure path than ChatGPT login)
  • Many mobile apps: Native mobile apps often bypass the CDN layer entirely
  • Sites with multi-CDN: Companies using multiple CDN providers could fail over to alternatives
Architectural Insight

This outage showed why multi-CDN architecture is increasingly important. Companies that had a backup CDN configured (like Fastly, Akamai, or AWS CloudFront) could switch traffic and minimize impact. Single points of failure are dangerous at internet scale.

09

Lessons for Building Systems at Scale

This incident is a goldmine of lessons for anyone building or operating large-scale systems. Let's extract the wisdom.

🔐 Lesson 1: Never Trust Internal Data

Cloudflare's code trusted its own configuration files completely. The assumption: "We generate this file ourselves, so it will always be valid." Makes sense, right? 🤔

Reality: Internal systems can produce invalid data due to bugs, race conditions, database issues, or (as in this case) unexpected side effects of other changes. Never assume!

✅ Best Practice

Validate ALL inputs, even those from internal systems. Apply the same defensive programming to internal data that you would to user input.

📉 Lesson 2: Graceful Degradation Over Hard Crashes

When the feature file was invalid, the FL2 proxy crashed with a panic. A better approach:

  • Log the error for investigation
  • Fall back to the previous known-good configuration
  • Alert operators while continuing to serve traffic
  • Rate-limit the fallback to prevent cascading issues

Serving traffic with a slightly stale configuration is almost always better than not serving traffic at all.

🔄 Lesson 3: Database Changes Are Infrastructure Changes

The permission change seemed like a small, safe improvement. But it had unexpected downstream effects. Database changes can affect:

  • Query results (as seen here)
  • Query performance (indexes, query plans)
  • Application behavior (assumptions about data format)
⚠️ Treat Database Changes Like Code Deploys

Use the same rigor: staged rollouts, feature flags, monitoring, and the ability to quickly roll back. Test in production-like environments with realistic queries.

🎭 Lesson 4: Intermittent Failures Are The Hardest

If the system had stayed down, they might have found the cause faster. The intermittent nature led the team to wrong conclusions (DDoS attack, race condition, network issues). This is the worst kind of bug to debug! 😵

Strategy for intermittent issues:

  • Focus on what's DIFFERENT between success and failure cases
  • Check for gradual rollouts of any kind (feature flags, database changes, code deploys)
  • Look at timing — does it correlate with scheduled jobs?
  • Check if different servers/regions behave differently

🚨 Lesson 5: Design Kill Switches

Cloudflare is now implementing more "global kill switches" — ways to instantly disable features that might be causing problems. When building systems:

  • Every feature should be independently disableable
  • Kill switches should work even when the main system is failing
  • Practice using them (chaos engineering)

📊 Lesson 6: Your Observability Can Hurt You

During the incident, Cloudflare's debugging systems (collecting stack traces, enhancing errors with context) consumed significant CPU. In a crisis, resources are precious.

Consider:

  • Emergency logging modes with reduced verbosity
  • Sampling during high-error-rate periods
  • Async logging that doesn't block the main process
  • Resource limits on debugging/tracing systems

🔗 Lesson 7: Understand Your Dependencies

Many developers debugging their apps during the outage wasted time because they didn't immediately recognize it as a Cloudflare issue. Understanding your dependency chain is crucial:

Know Your Stack
Layer 1
Your Code
Layer 2
Your Infra (AWS/GCP)
Layer 3
CDN (Cloudflare)
Layer 4
DNS
Layer 5
User's ISP

When something breaks, know which layer to investigate first.

🌐 Lesson 8: The Internet Is More Fragile Than It Looks

The internet appears decentralized, but in reality, a few key providers handle massive portions of traffic:

  • CDNs: Cloudflare, Akamai, Fastly, AWS CloudFront
  • Cloud Providers: AWS, Azure, GCP
  • DNS: Cloudflare, AWS Route 53, Google

A failure in any of these can cascade globally. As a system designer, always consider: what if this dependency fails?

10

What Cloudflare Is Doing to Prevent This

To their credit, Cloudflare has been transparent about this incident and committed to specific improvements. Here's their action plan:

Immediate Actions

  1. Hardening Configuration File Validation: Treating internally-generated files with the same validation rigor as user input
  2. More Global Kill Switches: Ability to instantly disable any feature that might be causing issues
  3. Resource Limits on Error Handling: Preventing debugging systems from consuming excessive resources during incidents
  4. Review of All Error Paths: Auditing every module in the core proxy for similar issues

Longer-Term Improvements

  • Graceful Degradation by Default: Modules should fail open (continue working with reduced functionality) rather than fail closed (crash)
  • Better Testing of Permission Changes: Simulating downstream effects of database changes before production rollout
  • Enhanced Canary Deployments: Testing changes on a small subset of traffic before global rollout
  • Improved Incident Detection: Faster correlation between symptoms and root causes
The Meta-Lesson

Cloudflare's worst outage since 2019 was caused by a one-line bug in a SQL query that had been working correctly for years. The conditions for failure were created by an unrelated change (the permission improvement). Complex systems fail in complex ways. You can't prevent all failures, but you can build systems that fail gracefully, detect issues quickly, and recover fast.

What You Can Do In Your Systems

Regardless of your scale, apply these principles:

  1. Validate everything: Don't trust any input, even from your own systems
  2. Fail gracefully: When something goes wrong, degrade rather than crash
  3. Build kill switches: Have the ability to disable any feature instantly
  4. Monitor dependencies: Know when external services you depend on are having issues
  5. Test failure modes: Don't just test the happy path; test what happens when things break
  6. Have a runbook: Document how to diagnose and recover from common failures
  7. Practice incident response: Run game days where you simulate failures
  8. Consider multi-provider: For critical dependencies, have a backup

Final Thoughts

So let's recap what happened on November 18, 2025: A routine database permission change exposed a missing WHERE clause in a SQL query. That query generated a file that was too large. That file crashed a proxy. That proxy was handling 20% of the internet's traffic. Wild, right? 🤯

The chain reaction took about 15 minutes to start and nearly 6 hours to fully resolve. Billions of users were affected. Stock prices dropped. Engineers around the world wasted hours debugging their own systems thinking they were at fault. 😤

And yet, in some ways, the system worked! Automated monitoring detected the issue within 11 minutes. Engineers responded quickly. The post-mortem was thorough and public. Improvements are being made.

As you build and operate systems, remember: complexity is the enemy of reliability. Every dependency is a potential failure point. Every assumption is a potential bug. The goal isn't to never fail — it's to fail well, recover fast, and learn continuously.

Welcome to the world of systems at scale. It's messy, it's humbling, and it's endlessly fascinating. 🔥

🎓 Keep Learning

This incident is a case study in distributed systems, database management, incident response, and engineering culture. Study other post-mortems (Google, AWS, GitHub all publish them). Build things. Break things. Learn from every failure. That's how we all get better! 💪

11

References & Sources

This deep dive was compiled from the following sources:

🔴 Official Cloudflare Communications

📰 News Coverage

  • Ars Technica — "Cloudflare outage takes down Discord, ChatGPT, Notion, and many more"
    arstechnica.com
  • TechCrunch — "Major Cloudflare outage impacts Discord, Notion, and ChatGPT"
    techcrunch.com
  • BleepingComputer — "Cloudflare outage causes major Internet disruption"
    bleepingcomputer.com
  • Forbes — "Cloudflare Outage Takes Out Major Websites"
    forbes.com/sites/technology
  • The Verge — "Cloudflare outage takes down Spotify, Discord, and more"
    theverge.com

📊 Status & Monitoring

  • DownDetector — Real-time outage reports
    downdetector.com
  • Discord Status — Service status page
    discordstatus.com
  • OpenAI Status — ChatGPT service status
    status.openai.com

🔧 Technical Background

📈 Market Impact

  • Yahoo Finance — Cloudflare (NET) stock movement
    finance.yahoo.com/quote/NET
  • MarketWatch — Pre-market trading data November 18, 2025
    marketwatch.com

🎓 Related Learning Resources

🌐 Internet Infrastructure Context