Cloudflare Outage Nov 2025: How One SQL Query Broke the Internet

01

What is Cloudflare and Why Does It Matter?

Before we dive into what went wrong, let's understand what Cloudflare actually does. If you're building web applications, this is essential knowledge. Trust me, after reading this, you'll never look at the internet the same way! 🤯

🌐 The Problem Cloudflare Solves

Imagine you build an amazing e-commerce website. It's hosted on a server in Mumbai. Now, when someone from New York tries to access your site, their request has to travel thousands of kilometers across the ocean. That's slow. And if suddenly 10,000 people try to access your site at once? Your single server might crash. Not a great experience, right? 😣

🔧 Technical Analogy

Think of Cloudflare as a security guard + receptionist sitting between your users and your server. Every request goes through them first. They check if the visitor is legitimate, block the bad ones, and for common requests (like your homepage), they keep a copy ready so they can respond instantly without bothering your server. Now imagine having this guard in 330+ cities worldwide — so users always talk to someone nearby instead of waiting for a response from far away.

What Cloudflare Actually Does

20%+ of the Internet

330+ Cities Worldwide

Millions of Websites

~75T Requests/Day

Cloudflare provides several critical services:

CDN (Content Delivery Network) — Caches and serves your content from servers close to users
DDoS Protection — Blocks attacks where hackers flood your site with fake traffic
DNS Services — Translates domain names (like google.com) to IP addresses
Bot Management — Identifies and blocks malicious bots (this is the culprit in our story! 🎯)
SSL/TLS — Handles encryption for secure connections
Web Application Firewall — Protects against common attacks like SQL injection

⚠️ Why This Matters

Because Cloudflare sits between users and websites for such a large portion of the internet, when Cloudflare fails, it doesn't matter if your actual servers are running perfectly — users simply can't reach them. Your application is fine, but nobody can access it.

Services Affected by This Outage

This wasn't a minor hiccup. These major services went down or became partially unavailable — and it was chaos! 😱

X (Twitter) — ~700 million users 🐦
ChatGPT — couldn't log in 🤖
Spotify, Discord, Canva, Figma 🎵💬🎨
Claude AI, 1Password, Trello, Medium, Postman
League of Legends, Valorant (couldn't connect to servers) 🎮
Even DownDetector (the site people use to check if sites are down!) was down 😂

02

How Cloudflare's Architecture Works

To understand what broke, you need to understand how a request flows through Cloudflare's system. Don't worry, I'll make this simple! 🤓

🔄 The Request Journey

When you type twitter.com in your browser, here's what happens behind the scenes:

Request Flow Through Cloudflare

Step 1

Your Browser

→

Step 2

DNS Resolution

→

Step 3

TLS/HTTP Layer

→

Step 4
FL Proxy 💥

→

Step 5

Pingora

→

Step 6

Origin Server

Let me explain each layer (this is where it gets interesting! 🎯):

🌍 1. DNS Resolution

Your browser asks "What's the IP address of twitter.com?" Cloudflare's DNS servers respond with the IP of the nearest Cloudflare edge server (not Twitter's actual server).

↓

🔒 2. TLS/HTTP Layer

Your encrypted HTTPS connection terminates here. Cloudflare decrypts it, inspects the request, and re-encrypts if needed.

↓

⚡ 3. FL Proxy (Core Proxy) — THE FAILURE POINT

This is the brain of the operation. Called "FL" for "Frontline", this is where all the magic happens:

WAF rules are applied (blocking SQL injection, XSS, etc.)
DDoS protection kicks in
Bot Management runs here — generating bot scores for every request
Customer-specific configurations are applied
Traffic is routed to the appropriate service

This is where things broke. The FL Proxy crashed when loading a corrupted configuration file.

↓

📦 4. Pingora

Handles caching and fetches content from the origin server if not cached. Written in Rust for performance.

↓

🏠 5. Origin Server

The actual server running Twitter/X's code. During this outage, origin servers were perfectly healthy — users just couldn't reach them.

Key Insight: Single Point of Failure

The FL Proxy processes every single request that goes through Cloudflare. There's no way to bypass it. When it fails, everything fails. This is why understanding the architecture matters — one broken component in the critical path can take down everything downstream.

FL vs FL2: The Two Proxy Versions

Here's where it gets tricky! 🤔 Cloudflare was in the process of migrating customers from their old proxy (FL) to a new, improved one (FL2). During this outage, both versions were affected — but differently:

Aspect	FL2 (New Proxy)	FL (Old Proxy)
Written in	Rust	Older codebase
What happened	Completely crashed with HTTP 5xx errors	Continued running but returned incorrect bot scores (always 0)
User experience	Error pages, couldn't access sites	Could access sites, but bot rules misfired (false positives)

The FL2 proxy was stricter about input validation (a good thing normally!) and crashed when it received invalid data. The older FL proxy was more lenient but produced incorrect results. Ironic, right? 🤷‍♂️

03

The Bot Management System

The root cause of this outage was in Cloudflare's Bot Management system. Now this is where it gets really interesting! 🎯 Let's understand what it does and why a simple "feature file" brought down the internet.

🤖 What is Bot Management?

Ever wondered how websites know if you're a real person or an automated bot? 🤔

Not all traffic to websites is from real humans. A significant portion comes from bots — automated programs that access websites. Some bots are good (like Google's crawler that indexes your site for search), and some are bad (like scrapers stealing your content or attackers trying to brute-force passwords).

Bot Management uses machine learning to analyze every request and assign a "bot score" — a number that indicates how likely the request is from a bot vs. a human.

🔧 Technical Analogy

It's like a spam filter for websites. Just like Gmail looks at email patterns to decide "spam or not spam", Bot Management looks at request patterns to decide "bot or human". It checks things like: How fast are requests coming? Does this browser fingerprint look real? Is this IP known for suspicious activity? All these signals get combined into a single score.

How Bot Scores Work

Bot Score Generation

Input

HTTP Request

→

Config
Feature File 💥

→

Processing

ML Model

→

Output

Bot Score (1-99)

Score 1-29: Likely a bot
Score 30-70: Uncertain, might be either
Score 71-99: Likely a human

Website owners can then create rules like: "If bot score < 30, show a CAPTCHA" or "If bot score < 10, block the request."

📄 What is a Feature File?

Now we get to the critical piece — pay attention here! 👀 The machine learning model needs to know which characteristics (features) to look at when analyzing a request. These are defined in a feature configuration file.

🔧 What's in a Feature File?

A feature file contains a list of "features" — individual traits the ML model uses. For example:

user_agent_entropy — How random/unique is the User-Agent string?
request_rate — How many requests per second from this IP?
header_order — In what order are HTTP headers sent?
tls_fingerprint — What does the TLS handshake look like?

Why Feature Files Need Frequent Updates

Bad actors constantly evolve. It's like a cat and mouse game! 🐱🐭 When attackers figure out that Cloudflare is looking at Feature X, they'll modify their bots to appear normal on Feature X. So Cloudflare needs to constantly update the features — adding new ones, tweaking existing ones, removing obsolete ones.

This is why the feature file is regenerated every few minutes and pushed to every server globally. Sounds harmless, right? 🤷‍♂️

⚠️ The Design Decision That Caused Problems

For performance reasons, Cloudflare's proxy pre-allocates memory for the feature file. They set a limit of 200 features — well above their actual use of ~60 features. When the corrupted file contained over 200 features due to duplicates, it exceeded this limit and crashed the system.

The Code That Crashed

Here's the actual Rust code that caused the crash (simplified):

🦀 Rust — FL2 Proxy Bot Management Module

// This code checks if the number of features is within the limit
// MAX_FEATURES is set to 200

fn load_feature_config(features: Vec<Feature>) -> Result<Config, Error> {
    if features.len() > MAX_FEATURES {
        return Err(Error::TooManyFeatures);
    }
    
    // Pre-allocate memory for exactly this many features
    let config = Config::with_capacity(features.len());
    
    // ... rest of loading logic
}

// Somewhere in the calling code:
let config = load_feature_config(features).unwrap();  // 💥 CRASH HERE!

The problem was that .unwrap(). In Rust, .unwrap() says "I expect this to succeed, and if it doesn't, crash the program." When the feature count exceeded 200, the function returned an error, .unwrap() was called on that error, and the entire proxy crashed.

Key Lesson: Never Trust Your Own Data

Cloudflare's engineers assumed their internally-generated feature file would always be valid. They didn't apply the same defensive programming they would for user-provided input. This is a common mistake — we often trust "internal" data more than we should.

04

Understanding ClickHouse Distributed Databases

The bug originated in Cloudflare's ClickHouse database. If you're getting into large-scale systems, understanding distributed databases is essential. Let's break it down.

📊 What is ClickHouse?

ClickHouse is an open-source column-oriented database designed for analytics at massive scale. It's used by companies like Uber, Cloudflare, eBay, and Yandex to analyze billions of rows of data in real-time.

📚 Row vs Column Storage

Row-oriented databases (MySQL, PostgreSQL): Store data like a book — one complete row after another. Great for looking up a specific user's full profile.

Column-oriented databases (ClickHouse): Store data like a spreadsheet where each column is a separate file. Great for analytics queries like "What's the average of column X across 1 billion rows?" because you only read that one column.

How Distributed Databases Work

When you have billions of rows of data, one server isn't enough. You shard the data across many servers:

ClickHouse Distributed Architecture

Your Query

SELECT * FROM logs

→

Coordinator

Distributed Table

→

Shard 1

Rows 1-1M

Shard 2

Rows 1M-2M

Shard 3

Rows 2M-3M

Here's the key concept that caused the bug:

The Two-Database Structure

Cloudflare's ClickHouse has two logical databases:

Database	Purpose	What it contains
`default`	Query entry point	Distributed tables — virtual tables that fan out queries to all shards
`r0`	Actual storage	Underlying tables — where the actual data lives on each shard

When you query default.http_requests_features, the Distributed engine automatically queries r0.http_requests_features on every shard and combines the results.

🔐 The Permission Change That Started It All

Here's where things went wrong. Cloudflare was improving their database security by making permissions more explicit.

Before the Change (11:04 UTC)

When users queried metadata (like "what columns exist in this table?"), they could only see the default database:

SQL — Metadata Query (Before)

SELECT name, type FROM system.columns
WHERE table = 'http_requests_features';

-- Result:
-- name          | type
-- --------------+--------
-- user_agent    | String
-- ip_address    | String
-- request_rate  | Float64
-- ... (~60 features)

After the Change (11:05 UTC)

The permission change made the r0 database visible too. Now the same query returned duplicates:

SQL — Metadata Query (After)

SELECT name, type FROM system.columns 
WHERE table = 'http_requests_features';

-- Result (PROBLEM!):
-- name          | type     | database
-- --------------+----------+----------
-- user_agent    | String   | default    ← Original
-- ip_address    | String   | default
-- request_rate  | Float64  | default
-- user_agent    | String   | r0         ← DUPLICATE!
-- ip_address    | String   | r0         ← DUPLICATE!
-- request_rate  | Float64  | r0         ← DUPLICATE!
-- ... (now ~120+ rows!)

💥 The Bug

The query that generates the feature file didn't filter by database name. It assumed all results would be from the default database. When the r0 tables became visible, every column appeared twice — more than doubling the feature count from ~60 to ~120+, which exceeded the limit!

The Problematic Query

SQL — The Query That Broke Everything

-- This query was used to generate the feature file
SELECT name, type FROM system.columns
WHERE table = 'http_requests_features'
ORDER BY name;

-- ❌ MISSING: WHERE database = 'default'

One missing WHERE clause. That's all it took. Just one line of code. 🤯

Why The Outage Was Intermittent (At First)

Here's where it gets even more confusing! 😵 The outage didn't hit all at once. It fluctuated. Why?

Cloudflare was gradually rolling out the permission change to their ClickHouse cluster. The feature file is regenerated every 5 minutes, and each regeneration randomly picks a node in the cluster to run the query on.

Query hits updated node: Bad feature file generated → Outage
Query hits non-updated node: Good feature file generated → Recovery

This made debugging incredibly confusing because the system would recover on its own, then fail again minutes later. Imagine trying to fix something that keeps "fixing itself" and breaking again! 😤

Why The Outage Kept Coming Back

11:20

Query → Updated Node → Bad File → 💥 DOWN

11:25

Query → Old Node → Good File → ✅ UP

11:30

Query → Updated Node → Bad File → 💥 DOWN

11:35

Query → Old Node → Good File → ✅ UP

... Eventually all nodes updated → Permanent outage until fix

05

What Actually Happened (Step by Step)

Now that you understand all the components, let's trace exactly what happened, minute by minute.

11:05 UTC

The Innocent Change

A database engineer deploys a permission change to the ClickHouse cluster. The goal? Improve security by making permissions more explicit. Users would now see metadata for tables they have access to in both default and r0 databases. This seemed like a good, routine change.

11:05 - 11:20 UTC

The Time Bomb Ticks

The permission change starts rolling out gradually across the ClickHouse cluster. The feature file regeneration job runs every 5 minutes. Depending on which node handles the query, it might or might not see the new permissions yet.

11:20 UTC

💥 Impact Begins

The feature file regeneration job runs on an updated node. The query returns duplicate rows. The feature file now contains 120+ features instead of 60. This file is propagated to all servers globally within seconds.

11:20 UTC (continued)

The Cascade Failure

Servers running FL2 (the new proxy) try to load the feature file. The code checks: "Do we have more than 200 features?" Yes. It throws an error. .unwrap() is called. PANIC. The proxy crashes. HTTP 5xx errors flood the network.

11:25 UTC (approximately)

Brief Recovery (The Confusion)

The next feature file regeneration runs on a node that hasn't been updated yet. A good feature file is generated and propagated. Servers recover. For 5 minutes, everything looks fine.

11:30 UTC

Down Again

Next regeneration hits an updated node. Bad file. Crash. This cycle continues — recovery, crash, recovery, crash — making the problem incredibly confusing to diagnose.

11:31 UTC

Automated Alert Fires

Cloudflare's automated testing detects the issue. An alert goes out.

11:32 UTC

Manual Investigation Starts

Engineers start looking at the problem. Initial symptoms point to Workers KV (a key-value store) having issues.

11:35 UTC

Incident Call Created

A formal incident is declared. Engineers from multiple teams join. The hunt begins.

11:35 - 13:00 UTC

Wild Goose Chase

The team goes down several wrong paths:

Initial suspicion: A DDoS attack (Cloudflare had recently defended against massive attacks)
The status page going down (unrelated coincidence!) reinforced the attack theory
The intermittent nature made it seem like attackers were probing
Focus shifts to Workers KV, then Access, then other services

13:05 UTC

First Mitigation

Engineers implement a bypass for Workers KV and Access, routing them through the old FL proxy instead of FL2. This reduces impact, though core issues remain.

13:37 UTC

Root Cause Identified

Finally! An engineer traces the crashes back to the Bot Management feature file. They examine it and see the duplicate features. The ClickHouse permission change is identified as the trigger.

14:24 UTC

Stop the Bleeding

Two actions taken simultaneously:

Stop automatic regeneration of new feature files
Retrieve the last known good feature file from before 11:20

14:30 UTC

Recovery Begins

The good feature file is pushed globally. Servers start recovering. Core traffic begins flowing again.

14:30 - 17:06 UTC

Cleanup and Recovery

Various services that entered bad states need manual intervention. A backlog of login attempts overwhelms the dashboard. Teams work to scale services and restart affected components.

17:06 UTC

✅ Full Recovery

All services restored. 5xx error rates return to baseline. The internet breathes again.

~6h Total Impact Duration

~3h Peak Outage

~2h To Find Root Cause

~50min To Deploy Fix

06

How They Detected the Problem

This section is crucial for anyone building systems at scale. Detection and observability are your lifeline when things go wrong! 🚨

🚨 The Monitoring That Worked

Cloudflare has extensive monitoring. Here's what fired:

Automated Health Checks: Synthetic tests that continuously make requests to Cloudflare services detected issues at 11:31 UTC — just 11 minutes after impact started. Pretty fast! ⚡
5xx Error Rate Monitors: Dashboards immediately showed the spike in error responses.
Latency Metrics: Response times spiked because the proxy was spending CPU cycles on error handling and debugging.

🔍 The Investigation Challenges

Even with good monitoring, finding the root cause was hard. Here's why (this is where it gets interesting!):

1. Symptoms Didn't Point to the Cause

The visible symptoms were:

Workers KV returning errors
Access authentication failing
Dashboard login broken
General HTTP 5xx errors

None of these immediately screamed "Bot Management feature file!" The actual cause was several layers below the symptoms. Talk about a needle in a haystack! 🔍

2. The Intermittent Nature

The system would recover, then fail again. This pattern matched what you'd expect from:

A sophisticated DDoS attack (probing before the main assault)
A race condition in code
Network issues that come and go

It didn't match what you'd expect from a configuration problem (which usually causes persistent failures).

3. The Status Page Coincidence

Cloudflare's status page (hosted completely separately, not on Cloudflare) went down at the same time. This was a complete coincidence, but it made the team think an attacker was targeting both their infrastructure AND their communication channel. Can you imagine the panic? 😱

⚠️ Lesson: Coincidences Happen Under Pressure

When you're in incident response mode, your brain looks for patterns. Unrelated events can seem connected. Always verify assumptions. In this case, the status page issue was completely unrelated but wasted valuable investigation time.

4. Wrong Initial Hypothesis

The team initially suspected a DDoS attack because:

Cloudflare had recently defended against record-breaking attacks
The intermittent nature matched attack patterns
The status page going down reinforced this theory

How They Finally Found It

Around 13:37 UTC, an engineer looking at the FL2 proxy logs noticed the panic message:

🔴 Error Log

thread fl2_worker_thread panicked: called Result::unwrap() 
on an Err value: TooManyFeatures

This was the key. "TooManyFeatures" pointed directly to the Bot Management module. From there:

They examined the current feature file
They saw the duplicate entries
They traced the feature file generation to the ClickHouse query
They found the query didn't filter by database
They checked recent ClickHouse changes — found the permission change at 11:05

Key Debugging Principle

The error message contained the answer: TooManyFeatures. Good error messages are invaluable. When writing code, invest in descriptive, specific error messages. Future you (or your on-call colleague at 3 AM) will thank you.

07

How They Fixed It

Once the root cause was identified, the fix was conceptually simple but operationally challenging. Here's the playbook they followed.

🛑 Step 1: Stop the Bleeding (14:24 UTC)

First priority: stop making things worse.

Immediate Actions

# 1. Stop automatic feature file generation
# This prevents new bad files from being created

$ kill-feature-file-job

# 2. Block propagation of feature files
# Even if a file is generated, don't push it

$ block-feature-file-distribution

This stabilized the situation — no new bad files would be created or distributed.

📦 Step 2: Restore Known Good State (14:24-14:30 UTC)

With the bleeding stopped, they needed to restore service:

Find the last good file: Look at feature file history, find the last one generated before 11:20 UTC (before the permission change took effect)
Validate the file: Check it has ~60 features, not 120+
Manually inject into distribution: Push this file into the distribution queue
Force restart: Restart the FL2 proxy across all servers to pick up the new file

💻 Shell — Conceptual Fix Process

# Find last good feature file
$ ls -la /var/cloudflare/feature-files/
feature-file-2025-11-18-11-15.json  # ← Before the change, should be good
feature-file-2025-11-18-11-20.json  # ← Bad, has duplicates
feature-file-2025-11-18-11-25.json  # ← Might be good (hit old node)
feature-file-2025-11-18-11-30.json  # ← Bad again

# Verify the good file
$ cat feature-file-2025-11-18-11-15.json | jq '.features | length'
62  # Good! Under 200 ✅

# Inject into distribution
$ inject-feature-file feature-file-2025-11-18-11-15.json --force --global

# Force proxy restart globally
$ restart-fl2-proxy --all-regions

🔧 Step 3: Handle the Cascade (14:30 - 17:06 UTC)

Restoring the core proxy wasn't enough. Other services had entered bad states:

The Dashboard Login Storm

While the system was down, millions of users kept trying to log in. When services recovered, all those retry attempts hit at once — a "thundering herd" problem. This is a classic distributed systems nightmare! 😣

🔧 Technical Analogy: Thundering Herd

Imagine a popular website's cache expires, and suddenly 10,000 users who were waiting all hit "refresh" at the exact same moment. Your database gets slammed with 10,000 identical requests instead of just one. That's the thundering herd problem! The fix? Add random delays to retries (so not everyone retries at once), or use a "circuit breaker" that temporarily rejects requests to prevent overload.

Solution: Scale up dashboard and login services, add rate limiting, gradually let traffic through.

Turnstile (CAPTCHA) Recovery

Cloudflare Turnstile (their CAPTCHA alternative) was down, which meant new logins to the dashboard were impossible. Even after the proxy recovered, Turnstile needed separate attention.

CPU Exhaustion from Error Logging

Here's an interesting side effect: Cloudflare's debugging systems automatically enhance errors with extra context. With millions of errors happening, this consumed massive CPU resources, further slowing recovery. Ironic, right? The very thing meant to help debug was making things worse! 🤦‍♂️

💡 Lesson: Your Debugging Can Make Things Worse

Heavy error logging, stack trace collection, and debugging information are great for diagnosing issues. But during a major outage, they can consume resources you desperately need for recovery. Consider having "emergency mode" logging that's more minimal.

🔒 Step 4: Permanent Fix

After immediate recovery, they needed to fix the underlying bugs:

Fix 1: The SQL Query

SQL — Before (Buggy)

SELECT name, type FROM system.columns
WHERE table = 'http_requests_features'
ORDER BY name;

SQL — After (Fixed) ✅

SELECT name, type FROM system.columns
WHERE table = 'http_requests_features'
  AND database = 'default'  -- ← Added this filter! That's it! 🎉
ORDER BY name;

One line of code. That's the difference between the internet working and the internet breaking. 🤯

Fix 2: Better Error Handling in FL2

Rust — Before (Crashy) ❌

let config = load_feature_config(features).unwrap();  // Crash on error

Rust — After (Resilient) ✅

let config = match load_feature_config(features) {
    Ok(c) => c,
    Err(e) => {
        // Log the error
        error!("Failed to load feature config: {:?}", e);
        // Fall back to previous good config
        get_previous_config()
    }
};

08

The Full Impact

Let's look at the full scope of what was affected during this incident. Spoiler: it was massive! 💥

Services Directly Impacted

Service	Impact
Core CDN & Security	HTTP 5xx errors for customer sites
Turnstile (CAPTCHA)	Failed to load entirely
Workers KV	Elevated 5xx errors
Dashboard	Users couldn't log in
Cloudflare Access	Authentication failures
Email Security	Reduced spam detection, some Auto Move failures

Major Websites/Apps Affected

X ~700M Users

ChatGPT Login Issues

Spotify Service Down

Discord Connectivity Issues

Also affected: Canva, Figma, Claude AI, 1Password, Trello, Medium, Postman, League of Legends, Valorant, various crypto platforms, and ironically... DownDetector (the site people use to check if other sites are down).

Financial Impact

Cloudflare Stock (NET): Dropped 3.5% in pre-market trading
Customer Revenue Loss: Potentially millions across all affected sites (e-commerce transactions failed, ads didn't load, subscriptions couldn't be processed)
Crypto Markets: Multiple exchanges and DeFi platforms went offline, potentially affecting trades

Why Some Services Were Fine

Interestingly, not everything went down:

OpenAI API: Continued working (different infrastructure path than ChatGPT login)
Many mobile apps: Native mobile apps often bypass the CDN layer entirely
Sites with multi-CDN: Companies using multiple CDN providers could fail over to alternatives

Architectural Insight

This outage showed why multi-CDN architecture is increasingly important. Companies that had a backup CDN configured (like Fastly, Akamai, or AWS CloudFront) could switch traffic and minimize impact. Single points of failure are dangerous at internet scale.

09

Lessons for Building Systems at Scale

This incident is a goldmine of lessons for anyone building or operating large-scale systems. Let's extract the wisdom.

🔐 Lesson 1: Never Trust Internal Data

Cloudflare's code trusted its own configuration files completely. The assumption: "We generate this file ourselves, so it will always be valid." Makes sense, right? 🤔

Reality: Internal systems can produce invalid data due to bugs, race conditions, database issues, or (as in this case) unexpected side effects of other changes. Never assume!

✅ Best Practice

Validate ALL inputs, even those from internal systems. Apply the same defensive programming to internal data that you would to user input.

📉 Lesson 2: Graceful Degradation Over Hard Crashes

When the feature file was invalid, the FL2 proxy crashed with a panic. A better approach:

Log the error for investigation
Fall back to the previous known-good configuration
Alert operators while continuing to serve traffic
Rate-limit the fallback to prevent cascading issues

Serving traffic with a slightly stale configuration is almost always better than not serving traffic at all.

🔄 Lesson 3: Database Changes Are Infrastructure Changes

The permission change seemed like a small, safe improvement. But it had unexpected downstream effects. Database changes can affect:

Query results (as seen here)
Query performance (indexes, query plans)
Application behavior (assumptions about data format)

⚠️ Treat Database Changes Like Code Deploys

Use the same rigor: staged rollouts, feature flags, monitoring, and the ability to quickly roll back. Test in production-like environments with realistic queries.

🎭 Lesson 4: Intermittent Failures Are The Hardest

If the system had stayed down, they might have found the cause faster. The intermittent nature led the team to wrong conclusions (DDoS attack, race condition, network issues). This is the worst kind of bug to debug! 😵

Strategy for intermittent issues:

Focus on what's DIFFERENT between success and failure cases
Check for gradual rollouts of any kind (feature flags, database changes, code deploys)
Look at timing — does it correlate with scheduled jobs?
Check if different servers/regions behave differently

🚨 Lesson 5: Design Kill Switches

Cloudflare is now implementing more "global kill switches" — ways to instantly disable features that might be causing problems. When building systems:

Every feature should be independently disableable
Kill switches should work even when the main system is failing
Practice using them (chaos engineering)

📊 Lesson 6: Your Observability Can Hurt You

During the incident, Cloudflare's debugging systems (collecting stack traces, enhancing errors with context) consumed significant CPU. In a crisis, resources are precious.

Consider:

Emergency logging modes with reduced verbosity
Sampling during high-error-rate periods
Async logging that doesn't block the main process
Resource limits on debugging/tracing systems

🔗 Lesson 7: Understand Your Dependencies

Many developers debugging their apps during the outage wasted time because they didn't immediately recognize it as a Cloudflare issue. Understanding your dependency chain is crucial:

Know Your Stack

Layer 1

Your Code

→

Layer 2

Your Infra (AWS/GCP)

→

Layer 3

CDN (Cloudflare)

→

Layer 4

DNS

→

Layer 5

User's ISP

When something breaks, know which layer to investigate first.

🌐 Lesson 8: The Internet Is More Fragile Than It Looks

The internet appears decentralized, but in reality, a few key providers handle massive portions of traffic:

CDNs: Cloudflare, Akamai, Fastly, AWS CloudFront
Cloud Providers: AWS, Azure, GCP
DNS: Cloudflare, AWS Route 53, Google

A failure in any of these can cascade globally. As a system designer, always consider: what if this dependency fails?

10

What Cloudflare Is Doing to Prevent This

To their credit, Cloudflare has been transparent about this incident and committed to specific improvements. Here's their action plan:

Immediate Actions

Hardening Configuration File Validation: Treating internally-generated files with the same validation rigor as user input
More Global Kill Switches: Ability to instantly disable any feature that might be causing issues
Resource Limits on Error Handling: Preventing debugging systems from consuming excessive resources during incidents
Review of All Error Paths: Auditing every module in the core proxy for similar issues

Longer-Term Improvements

Graceful Degradation by Default: Modules should fail open (continue working with reduced functionality) rather than fail closed (crash)
Better Testing of Permission Changes: Simulating downstream effects of database changes before production rollout
Enhanced Canary Deployments: Testing changes on a small subset of traffic before global rollout
Improved Incident Detection: Faster correlation between symptoms and root causes

The Meta-Lesson

Cloudflare's worst outage since 2019 was caused by a one-line bug in a SQL query that had been working correctly for years. The conditions for failure were created by an unrelated change (the permission improvement). Complex systems fail in complex ways. You can't prevent all failures, but you can build systems that fail gracefully, detect issues quickly, and recover fast.

What You Can Do In Your Systems

Regardless of your scale, apply these principles:

Validate everything: Don't trust any input, even from your own systems
Fail gracefully: When something goes wrong, degrade rather than crash
Build kill switches: Have the ability to disable any feature instantly
Monitor dependencies: Know when external services you depend on are having issues
Test failure modes: Don't just test the happy path; test what happens when things break
Have a runbook: Document how to diagnose and recover from common failures
Practice incident response: Run game days where you simulate failures
Consider multi-provider: For critical dependencies, have a backup

✨

Final Thoughts

So let's recap what happened on November 18, 2025: A routine database permission change exposed a missing WHERE clause in a SQL query. That query generated a file that was too large. That file crashed a proxy. That proxy was handling 20% of the internet's traffic. Wild, right? 🤯

The chain reaction took about 15 minutes to start and nearly 6 hours to fully resolve. Billions of users were affected. Stock prices dropped. Engineers around the world wasted hours debugging their own systems thinking they were at fault. 😤

And yet, in some ways, the system worked! Automated monitoring detected the issue within 11 minutes. Engineers responded quickly. The post-mortem was thorough and public. Improvements are being made.

As you build and operate systems, remember: complexity is the enemy of reliability. Every dependency is a potential failure point. Every assumption is a potential bug. The goal isn't to never fail — it's to fail well, recover fast, and learn continuously.

Welcome to the world of systems at scale. It's messy, it's humbling, and it's endlessly fascinating. 🔥

🎓 Keep Learning

This incident is a case study in distributed systems, database management, incident response, and engineering culture. Study other post-mortems (Google, AWS, GitHub all publish them). Build things. Break things. Learn from every failure. That's how we all get better! 💪

11

References & Sources

This deep dive was compiled from the following sources:

🔴 Official Cloudflare Communications

Cloudflare Blog — Official Post-Mortem: "18 November 2025 Outage"
blog.cloudflare.com/18-november-2025-outage
Cloudflare Status Page — Incident Timeline
cloudflarestatus.com

📰 News Coverage

Ars Technica — "Cloudflare outage takes down Discord, ChatGPT, Notion, and many more"
arstechnica.com
TechCrunch — "Major Cloudflare outage impacts Discord, Notion, and ChatGPT"
techcrunch.com
BleepingComputer — "Cloudflare outage causes major Internet disruption"
bleepingcomputer.com
Forbes — "Cloudflare Outage Takes Out Major Websites"
forbes.com/sites/technology
The Verge — "Cloudflare outage takes down Spotify, Discord, and more"
theverge.com

📊 Status & Monitoring

DownDetector — Real-time outage reports
downdetector.com
Discord Status — Service status page
discordstatus.com
OpenAI Status — ChatGPT service status
status.openai.com

🔧 Technical Background

Cloudflare — "How Cloudflare's Bot Management Works"
cloudflare.com/products/bot-management
ClickHouse Documentation — Distributed queries and database schemas
clickhouse.com/docs
Rust Documentation — Error handling with Result and unwrap()
doc.rust-lang.org

📈 Market Impact

Yahoo Finance — Cloudflare (NET) stock movement
finance.yahoo.com/quote/NET
MarketWatch — Pre-market trading data November 18, 2025
marketwatch.com

🎓 Related Learning Resources

Google SRE Book — Site Reliability Engineering
sre.google/books
AWS Post-Mortems — Amazon Web Services incident reports
aws.amazon.com/premiumsupport/technology/pes
GitHub Engineering Blog — Incident analyses
github.blog/category/engineering
Cloudflare Engineering Blog — Technical deep dives
blog.cloudflare.com/tag/engineering

🌐 Internet Infrastructure Context

W3Techs — Web technology usage statistics
w3techs.com
Netcraft — Web server surveys
netcraft.com