Google’s Massive Cloud Outage Traced to API Management Glitch

Mourad Maacha

Monday, June 16, 2025

On June 12, 2025, Google Cloud and several Google services were down for up to seven hours. The root cause: a malfunction in Google’s Service Control system, which handles API authorization and quota policies across Google’s infrastructure.

Takeaways

A bug in Service Control triggered by a policy update with blank fields caused the system to crash globally.
The failure led to a cascading outage across multiple Google Cloud and Workspace products.
Google disabled the problematic feature, scaled back changes, and is rearchitecting Service Control to “fail open” in future incidents.

What Happened

Google had added a feature for more granular quota validation. However, the new code lacked proper error handling and wasn’t behind a feature flag.
A policy change with unintended blank metadata fields was inserted into regional databases and replicated globally.
When Service Control tried to process that policy, it encountered a null pointer exception, causing the binary to crash across all regions.
The binary crash loops triggered a vast disruption in API services.
In the most affected region (us-central1), restarting Service Control caused overload on the underlying Spanner database due to a “herd effect” — many tasks restarted at once without backoff.
Recovery took longer in that region; Google throttled restarts and rerouted traffic to multi-regional databases to reduce load.

Impact

Disruption spanned Google Cloud Platform, Workspace, and numerous dependent services (Compute Engine, BigQuery, Cloud Storage, and more).
Third-party platforms relying on Google infrastructure were also hit (Spotify, Discord, Snapchat, etc.).
The outage led to widespread 503 errors and degraded access across many regions.
Regions outside us-central1 largely restored in a couple of hours; us-central1 took nearly 2h 40m just to fully recover.

Mitigations

Google immediately froze changes to the Service Control stack and halted manual policy pushes.
They disabled the offending quota checks with a “red-button” kill switch.
They’re redesigning Service Control so that if an internal check fails, the system “fails open” rather than blocking all API traffic.
Planned improvements include better error handling, stricter feature flags, modular architecture, and avoiding global replication of unvalidated metadata.
They also intend to audit systems consuming globally replicated data and implement randomized backoff to avoid database overloads during recovery.

Google’s Massive Cloud Outage Traced to API Management Glitch

Takeaways

What Happened

Impact

Mitigations

Comments

Leave a comment Cancel reply

More posts

Google Chrome Emergency Security Update Fixes High-Severity Vulnerabilities

Critical Chrome 0-day Vulnerability Everyone Should Know

Zero-Click Exploit in Claude Desktop Reveals Trust Boundary Failures in AI Agent Architecture

Command Injection Flaw in Hikvision Access Points Allows Authenticated Code Execution