On June 12, 2025, Google Cloud and several Google services were down for up to seven hours. The root cause: a malfunction in Google’s Service Control system, which handles API authorization and quota policies across Google’s infrastructure.

Takeaways
- A bug in Service Control triggered by a policy update with blank fields caused the system to crash globally.
- The failure led to a cascading outage across multiple Google Cloud and Workspace products.
- Google disabled the problematic feature, scaled back changes, and is rearchitecting Service Control to “fail open” in future incidents.
What Happened
- Google had added a feature for more granular quota validation. However, the new code lacked proper error handling and wasn’t behind a feature flag.
- A policy change with unintended blank metadata fields was inserted into regional databases and replicated globally.
- When Service Control tried to process that policy, it encountered a null pointer exception, causing the binary to crash across all regions.
- The binary crash loops triggered a vast disruption in API services.
- In the most affected region (us-central1), restarting Service Control caused overload on the underlying Spanner database due to a “herd effect” — many tasks restarted at once without backoff.
- Recovery took longer in that region; Google throttled restarts and rerouted traffic to multi-regional databases to reduce load.
Impact
- Disruption spanned Google Cloud Platform, Workspace, and numerous dependent services (Compute Engine, BigQuery, Cloud Storage, and more).
- Third-party platforms relying on Google infrastructure were also hit (Spotify, Discord, Snapchat, etc.).
- The outage led to widespread 503 errors and degraded access across many regions.
- Regions outside us-central1 largely restored in a couple of hours; us-central1 took nearly 2h 40m just to fully recover.
Mitigations
- Google immediately froze changes to the Service Control stack and halted manual policy pushes.
- They disabled the offending quota checks with a “red-button” kill switch.
- They’re redesigning Service Control so that if an internal check fails, the system “fails open” rather than blocking all API traffic.
- Planned improvements include better error handling, stricter feature flags, modular architecture, and avoiding global replication of unvalidated metadata.
- They also intend to audit systems consuming globally replicated data and implement randomized backoff to avoid database overloads during recovery.
Leave a comment