Google’s Massive Cloud Outage Traced to API Management Glitch

On June 12, 2025, Google Cloud and several Google services were down for up to seven hours. The root cause: a malfunction in Google’s Service Control system, which handles API authorization and quota policies across Google’s infrastructure.

Takeaways

  • A bug in Service Control triggered by a policy update with blank fields caused the system to crash globally.
  • The failure led to a cascading outage across multiple Google Cloud and Workspace products.
  • Google disabled the problematic feature, scaled back changes, and is rearchitecting Service Control to “fail open” in future incidents.

What Happened

  • Google had added a feature for more granular quota validation. However, the new code lacked proper error handling and wasn’t behind a feature flag.
  • A policy change with unintended blank metadata fields was inserted into regional databases and replicated globally.
  • When Service Control tried to process that policy, it encountered a null pointer exception, causing the binary to crash across all regions.
  • The binary crash loops triggered a vast disruption in API services.
  • In the most affected region (us-central1), restarting Service Control caused overload on the underlying Spanner database due to a “herd effect” — many tasks restarted at once without backoff.
  • Recovery took longer in that region; Google throttled restarts and rerouted traffic to multi-regional databases to reduce load.

Impact

  • Disruption spanned Google Cloud Platform, Workspace, and numerous dependent services (Compute Engine, BigQuery, Cloud Storage, and more).
  • Third-party platforms relying on Google infrastructure were also hit (Spotify, Discord, Snapchat, etc.).
  • The outage led to widespread 503 errors and degraded access across many regions.
  • Regions outside us-central1 largely restored in a couple of hours; us-central1 took nearly 2h 40m just to fully recover.

Mitigations

  • Google immediately froze changes to the Service Control stack and halted manual policy pushes.
  • They disabled the offending quota checks with a “red-button” kill switch.
  • They’re redesigning Service Control so that if an internal check fails, the system “fails open” rather than blocking all API traffic.
  • Planned improvements include better error handling, stricter feature flags, modular architecture, and avoiding global replication of unvalidated metadata.
  • They also intend to audit systems consuming globally replicated data and implement randomized backoff to avoid database overloads during recovery.

Comments

Leave a comment