Google has revealed that a simple ‘zero’ value was behind the failure of its global authentication system that blocked access to YouTube, Gmail, and Google Cloud Platform services.
A day after the incident on Monday 14, Google said in a prelimiary analysis that the root cause was an issue in its automated storage quota management system, which reduced the capacity of its central identity management system and in turn blocked everyone from accessing many Google services that require users to log in.
The outage only lasted 50 minutes but blocked access to Gmail and YouTube for billions of users worldwide. The incident also affected companies that rely on Google Cloud Platform for computing resources.
The picture Google’s engineers paint in its full incident report details a short-lived but major event that all came down to a ‘zero’ error generated by the legacy storage quota system that Google uses to automatically provision storage for its authentication system.
“As part of an ongoing migration of the User ID Service to a new quota system, a change was made in October to register the User ID Service with the new quota system, but parts of the previous quota system were left in place which incorrectly reported the usage for the User ID Service as 0,” the report said.
“As a result, the quota for the account database was reduced, which prevented the Paxos leader from writing. Shortly after, the majority of read operations became outdated which resulted in errors on authentication lookups.”
Google says that the outage stemmed from changes it made to the Google User ID Service in October as part of a migration to the new quota system.
At the heart of the outage was the Google User ID Service, which has a unique identifier for every account and handles authentication credentials for OAuth tokens and cookies. OAuth tokens are used to log people in to a service without requiring the user to enter or re-enter a password.
Google stores this account data in a distributed cloud database, which uses Paxos protocols to coordinate updates after agreeing on data values needed for processing.
“For security reasons, this service will reject requests when it detects outdated data,” Google explains.
“An existing grace period on enforcing quota restrictions delayed the impact, which eventually expired, triggering automated quota systems to decrease the quota allowed for the User ID service and triggering this incident. Existing safety checks exist to prevent many unintended quota changes, but at the time they did not cover the scenario of zero reported load for a single service.”
Google also detailed the extent of the impact to users across Google Cloud Storage, Google Cloud Network, the Google Kubernetes Engine (GKE), Google Workspace (formerly G Suite), and Google cloud support.
“On Monday 14 December, 2020 from 03:46 to 04:33 US/Pacific, credential issuance and account metadata lookups for all Google user accounts failed. As a result, we could not verify that user requests were authenticated and served 5xx errors on virtually all authenticated traffic,” Google says in the report for incident Google Cloud Infrastructure Components incident 20013.
Google confirmed that “all authenticated Google Workspace apps were down for the duration of the incident” and that around “4% of requests to the GKE control plane API failed, and nearly all Google-managed and customer workloads could not report metrics to Cloud Monitoring.”
The majority of Google’s authenticated services experienced “elevated error rates across all Google Cloud Platform and Google Workspace APIs and Consoles.”
While most services quickly recovered automatically, some services had a “unique or lingering impact”, Google said.
Google noted in a correction published on Tuesday to its root cause analysis that “all services that require sign-in via a Google Account were affected with varying impact.”