TECH & OTHER NEWS

Cloudflare reveals it’s automated empathy to avoid fixing flaky hardware too often

March 26, 2024

Cloudflare has revealed a little about how it maintains the millions of boxes it operates around the world – including the concept of an “error budget” that enacts “empathy embedded in automation.”

In a Tuesday post titled “Autonomous hardware diagnostics and recovery at scale,” the internet-taming biz explains that it built fault-tolerant infrastructure that can continue operating with “little to no impact” on its services. But as explained by infrastructure engineering tech lead Jet Marsical and systems engineers Aakash Shah and Yilin Xiong, when servers did break the Data Center Operations team relied on manual processes to identify dead boxes. And those processes could take “hours for a single server alone, and [could] easily consume an engineer’s entire day.”

Which does not work at hyperscale.

Worse, dead servers would sometimes remain powered on, costing Cloudflare money without producing anything of value.

Enter Phoenix – a tool Cloudflare created to detect broken servers and automatically initiate workflows to get them fixed.

Phoenix makes a “discovery run” every thirty minutes, during which it probes up to two datacenters known to house broken boxen. That pace of discovery means Phoenix can find dead machines across Cloudflare’s network in no more than three days. If it spots machines already listed for repairs, it “takes care of ensuring that the Recovery phase is executed immediately.”

When it spots a broken box, Phoenix uses the Intelligent Platform Management Interface to figure out what’s wrong. If a machine passes that test, it is subjected to a “Node Acceptance Test” that works like this:

Phoenix will send relevant system instructions to have it boot into a custom Linux boot image, internally called INAT-image. Built into this image are the various tests that need to run when the server boots up, publishing the results to an internal resource in both human-readable (HTML) and machine-readable (JSON) formats, with the latter consumed and interpreted by Phoenix. Upon completion of the boot diagnostics, the server is powered off again to ensure it is not wasting energy.

The results of that testing automatically produce a to-do list, with the system smart enough to do things like not repeatedly adding a device to a list if the part it needs to resume operations is yet to arrive.

Phoenix also operates against an “error budget” that assesses if a box that has gone down more than once is worth saving.

“The error budget is the amount of error that automation can accumulate over a certain period of time before our site reliability engineers start being unhappy due to the excessive server failures or unreliability of the system,” explained Marsical, Shah, and Xiong. “It is empathy embedded in automation.”

And it means that Phoenix stops trying to recover a machine – without human intervention – if it fails a certain number of times within a certain time window.

“The error budget has helped us define and manage our tolerance for hardware failures without causing significant harm to the system or too much noise for SREs, and gave us opportunities to improve our diagnostics system,” Cloudflare’s trio wrote. “It provides a common incentive that allows both the Infrastructure Engineering and SRE teams to focus on finding the right balance between innovation and reliability.”

The post concludes with a paean to the power of automation – to let techies spend their time on higher-value activities. ®

Source Link

Cloudflare reveals it’s automated empathy to avoid fixing flaky hardware too often

LEAVE A REPLY Cancel reply

TECH NEWS

Gartner Survey Shows AI and Generative AI Top Digital Supply Chain...

Navigating the Future: Cloud Migration Journeys and Data Security

Gartner Forecasts Worldwide Semiconductor Revenue to Grow 14% in 2025

Ubiquitous AI Computing: The Future of Pervasive Intelligence

Forrester’s Technology & Security Predictions 2025: Tech Leaders Will Triple The...

Gartner Forecasts Worldwide IT Spending to Grow 9.3% in 2025

TOP STORIES

AI Adoption in 2024: 74% of Companies Struggle to Achieve and...

World Quality Report 2024 shows 68% of Organizations Now Utilizing Gen...

Criminals Reverting to Old-School Tactics with New Twists, Visa’s State of...

Gartner Identifies the Top 10 Strategic Technology Trends for 2025

Accenture Life Trends 2025 Predicts New Dynamics of Trust Will Reshape...

Generative AI expected to accelerate entry-level career progress across industries

Cyber Security

Gartner Survey Shows AI Enhanced Malicious Attacks as Top Emerging Risk...

New cyber campaign targets PC users with fake CAPTCHAs and browser...

81% of Security Leaders Expect Cyberattacks in the Next 12 Months

New Research Underscores the Growing Security Risk Due to Hybrid Work...

Lazarus APT exploited zero-day vulnerability in Chrome to steal cryptocurrency

Cybersecurity Awareness Month 2024: Securing the Future in a Digital World