The absence of safeguards and validation checks for high-impact actions led to the outage, prompting Cloudflare to plan and implement additional measures for improved account provisioning, stricter access control, and two-party approval processes for high-risk actions. "The decline in R2 availability metrics was gradual and not immediately obvious because there was a delay in the propagation of the previous credential deletion to storage infrastructure," explained Cloudflare in its incident report. Cloudflare announced that its R2 object storage and dependent services experienced an outage lasting 1 hour and 7 minutes, causing 100% write and 35% read failures globally. To prevent similar incidents from reoccurring in the future, Cloudflare has improved credential logging and verification and now mandates the use of automated deployment tooling to avoid human errors. The company is also updating standard operating procedures (SOPs) to require dual validation for high-impact actions like credential rotation and plans to enhance health checks for faster root cause detection. Cloudflare R2 is a scalable, S3-compatible object storage service with free data retrieval, multi-region replication, and tight Cloudflare integration. The incident, which lasted between 21:38 UTC and 22:45 UTC, was reportedly caused by a credential rotation that caused the R2 Gateway (API frontend) to lose authentication access to the backend storage. Cloudflare's R2 service suffered another 1-hour long outage in February, which was also caused by a human error. Bill Toulas Bill Toulas is a tech writer and infosec news reporter with over a decade of experience working on various online publications, covering open-source, Linux, malware, data breach incidents, and hacks. The issue stemmed from omitting a single command-line flag, '--env production,' which causes the new credentials to be deployed to the production R2 Gateway Worker rather than the production worker. Due to the nature of the problem and the way Cloudflare's services work, the misconfiguration wasn't made immediately obvious, causing further delays in its remediation. Specifically, new credentials were mistakenly deployed to a development environment instead of production, and when the old credentials were deleted, the production service was left with no valid credentials.
This Cyber News was published on www.bleepingcomputer.com. Publication date: Tue, 25 Mar 2025 19:50:23 +0000