Lights Out, Systems On: Validating Instant Power Loss Readiness
Meta's engineering team developed Instantaneous PowerLoss Storm, a testing paradigm to validate readiness for instant power loss in data centers. To build readiness, they integrated power loss tolerance into their DC stack using defense-in-depth strategies, such as in-memory data persistence and asynchronous signaling mechanisms. Validation involved a controlled power supply fault injection and remedial actions to ensure seamless region de-energization. Key tradeoffs were made to balance reliability and velocity of growth, including prioritizing critical infrastructure impacts and tolerating transient service errors. Validation exercises, including de-energizing production regions, successfully trained infrastructure and engineers to handle region-level failures.