Imagine a room the size of a warehouse somewhere in Meta’s global data center network, with racks running from floor to ceiling, fiber cables bundled like industrial rope, and cooling systems humming at a frequency you can feel rather than hear. Everything appears to be in order. The monitors’ numbers appear to be correct. The training run is underway. And somewhere within one of those thousands of GPUs, a chip is silently generating the incorrect response, leaving no trace and sounding no alarm. The model simply keeps running, ingesting erroneous data and veering off course.
Meta’s hardware engineers have been working on this issue for years, but it doesn’t receive nearly enough attention outside of the technical community. Silent data corruption, or SDC as the company refers to it, occurs when silicon makes a calculation error without reporting it. No crash, no log entry, no clear indication that something went wrong. These errors are truly pernicious because neural networks can occasionally absorb minor computational errors and continue producing believable results. By the time someone notices that the model is acting strangely, it will take forensic engineering that most organizations are just not equipped to perform in order to pinpoint a specific device buried within a cluster of thousands.

Meta does. Since 2018, the company has been methodically cataloguing failure modes throughout its whole hardware stack, including disks, CPUs, memories, switches, GPUs, ASICs, and network components. This process frequently reveals issues that the industry as a whole hasn’t yet formally documented. This body of knowledge has produced something unique: a reliability discipline designed especially for AI workloads at a scale that no one else is using, as Jensen Huang of NVIDIA once noted. Depending on how you consider what it means for the rest of the field, it’s difficult not to find that both impressive and somewhat sobering.
The challenge was made tangible by the results of operating the Llama 3 family of models. Hardware failures, such as SRAMs, high-bandwidth memory chips, processing grids, and network switching hardware malfunctioning in ways that stopped synchronous training runs involving thousands of accelerators at once, were directly responsible for over 66% of training interruptions. The synchronous component is very important. Everything stops in that environment when one component fails. While engineers diagnose the issue, identify the malfunctioning device, and restart from the most recent valid checkpoint, all connected accelerators remain idle. Even a brief disruption is costly at the scale at which Meta operates.
Hardware faults are divided into three categories by Meta, and the differences provide important insights into the engineering problem. In actuality, static errors—devices that just won’t turn on—are the simplest to fix. They are quickly detected by health checks, and although they become more common as cluster sizes increase, they are relatively easy to diagnose and fix. Transient mistakes are more difficult. These are load-dependent faults that manifest under particular computational or thermal conditions and occasionally disappear before they can be isolated.
By purposefully creating those conditions with artificial workloads in non-production environments, Meta’s method essentially makes the fault repeat on demand so engineers can examine and capture it. At Meta’s scale, pattern matching across thousands of devices makes what would otherwise be a rare event detectable. It’s an uncommon technique.
Then there are mistakes that go unnoticed. These are the most risky and require the largest engineering investment. It takes a lot of time and sophisticated telemetry to trace corrupted outputs back to the original device through layers of computation in order to detect SDCs. As engineers from NVIDIA, AMD, Google, Intel, and ARM have contributed research to Meta’s methodical development of that capability, it appears that the industry as a whole views this as a shared issue rather than the advantage of any one company.
All of this is taking place against a backdrop of infrastructure that, to be honest, is developing more quickly than conventional construction schedules permit. When rumors surfaced that GPU clusters were being installed inside weatherproof tent structures—hurricane-proof, Zuckerberg told The Information—because permanent data centers just take too long to construct, Meta made headlines.
The strategy has precedent; in 2008, Microsoft operated servers in a tent for seven months without experiencing a single malfunction, proving that computers are more resilient than the industry believed. It’s really unclear if canvas walls are the best option in the long run. Some structural engineers have voiced serious concerns. However, Meta’s history of making unorthodox infrastructure choices, such as its 2015 cold storage facilities constructed without backup generators, indicates that the company has a particular tolerance for sacrificing polish for speed, and thus far, that instinct has largely worked to its advantage.
Piece by piece, what is being constructed here is more of a methodology than a system; it is intended to maintain the integrity of AI training when the hardware that powers it is subjected to stresses for which it was never intended.
