Google15:05Feature UpdatesOfficial Blog
Google DeepMind Unveils Decoupled DiLoCo for Resilient Training
Boosts reliability of global distributed training.
Key Points
- 1Low-bandwidth fault tolerance.
- 2Multi-region 12B training.
- 3Hardware mixing self-healing.
- 4Builds on Pathways/DiLoCo.
Decoupled DiLoCo enables fault-tolerant training across centers. Continues on failures, low-bandwidth 12B Gemma trained. Mixes hardware, self-heals for scalable infra.