Hierarchical SparseLoCo protocols can enable efficient Geo-Distributed Mixture of Experts (MoE) training by localizing expert routing within bandwidth-constrained regions.
Motivation
Current MoE models require high-bandwidth All-to-All communication for token routing, restricting them to single monolithic clusters. If the low-bandwidth techniques from the paper can be applied to expert routing, it would allow training massive models across geographically separated datacenters, overcoming local power and space constraints.
Proposed Method
Partition an MoE model such that frequently co-activated experts reside in the same physical region. Modify the SparseLoCo algorithm to prioritize intra-region expert routing and apply aggressive quantization/sparsification only for inter-region token dispatch. Evaluate perplexity and throughput on a simulated multi-region cluster (e.g., AWS East vs. West) compared to a standard baseline.
Expected Contribution
A framework for 'Planetary Scale' MoE training that decouples model size from single-datacenter capacity limits.
Required Resources
Access to a distributed cloud environment (e.g., AWS/GCP multi-region setup), PyTorch/Megatron-LM codebase, significant GPU compute budget for pre-training experiments.