← Back to Ideas

Hierarchical SparseLoCo protocols can enable efficient Geo-Distributed Mixture of Experts (MoE) training by localizing expert routing within bandwidth-constrained regions.

Feasibility: 7 Novelty: 9

Motivation

Current MoE models require high-bandwidth All-to-All communication for token routing, restricting them to single monolithic clusters. If the low-bandwidth techniques from the paper can be applied to expert routing, it would allow training massive models across geographically separated datacenters, overcoming local power and space constraints.

Proposed Method

Partition an MoE model such that frequently co-activated experts reside in the same physical region. Modify the SparseLoCo algorithm to prioritize intra-region expert routing and apply aggressive quantization/sparsification only for inter-region token dispatch. Evaluate perplexity and throughput on a simulated multi-region cluster (e.g., AWS East vs. West) compared to a standard baseline.

Expected Contribution

A framework for 'Planetary Scale' MoE training that decouples model size from single-datacenter capacity limits.

Required Resources

Access to a distributed cloud environment (e.g., AWS/GCP multi-region setup), PyTorch/Megatron-LM codebase, significant GPU compute budget for pre-training experiments.

Source Paper

Heterogeneous Low-Bandwidth Pre-Training of LLMs

View Paper Details →