Asynchronous 'Generational' Pipeline Parallelism can utilize legacy GPUs (e.g., V100s) alongside modern GPUs (H100s) without bottlenecking the faster hardware if connected via low-bandwidth links.
Motivation
Heterogeneous training usually suffers because the fastest GPU waits for the slowest. While the paper addresses bandwidth, it implies synchronous pipelines. By decoupling the backward pass of legacy hardware using the low-bandwidth protocol for delayed gradient updates, old hardware could contribute to training without stalling the main H100 fleet.
Proposed Method
Setup a heterogeneous cluster with a fast partition (H100s) and a slow partition (V100s) connected via standard Ethernet. Implement an asynchronous variation of the paper's pipeline where the slow partition processes micro-batches with a staleness tolerance, compressing updates via SparseLoCo. Measure the 'Effective FLOPS' contribution of the slow partition vs. the communication overhead.
Expected Contribution
A method to extend the economic lifespan of older hardware by effectively integrating it into modern training loops despite bandwidth and compute disparities.
Required Resources
Access to a mixed-generation GPU cluster, modification of the distributed optimizer to handle stale gradients.