Discretization errors in the ODE approximation of SGD are the primary cause of training instability at large batch sizes, implying that higher-order numerical integration optimizers will enable larger stable learning rates.
Motivation
The paper models SGD as a continuous ODE, but real training is discrete. If the 'instability' seen in large-batch training is simply a failure of the Euler method (standard SGD/Adam) to track the true ODE trajectory, better integrators could stabilize training.
Proposed Method
1. Re-interpret the paper's update rules as a first-order Euler discretization. 2. Implement a custom optimizer based on 4th-order Runge-Kutta (RK4) or symplectic integrators designed to preserve the energy/Hamiltonian of the system. 3. Compare the maximum stable learning rate and convergence speed per wall-clock second against AdamW on a GPT-2 scale model.
Expected Contribution
A new class of 'physics-informed' optimizers that allow for significantly larger batch sizes and learning rates, accelerating the training of foundation models.
Required Resources
Standard GPU compute (single node sufficient for proof of concept), deep knowledge of numerical analysis.
Source Paper
Unifying Learning Dynamics and Generalization in Transformers Scaling Law