← Back to Ideas

Discretization errors in the ODE approximation of SGD are the primary cause of training instability at large batch sizes, implying that higher-order numerical integration optimizers will enable larger stable learning rates.

Feasibility: 8 Novelty: 8

Motivation

The paper models SGD as a continuous ODE, but real training is discrete. If the 'instability' seen in large-batch training is simply a failure of the Euler method (standard SGD/Adam) to track the true ODE trajectory, better integrators could stabilize training.

Proposed Method

1. Re-interpret the paper's update rules as a first-order Euler discretization. 2. Implement a custom optimizer based on 4th-order Runge-Kutta (RK4) or symplectic integrators designed to preserve the energy/Hamiltonian of the system. 3. Compare the maximum stable learning rate and convergence speed per wall-clock second against AdamW on a GPT-2 scale model.

Expected Contribution

A new class of 'physics-informed' optimizers that allow for significantly larger batch sizes and learning rates, accelerating the training of foundation models.

Required Resources

Standard GPU compute (single node sufficient for proof of concept), deep knowledge of numerical analysis.

Source Paper

Unifying Learning Dynamics and Generalization in Transformers Scaling Law

View Paper Details →