The ODE-based learning dynamics framework can be formulated as an optimal control problem to mathematically derive the ideal training curriculum (data sequence) that minimizes convergence time.
Motivation
Current scaling laws mostly assume static data distributions or heuristic curriculum schedules. If learning dynamics are truly governed by specific ODEs, we should be able to treat the data distribution at time 't' as a control variable to analytically maximize the rate of error reduction.
Proposed Method
1. Modify the paper's ODE formulation to include a time-variant control parameter representing data mixing ratios (e.g., code vs. text complexity). 2. Use Pontryagin's maximum principle or numerical optimal control methods to solve for the schedule that minimizes the integral of generalization error. 3. Train three 1B-parameter models: one with the derived schedule, one with static mixing, and one with a standard 'easy-to-hard' heuristic, comparing loss curves.
Expected Contribution
A mathematically rigorous method for curriculum generation that replaces heuristics with control theory, potentially reducing compute requirements for training LLMs.
Required Resources
Access to a cluster capable of training 1B parameter models, expertise in control theory and differential equations.
Source Paper
Unifying Learning Dynamics and Generalization in Transformers Scaling Law