← Back to Ideas

The spectral decay properties of the theoretical kernel limit at initialization are predictive of the specific scaling power-law exponent for hybrid architectures (e.g., Attention-SSM hybrids) without full-scale training.

Feasibility: 6 Novelty: 9

Motivation

Scaling laws are usually established empirically by training many models. If the paper's link between kernel behaviors and scaling is robust, we should be able to predict the 'slope' of the scaling law for new architectures (like Mamba or RWKV) purely by analyzing their initial kernel properties.

Proposed Method

1. Deriving the kernel limit formulation for a hybrid SSM-Transformer architecture using the paper's framework. 2. Calculate the eigenspectrum decay of this kernel. 3. Predict the scaling exponent based on the theory. 4. Empirically verify by training a suite of small-to-medium models (100M to 1B params) and fitting the actual scaling law to check alignment.

Expected Contribution

A rapid, low-compute Neural Architecture Search (NAS) proxy that allows researchers to screen new architectures for scalability before investing in large-scale training.

Required Resources

Mathematical expertise to derive kernels for new architectures, moderate compute for verification runs.

Source Paper

Unifying Learning Dynamics and Generalization in Transformers Scaling Law

View Paper Details →