Scaling Laws Aren't Scaling Laws When You're Out of Data
Chinchilla assumes you have enough data to be in the resolution-limited regime. We didn't. So we wrote one that doesn't.
In language modeling, the loss curve as you grow the model is monotone. Train a 1B-parameter model on enough tokens, and it's better than a 100M one. Train a 100B on the same data, and it's better still. Chinchilla and Kaplan formalize this: L ∝ N^(-α), smooth and decreasing in N.
We tried to fit this curve on emg2pose, the largest publicly available surface-EMG hand-pose dataset (370 session-hours, 193 users). It refused.
What we saw instead was a U-shape: validation loss decreased as N grew, hit a minimum somewhere between 4M and 17M parameters, and then went back up. This is overfitting, of course — but it's overfitting that standard scaling laws don't predict, because they describe the resolution-limited regime, where the model is the bottleneck. We were in the opposite regime: data was the bottleneck. Around the interpolation threshold the model was just barely big enough to fit the training set exactly, with no capacity left over for generalization.
This is the classical side of double descent. Hastie, Mei–Montanari, and Bahri et al. characterize it in linear, kernel, and random-feature settings: at N ≈ D, the validation loss explodes. The peak then shrinks and smooths out as D grows.
What we wanted was a single closed-form fit for the curve at every data fraction — one that recovers Chinchilla when data is abundant and captures the U-shape when it isn't.
the patch
We added one term to the Chinchilla form:
L(N, D) = ε∞ + a · N^(-α) + b · D^(-β) + c · (N/D)^γ
The first three terms are exactly Hoffmann et al.: irreducible loss, resolution bias from finite N, resolution bias from finite D. The fourth term, c · (N/D)^γ, is new. It dominates when N is large relative to D, captures the overfitting upturn, and vanishes in the data-rich limit. The exponent γ controls how aggressively overfitting bites.
Fit on a 5×8 grid spanning 1.49M–85M parameters and 20%–100% of emg2pose, the form explains 98.8% of the variance in validation loss with an RMS residual of 0.13 mm. A held-out cell (85M params, 100% data) is predicted within 0.21 mm of the empirical value. The vanilla Chinchilla form (set c = 0) fits to R² = 0.906 and systematically misses every U.
what it buys you
Once you have the fit, you can read off the data-collection budget that hits a target accuracy. To match HaMeR's 75th-percentile error on monocular vision (9.15 mm), our law projects you need roughly 34× more session-hours than emg2pose currently has, at a compute-optimal model of ~2.2B parameters.
That's a concrete number you can argue for or against, instead of "more data is good." It's also a number that goes away as the field grows the dataset and the variance-limited regime recedes. The point isn't that this form is permanent — it's that while you're stuck on the wrong side of the interpolation threshold, you need to admit it in your loss function.