LLM Pre-training and Scaling Laws
Every LLM lab has a fixed pre-training budget. How do you allocate it between model size and data?
TL;DR: With $10M and a fixed compute budget, picking your model size locks in how much data you can feed it. DeepMind's Chinchilla paper said 20 tokens per parameter is the sweet spot. The industry now goes 200:1 or further. Why? Inference.
Imagine we've raised $10M and we want to pre-train an LLM from scratch with that money.
Today H200s cost $2.32/hour, so that buys us about 4,310,000 GPU-hours.
4,310,000 h × 3,600 s × 3.95×10¹⁴ FLOPs/s ≈ 6×10²⁴ FLOPsOur compute budget will be 6×10²⁴ total FLOPs.
So now we need to decide the model size we're going to build. And how much data?
In practice when we choose the transformer model size and we have a fixed compute, we've automatically picked how much data we can feed it.
There's a rule out there that says:
Data = Compute / (6 × Parameters)So let's go for a 40B model, a random choice. We jump on Common Crawl, scrape a mountain of text, and start training. We'll burn the whole $10M budget processing 25 trillion tokens. Roughly everything usable on the open web.
What if we picked a bigger model? These are the numbers.
- 40B → 25T tokens.
- 70B → 14.3T tokens.
- 224B → 4.5T tokens.
- 500B → 2T tokens.
Which one ends up with the best model?
Training loss vs. parameter count — at a fixed $10M compute budget.
More params = less data = running out of time to learn. Less params = tons of data = model too dumb to use it.
Where is the sweet spot? Can we predict which one will be the best for us?
This is the question a team at DeepMind set out to answer in early 2022, in a paper that would reshape how the entire industry thought about training large models. They called the resulting model Chinchilla.
They trained over 400 models, combining number of parameters and tokens. They discovered that 20:1 is the sweet spot.
We have a fixed budget X? Then take 20 tokens per model parameter.
So, this is the Industry Standard in 2026?
Not really...
20:1 nails it if all you want is the best model per training dollar. Great for research. But the industry doesn't just train models, it serves them. And that's where ✨inference✨ comes in.
A bigger model → more cost in inference. Forever. Every token, every user, every day.
A lot of things matter here, not only the size: the architecture, MoE, speculative decoding…
And revenue? Maybe a bigger, smarter model brings in more users paying. The equation isn't fully closed.
So we need smaller models. The industry tends to go 200:1 or even further.
Will it plateau?
Back in 2024 there was no sign of it. More compute + more parameters + more data → loss kept going down.
Now in 2026, a lot of people say the scaling law is plateauing. It really isn't.
Here is the trick. On the typical scaling law plot the Y axis is linear, but the X axis is log10. Plot the same data with a linear X and you see a brutal cliff that flattens out almost immediately. Plot it on log10 and it is still a clean line going down.
Same scaling law. Two axes.
Linear X looks plateaued. Log10 X is still going down.
So no, the law itself isn't breaking. The catch is what each step on that log axis costs us. The scaling law won't stall, but the log will break us. We will need 10× compute to keep going.
And 10× compute isn't cheap, and is not fast. Most estimates put us there around 2028.
References
- Hoffmann et al. (2022) — Training Compute-Optimal Large Language Models (the Chinchilla paper).
- Sardana, Frankle et al. (2024) — Beyond Chinchilla-Optimal: Accounting for Inference in Language Model Scaling Laws.
- Jingyi Qi — Scaling Law won't stall, but scaling law's log will break us.