2026-05-22scalingcode

Strong Teacher Not Needed? On Distillation in LLM Pretraining

Taiming Lu, Zhuang Liu

Key claim

Weaker teachers can still improve larger models.

This study reveals that even weaker teachers can enhance larger student models when using a proper mix of losses. It also shows that stronger teachers do not always yield better results, as excessive parameters or training can diminish distillation benefits. Importantly, distillation is found to improve generalization more effectively than in-domain fitting.

Novelty

7.5/10

The paper challenges the traditional strong-to-weak assumption in knowledge distillation.

Reliability

7.0/10

The methodology includes varying architectures and training budgets, providing a solid evaluation framework.

Deep reliability assessment

The methodology supports the finding that weaker teachers can improve stronger students under certain conditions, but it may overclaim the generalizability of these results across all configurations and tasks.

Reproducibility

Yes, the code is available here.

Discussion questions

What assumptions about teacher strength in knowledge distillation can be challenged based on these findings?
How can builders apply these insights to optimize their model training processes?
What specific conditions or configurations would lead to a failure of the results presented in this paper?

Key figure

Figure 1 illustrates that effective distillation in LLM pretraining depends on the compatibility between teacher and student models rather than solely on the strength of the teacher.

Codelink

Our code is avaliable here.Official

Read on arXiv →