LLMSurgeon: Diagnosing Data Mixture of Large Language Models
Yaxin Luo, Jiacheng Cui, Xiaohan Zhao, Xinyi Shang, Jiacheng Liu, Xinyue Bi, Zhaoyi Li, Zhiqiang Shen
Read on arXiv →Key claim
LLMSurgeon enables effective auditing of LLM pretraining data.
This paper presents a novel framework called LLMSurgeon for estimating the pretraining data mixture of large language models based on generated text. It introduces a method for auditing the 'digital DNA' of foundation models, allowing for high-fidelity recovery of domain mixtures without direct access to training data. The key result is that LLMSurgeon can effectively recover domain mixtures under fixed protocols.
In plain English
This paper presents a novel framework called LLMSurgeon for estimating the pretraining data mixture of large language models based on generated text. It introduces a method for auditing the 'digital DNA' of foundation models, allowing for high-fidelity recovery of domain mixtures without direct access to training data. The key result is that LLMSurgeon can effectively recover domain mixtures under fixed protocols.
The introduction of Data Mixture Surgery and LLMSurgeon represents a significant advancement in auditing LLMs' pretraining data.
The evaluation suite LLMScan provides a solid basis for assessing the method's effectiveness, though more extensive baselines could strengthen the claims.
Deep reliability assessment
The methodology supports post-hoc estimation of domain-level mixture proportions only under a predefined taxonomy, fixed prompting protocol, and a strong label-shift assumption validated mainly on open models with known mixtures. It overclaims when framed as recovering an LLM's true pretraining "digital DNA", because generated text is affected by prompting, alignment, decoding, memorization, domain classifier bias, and taxonomy choice, especially for closed or instruction-tuned models.
Reproducibility
Partial. The paper states "Code & Data: LLMSurgeon" and introduces LLMScan as a recipe-verifiable evaluation suite built from open-source LLMs with transparent pretraining mixtures, but no exact repository URL is visible in the provided text.
Discussion questions
- 1.How plausible is the label-shift assumption that domain-specific language patterns remain invariant between reference corpora, pretraining data, and generated samples from an aligned LLM?
- 2.For builders, could this be used as a practical compliance or data-audit tool, or would prompt choice, decoding settings, and taxonomy design make the estimates too easy to manipulate?
- 3.What experiment would falsify LLMSurgeon: for example, if models with fully known mixtures repeatedly produce unrecoverable or prompt-sensitive estimates despite the same taxonomy and sampling protocol?
Key figure
Figure 2 shows LLMSurgeon as a three-stage pipeline: train a domain classifier on labeled reference data, sample neutral text from the target LLM, then use a calibrated confusion matrix to invert biased classifier predictions into estimated training-domain proportions.
