Lost in Sampling: Assessing Lexical Reachability in LLMs via the Word Coverage Score (WCS)
Samer Awad, Javier Conde, Carlos Arriaga, Tairan Fu, Javier Coronado-Blázquez, Pedro Reviriego
Read on arXiv →Key claim
Standard sampling methods suppress linguistic diversity in LLMs.
This paper investigates how standard sampling methods in LLMs limit linguistic diversity. It introduces the Word Coverage Score (WCS) to measure the impact of these sampling filters on the use of low-frequency, high-information words. The key finding is that common sampling defaults can unintentionally censor diverse language, leading to more homogeneous text outputs.
In plain English
This paper investigates how standard sampling methods in LLMs limit linguistic diversity. It introduces the Word Coverage Score (WCS) to measure the impact of these sampling filters on the use of low-frequency, high-information words. The key finding is that common sampling defaults can unintentionally censor diverse language, leading to more homogeneous text outputs.
The introduction of the Word Coverage Score (WCS) provides a significant new metric for evaluating linguistic diversity in LLM outputs.
The study uses empirical audits on human-authored texts, supporting its claims with quantitative evidence.
Deep reliability assessment
The methodology supports measuring whether specific human-authored target word tokens would survive particular decoding filters under forced-prefix evaluation, which is a useful diagnostic of lexical reachability. It overclaims when framing pruning as “censorship” or as a general explanation for LLM homogeneity without showing downstream generation quality/diversity trade-offs or separating decoder effects from model probability calibration, tokenization, domain mismatch, and alignment.
Reproducibility
No code repository is mentioned in the provided paper text. The datasets and lexical sources are named: Google Web Trillion Word Corpus frequency ranks, Moby Word Lists for dictionary validation, and PG-19 test contexts, but implementation details and exact sampled word/context lists are not provided in the excerpt.
Discussion questions
- 1.Does forced-path survival of preselected rare words actually measure desirable lexical diversity, or does it mainly measure whether the model assigns high probability to a particular historical author’s word choice?
- 2.For builders, how should decoding defaults be tuned if increasing WCS also increases incoherence, hallucination, latency, or brand/style inconsistency in production applications?
- 3.What evidence would falsify the paper’s claim that sampling filters drive homogenization: for example, if high-WCS decoding produced outputs with similar lexical diversity to standard decoding, or if models still showed the same rare-word suppression under unfiltered sampling?
Key figure
Figure 1 depicts a four-stage WCS pipeline: choose frequency-bounded middle-long-tail words, find human-written contexts containing them, audit whether each target token survives decoding filters, and aggregate survival into a word coverage score.