2026-05-27agentsreasoningscalingrlhf

Extrapolative Weight Averaging Reveals Correctness-Efficiency Frontiers in Code RL

Kunhao Zheng, Pierre Chambon, Juliette Decugis, Jonas Gehring, Taco Cohen, Benjamin Negrevergne, Gabriel Synnaeve

Key claim

Extrapolative weight averaging improves RL performance significantly.

This study explores how extrapolative weight averaging can enhance performance in reinforcement learning by navigating a correctness-efficiency frontier. The key result shows that using this method improves the solve rate on challenging problems by 3.3% over the best single checkpoint, making it a valuable technique for builders in code-related RL tasks.

In plain English

Novelty

8.0/10

The paper introduces a novel approach to extending the correctness-efficiency frontier in reinforcement learning for competitive programming.

Reliability

7.5/10

The claims are supported by experiments across multiple settings and model scales, though some limitations in baseline comparisons exist.

Deep reliability assessment

The methodology supports a controlled within-domain claim: for shared-initialization code RL checkpoints trained under nested verifier strictness, interpolation and moderate extrapolation navigate a correctness–efficiency frontier across the tested model sizes and inference modes. Broader claims about other domains, true cost-matched inference scaling, or moving the frontier itself are less supported because experiments are limited to competitive programming and do not release code/checkpoints.

Reproducibility

No open-source training code, model checkpoints, or new datasets are released. Base models and datasets are public: Qwen 2.5 7B, CWM-SFT 32B, CodeContestsPlus, OpenCodeReasoning-2, OpenMathReasoning, and LiveCodeBench; the paper reports objectives, verifier thresholds, evaluation protocol, hyperparameters, compute, and statistical procedures.

Discussion questions

1.Does nested unit-test coverage really isolate an ordered correctness–efficiency axis, or does it also change reward sparsity, exploration behavior, and learned coding style in ways that make the weight direction harder to interpret?
2.For builders, is extrapolative weight averaging a cheaper and safer way to get policy diversity than training more RL checkpoints, especially when deployment cost should be matched by tokens or wall-clock rather than sample count?
3.What result would falsify the frontier interpretation: failure to reproduce extrapolation on a new benchmark, extrapolated checkpoints not solving complementary problems, or disappearance of the effect when using token/wall-clock-matched evaluation?

Key figure

Figure 1 shows that linear interpolation between low- and high-coverage RL checkpoints traces a correctness-versus-optimization-failure frontier on LCB/hard, while extrapolation continues the same frontier until extreme weights cause format errors and token exhaustion.

Benchmark results

~LiveCodeBench hardpass@250 improvement over best single checkpoint at matched sample budget: 3.3vs best single checkpoint+3.3%