2026-05-18agentsreasoningrlhf

Entropy-Gradient Inversion: Moving Toward Internal Mechanism of Large Reasoning Models

Junyao Yang, Chen Qian, Kun Wang, Linfeng Zhang, Quanshi Zhang, Yong Liu, Dongrui Liu

Key claim

CorR-PO improves reasoning performance in LRMs.

This paper presents a new approach to improve reasoning in Large Reasoning Models by utilizing a correlation between token entropy and logit gradients. The key result shows that their proposed method, CorR-PO, consistently outperforms existing techniques, indicating that stronger entropy inversions lead to better reasoning performance.

Novelty

8.0/10

The introduction of Entropy-Gradient Inversion and its application in CorR-PO represents a meaningful extension to existing reinforcement learning methods for reasoning.

Reliability

7.5/10

The methodology is solid, supported by extensive experiments across various benchmarks, though the institution is not specified.

Deep reliability assessment

The methodology supports the claim that embedding Entropy-Gradient Inversion into reinforcement learning improves reasoning performance, but the generalizability to other model architectures and tasks beyond those tested is not fully established.

Reproducibility

Yes, the paper uses publicly available models and datasets, and provides detailed experimental settings and hyperparameters.

Discussion questions

How does the Entropy-Gradient Inversion phenomenon generalize to other types of reasoning tasks beyond those tested?
What are the practical implications of using CorR-PO for real-world applications that require reasoning capabilities?
What specific conditions or experiments could falsify the claim that Entropy-Gradient Inversion is a definitive fingerprint of reasoning capability?

Key figure

Figure 1 illustrates the Entropy-Gradient Inversion phenomenon, showing the correlation between token entropy and logit gradients in reasoning models.

Benchmark results

AIME24Pass@16: 56.7vs GRPO+6.7%SOTA

MATH500Pass@1: 72.6vs GRPO+1.2%SOTA

GSM8kMajor@16: 91.9vs GRPO+1.3%SOTA

Read on arXiv →