2026-05-27agentsreasoning

Thinking as Compression: Your Reasoning Model is Secretly a Context Compressor

Guoxin Ma, Yibing Liu, Chengzhengxu Li, Yu Liang, Yan Wang, Yueyang Zhang, Kecheng Chen, Zhaohan Zhang, Zhiyuan Sun, Daiting Shi

PDF preview unavailable

Read on arXiv →

Key claim

TaC outperforms existing compression methods significantly.

This paper introduces Thinking as Compression (TaC), a novel approach that allows LLMs to compress long contexts by generating thinking traces. The method outperforms existing compression techniques, achieving significant improvements in F1 and Exact Match scores at high compression ratios.

In plain English

Novelty

8.0/10

The introduction of Thinking as Compression (TaC) presents a significant new method for context compression that leverages the intrinsic capabilities of LLMs.

Reliability

7.5/10

The experiments across multiple benchmarks provide solid evidence supporting the claims, although more extensive baselines could enhance reliability.

Deep reliability assessment

The experiments support that reasoning traces can serve as effective query-conditioned compressed contexts for long-context QA, and that reward tuning improves budget control and discourages answer-like shortcuts. The broader claim that reasoning models are generally intrinsic context compressors is overextended without stronger evidence on end-to-end latency/cost, non-QA tasks, tool/agent histories, code, and extremely long contexts.

Reproducibility

No code repository is mentioned in the provided abstract, introduction, conclusion, or visible references. Datasets are identifiable only at a high level as four long-context QA benchmarks, with some cited benchmarks such as LongBench, Natural Questions, and multi-hop QA datasets, but the exact experimental setup is not fully recoverable from the excerpt.

Discussion questions

1.Is a thinking trace truly a reusable compressed context, or is it closer to an intermediate answer/rationale that may leak task-specific shortcuts and reduce verifiability?
2.For production RAG systems, does TaC-C reduce total cost and latency once the extra thinking/compression pass is included, or is it mainly useful when the compressed trace is reused across multiple downstream generations?
3.What result would falsify the core claim: failure on adversarial multi-hop QA, poor transfer to non-QA tasks like code or tool-use logs, or lower end-to-end efficiency than simpler retrieval/pruning baselines?

Key figure

Figure 1 contrasts task-agnostic global compression, task-aware relevance-based segment selection, and TaC, where a thinking model dynamically focuses, skips, revisits, links, and organizes evidence into a compact trace.

Benchmark results

~four long-context QA benchmarks, averagedF1 at 8x compression: 65.42vs strongest existing compression baseline, not named in excerpt+23.4% average F1SOTA

~four long-context QA benchmarks, averagedExact Match at 8x compression: 51.79vs strongest existing compression baseline, not named in excerpt+21.7% average EMSOTA