2026-05-28reasoningscaling

Unlocking the Working Memory of Large Language Models for Latent Reasoning

Lukas Aichberger, Sepp Hochreiter

Key claim

RiM enables efficient latent reasoning using working memory.

The paper presents Reasoning in Memory (RiM), a novel approach that enhances the reasoning capabilities of large language models by using fixed memory blocks instead of autoregressive generation. This method allows for compute-efficient reasoning and shows promising results on reasoning benchmarks, matching or exceeding existing methods. The key takeaway is that RiM enables large language models to utilize working memory effectively for reasoning tasks.

In plain English

Novelty

8.0/10

The introduction of memory blocks for latent reasoning represents a significant advancement in how reasoning is approached in large language models.

Reliability

7.5/10

The experiments demonstrate strong performance across various models, supporting the claims made about the effectiveness of the proposed method.

Deep reliability assessment

The methodology supports the claim that fixed special-token memory blocks can learn input-dependent representations and can improve/maintain reasoning accuracy versus some latent-reasoning baselines under their GSM8K-style evaluation protocol. It overclaims if interpreted as evidence of human-like working memory or broadly general latent reasoning, since the provided evidence is mostly benchmark accuracy and representation analysis rather than causal proof of internal computation.

Reproducibility

Code: no repository is mentioned in the provided abstract, introduction, results, limitations, or conclusion excerpts. Data: partially reproducible because GSM8K is public and the paper describes models, stages, checkpoint selection, and evaluation protocol, but full reproduction would require implementation details and hyperparameters not present in the excerpt.

Discussion questions

1.Does replacing generated chain-of-thought with fixed memory tokens actually create a qualitatively different reasoning mechanism, or is it just a learned prompt/adapter-like computation budget?
2.For builders, is the latency gain from parallel memory-block processing large enough to justify fine-tuning and maintaining a separate RiM-trained model instead of using shorter CoT, distillation, or verifier-guided decoding?
3.What experiment would falsify RiM: failure to transfer to harder out-of-distribution reasoning tasks, no accuracy gain when controlling for total FLOPs, or causal interventions showing memory-block activations do not affect final answers?

Key figure

Figure 1 shows RiM’s two-stage curriculum: Stage 1 places fixed memory blocks before supervised reasoning-step targets, then Stage 2 removes step supervision and trains the model to refine the final answer after each memory block.