2026-05-27agentsreasoninginfracode

MemTrace: Tracing and Attributing Errors in Large Language Model Memory Systems

Xinle Deng, Ruobin Zhong, Hujin Peng, Xiaoben Lu, Yanzhe Wu, Guang Li, Buqiang Xu, Yunzhi Yao, Jizhan Fang, Haoliang Cao, Junjie Guo, Yuan Yuan, Ziqing Ma, Yuanqiang Yu, Rui Hu, Baohua Dong, Hangcheng Zhu, Ningyu Zhang

Read on arXiv →

Key claim

Memory tracing improves LLM performance by up to 7.62%.

This paper introduces a novel framework for tracing errors in memory systems of large language models, which helps identify and correct systematic memory failures. The key result shows that their approach can enhance end-task performance by up to 7.62%. This work opens new avenues for improving the reliability of memory in LLMs.

In plain English

Novelty

8.0/10

The proposed framework for error tracing in LLM memory systems represents a significant advancement in understanding and improving memory reliability.

Reliability

7.5/10

The study includes a benchmark and an automatic attribution method, providing solid experimental validation for its claims.

Deep reliability assessment

The methodology supports graph-based, operation-level debugging of selected non-parametric LLM memory systems and shows that structured execution traces can help attribute many observed failures better than flat logs. The broader claim of general automatic root-cause diagnosis is overclaimed because the benchmark is small, limited to singleton decisive errors, and evaluated on a narrow set of memory architectures and synthetic/public trajectories.

Reproducibility

Code: yes, the paper says it will be released at https://github.com/zjunlp/MemTrace. Dataset: yes, MemTraceBench is described as a benchmark of 160 human-annotated failure cases from Long-Context, RAG, Mem0, and EverMemOS over three public datasets, but the excerpt does not confirm an already-available dataset download link.

Discussion questions

1.Does representing memory-system execution as an operation-variable graph actually capture causal responsibility, or does it mostly provide a structured heuristic over correlations in logged dependencies?
2.For builders of production agents, is the added instrumentation and storage cost of million-token execution traces justified compared with simpler observability, retrieval evaluation, and regression tests?
3.What result would falsify MemTrace's core value proposition: poor attribution agreement with humans on unseen memory systems, failure on multi-error cases, or lack of downstream performance gains after using its diagnoses?

Key figure

Figure 1 shows a pipeline where a memory system is executed to build an execution graph, and MemTrace then performs step-by-step graph tracing over failed cases to identify faulty memory operations across memory construction, interaction history, and evaluation.

GitHub1 repo

zjunlp/MemTraceOfficial