2026-05-22reasoningvisionmultimodalcode

ETCHR: Editing To Clarify and Harness Reasoning

Beichen Zhang, Yuhong Liu, Jinsong Li, Yuhang Zang, Jiaqi Wang, Dahua Lin

Key claim

ETCHR improves visual reasoning in multimodal models.

The paper presents ETCHR, a novel image editing model designed to enhance visual reasoning in multimodal large language models. It improves reasoning accuracy significantly across various tasks, achieving notable performance gains with different models.

Novelty

8.0/10

The introduction of ETCHR represents a meaningful extension in multimodal reasoning by decoupling image editing from understanding models.

Reliability

7.0/10

The methodology includes a two-stage training process and demonstrates improvements across multiple task families, though it lacks extensive evaluation on diverse datasets.

Deep reliability assessment

The methodology supports the claim that ETCHR improves reasoning by using a dedicated image editor for intermediate visual evidence, but overclaims may include the extent of improvement across all task families due to potential limitations in reasoning depth and task-specific challenges.

Reproducibility

Yes, the code is available at https://github.com/InternLM/ETCHR, but the datasets used for training and evaluation are not explicitly mentioned as open source.

Discussion questions

How does ETCHR handle tasks where the visual transformation required is not immediately clear from the question?
What are the practical implications of using ETCHR for real-time applications where time overhead is a concern?
What specific scenarios or task types would demonstrate the limitations or failure of ETCHR's approach?

Key figure

Figure 1 illustrates the ETCHR architecture, highlighting its decoupled image editing and understanding model, and compares it with tool-based and unified model approaches.

GitHub1 repo

InternLM/ETCHROfficial

Read on arXiv →