2026-05-27agentsreasoningvisionmultimodal

Agent Explorative Policy Optimization for Multimodal Agentic Reasoning

Minki Kang, Shizhe Diao, Ryo Hachiuma, Sung Ju Hwang, Pavlo Molchanov, Yu-Chiang Frank Wang, Byung-Kwan Lee

Key claim

AXPO significantly improves tool use in vision-language models.

This paper introduces AXPO, a new approach to improve tool use in vision-language models by addressing the Thinking-Acting Gap. The key result shows that SFT+AXPO outperforms SFT+GRPO across multiple benchmarks, achieving better performance with fewer parameters. This advancement could lead to more effective applications of vision-language models in real-world scenarios.

In plain English

Novelty

8.0/10

The introduction of AXPO presents a significant new method for addressing the Thinking-Acting Gap in vision-language models.

Reliability

7.5/10

The claims are supported by multiple benchmarks and a clear comparison to existing methods, though further validation could strengthen the findings.

Deep reliability assessment

The methodology supports the claim that targeted resampling at failed tool-use points improves RL fine-tuning over standard GRPO for Qwen3-VL-Thinking models on the authors' nine multimodal benchmarks. The broader claim that AXPO generally solves the Thinking-Acting Gap is overclaimed without larger-scale models, more tool types, non-verifiable tasks, and external replication.

Reproducibility

No clear open-source code or dataset release is provided in the supplied text; the paper mentions a project page only as 'link'. Evaluation benchmarks and hyperparameters are described, but training data, exact implementation, and repository URL are not available from the excerpt.

Discussion questions

1.Does the Thinking-Acting Gap require a new RL algorithm, or could the same gains come from better tool-use SFT data, reward shaping, or higher exploration temperature?
2.For builders, is AXPO practical when tool calls are expensive, slow, or involve non-deterministic APIs such as web search, OCR, or enterprise databases?
3.What result would falsify AXPO: no gain when controlling for total sampled trajectories, failure on unseen tools, or gains disappearing when evaluated with stricter tool-call correctness metrics?

Key figure

Figure 1 plots average Pass@1 and Pass@4 over nine multimodal benchmarks across Qwen3-VL-Thinking model sizes, showing SFT + AXPO outperforming SFT + GRPO and the 8B AXPO model exceeding the 32B base model on Pass@4.

Benchmark results

~Average over nine multimodal benchmarksPass@4: 75.8vs Qwen3-VL-32B-Thinking Base+6.2 pp

~Average over nine multimodal benchmarksPass@4: 75.8vs SFT + GRPO on Qwen3-VL-8B-Thinking+1.8 pp

~Average over nine multimodal benchmarksPass@1: 62.3vs SFT + GRPO on Qwen3-VL-8B-Thinking+1.8 pp