← Back to feed
2026-05-27agentsreasoningvisionmultimodal

Agent Explorative Policy Optimization for Multimodal Agentic Reasoning

Minki Kang, Shizhe Diao, Ryo Hachiuma, Sung Ju Hwang, Pavlo Molchanov, Yu-Chiang Frank Wang, Byung-Kwan Lee

PDF preview for Agent Explorative Policy Optimization for Multimodal Agentic Reasoning
Read on arXiv →

Key claim

AXPO significantly improves tool use in vision-language models.

This paper introduces AXPO, a new approach to improve tool use in vision-language models by addressing the Thinking-Acting Gap. The key result shows that SFT+AXPO outperforms SFT+GRPO across multiple benchmarks, achieving better performance with fewer parameters. This advancement could lead to more effective applications of vision-language models in real-world scenarios.

In plain English

This paper introduces AXPO, a new approach to improve tool use in vision-language models by addressing the Thinking-Acting Gap. The key result shows that SFT+AXPO outperforms SFT+GRPO across multiple benchmarks, achieving better performance with fewer parameters. This advancement could lead to more effective applications of vision-language models in real-world scenarios.

Novelty
8.0/10

The introduction of AXPO presents a significant new method for addressing the Thinking-Acting Gap in vision-language models.

Reliability
7.5/10

The claims are supported by multiple benchmarks and a clear comparison to existing methods, though further validation could strengthen the findings.

Deep reliability assessment

The methodology supports the claim that targeted resampling at failed tool-use points improves RL fine-tuning over standard GRPO for Qwen3-VL-Thinking models on the authors' nine multimodal benchmarks. The broader claim that AXPO generally solves the Thinking-Acting Gap is overclaimed without larger-scale models, more tool types, non-verifiable tasks, and external replication.

Reproducibility

No clear open-source code or dataset release is provided in the supplied text; the paper mentions a project page only as 'link'. Evaluation benchmarks and hyperparameters are described, but training data, exact implementation, and repository URL are not available from the excerpt.

Discussion questions

  1. 1.Does the Thinking-Acting Gap require a new RL algorithm, or could the same gains come from better tool-use SFT data, reward shaping, or higher exploration temperature?
  2. 2.For builders, is AXPO practical when tool calls are expensive, slow, or involve non-deterministic APIs such as web search, OCR, or enterprise databases?
  3. 3.What result would falsify AXPO: no gain when controlling for total sampled trajectories, failure on unseen tools, or gains disappearing when evaluated with stricter tool-call correctness metrics?

Key figure

Figure 1 plots average Pass@1 and Pass@4 over nine multimodal benchmarks across Qwen3-VL-Thinking model sizes, showing SFT + AXPO outperforming SFT + GRPO and the 8B AXPO model exceeding the 32B base model on Pass@4.

Benchmark results

~Average over nine multimodal benchmarksPass@4: 75.8vs Qwen3-VL-32B-Thinking Base+6.2 pp
~Average over nine multimodal benchmarksPass@4: 75.8vs SFT + GRPO on Qwen3-VL-8B-Thinking+1.8 pp
~Average over nine multimodal benchmarksPass@1: 62.3vs SFT + GRPO on Qwen3-VL-8B-Thinking+1.8 pp