2026-05-27reasoningvisionmultimodal

The Abstraction Gap in Vision-Language Causal Reasoning

Chinh Hoang, Mohammad Rashedul Hasan

Key claim

One model achieves near-zero Abstraction Gap in causal reasoning.

This paper presents a new methodology for evaluating vision-language models by distinguishing between linguistic plausibility and causal reasoning. The key finding is that while many models perform well on linguistic quality, they struggle with generating explicit causal chains. One model, however, demonstrates the ability to achieve near-zero Abstraction Gap, indicating potential for improved causal reasoning in VLMs.

In plain English

Novelty

8.0/10

The introduction of a dual-probe methodology and the CAGE benchmark significantly advances the evaluation of VLMs.

Reliability

7.5/10

The study evaluates multiple models with a large dataset, providing solid evidence for its claims.

Deep reliability assessment

The methodology supports a useful diagnostic: many VLMs can produce plausible causal answers but perform much worse when asked to explicitly generate simple linear causal chains first. It overclaims if interpreted as proving absence of causal reasoning or internal unfaithfulness, because failures may reflect output-format brittleness, instruction following, or LLM-judge bias rather than a pure causal-reasoning deficit.

Reproducibility

Dataset: yes, the paper states that the CAGE benchmark is released, but no dataset URL is provided in the supplied text. Code: no public code repository is mentioned in the supplied abstract, introduction, discussion, conclusion, or footnotes.

Discussion questions

1.Does requiring a model to emit an explicit linear causal chain actually test causal understanding, or does it mainly test compliance with a particular symbolic output format?
2.For builders deploying VLMs in medicine, robotics, or inspections, should models be rejected if they have a high Abstraction Gap even when their final answers are useful and empirically accurate?
3.What experiment would falsify the paper's conclusion: for example, would strong performance across multiple causal representations such as graphs, natural-language rationales, and interventions eliminate the claimed gap?

Key figure

Figure 1 uses a beach-scene example to show Pearl Level 1 association questions answered directly, while Level 2 intervention and Level 3 counterfactual questions require an explicit causal chain before the final textual answer.

Benchmark results

CAGE validation setGPT-4o judged average Level 2 score, 0-10: 8.5vs LLaVA-NeXT 13B baseline+0.21

CAGE validation setGPT-4o judged Level 3 chain score, 0-10: 6.96vs LLaVA-NeXT 13B baseline-0.46

CAGE validation setGPT-4o judged Level 3 chain score, 0-10: 0.18vs MiniGPT-4 13B baseline-1.34

MMHal-Benchaverage score, higher is better: 3.02vs LLaVA-NeXT baselinebaseline result

MMHal-Benchhallucination rate, lower is better: 0.43vs Qwen-VL-Chat baseline-0.05