2026-05-26reasoningvisionmultimodal

Chartographer: Counterfactual Chart Generation for Evaluating Vision-Language Models

Yifan Jiang, Dae Yon Hwang, Jesse C. Cresswell, Freda Shi

PDF preview unavailable

Key claim

Counterfactual charts expose VLMs' reasoning failures.

The paper presents a novel framework called Chartographer that generates counterfactual charts to evaluate visual reasoning in question-answering tasks. It reveals that vision-language models often fail to generalize when faced with updated charts requiring new reasoning pathways. This finding highlights the limitations of current models in handling visual reasoning tasks effectively.

In plain English

Novelty

8.0/10

The introduction of counterfactual charts represents a significant advancement in evaluating visual reasoning in QA models.

Reliability

7.5/10

The framework is applied to existing datasets and evaluates multiple models, providing solid evidence for claims.

Deep reliability assessment

The methodology supports the claim that, for reconstructable chart-QA examples, counterfactual data perturbations reveal failures that fixed chart-question-answer evaluation misses. It overclaims if generalized to all chart reasoning, because the benchmark filters for charts that can be reverse-engineered into executable code and does not test changes in chart type, visual design, or question formulation.

Reproducibility

No public code repository or downloadable generated dataset is mentioned in the provided abstract, introduction, results, limitations, or conclusion. The paper describes the CHARTOGRAPHER pipeline, use of existing datasets, human-in-the-loop validation, seed-controlled variants, and LLM-judge evaluation, but the artifact availability is unclear.

Discussion questions

1.Does preserving the same chart-question task while changing only the data truly isolate visual reasoning, or does it mostly test robustness to distribution shifts introduced by the reconstruction pipeline?
2.For builders of dashboard, BI, or document-understanding agents, should counterfactual chart tests become part of eval suites before deployment, especially when models already score well on static chart-QA benchmarks?
3.What result would falsify the paper's central claim: for example, if models trained or prompted only to extract chart data tables maintained high CVA across counterfactual variants, would that show the failures are not fundamental VLM reasoning failures but interface or prompting failures?

Key figure

Figure 1 shows CHARTOGRAPHER converting a source chart-QA example into reconstructed chart code, generating seed-controlled counterfactual variants, recomputing answers with executable QA logic, and evaluating original, reconstruction, variant, sensitivity, and generalizability metrics.

Benchmark results

ChartQAvariant accuracy: 0.917vs GPT-5.4+0.001

CharXivvariant accuracy: 0.805vs Gemini 2.5 Pro+0.066

ChartMuseumvariant accuracy: 0.554vs Claude Sonnet 4.6+0.013