2026-05-26multimodalvisionalignment

Real Images, Worse Judgments: Evaluating Vision-Language Models on Concreteness and Imagery

Yifan Jiang, Ruoxi Ning, Sheng Yao, Freda Shi

PDF preview unavailable

Key claim

Visual context can degrade lexical judgment accuracy.

This study investigates whether visual inputs improve language understanding in multimodal models. It finds that real-image contexts can sometimes degrade performance, especially for less relevant visual evidence. The key result is that focusing on textual content can mitigate these issues.

In plain English

Novelty

7.0/10

The paper provides a meaningful extension by challenging the assumption that visual inputs always enhance language understanding in VLMs.

Reliability

8.0/10

The findings are supported by human ratings and thorough analysis methods, indicating solid experimental validation.

Deep reliability assessment

The methodology supports a targeted diagnostic claim: in English single-word concreteness/imagery rating tasks, retrieved real images can perturb instruction-tuned VLM judgments and internal representations relative to no-image or uninformative-image controls. It would be overclaiming to conclude that visual input generally worsens VLM language understanding, since the setup is narrow, lexical, English-only, and uses incidental retrieved images rather than task-relevant visual evidence.

Reproducibility

Code: no repository or project URL is mentioned in the provided paper text. Dataset: yes in principle, since the study uses named public lexical norm datasets MT40k and CP2004B, plus retrieved natural images from sources such as ImageNet and Wikimedia, but the exact retrieval pipeline/splits are not recoverable from the provided excerpt alone.

Discussion questions

1.Does using concreteness and imagery ratings actually test visual grounding, or does it mainly test whether VLMs can ignore irrelevant context under instruction-following pressure?
2.For builders deploying multimodal agents, should systems add an explicit relevance gate before passing images into a VLM for text-centric tasks, or is prompt-level mitigation enough?
3.What result would falsify the paper's claim: for example, would a VLM whose lexical ratings are unchanged by incidental images across abstract and concrete words, while still using images when genuinely relevant, be sufficient?

Key figure

Figure 1 illustrates a failure case where the word "nature" is rated accurately with an uninformative image but is overestimated when paired with a real image, and shows that an instruction to rely on the word reduces the error.