← Back to feed
2026-05-25reasoningvisionmultimodal

DRScaffold: Boosting Dense-Scene Reasoning in Lightweight Vision Language Models

Xinrui Shi, Kai Liu, Ziqing Zhang, Jianze Li, Anqi Li, Yulun Zhang

PDF preview unavailable
Read on arXiv →

Key claim

Structured supervision improves lightweight model performance in reasoning.

This paper presents DRBench, a new benchmark for evaluating dense-scene reasoning in vision-language models, and DRScaffold, a fine-tuning framework that enhances grounded reasoning. The key result shows that a smaller model trained with structured supervision can outperform a larger frozen model, highlighting the effectiveness of the proposed approach.

In plain English

This paper presents DRBench, a new benchmark for evaluating dense-scene reasoning in vision-language models, and DRScaffold, a fine-tuning framework that enhances grounded reasoning. The key result shows that a smaller model trained with structured supervision can outperform a larger frozen model, highlighting the effectiveness of the proposed approach.

Novelty
8.0/10

The introduction of DRBench and DRScaffold represents a significant advancement in grounding reasoning in vision-language models.

Reliability
8.0/10

The experiments demonstrate substantial gains with solid baselines and the release of code and models supports reproducibility.