← Back to feed
2026-05-26visiondatacode

Self-Ensembling Vision-Language Models for Chart Data Extraction

Thomas Berkane, Qianyi Wang, Maimuna S. Majumder

PDF preview unavailable
Read on arXiv →

Key claim

Self-ensembling improves chart-to-table extraction accuracy significantly.

This paper introduces a novel self-ensembling method for extracting tabular data from charts, improving accuracy by up to 23% on a new benchmark. It addresses the limitations of existing models by aggregating multiple outputs to enhance reliability and accuracy. This advancement enables better reuse and analysis of data previously locked in chart images.

In plain English

This paper introduces a novel self-ensembling method for extracting tabular data from charts, improving accuracy by up to 23% on a new benchmark. It addresses the limitations of existing models by aggregating multiple outputs to enhance reliability and accuracy. This advancement enables better reuse and analysis of data previously locked in chart images.

Novelty
7.5/10

The proposed self-ensembling method significantly enhances chart-to-table extraction, extending the capabilities of existing vision-language models.

Reliability
8.0/10

The paper presents strong experimental results across multiple datasets, including a new benchmark, supporting its claims effectively.

Deep reliability assessment

The methodology supports the claim that repeated stochastic sampling from the same VLM plus cell-level median aggregation can improve chart-to-table extraction on the evaluated benchmarks, especially for dense synthetic World Bank charts. It does not fully establish robustness on messy real-world charts, non-numeric tables, or cases where all samples share the same systematic visual misreading.

Reproducibility

Yes. The paper says it releases the full WB-ChartExtract benchmark with chart images and ground-truth tables under CC BY 4.0, plus chart-generation, extraction, ensembling, evaluation code, and configuration files. Code: https://github.com/tberkane/vlm-ensemble-chart. Data: https://huggingface.co/datasets/tberkane/WB-ChartExtract.

Discussion questions

  1. 1.Does self-ensembling genuinely recover independent evidence from the image, or is it mostly smoothing correlated hallucinations from the same VLM?
  2. 2.For builders, when is the accuracy gain worth the extra latency and inference cost versus using a stronger single-pass frontier VLM or a deterministic chart-specific parser?
  3. 3.What result would falsify the paper’s core claim: e.g., showing that gains disappear under deterministic decoding, on real-world charts with noisy labels, or when evaluated against charts with systematic axis/legend ambiguity?

Key figure

The key architecture repeatedly queries the same VLM on one chart image, parses the sampled table outputs, aligns corresponding cells, aggregates numeric cells with medians, and reports both a consensus table and uncertainty/convergence signals.

Benchmark results

~WB-ChartExtractRMSF1: 87.83vs other evaluated single-pass general-purpose VLMsnot specified
~ChartQARMSF1: 95.2vs other evaluated modelsnot specified
~ChartQARMSF1: 91.43vs other evaluated general-purpose VLMsnot specified
GitHub1 repo
tberkane/vlm-ensemble-chartOfficial