2026-05-25datamultimodalcode

WhoSaidIt: Human-LLM Collaborative Annotation for Text-Based Multilingual Speaker-Attribute Classification

Lingyu Gao, Will Monroe, David Smith, Meghan Jemison, Jackie Lee

Key claim

Cross-lingual differences significantly affect speaker-attribute annotations.

This paper presents a new framework for re-annotating multilingual speaker attributes using human-LLM collaboration. The key finding is that there are significant cross-lingual differences in how speaker attributes are annotated, highlighting both the potential and limitations of LLMs in this context.

Novelty

7.5/10

The proposed framework for collaborative re-annotation introduces a novel approach to handling multilingual speaker attributes.

Reliability

8.0/10

The study includes a comprehensive analysis of annotation divergence and benchmarks against recent LLMs, supporting its claims.

Deep reliability assessment

The methodology supports the use of LLMs to refine annotation guidelines and improve labeling consistency, but it may overclaim the extent to which LLMs can independently resolve ambiguities in subjective tasks without human oversight.

Reproducibility

yes, the dataset is available at https://github.com/duolingo/whosaidit

Discussion questions

How do we ensure that the LLM's biases do not influence the final annotations?
What are the implications of using LLMs for annotation in terms of labor costs and quality control?
What would happen if the LLMs were trained on a dataset with different cultural contexts?

Key figure

Figure 1 illustrates the dataset construction pipeline, highlighting the iterative process of LLM summarization and expert review for refining annotation guidelines.

Benchmark results

WHOSAIDITF1: 94.5vs DeepSeek V3+8.3%SOTA

WHOSAIDITF1: 96.5vs Gemini 2.5 Flash+8.2%SOTA

WHOSAIDITF1: 94.1vs GPT 4.1+8.1%SOTA

WHOSAIDITF1: 96.5vs Claude 3.7 Sonnet+8.0%SOTA

GitHub1 repo

duolingo/whosaiditOfficial

Read on arXiv →