2026-05-26data

Temporal Simultaneity Predicts Annotation Quality in Sentiment Corpora

Idris Abdulmumin, Mokgadi Penelope Matloga, Tadesse Destaw Belay, Botshelo Kondowe, Letlhogonolo Mohleleng, Hareaipha Nkopo Letsoalo, Shamsuddeen Hassan Muhammad, Vukosi Marivate

PDF preview unavailable

Read on arXiv →

Key claim

Temporal simultaneity significantly affects annotation quality.

This paper presents a new sentiment dataset for Setswana and analyzes the decline in inter-annotator agreement over time. A key finding is that tweets labeled within one minute achieve a much higher agreement score than those labeled further apart, highlighting the importance of temporal factors in annotation quality.

In plain English

Novelty

7.0/10

The paper introduces a new dataset and insights into annotation quality over time, which is significant for the field of sentiment analysis in African languages.

Reliability

8.0/10

The study provides strong empirical evidence through multiple analyses and benchmarks against established models, supporting its claims.

Deep reliability assessment

The timestamped, per-annotation analysis supports a strong association between shorter inter-annotator time gaps and higher agreement on this Setswana Twitter sentiment dataset. Causal claims should be treated cautiously because temporal simultaneity may proxy for batch difficulty, shared context, annotator availability, or other unmeasured campaign effects, and the annotator pool is only three people from one institution.

Reproducibility

Yes, partially: the paper states that the dataset, per-annotation timestamps, and analysis code are released publicly, but the supplied text does not include a repository or dataset URL. The dataset is under a controlled-access NOODL license, so reproducibility may require approval rather than direct download.

Discussion questions

1.Does temporal simultaneity genuinely improve annotation quality, or does it merely correlate with easier tweets, better annotator focus, or periods when annotators were more aligned in interpretation?
2.For builders running low-resource annotation projects, is it better to enforce synchronized annotation windows, add calibration meetings, increase annotator diversity, or spend budget on adjudication and gold checks?
3.What result would falsify the paper's main claim: for example, if a new multi-language annotation campaign controlled for item difficulty and annotator identity but found no κ improvement from synchronized labeling?

Key figure

No Figure 1 or architectural diagram is provided in the supplied excerpt; the key setup is a timestamped three-annotator sentiment-labeling campaign over multiple tweet batches.

Benchmark results

Setswana Twitter sentiment test setmacro-F1 (%): 50.6vs mBERT pre-trained probe+30.0 macro-F1

Setswana Twitter sentiment test setmacro-F1 (%): 49vs AfriBERTa pre-trained probe+40.3 macro-F1

Setswana Twitter sentiment test setmacro-F1 (%): 53.6vs AfroXLMR-base pre-trained probe+43.0 macro-F1

Setswana Twitter sentiment test setmacro-F1 (%): 56.8vs zero-shot proprietary LLM baselineN/A

Setswana Twitter sentiment test setmacro-F1 (%): 57.1vs zero-shot proprietary LLM baselineN/A

Setswana Twitter sentiment test setmacro-F1 (%): 62.2vs AfroXLMR-base fine-tuned+8.6 macro-F1

Setswana Twitter sentiment test setmacro-F1 (%): 57.2vs Gemini zero-shot+0.1 macro-F1