← Back to feed
2026-05-27datacode

Towards Reliable Multilingual LLMs-as-a-Judge: An Empirical Study

Irune Zubiaga, Aitor Soroa, Rodrigo Agerri

PDF preview unavailable
Read on arXiv →

Key claim

Fine-tuning smaller models can match proprietary performance.

This paper explores strategies for developing multilingual LLMs for text evaluation, focusing on English, Spanish, and Basque. A key finding is that fine-tuning smaller models with in-domain data can match proprietary models, while larger models excel in zero-shot evaluations. The results offer practical guidance for building multilingual evaluation pipelines.

In plain English

This paper explores strategies for developing multilingual LLMs for text evaluation, focusing on English, Spanish, and Basque. A key finding is that fine-tuning smaller models with in-domain data can match proprietary models, while larger models excel in zero-shot evaluations. The results offer practical guidance for building multilingual evaluation pipelines.

Novelty
7.0/10

The paper presents a meaningful extension of LLMs to multilingual evaluation, particularly for low-resource languages.

Reliability
8.0/10

The study includes systematic analysis and extends existing datasets, providing solid evidence for its claims.

Deep reliability assessment

The methodology supports comparative guidance about multilingual LLM-as-judge strategies across English, Spanish, and Basque under in-domain versus out-of-domain conditions. Claims about multilingual evaluation more broadly are somewhat overextended because the benchmarks and training/evaluation data are machine-translated and limited to three languages, with Basque resource scarcity constraining generalization.

Reproducibility

Yes. The paper states that data and code are publicly available at hitz-zentroa/mJudge, and it extends two meta-evaluation datasets to Basque and Spanish.

Discussion questions

  1. 1.Is LLM-as-a-judge alignment with translated benchmark labels actually measuring multilingual evaluation ability, or mostly measuring robustness to translation artifacts and English-centric rubrics?
  2. 2.For builders in SEA deploying multilingual evaluators, when is it cheaper and safer to fine-tune a small local model versus using a larger proprietary or open-weight model zero-shot?
  3. 3.Would the main conclusions fail if evaluated on native, human-authored low-resource language data rather than machine-translated versions of English-origin benchmarks?

Key figure

No Figure 1 or architectural diagram is included in the provided excerpt; the key setup compares multilingual LLM-as-judge training/evaluation strategies across English, Spanish, and Basque with and without in-domain fine-tuning data.

Codelink
hitz-zentroa/mJudgeOfficial