Benchmarking Commercial ASR Systems on Code-Switching Speech: Arabic, Persian, and German
Sajjad Abdoli, Ghassan Al-Sumaidaee, Clayton W. Taylor, Ahmad ElShiekh, Ahmed Rashad
Key claim
ElevenLabs Scribe v2 achieves best ASR performance.
This study benchmarks five commercial ASR systems on code-switching between various languages. The key finding is that ElevenLabs Scribe v2 outperforms others with the lowest WER and highest BERTScore, highlighting significant quality differences in ASR performance.
The paper provides a new benchmark for evaluating ASR in code-switching contexts.
The methodology includes a rigorous evaluation of multiple ASR systems with clear metrics.
Deep reliability assessment
The methodology supports the claim that BERTScore is a more reliable metric than WER for evaluating ASR systems on code-switching speech, particularly for Arabic and Persian. However, the assertion that WER systematically overstates performance differences may not hold in all contexts, especially with different language pairs.
Reproducibility
yes, the dataset is publicly available at https://huggingface.co/datasets/Perle-ai/ASR_Code_Switch.
Discussion questions
- What assumptions are made about the generalizability of the benchmark results across different dialects and languages?
- How can builders leverage these findings to improve ASR systems for multilingual environments?
- What specific conditions or datasets would contradict the findings regarding the superiority of BERTScore over WER?
Key figure
Figure 1 shows the distribution of semantic topics across the 300 benchmark samples for each language pair, classified by GPT-4o using an inductively derived taxonomy.