Can Large Language Models Handle Discourse Particles? A Case Study of Colloquial Malay
Mariah Al Giptiah Binte Yusoff, Jakin Tan, Bocheng Chen, Guangliang Liu, Xi Chen
Read on arXiv →Key claim
Current LLMs struggle with Malay discourse particles.
This paper introduces extsc{MalayPrag}, a benchmark for assessing LLMs' handling of discourse particles in colloquial Malay. The findings indicate that current LLMs struggle with these particles, but the proposed attributes significantly enhance their performance. This highlights the importance of structured approaches to improve LLMs' pragmatic understanding.
In plain English
This paper introduces extsc{MalayPrag}, a benchmark for assessing LLMs' handling of discourse particles in colloquial Malay. The findings indicate that current LLMs struggle with these particles, but the proposed attributes significantly enhance their performance. This highlights the importance of structured approaches to improve LLMs' pragmatic understanding.
The introduction of a benchmark for evaluating LLMs on discourse particles in Malay represents a meaningful extension of existing research.
The study provides experimental results and a structured framework, though it may lack extensive baselines.
Deep reliability assessment
The methodology supports a narrow benchmark-style claim: in zero-shot closed-label prompting on 187 annotated colloquial Malay utterances, current LLMs struggle with Malay discourse-particle pragmatics, and explicit linguistic attributes improve pragmatic-function prediction. It overclaims if read as a broad statement about LLM pragmatic competence, because the dataset is small, text-only, limited to mainly kan and ke, and does not test fine-tuning, multimodal/prosodic inputs, or real dialogue use.
Reproducibility
Partial. The paper says MALAYPRAG is accessible via a link and provides prompt templates in an appendix, but no repository URL or exact dataset URL is visible in the supplied text; no open-source code release is mentioned.
Discussion questions
- 1.Does decomposing pragmatic function into five discrete attributes actually model human pragmatic understanding, or does it mainly make the classification task easier by exposing labels that are close to the answer?
- 2.For SEA builders, should Malay/Singlish/Indonesian chatbots use explicit pragmatic scaffolds like these at inference time, or is the more practical path to collect conversational fine-tuning data with particles, prosody, and speaker context?
- 3.What result would falsify the paper's conclusion: for example, would strong performance from a Malay-specialized model on unseen particles, dialects, and audio-rich dialogue without attribute hints show that the observed failure is benchmark-specific rather than a general pragmatic gap?
Key figure
Figure 1 presents a five-dimensional annotation schema for Malay discourse-particle utterances, labeling each utterance by Epistemic Stance, Listener Agreement, Emotion, Question Type, and Particle Position.
