Publication Type
Conference Proceeding Article
Version
publishedVersion
Publication Date
5-2025
Abstract
This study introduces two novel benchmarks, SeaExam and SeaBench, designed to evalu ate the capabilities of Large Language Models (LLMs) in Southeast Asian (SEA) application scenarios. Unlike existing multilingual datasets primarily derived from English translations, these benchmarks are constructed based on real world scenarios from SEA regions. SeaExam draws from regional educational exams to form a comprehensive dataset that encompasses sub jects such as local history and literature. In contrast, SeaBench is crafted around multi turn, open-ended tasks that reflect daily inter actions within SEA communities. Our evalua tions demonstrate that SeaExam and SeaBench more effectively discern LLM performance on SEA language tasks compared to their trans lated benchmarks. This highlights the impor tance of using real-world queries to assess the multilingual capabilities of LLMs.
Discipline
Artificial Intelligence and Robotics | Databases and Information Systems
Publication
Findings of the Association for Computational Linguistics: NAACL 2025: Albuquerque, April 29 - May 4
First Page
6134
Last Page
6151
ISBN
9798891761957
Identifier
10.18653/v1/2025.findings-naacl.341
Publisher
Association for Computational Linguistics (ACL)
City or Country
Albuquerque
Citation
LIU, Chaoqun; ZHANG, Wenxuan; YING, Jiahao; Aljunied, Mahani; LUU, Anh Tuan; and BING, Lidong.
SeaExam and SeaBench: Benchmarking LLMs with local multilingual questions in Southeast Asia. (2025). Findings of the Association for Computational Linguistics: NAACL 2025: Albuquerque, April 29 - May 4. 6134-6151.
Available at: https://ink.library.smu.edu.sg/sis_research/11104
Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.
Additional URL
https://doi.org/10.18653/v1/2025.findings-naacl.341