Publication Type

Conference Proceeding Article

Version

publishedVersion

Publication Date

5-2025

Abstract

This study introduces two novel benchmarks, SeaExam and SeaBench, designed to evalu ate the capabilities of Large Language Models (LLMs) in Southeast Asian (SEA) application scenarios. Unlike existing multilingual datasets primarily derived from English translations, these benchmarks are constructed based on real world scenarios from SEA regions. SeaExam draws from regional educational exams to form a comprehensive dataset that encompasses sub jects such as local history and literature. In contrast, SeaBench is crafted around multi turn, open-ended tasks that reflect daily inter actions within SEA communities. Our evalua tions demonstrate that SeaExam and SeaBench more effectively discern LLM performance on SEA language tasks compared to their trans lated benchmarks. This highlights the impor tance of using real-world queries to assess the multilingual capabilities of LLMs.

Discipline

Artificial Intelligence and Robotics | Databases and Information Systems

Publication

Findings of the Association for Computational Linguistics: NAACL 2025: Albuquerque, April 29 - May 4

First Page

6134

Last Page

6151

ISBN

9798891761957

Identifier

10.18653/v1/2025.findings-naacl.341

Publisher

Association for Computational Linguistics (ACL)

City or Country

Albuquerque

Additional URL

https://doi.org/10.18653/v1/2025.findings-naacl.341

Share

COinS