Publication Type
Conference Proceeding Article
Version
publishedVersion
Publication Date
1-2025
Abstract
Byte-pair encoding (BPE) is pivotal for processing text into chunksize tokens, particularly in Large Language Model (LLM). From a topic modeling perspective, as these chunksize tokens might be mere parts of valid words, evaluating and interpreting these tokens for coherence is challenging. Most, if not all, of coherence evaluation measures are incompatible as they benchmark using valid words. We propose to interpret the recovery of valid words from these tokens as a ranking problem and present a model-agnostic and training-free recovery approach from the topic-token distribution onto a selected vocabulary space, following which we could apply existing evaluation measures. Results show that topic sets recovered from BPE vocabulary space are coherent.
Discipline
Databases and Information Systems
Research Areas
Data Science and Engineering
Areas of Excellence
Digital transformation
Publication
Proceedings of the 31st International Conference on Computational Linguistics, Abu Dhabi, UAE, 2025 January 19-24
First Page
10810
Last Page
10838
Publisher
ACL
City or Country
Abu Dhabi, UAE
Citation
LIM, Jia Peng and LAUW, Hady Wirawan.
Interpreting topic models in byte-pair encoding space. (2025). Proceedings of the 31st International Conference on Computational Linguistics, Abu Dhabi, UAE, 2025 January 19-24. 10810-10838.
Available at: https://ink.library.smu.edu.sg/sis_research/10143
Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.
Additional URL
https://aclanthology.org/2025.coling-main.720/