Publication Type

Conference Proceeding Article

Version

publishedVersion

Publication Date

1-2025

Abstract

Byte-pair encoding (BPE) is pivotal for processing text into chunksize tokens, particularly in Large Language Model (LLM). From a topic modeling perspective, as these chunksize tokens might be mere parts of valid words, evaluating and interpreting these tokens for coherence is challenging. Most, if not all, of coherence evaluation measures are incompatible as they benchmark using valid words. We propose to interpret the recovery of valid words from these tokens as a ranking problem and present a model-agnostic and training-free recovery approach from the topic-token distribution onto a selected vocabulary space, following which we could apply existing evaluation measures. Results show that topic sets recovered from BPE vocabulary space are coherent.

Discipline

Databases and Information Systems

Research Areas

Data Science and Engineering

Areas of Excellence

Digital transformation

Publication

Proceedings of the 31st International Conference on Computational Linguistics, Abu Dhabi, UAE, 2025 January 19-24

First Page

10810

Last Page

10838

Publisher

ACL

City or Country

Abu Dhabi, UAE

Additional URL

https://aclanthology.org/2025.coling-main.720/

Share

COinS