Publication Type
Conference Proceeding Article
Version
publishedVersion
Publication Date
12-2025
Abstract
Tokenization is the process of encoding strings into tokens of a fixed vocabulary size, and is widely utilized in Natural Language Processing applications. The leading tokenization algorithm today is Byte Pair Encoding (BPE), which formulates the tokenization problem as a compression problem and tackles it by performing sequences of merges. In this work, we formulate tokenization as an optimization objective, show that it is NP-hard via a simple reduction from vertex cover, and propose a polynomial-time greedy algorithm GreedTok. Our formulation naturally relaxes to the well-studied weighted maximum coverage problem which has a simple -approximation algorithm GreedWMC. Through empirical evaluations on real-world corpora, we show that GreedTok outperforms BPE and Unigram on compression and achieves a covering score comparable to GreedWMC. Finally, our extensive pre-training for two transformer-based language models with 1 billion parameters, comparing the choices of BPE and GreedTok as the tokenizer, shows that GreedTok achieves a lower bit per byte even when we control for either the total dataset proportion or total training tokens.
Discipline
Databases and Information Systems | Programming Languages and Compilers
Research Areas
Data Science and Engineering
Publication
Proceedings of the 39th Conference on Neural Information Processing Systems (NeurIPS 2025), San Diego, CA, December 2-7
First Page
1
Last Page
33
City or Country
San Diego, CA
Citation
LIM, Jia Peng; TAN, Shawn; CHOO, Davin; and LAUW, Hady Wirawan.
A partition cover approach to tokenization. (2025). Proceedings of the 39th Conference on Neural Information Processing Systems (NeurIPS 2025), San Diego, CA, December 2-7. 1-33.
Available at: https://ink.library.smu.edu.sg/sis_research/10740
Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.
Additional URL
https://openreview.net/forum?id=prGyR9id7X&referrer=%5Bthe%20profile%20of%20Hady%20W.%20Lauw%5D(%2Fprofile%3Fid%3D~Hady_W._Lauw1)