Publication Type

Conference Proceeding Article

Version

publishedVersion

Publication Date

12-2025

Abstract

Tokenization is the process of encoding strings into tokens of a fixed vocabulary size, and is widely utilized in Natural Language Processing applications. The leading tokenization algorithm today is Byte Pair Encoding (BPE), which formulates the tokenization problem as a compression problem and tackles it by performing sequences of merges. In this work, we formulate tokenization as an optimization objective, show that it is NP-hard via a simple reduction from vertex cover, and propose a polynomial-time greedy algorithm GreedTok. Our formulation naturally relaxes to the well-studied weighted maximum coverage problem which has a simple -approximation algorithm GreedWMC. Through empirical evaluations on real-world corpora, we show that GreedTok outperforms BPE and Unigram on compression and achieves a covering score comparable to GreedWMC. Finally, our extensive pre-training for two transformer-based language models with 1 billion parameters, comparing the choices of BPE and GreedTok as the tokenizer, shows that GreedTok achieves a lower bit per byte even when we control for either the total dataset proportion or total training tokens.

Discipline

Databases and Information Systems | Programming Languages and Compilers

Research Areas

Data Science and Engineering

Publication

Proceedings of the 39th Conference on Neural Information Processing Systems (NeurIPS 2025), San Diego, CA, December 2-7

First Page

Last Page

City or Country

San Diego, CA

Citation

LIM, Jia Peng; TAN, Shawn; CHOO, Davin; and LAUW, Hady Wirawan. A partition cover approach to tokenization. (2025). Proceedings of the 39th Conference on Neural Information Processing Systems (NeurIPS 2025), San Diego, CA, December 2-7. 1-33.
Available at: https://ink.library.smu.edu.sg/sis_research/10740

Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.

Additional URL

https://openreview.net/forum?id=prGyR9id7X&referrer=%5Bthe%20profile%20of%20Hady%20W.%20Lauw%5D(%2Fprofile%3Fid%3D~Hady_W._Lauw1)

Download

Included in

Databases and Information Systems Commons, Programming Languages and Compilers Commons

COinS

Research Collection School Of Computing and Information Systems

A partition cover approach to tokenization

Publication Type

Version

Publication Date

Abstract

Discipline

Research Areas

Publication

First Page

Last Page

City or Country

Citation

Creative Commons License

Additional URL

Included in

Search

Links

Browse

Links

Research Collection School Of Computing and Information Systems

A partition cover approach to tokenization

Author

Publication Type

Version

Publication Date

Abstract

Discipline

Research Areas

Publication

First Page

Last Page

City or Country

Citation

Creative Commons License

Additional URL

Included in

Share

Search

Links

Browse

Links