Publication Type
Conference Proceeding Article
Version
acceptedVersion
Publication Date
4-2023
Abstract
We propose SubText, a compression mechanism via vocabulary reduction. The crux is to judiciously select a subset of word embeddings which support the reconstruction of the remaining word embeddings based on their form alone. The proposed algorithm considers the preservation of the original embeddings, as well as a word’s relationship to other words that are morphologically or semantically similar. Comprehensive evaluation of the compressed vocabulary reveals SubText’s efficacy on diverse tasks over traditional vocabulary reduction techniques, as validated on English, as well as a collection of inflected languages.
Keywords
Word embeddings, compression, vocabulary reduction
Discipline
Databases and Information Systems | Numerical Analysis and Scientific Computing | Theory and Algorithms
Research Areas
Data Science and Engineering
Publication
WI-IAT '22: Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology 2022, Niagara Falls, Canada, November 17-20
First Page
56
Last Page
63
ISBN
9781665494021
Identifier
10.1109/WI-IAT55865.2022.00018
Publisher
ACM
City or Country
New York
Citation
CHIA, Chong Cher; TKACHENKO, Maksim; and LAUW, Hady Wirawan.
Morphologically-aware vocabulary reduction of word embeddings. (2023). WI-IAT '22: Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology 2022, Niagara Falls, Canada, November 17-20. 56-63.
Available at: https://ink.library.smu.edu.sg/sis_research/7608
Copyright Owner and License
Authors
Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.
Additional URL
https://doi.org/10.1109/WI-IAT55865.2022.00018
Included in
Databases and Information Systems Commons, Numerical Analysis and Scientific Computing Commons, Theory and Algorithms Commons