Publication Type
Conference Proceeding Article
Version
publishedVersion
Publication Date
12-2019
Abstract
We investigate the punctuation prediction for the Vietnamese language. This problem is crucial as it can be used to add suitable punctuation marks to machine-transcribed speeches, which usually do not have such information. Similar to previous works for English and Chinese languages, we formulate this task as a sequence labeling problem. After that, we apply the conditional random field model for solving the problem and propose a set of appropriate features that are useful for prediction. Moreover, we build two corpora from Vietnamese online news and movie subtitles and perform extensive experiments on these data. Finally, we ask four volunteers to insert punctuations into a small sample of our dataset. The experimental results show that this problem is challenging, even for a human, and our model can achieve near performance in comparison to a human.
Keywords
Conditional random field, Punctuation prediction, Sequence labeling, Vietnamese language
Discipline
Numerical Analysis and Scientific Computing | South and Southeast Asian Languages and Societies
Publication
SoICT '19: Proceedings of the 10th International Symposium on Information and Communication Technology, Hanoi, December 4-6
First Page
322
Last Page
327
ISBN
9781450372459
Identifier
10.1145/3368926.3369716
Publisher
ACM
City or Country
New York
Citation
PHAM, Hong Quang; NGUYEN, Binh T.; and CUONG, Nguyen Viet.
Punctuation prediction for Vietnamese texts using conditional random fields. (2019). SoICT '19: Proceedings of the 10th International Symposium on Information and Communication Technology, Hanoi, December 4-6. 322-327.
Available at: https://ink.library.smu.edu.sg/sis_research/7816
Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.
Additional URL
https://doi.org/10.1145/3368926.3369716
Included in
Numerical Analysis and Scientific Computing Commons, South and Southeast Asian Languages and Societies Commons
Comments
scispg