Research Collection School Of Computing and Information Systems

Punctuation prediction for Vietnamese texts using conditional random fields

Publication Type

Conference Proceeding Article

Version

publishedVersion

Publication Date

12-2019

Abstract

We investigate the punctuation prediction for the Vietnamese language. This problem is crucial as it can be used to add suitable punctuation marks to machine-transcribed speeches, which usually do not have such information. Similar to previous works for English and Chinese languages, we formulate this task as a sequence labeling problem. After that, we apply the conditional random field model for solving the problem and propose a set of appropriate features that are useful for prediction. Moreover, we build two corpora from Vietnamese online news and movie subtitles and perform extensive experiments on these data. Finally, we ask four volunteers to insert punctuations into a small sample of our dataset. The experimental results show that this problem is challenging, even for a human, and our model can achieve near performance in comparison to a human.

Keywords

Conditional random field, Punctuation prediction, Sequence labeling, Vietnamese language

Discipline

Numerical Analysis and Scientific Computing | South and Southeast Asian Languages and Societies

Publication

SoICT '19: Proceedings of the 10th International Symposium on Information and Communication Technology, Hanoi, December 4-6

First Page

322

Last Page

327

ISBN

9781450372459

Identifier

10.1145/3368926.3369716

Publisher

ACM

City or Country

New York

Citation

PHAM, Hong Quang; NGUYEN, Binh T.; and CUONG, Nguyen Viet. Punctuation prediction for Vietnamese texts using conditional random fields. (2019). SoICT '19: Proceedings of the 10th International Symposium on Information and Communication Technology, Hanoi, December 4-6. 322-327.
Available at: https://ink.library.smu.edu.sg/sis_research/7816

Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.

Comments

scispg

Additional URL

https://doi.org/10.1145/3368926.3369716

Download

Included in

Numerical Analysis and Scientific Computing Commons, South and Southeast Asian Languages and Societies Commons

COinS

Research Collection School Of Computing and Information Systems

Punctuation prediction for Vietnamese texts using conditional random fields

Publication Type

Version

Publication Date

Abstract

Keywords

Discipline

Publication

First Page

Last Page

ISBN

Identifier

Publisher

City or Country

Citation

Creative Commons License

Comments

Additional URL

Included in

Search

Links

Browse

Links

Research Collection School Of Computing and Information Systems

Punctuation prediction for Vietnamese texts using conditional random fields

Author

Publication Type

Version

Publication Date

Abstract

Keywords

Discipline

Publication

First Page

Last Page

ISBN

Identifier

Publisher

City or Country

Citation

Creative Commons License

Comments

Additional URL

Included in

Share

Search

Links

Browse

Links