Publication Type

Conference Proceeding Article

Version

acceptedVersion

Publication Date

1-2020

Abstract

Adding appropriate punctuation marks into text is an essential step in speech-to-text where such information is usually not available. While this has been extensively studied for English, there is no large-scale dataset and comprehensive study in the punctuation prediction problem for the Vietnamese language. In this paper, we collect two massive datasets and conduct a benchmark with both traditional methods and deep neural networks. We aim to publish both our data and all implementation codes to facilitate further research, not only in Vietnamese punctuation prediction but also in other related fields. Our project, including datasets and implementation details, is publicly available at https://github.com/BinhMisfit/vietnamese-punctuation-prediction.

Keywords

Attention model, BiLSTM, Conditional random field, Punctuation prediction

Discipline

Numerical Analysis and Scientific Computing | South and Southeast Asian Languages and Societies

Publication

SOFSEM 2020: Theory and Practice of Computer Science: Limassol: January 20-24: Proceedings

Volume

12011

First Page

388

Last Page

400

ISBN

9783030389185

Identifier

10.1007/978-3-030-38919-2_32

Publisher

Springer

City or Country

Cham

Additional URL

https://doi.org/10.1007/978-3-030-38919-2_32

Share

COinS