Publication Type
Conference Proceeding Article
Version
acceptedVersion
Publication Date
10-2023
Abstract
Chinese Grammatical Error Correction (CGEC) has been attracting growing attention from researchers recently. In spite of the fact that multiple CGEC datasets have been developed to support the research, these datasets lack the ability to provide a deep linguistic topology of grammar errors, which is critical for interpreting and diagnosing CGEC approaches. To address this limitation, we introduce FlaCGEC, which is a new CGEC dataset featured with fine-grained linguistic annotation. Specifically, we collect raw corpus from the linguistic schema defined by Chinese language experts, conduct edits on sentences via rules, and refine generated samples manually, which results in 10k sentences with 78 instantiated grammar points and 3 types of edits. We evaluate various cutting-edge CGEC methods on the proposed FlaCGEC dataset and their unremarkable results indicate that this dataset is challenging in covering a large range of grammatical errors. In addition, we also treat FlaCGEC as a diagnostic dataset for testing generalization skills and conduct a thorough evaluation of existing CGEC models.
Keywords
Chinese Grammatical Error Correction, Deep Learning, Fine-grained Linguistic Annotation
Discipline
Asian Studies | Databases and Information Systems | East Asian Languages and Societies
Publication
CIKM '23: Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, Birmingham, October 21-25
First Page
5321
Last Page
5325
ISBN
9798400701245
Identifier
10.1145/3583780.3615119
Publisher
ACM
City or Country
New York
Citation
DU, Hanyue; ZHAO, Yike; TIAN, Qingyuan; WANG, Jiani; WANG, Lei; LAN, Yunshi; and LU, Xuesong.
FlaCGEC: A Chinese grammatical error correction dataset with fine-grained linguistic annotation. (2023). CIKM '23: Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, Birmingham, October 21-25. 5321-5325.
Available at: https://ink.library.smu.edu.sg/sis_research/8463
Copyright Owner and License
Authors
Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.
Additional URL
https://doi.org/10.1145/3583780.3615119
Included in
Asian Studies Commons, Databases and Information Systems Commons, East Asian Languages and Societies Commons