Publication Type
Journal Article
Version
acceptedVersion
Publication Date
11-2021
Abstract
In Chinese, Chengyu are fixed phrases consisting of four characters. As a type of idioms, their meanings usually cannot be derived from their component characters. In this paper, we study the task of recommending a Chengyu given a textual context. Observing some of the limitations with existing work, we propose a two-stage model, where during the first stage we re-train a Chinese BERT model by masking out Chengyu from a large Chinese corpus with a wide coverage of Chengyu. During the second stage, we fine-tune the retrained, Chengyu-oriented BERT on a specific Chengyu recommendation dataset. We evaluate this method on ChID and CCT datasets and find that it can achieve the state of the art on both datasets. Ablation studies show that both stages of training are critical for the performance gain.
Keywords
natural language processing, chengyu recommendation, idiom understanding, question answering
Discipline
Databases and Information Systems | East Asian Languages and Societies | Numerical Analysis and Scientific Computing
Research Areas
Data Science and Engineering
Publication
ACM Transactions on Asian and Low-Resource Language Information Processing
Volume
20
Issue
6
First Page
1
Last Page
18
ISSN
2375-4699
Identifier
10.1145/3453185
Publisher
ACM
Embargo Period
3-11-2021
Citation
TAN, Minghuan; Jing JIANG; and DAI, Bingtian.
A BERT-based two-stage model for Chinese Chengyu recommendation. (2021). ACM Transactions on Asian and Low-Resource Language Information Processing. 20, (6), 1-18.
Available at: https://ink.library.smu.edu.sg/sis_research/5821
Copyright Owner and License
Authors
Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.
Additional URL
https://doi.org/10.1145/3453185
Included in
Databases and Information Systems Commons, East Asian Languages and Societies Commons, Numerical Analysis and Scientific Computing Commons