Publication Type

Journal Article

Version

acceptedVersion

Publication Date

11-2021

Abstract

In Chinese, Chengyu are fixed phrases consisting of four characters. As a type of idioms, their meanings usually cannot be derived from their component characters. In this paper, we study the task of recommending a Chengyu given a textual context. Observing some of the limitations with existing work, we propose a two-stage model, where during the first stage we re-train a Chinese BERT model by masking out Chengyu from a large Chinese corpus with a wide coverage of Chengyu. During the second stage, we fine-tune the retrained, Chengyu-oriented BERT on a specific Chengyu recommendation dataset. We evaluate this method on ChID and CCT datasets and find that it can achieve the state of the art on both datasets. Ablation studies show that both stages of training are critical for the performance gain.

Keywords

natural language processing, chengyu recommendation, idiom understanding, question answering

Discipline

Databases and Information Systems | East Asian Languages and Societies | Numerical Analysis and Scientific Computing

Research Areas

Data Science and Engineering

Publication

ACM Transactions on Asian and Low-Resource Language Information Processing

Volume

20

Issue

6

First Page

1

Last Page

18

ISSN

2375-4699

Identifier

10.1145/3453185

Publisher

ACM

Embargo Period

3-11-2021

Copyright Owner and License

Authors

Additional URL

https://doi.org/10.1145/3453185

Share

COinS