Publication Type

PhD Dissertation

Version

publishedVersion

Publication Date

5-2022

Abstract

In this dissertation, I study the understanding of Chinese idioms using transformer-based pretrained language models. By ``understanding", I confine the topics to word embeddings learning, contextualized word representations learning, multiple-choice cloze-test reading comprehension and conditional text generation. Chinese idioms are fixed phrases that have special meanings usually derived from an ancient story. The meanings of these idioms are oftentimes not directly related to their component characters, which makes it hard to model them compared with standard phrases whose meanings are compositional. We initiate the work with studying idiom representations derived from pretrained language models, in particular, BERT. We adopt probing-based methods to investigate to what extent BERT can encode an idiom's meaning. We design two probing tasks to test whether idiom encodings through pretrained language models can be used to (1) classify the usage of a potential idiomatic expression as either idiomatic or literal and (2) identify idiom paraphrases. Then we propose a BERT-based method to better learn Chinese idioms' embeddings and evaluate the embeddings using our newly constructed dataset of Chinese idiom synonyms and antonyms. We further study Chinese idiom prediction based on a context. We first propose a BERT-based dual embedding model for the Chinese idiom prediction task, where given a context with a missing Chinese idiom and a set of candidate idioms, the model needs to find the correct idiom to fill in the blank. Our method is based on the observation that part of an idiom's meaning comes from a long-range context that contains topical information, and part of its meaning comes from a local context that encodes more of its syntactic usage. We use BERT to process the contextual words and to match the embedding of each candidate idiom with both the hidden representation corresponding to the blank in the context and the hidden representations of all the tokens in the context through context pooling.
We also propose to use two separate idiom embeddings for the two kinds of matching. Experiments on ChID, a recently released Chinese idiom cloze test dataset, show that our proposed method performs better than existing state of the art. Ablation experiments also show that both context pooling and dual embedding contribute to the performance improvement. Observing some of the limitations with existing work, we further propose a two-stage model, where during the first stage we retrain a Chinese BERT model by masking out idioms from a large Chinese corpus with a wide coverage of idioms. During the second stage, we fine-tune the retrained, idioms-oriented BERT on a specific idiom recommendation dataset.
We evaluate this method on the ChID dataset and find that it can achieve the state of the art. Ablation studies show that both stages of training are critical for the performance gain. We also propose a new task called Chengyu-oriented text polishing. This task is based on the hypothesis that using Chengyu properly usually can enhance the elegance and conciseness of the Chinese language. We formulate the task as a context-dependent text generation problem and construct a dataset with 1.5 million automatically generated instances for training and 4K human-annotated examples for evaluation. The study offers solid baselines built with the latest pretrained encoder-decoder transformer models. We finally conclude the thesis by summarizing the contributions of this thesis and pointing out potential future directions to explore related to Chinese idiom understanding, namely, sentiment analysis with idioms and explaining Chinese Chengyu recommendation models.

Keywords

natural language processing, multiword expressions, Chinese idioms

Degree Awarded

PhD in Information Systems

Discipline

Databases and Information Systems | East Asian Languages and Societies

Supervisor(s)

JIANG, Jing

Publisher

Singapore Management University

City or Country

Singapore

Citation

TAN, Minghuan. Chinese idiom understanding with transformer-based pretrained language models. (2022).
Available at: https://ink.library.smu.edu.sg/etd_coll/410

Copyright Owner and License

Author

Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.

Download

Included in

Databases and Information Systems Commons, East Asian Languages and Societies Commons

COinS

Dissertations and Theses Collection (Open Access)

Chinese idiom understanding with transformer-based pretrained language models

Publication Type

Version

Publication Date

Abstract

Keywords

Degree Awarded

Discipline

Supervisor(s)

Publisher

City or Country

Citation

Copyright Owner and License

Creative Commons License

Included in

Search

Links

Browse

Links

Dissertations and Theses Collection (Open Access)

Chinese idiom understanding with transformer-based pretrained language models

Author

Publication Type

Version

Publication Date

Abstract

Keywords

Degree Awarded

Discipline

Supervisor(s)

Publisher

City or Country

Citation

Copyright Owner and License

Creative Commons License

Included in

Share

Search

Links

Browse

Links