Publication Type

Journal Article

Version

acceptedVersion

Publication Date

2-2024

Abstract

Automatic speech recognition (ASR) is a typical pattern recognition technology that converts human speeches into texts. With the aid of advanced deep learning models, the performance of speech recognition is significantly improved. Especially, the emerging Audio–Visual Speech Recognition (AVSR) methods achieve satisfactory performance by combining audio-modal and visual-modal information. However, various complex environments, especially noises, limit the effectiveness of existing methods. In response to the noisy problem, in this paper, we propose a novel cross-modal audio–visual speech recognition model, named CATNet. First, we devise a cross-modal bidirectional fusion model to analyze the close relationship between audio and visual modalities. Second, we propose an audio–visual dual-modal network to preprocess audio and visual information, extract significant features and filter redundant noises. The experimental results demonstrate the effectiveness of CATNet, which achieves excellent WER, CER and converges speeds, outperforms other benchmark models and overcomes the challenge posed by noisy environments.

Keywords

Attention mechanism, Audio-visual speech recognition, Cross-modal fusion, Deep learning

Discipline

Graphics and Human Computer Interfaces | Numerical Analysis and Scientific Computing

Publication

Pattern Recognition Letters

Volume

178

First Page

216

Last Page

222

ISSN

0167-8655

Identifier

10.1016/j.patrec.2024.01.002

Publisher

Elsevier

Citation

WANG, Xingmei; MI, Jianchen; LI, Boquan; ZHAO, Yixu; and MENG, Jiaxiang. CATNet: Cross-modal fusion for audio-visual speech recognition. (2024). Pattern Recognition Letters. 178, 216-222.
Available at: https://ink.library.smu.edu.sg/sis_research/8645

Copyright Owner and License

Authors

Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.

Additional URL

https://doi.org/10.1016/j.patrec.2024.01.002

Download

Included in

Graphics and Human Computer Interfaces Commons, Numerical Analysis and Scientific Computing Commons

COinS

Research Collection School Of Computing and Information Systems

CATNet: Cross-modal fusion for audio-visual speech recognition

Publication Type

Version

Publication Date

Abstract

Keywords

Discipline

Publication

Volume

First Page

Last Page

ISSN

Identifier

Publisher

Citation

Copyright Owner and License

Creative Commons License

Additional URL

Included in

Search

Links

Browse

Links

Research Collection School Of Computing and Information Systems

CATNet: Cross-modal fusion for audio-visual speech recognition

Author

Publication Type

Version

Publication Date

Abstract

Keywords

Discipline

Publication

Volume

First Page

Last Page

ISSN

Identifier

Publisher

Citation

Copyright Owner and License

Creative Commons License

Additional URL

Included in

Share

Search

Links

Browse

Links