Publication Type
Journal Article
Version
acceptedVersion
Publication Date
2-2024
Abstract
Automatic speech recognition (ASR) is a typical pattern recognition technology that converts human speeches into texts. With the aid of advanced deep learning models, the performance of speech recognition is significantly improved. Especially, the emerging Audio–Visual Speech Recognition (AVSR) methods achieve satisfactory performance by combining audio-modal and visual-modal information. However, various complex environments, especially noises, limit the effectiveness of existing methods. In response to the noisy problem, in this paper, we propose a novel cross-modal audio–visual speech recognition model, named CATNet. First, we devise a cross-modal bidirectional fusion model to analyze the close relationship between audio and visual modalities. Second, we propose an audio–visual dual-modal network to preprocess audio and visual information, extract significant features and filter redundant noises. The experimental results demonstrate the effectiveness of CATNet, which achieves excellent WER, CER and converges speeds, outperforms other benchmark models and overcomes the challenge posed by noisy environments.
Keywords
Attention mechanism, Audio-visual speech recognition, Cross-modal fusion, Deep learning
Discipline
Graphics and Human Computer Interfaces | Numerical Analysis and Scientific Computing
Publication
Pattern Recognition Letters
Volume
178
First Page
216
Last Page
222
ISSN
0167-8655
Identifier
10.1016/j.patrec.2024.01.002
Publisher
Elsevier
Citation
WANG, Xingmei; MI, Jianchen; LI, Boquan; ZHAO, Yixu; and MENG, Jiaxiang.
CATNet: Cross-modal fusion for audio-visual speech recognition. (2024). Pattern Recognition Letters. 178, 216-222.
Available at: https://ink.library.smu.edu.sg/sis_research/8645
Copyright Owner and License
Authors
Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.
Additional URL
https://doi.org/10.1016/j.patrec.2024.01.002
Included in
Graphics and Human Computer Interfaces Commons, Numerical Analysis and Scientific Computing Commons