Publication Type

Journal Article

Version

acceptedVersion

Publication Date

2-2024

Abstract

Automatic speech recognition (ASR) is a typical pattern recognition technology that converts human speeches into texts. With the aid of advanced deep learning models, the performance of speech recognition is significantly improved. Especially, the emerging Audio–Visual Speech Recognition (AVSR) methods achieve satisfactory performance by combining audio-modal and visual-modal information. However, various complex environments, especially noises, limit the effectiveness of existing methods. In response to the noisy problem, in this paper, we propose a novel cross-modal audio–visual speech recognition model, named CATNet. First, we devise a cross-modal bidirectional fusion model to analyze the close relationship between audio and visual modalities. Second, we propose an audio–visual dual-modal network to preprocess audio and visual information, extract significant features and filter redundant noises. The experimental results demonstrate the effectiveness of CATNet, which achieves excellent WER, CER and converges speeds, outperforms other benchmark models and overcomes the challenge posed by noisy environments.

Keywords

Attention mechanism, Audio-visual speech recognition, Cross-modal fusion, Deep learning

Discipline

Graphics and Human Computer Interfaces | Numerical Analysis and Scientific Computing

Publication

Pattern Recognition Letters

Volume

178

First Page

216

Last Page

222

ISSN

0167-8655

Identifier

10.1016/j.patrec.2024.01.002

Publisher

Elsevier

Copyright Owner and License

Authors

Additional URL

https://doi.org/10.1016/j.patrec.2024.01.002

Share

COinS