Publication Type

Conference Proceeding Article

Version

publishedVersion

Publication Date

6-2021

Abstract

Lip reading aims to predict the spoken sentences from silent lip videos. Due to the fact that such a vision task usually performs worse than its counterpart speech recognition, one potential scheme is to distill knowledge from a teacher pretrained by audio signals. However, the latent domain gap between the cross-modal data could lead to a learning ambiguity and thus limits the performance of lip reading. In this paper, we propose a novel collaborative framework for lip reading, and two aspects of issues are considered: 1) the teacher should understand bi-modal knowledge to possibly bridge the inherent cross-modal gap; 2) the teacher should adjust teaching contents adaptively with the evolution of the student. To these ends, we introduce a trainable “master” network which ingests both audio signals and silent lip videos instead of a pretrained teacher. The master produces logits from three modalities of features: audio modality, video modality, and their combination. To further provide an interactive strategy to fuse these knowledge organically, we regularize the master with the task-specific feedback from the student, in which the requirement of the student is implicitly embedded. Meanwhile, we involve a couple of “tutor” networks into our system as guidance for emphasizing the fruitful knowledge flexibly. In addition, we incorporate a curriculum learning design to ensure a better convergence. Extensive experiments demonstrate that the proposed network outperforms the state-of-the-art methods on several benchmarks, including in both word-level and sentence-level scenarios.

Keywords

Curricula, Modal analysis, Speech recognition, Audio signal, Collaborative framework, Cross-modal, Interactive strategy, Learning designs, Lip reading, Modal data, Performance, Teachers', Teaching contents, Students

Discipline

Databases and Information Systems | Graphics and Human Computer Interfaces

Research Areas

Information Systems and Management

Publication

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

ISBN

9781665445092

Identifier

10.1109/CVPR46437.2021.01312

City or Country

USA

Citation

REN, Sucheng; DU, Yong; LV, Jianming; HAN, Guoqiang; and HE, Shengfeng. Learning from the master: Distilling cross-modal advanced knowledge for lip reading. (2021). Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
Available at: https://ink.library.smu.edu.sg/sis_research/8442

Copyright Owner and License

Authors

Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.

Download

Included in

Databases and Information Systems Commons, Graphics and Human Computer Interfaces Commons

COinS

Research Collection School Of Computing and Information Systems

Learning from the master: Distilling cross-modal advanced knowledge for lip reading

Publication Type

Version

Publication Date

Abstract

Keywords

Discipline

Research Areas

Publication

ISBN

Identifier

City or Country

Citation

Copyright Owner and License

Creative Commons License

Included in

Search

Links

Browse

Links

Research Collection School Of Computing and Information Systems

Learning from the master: Distilling cross-modal advanced knowledge for lip reading

Author

Publication Type

Version

Publication Date

Abstract

Keywords

Discipline

Research Areas

Publication

ISBN

Identifier

City or Country

Citation

Copyright Owner and License

Creative Commons License

Included in

Share

Search

Links

Browse

Links