Research Collection School Of Computing and Information Systems

Boosting video representation learning with multi-faceted integration

Publication Type

Conference Proceeding Article

Version

publishedVersion

Publication Date

6-2021

Abstract

Video content is multifaceted, consisting of objects, scenes, interactions or actions. The existing datasets mostly label only one of the facets for model training, resulting in the video representation that biases to only one facet depending on the training dataset. There is no study yet on how to learn a video representation from multifaceted labels, and whether multifaceted information is helpful for video representation learning. In this paper, we propose a new learning framework, MUlti-Faceted Integration (MUFI), to aggregate facets from different datasets for learning a representation that could reflect the full spectrum of video content. Technically, MUFI formulates the problem as visual-semantic embedding learning, which explicitly maps video representation into a rich semantic embedding space, and jointly optimizes video representation from two perspectives. One is to capitalize on the intra-facet supervision between each video and its own label descriptions, and the second predicts the" semantic representation" of each video from the facets of other datasets as the inter-facet supervision. Extensive experiments demonstrate that learning 3D CNN via our MUFI framework on a union of four large-scale video datasets plus two image datasets leads to superior capability of video representation. The pre-learnt 3D CNN with MUFI also shows clear improvements over other approaches on several downstream video applications. More remarkably, MUFI achieves 98.1%/80.9% on UCF101/HMDB51 for action recognition and 101.5% in terms of CIDEr-D score on MSVD for video captioning.

Discipline

Databases and Information Systems

Research Areas

Data Science and Engineering

Publication

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 20-25

Identifier

10.1109/CVPR46437.2021.01381

Publisher

IEEE

City or Country

New York

Citation

QIU, Zhaofan; TING, Yao; NGO, Chong-wah; ZHANG, Xiao-Ping; WU, Dong; and MEI, Tao. Boosting video representation learning with multi-faceted integration. (2021). Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 20-25.
Available at: https://ink.library.smu.edu.sg/sis_research/6808

Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.

Download

Included in

Databases and Information Systems Commons

COinS

Research Collection School Of Computing and Information Systems

Boosting video representation learning with multi-faceted integration

Publication Type

Version

Publication Date

Abstract

Discipline

Research Areas

Publication

Identifier

Publisher

City or Country

Citation

Creative Commons License

Included in

Search

Links

Browse

Links

Research Collection School Of Computing and Information Systems

Boosting video representation learning with multi-faceted integration

Author

Publication Type

Version

Publication Date

Abstract

Discipline

Research Areas

Publication

Identifier

Publisher

City or Country

Citation

Creative Commons License

Included in

Share

Search

Links

Browse

Links