Research Collection School Of Computing and Information Systems

Self-supervised video representation learning by uncovering spatio-temporal statistics

Publication Type

Journal Article

Version

publishedVersion

Publication Date

7-2022

Abstract

This paper proposes a novel pretext task to address the self-supervised video representation learning problem. Specifically, given an unlabeled video clip, we compute a series of spatio-temporal statistical summaries, such as the spatial location and dominant direction of the largest motion, the spatial location and dominant color of the largest color diversity along the temporal axis, etc. Then a neural network is built and trained to yield the statistical summaries given the video frames as inputs. In order to alleviate the learning difficulty, we employ several spatial partitioning patterns to encode rough spatial locations instead of exact spatial Cartesian coordinates. Our approach is inspired by the observation that human visual system is sensitive to rapidly changing contents in the visual field, and only needs impressions about rough spatial locations to understand the visual contents. To validate the effectiveness of the proposed approach, we conduct extensive experiments with four 3D backbone networks, i.e., C3D, 3D-ResNet, R(2+1)D and S3D-G. The results show that our approach outperforms the existing approaches across these backbone networks on four downstream video analysis tasks including action recognition, video retrieval, dynamic scene recognition, and action similarity labeling. The source code is publicly available at: https://github.com/laura-wang/video_repres_sts.

Keywords

Task analysis, Three-dimensional displays, Neural networks, Image color analysis, Visualization, Training, Feature extraction, Self-supervised learning, representation learning, video understanding, 3D CNN

Discipline

Information Security

Research Areas

Information Systems and Management

Publication

IEEE Transactions on Pattern Analysis and Machine Intelligence

Volume

Issue

First Page

3791

Last Page

3806

ISSN

0162-8828

Identifier

10.1109/TPAMI.2021.3057833

Publisher

Institute of Electrical and Electronics Engineers

Citation

WANG, Jiangliu; JIAO, Jianbo; BAO, Linchao; HE, Shengfeng; LIU, Wei; and LIU, Yun-hui. Self-supervised video representation learning by uncovering spatio-temporal statistics. (2022). IEEE Transactions on Pattern Analysis and Machine Intelligence. 44, (7), 3791-3806.
Available at: https://ink.library.smu.edu.sg/sis_research/7839

Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.

Additional URL

https://doi.org/10.1109/TPAMI.2021.3057833

Download

Included in

Information Security Commons

COinS

Research Collection School Of Computing and Information Systems

Self-supervised video representation learning by uncovering spatio-temporal statistics

Publication Type

Version

Publication Date

Abstract

Keywords

Discipline

Research Areas

Publication

Volume

Issue

First Page

Last Page

ISSN

Identifier

Publisher

Citation

Creative Commons License

Additional URL

Included in

Search

Links

Browse

Links

Research Collection School Of Computing and Information Systems

Self-supervised video representation learning by uncovering spatio-temporal statistics

Author

Publication Type

Version

Publication Date

Abstract

Keywords

Discipline

Research Areas

Publication

Volume

Issue

First Page

Last Page

ISSN

Identifier

Publisher

Citation

Creative Commons License

Additional URL

Included in

Share

Search

Links

Browse

Links