Publication Type

Conference Proceeding Article

Version

publishedVersion

Publication Date

11-2016

Abstract

Multimedia events such as “birthday party” usually involve the complex interaction between humans and objects. Unlike actions and sports, these events rarely contain unique motion patterns to be vividly explored for recognition. To encode rich objects in the events, a common practice is to tag an individual video frame with object labels, represented as a vector signifying probabilities of object appearances. These vectors are then pooled across frames to obtain a video-level representation. The current practices suffer from two deficiencies due to the direct employment of deep convolutional neural network (DCNN) and standard feature pooling techniques. First, the use of max-pooling and softmax layers in DCNN overemphasize the primary object or scene in a frame, producing a sparse vector that overlooks the existence of secondary or small-size objects. Second, feature pooling by max or average operator over sparse vectors makes the video-level feature unpredictable in modeling the object composition of an event. To address these problems, this paper proposes a new video representation, named Object-VLAD, which treats each object equally and encodes them into a vector for multimedia event detection. Furthermore, the vector can be flexibly decoded to identify evidences such as key objects to recount the reason why a video is retrieved for an event of interest. Experiments conducted on MED13 and MED14 datasets verify the merit of Object-VLAD by consistently outperforming several state-of-the-arts in both event detection and recounting.

Discipline

Software Engineering

Research Areas

Software and Cyber-Physical Systems

Publication

Proceedings of TRECVID 2016: Gaithersburg, November 14-16

First Page

Last Page

Publisher

National Institute of Standards and Technology

City or Country

Gaithersburg

Citation

ZHANG, Hao; LU, Yi-Jie; and NGO, Chong-wah. VIREO @ TRECVID 2016: Multimedia event detection, ad-hoc video search, video-to-text description. (2016). Proceedings of TRECVID 2016: Gaithersburg, November 14-16. 1-15.
Available at: https://ink.library.smu.edu.sg/sis_research/6578

Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.

Additional URL

https://www-nlpir.nist.gov/projects/tvpubs/tv16.papers/vireo.pdf

Download

Included in

Software Engineering Commons

COinS

Research Collection School Of Computing and Information Systems

VIREO @ TRECVID 2016: Multimedia event detection, ad-hoc video search, video-to-text description

Publication Type

Version

Publication Date

Abstract

Discipline

Research Areas

Publication

First Page

Last Page

Publisher

City or Country

Citation

Creative Commons License

Additional URL

Included in

Search

Links

Browse

Links

Research Collection School Of Computing and Information Systems

VIREO @ TRECVID 2016: Multimedia event detection, ad-hoc video search, video-to-text description

Author

Publication Type

Version

Publication Date

Abstract

Discipline

Research Areas

Publication

First Page

Last Page

Publisher

City or Country

Citation

Creative Commons License

Additional URL

Included in

Share

Search

Links

Browse

Links