Publication Type

Conference Proceeding Article

Version

publishedVersion

Publication Date

10-2019

Abstract

The challenge of Ad-Hoc Video Search (AVS) originates from free-form (i.e., no pre-defined vocabulary) and freestyle (i.e., natural language) query description. Bridging the semantic gap between AVS queries and videos becomes highly difficult as evidenced from the low retrieval accuracy of AVS benchmarking in TRECVID. In this paper, we study a new method to fuse multimodal embeddings which have been derived based on completely disjoint datasets. This method is tested on two datasets for two distinct tasks: on MSR-VTT for unique video retrieval and on V3C1 for multiple videos retrieval.

Keywords

Deep learning, Multimedia, Multimodal embeddings, Multimodal fusion, Video search

Discipline

Databases and Information Systems | Graphics and Human Computer Interfaces

Research Areas

Intelligent Systems and Optimization

Publication

Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), Seoul, Korea, October 27-28

First Page

1868

Last Page

1872

ISBN

9781728150239

Identifier

10.1109/ICCVW.2019.00233

Publisher

Institute of Electrical and Electronics Engineers Inc.

City or Country

Seoul

Share

COinS