Publication Type
Conference Proceeding Article
Version
publishedVersion
Publication Date
10-2019
Abstract
The challenge of Ad-Hoc Video Search (AVS) originates from free-form (i.e., no pre-defined vocabulary) and freestyle (i.e., natural language) query description. Bridging the semantic gap between AVS queries and videos becomes highly difficult as evidenced from the low retrieval accuracy of AVS benchmarking in TRECVID. In this paper, we study a new method to fuse multimodal embeddings which have been derived based on completely disjoint datasets. This method is tested on two datasets for two distinct tasks: on MSR-VTT for unique video retrieval and on V3C1 for multiple videos retrieval.
Keywords
Deep learning, Multimedia, Multimodal embeddings, Multimodal fusion, Video search
Discipline
Databases and Information Systems | Graphics and Human Computer Interfaces
Research Areas
Intelligent Systems and Optimization
Publication
Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), Seoul, Korea, October 27-28
First Page
1868
Last Page
1872
ISBN
9781728150239
Identifier
10.1109/ICCVW.2019.00233
Publisher
Institute of Electrical and Electronics Engineers Inc.
City or Country
Seoul
Citation
FRANCIS, Danny; NGUYEN, Phuong Anh; HUET, Benoit; and NGO, Chong-wah.
Fusion of multimodal embeddings for ad-hoc video search. (2019). Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), Seoul, Korea, October 27-28. 1868-1872.
Available at: https://ink.library.smu.edu.sg/sis_research/6462
Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.
Included in
Databases and Information Systems Commons, Graphics and Human Computer Interfaces Commons