VrdONE : One-stage video visual relation detection
Publication Type
Conference Proceeding Article
Publication Date
10-2024
Abstract
Video Visual Relation Detection (VidVRD) focuses on understanding how entities interact over time and space in videos, a key step for gaining deeper insights into video scenes beyond basic visual tasks. Traditional methods for VidVRD, challenged by its complexity, typically split the task into two parts: one for identifying what relation categories are present and another for determining their temporal boundaries. This split overlooks the inherent connection between these elements. Addressing the need to recognize entity pairs' spatiotemporal interactions across a range of durations, we propose VrdONE, a streamlined yet efficacious one-stage model. VrdONE combines the features of subjects and objects, turning predicate detection into 1D instance segmentation on their combined representations. This setup allows for both relation category identification and binary mask generation in one go, eliminating the need for extra steps like proposal generation or post-processing. VrdONE facilitates the interaction of features across various frames, adeptly capturing both short-lived and enduring relations. Additionally, we introduce the Subject-Object Synergy (SOS) module, enhancing how subjects and objects perceive each other before combining. VrdONE achieves state-of-the-art performances on the VidOR benchmark and ImageNet-VidVRD, showcasing its superior capability in discerning relations across different temporal scales.
Keywords
Scene understanding, Video relation detection, Video understanding, One-stage, Set prediction, Spatiotemporally synergism
Discipline
Artificial Intelligence and Robotics | Graphics and Human Computer Interfaces
Research Areas
Intelligent Systems and Optimization
Publication
Proceedings of 32nd ACM International Conference on Multimedia (ACM MM 2024) : Melbourne, Australia, October 28 - November 1
First Page
1437
Last Page
1446
Identifier
10.1145/3664647.3680833
Publisher
Association for Computing Machinery
City or Country
Australia
Citation
JIANG, Xinjie; ZHENG, Chenxi; XU, Xuemiao; LIU, Bangzhen; ZHENG, Weiying; ZHANG, Huaidong; and HE, Shengfeng.
VrdONE : One-stage video visual relation detection. (2024). Proceedings of 32nd ACM International Conference on Multimedia (ACM MM 2024) : Melbourne, Australia, October 28 - November 1. 1437-1446.
Available at: https://ink.library.smu.edu.sg/sis_research/9802
Additional URL
https://doi.org/10.1145/3664647.3680833