Research Collection School Of Computing and Information Systems

VrdONE : One-stage video visual relation detection

Xinjie JIANG
Chenxi ZHENG
Xuemiao XU
Bangzhen LIU
Weiying ZHENG
Huaidong ZHANG
Shengfeng HE, Singapore Management UniversityFollow

Publication Type

Conference Proceeding Article

Publication Date

10-2024

Abstract

Video Visual Relation Detection (VidVRD) focuses on understanding how entities interact over time and space in videos, a key step for gaining deeper insights into video scenes beyond basic visual tasks. Traditional methods for VidVRD, challenged by its complexity, typically split the task into two parts: one for identifying what relation categories are present and another for determining their temporal boundaries. This split overlooks the inherent connection between these elements. Addressing the need to recognize entity pairs' spatiotemporal interactions across a range of durations, we propose VrdONE, a streamlined yet efficacious one-stage model. VrdONE combines the features of subjects and objects, turning predicate detection into 1D instance segmentation on their combined representations. This setup allows for both relation category identification and binary mask generation in one go, eliminating the need for extra steps like proposal generation or post-processing. VrdONE facilitates the interaction of features across various frames, adeptly capturing both short-lived and enduring relations. Additionally, we introduce the Subject-Object Synergy (SOS) module, enhancing how subjects and objects perceive each other before combining. VrdONE achieves state-of-the-art performances on the VidOR benchmark and ImageNet-VidVRD, showcasing its superior capability in discerning relations across different temporal scales.

Keywords

Scene understanding, Video relation detection, Video understanding, One-stage, Set prediction, Spatiotemporally synergism

Discipline

Artificial Intelligence and Robotics | Graphics and Human Computer Interfaces

Research Areas

Intelligent Systems and Optimization

Publication

Proceedings of 32nd ACM International Conference on Multimedia (ACM MM 2024) : Melbourne, Australia, October 28 - November 1

First Page

1437

Last Page

1446

Identifier

10.1145/3664647.3680833

Publisher

Association for Computing Machinery

City or Country

Australia

Citation

JIANG, Xinjie; ZHENG, Chenxi; XU, Xuemiao; LIU, Bangzhen; ZHENG, Weiying; ZHANG, Huaidong; and HE, Shengfeng. VrdONE : One-stage video visual relation detection. (2024). Proceedings of 32nd ACM International Conference on Multimedia (ACM MM 2024) : Melbourne, Australia, October 28 - November 1. 1437-1446.
Available at: https://ink.library.smu.edu.sg/sis_research/9802

Additional URL

https://doi.org/10.1145/3664647.3680833

This document is currently not available here.

COinS

Research Collection School Of Computing and Information Systems

VrdONE : One-stage video visual relation detection

Publication Type

Publication Date

Abstract

Keywords

Discipline

Research Areas

Publication

First Page

Last Page

Identifier

Publisher

City or Country

Citation

Additional URL

Search

Links

Browse

Links

Research Collection School Of Computing and Information Systems

VrdONE : One-stage video visual relation detection

Author

Publication Type

Publication Date

Abstract

Keywords

Discipline

Research Areas

Publication

First Page

Last Page

Identifier

Publisher

City or Country

Citation

Additional URL

Share

Search

Links

Browse

Links