Publication Type
Conference Proceeding Article
Version
acceptedVersion
Publication Date
3-2025
Abstract
Spatio-Temporal Video Grounding (STVG) - the task of identifying the target object in the field-of-view that the language instruction refers to - is a fundamental vision-language task. Current STVG approaches typically utilize feeds from an RGB camera that is assumed to be always-on and process the video frames using complex neural network pipelines. As a result they often impose prohibitive system overheads (energy latency) on pervasive devices. To address this we propose NeuroViG with two key innovations: (a) leveraging on event streams from a low-power neuromorphic event camera sensor to perform selective triggering of the more energy-hungry RGB camera for STVG and (b) augmenting the STVG model with a lightweight Adaptive Frame Selector (AFS) that bypasses complex transformer-based operations for a majority of video frames thereby enabling its execution on a pervasive Jetson AGX device. We have also introduced modifications to the neural network processing pipeline such that the system can offer tunable tradeoffs between accuracy and energy/latency. Our proposed NeuroViG system allows us to reduce the STVG energy overhead and latency by 4x and 3.8x respectively for less than 1% loss in accuracy.
Keywords
event processing, multi-modal processing, spatio-temporal video grounding, vision-language, visual grounding
Discipline
Graphics and Human Computer Interfaces | Software Engineering
Research Areas
Software and Cyber-Physical Systems
Publication
2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV): Tucson, AZ, February 26 - March 4: Proceedings
First Page
5781
Last Page
5790
ISBN
9798331510831
Identifier
10.1109/WACV61041.2025.00564
Publisher
IEEE
City or Country
Piscataway
Citation
WEERAKOON, Dulanga; SUBBARAJU, Vigneshwaran; LIM, Joo Hwee; and MISRA, Archan.
NeuroViG: Integrating event cameras for resource-efficient video grounding. (2025). 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV): Tucson, AZ, February 26 - March 4: Proceedings. 5781-5790.
Available at: https://ink.library.smu.edu.sg/sis_research/10162
Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.
Additional URL
https://doi.org/10.1109/WACV61041.2025.00564