Publication Type

Conference Proceeding Article

Version

acceptedVersion

Publication Date

3-2025

Abstract

Spatio-Temporal Video Grounding (STVG) - the task of identifying the target object in the field-of-view that the language instruction refers to - is a fundamental vision-language task. Current STVG approaches typically utilize feeds from an RGB camera that is assumed to be always-on and process the video frames using complex neural network pipelines. As a result they often impose prohibitive system overheads (energy latency) on pervasive devices. To address this we propose NeuroViG with two key innovations: (a) leveraging on event streams from a low-power neuromorphic event camera sensor to perform selective triggering of the more energy-hungry RGB camera for STVG and (b) augmenting the STVG model with a lightweight Adaptive Frame Selector (AFS) that bypasses complex transformer-based operations for a majority of video frames thereby enabling its execution on a pervasive Jetson AGX device. We have also introduced modifications to the neural network processing pipeline such that the system can offer tunable tradeoffs between accuracy and energy/latency. Our proposed NeuroViG system allows us to reduce the STVG energy overhead and latency by 4x and 3.8x respectively for less than 1% loss in accuracy.

Keywords

event processing, multi-modal processing, spatio-temporal video grounding, vision-language, visual grounding

Discipline

Graphics and Human Computer Interfaces | Software Engineering

Research Areas

Software and Cyber-Physical Systems

Publication

2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV): Tucson, AZ, February 26 - March 4: Proceedings

First Page

5781

Last Page

5790

ISBN

9798331510831

Identifier

10.1109/WACV61041.2025.00564

Publisher

IEEE

City or Country

Piscataway

Additional URL

https://doi.org/10.1109/WACV61041.2025.00564

Share

COinS