Publication Type
Conference Proceeding Article
Version
publishedVersion
Publication Date
6-2024
Abstract
As the use of pervasive devices expands into complex collaborative tasks such as cognitive assistants and interactive AR/VR companions, they are equipped with a myriad of sensors facilitating natural interactions, such as voice commands. Spatio-Temporal Video Grounding (STVG), the task of identifying the target object in the field-of-view referred to in a language instruction, is a key capability needed for such systems. However, current STVG models tend to be resource-intensive, relying on multiple cross-attentional transformers applied to each video frame. This results in runtime complexity that increases linearly with video length. Furthermore, deploying these models on mobile devices while maintaining a low-latency poses additional challenges. Hence, this paper explores the latency and energy requirements for implementing STVG models on a pervasive device.
Keywords
Human-AI Collaboration, Spatio-Temporal Video Grounding
Discipline
Computer Engineering
Research Areas
Intelligent Systems and Optimization; Software and Cyber-Physical Systems
Areas of Excellence
Digital transformation
Publication
MOBISYS '24: Proceedings of the 22nd Annual International Conference on Mobile Systems, Minato-ku, Tokyo Japan, 2024 June 3-7
First Page
648
Last Page
649
ISBN
9798400705816
Identifier
https://doi.org/10.1145/3643832.3661402
Publisher
ACM
City or Country
New York
Citation
WEERAKOON MUDIYANSELAGE, Dulanga Kaveesha; SUBBARAJU, Vigneshwaran; LIM, Joo Hwee; and Misra, Archan.
Poster: Towards efficient spatio-temporal video grounding in pervasive mobile devices. (2024). MOBISYS '24: Proceedings of the 22nd Annual International Conference on Mobile Systems, Minato-ku, Tokyo Japan, 2024 June 3-7. 648-649.
Available at: https://ink.library.smu.edu.sg/sis_research/9219
Copyright Owner and License
Authors
Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License.
Additional URL
https://doi.org/10.1145/3643832.3661402