Publication Type

Conference Proceeding Article

Version

publishedVersion

Publication Date

6-2024

Abstract

As the use of pervasive devices expands into complex collaborative tasks such as cognitive assistants and interactive AR/VR companions, they are equipped with a myriad of sensors facilitating natural interactions, such as voice commands. Spatio-Temporal Video Grounding (STVG), the task of identifying the target object in the field-of-view referred to in a language instruction, is a key capability needed for such systems. However, current STVG models tend to be resource-intensive, relying on multiple cross-attentional transformers applied to each video frame. This results in runtime complexity that increases linearly with video length. Furthermore, deploying these models on mobile devices while maintaining a low-latency poses additional challenges. Hence, this paper explores the latency and energy requirements for implementing STVG models on a pervasive device.

Keywords

Human-AI Collaboration, Spatio-Temporal Video Grounding

Discipline

Computer Engineering

Research Areas

Intelligent Systems and Optimization; Software and Cyber-Physical Systems

Areas of Excellence

Digital transformation

Publication

MOBISYS '24: Proceedings of the 22nd Annual International Conference on Mobile Systems, Minato-ku, Tokyo Japan, 2024 June 3-7

First Page

648

Last Page

649

ISBN

9798400705816

Identifier

https://doi.org/10.1145/3643832.3661402

Publisher

ACM

City or Country

New York

Copyright Owner and License

Authors

Creative Commons License

Creative Commons Attribution 4.0 International License
This work is licensed under a Creative Commons Attribution 4.0 International License.

Additional URL

https://doi.org/10.1145/3643832.3661402

Share

COinS