Publication Type
Journal Article
Version
acceptedVersion
Publication Date
7-2022
Abstract
Supporting real-time, on-device execution of multi-modal referring instruction comprehension models is an important challenge to be tackled in embodied Human-Robot Interaction. However, state-of-the-art deep learning models are resource-intensive and unsuitable for real-time execution on embedded devices. While model compression can achieve a reduction in computational resources up to a certain point, further optimizations result in a severe drop in accuracy. To minimize this loss in accuracy, we propose the COSM2IC framework, with a lightweight Task Complexity Predictor, that uses multiple sensor inputs to assess the instructional complexity and thereby dynamically switch between a set of models of varying computational intensity such that computationally less demanding models are invoked whenever possible. To demonstrate the benefits of COSM2IC , we utilize a representative human-robot collaborative “table-top target acquisition” task, to curate a new multi-modal instruction dataset where a human issues instructions in a natural manner using a combination of visual, verbal, and gestural (pointing) cues. We show that COSM2IC achieves a 3-fold reduction in comprehension latency when compared to a baseline DNN model while suffering an accuracy loss of only ∼ 5%. When compared to state-of-the-art model compression methods, COSM2IC is able to achieve a further 30% reduction in latency and energy consumption for a comparable performance.
Keywords
Deep Learning for Visual Perception, Data Sets for Robotic Vision, Embedded Systems for Robotic andAutomation, Human-Robot Collaboration, RGB-D Perception;
Discipline
Artificial Intelligence and Robotics | Databases and Information Systems
Research Areas
Data Science and Engineering
Publication
IEEE Robotics and Automation Letters
Volume
7
Issue
4
First Page
10697
Last Page
10704
ISSN
2377-3766
Identifier
10.1109/LRA.2022.3194683
Publisher
Institute of Electrical and Electronics Engineers
Citation
WEERAKOON MUDIYANSELAGE DULANGA KAVEESHA WEERAKOON; SUBBARAJU, Vigneshwaran; TRAN, Minh Anh Tuan; and MISRA, Archan.
COSM2IC: Optimizing real-time multi-modal instruction comprehension. (2022). IEEE Robotics and Automation Letters. 7, (4), 10697-10704.
Available at: https://ink.library.smu.edu.sg/sis_research/7618
Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.
Additional URL
https://doi.org/10.1109/LRA.2022.3194683