Publication Type

PhD Dissertation

Version

publishedVersion

Publication Date

11-2025

Abstract

Artificial Intelligence of Things (AIoT) technologies have ushered in exciting new advances in intelligent sensing, perception, and actuation for many real-world cyberphysical systems (CPS) applications. These technologies have had a formidable impact in domains such as large-scale video surveillance, autonomous transportation and robotics, precision healthcare, and industrial automation. In these applications, sensors and actuators are often collocated with processing nodes, and such nodes are typically interconnected via wireless networks. Vision-based machine intelligence, exemplified by tasks such as object detection, object tracking, and activity analysis, is a very common enabler of such CPS applications. Efficient execution of Deep Neural Network (DNN) model inference for such machine vision perception tasks on resource-constrained edge devices is a critical need, especially as wirelessly transmitting high-bandwidth data streams from large numbers of geographically co-located sensors to a centralized cloud infrastructure is often impractical. Enhancing the computational efficiency of such edge-based DNN inference is especially
challenging, given that DNNs have very high complexity and smaller/lightweight
DNN models typically sacrifice perception accuracy to achieve adequate processing
throughput.
This thesis explores a paradigm of “collaborative DNN inference”, among peer IoT nodes (with relatively modest computational capabilities) that are collectively engaged in vision perception tasks, as a means of overcoming this accuracy-vs.-throughput tradeoff. Under this paradigm, multiple peer IoT nodes are collectively
responsible for monitoring a sensing field, and exhibit potentially time-varying overlaps in their field-of-view (FoV). The thesis explores multiple different collaborative DNN inference approaches that utilize this phenomenon of FoV overlap, occurring either serendipitously due to the placement of the IoT nodes or intentionally due to sensor steering actions performed by such nodes. Notably, such FoV overlap can be
observed in both unimodal (RGB vision) and multimodal (RGB vision and LiDAR) sensor deployments, with some nodes also potentially being mobile and thus continually changing their own FoV relative to the global sensing field coordinates. The core idea of such collaborative inference involves having one or more such IoT nodes share the intermediate DNN states (or suitably compact summaries of such states) that are produced during inference execution. These DNN states or summaries then serve as hints that help one or more peer nodes with overlapping FoVs to enhance the accuracy of their own DNN-based inference or control the collective behavior of these IoT nodes. This approach is called deep fusion, in contrast to early fusion that combines multiple raw sensor feeds and late fusion that combines the inferred output from such DNN models. I demonstrate how peer sharing can help mitigate the accuracy drawbacks of lightweight, edge-compatible DNN models, while also seeking to minimize system-level goals of minimizing the communication and energy
overheads.

In the first thrust of this work, I focus on developing the core mechanisms of such intermediate DNN state sharing, applied to a deployment scenario consisting of stationary RGB camera-equipped IoT nodes with fixed FoVs that collectively monitor a shared physical space (the sensing field). The proposed approach, called ComAI, introduces techniques to extract salient summaries from the intermediate feature maps associated with a Convolutional Neural Network (CNN)-based DNN instance. These summaries are then shared with peers to help improve the fidelity of each peer’s inference without requiring any deployment-specific training of AI models. The thesis demonstrates how ComAI can help smaller, edge-friendly DNNs achieve perception accuracy, for a canonical object detection task, that is competitive with much larger DNN models, while achieving significantly higher processing throughput. Moreover, the network overheads associated with such a state sharing paradigm can be further reduced by carefully curating both the summaries to be
shared and the subset of peers that receive such summaries.

The subsequent thrust of this thesis demonstrates how the benefit of such collaborative DNN inference can be further enhanced by combining such deep fusion with active adjustment of the sensor nodes’ FoVs. To illustrate this benefit, I consider a deployment scenario of RGB camera-equipped IoT nodes that collectively monitor a shared physical space, but with the cameras now being capable of being steered. The proposed approach, called SteerCam, utilizes a novel Reinforcement Learning (RL) based approach to jointly adapt both the FoVs and the collaborative inference processing pipelines of multiple networked cameras. I demonstrate that
such joint RL-based optimization of {steering, collaboration} can achieve superior performance compared to both (a) collaborative inferencing over static cameras, and (b) non-collaborative inferencing using steerable cameras. Furthermore, the maximum accuracy gain is achieved, somewhat counterintuitively, when the cameras steer to intelligently create partial FoV overlaps rather than partitioning the sensing
field.

In the final thrust of this thesis, I expand the notion of such collaborative inferencing to include sensors of multiple modalities (specifically, RGB and LiDAR sensors). In addition, I consider deployments where some of the nodes are mobile (for example, surveillance robots fitted with LiDAR cameras), resulting in continual changes to their FoVs. To help develop and test the enhanced inferencing mechanisms for such multimodal scenarios, I first develop MultiSense-RL-Arena, an integrated simulation framework that allows the programmable generation of synthetic sensor data corresponding to such multimodal sensor deployments and the training of RLbased collaborative inferencing techniques on such data. The subsequent proposed
multimodal collaborative sensing approach, called FusionBridge, while still under development, utilizes 3D features extracted by edge-compliant DNNs processing LiDAR data to enhance the perception accuracy of 2D detector AI models.

These works collectively establish how the performance of lightweight edgebased visual DNN perception can be significantly enhanced, with minimal system overhead, by deep fusion–i.e., by judiciously exploiting the correlations in the latent DNN states derived from sensors with partially overlapping FoVs. ComAI
boosts the recall by 20–50% across heterogeneous camera deployments compared to non-collaborative object detectors, while incurring negligible communication and computation costs. SteerCam combines steering with such selective deep fusion by lightweight SSD models to achieve F1-scores within 5–6% of heavyweight YOLOv8L, while attaining 3–5× higher throughput and 50% lower network overhead than prior greedy-collaboration methods. SteerCam also achieves 9% F1-score improvement compared to the state-of-the-art steerable-only SOTA methods, and outlines how operators can utilize PTZ (pan-tilt-zoom) capabilities of cameras for smarter surveillance (e.g., of indoor or outdoor events) that adapts to dynamically
changing environmental conditions and object movement behavior. FusionBridge (a multimodal deep fusion method) achieves up to a 57% F1-score improvement over single-modality (lightweight 2D object detection) baselines with only 15% higher latency and a very modest 0.4 KB per-frame network overhead. This evidence
shows that distributed edge nodes can collectively achieve near state-of-the-art vision accuracy (typically improving accuracy by ≈ 10–15% overall and up to ≈ 57% in challenging conditions), without relying on cloud offloading or heavy retraining. It also enables a new form of visual surveillance in public spaces, such as shopping malls, that harnesses the heterogeneous sensing capabilities of static RGB cameras
and emerging mobile service robots.

Degree Awarded

PhD in Computer Science

Discipline

Artificial Intelligence and Robotics

Supervisor(s)

MISRA, Archan

First Page

Last Page

212

Publisher

Singapore Management University

City or Country

Singapore

Citation

WANNIARACHCHIGE, Dhanuja Tharith. Enhancing multi-view, multi-modal sensing, perception and actuation for edge intelligence. (2025). 1-212.
Available at: https://ink.library.smu.edu.sg/etd_coll/816

Copyright Owner and License

Author

Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.

Download

Included in

Artificial Intelligence and Robotics Commons

COinS

Dissertations and Theses Collection (Open Access)

Enhancing multi-view, multi-modal sensing, perception and actuation for edge intelligence

Publication Type

Version

Publication Date

Abstract

Degree Awarded

Discipline

Supervisor(s)

First Page

Last Page

Publisher

City or Country

Citation

Copyright Owner and License

Creative Commons License

Included in

Search

Links

Browse

Links

Dissertations and Theses Collection (Open Access)

Enhancing multi-view, multi-modal sensing, perception and actuation for edge intelligence

Author

Publication Type

Version

Publication Date

Abstract

Degree Awarded

Discipline

Supervisor(s)

First Page

Last Page

Publisher

City or Country

Citation

Copyright Owner and License

Creative Commons License

Included in

Share

Search

Links

Browse

Links