Research Collection School Of Computing and Information Systems

Stitching weight-shared deep neural networks for efficient multitask inference on GPU

Zeyu WANG
Xiaoxi HE
Zimu ZHOU, Singapore Management UniversityFollow
Xu WANG
Qiang MA
Xin MIAO
Zhuo LIU
Lothar THIELE
Zheng. YANG

Publication Type

Conference Proceeding Article

Version

acceptedVersion

Publication Date

10-2022

Abstract

Intelligent personal and home applications demand multiple deep neural networks (DNNs) running on resourceconstrained platforms for compound inference tasks, known as multitask inference. To fit multiple DNNs into low-resource devices, emerging techniques resort to weight sharing among DNNs to reduce their storage. However, such reduction in storage fails to translate into efficient execution on common accelerators such as GPUs. Most DNN graph rewriters are blind for multiDNN optimization, while GPU vendors provide inefficient APIs for parallel multi-DNN execution at runtime. A few prior graph rewriters suggest cross-model graph fusion for low-latency multiDNN execution. Yet they request duplication of the shared weights, erasing the memory saving of weight-shared DNNs. In this paper, we propose MTS, a novel graph rewriter for efficient multitask inference with weight-shared DNNs. MTS adopts a model stitching algorithm which outputs a single computational graph for weight-shared DNNs without duplicating any shared weight. MTS also utilizes a model grouping strategy to avoid overwhelming the GPU when co-running tens of DNNs. Extensive experiments show that MTS accelerates multitask inference by up to 6.0× compared to sequentially executing multiple weightshared DNNs. MTS also yields up to 2.5× lower latency and 3.7× less memory usage compared with NETFUSE, a state-of-the-art multi-DNN graph rewriter.

Keywords

Deep Neural Networks, Multitask Inference, Model Acceleration

Discipline

OS and Networks | Software Engineering

Research Areas

Software and Cyber-Physical Systems

Publication

Proceedings of the 2022 19th Annual IEEE International Conference on Sensing, Communication, and Networking

First Page

145

Last Page

153

ISBN

9781665486446

Identifier

10.1109/SECON55815.2022.9918563

Publisher

IEEE

City or Country

Stockholm, Sweden

Citation

WANG, Zeyu; HE, Xiaoxi; ZHOU, Zimu; WANG, Xu; MA, Qiang; MIAO, Xin; LIU, Zhuo; THIELE, Lothar; and YANG, Zheng.. Stitching weight-shared deep neural networks for efficient multitask inference on GPU. (2022). Proceedings of the 2022 19th Annual IEEE International Conference on Sensing, Communication, and Networking. 145-153.
Available at: https://ink.library.smu.edu.sg/sis_research/7486

Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.

Additional URL

http://doi.org/10.1109/SECON55815.2022.9918563

Download

Included in

OS and Networks Commons, Software Engineering Commons

COinS

Research Collection School Of Computing and Information Systems

Stitching weight-shared deep neural networks for efficient multitask inference on GPU

Publication Type

Version

Publication Date

Abstract

Keywords

Discipline

Research Areas

Publication

First Page

Last Page

ISBN

Identifier

Publisher

City or Country

Citation

Creative Commons License

Additional URL

Included in

Search

Links

Browse

Links

Research Collection School Of Computing and Information Systems

Stitching weight-shared deep neural networks for efficient multitask inference on GPU

Author

Publication Type

Version

Publication Date

Abstract

Keywords

Discipline

Research Areas

Publication

First Page

Last Page

ISBN

Identifier

Publisher

City or Country

Citation

Creative Commons License

Additional URL

Included in

Share

Search

Links

Browse

Links