Research Collection School Of Computing and Information Systems

Timed dataflow: Reducing communication overhead for distributed machine learning systems

Peng SUN
Yonggang WEN
Nguyen Binh Duong TA, Singapore Management UniversityFollow
Shengen YAN

Publication Type

Conference Proceeding Article

Publication Date

12-2016

Abstract

Many distributed machine learning (ML) systems exhibit high communication overhead when dealing with big data sets. Our investigations showed that popular distributed ML systems could spend about an order of magnitude more time on network communication than computation to train ML models containing millions of parameters. Such high communication overhead is mainly caused by two operations: pulling parameters and pushing gradients. In this paper, we propose an approach called Timed Dataflow (TDF) to deal with this problem via reducing network traffic using three techniques: a timed parameter storage system, a hybrid parameter filter and a hybrid gradient filter. In particular, the timed parameter storage technique and the hybrid parameter filter enable servers to discard unchanged parameters during the pull operation, and the hybrid gradient filter allows servers to drop gradients selectively during the push operation. Therefore, TDF could reduce the network traffic and communication time significantly. Extensive performance evaluations in a real testbed showed that TDF could reduce up to 77% and 79% of network traffic for the pull and push operations, respectively. As a result, TDF could speed up model training by a factor of up to 4 without sacrificing much accuracy for some popular ML models, compared to systems not using TDF.

Discipline

Computer and Systems Architecture | Software Engineering

Research Areas

Software and Cyber-Physical Systems

Publication

Proceedings of the 22nd International Conference on Parallel and Distributed Systems (ICPADS): 2016 IEEE, Wuhan, China, December 13-16

ISBN

1521-9097

Identifier

10.1109/ICPADS.2016.0146

Publisher

IEEE

City or Country

Wuhan, China

Citation

SUN, Peng; WEN, Yonggang; TA, Nguyen Binh Duong; and YAN, Shengen. Timed dataflow: Reducing communication overhead for distributed machine learning systems. (2016). Proceedings of the 22nd International Conference on Parallel and Distributed Systems (ICPADS): 2016 IEEE, Wuhan, China, December 13-16.
Available at: https://ink.library.smu.edu.sg/sis_research/4834

Additional URL

https://doi.org/10.1109/ICPADS.2016.0146

This document is currently not available here.

Find it in your library

COinS

Research Collection School Of Computing and Information Systems

Timed dataflow: Reducing communication overhead for distributed machine learning systems

Publication Type

Publication Date

Abstract

Discipline

Research Areas

Publication

ISBN

Identifier

Publisher

City or Country

Citation

Additional URL

Search

Links

Browse

Links

Research Collection School Of Computing and Information Systems

Timed dataflow: Reducing communication overhead for distributed machine learning systems

Author

Publication Type

Publication Date

Abstract

Discipline

Research Areas

Publication

ISBN

Identifier

Publisher

City or Country

Citation

Additional URL

Share

Search

Links

Browse

Links