Publication Type

Conference Proceeding Article

Version

acceptedVersion

Publication Date

5-2017

Abstract

Many cluster management systems (CMSs) have been proposed to share a single cluster with multiple distributed computing systems. However, none of the existing approaches can handle distributed machine learning (ML) workloads given the following criteria: high resource utilization, fair resource allocation and low sharing overhead. To solve this problem, we propose a new CMS named Dorm, incorporating a dynamicallypartitioned cluster management mechanism and an utilizationfairness optimizer. Specifically, Dorm uses the container-based virtualization technique to partition a cluster, runs one application per partition, and can dynamically resize each partition at application runtime for resource efficiency and fairness. Each application directly launches its tasks on the assigned partition without petitioning for resources frequently, so Dorm imposes flat sharing overhead. Extensive performance evaluations showed that Dorm could simultaneously increase the resource utilization by a factor of up to 2.32, reduce the fairness loss by a factor of up to 1.52, and speed up popular distributed ML applications by a factor of up to 2.72, compared to existing approaches. Dorm’s sharing overhead is less than 5% in most cases. Index Terms—Cluster Resource Management, Distributed Machine Learning, Fairness

Discipline

Artificial Intelligence and Robotics | Software Engineering

Research Areas

Software and Cyber-Physical Systems

Publication

Proceedings of the 2017 IEEE International Conference on Smart Computing (SMARTCOMP), May 29-31

First Page

1

Last Page

6

Identifier

10.1109/SMARTCOMP.2017.7947053

Publisher

IEEE

City or Country

Hong Kong

Additional URL

https://doi.org/10.1109/SMARTCOMP.2017.7947053

Share

COinS