Publication Type
Journal Article
Version
publishedVersion
Publication Date
2-2019
Abstract
Training large, complex machine learning models such as deep neural networks with big data requires powerful computing clusters, which are costly to acquire, use and maintain. As a result, many machine learning researchers turn to cloud computing services for on-demand and elastic resource provisioning capabilities. Two issues have arisen from this trend: (1) if not configured properly, training models on cloud-based clusters could incur significant cost and time, and (2) many researchers in machine learning tend to focus more on model and algorithm development, so they may not have the time or skills to deal with system setup, resource selection and configuration. In this work, we propose and implement FC2: a system for fast, convenient and cost-effective distributed machine learning over public cloud resources. Central to the effectiveness of FC2 is the ability to recommend an appropriate resource configuration in terms of cost and execution time for a given model training task. Our approach differs from previous work in that it does not need to manually analyze the code and dataset of the training task in advance. The recommended resource configuration can then be deployed and managed automatically by FC2 until the training task is completed. We have conducted extensive experiments with an implementation of FC2, using real-world deep neural network models and datasets. The results demonstrate the effectiveness of our approach, which could produce cost saving of up to 80% while maintaining similar training performance compared to much more expensive resource configurations.
Keywords
Distributed machine learning, Cloud-based clusters, Resource recommendation, Cluster deployment
Discipline
Computer Engineering | Software Engineering
Research Areas
Software and Cyber-Physical Systems
Publication
Cluster Computing
Volume
22
Issue
4
First Page
1299
Last Page
1315
ISSN
1386-7857
Identifier
10.1007%2Fs10586-019-02912-6
Publisher
Springer (part of Springer Nature): Springer Open Choice Hybrid Journals
Citation
TA, Nguyen Binh Duong.
FC2: Cloud-based cluster provisioning for distributed machine learning. (2019). Cluster Computing. 22, (4), 1299-1315.
Available at: https://ink.library.smu.edu.sg/sis_research/4763
Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.
Additional URL
https://doi.org/10.1007%2Fs10586-019-02912-6