HeterPS: Distributed deep learning with reinforcement learning based scheduling in heterogeneous environments
[摘要] Deep neural networks (DNNs) exploit many layers and a large number of parameters to achieve excellent performance. The training process of DNN models generally handles large-scale input data with many sparse features, which incurs high Input/Output (IO) cost, while some layers are computeintensive. The training process generally exploits distributed computing resources to reduce training time. While heterogeneous computing resources, e.g., CPUs, GPUs of multiple types, are available for the distributed training process, the scheduling of multiple layers to diverse computing resources remains critical for the training process. To efficiently train a DNN model using the heterogeneous computing resources, we propose a distributed framework, i.e., Heterogeneous Parameter Server (HeterPS), composed of a distributed architecture and a Reinforcement Learning (RL)-based scheduling method. The advantages of HeterPS are three-fold compared with existing frameworks. First, HeterPS enables efficient training process of diverse workloads with heterogeneous computing resources. Second, HeterPS exploits an RL-based method to efficiently schedule the workload of each layer to appropriate computing resources to minimize the cost while satisfying throughput constraints. Third, HeterPS manages data storage and data communication among distributed computing resources. We carry out extensive experiments to show that HeterPS significantly outperforms state-of-the-art approaches in terms of throughput (14.5 times higher) and monetary cost (312.3% smaller).& COPY; 2023 Elsevier B.V. All rights reserved.
[发布日期] 2023-11-01 [发布机构]
[效力级别] [学科分类]
[关键词] [时效性]