PyTorch-ParameterServer

An implementation of parameter server (PS) framework [1] based on Remote Procedure Call (RPC) in PyTorch [2].

PS-based Architecture

The figure [3] below shows the PS-based architecture. The architecture consists of two logical entities: one (or multiple) PS(s) and multiple workers. The whole dataset is partitioned among workers and the PS maintains model parameters. During training, each worker pulls model parameters from the PS, computes gradients on a mini-batch from its data partition, and pushes the gradients to the PS. The PS updates model parameters with gradients from the workers according to a synchronization strategy and sends the updated parameters back to the workers. The pseudocode [1] of this architecture is shown as follows.

Implementation

This code is based on torch.distributed.rpc [4]. It is used to train ResNet50 [5] on Imagenette dataset [6] - a subset of ImageNet [7] with one PS (rank=0) and 4 workers (rank=1,2,3,4).

Environments

The code is developed under the following configurations.
Server: a g3.16xlarge instance with 4 NVIDIA Tesla M60 GPUs on AWS EC2
System: Ubuntu 18.04
Software: python==3.6.9, torch==1.9.0, torchvision==0.10.0

Quick Start

Download the code

git clone https://github.com/xbfu/PyTorch-ParameterServer.git

Install dependencies

cd PyTorch-ParameterServer
sudo sh install-dependencies.sh

Prepare datasets

cd PyTorch-ParameterServer
wget https://s3.amazonaws.com/fast-ai-imageclas/imagenette2.tgz
tar -zxf imagenette2.tgz

Train

For PS

python public-asgd.py --rank=0

For workers

python public-asgd.py --rank=r

r=1,2,3,4 is the rank of each worker.

Performance

Sync Mode	Training Time (seconds)
Single	858
Syn	533
Asyn	268

Usage

On one machine with multiple GPUs For PS

python public-asgd.py --rank=0

For workers

python public-asgd.py --rank=r

r=1,2,3,4 is the rank of each worker.

On multiple machines For PS

python public-asgd.py --rank=0 --master_addr=12.34.56.78

For workers

python public-asgd.py --rank=r --master_addr=12.34.56.78

r=1,2,3,4 is the rank of each worker. 12.34.56.78 is the IP address of the PS.

References

[1]. Li M, Andersen D G, Park J W, et al. Scaling distributed machine learning with the parameter server//11th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 14). 2014: 583-598.
[2]. Pytorch. https://pytorch.org/.
[3]. Sergeev A, Del Balso M. Horovod: fast and easy distributed deep learning in TensorFlow. arXiv preprint arXiv:1802.05799, 2018.
[4]. Distributed RPC Framework. https://pytorch.org/docs/1.9.0/rpc.html.
[5]. He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778).
[6]. Imagenette. https://github.com/fastai/imagenette.
[7]. Imagenet. https://image-net.org/.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

PyTorch-ParameterServer

Table of Contents

PS-based Architecture

Implementation

Environments

Quick Start

Download the code

Install dependencies

Prepare datasets

Train

Performance

Usage

References

Files

README.md

Latest commit

History

README.md

File metadata and controls

PyTorch-ParameterServer

Table of Contents

PS-based Architecture

Implementation

Environments

Quick Start

Download the code

Install dependencies

Prepare datasets

Train

Performance

Usage

References