Goku Coding Experience: How to run distributed data parallel (DDP) training using Pytorch?

Saturday, 13 February 2021

How to run distributed data parallel (DDP) training using Pytorch?

step 1: generate the keygen in main machine

ssh-keygen -t rsa

step 2: copy paste the id_rsa and id_rsa.pub from main machine ~/.ssh to the child machine ~/.ssh (or remote machines)

step 3: setup the ssh connection from main machine to child machine, and child machine to machine

eg1

(in main machine):

cat ~/.ssh/id_rsa.pub | ssh USER@CHILD1_IP "mkdir -p ~/.ssh && cat >> ~/.ssh/authorized_keys"
cat ~/.ssh/id_rsa.pub | ssh USER@CHILD2_IP "mkdir -p ~/.ssh && cat >> ~/.ssh/authorized_keys"

(in child machine):

cat ~/.ssh/id_rsa.pub | ssh USER@MAIN_IP "mkdir -p ~/.ssh && cat >> ~/.ssh/authorized_keys"

step 4: test the ssh connection, open a terminal and type

ssh 192.168.x.x (it should connect automatically to that ip address)

otherwise, use "ssh -vvv ip_address" to check the issue

step 5: install the same cuda, cudnn, nccl and conda environment in all machines

step 6: run the code in main machine, then run the code in child machine. If everything is ok, you should see a change in nvidia-smi in both machines

reference: https://cv.gluon.ai/build/examples_torch_action_recognition/ddp_pytorch.html

Goku Coding Experience

Saturday, 13 February 2021

How to run distributed data parallel (DDP) training using Pytorch?

No comments:

Post a Comment