Saturday, 13 February 2021

How to run distributed data parallel (DDP) training using Pytorch?

step 1: generate the keygen in main machine

ssh-keygen -t rsa


step 2: copy paste the id_rsa and id_rsa.pub from main machine ~/.ssh to the child machine ~/.ssh (or remote machines)


step 3: setup the ssh connection from main machine to child machine, and child machine to machine

eg1

(in main machine): 

cat ~/.ssh/id_rsa.pub | ssh USER@CHILD1_IP "mkdir -p ~/.ssh && cat >> ~/.ssh/authorized_keys"
cat ~/.ssh/id_rsa.pub | ssh USER@CHILD2_IP "mkdir -p ~/.ssh && cat >> ~/.ssh/authorized_keys"



(in child machine): 

cat ~/.ssh/id_rsa.pub | ssh USER@MAIN_IP "mkdir -p ~/.ssh && cat >> ~/.ssh/authorized_keys"


step 4: test the ssh connection, open a terminal and type

ssh 192.168.x.x (it should connect automatically to that ip address)

otherwise, use "ssh -vvv ip_address" to check the issue


step 5: install the same cuda, cudnn, nccl and conda environment in all machines


step 6: run the code in main machine, then run the code in child machine. If everything is ok, you should see a change in nvidia-smi in both machines

reference: https://cv.gluon.ai/build/examples_torch_action_recognition/ddp_pytorch.html

No comments:

Post a Comment