step 1: generate the keygen in main machine
ssh-keygen -t rsa
step 2: copy paste the id_rsa and id_rsa.pub from main machine ~/.ssh to the child machine ~/.ssh (or remote machines)
step 3: setup the ssh connection from main machine to child machine, and child machine to machine
eg1
(in main machine):
cat ~/.ssh/id_rsa.pub | ssh USER@CHILD1_IP "mkdir -p ~/.ssh && cat >> ~/.ssh/authorized_keys"
cat ~/.ssh/id_rsa.pub | ssh USER@CHILD2_IP "mkdir -p ~/.ssh && cat >> ~/.ssh/authorized_keys"
(in child machine):
cat ~/.ssh/id_rsa.pub | ssh USER@MAIN_IP "mkdir -p ~/.ssh && cat >> ~/.ssh/authorized_keys"
step 4: test the ssh connection, open a terminal and type
ssh 192.168.x.x (it should connect automatically to that ip address)
otherwise, use "ssh -vvv ip_address" to check the issue
step 5: install the same cuda, cudnn, nccl and conda environment in all machines
step 6: run the code in main machine, then run the code in child machine. If everything is ok, you should see a change in nvidia-smi in both machines
reference: https://cv.gluon.ai/build/examples_torch_action_recognition/ddp_pytorch.html
No comments:
Post a Comment