ref: https://github.com/markwdalton/lambdalabs/blob/main/documentation/cheatsheets.txt
Lambda Cheat Sheets
This is the conceptual starter - provides the idea
Lambda Ubuntu Linux Command Line Cheat Sheet (http://lambdalabs.com/)
Working with Files:
Basics:
* pwd – Show the ‘present working directory’
* ls - see files in your current directory
* cd <name> - to change to a new directory
* find . -name ‘example*’ - Find files that start with the name ‘example’
ls - List files
* ls -alrt - List all files in directory long format
* ls -a - list hidden files
* ls -CF - List in columns and classify files
* ls -lR - Long format recursive
Show sizes of files:
* du -s <filename> - in KBs
* du -sh <filename> - human readable
* du -s * | sort -n - Easy way to find the largest files/directories in a directory
Moving Files:
* mv <filename> <new filename>
* mv <name> <location/name>
Move a file to new location/name:
* mv foo /tmp/user/foo.txt
Move a directory to new name:
* mv data save/data.bak
Copying files:
* cp <file> <new_file>
* cp <file> <dir>/<new_file>
Copy a directory to new name/location:
* cp -a <dir> <new_dir>
Remote copy:
* sftp
* rsync
* scp file remote:./file
* scp -rq directory remote:.
* sshfs user@remote-host:directory ./mount
Example:
$ mkdir myhome
$ sshfs 192.168.1.122:/home ./myhome
$ df -h ./myhome
Filesystem Size Used Avail Use% Mounted on
192.168.1.122:/home 480G 73G 384G 16% /home/user/myhome
$ umount ./myhome
* Tunnel a port to a remote hosts local interface to your machine:
- Use case to securely access over ssh jupyter-notebook on a remote hosts that is not exposed to the internet.
$ ssh -N -L 8888:localhost:8888 <gpuserver>
* Substitute the 8888 port as jupyter notebook may assign others
* Tunnel a port for a remote host to access from your local machine through a jump host:
$ ssh mdalton@50.211.197.34 -L 8080:10.1.10.69:80
Checking space on a file system:
Disk space: (-h is ‘human readable’ or translated to M, G based on 1024 or 1000 for inodes).
* df -h . - Check your current location
* df -h - Check all filesystems mounted. You should be concerned over 90%.
Check Inodes (number of files):
* df -i - Number of inodes, used, available. Be concerned over 90%.
Show CPU Utilization:
* top
* htop
* ps
ps -elf
ps aux
ps -flu <user>
* Show Memory Utilization on Linux:
* free
* free -h
* top
Disk commands:
* lsblk - list drives seen
* df - show sizes of mounted
* mount - show mounts/options
* fdisk -l - show drive partitions
* smartctl - check drives for information or errors.
Example:
smartctl -x /dev/nvme0n0
Show GPU Utilization:
* nvtop
* nvidia-smi
* nvidia-smi -q - this provides additional information.
* nvidia-smi pmon
NVlink information:
* nvidia-smi topo -m - To see the NVLink connection topology
* nvidia-smi nvlink -s - To see rates per link
GPU Debugging:
* ‘nvidia-smi Failed to initialize NVML: Driver/library version mismatch’
* This can normally be resolved with a reboot.
* This occurs when the command is newer than the nvidia kernel module.
* I do not see any GPUs:
* If it is using an old CUDA version that does not support current GPUs. Like CUDA 10 (nvidia-cuda-toolkit) with Ampere GPUs (30## series or A## series GPUs).
* If you are using Anaconda:
- If it did not load or have the correct CUDA installed.
- If you did not set the LD_LIBRARY_PATH to the cuDNN version Anaconda installed.
* See logged errors:
* grep "kernel: NVRM: Xid" /var/log/kern.log
* The main Xid errors:
All GPUs Xid: 79
A100 Xid: 64, 94
Xid Error References:
* GPU Debug guide: https://docs.nvidia.com/deploy/gpu-debug-guidelines/index.html
* GPU Error Definitions: https://docs.nvidia.com/deploy/xid-errors/index.html
* A100 Xids: https://docs.nvidia.com/deploy/a100-gpu-mem-error-mgmt/index.html
NVidia Fabric manager/NVSwitch:
* Fabric manager guide - https://docs.nvidia.com/datacenter/tesla/pdf/fabric-manager-user-guide.pdf
PCI devices:
* lspci - lists devices on the PCI bus
* lspci -vvv - provides more verbose output
USB:
* lsusb - list seen USB devices
* sudo dmesg - will also show when they are discovered
Linux/Ubuntu/NVIDIA tools for monitoring utilization
* For GPUs it is important to associate the GPUs PCI Address with the GPU UUID (index is relative)
* nvidia-smi --query-gpu=index,pci.bus_id,uuid --format=csv
* top - Show the linux top process based on CPU, memory (rss) virtual memory
* htop
View the processes on GPUs
* nvidia-smi pmon
* nvidia-smi dmon -s pc
Show GPU view power, temp, memory on GPUs over time
* nvidia-smi dmon
Show GPU stats and environment over time in CSV format
* nvidia-smi --query-gpu=index,pci.bus_id,uuid,fan.speed,utilization.gpu,utilization.memory,temperature.gpu,power.draw --format=csv -l
Find various options for –query-gpu:
* nvidia-smi --help-query-gpu
On Commercial GPUs like A100’s there are some special options like:
* GPU Memory Temperature, Memory errors, Memory remapping
* You can see these all through:
nvidia-smi -q
* You can monitor for memory temperature also:
nvidia-smi --query-gpu=index,pci.bus_id,uuid,pstate,fan.speed,utilization.gpu,utilization.memory,temperature.gpu,temperature.memory,power.draw --format=csv -l
* Watch for remapped memory (requires a reboot/reset of the GPU):
nvidia-smi --query-remapped-rows=gpu_bus_id,gpu_uuid,remapped_rows.correctable,remapped_rows.uncorrectable,remapped_rows.pending,remapped_rows.failure --format=csv
* 8 banks in a row can be remapped, but requires a reboot between each remap.
* After 8 banks in a row are remapped the GPU or chassis (SXM) needs to be reworked.
* If remapped_rows.failure == yes ; Disable GPU ; Machine needs a RMA to repair
* If remapped_rows.pending == yes ; then GPU needs to be reset (commonly high number of aggregate errors).
* Watch for Volatile (current boot session - more accurate) and Aggregate (life time of GPU - in theory all but misses some) memory errors:
To see the various memory errors to track:
nvidia-smi --help-query-gpu | grep "ecc.err"
For example:
All volatile memory errors (this boot session or since a GPU reset):
nvidia-smi --query-gpu=index,pci.bus_id,uuid,ecc.errors.corrected.volatile.dram,ecc.errors.corrected.volatile.sram --format=csv
All volatile uncorrected memory errors:
nvidia-smi --query-gpu=index,pci.bus_id,uuid,ecc.errors.corrected.volatile.dram,ecc.errors.corrected.volatile.sram --format=csv
All Aggregate corrected memory errors:
nvidia-smi --query-gpu=index,pci.bus_id,uuid,ecc.errors.uncorrected.aggregate.dram,ecc.errors.uncorrected.aggregate.sram --format=csv
All Aggregate uncorrected memory errors:
nvidia-smi --query-gpu=index,pci.bus_id,uuid,ecc.errors.uncorrected.aggregate.dram,ecc.errors.uncorrected.aggregate.sram --format=csv
Linux/Ubuntu commands for Lambda
system monitoring
* top
* htop
* nvtop
* nvidia-smi pmon
* ps -elf (ps aux) to see running processes
* free – see the amount of memory and swap; used and available
Navigating and finding files
* pwd – Present working directory
* ls - see files in your current directory
* cd <name> - to change to a new directory
* find . -name ‘example*’ - Find files that start with the name ‘example’
Disk and file systems
* df -h – Show how much space is in all file systems
* df -ih – show how many inodes (number of files) on each file system
* du -s * | sort -n - Show the largest files/directories in the current directory
* du -sh example.tar.gz – show how large a file ‘example.tar.gz’
* duf - a little more friendly format (sudo apt install duf)
* Graphical view:
$ sudo apt install xdiskusage
$ sudo xdiskusage
Networking:
* ip address show - Long list of information about interfaces
* ip -br address show - more brief version of the commands
* ip addr show dev <dev> - show address for one interface
* example: ip addr show dev eth0
* ip link - show links
* ip -br link – brief view of the links
* ip route – show routes on your system
* ip tunnel
* ip n - replace arp find MAC and IP addresses on the network
* ping -c 3 10.0.0.1 - Ping the IP address 10.0.0.1 three times
* traceroute 10.0.0.1 - Check the route and performance to IP address 10.0.0.1
* /etc/netplan – Location of the network interface configurations.
Managing users:
* group <username> - Check if there is a ‘user’ and which groups they are in
* sudo adduser <username> - Adding a new user
* sudo deluser <username> - Deleting a existing username
* sudo adduser <username> <group> - Add a ‘user’ to a ‘group’ both that existing
* sudo deluser <username> <group> - Delete/remove a ‘user’ from a ‘group’
Firewall:
* sudo iptables -L - List iptables rules
This is switching to ‘nftables’
* sudo nft -a list ruleset
* sudo ufw status - Show the status of the ufw
* Example adding ssh to UFW firewall:
* sudo ufw allow ssh
Linux and Lambda Stack upgrades and packaging:
* sudo apt-get update - Update the list of packages from repository (sync up)
* apt list --upgradeable - list upgradable packages (after the update)
* sudo apt-get upgrade - Upgrade packages
* sudo apt-get dist-upgrade - more aggressive upgrade - can remove packages
* sudo apt full-upgrade - more aggressive upgrade - can remove packages
* dpkg -L <installed package> - List the contents of a given packages
* dpkg -S <full path to file> - show the package a file came from
* dpkg –list - Show the list of packages
* apt list --installed
Linux/Ubuntu security managing user access:
* iptables -L - List firewall rules
* /etc/sudoers - Contains a list of sudo rules
* visudo - to edit sudoers to change rules
* sudo adduser <use> sudo - Add a user to the sudo group, which gives them full root access via sudo, use caution.
Example: (Add the user 'john' to the 'sudo' group
$ sudo adduser john sudo
Linux/Ubuntu NVIDIA GPU
* nvtop - watch the GPUs utilization and memory utilization
* nvidia-smi - see the driver version (supported CUDA and usage)
* note the persistence mode
* nvidia-smi -q - gives more detailed information for each GPU
NVlink information:
* nvidia-smi topo -m - To see the NVLink connection topology
* nvidia-smi nvlink -s - To see the rates per link
See logged errors:
* grep "kernel: NVRM: Xid" /var/log/kern.log
Boot modes for linux:
Find the current setting for boot level:
$ systemctl get-default
Set to boot to Multi-user (non-graphical):
$ sudo systemctl set-default multi-user
Set to boot to Graphical mode:
$ sudo systemctl set-default graphical
Change now (temporarily) to multi-user:
$ sudo systemctl isolate multi-user
Change now (temporarily) to Graphical:
$ sudo systemctl isolate graphical
Containers and Virtual Environments:
See examples:
https://github.com/markwdalton/lambdalabs/tree/main/documentation/software/examples/virtual-environments
* Docker/Singularity - Make use of NVIDIA's Container Catalog: https://catalog.ngc.nvidia.com/
* Python venv - Developed by Python - so this is recommended - supports isolated or using system site default packages.
* viritualenv - a group independently developed for Python 2.0, still around but recommend move to python venv.
* Anaconda - Recent years license changes - companies should be aware of the license.
Docker
See the Lambda Docker PyTorch Tutorial:
https://lambdalabs.com/blog/nvidia-ngc-tutorial-run-pytorch-docker-container-using-nvidia-container-toolkit-on-ubuntu
Install docker (with Lambda Stack installed):
* sudo apt-get install -y docker.io nvidia-container-toolkit
* sudo systemctl daemon-reload
* sudo systemctl restart docker
Finding many docker images for Deep Learning
* https://catalog.ngc.nvidia.com/
Pull a docker image
* sudo docker pull <image name>
* sudo docker pull nvcr.io/nvidia/tensorflow:22.05-tf1-py3
* sudo docker pull nvcr.io/nvidia/pytorch:22.05-py3
Run Docker (it will pull the image if not found)
* sudo docker run --gpus all -it --rm nvcr.io/nvidia/pytorch:22.05-py3
List running docker containers
* docker ps
List docker images
* docker images
Mount a directory in a docker image on start up
* sudo docker run --gpus all -it --rm -v `pwd`/data:/data/ nvcr.io/nvidia/pytorch:22.05-py3
You can add a command to the end of the line: ls, python code
Mounts the ‘data’ directory from the current directory into the container as /data.
Copy a file from the host to the container
* docker cp input.txt container_id:/input.txt
Copy a file from the container to the local file system
* docker cp container_id:/output.txt output.txt
Copy a group of files in the ‘data’ directory to the container
* docker cp data/. container_id:/target
Copy a group of files in the container ‘output’ directory to local host
* docker cp container_id:/output/. target
* mark@lambda-dual:~/lambda/tickets/8021$ cat ~/lambda/docker.txt
* docker create or docker run:
* * -a, --attach # attach stdout/err
* -i, --interactive # attach stdin (interactive)
* -t, --tty # pseudo-tty
* --name NAME # name your image
* -p, --publish 5000:5000 # port map
* --expose 5432 # expose a port to linked containers
* -P, --publish-all # publish all ports
* --link container:alias # linking
* -v, --volume `pwd`:/app # mount (absolute paths needed)
* -e, --env NAME=hello # env vars
* * for 'docker run':
* --rm true|false
* Automatically remove the container when it exits. The default is false.
Example to run on ALL GPUs, interactive, with a tty, and remove the running container on exit.
$ docker run --gpus all -it --rm nvcr.io/nvidia/pytorch:22.05-py3 nvidia-smi
List the ports mapped example:
Look for running images:
$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
e7fa01d97208 dalle-playground_dalle-interface "docker-entrypoint.s…" 3 months ago Up 8 hours 0.0.0.0:3000->3000/tcp, :::3000->3000/tcp dalle-interface
Look at the port mapping for that running image:
$ docker port e7fa01d97208
3000/tcp -> 0.0.0.0:3000
3000/tcp -> :::3000
Kubernetes
List all pods in the current namespace:
kubectl get pods
List all pods in all namespaces:
kubectl get pods --all-namespaces
List all services in the current namespace:
kubectl get services
List all deployments in the current namespace:
kubectl get deployments
List all nodes in the cluster:
kubectl get nodes
Describe a pod:
kubectl describe pod <pod-name>
Describe a service:
kubectl describe service <service-name>
Describe a deployment:
kubectl describe deployment <deployment-name>
Create a new deployment:
kubectl create deployment <deployment-name> --image=<image-name>
Update a deployment:
kubectl set image deployment <deployment-name> <container-name>=<new-image-name>
Scale a deployment:
kubectl scale deployment <deployment-name> --replicas=<replica-count>
Delete a deployment:
kubectl delete deployment <deployment-name>
Delete a pod:
kubectl delete pod <pod-name>
Delete a service:
kubectl delete service <service-name>
For servers with IPMI:
Install ipmitool:
$ sudo apt-get install ipmitool
List the users
$ sudo ipmitool user list
Change the password for User ID 2 (from previous ‘user list’)
$ sudo ipmitool user set password 2
Then, enter the new password twice.
Cold reset the BMC - normally only needed when the BMC is not getting updates:
$ sudo ipmitool mc reset cold
Print the BMC network information:
$ sudo ipmitool lan print
Print the BMC Event log
$ sudo ipmitool sel elist
Print Sensor information:
$ sudo ipmitool sdr
$ sudo ipmitool sensor
Print Information about the system:
$ sudo ipmitool fru
Power Status:
$ sudo ipmitool power status
Power control server:
$ sudo ipmitool power [status|on|off|cycle|reset|diag|soft]
Power on server:
$ sudo ipmitool power on
Power off server:
$ sudo ipmitool power off
Power cycle server:
$ sudo ipmitool power cycle
Power reset the server:
$ sudo ipmitool power reset
Check or Set the BMC time:
$ sudo ipmitool sel time get
$ sudo ipmitool sel time set "$(date '+%m/%d/%Y %H:%M:%S')"
$ ipmitool sel time get
Or
$ sudo ipmitool sel time set now
$ sudo hwclock --systohc
IPMI Example setting up a static IP address:
If you were given:
IPMI/BMC IP address: 10.100.1.132
Netmask: 255.255.255.0
Gateway: 10.100.1.1
Then the configuration would be:
* Confirm current settings:
$ sudo ipmitool lan print 1
* Set the IPMI Interface to Static (default is dhcp)
$ sudo ipmitool lan set 1 ipsrc static
* Set the IP Address:
$ sudo ipmitool lan set 1 ipaddr 129.105.61.139
* Set the netmask for this network:
$ sudo ipmitool lan set 1 netmask 255.255.255.0
* Set the Default Gateway
$ sudo ipmitool lan set 1 defgw ipaddr 129.105.61.1
* Set/confirm that LAN (network) access is enabled:
$ sudo ipmitool lan set 1 access on
Common request is getting the sensor and event log output..
1. On the node from linux:
$ sudo apt install ipmitool
$ sudo ipmitool sdr >& ipmi-sdr.txt
$ sudo ipmitool sel elist >& ipmi-sel.txt
Or from a remote linux machine:
$ ipmitool -I lanplus -H IP_ADDRESS -U ADMIN -P "PASSWORD" sel elist >& ipmi-sel.txt
$ ipmitool -I lanplus -H IP_ADDRESS -U ADMIN -P "PASSWORD" sdr >& ipmi-sdr.txt
** Where 'PASSWORD' would be your IPMI password, and IP_ADDRESS is your
machines BMC/IPMI ip address.
Or at least the Web GUI can save the cvs of the eventlog.
BMC/IPMI -> Logs and Reports -> Event Log -> Save to excel (CSV).
Networking -> Infiniband
* lsmod | egrep “mlx|ib”
* ibstat
* ibstatus
* ibv_devinfo
* ibswitch
* ibhosts
* lspci | grep Mellanox
* lspci | egrep -i "mellanox|mlnx|mlx[0-9]_core|mlnx[0-9]_ib"
* dmesg | egrep -i "mellanox|mlnx|mlx[0-9]_core|mlnx[0-9]_ib"
* Check for errors like insufficient power
* lsmod | grep rdma
* mst start
* mst status -v
* opensm needs to be running either on the switch or at least one of the nodes