Goku Coding Experience: March 2024

ref: https://github.com/markwdalton/lambdalabs/blob/main/documentation/cheatsheets.txt
Lambda Cheat Sheets
This is the conceptual starter - provides the idea


Lambda Ubuntu Linux Command Line Cheat Sheet (http://lambdalabs.com/)
Working with Files:
  Basics:
    * pwd – Show the ‘present working directory’
    * ls     - see files in your current directory
    * cd <name> - to change to a new directory
    * find . -name ‘example*’   - Find files that start with the name ‘example’
  ls - List files 
    * ls -alrt - List all files in directory long format 
    * ls -a  - list hidden files
    * ls -CF - List in columns and classify files
    * ls -lR - Long format recursive
  Show sizes of files:
    * du -s <filename> - in KBs
    * du -sh <filename> - human readable
    * du -s * | sort -n  - Easy way to find the largest files/directories in a directory
  Moving Files:
    * mv <filename> <new filename>
    * mv <name> <location/name>
  Move a file to new location/name:
    * mv foo /tmp/user/foo.txt
         Move a directory to new name:
    * mv data save/data.bak
  Copying files:
    * cp <file> <new_file>
    * cp <file> <dir>/<new_file>
      Copy a directory to new name/location:
    * cp -a <dir> <new_dir>
  Remote copy:
    * sftp
    * rsync
    * scp file remote:./file
    * scp -rq directory remote:.
    * sshfs user@remote-host:directory ./mount
      Example:
        $ mkdir myhome
        $ sshfs 192.168.1.122:/home ./myhome
        $ df -h ./myhome
          Filesystem           Size  Used Avail Use% Mounted on
          192.168.1.122:/home  480G   73G  384G  16% /home/user/myhome
        $ umount ./myhome
    * Tunnel a port to a remote hosts local interface to your machine:
        - Use case to securely access over ssh jupyter-notebook on a remote hosts that is not exposed to the internet.
        $ ssh -N -L 8888:localhost:8888 <gpuserver>
          * Substitute the 8888 port as jupyter notebook may assign others
         
    * Tunnel a port for a remote host to access from your local machine through a jump host:
        $ ssh mdalton@50.211.197.34 -L 8080:10.1.10.69:80
       
Checking space on a file system:
  Disk space: (-h is ‘human readable’ or translated to M, G based on 1024 or 1000 for inodes).
   * df -h .  - Check your current location
   * df -h  - Check all filesystems mounted.   You should be concerned over 90%.
        Check Inodes (number of files):
   * df -i  - Number of inodes, used, available. Be concerned over 90%.

Show CPU Utilization:
   * top
   * htop
   * ps
     ps -elf
     ps aux
     ps -flu <user>

* Show Memory Utilization on Linux:
   * free
   * free -h
   * top

Disk commands:
   * lsblk - list drives seen
   * df - show sizes of mounted
   * mount - show mounts/options
   * fdisk -l - show drive partitions
   * smartctl - check drives for information or errors.
     Example:
      smartctl -x /dev/nvme0n0

Show GPU Utilization:
      * nvtop
      * nvidia-smi 
      * nvidia-smi -q - this provides additional information.
      * nvidia-smi pmon

NVlink information:
      * nvidia-smi topo -m  - To see the NVLink connection topology
      * nvidia-smi nvlink -s - To see rates per link

GPU Debugging:
      * ‘nvidia-smi Failed to initialize NVML: Driver/library version mismatch’
      * This can normally be resolved with a reboot.
      * This occurs when the command is newer than the nvidia kernel module.
      * I do not see any GPUs:
      * If it is using  an old CUDA version that does not support current GPUs.  Like CUDA 10 (nvidia-cuda-toolkit) with Ampere GPUs (30## series or A## series GPUs).
      * If you are using Anaconda:
       - If it did not load or  have the correct CUDA installed.
       - If you did not set the LD_LIBRARY_PATH to the cuDNN version Anaconda installed. 
         * See logged errors:
         * grep "kernel: NVRM: Xid" /var/log/kern.log
         * The main Xid errors:
            All GPUs Xid: 79
            A100 Xid: 64, 94

Xid Error References:
      * GPU Debug guide: https://docs.nvidia.com/deploy/gpu-debug-guidelines/index.html
      * GPU Error Definitions: https://docs.nvidia.com/deploy/xid-errors/index.html
      * A100 Xids: https://docs.nvidia.com/deploy/a100-gpu-mem-error-mgmt/index.html

    NVidia Fabric manager/NVSwitch:
      * Fabric manager guide - https://docs.nvidia.com/datacenter/tesla/pdf/fabric-manager-user-guide.pdf

PCI devices:
      * lspci - lists devices on the PCI bus 
      * lspci -vvv - provides more verbose output

USB:
      * lsusb - list seen USB devices
      * sudo dmesg - will also show when they are discovered

Linux/Ubuntu/NVIDIA tools for monitoring utilization
      * For GPUs it is important to associate the GPUs PCI Address with the GPU UUID (index is relative)
      * nvidia-smi --query-gpu=index,pci.bus_id,uuid --format=csv
      * top  - Show the linux top process based on CPU, memory (rss) virtual memory
      * htop 

View the processes on GPUs
      * nvidia-smi pmon
      * nvidia-smi dmon -s pc

Show GPU view power, temp, memory on GPUs over time
      * nvidia-smi dmon

Show GPU stats and environment over time in CSV format
      * nvidia-smi --query-gpu=index,pci.bus_id,uuid,fan.speed,utilization.gpu,utilization.memory,temperature.gpu,power.draw --format=csv -l
Find various options for –query-gpu:
      * nvidia-smi --help-query-gpu
On Commercial GPUs like A100’s there are some special options like:
      * GPU Memory Temperature, Memory errors, Memory remapping
      * You can see these all through:
           nvidia-smi -q
      * You can monitor for memory temperature also:
           nvidia-smi --query-gpu=index,pci.bus_id,uuid,pstate,fan.speed,utilization.gpu,utilization.memory,temperature.gpu,temperature.memory,power.draw --format=csv -l
   
      * Watch for remapped memory (requires a reboot/reset of the GPU):
        nvidia-smi --query-remapped-rows=gpu_bus_id,gpu_uuid,remapped_rows.correctable,remapped_rows.uncorrectable,remapped_rows.pending,remapped_rows.failure --format=csv
            * 8 banks in a row can be remapped, but requires a reboot between each remap.
            * After 8 banks in a row are remapped the GPU or chassis (SXM) needs to be reworked.
            * If remapped_rows.failure == yes  ; Disable GPU ; Machine needs a RMA to repair
            * If remapped_rows.pending == yes ; then GPU needs to be reset (commonly high number of aggregate errors).

      * Watch for Volatile (current boot session - more accurate) and Aggregate (life time of GPU - in theory all but misses some) memory errors:
          To see the various memory errors to track:
            nvidia-smi --help-query-gpu | grep "ecc.err"
            For example:
             All volatile memory errors (this boot session or since a GPU reset):
              nvidia-smi --query-gpu=index,pci.bus_id,uuid,ecc.errors.corrected.volatile.dram,ecc.errors.corrected.volatile.sram  --format=csv
              All volatile uncorrected memory errors:
                nvidia-smi --query-gpu=index,pci.bus_id,uuid,ecc.errors.corrected.volatile.dram,ecc.errors.corrected.volatile.sram  --format=csv
             All Aggregate corrected memory errors:
               nvidia-smi --query-gpu=index,pci.bus_id,uuid,ecc.errors.uncorrected.aggregate.dram,ecc.errors.uncorrected.aggregate.sram --format=csv
            All Aggregate uncorrected memory errors:
              nvidia-smi --query-gpu=index,pci.bus_id,uuid,ecc.errors.uncorrected.aggregate.dram,ecc.errors.uncorrected.aggregate.sram --format=csv
	

Linux/Ubuntu commands for Lambda
  system monitoring
      * top
      * htop
      * nvtop
      * nvidia-smi pmon 
      * ps -elf (ps aux) to see running processes
      * free – see the amount of memory and swap;  used and available

  Navigating and finding files
      * pwd – Present working directory
      * ls     - see files in your current directory
      * cd <name> - to change to a new directory
      * find . -name ‘example*’   - Find files that start with the name ‘example’

  Disk and file systems
      * df -h – Show how much space is in all file systems
      * df -ih – show how many inodes (number of files) on each file system
      * du -s * | sort -n    - Show the largest files/directories in the current directory
      * du -sh example.tar.gz – show how large a file ‘example.tar.gz’
      * duf - a little more friendly format (sudo apt install duf)
      * Graphical view: 
         $ sudo apt install xdiskusage
         $ sudo xdiskusage


  Networking:
      * ip address show - Long list of information about interfaces
      * ip -br address show -  more brief version of the commands
      * ip addr show dev <dev> - show address for one interface
      * example: ip addr show dev eth0
      * ip link  - show links
      * ip -br link – brief view of the links
      * ip route – show routes on your system
      * ip tunnel
      * ip n - replace arp find MAC and IP addresses on the network
      * ping -c 3 10.0.0.1    - Ping the IP address 10.0.0.1  three times
      * traceroute 10.0.0.1  - Check the route and performance to IP address 10.0.0.1
      * /etc/netplan – Location of the network interface configurations.


  Managing users:
      * group <username> - Check if there is a ‘user’ and which groups they are in
      * sudo adduser <username> - Adding a new user
      * sudo deluser <username>  - Deleting a existing username
      * sudo adduser <username> <group> - Add a ‘user’ to a ‘group’ both that existing
      * sudo deluser <username> <group>  - Delete/remove a ‘user’ from a ‘group’


  Firewall:
      * sudo iptables -L  - List iptables rules
         This is switching to ‘nftables’
      * sudo nft -a list ruleset
      * sudo ufw status  - Show the status of the ufw
      * Example adding ssh to UFW firewall:
      * sudo ufw allow ssh


  Linux and Lambda Stack upgrades and packaging:
      * sudo apt-get update          - Update the list of packages from repository (sync up)
      * apt list --upgradeable       - list upgradable packages (after the update)
      * sudo apt-get upgrade         - Upgrade packages
      * sudo apt-get dist-upgrade    - more aggressive upgrade - can remove packages
      * sudo apt full-upgrade        - more aggressive upgrade - can remove packages
      * dpkg -L <installed package>  - List the contents of a given packages
      * dpkg -S <full path to file>  - show the package a file came from
      * dpkg –list                   - Show the list of packages
      * apt list --installed

Linux/Ubuntu security managing user access:
      * iptables -L                  - List firewall rules
      * /etc/sudoers                 - Contains a list of sudo rules
      * visudo                       - to edit sudoers to change rules
      * sudo adduser <use> sudo      - Add a user to the sudo group, which gives them full root access via sudo, use caution.
        Example: (Add the user 'john' to the 'sudo' group
           $ sudo adduser john sudo

Linux/Ubuntu NVIDIA GPU
      * nvtop - watch the GPUs utilization and memory utilization
      * nvidia-smi - see the driver version (supported CUDA and usage)
      * note the persistence mode
      * nvidia-smi -q - gives more detailed information for each GPU
   NVlink information:
      * nvidia-smi topo -m  - To see the NVLink connection topology
      * nvidia-smi nvlink -s - To see the rates per link
   See logged errors:
      * grep "kernel: NVRM: Xid" /var/log/kern.log

Boot modes for linux:
  Find the current setting for boot level:
      $ systemctl get-default
  Set to boot to Multi-user (non-graphical):
      $ sudo systemctl set-default multi-user
  Set to boot to Graphical mode:
      $ sudo systemctl set-default graphical

  Change now (temporarily) to multi-user:
      $ sudo systemctl isolate multi-user
  Change now (temporarily) to Graphical:
      $ sudo systemctl isolate graphical


Containers and Virtual Environments:
      See examples:
         https://github.com/markwdalton/lambdalabs/tree/main/documentation/software/examples/virtual-environments
      * Docker/Singularity - Make use of NVIDIA's Container Catalog: https://catalog.ngc.nvidia.com/
      * Python venv - Developed by Python - so this is recommended - supports isolated or using system site default packages.
      * viritualenv - a group independently developed for Python 2.0, still around but recommend move to python venv.
      * Anaconda - Recent years license changes - companies should be aware of the license.

    Docker
      See the Lambda Docker PyTorch Tutorial:
        https://lambdalabs.com/blog/nvidia-ngc-tutorial-run-pytorch-docker-container-using-nvidia-container-toolkit-on-ubuntu
      Install docker (with Lambda Stack installed):
        * sudo apt-get install -y docker.io nvidia-container-toolkit
        * sudo systemctl daemon-reload
        * sudo systemctl restart docker
      Finding many docker images for Deep Learning
        * https://catalog.ngc.nvidia.com/
      Pull a docker image
        * sudo docker pull <image name>
        * sudo docker pull nvcr.io/nvidia/tensorflow:22.05-tf1-py3
        * sudo docker pull nvcr.io/nvidia/pytorch:22.05-py3
      Run Docker (it will pull the image if not found)
        * sudo docker run --gpus all -it --rm nvcr.io/nvidia/pytorch:22.05-py3
      List running docker containers
        * docker ps
      List docker images
        * docker images
      Mount a directory in a docker image on start up
        * sudo docker run --gpus all -it --rm -v `pwd`/data:/data/ nvcr.io/nvidia/pytorch:22.05-py3 
         You can add a command to the end of the line: ls, python code
         Mounts the ‘data’ directory from the current directory into the container as /data.
      Copy a file from the host to the container
        * docker cp input.txt container_id:/input.txt
      Copy a file from the container to the local file system
        * docker cp container_id:/output.txt output.txt
      Copy a group of files in the ‘data’ directory to the container
        * docker cp data/. container_id:/target
      Copy a group of files in the container ‘output’ directory to local host
        * docker cp container_id:/output/. target
        * mark@lambda-dual:~/lambda/tickets/8021$ cat ~/lambda/docker.txt
        * docker create or docker run:
        *                   *   -a, --attach               # attach stdout/err
        *   -i, --interactive          # attach stdin (interactive)
        *   -t, --tty                  # pseudo-tty
        *       --name NAME            # name your image
        *   -p, --publish 5000:5000    # port map
        *       --expose 5432          # expose a port to linked containers
        *   -P, --publish-all          # publish all ports
        *       --link container:alias # linking
        *   -v, --volume `pwd`:/app    # mount (absolute paths needed)
        *   -e, --env NAME=hello       # env vars
        *                   * for 'docker run':
        *   --rm true|false
        *        Automatically remove the container when it exits. The default is false.

      Example to run on ALL GPUs, interactive, with a tty, and remove the running container on exit.
          $ docker run --gpus all -it --rm nvcr.io/nvidia/pytorch:22.05-py3 nvidia-smi
      List the ports mapped example:
        Look for running images:
          $ docker ps
            CONTAINER ID   IMAGE                              COMMAND                  CREATED        STATUS       PORTS                                       NAMES
            e7fa01d97208   dalle-playground_dalle-interface   "docker-entrypoint.s…"   3 months ago   Up 8 hours   0.0.0.0:3000->3000/tcp, :::3000->3000/tcp   dalle-interface

        Look at the port mapping for that running image:
          $ docker port e7fa01d97208
            3000/tcp -> 0.0.0.0:3000
            3000/tcp -> :::3000

Kubernetes
   List all pods in the current namespace:
      kubectl get pods
   List all pods in all namespaces:
      kubectl get pods --all-namespaces
   List all services in the current namespace:
   kubectl get services
   List all deployments in the current namespace:
      kubectl get deployments
   List all nodes in the cluster:
      kubectl get nodes
   Describe a pod:
      kubectl describe pod <pod-name>
   Describe a service:
      kubectl describe service <service-name>
   Describe a deployment:
      kubectl describe deployment <deployment-name>
   Create a new deployment:
      kubectl create deployment <deployment-name> --image=<image-name>
   Update a deployment:
      kubectl set image deployment <deployment-name> <container-name>=<new-image-name>
   Scale a deployment:
      kubectl scale deployment <deployment-name> --replicas=<replica-count>
   Delete a deployment:
      kubectl delete deployment <deployment-name>
   Delete a pod:
      kubectl delete pod <pod-name>
   Delete a service:
      kubectl delete service <service-name>


For servers with IPMI:
Install ipmitool:
  $ sudo apt-get install ipmitool
List the users
  $ sudo ipmitool user list
Change the password for User ID 2 (from previous ‘user list’)
  $ sudo ipmitool user set password 2
        Then, enter the new password twice.
Cold reset the BMC - normally only needed when the BMC is not getting updates:
  $ sudo ipmitool mc reset cold
Print the BMC network information:
  $ sudo ipmitool lan print
Print the BMC Event log
  $ sudo ipmitool sel elist
Print Sensor information:
  $ sudo ipmitool sdr
  $ sudo ipmitool sensor
Print Information about the system:
  $ sudo ipmitool fru
Power Status:
     $ sudo ipmitool power status
Power control server:
     $ sudo ipmitool power [status|on|off|cycle|reset|diag|soft]
Power on server:
     $ sudo ipmitool power on
Power off server:
     $ sudo ipmitool power off
Power cycle server:
     $ sudo ipmitool power cycle
Power reset the server:
     $ sudo ipmitool power reset
Check or Set the BMC time:
     $ sudo ipmitool sel time get
     $ sudo ipmitool sel time set "$(date '+%m/%d/%Y %H:%M:%S')"
     $ ipmitool sel time get
          Or
     $ sudo ipmitool sel time set now
     $ sudo hwclock --systohc


IPMI Example setting up a static IP address:
If you were given:
   IPMI/BMC IP address: 10.100.1.132
    Netmask: 255.255.255.0
    Gateway: 10.100.1.1


Then the configuration would be:
                  * Confirm current settings:
           $ sudo ipmitool lan print 1
                  * Set the IPMI Interface to Static (default is dhcp)
           $ sudo ipmitool lan set 1 ipsrc static
                  * Set the IP Address:
           $ sudo ipmitool lan set 1 ipaddr 129.105.61.139
                  * Set the netmask for this network:
           $ sudo ipmitool lan set 1 netmask 255.255.255.0
                  * Set the Default Gateway
           $ sudo ipmitool lan set 1 defgw ipaddr 129.105.61.1
                  * Set/confirm that LAN (network) access is enabled:
           $ sudo ipmitool lan set 1 access on


Common request is getting the sensor and event log output..
1. On the node from linux:
        $ sudo apt install ipmitool
        $ sudo ipmitool sdr >& ipmi-sdr.txt
        $ sudo ipmitool sel elist >& ipmi-sel.txt
        
      Or from a remote linux machine:
         $ ipmitool -I lanplus -H IP_ADDRESS -U ADMIN -P "PASSWORD" sel elist >& ipmi-sel.txt
         $ ipmitool -I lanplus -H IP_ADDRESS -U ADMIN -P "PASSWORD" sdr >& ipmi-sdr.txt
            ** Where 'PASSWORD' would be your IPMI password, and IP_ADDRESS is your  
               machines BMC/IPMI ip address.
 
Or at least the Web GUI can save the cvs of the eventlog.
     BMC/IPMI -> Logs and Reports -> Event Log -> Save to excel (CSV).


Networking -> Infiniband
     * lsmod | egrep “mlx|ib”
     * ibstat
     * ibstatus
     * ibv_devinfo
     * ibswitch
     * ibhosts
     * lspci | grep Mellanox
     * lspci | egrep -i "mellanox|mlnx|mlx[0-9]_core|mlnx[0-9]_ib"
     * dmesg | egrep -i "mellanox|mlnx|mlx[0-9]_core|mlnx[0-9]_ib"
     * Check for errors like insufficient power
     * lsmod | grep rdma
     * mst start
     * mst status -v
* opensm needs to be running either on the switch or at least one of the nodes
Goku Coding Experience

Monday, 11 March 2024

Ubuntu Cheatsheet