Node deployment

Prepare your node to use GPU platform services.

These instructions show how to prepare a node so Kommander can launch GPU services on the node.

You can prepare a node in the following ways:

Before you begin

This procedure requires the following items and configurations before you begin:

  • Nodes must provide an Nvidia GPU.

  • For AWS select a GPU instance type from the Accelerated Computing section of the AWS instance types.

  • Nodes must run CentOS 7.

Use Konvoy 2 on AWS

To provision GPU nodes using Konvoy 2 on AWS:

  1. Use konvoy-image-builder to create an Amazon AMI with the GPU override.

    konvoy-image build images/ami/centos-7.yaml --overrides overrides/nvidia.yaml
    
  2. Begin the Konvoy Installation up to and including the konvoy create cluster aws command.

  3. Edit the ${CLUSTER_NAME}.yaml file:

    • Update the instanceType of the worker nodepool to an instance type that provides Nvidia GPUs, e.g. p2.xlarge.
    • Add an ami.id to reference the image generated by konvoy-image-builder. In this simplified example, we use AMI ID ami-0d931a15fdf46f14f, you should substitute the one from the konvoy-image-builder output.
    [...]
    ---
    apiVersion: infrastructure.cluster.x-k8s.io/v1alpha3
    kind: AWSMachineTemplate
    metadata:
      [...]
      name: <CLUSTER_NAME>-md-0
      [...]
    spec:
      template:
        spec:
          [...]
          instanceType: p2.xlarge
          [...]
          ami:
            id: ami-0d931a15fdf46f14f
    
  4. Continue the Konvoy Installation.

Manual Deployment

For clusters not covered in the previous procedure, run the following commands on each GPU node to configure the drivers:

CentOS 7

sudo yum update -y
sudo yum -y group install "Development Tools"
sudo yum -y install kernel-devel epel-release
sudo yum -y install dkms
sudo sed -i '/^GRUB_CMDLINE_LINUX=/s/"$/ module_name.blacklist=1 rd.driver.blacklist=nouveau modprobe.blacklist=nouveau"/' /etc/default/grub
sudo dracut --omit-drivers nouveau -f
sudo grub2-mkconfig -o /boot/grub2/grub.cfg
sudo reboot

lsmod | grep -i nouveau # ensure not loaded
sudo yum install -y tar bzip2 make automake gcc gcc-c++ pciutils elfutils-libelf-devel libglvnd-devel iptables firewalld vim bind-utils wget
distribution=rhel7
ARCH=$( /bin/arch )
sudo yum-config-manager --add-repo http://developer.download.nvidia.com/compute/cuda/repos/$distribution/${ARCH}/cuda-$distribution.repo
sudo yum install -y kernel-devel-$(uname -r) kernel-headers-$(uname -r)
sudo yum clean expire-cache
sudo yum install -y nvidia-driver-latest-dkms-3:460.73.01-1.el7.x86_64

Verification

Verify that the Nvidia driver is working by running:

nvidia-smi

When drivers are successfully installed the display will look like the following:

Fri Jun 11 09:05:31 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.73.01    Driver Version: 460.73.01    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 00000000:00:1E.0 Off |                    0 |
| N/A   35C    P0    73W / 149W |      0MiB / 11441MiB |     99%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+