Deploying Nvidia GPU enabled Tanzu Kubernetes Clusters

In this blog post I'm going to detail how deploy and configure a Nvidia GPU enabled Tanzu Kubernetes Grid cluster in AWS. The method will be similar for Azure, for vSphere there are a number of additional steps to prepare the system. I'm going to essentially follow the official documentation, then run some of the Nvidia tests. Like always, it's good to get a visual reference and such for these kinds of deployments.

Pre-Reqs

Nvidia today only support Ubuntu deployed images in relation to a TKG deployment
For this blog I've already deployed my TKG Management cluster in AWS

Deploy a GPU enabled workload cluster

It's simple, just deploy a workload cluster that for the compute plane nodes (workers) that uses a GPU enabled instance.

You can create a new cluster YAML file from scratch, or clone one of your existing located in:

~/.config/tanzu/tkg/clusterconfigs

Below are the four main values you will need to change. As mentioned above, you need a GPU enabled instance, and for the OS to be Ubuntu. The OS version will default if not set to 20.04.

CONTROL_PLANE_MACHINE_TYPE: t3.large
NODE_MACHINE_TYPE: g4dn.xlarge
OS_ARCH: amd64
OS_NAME: ubuntu
OS_VERSION: "20.04

The rest of the file you configure as you would for any workload cluster deployment.

Create the cluster.

tanzu cluster create {name} -f {cluster.yaml}

You can retrieve the kubeadmin file to login by running.

tanzu cluster kubeconfig get {cluster_name} --admin

Deploying the Nvidia Kubernetes Operator

Change the kubectl context to your newly deployed cluster.

Deploying the Nvidia operator couldn't be easier, you can either download the files from the Cluster API for AWS github repo, or directly install them.

kubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/cluster-api-provider-aws/de4fd54e6f988ca7fd3f94bce46867ba0523e23b/test/e2e/data/infrastructure-aws/gpu/clusterpolicy-crd.yaml

kubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/cluster-api-provider-aws/de4fd54e6f988ca7fd3f94bce46867ba0523e23b/test/e2e/data/infrastructure-aws/gpu/gpu-operator-components.yaml

Validate the installation

Validate the operator pods in the default namespace, and then "nvidia" pods in the namespace "gpu-operator-resources".

kubectl get pods

kubectl get pods -n gpu-operator-resources

If you scale out your cluster with additional nodes, the Nvidia operator will ensure the additional pods run on the new nodes.

Running the Sample Applications

From here to further validate, I am running the sample applications from the Nvidia documentation.

So rather than copy the exact configs here, I'm just showing the outputs.

CUDA VectorAdd

CUDA load generator

If you want to look at further examples, Nvidia have some fantastic Deep Learning examples in this repository.

https://github.com/NVIDIA/DeepLearningExamples

Wrap-up and Resources

Hopefully you can see that to use the GPU support with a Tanzu Kubernetes Grid cluster is quick and simple to setup and consume.

Blog - VMware Tanzu Kubernetes Grid Now Supports GPUs Across Clouds
Documentation
- Deploy Tanzu Kubernetes Clusters to Amazon EC2
- Deploy a GPU-Enabled Cluster

Regards

Dean Lewis on LinkedIn

Dean Lewis