|
| 1 | +--- |
| 2 | +title: "Garden Linux: Enabling AI on Kubernetes with NVIDIA GPUs" |
| 3 | +authors: |
| 4 | + - pavel-pavlov |
| 5 | + - darren-hague |
| 6 | +tags: |
| 7 | + - kubernetes |
| 8 | + - gardener |
| 9 | + - gardenlinux |
| 10 | +--- |
| 11 | + |
| 12 | +## AI and Kubernetes: Unlocking Business Innovation |
| 13 | + |
| 14 | +Artificial Intelligence (AI) has become essential for business innovation, enabling companies to unlock new revenue streams, automate processes, and make data-driven decisions automatically and at scale. |
| 15 | + |
| 16 | +There is industry-wide agreement that Kubernetes provides an ideal platform for running AI workloads (see [Cloud Native AI Whitepaper](https://www.cncf.io/reports/cloud-native-artificial-intelligence-whitepaper/)). Furthermore, the CNCF community is in the process of defining infrastructure level [AI Conformance](https://github.com/cncf/ai-conformance) which will make Kubernetes ubiquitous for AI workloads. |
| 17 | + |
| 18 | +But for Kubernetes to support GPUs, you need the worker nodes' operating systems enabled with the right GPU drivers and associated access frameworks. |
| 19 | + |
| 20 | +<!-- truncate --> |
| 21 | + |
| 22 | +It may seem like just an obvious, pragmatic, and necessary requirement at the infrastructure level, but embedded in the fully open-source Apeiro Reference Architecture, governed and supported by (industry) members of the [NeoNephos Foundation](https://neonephos.org), its impact is substantial: **Apeiro freely empowers any organization or consortia seeking to build sovereign, modern datacenters for leveraging AI**. |
| 23 | + |
| 24 | +Participation and contributions are not only welcome, but directly connect to the broader joint AI imperative of business. |
| 25 | + |
| 26 | +## Simplifying NVIDIA GPU Support in Gardener |
| 27 | + |
| 28 | +Easier said than done, there is significant operational complexity to consider: multi-cloud, hybrid environments, different hardware, diverse operating systems, complex driver management, and varying cloud provider configurations. |
| 29 | + |
| 30 | +In Apeiro, we offer [Gardener](https://gardener.cloud) and [Garden Linux](https://github.com/gardenlinux) to tackle such operational complexity. With the [NVIDIA GPU Operator](https://github.com/NVIDIA/gpu-operator), we can provide a unified AI-conformant Kubernetes platform that works across any infrastructure with [NVIDIA Data Center GPUs](https://www.nvidia.com/en-us/data-center/data-center-gpus/). |
| 31 | + |
| 32 | +## Understanding the NVIDIA GPU Operator |
| 33 | + |
| 34 | +The NVIDIA GPU Operator automates GPU support in Kubernetes by deploying all the required software components (drivers, CUDA, device plugins, etc.) in the right [ABI-compatible](https://en.wikipedia.org/wiki/Application_binary_interface) versions. It eliminates any manual GPU driver installation and configuration, and enables GPUs as native Kubernetes resources. The GPU Operator is a Kubernetes-native operator with custom resource definitions. Furthermore, it ensures consistent GPU functionality across different hardware nodes and configurations, while enabling automatic updates, scaling, and troubleshooting through standard Kubernetes APIs. |
| 35 | + |
| 36 | +<ApeiroFigure src="/img/blog/2025-08-25-nvidia-gpu-enablement-gpu-operator.png" |
| 37 | + caption="NVIDIA GPU Operator visualization in layers" |
| 38 | + source="docs.nvidia.com" |
| 39 | + sourceLink="https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/overview.html" |
| 40 | + width="100%"/> |
| 41 | + |
| 42 | +## Enabling Garden Linux for the GPU Operator |
| 43 | + |
| 44 | +The NVIDIA GPU Operator is architected in a modular way so anyone who wants to build GPU Driver containers can make the GPU Operator work with their operating system. |
| 45 | +This is what we have done and we are making it publicly available. We used the public NVIDIA GPU Driver Dockerfile to create functional Garden Linux GPU Driver images. Please feel free to use them and collaborate by sharing feedback within the Garden Linux |
| 46 | +[gardenlinux-nvidia-installer](https://github.com/gardenlinux/gardenlinux-nvidia-installer/) repository. |
| 47 | + |
| 48 | +Garden Linux builds containers for the three latest active NVIDIA driver branches on all |
| 49 | +Garden Linux versions that are in maintenance. |
| 50 | + |
| 51 | +As of August 2025, this means containerized GPU drivers for the following combinations of major releases are available: |
| 52 | + |
| 53 | +| Garden Linux | NVIDIA Driver | |
| 54 | +| - | - | |
| 55 | +| [1592](https://github.com/gardenlinux/gardenlinux/issues/2161) | 570, 565, 550 | |
| 56 | +| [1877](https://github.com/gardenlinux/gardenlinux/issues/2358) | 570, 565, 550 | |
| 57 | + |
| 58 | +We automated the support directly in our build pipelines. |
| 59 | + |
| 60 | +### Automating the Build |
| 61 | + |
| 62 | +With guidance from NVIDIA[^thanks], Garden Linux's build and release process was adjusted to automatically publish the ABI-compatible container images required by the GPU Operator. |
| 63 | + |
| 64 | +[^thanks]: Thanks to [Jathavan Sriram](https://www.linkedin.com/in/jathavansriram) from NVIDIA for the productive discussions. |
| 65 | + |
| 66 | +<ApeiroFigure src="/img/blog/2025-08-25-nvidia-gpu-enablement-gpu-operator-publishing.svg" |
| 67 | + caption="Publishing workflow" /> |
| 68 | + |
| 69 | +An automated [workflow](https://github.com/gardenlinux/gardenlinux-nvidia-installer/blob/main/.github/workflows/update_version.yml) immediately creates a pull request for new driver versions. Hence, Garden Linux provides you with the latest GPU driver updates with zero effort! The results are published in Garden Linux's GitHub container registry [`ghcr.io/gardenlinux/gardenlinux-nvidia-installer`](https://github.com/gardenlinux/gardenlinux-nvidia-installer/pkgs/container/gardenlinux-nvidia-installer) with the [release](https://github.com/gardenlinux/gardenlinux-nvidia-installer/blob/main/.github/workflows/release.yml) workflow. |
| 70 | + |
| 71 | +### Under the Hood |
| 72 | +Orchestrating the publishing of the drivers, wrapped in the correct container format needed by the NVIDIA GPU Operator, requires two major steps: |
| 73 | + |
| 74 | +1. The new driver is [compiled](https://github.com/gardenlinux/gardenlinux-nvidia-installer/blob/main/.github/workflows/build_driver.yml) against the specific container-based environment and the exact [Linux Kernel](https://kernel.org/) version used in Garden Linux. |
| 75 | + |
| 76 | +2. After Step 1 is successfully completed, the new driver is compatibly [packaged](https://github.com/gardenlinux/gardenlinux-nvidia-installer/blob/main/.github/workflows/build_image.yml) as OCI container, which can be easily picked up by the NVIDIA GPU Operator at runtime (cf. "nvidia-driver" entry point). |
| 77 | + |
| 78 | +### Example Helm Chart Configuration |
| 79 | + |
| 80 | +The GPU Operator is installed using a [Helm Chart](https://helm.sh/docs/topics/charts/) provided in the NVIDIA Helm repository. Running the NVIDIA GPU Operator on Garden Linux requires a specific set of configuration values in [gpu-operator-values.yaml](https://github.com/gardenlinux/gardenlinux-nvidia-installer/blob/main/helm/gpu-operator-values.yaml). |
| 81 | + |
| 82 | +For sovereign (and air-gapped) environments, you need to maintain your own repository correctly in the `driver.repository` value of the Helm chart. |
| 83 | + |
| 84 | +## Connecting the Dots |
| 85 | + |
| 86 | +### Prerequisites |
| 87 | + |
| 88 | +The example below assumes you have: |
| 89 | + |
| 90 | +1. Access to a [Gardener Project](https://gardener.cloud/docs/getting-started/project/) with sufficient permissions to create a Kubernetes cluster on your preferred platform. |
| 91 | +2. Sufficient quota and permissions to create worker pools with data center-grade NVIDIA GPUs. |
| 92 | +3. Understanding of how to use Gardener and command line terminal. |
| 93 | + |
| 94 | +### Installation Steps |
| 95 | + |
| 96 | +1. Create Kubernetes cluster. |
| 97 | + |
| 98 | + You can use any (and different) worker nodes with NVIDIA GPUs. |
| 99 | + |
| 100 | +2. Install Helm |
| 101 | + |
| 102 | + Follow the [NVIDIA GPU Driver Getting Started Operator Installation Guide](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html#operator-install-guide) to prepare Helm. |
| 103 | + |
| 104 | + It is important to add the NVIDIA Helm repository before proceeding to next step. |
| 105 | + |
| 106 | + |
| 107 | +3. Install the NVIDIA GPU Operator |
| 108 | + |
| 109 | + You can further follow the guide from Step 2 or use the example from the [Garden Linux NVIDIA Installer](https://github.com/gardenlinux/gardenlinux-nvidia-installer). It is important to: |
| 110 | + |
| 111 | + - make sure the `gpu-operator` namespace exists before installation or if you execute the command below consider adding the Helm flag `--create-namespace` as alternative. |
| 112 | + |
| 113 | + - use Helm flag `--values` with value `https://raw.githubusercontent.com/gardenlinux/gardenlinux-nvidia-installer/refs/heads/main/helm/gpu-operator-values.yaml` as demonstrated below. |
| 114 | + |
| 115 | + ```bash |
| 116 | + helm upgrade --install -n gpu-operator --create-namespace gpu-operator nvidia/gpu-operator --values \ |
| 117 | + https://raw.githubusercontent.com/gardenlinux/gardenlinux-nvidia-installer/refs/heads/main/helm/gpu-operator-values.yaml |
| 118 | + ``` |
| 119 | + |
| 120 | + - By default you can use the latest supported version with the values file above, but if you really need it, you can change the `driver.version` property to any available version available in [Garden Linux NVIDIA Driver Package Repository](https://github.com/gardenlinux/gardenlinux-nvidia-installer/pkgs/container/gardenlinux-nvidia-installer). |
| 121 | + |
| 122 | +4. Test GPU availability (optional) |
| 123 | + |
| 124 | + You can verify that GPU Operator has worked correctly using a sample job from the NVIDIA [k8s-device-plugin](https://github.com/NVIDIA/k8s-device-plugin) repository. Deploy the following GPU pod manifest: |
| 125 | + |
| 126 | + ```yaml |
| 127 | + apiVersion: v1 |
| 128 | + kind: Pod |
| 129 | + metadata: |
| 130 | + name: gpu-pod |
| 131 | + spec: |
| 132 | + restartPolicy: Never |
| 133 | + containers: |
| 134 | + - name: cuda-container |
| 135 | + image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda12.5.0 |
| 136 | + resources: |
| 137 | + limits: |
| 138 | + nvidia.com/gpu: 1 # requesting 1 GPU |
| 139 | + tolerations: |
| 140 | + - key: nvidia.com/gpu |
| 141 | + operator: Exists |
| 142 | + effect: NoSchedule |
| 143 | + ``` |
| 144 | + |
| 145 | + If everything is working correctly, the container log should include a message containing the message `Test PASSED`: |
| 146 | + |
| 147 | + <ApeiroFigure src="/img/blog/2025-08-25-nvidia-gpu-enablement-container-done.png" |
| 148 | + alt="Example of container logs" |
| 149 | + caption="Example of container logs" /> |
| 150 | + |
| 151 | +## Gardener Integration |
| 152 | + |
| 153 | +With the NVIDIA GPU Operator working out of the box, we are planning to offer a complete end-to-end experience, by enabling the end user to order a Kubernetes cluster via Gardener with everything preset; as a Service. We will be working with the community and propose a Gardener Enhancement Proposal (GEP), with the goal to present the integrated experience as an extension like the one shown below. |
| 154 | + |
| 155 | +```yaml |
| 156 | +kind: Shoot |
| 157 | +... |
| 158 | +spec: |
| 159 | + extensions: |
| 160 | + - type: nvidia-gpu-extension |
| 161 | + providerConfig: |
| 162 | + cdi: |
| 163 | + enabled: true |
| 164 | + default: true |
| 165 | + toolkit: |
| 166 | + installDir: /opt/nvidia |
| 167 | + driver: |
| 168 | + imagePullPolicy: Always |
| 169 | + usePrecompiled: true |
| 170 | + repository: ghcr.io/gardenlinux/gardenlinux-nvidia-installer |
| 171 | +... |
| 172 | +``` |
| 173 | + |
| 174 | +## Demo Video |
| 175 | + |
| 176 | +Watch our 5 minutes demo and see how it works end-to-end! |
| 177 | + |
| 178 | +<ApeiroFigure src="/img/blog/2025-08-25-nvidia-gpu-enablement-youtube-cover.png" |
| 179 | + caption="5-minute demo video on YouTube" |
| 180 | + href="https://youtu.be/7_e7mTvQFsU" /> |
| 181 | + |
| 182 | +## Outlook and Support |
| 183 | + |
| 184 | +Our Apeiro community encourages you to share feedback or report any issues you encounter while using the NVIDIA GPU Operator on Garden Linux. Please open an issue in the [gardenlinux-nvidia-installer](https://github.com/gardenlinux/gardenlinux-nvidia-installer/issues) repository. |
| 185 | + |
| 186 | +The team values your contributions and is eager to hear from your experience. |
0 commit comments