-
Notifications
You must be signed in to change notification settings - Fork 15
Description
Describe the bug
Our "Build Docker Images" GHA workflow includes logic to select different base images depending on the target architecture. CPU builds use a standard Ubuntu base image, while GPU builds use a CUDA-enabled base image.
See these relevant lines in the workflow:
nebari-docker-images/.github/workflows/build-push-docker.yaml
Lines 26 to 27 in 0e120d5
| GPU_BASE_IMAGE: nvidia/cuda:12.8.1-base-ubuntu24.04 | |
| GPU_IMAGE_SUFFIX: gpu |
nebari-docker-images/.github/workflows/build-push-docker.yaml
Lines 84 to 89 in 0e120d5
| - name: "Set BASE_IMAGE and Image Suffix 📷" | |
| if: ${{ matrix.platform == 'gpu' }} | |
| run: | | |
| echo "GPU Platform Matrix" | |
| echo "BASE_IMAGE=$GPU_BASE_IMAGE" >> $GITHUB_ENV | |
| echo "IMAGE_SUFFIX=-$GPU_IMAGE_SUFFIX" >> $GITHUB_ENV |
| build-args: BASE_IMAGE=${{ env.BASE_IMAGE }} |
It used to be the case that our Dockerfiles accepted an ARG for the base image. For example:
| ARG BASE_IMAGE=ubuntu:20.04 |
However, #211 appears to have changed that behavior unintentionally. I don’t think we noticed the issue until now, while reviewing #229. Looking at some of our recent image builds, I can confirm the CPU and GPU images are indeed using the same base image.
Taken from the jupyterlab-cpu build for the 2025.10.1rc1 tag GHA logs:
#7 [linux/amd64 builder 1/4] FROM docker.io/library/ubuntu:24.04@sha256:66460d557b25769b102175144d538d88219c077c678a49af4afca6fbfc1b5252
Taken from the jupyterlab-gpu build for the 2025.10.1rc1 tag GHA logs:
#9 [linux/amd64 builder 1/4] FROM docker.io/library/ubuntu:24.04@sha256:66460d557b25769b102175144d538d88219c077c678a49af4afca6fbfc1b5252
Looking at the image digests, they're exactly the same, which is not the expected behavior. I think we should fix this so we make sure we built GPU images on top of the CUDA base image. I know @dcmcand has some thoughts on how to implement this instead of relying on having an ARG on our Dockerfile. We've been building images like these for some months now and haven't noticed any issues when running GPU workloads on Nebari. That makes me wonder why things still work as before if we're not relying on the CUDA images anymore. I guess it might have to do with the fact that we're still relying on the NVIDIA device plugin Daemonset.
Expected behavior
GPU images should build on top of a CUDA base image.
How to Reproduce the problem?
Take a look at the "Build Docker Images" GHA workflow
Command output
Versions and dependencies used.
No response
Anything else?
No response