-
Notifications
You must be signed in to change notification settings - Fork 49
Description
Summary
On NodeSet esc8000a we set both Kubernetes GPU limits (nvidia.com/gpu: 1) and Slurm GRES (gpu:a40:1). However, when running an srun on esc8000a-*, the container sees all 8 A40 GPUs via nvidia-smi. Slurm also reports Gres=gpu:a40:1 for those nodes, so Slurm’s view (1 GPU) and the container’s view (8 GPUs) are inconsistent.
We’d like guidance on the expected behavior and how to ensure that jobs (and/or the slurmd container) only see the number of GPUs granted by K8s + GRES.
Expected behavior
- A container scheduled with
resources.limits.nvidia.com/gpu: 1on NodeSetesc8000ashould expose only one GPU to user processes (e.g.,nvidia-smilists a single device). srunonesc8000a-*should inherit that isolation.
Actual behavior
nvidia-smiinside the job shows 8 GPUs onesc8000a-*.- Slurm node state shows
Gres=gpu:a40:1foresc8000a-[0-1](so Slurm thinks 1 GPU), but the container can access all 8.
Config (values excerpt)
# Base: values-cluster-v0.3.1.yaml
compute:
nodesets:
- name: esc8000a
enabled: true
replicas: 2
nodeSelector:
kubernetes.io/os: linux
nvidia.com/gpu.present: "true"
nvidia.com/gpu.machine: ESC8000-E11
tolerations:
- key: nvidia.com/gpu
operator: Exists
resources:
limits:
cpu: 12
memory: 48Gi
nvidia.com/gpu: 1
requests:
cpu: 8
memory: 32Gi
nvidia.com/gpu: 1
useResourceLimits: true
extraVolumeMounts:
- name: cache
mountPath: /dev/shm
- name: demosite-storage
mountPath: /mnt/data
extraVolumes:
- name: cache
emptyDir:
medium: Memory
sizeLimit: 12Gi
- name: demosite-storage
persistentVolumeClaim:
claimName: slurm-workspace-pvc
partition:
enabled: false
config:
State: UP
MaxTime: UNLIMITED
nodeConfig:
Features: ["a40"]
Gres: ["gpu:a40:1"]
CPUs: 12
RealMemory: 245760Diagnostics
sinfo:
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
tn* up infinite 3 idle esc8000a-[0-1],esc8000b-0
scontrol show nodes (esc8000a):
NodeName=esc8000a-0 ...
Gres=gpu:a40:1(S:0,2,4,6,8,10)
Partitions=tn
CfgTRES=cpu=11,mem=240G,billing=11,gres/gpu=1
...
NodeName=esc8000a-1 ...
Gres=gpu:a40:1(S:0,2,4,6,8,10)
Partitions=tn
CfgTRES=cpu=11,mem=240G,billing=11,gres/gpu=1
scontrol show partitions (excerpt):
PartitionName=tn
NodeSets=esc8000a,esc8000b
Nodes=esc8000a-[0-1],esc8000b-0
TRES=cpu=102,mem=720G,node=3,billing=102,gres/gpu=10
nvidia-smi from srun --nodelist=esc8000a-0 bash -c "hostname; nvidia-smi" (excerpt):
esc8000a-0
Driver Version: 550.54.14, CUDA 12.4
... 8x NVIDIA A40 listed ...
Environment
- Chart/base values:
values-cluster-v0.3.1.yaml - Slurm version (from nodes):
25.05.2 - GPU: NVIDIA A40 (8 per host on esc8000a)
- NVIDIA driver:
550.54.14 - CUDA shown:
12.4 - Kernel (esc8000a):
5.15.0-142-generic - Partition:
tn - NodeSets:
esc8000a(expected 1 GPU per pod),esc8000b(has 8 GPUs)
Questions / help requested
- Should the
slurmdcontainer (and job steps launched viasrun) only see the GPUs granted byresources.limits.nvidia.com/gpuon the NodeSet? - Does the chart expect/use a
runtimeClassName: nvidiaor similar to ensure the NVIDIA device plugin injectsNVIDIA_VISIBLE_DEVICESfor theslurmdand job containers? - Is there a known gap where the
slurmdcontainer runs with broader /dev/nvidia access* than the requested limit, causing jobs launched inside it to see all devices? - Any guidance on aligning K8s GPU limits ↔ Slurm GRES so that container device visibility matches Slurm allocation?
What we can provide to assist (tell us which would help)
kubectl -n slurm get pod <slurm-compute-esc8000a-*> -o yaml(resources, runtimeClass, env)- Env from inside
slurmdcontainer:env | grep NVIDIA,ls -l /dev/nvidia* - Slurm config fragments resolved at runtime:
slurm.conf,gres.confas rendered in the pod - Versions of NVIDIA GPU Operator / device plugin / container toolkit in the cluster
Hypotheses (for triage)
runtimeClassName(or NVIDIA injection) not applied toslurmd/ job containers → container sees all host GPUs.- The chart’s
useResourceLimits: truemay not propagatenvidia.com/gpuin a way that the NVIDIA device plugin constrains device visibility. - A mismatch between GRES and K8s device plugin (e.g., job steps executed inside the
slurmdcontainer rather than a per-job pod with GPU limits).
Ask
Please confirm the intended behavior and suggest the correct chart values or configuration to ensure that only 1 GPU is exposed on esc8000a-* when we set nvidia.com/gpu: 1 and Gres: ["gpu:a40:1"]. If there’s a recommended example values.yaml for 1-GPU NodeSets, that would be greatly appreciated.