Skip to content

[v0.3.1] GPU isolation ignored on NodeSet esc8000a: container sees 8 GPUs even with nvidia.com/gpu: 1 and GRES=gpu:a40:1 #45

@tsungjung411

Description

@tsungjung411

Summary

On NodeSet esc8000a we set both Kubernetes GPU limits (nvidia.com/gpu: 1) and Slurm GRES (gpu:a40:1). However, when running an srun on esc8000a-*, the container sees all 8 A40 GPUs via nvidia-smi. Slurm also reports Gres=gpu:a40:1 for those nodes, so Slurm’s view (1 GPU) and the container’s view (8 GPUs) are inconsistent.

We’d like guidance on the expected behavior and how to ensure that jobs (and/or the slurmd container) only see the number of GPUs granted by K8s + GRES.


Expected behavior

  • A container scheduled with resources.limits.nvidia.com/gpu: 1 on NodeSet esc8000a should expose only one GPU to user processes (e.g., nvidia-smi lists a single device).
  • srun on esc8000a-* should inherit that isolation.

Actual behavior

  • nvidia-smi inside the job shows 8 GPUs on esc8000a-*.
  • Slurm node state shows Gres=gpu:a40:1 for esc8000a-[0-1] (so Slurm thinks 1 GPU), but the container can access all 8.

Config (values excerpt)

# Base: values-cluster-v0.3.1.yaml

compute:
  nodesets:
    - name: esc8000a
      enabled: true
      replicas: 2
      nodeSelector:
        kubernetes.io/os: linux
        nvidia.com/gpu.present: "true"
        nvidia.com/gpu.machine: ESC8000-E11
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
      resources:
        limits:
          cpu: 12
          memory: 48Gi
          nvidia.com/gpu: 1
        requests:
          cpu: 8
          memory: 32Gi
          nvidia.com/gpu: 1
      useResourceLimits: true
      extraVolumeMounts:
        - name: cache
          mountPath: /dev/shm
        - name: demosite-storage
          mountPath: /mnt/data
      extraVolumes:
        - name: cache
          emptyDir:
            medium: Memory
            sizeLimit: 12Gi
        - name: demosite-storage
          persistentVolumeClaim:
            claimName: slurm-workspace-pvc
      partition:
        enabled: false
        config:
          State: UP
          MaxTime: UNLIMITED
      nodeConfig:
        Features: ["a40"]
        Gres: ["gpu:a40:1"]
        CPUs: 12
        RealMemory: 245760

Diagnostics

sinfo:

PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
tn*          up   infinite      3   idle esc8000a-[0-1],esc8000b-0

scontrol show nodes (esc8000a):

NodeName=esc8000a-0 ... 
   Gres=gpu:a40:1(S:0,2,4,6,8,10)
   Partitions=tn
   CfgTRES=cpu=11,mem=240G,billing=11,gres/gpu=1
...
NodeName=esc8000a-1 ...
   Gres=gpu:a40:1(S:0,2,4,6,8,10)
   Partitions=tn
   CfgTRES=cpu=11,mem=240G,billing=11,gres/gpu=1

scontrol show partitions (excerpt):

PartitionName=tn
   NodeSets=esc8000a,esc8000b
   Nodes=esc8000a-[0-1],esc8000b-0
   TRES=cpu=102,mem=720G,node=3,billing=102,gres/gpu=10

nvidia-smi from srun --nodelist=esc8000a-0 bash -c "hostname; nvidia-smi" (excerpt):

esc8000a-0
Driver Version: 550.54.14, CUDA 12.4
... 8x NVIDIA A40 listed ...

Environment

  • Chart/base values: values-cluster-v0.3.1.yaml
  • Slurm version (from nodes): 25.05.2
  • GPU: NVIDIA A40 (8 per host on esc8000a)
  • NVIDIA driver: 550.54.14
  • CUDA shown: 12.4
  • Kernel (esc8000a): 5.15.0-142-generic
  • Partition: tn
  • NodeSets: esc8000a (expected 1 GPU per pod), esc8000b (has 8 GPUs)

Questions / help requested

  1. Should the slurmd container (and job steps launched via srun) only see the GPUs granted by resources.limits.nvidia.com/gpu on the NodeSet?
  2. Does the chart expect/use a runtimeClassName: nvidia or similar to ensure the NVIDIA device plugin injects NVIDIA_VISIBLE_DEVICES for the slurmd and job containers?
  3. Is there a known gap where the slurmd container runs with broader /dev/nvidia access* than the requested limit, causing jobs launched inside it to see all devices?
  4. Any guidance on aligning K8s GPU limits ↔ Slurm GRES so that container device visibility matches Slurm allocation?

What we can provide to assist (tell us which would help)

  • kubectl -n slurm get pod <slurm-compute-esc8000a-*> -o yaml (resources, runtimeClass, env)
  • Env from inside slurmd container: env | grep NVIDIA, ls -l /dev/nvidia*
  • Slurm config fragments resolved at runtime: slurm.conf, gres.conf as rendered in the pod
  • Versions of NVIDIA GPU Operator / device plugin / container toolkit in the cluster

Hypotheses (for triage)

  • runtimeClassName (or NVIDIA injection) not applied to slurmd / job containers → container sees all host GPUs.
  • The chart’s useResourceLimits: true may not propagate nvidia.com/gpu in a way that the NVIDIA device plugin constrains device visibility.
  • A mismatch between GRES and K8s device plugin (e.g., job steps executed inside the slurmd container rather than a per-job pod with GPU limits).

Ask
Please confirm the intended behavior and suggest the correct chart values or configuration to ensure that only 1 GPU is exposed on esc8000a-* when we set nvidia.com/gpu: 1 and Gres: ["gpu:a40:1"]. If there’s a recommended example values.yaml for 1-GPU NodeSets, that would be greatly appreciated.

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions