[v0.3.1] GPU isolation ignored on NodeSet `esc8000a`: container sees 8 GPUs even with `nvidia.com/gpu: 1` and `GRES=gpu:a40:1`


## Summary
On NodeSet `esc8000a` we set both Kubernetes GPU limits (`nvidia.com/gpu: 1`) and Slurm GRES (`gpu:a40:1`). However, when running an `srun` on `esc8000a-*`, the container sees **all 8 A40 GPUs** via `nvidia-smi`. Slurm also reports `Gres=gpu:a40:1` for those nodes, so Slurm’s view (1 GPU) and the container’s view (8 GPUs) are inconsistent.

We’d like guidance on the expected behavior and how to ensure that jobs (and/or the `slurmd` container) only see the number of GPUs granted by K8s + GRES.

---

**Expected behavior**

* A container scheduled with `resources.limits.nvidia.com/gpu: 1` on NodeSet `esc8000a` should expose only **one** GPU to user processes (e.g., `nvidia-smi` lists a single device).
* `srun` on `esc8000a-*` should inherit that isolation.

**Actual behavior**

* `nvidia-smi` inside the job shows **8 GPUs** on `esc8000a-*`.
* Slurm node state shows `Gres=gpu:a40:1` for `esc8000a-[0-1]` (so Slurm thinks 1 GPU), but the container can access all 8.

---

**Config (values excerpt)**

```yaml
# Base: values-cluster-v0.3.1.yaml

compute:
  nodesets:
    - name: esc8000a
      enabled: true
      replicas: 2
      nodeSelector:
        kubernetes.io/os: linux
        nvidia.com/gpu.present: "true"
        nvidia.com/gpu.machine: ESC8000-E11
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
      resources:
        limits:
          cpu: 12
          memory: 48Gi
          nvidia.com/gpu: 1
        requests:
          cpu: 8
          memory: 32Gi
          nvidia.com/gpu: 1
      useResourceLimits: true
      extraVolumeMounts:
        - name: cache
          mountPath: /dev/shm
        - name: demosite-storage
          mountPath: /mnt/data
      extraVolumes:
        - name: cache
          emptyDir:
            medium: Memory
            sizeLimit: 12Gi
        - name: demosite-storage
          persistentVolumeClaim:
            claimName: slurm-workspace-pvc
      partition:
        enabled: false
        config:
          State: UP
          MaxTime: UNLIMITED
      nodeConfig:
        Features: ["a40"]
        Gres: ["gpu:a40:1"]
        CPUs: 12
        RealMemory: 245760
```

---

## Diagnostics

`sinfo`:

```
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
tn*          up   infinite      3   idle esc8000a-[0-1],esc8000b-0
```

`scontrol show nodes (esc8000a)`:

```
NodeName=esc8000a-0 ... 
   Gres=gpu:a40:1(S:0,2,4,6,8,10)
   Partitions=tn
   CfgTRES=cpu=11,mem=240G,billing=11,gres/gpu=1
...
NodeName=esc8000a-1 ...
   Gres=gpu:a40:1(S:0,2,4,6,8,10)
   Partitions=tn
   CfgTRES=cpu=11,mem=240G,billing=11,gres/gpu=1
```

`scontrol show partitions` (excerpt):

```
PartitionName=tn
   NodeSets=esc8000a,esc8000b
   Nodes=esc8000a-[0-1],esc8000b-0
   TRES=cpu=102,mem=720G,node=3,billing=102,gres/gpu=10
```

`nvidia-smi` from `srun --nodelist=esc8000a-0 bash -c "hostname; nvidia-smi"` (excerpt):

```
esc8000a-0
Driver Version: 550.54.14, CUDA 12.4
... 8x NVIDIA A40 listed ...
```

---

**Environment**

* Chart/base values: `values-cluster-v0.3.1.yaml`
* Slurm version (from nodes): `25.05.2`
* GPU: NVIDIA A40 (8 per host on esc8000a)
* NVIDIA driver: `550.54.14`
* CUDA shown: `12.4`
* Kernel (esc8000a): `5.15.0-142-generic`
* Partition: `tn`
* NodeSets: `esc8000a` (expected 1 GPU per pod), `esc8000b` (has 8 GPUs)

---

**Questions / help requested**

1. Should the **`slurmd`** container (and job steps launched via `srun`) only see the GPUs granted by `resources.limits.nvidia.com/gpu` on the NodeSet?
2. Does the chart expect/use a **`runtimeClassName: nvidia`** or similar to ensure the NVIDIA device plugin injects `NVIDIA_VISIBLE_DEVICES` for the `slurmd` and job containers?
3. Is there a known gap where the **`slurmd` container runs with broader /dev/nvidia* access*\* than the requested limit, causing jobs launched inside it to see all devices?
4. Any guidance on aligning **K8s GPU limits ↔ Slurm GRES** so that **container device visibility matches Slurm allocation**?

---

**What we can provide to assist (tell us which would help)**

* `kubectl -n slurm get pod <slurm-compute-esc8000a-*> -o yaml` (resources, runtimeClass, env)
* Env from inside `slurmd` container: `env | grep NVIDIA`, `ls -l /dev/nvidia*`
* Slurm config fragments resolved at runtime: `slurm.conf`, `gres.conf` as rendered in the pod
* Versions of **NVIDIA GPU Operator / device plugin / container toolkit** in the cluster

---

**Hypotheses (for triage)**

* `runtimeClassName` (or NVIDIA injection) not applied to `slurmd` / job containers → container sees all host GPUs.
* The chart’s `useResourceLimits: true` may not propagate `nvidia.com/gpu` in a way that the NVIDIA device plugin constrains device visibility.
* A mismatch between GRES and K8s device plugin (e.g., job steps executed inside the `slurmd` container rather than a per-job pod with GPU limits).

---

**Ask**
Please confirm the intended behavior and suggest the correct chart values or configuration to ensure that **only 1 GPU** is exposed on `esc8000a-*` when we set `nvidia.com/gpu: 1` and `Gres: ["gpu:a40:1"]`. If there’s a recommended example `values.yaml` for 1-GPU NodeSets, that would be greatly appreciated.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[v0.3.1] GPU isolation ignored on NodeSet `esc8000a`: container sees 8 GPUs even with `nvidia.com/gpu: 1` and `GRES=gpu:a40:1` #45

Summary

Diagnostics

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[v0.3.1] GPU isolation ignored on NodeSet esc8000a: container sees 8 GPUs even with nvidia.com/gpu: 1 and GRES=gpu:a40:1 #45

Description

Summary

Diagnostics

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[v0.3.1] GPU isolation ignored on NodeSet `esc8000a`: container sees 8 GPUs even with `nvidia.com/gpu: 1` and `GRES=gpu:a40:1` #45