-
Notifications
You must be signed in to change notification settings - Fork 121
Open
Description
Description
In some version of NVIDIA Driver, e.g., Driver Version: 560.35.03, when calling nvmlDeviceGetComputeRunningProcesses and nvmlDeviceGetProcessUtilization, we may get the pid from different PID namespace.
I create a gist for reproduce the issue and test it on a server with the following config
- Driver Version: 560.35.03
- CUDA Version: 12.6
run a docker container with
docker run --rm -it --name shouren-hami-core-tester1 \
--entrypoint /bin/bash \
--gpus '"device=3"' \
-v /root/shouren/Project-HAMi/hami-core:/opt/hami-core \
nvcr.io/nvidia/pytorch:24.06-py3get the output of the test
root@23713ecf3450:/opt/hami-core/tmp# ./pid_check
main: pid: 323
main: compute thread started successfully
Found 1 NVIDIA GPU device(s).
Device 0: NVIDIA GeForce RTX 2080 Ti, GPU-ea5a7f29-c714-06ba-23ef-cedbc44248da
Num of running process : 1
process id: 324 mem: 0
Failed to get processes utils by device with Not Found
Found 1 NVIDIA GPU device(s).
Device 0: NVIDIA GeForce RTX 2080 Ti, GPU-ea5a7f29-c714-06ba-23ef-cedbc44248da
Num of running process : 1
process id: 324 mem: 0
Num of process sampled: 1
process id: 645400 smUtil: 13
Found 1 NVIDIA GPU device(s).
...As it shows the pid given by nvmlDeviceGetComputeRunningProcesses function is in container's namespace, however the pid given by nvmlDeviceGetProcessUtilization is from host pid namepace.
I think it may be the root cause for issue like #104 and some other issues in HAMi repo.
Solution
I want to know if we should handle this inconsistent behavior ?
Metadata
Metadata
Assignees
Labels
No labels