Skip to content

Bug(multiprocess): handle inconsistent PID namespace with NVML API #112

@Shouren

Description

@Shouren

Description

In some version of NVIDIA Driver, e.g., Driver Version: 560.35.03, when calling nvmlDeviceGetComputeRunningProcesses and nvmlDeviceGetProcessUtilization, we may get the pid from different PID namespace.

I create a gist for reproduce the issue and test it on a server with the following config

  • Driver Version: 560.35.03
  • CUDA Version: 12.6

run a docker container with

docker run --rm -it --name shouren-hami-core-tester1 \
	--entrypoint /bin/bash \
	--gpus '"device=3"' \
	-v /root/shouren/Project-HAMi/hami-core:/opt/hami-core \
        nvcr.io/nvidia/pytorch:24.06-py3

get the output of the test

root@23713ecf3450:/opt/hami-core/tmp# ./pid_check
main: pid: 323
main: compute thread started successfully
Found 1 NVIDIA GPU device(s).

Device 0: NVIDIA GeForce RTX 2080 Ti, GPU-ea5a7f29-c714-06ba-23ef-cedbc44248da
Num of running process : 1
    process id: 324	mem: 0
Failed to get processes utils by device with Not Found
Found 1 NVIDIA GPU device(s).

Device 0: NVIDIA GeForce RTX 2080 Ti, GPU-ea5a7f29-c714-06ba-23ef-cedbc44248da
Num of running process : 1
    process id: 324	mem: 0
Num of process sampled: 1
    process id: 645400	smUtil: 13
Found 1 NVIDIA GPU device(s).
...

As it shows the pid given by nvmlDeviceGetComputeRunningProcesses function is in container's namespace, however the pid given by nvmlDeviceGetProcessUtilization is from host pid namepace.

I think it may be the root cause for issue like #104 and some other issues in HAMi repo.

Solution

I want to know if we should handle this inconsistent behavior ?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions