Bug(multiprocess): handle inconsistent PID namespace with NVML API

## Description
In some version of NVIDIA Driver, e.g., `Driver Version: 560.35.03`, when calling `nvmlDeviceGetComputeRunningProcesses` and `nvmlDeviceGetProcessUtilization`, we may get the pid from different PID namespace.

I create a [gist](https://gist.github.com/Shouren/10f6743e5433afbe4819e3157a26f7bd) for reproduce the issue and test it on a server with the following config
- Driver Version: 560.35.03
- CUDA Version: 12.6

run a docker container with 
```bash
docker run --rm -it --name shouren-hami-core-tester1 \
	--entrypoint /bin/bash \
	--gpus '"device=3"' \
	-v /root/shouren/Project-HAMi/hami-core:/opt/hami-core \
        nvcr.io/nvidia/pytorch:24.06-py3
```
get the output of the test
```bash
root@23713ecf3450:/opt/hami-core/tmp# ./pid_check
main: pid: 323
main: compute thread started successfully
Found 1 NVIDIA GPU device(s).

Device 0: NVIDIA GeForce RTX 2080 Ti, GPU-ea5a7f29-c714-06ba-23ef-cedbc44248da
Num of running process : 1
    process id: 324	mem: 0
Failed to get processes utils by device with Not Found
Found 1 NVIDIA GPU device(s).

Device 0: NVIDIA GeForce RTX 2080 Ti, GPU-ea5a7f29-c714-06ba-23ef-cedbc44248da
Num of running process : 1
    process id: 324	mem: 0
Num of process sampled: 1
    process id: 645400	smUtil: 13
Found 1 NVIDIA GPU device(s).
...
```
As it shows the `pid` given by `nvmlDeviceGetComputeRunningProcesses` function is in container's namespace, however the pid given by `nvmlDeviceGetProcessUtilization` is from host pid namepace.

I think it may be the root cause for issue like #104 and some other issues in HAMi repo.

## Solution
I want to know if we should handle this inconsistent behavior ?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bug(multiprocess): handle inconsistent PID namespace with NVML API #112

Description

Solution

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Bug(multiprocess): handle inconsistent PID namespace with NVML API #112

Description

Description

Solution

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions