The OpenCL extension for querying locality of AMD GPU (CL_DEVICE_TOPOLOGY_TYPE_PCIE_AMD) doesn't report the PCI domain. Systems with multiple PCI domains were rare, but at least systems with multiple MI300A CPU+GPUs (like Adastra and maybe El Capitan) use one domain per CPU now.
The issue is confirmed at ROCm/clr#106 but it won't be fixed upstream, likely because the OpenCL runtime doesn't matter anymore.
Possible solutions:
- remove AMD OpenCL locality queries since it doesn't matter much anymore apparently
- if there are multiple PCI domains in the machine, ignore AMD OpenCL locality (just attach to root)
- if there are multiple AMD GPUs with same PCI BDFs except the PCI domain, attach all of them to root