How to run quiver on server with complex GPU topology?

Hi, I want to run quiver's p2p_clique_replicate cache policy on a single server with 4 A100 GPUs. The GPU topology are as follows:
        GPU0    GPU1    GPU2    GPU3    CPU Affinity    NUMA Affinity
GPU0     **X      NV12    PXB     PXB**     0-25,52-77      0
GPU1    **NV12     X      PXB     PXB**     0-25,52-77      0
GPU2    **PXB     PXB      X      NV12**    0-25,52-77      0
GPU3    **PXB     PXB     NV12     X**      0-25,52-77      0
There are NVLinks between GPU 0,1 and GPU 2,3.

According to the documentation, there are two cliques(GPU 0,1 and GPU2,3). The cache should be replicate over two cliques. But I found the cache seems to distribute over 4GPUs.
Here is my code(dist_sampling_ogb_reddit_quiver.py, Reddit dataset, feature 500MB):
 **quiver.init_p2p(device_list=list(range(world_size)))**
**quiver_feature = quiver.Feature(rank=0, device_list=list(range(world_size)), device_cache_size="0.1G", cache_policy="p2p_clique_replicate", csr_topo=csr_topo)**
Theses are what I got:
[0, 1, 2, 3]
LOG>>> P2P Access Initilization
Enable P2P Access Between 0 <---> 1 
Enable P2P Access Between 0 <---> 2 
Enable P2P Access Between 0 <---> 3 
Enable P2P Access Between 1 <---> 2 
Enable P2P Access Between 1 <---> 3 
Enable P2P Access Between 2 <---> 3 
WARNING: You are using p2p_clique_replicate mode, MAKE SURE you have called quiver.init_p2p() to enable p2p access
LOG>>> 76% data cached
LOG>>> GPU [0, 1, 2, 3] belong to the same NUMA Domain
LOG >>> Memory Budge On 0 is 102 MB
LOG >>> Memory Budge On 1 is 102 MB
LOG >>> Memory Budge On 2 is 102 MB
LOG >>> Memory Budge On 3 is 102 MB
Let's use 4 GPUs!
WARNING: You are using p2p_clique_replicate mode, MAKE SURE you have called quiver.init_p2p() to enable p2p access
WARNING: You are using p2p_clique_replicate mode, MAKE SURE you have called quiver.init_p2p() to enable p2p access
WARNING: You are using p2p_clique_replicate mode, MAKE SURE you have called quiver.init_p2p() to enable p2p access
WARNING: You are using p2p_clique_replicate mode, MAKE SURE you have called quiver.init_p2p() to enable p2p access
Epoch: 019, Epoch Time: 0.5197241902351379

So I wonder if there is a solution to enable p2p_clique_replicate on my 4 GPU server.
Thanks~

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How to run quiver on server with complex GPU topology? #135

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

How to run quiver on server with complex GPU topology? #135

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions