Skip to content

How to run quiver on server with complex GPU topology? #135

@JIESUN233

Description

@JIESUN233

Hi, I want to run quiver's p2p_clique_replicate cache policy on a single server with 4 A100 GPUs. The GPU topology are as follows:
GPU0 GPU1 GPU2 GPU3 CPU Affinity NUMA Affinity
GPU0 X NV12 PXB PXB 0-25,52-77 0
GPU1 NV12 X PXB PXB 0-25,52-77 0
GPU2 PXB PXB X NV12 0-25,52-77 0
GPU3 PXB PXB NV12 X 0-25,52-77 0
There are NVLinks between GPU 0,1 and GPU 2,3.

According to the documentation, there are two cliques(GPU 0,1 and GPU2,3). The cache should be replicate over two cliques. But I found the cache seems to distribute over 4GPUs.
Here is my code(dist_sampling_ogb_reddit_quiver.py, Reddit dataset, feature 500MB):
quiver.init_p2p(device_list=list(range(world_size)))
quiver_feature = quiver.Feature(rank=0, device_list=list(range(world_size)), device_cache_size="0.1G", cache_policy="p2p_clique_replicate", csr_topo=csr_topo)
Theses are what I got:
[0, 1, 2, 3]
LOG>>> P2P Access Initilization
Enable P2P Access Between 0 <---> 1
Enable P2P Access Between 0 <---> 2
Enable P2P Access Between 0 <---> 3
Enable P2P Access Between 1 <---> 2
Enable P2P Access Between 1 <---> 3
Enable P2P Access Between 2 <---> 3
WARNING: You are using p2p_clique_replicate mode, MAKE SURE you have called quiver.init_p2p() to enable p2p access
LOG>>> 76% data cached
LOG>>> GPU [0, 1, 2, 3] belong to the same NUMA Domain
LOG >>> Memory Budge On 0 is 102 MB
LOG >>> Memory Budge On 1 is 102 MB
LOG >>> Memory Budge On 2 is 102 MB
LOG >>> Memory Budge On 3 is 102 MB
Let's use 4 GPUs!
WARNING: You are using p2p_clique_replicate mode, MAKE SURE you have called quiver.init_p2p() to enable p2p access
WARNING: You are using p2p_clique_replicate mode, MAKE SURE you have called quiver.init_p2p() to enable p2p access
WARNING: You are using p2p_clique_replicate mode, MAKE SURE you have called quiver.init_p2p() to enable p2p access
WARNING: You are using p2p_clique_replicate mode, MAKE SURE you have called quiver.init_p2p() to enable p2p access
Epoch: 019, Epoch Time: 0.5197241902351379

So I wonder if there is a solution to enable p2p_clique_replicate on my 4 GPU server.
Thanks~

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions