Skip to content

Commit a807b62

Browse files
Merge pull request #89 from windsonsea/metfor
Clean up userguide/Metax-device/ pages
2 parents 401d7b4 + ec7aa67 commit a807b62

File tree

10 files changed

+46
-62
lines changed

10 files changed

+46
-62
lines changed

docs/userguide/Metax-device/Metax-GPU/enable-metax-gpu-schedule.md

Lines changed: 17 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -2,28 +2,30 @@
22
title: Enable Metax GPU topology-aware scheduling
33
---
44

5-
## Introduction
5+
**HAMi now supports metax.com/gpu by implementing topo-awareness among metax GPUs.**
66

7-
**we now support metax.com/gpu by implementing topo-awareness among metax GPUs**
8-
9-
When multiple GPUs are configured on a single server, the GPU cards are connected to the same PCIe Switch or MetaXLink depending on whether they are connected
10-
, there is a near-far relationship. This forms a topology among all the cards on the server, as shown in the following figure:
7+
When multiple GPUs are configured on a single server, the GPU cards are connected to the same PCIe Switch or MetaXLink.
8+
Depending on the connection type, a near-far relationship is formed among the GPUs.
9+
Together, these connections define the topology of the GPU cards on the server, as shown below:
1110

1211
![img](https://github.com/Project-HAMi/HAMi/raw/master/imgs/metax_topo.png)
1312

14-
A user job requests a certain number of metax-tech.com/gpu resources, Kubernetes schedule pods to the appropriate node. gpu-device further processes the logic of allocating the remaining resources on the resource node following criterias below:
15-
1. MetaXLink takes precedence over PCIe Switch in two way:
16-
– A connection is considered a MetaXLink connection when there is a MetaXLink connection and a PCIe Switch connection between the two cards.
17-
– When both the MetaXLink and the PCIe Switch can meet the job request
18-
Equipped with MetaXLink interconnected resources.
13+
When a user job requests a specific number of `metax-tech.com/gpu` resources,
14+
Kubernetes schedules the pod to a suitable node. On that node,
15+
the GPU device plugin (gpu-device) handles fine-grained allocation based on the following criteria:
16+
17+
1. MetaXLink takes precedence over PCIe Switch in two ways:
18+
19+
- A connection is considered a MetaXLink connection when there is a MetaXLink connection and a PCIe Switch connection between the two cards.
20+
- When both the MetaXLink and the PCIe Switch can meet the job request, equipped with MetaXLink interconnected resources.
1921

20-
2. When using `node-scheduler-policy=spread` , Allocate Metax resources to be under the same Metaxlink or Paiswich as much as possible, as the following figure shows:
22+
2. When using `node-scheduler-policy=spread`, allocate Metax resources to be under the same Metaxlink or Paiswich as much as possible, as shown below:
2123

22-
![img](https://github.com/Project-HAMi/HAMi/raw/master/imgs/metax_spread.png)
24+
![img](https://github.com/Project-HAMi/HAMi/raw/master/imgs/metax_spread.png)
2325

24-
3. When using `node-scheduler-policy=binpack`, Assign GPU resources, so minimize the damage to MetaxXLink topology, as the following figure shows:
26+
3. When using `node-scheduler-policy=binpack`, assign GPU resources, so minimize the damage to MetaxXLink topology, as shown below:
2527

26-
![img](https://github.com/Project-HAMi/HAMi/raw/master/imgs/metax_binpack.png)
28+
![img](https://github.com/Project-HAMi/HAMi/raw/master/imgs/metax_binpack.png)
2729

2830
## Important Notes
2931

@@ -45,7 +47,7 @@ Equipped with MetaXLink interconnected resources.
4547
## Running Metax jobs
4648

4749
Metax GPUs can now be requested by a container
48-
using the `metax-tech.com/gpu` resource type:
50+
using the `metax-tech.com/gpu` resource type:
4951

5052
```yaml
5153
apiVersion: v1

docs/userguide/Metax-device/Metax-GPU/examples/allocate-binpack.md

Lines changed: 3 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -2,11 +2,9 @@
22
title: Binpack schedule policy
33
---
44

5-
## Allocate metax device using binpack schedule policy
5+
To allocate metax device with mininum damage to topology, you need to only assign `metax-tech.com/gpu` with annotations `hami.io/node-scheduler-policy: "binpack"`.
66

7-
To allocate metax device with mininum damage to topology, you need to only assign `metax-tech.com/gpu` with annotations `hami.io/node-scheduler-policy`=`binpack`
8-
9-
```
7+
```yaml
108
apiVersion: v1
119
kind: Pod
1210
metadata:
@@ -22,4 +20,4 @@ spec:
2220
resources:
2321
limits:
2422
metax-tech.com/gpu: 1 # requesting 1 metax GPU
25-
```
23+
```

docs/userguide/Metax-device/Metax-GPU/examples/allocate-spread.md

Lines changed: 3 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -2,11 +2,9 @@
22
title: Spread schedule policy
33
---
44

5-
## Allocate metax device using spread schedule policy
5+
To allocate metax device with best performance, you need to only assign `metax-tech.com/gpu` with annotations `hami.io/node-scheduler-policy: "spread"`.
66

7-
To allocate metax device with best performance, you need to only assign `metax-tech.com/gpu` with annotations `hami.io/node-scheduler-policy`=`spread`
8-
9-
```
7+
```yaml
108
apiVersion: v1
119
kind: Pod
1210
metadata:
@@ -22,4 +20,4 @@ spec:
2220
resources:
2321
limits:
2422
metax-tech.com/gpu: 4 # requesting 4 metax GPUs
25-
```
23+
```

docs/userguide/Metax-device/Metax-GPU/examples/default-use.md

Lines changed: 2 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -2,11 +2,9 @@
22
title: Allocate metax device
33
---
44

5-
## Allocate metax device
6-
75
To allocate metax device, you need to only assign `metax-tech.com/gpu` without other fields.
86

9-
```
7+
```yaml
108
apiVersion: v1
119
kind: Pod
1210
metadata:
@@ -20,4 +18,4 @@ spec:
2018
resources:
2119
limits:
2220
metax-tech.com/gpu: 1 # requesting 1 metax GPU
23-
```
21+
```

docs/userguide/Metax-device/Metax-GPU/specify-binpack-task.md

Lines changed: 2 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -2,11 +2,9 @@
22
title: Binpack schedule policy
33
---
44

5-
## Set schedule policy to binpack
5+
To allocate metax device with mininum damage to topology, you need to only assign `metax-tech.com/gpu` with annotations `hami.io/node-scheduler-policy: "binpack"`.
66

7-
To allocate metax device with mininum damage to topology, you need to only assign `metax-tech.com/gpu` with annotations `hami.io/node-scheduler-policy`=`binpack`
8-
9-
```
7+
```yaml
108
metadata:
119
annotations:
1210
hami.io/node-scheduler-policy: "binpack" # when this parameter is set to binpack, the scheduler will try to minimize the topology loss.

docs/userguide/Metax-device/Metax-GPU/specify-spread-task.md

Lines changed: 2 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -2,11 +2,9 @@
22
title: Spread schedule policy
33
---
44

5-
## Set schedule policy to spread
5+
To allocate metax device with best performance, you need to only assign `metax-tech.com/gpu` with annotations `hami.io/node-scheduler-policy: "spread"`.
66

7-
To allocate metax device with best performance, you need to only assign `metax-tech.com/gpu` with annotations `hami.io/node-scheduler-policy`=`spread`
8-
9-
```
7+
```yaml
108
metadata:
119
annotations:
1210
hami.io/node-scheduler-policy: "spread" # when this parameter is set to spread, the scheduler will try to find the best topology for this task.

docs/userguide/Metax-device/Metax-sGPU/enable-metax-gpu-sharing.md

Lines changed: 9 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -3,32 +3,30 @@ title: Enable Metax GPU sharing
33
translated: true
44
---
55

6-
## Introduction
6+
**HAMi now supports metax.com/gpu by implementing most device-sharing features as nvidia-GPU**, device-sharing features include the following:
77

8-
**we now support metax.com/gpu by implementing most device-sharing features as nvidia-GPU**, device-sharing features include the following:
8+
- **GPU Sharing**: Tasks can request a fraction of a GPU rather than the entire GPU card, allowing multiple tasks to share the same GPU.
99

10-
***GPU sharing***: Each task can allocate a portion of GPU instead of a whole GPU card, thus GPU can be shared among multiple tasks.
10+
- **Device Memory Control**: Tasks can be allocated a specific amount of GPU memory, with strict enforcement to ensure usage does not exceed the assigned limit.
1111

12-
***Device Memory Control***: GPUs can be allocated with certain device memory size and have made it that it does not exceed the boundary.
12+
- **Compute Core Limiting**: Tasks can be allocated a specific percentage of GPU compute cores (e.g., `60` means the container can use 60% of the GPU’s compute cores).
1313

14-
***Device compute core limitation***: GPUs can be allocated with certain percentage of device core(60 indicate this container uses 60% compute cores of this device)
15-
16-
### Prerequisites
14+
## Prerequisites
1715

1816
* Metax Driver >= 2.32.0
1917
* Metax GPU Operator >= 0.10.2
2018
* Kubernetes >= 1.23
2119

22-
### Enabling GPU-sharing Support
20+
## Enabling GPU-sharing support
2321

24-
* Deploy Metax GPU Operator on metax nodes (Please consult your device provider to aquire its package and document)
22+
* Deploy Metax GPU Operator on metax nodes (Please consult your device provider to obtain the installation package and documentation)
2523

2624
* Deploy HAMi according to README.md
2725

28-
### Running Metax jobs
26+
## Running Metax jobs
2927

3028
Metax GPUs can now be requested by a container
31-
using the `metax-tech.com/sgpu` resource type:
29+
using the `metax-tech.com/sgpu` resource type:
3230

3331
```yaml
3432
apiVersion: v1

docs/userguide/Metax-device/Metax-sGPU/examples/allocate-exclusive.md

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -3,8 +3,6 @@ title: Allocate exclusive device
33
translated: true
44
---
55

6-
## Allocate exclusive device
7-
86
To allocate a whole Metax GPU device, you need to only assign `metax-tech.com/sgpu` without other fields.
97

108
```yaml

docs/userguide/Metax-device/Metax-sGPU/examples/allocate-qos-policy.md

Lines changed: 7 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,15 @@
11
---
2-
title: Allocate specific Qos policy devices
2+
title: Allocate specific QoS policy devices
33
translated: true
44
---
55

6-
## Allocate specific Qos policy devices
6+
Users can configure the QoS policy for tasks using the `metax-tech.com/sgpu-qos-policy` annotation to specify the scheduling policy used by the shared GPU (sGPU). The available sGPU scheduling policies are described in the table below:
77

8-
Users can configure the Qos Policy parameter for tasks via `metax-tech.com/sgpu-qos-policy` to specify the scheduling policy used by the sGPU. The specific sGPU scheduling policy description can be found in the following table.
9-
10-
| scheduling policy | description |
11-
| --- | --- |
12-
| `best-effort` | sGPU is no limit on computing power |
13-
| `fixed-share` | sGPU is a fixed computing power quota, and it cannot be used beyond the fixed quota |
14-
| `burst-share` | sGPU is a fixed computing power quota. If the GPU card still has idle computing power, it can be used by the sGPU |
8+
| Scheduling Policy | Description |
9+
|-------------------|-------------|
10+
| `best-effort` | The sGPU has no restriction on compute usage. |
11+
| `fixed-share` | The sGPU is assigned a fixed compute quota and cannot exceed this limit. |
12+
| `burst-share` | The sGPU is assigned a fixed compute quota, but may utilize additional GPU compute resources when they are idle. |
1513

1614
```yaml
1715
apiVersion: v1

docs/userguide/Metax-device/Metax-sGPU/examples/default-use.md

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -3,9 +3,7 @@ title: Allocate device core and memory resource
33
translated: true
44
---
55

6-
## Allocate device core and memory to container
7-
8-
To allocate a certain part of device core resource, you need only to assign the `metax-tech.com/vcore` and `metax-tech.com/vmemory` along with the number of Metax GPUs you requested in the container using `metax-tech.com/sgpu`
6+
To allocate a certain part of device core resource, you need only to assign the `metax-tech.com/vcore` and `metax-tech.com/vmemory` along with the number of Metax GPUs you requested in the container using `metax-tech.com/sgpu`.
97

108
```yaml
119
apiVersion: v1

0 commit comments

Comments
 (0)