Skip to content

Commit 3e260f6

Browse files
Merge pull request #95 from windsonsea/nginx
i18n: Remove nginx-example and update deploy-with-helm
2 parents 2ba97bc + 9405d9a commit 3e260f6

File tree

3 files changed

+152
-323
lines changed

3 files changed

+152
-323
lines changed

docs/get-started/deploy-with-helm.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -160,15 +160,15 @@ spec:
160160
nvidia.com/gpumem: 10240 # Each vGPU contains 10240m device memory (Optional,Integer)
161161
```
162162
163-
#### Verify in container resouce control {#verify-in-container-resouce-control}
163+
#### 2. Verify in container resouce control {#verify-in-container-resouce-control}
164164
165165
Execute the following query command:
166166
167167
```bash
168168
kubectl exec -it gpu-pod nvidia-smi
169169
```
170170

171-
The result should be
171+
The result should be:
172172

173173
```text
174174
[HAMI-core Msg(28:140561996502848:libvgpu.c:836)]: Initializing.....
Lines changed: 150 additions & 148 deletions
Original file line numberDiff line numberDiff line change
@@ -1,186 +1,188 @@
1-
---
2-
title: 使用helm部署HAMi
3-
---
1+
---
2+
title: 使用 Helm 部署 HAMi
3+
---
44

5-
## 目录 {#toc}
5+
## 目录 {#toc}
66

7-
- [先决条件](#prerequisites)
8-
- [安装步骤](#installation)
9-
- [演示](#demo)
7+
- [先决条件](#prerequisites)
8+
- [安装步骤](#installation)
9+
- [演示](#demo)
1010

11-
本指南将涵盖:
11+
本指南将涵盖:
1212

13-
- 为每个GPU节点配置nvidia容器运行时
14-
- 使用helm安装HAMi
15-
- 启动vGPU任务
16-
- 验证容器内设备资源是否受限
13+
- 为每个 GPU 节点配置 NVIDIA 容器运行时
14+
- 使用 Helm 安装 HAMi
15+
- 启动 vGPU 任务
16+
- 验证容器内设备资源是否受限
1717

18-
## 先决条件 {#prerequisites}
18+
## 先决条件 {#prerequisites}
1919

20-
- [Helm](https://helm.sh/zh/docs/) v3+版本
21-
- [kubectl](https://kubernetes.io/docs/tasks/tools/install-kubectl/) v1.16+版本
22-
- [CUDA](https://developer.nvidia.com/cuda-toolkit) v10.2+版本
23-
- [NVIDIA驱动](https://www.nvidia.cn/drivers/unix/) v440+版本
20+
- [Helm](https://helm.sh/zh/docs/) v3+
21+
- [kubectl](https://kubernetes.io/docs/tasks/tools/install-kubectl/) v1.16+
22+
- [CUDA](https://developer.nvidia.com/cuda-toolkit) v10.2+
23+
- [NVIDIA 驱动](https://www.nvidia.cn/drivers/unix/) v440+
2424

25-
## 安装步骤 {#installation}
25+
## 安装步骤 {#installation}
2626

27-
### 1. 配置nvidia-container-toolkit {#configure-nvidia-container-toolkit}
27+
### 1. 配置 nvidia-container-toolkit {#configure-nvidia-container-toolkit}
2828

29-
<summary> 配置nvidia-container-toolkit </summary>
29+
<summary> 配置 nvidia-container-toolkit </summary>
3030

31-
在所有GPU节点执行以下操作。
31+
在所有 GPU 节点执行以下操作。
3232

33-
本文档假设已预装NVIDIA驱动和`nvidia-container-toolkit`,并已将`nvidia-container-runtime`配置为默认底层运行时。
33+
本文档假设已预装 NVIDIA 驱动和 `nvidia-container-toolkit`,并已将 `nvidia-container-runtime` 配置为默认底层运行时。
3434

35-
参考:[nvidia-container-toolkit安装指南](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html)
35+
参考:[nvidia-container-toolkit 安装指南](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html)
3636

37-
#### 基于Debian系统(使用`Docker``containerd`)示例 {#example-for-debian-based-systems-with-docker-and-containerd}
37+
#### 基于 Debian 系统(使用 `Docker``containerd`)示例 {#example-for-debian-based-systems-with-docker-and-containerd}
3838

39-
##### 安装`nvidia-container-toolkit` {#install-the-nvidia-container-toolkit}
39+
##### 安装 `nvidia-container-toolkit` {#install-the-nvidia-container-toolkit}
4040

41-
```bash
42-
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
43-
curl -s -L https://nvidia.github.io/libnvidia-container/gpgkey | sudo apt-key add -
44-
curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
45-
sudo tee /etc/apt/sources.list.d/libnvidia-container.list
41+
```bash
42+
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
43+
curl -s -L https://nvidia.github.io/libnvidia-container/gpgkey | sudo apt-key add -
44+
curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
45+
sudo tee /etc/apt/sources.list.d/libnvidia-container.list
4646

47-
sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
48-
```
47+
sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
48+
```
4949

50-
##### 配置`Docker` {#configure-docker}
50+
##### 配置 `Docker` {#configure-docker}
5151

52-
当使用`Docker`运行`Kubernetes`时,编辑配置文件(通常位于`/etc/docker/daemon.json`),将`nvidia-container-runtime`设为默认底层运行时:
52+
当使用 `Docker` 运行 `Kubernetes` 时,编辑配置文件(通常位于 `/etc/docker/daemon.json`),将
53+
`nvidia-container-runtime` 设为默认底层运行时:
5354

54-
```json
55-
{
56-
"default-runtime": "nvidia",
57-
"runtimes": {
58-
"nvidia": {
59-
"path": "/usr/bin/nvidia-container-runtime",
60-
"runtimeArgs": []
61-
}
62-
}
63-
}
64-
```
55+
```json
56+
{
57+
"default-runtime": "nvidia",
58+
"runtimes": {
59+
"nvidia": {
60+
"path": "/usr/bin/nvidia-container-runtime",
61+
"runtimeArgs": []
62+
}
63+
}
64+
}
65+
```
6566

66-
然后重启`Docker`
67+
然后重启 `Docker`
6768

68-
```bash
69-
sudo systemctl daemon-reload && systemctl restart docker
70-
```
69+
```bash
70+
sudo systemctl daemon-reload && systemctl restart docker
71+
```
7172

72-
##### 配置`containerd` {#configure-containerd}
73+
##### 配置 `containerd` {#configure-containerd}
7374

74-
当使用`containerd`运行`Kubernetes`时,修改配置文件(通常位于`/etc/containerd/config.toml`),将`nvidia-container-runtime`设为默认底层运行时:
75+
当使用 `containerd` 运行 `Kubernetes` 时,修改配置文件(通常位于 `/etc/containerd/config.toml`),将
76+
`nvidia-container-runtime` 设为默认底层运行时:
7577

76-
```toml
77-
version = 2
78-
[plugins]
79-
[plugins."io.containerd.grpc.v1.cri"]
80-
[plugins."io.containerd.grpc.v1.cri".containerd]
81-
default_runtime_name = "nvidia"
78+
```toml
79+
version = 2
80+
[plugins]
81+
[plugins."io.containerd.grpc.v1.cri"]
82+
[plugins."io.containerd.grpc.v1.cri".containerd]
83+
default_runtime_name = "nvidia"
8284

83-
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
84-
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
85-
privileged_without_host_devices = false
86-
runtime_engine = ""
87-
runtime_root = ""
88-
runtime_type = "io.containerd.runc.v2"
89-
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
90-
BinaryName = "/usr/bin/nvidia-container-runtime"
91-
```
85+
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
86+
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
87+
privileged_without_host_devices = false
88+
runtime_engine = ""
89+
runtime_root = ""
90+
runtime_type = "io.containerd.runc.v2"
91+
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
92+
BinaryName = "/usr/bin/nvidia-container-runtime"
93+
```
9294

93-
然后重启`containerd`
95+
然后重启 `containerd`
9496

95-
```bash
96-
sudo systemctl daemon-reload && systemctl restart containerd
97-
```
97+
```bash
98+
sudo systemctl daemon-reload && systemctl restart containerd
99+
```
98100

99-
#### 2. 标记节点 {#label-your-nodes}
101+
#### 2. 标记节点 {#label-your-nodes}
100102

101-
通过添加"gpu=on"标签将GPU节点标记为可调度HAMi任务。未标记的节点将无法被调度器管理。
103+
通过添加 "gpu=on" 标签将 GPU 节点标记为可调度 HAMi 任务。未标记的节点将无法被调度器管理。
102104

103-
```bash
104-
kubectl label nodes {节点ID} gpu=on
105-
```
105+
```bash
106+
kubectl label nodes {节点ID} gpu=on
107+
```
106108

107-
#### 3. 使用helm部署HAMi {#deploy-hami-using-helm}
109+
#### 3. 使用 Helm 部署 HAMi {#deploy-hami-using-helm}
108110

109-
首先通过以下命令确认Kubernetes版本:
111+
首先通过以下命令确认 Kubernetes 版本:
110112

111-
```bash
112-
kubectl version
113-
```
113+
```bash
114+
kubectl version
115+
```
114116

115-
然后添加helm仓库:
117+
然后添加 Helm 仓库:
116118

117-
```bash
118-
helm repo add hami-charts https://project-hami.github.io/HAMi/
119-
```
119+
```bash
120+
helm repo add hami-charts https://project-hami.github.io/HAMi/
121+
```
120122

121-
安装时需设置Kubernetes调度器镜像版本与集群版本匹配。例如集群版本为1.16.8时,使用以下命令部署:
123+
安装时需设置 Kubernetes 调度器镜像版本与集群版本匹配。例如集群版本为 1.16.8 时,使用以下命令部署:
122124

123-
```bash
124-
helm install hami hami-charts/hami \
125-
--set scheduler.kubeScheduler.imageTag=v1.16.8 \
126-
-n kube-system
127-
```
125+
```bash
126+
helm install hami hami-charts/hami \
127+
--set scheduler.kubeScheduler.imageTag=v1.16.8 \
128+
-n kube-system
129+
```
128130

129-
若一切正常,可见vgpu-device-plugin和vgpu-scheduler的Pod均处于Running状态
130-
131-
### 演示 {#demo}
132-
133-
#### 1. 提交演示任务 {#submit-demo-task}
134-
135-
容器现在可通过`nvidia.com/gpu`资源类型申请NVIDIA vGPU:
136-
137-
```yaml
138-
apiVersion: v1
139-
kind: Pod
140-
metadata:
141-
name: gpu-pod
142-
spec:
143-
containers:
144-
- name: ubuntu-container
145-
image: ubuntu:18.04
146-
command: ["bash", "-c", "sleep 86400"]
147-
resources:
148-
limits:
149-
nvidia.com/gpu: 1 # 申请1个vGPU
150-
nvidia.com/gpumem: 10240 # 每个vGPU包含10240m设备内存(可选,整型)
151-
```
152-
153-
#### 验证容器内资源限制 {#verify-in-container-resouce-control}
154-
155-
执行查询命令:
156-
157-
```bash
158-
kubectl exec -it gpu-pod nvidia-smi
159-
```
160-
161-
预期输出:
162-
163-
```text
164-
[HAMI-core Msg(28:140561996502848:libvgpu.c:836)]: 初始化中.....
165-
2024年4月10日 星期三 09:28:58
166-
+-----------------------------------------------------------------------------------------+
167-
| NVIDIA-SMI 550.54.15 驱动版本: 550.54.15 CUDA版本: 12.4 |
168-
|-----------------------------------------+------------------------+----------------------+
169-
| GPU 名称 持久化-M | 总线ID 显存.A | 易失性ECC错误 |
170-
| 风扇 温度 性能 功耗:使用/上限 | 显存使用率 | GPU利用率 计算模式 |
171-
| | | MIG模式 |
172-
|=========================================+========================+======================|
173-
| 0 Tesla V100-PCIE-32GB 启用 | 00000000:3E:00.0 关闭 | 0 |
174-
| N/A 29C P0 24W/250W | 0MiB/10240MiB | 0% 默认模式 |
175-
| | | N/A |
176-
+-----------------------------------------+------------------------+----------------------+
177-
178-
+-----------------------------------------------------------------------------------------+
179-
| 进程: |
180-
| GPU GI CI 进程ID 类型 进程名称 显存使用量 |
181-
| ID ID |
182-
|=========================================================================================|
183-
| 未找到运行中的进程 |
184-
+-----------------------------------------------------------------------------------------+
185-
[HAMI-core Msg(28:140561996502848:multiprocess_memory_limit.c:434)]: 调用退出处理程序28
186-
```
131+
若一切正常,可见 vgpu-device-plugin 和 vgpu-scheduler 的 Pod 均处于 Running 状态。
132+
133+
### 演示 {#demo}
134+
135+
#### 1. 提交演示任务 {#submit-demo-task}
136+
137+
容器现在可通过 `nvidia.com/gpu` 资源类型申请 NVIDIA vGPU:
138+
139+
```yaml
140+
apiVersion: v1
141+
kind: Pod
142+
metadata:
143+
name: gpu-pod
144+
spec:
145+
containers:
146+
- name: ubuntu-container
147+
image: ubuntu:18.04
148+
command: ["bash", "-c", "sleep 86400"]
149+
resources:
150+
limits:
151+
nvidia.com/gpu: 1 # 申请 1 个 vGPU
152+
nvidia.com/gpumem: 10240 # 每个 vGPU 包含 10240m 设备内存(可选,整型)
153+
```
154+
155+
#### 2. 验证容器内资源限制 {#verify-in-container-resouce-control}
156+
157+
执行查询命令:
158+
159+
```bash
160+
kubectl exec -it gpu-pod nvidia-smi
161+
```
162+
163+
预期输出:
164+
165+
```text
166+
[HAMI-core Msg(28:140561996502848:libvgpu.c:836)]: Initializing.....
167+
Wed Apr 10 09:28:58 2024
168+
+-----------------------------------------------------------------------------------------+
169+
| NVIDIA-SMI 550.54.15 Driver Version: 550.54.15 CUDA Version: 12.4 |
170+
|-----------------------------------------+------------------------+----------------------+
171+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
172+
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
173+
| | | MIG M. |
174+
|=========================================+========================+======================|
175+
| 0 Tesla V100-PCIE-32GB On | 00000000:3E:00.0 Off | 0 |
176+
| N/A 29C P0 24W / 250W | 0MiB / 10240MiB | 0% Default |
177+
| | | N/A |
178+
+-----------------------------------------+------------------------+----------------------+
179+
180+
+-----------------------------------------------------------------------------------------+
181+
| Processes: |
182+
| GPU GI CI PID Type Process name GPU Memory |
183+
| ID ID Usage |
184+
|=========================================================================================|
185+
| No running processes found |
186+
+-----------------------------------------------------------------------------------------+
187+
[HAMI-core Msg(28:140561996502848:multiprocess_memory_limit.c:434)]: Calling exit handler 28
188+
```

0 commit comments

Comments
 (0)