|
1 | | ---- |
2 | | -title: 使用helm部署HAMi |
3 | | ---- |
| 1 | +--- |
| 2 | +title: 使用 Helm 部署 HAMi |
| 3 | +--- |
4 | 4 |
|
5 | | -## 目录 {#toc} |
| 5 | +## 目录 {#toc} |
6 | 6 |
|
7 | | -- [先决条件](#prerequisites) |
8 | | -- [安装步骤](#installation) |
9 | | -- [演示](#demo) |
| 7 | +- [先决条件](#prerequisites) |
| 8 | +- [安装步骤](#installation) |
| 9 | +- [演示](#demo) |
10 | 10 |
|
11 | | -本指南将涵盖: |
| 11 | +本指南将涵盖: |
12 | 12 |
|
13 | | -- 为每个GPU节点配置nvidia容器运行时 |
14 | | -- 使用helm安装HAMi |
15 | | -- 启动vGPU任务 |
16 | | -- 验证容器内设备资源是否受限 |
| 13 | +- 为每个 GPU 节点配置 NVIDIA 容器运行时 |
| 14 | +- 使用 Helm 安装 HAMi |
| 15 | +- 启动 vGPU 任务 |
| 16 | +- 验证容器内设备资源是否受限 |
17 | 17 |
|
18 | | -## 先决条件 {#prerequisites} |
| 18 | +## 先决条件 {#prerequisites} |
19 | 19 |
|
20 | | -- [Helm](https://helm.sh/zh/docs/) v3+版本 |
21 | | -- [kubectl](https://kubernetes.io/docs/tasks/tools/install-kubectl/) v1.16+版本 |
22 | | -- [CUDA](https://developer.nvidia.com/cuda-toolkit) v10.2+版本 |
23 | | -- [NVIDIA驱动](https://www.nvidia.cn/drivers/unix/) v440+版本 |
| 20 | +- [Helm](https://helm.sh/zh/docs/) v3+ |
| 21 | +- [kubectl](https://kubernetes.io/docs/tasks/tools/install-kubectl/) v1.16+ |
| 22 | +- [CUDA](https://developer.nvidia.com/cuda-toolkit) v10.2+ |
| 23 | +- [NVIDIA 驱动](https://www.nvidia.cn/drivers/unix/) v440+ |
24 | 24 |
|
25 | | -## 安装步骤 {#installation} |
| 25 | +## 安装步骤 {#installation} |
26 | 26 |
|
27 | | -### 1. 配置nvidia-container-toolkit {#configure-nvidia-container-toolkit} |
| 27 | +### 1. 配置 nvidia-container-toolkit {#configure-nvidia-container-toolkit} |
28 | 28 |
|
29 | | -<summary> 配置nvidia-container-toolkit </summary> |
| 29 | +<summary> 配置 nvidia-container-toolkit </summary> |
30 | 30 |
|
31 | | -在所有GPU节点执行以下操作。 |
| 31 | +在所有 GPU 节点执行以下操作。 |
32 | 32 |
|
33 | | -本文档假设已预装NVIDIA驱动和`nvidia-container-toolkit`,并已将`nvidia-container-runtime`配置为默认底层运行时。 |
| 33 | +本文档假设已预装 NVIDIA 驱动和 `nvidia-container-toolkit`,并已将 `nvidia-container-runtime` 配置为默认底层运行时。 |
34 | 34 |
|
35 | | -参考:[nvidia-container-toolkit安装指南](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html) |
| 35 | +参考:[nvidia-container-toolkit 安装指南](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html) |
36 | 36 |
|
37 | | -#### 基于Debian系统(使用`Docker`和`containerd`)示例 {#example-for-debian-based-systems-with-docker-and-containerd} |
| 37 | +#### 基于 Debian 系统(使用 `Docker` 和 `containerd`)示例 {#example-for-debian-based-systems-with-docker-and-containerd} |
38 | 38 |
|
39 | | -##### 安装`nvidia-container-toolkit` {#install-the-nvidia-container-toolkit} |
| 39 | +##### 安装 `nvidia-container-toolkit` {#install-the-nvidia-container-toolkit} |
40 | 40 |
|
41 | | -```bash |
42 | | -distribution=$(. /etc/os-release;echo $ID$VERSION_ID) |
43 | | -curl -s -L https://nvidia.github.io/libnvidia-container/gpgkey | sudo apt-key add - |
44 | | -curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \ |
45 | | - sudo tee /etc/apt/sources.list.d/libnvidia-container.list |
| 41 | +```bash |
| 42 | +distribution=$(. /etc/os-release;echo $ID$VERSION_ID) |
| 43 | +curl -s -L https://nvidia.github.io/libnvidia-container/gpgkey | sudo apt-key add - |
| 44 | +curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \ |
| 45 | + sudo tee /etc/apt/sources.list.d/libnvidia-container.list |
46 | 46 |
|
47 | | -sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit |
48 | | -``` |
| 47 | +sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit |
| 48 | +``` |
49 | 49 |
|
50 | | -##### 配置`Docker` {#configure-docker} |
| 50 | +##### 配置 `Docker` {#configure-docker} |
51 | 51 |
|
52 | | -当使用`Docker`运行`Kubernetes`时,编辑配置文件(通常位于`/etc/docker/daemon.json`),将`nvidia-container-runtime`设为默认底层运行时: |
| 52 | +当使用 `Docker` 运行 `Kubernetes` 时,编辑配置文件(通常位于 `/etc/docker/daemon.json`),将 |
| 53 | +`nvidia-container-runtime` 设为默认底层运行时: |
53 | 54 |
|
54 | | -```json |
55 | | -{ |
56 | | - "default-runtime": "nvidia", |
57 | | - "runtimes": { |
58 | | - "nvidia": { |
59 | | - "path": "/usr/bin/nvidia-container-runtime", |
60 | | - "runtimeArgs": [] |
61 | | - } |
62 | | - } |
63 | | -} |
64 | | -``` |
| 55 | +```json |
| 56 | +{ |
| 57 | + "default-runtime": "nvidia", |
| 58 | + "runtimes": { |
| 59 | + "nvidia": { |
| 60 | + "path": "/usr/bin/nvidia-container-runtime", |
| 61 | + "runtimeArgs": [] |
| 62 | + } |
| 63 | + } |
| 64 | +} |
| 65 | +``` |
65 | 66 |
|
66 | | -然后重启`Docker`: |
| 67 | +然后重启 `Docker`: |
67 | 68 |
|
68 | | -```bash |
69 | | -sudo systemctl daemon-reload && systemctl restart docker |
70 | | -``` |
| 69 | +```bash |
| 70 | +sudo systemctl daemon-reload && systemctl restart docker |
| 71 | +``` |
71 | 72 |
|
72 | | -##### 配置`containerd` {#configure-containerd} |
| 73 | +##### 配置 `containerd` {#configure-containerd} |
73 | 74 |
|
74 | | -当使用`containerd`运行`Kubernetes`时,修改配置文件(通常位于`/etc/containerd/config.toml`),将`nvidia-container-runtime`设为默认底层运行时: |
| 75 | +当使用 `containerd` 运行 `Kubernetes` 时,修改配置文件(通常位于 `/etc/containerd/config.toml`),将 |
| 76 | +`nvidia-container-runtime` 设为默认底层运行时: |
75 | 77 |
|
76 | | -```toml |
77 | | -version = 2 |
78 | | -[plugins] |
79 | | - [plugins."io.containerd.grpc.v1.cri"] |
80 | | - [plugins."io.containerd.grpc.v1.cri".containerd] |
81 | | - default_runtime_name = "nvidia" |
| 78 | +```toml |
| 79 | +version = 2 |
| 80 | +[plugins] |
| 81 | + [plugins."io.containerd.grpc.v1.cri"] |
| 82 | + [plugins."io.containerd.grpc.v1.cri".containerd] |
| 83 | + default_runtime_name = "nvidia" |
82 | 84 |
|
83 | | - [plugins."io.containerd.grpc.v1.cri".containerd.runtimes] |
84 | | - [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia] |
85 | | - privileged_without_host_devices = false |
86 | | - runtime_engine = "" |
87 | | - runtime_root = "" |
88 | | - runtime_type = "io.containerd.runc.v2" |
89 | | - [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options] |
90 | | - BinaryName = "/usr/bin/nvidia-container-runtime" |
91 | | -``` |
| 85 | + [plugins."io.containerd.grpc.v1.cri".containerd.runtimes] |
| 86 | + [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia] |
| 87 | + privileged_without_host_devices = false |
| 88 | + runtime_engine = "" |
| 89 | + runtime_root = "" |
| 90 | + runtime_type = "io.containerd.runc.v2" |
| 91 | + [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options] |
| 92 | + BinaryName = "/usr/bin/nvidia-container-runtime" |
| 93 | +``` |
92 | 94 |
|
93 | | -然后重启`containerd`: |
| 95 | +然后重启 `containerd`: |
94 | 96 |
|
95 | | -```bash |
96 | | -sudo systemctl daemon-reload && systemctl restart containerd |
97 | | -``` |
| 97 | +```bash |
| 98 | +sudo systemctl daemon-reload && systemctl restart containerd |
| 99 | +``` |
98 | 100 |
|
99 | | -#### 2. 标记节点 {#label-your-nodes} |
| 101 | +#### 2. 标记节点 {#label-your-nodes} |
100 | 102 |
|
101 | | -通过添加"gpu=on"标签将GPU节点标记为可调度HAMi任务。未标记的节点将无法被调度器管理。 |
| 103 | +通过添加 "gpu=on" 标签将 GPU 节点标记为可调度 HAMi 任务。未标记的节点将无法被调度器管理。 |
102 | 104 |
|
103 | | -```bash |
104 | | -kubectl label nodes {节点ID} gpu=on |
105 | | -``` |
| 105 | +```bash |
| 106 | +kubectl label nodes {节点ID} gpu=on |
| 107 | +``` |
106 | 108 |
|
107 | | -#### 3. 使用helm部署HAMi {#deploy-hami-using-helm} |
| 109 | +#### 3. 使用 Helm 部署 HAMi {#deploy-hami-using-helm} |
108 | 110 |
|
109 | | -首先通过以下命令确认Kubernetes版本: |
| 111 | +首先通过以下命令确认 Kubernetes 版本: |
110 | 112 |
|
111 | | -```bash |
112 | | -kubectl version |
113 | | -``` |
| 113 | +```bash |
| 114 | +kubectl version |
| 115 | +``` |
114 | 116 |
|
115 | | -然后添加helm仓库: |
| 117 | +然后添加 Helm 仓库: |
116 | 118 |
|
117 | | -```bash |
118 | | -helm repo add hami-charts https://project-hami.github.io/HAMi/ |
119 | | -``` |
| 119 | +```bash |
| 120 | +helm repo add hami-charts https://project-hami.github.io/HAMi/ |
| 121 | +``` |
120 | 122 |
|
121 | | -安装时需设置Kubernetes调度器镜像版本与集群版本匹配。例如集群版本为1.16.8时,使用以下命令部署: |
| 123 | +安装时需设置 Kubernetes 调度器镜像版本与集群版本匹配。例如集群版本为 1.16.8 时,使用以下命令部署: |
122 | 124 |
|
123 | | -```bash |
124 | | -helm install hami hami-charts/hami \ |
125 | | - --set scheduler.kubeScheduler.imageTag=v1.16.8 \ |
126 | | - -n kube-system |
127 | | -``` |
| 125 | +```bash |
| 126 | +helm install hami hami-charts/hami \ |
| 127 | + --set scheduler.kubeScheduler.imageTag=v1.16.8 \ |
| 128 | + -n kube-system |
| 129 | +``` |
128 | 130 |
|
129 | | -若一切正常,可见vgpu-device-plugin和vgpu-scheduler的Pod均处于Running状态 |
130 | | - |
131 | | -### 演示 {#demo} |
132 | | - |
133 | | -#### 1. 提交演示任务 {#submit-demo-task} |
134 | | - |
135 | | -容器现在可通过`nvidia.com/gpu`资源类型申请NVIDIA vGPU: |
136 | | - |
137 | | -```yaml |
138 | | -apiVersion: v1 |
139 | | -kind: Pod |
140 | | -metadata: |
141 | | - name: gpu-pod |
142 | | -spec: |
143 | | - containers: |
144 | | - - name: ubuntu-container |
145 | | - image: ubuntu:18.04 |
146 | | - command: ["bash", "-c", "sleep 86400"] |
147 | | - resources: |
148 | | - limits: |
149 | | - nvidia.com/gpu: 1 # 申请1个vGPU |
150 | | - nvidia.com/gpumem: 10240 # 每个vGPU包含10240m设备内存(可选,整型) |
151 | | -``` |
152 | | -
|
153 | | -#### 验证容器内资源限制 {#verify-in-container-resouce-control} |
154 | | -
|
155 | | -执行查询命令: |
156 | | -
|
157 | | -```bash |
158 | | -kubectl exec -it gpu-pod nvidia-smi |
159 | | -``` |
160 | | - |
161 | | -预期输出: |
162 | | - |
163 | | -```text |
164 | | -[HAMI-core Msg(28:140561996502848:libvgpu.c:836)]: 初始化中..... |
165 | | -2024年4月10日 星期三 09:28:58 |
166 | | -+-----------------------------------------------------------------------------------------+ |
167 | | -| NVIDIA-SMI 550.54.15 驱动版本: 550.54.15 CUDA版本: 12.4 | |
168 | | -|-----------------------------------------+------------------------+----------------------+ |
169 | | -| GPU 名称 持久化-M | 总线ID 显存.A | 易失性ECC错误 | |
170 | | -| 风扇 温度 性能 功耗:使用/上限 | 显存使用率 | GPU利用率 计算模式 | |
171 | | -| | | MIG模式 | |
172 | | -|=========================================+========================+======================| |
173 | | -| 0 Tesla V100-PCIE-32GB 启用 | 00000000:3E:00.0 关闭 | 0 | |
174 | | -| N/A 29C P0 24W/250W | 0MiB/10240MiB | 0% 默认模式 | |
175 | | -| | | N/A | |
176 | | -+-----------------------------------------+------------------------+----------------------+ |
177 | | -
|
178 | | -+-----------------------------------------------------------------------------------------+ |
179 | | -| 进程: | |
180 | | -| GPU GI CI 进程ID 类型 进程名称 显存使用量 | |
181 | | -| ID ID | |
182 | | -|=========================================================================================| |
183 | | -| 未找到运行中的进程 | |
184 | | -+-----------------------------------------------------------------------------------------+ |
185 | | -[HAMI-core Msg(28:140561996502848:multiprocess_memory_limit.c:434)]: 调用退出处理程序28 |
186 | | -``` |
| 131 | +若一切正常,可见 vgpu-device-plugin 和 vgpu-scheduler 的 Pod 均处于 Running 状态。 |
| 132 | + |
| 133 | +### 演示 {#demo} |
| 134 | + |
| 135 | +#### 1. 提交演示任务 {#submit-demo-task} |
| 136 | + |
| 137 | +容器现在可通过 `nvidia.com/gpu` 资源类型申请 NVIDIA vGPU: |
| 138 | + |
| 139 | +```yaml |
| 140 | +apiVersion: v1 |
| 141 | +kind: Pod |
| 142 | +metadata: |
| 143 | + name: gpu-pod |
| 144 | +spec: |
| 145 | + containers: |
| 146 | + - name: ubuntu-container |
| 147 | + image: ubuntu:18.04 |
| 148 | + command: ["bash", "-c", "sleep 86400"] |
| 149 | + resources: |
| 150 | + limits: |
| 151 | + nvidia.com/gpu: 1 # 申请 1 个 vGPU |
| 152 | + nvidia.com/gpumem: 10240 # 每个 vGPU 包含 10240m 设备内存(可选,整型) |
| 153 | +``` |
| 154 | +
|
| 155 | +#### 2. 验证容器内资源限制 {#verify-in-container-resouce-control} |
| 156 | +
|
| 157 | +执行查询命令: |
| 158 | +
|
| 159 | +```bash |
| 160 | +kubectl exec -it gpu-pod nvidia-smi |
| 161 | +``` |
| 162 | + |
| 163 | +预期输出: |
| 164 | + |
| 165 | +```text |
| 166 | +[HAMI-core Msg(28:140561996502848:libvgpu.c:836)]: Initializing..... |
| 167 | +Wed Apr 10 09:28:58 2024 |
| 168 | ++-----------------------------------------------------------------------------------------+ |
| 169 | +| NVIDIA-SMI 550.54.15 Driver Version: 550.54.15 CUDA Version: 12.4 | |
| 170 | +|-----------------------------------------+------------------------+----------------------+ |
| 171 | +| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | |
| 172 | +| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | |
| 173 | +| | | MIG M. | |
| 174 | +|=========================================+========================+======================| |
| 175 | +| 0 Tesla V100-PCIE-32GB On | 00000000:3E:00.0 Off | 0 | |
| 176 | +| N/A 29C P0 24W / 250W | 0MiB / 10240MiB | 0% Default | |
| 177 | +| | | N/A | |
| 178 | ++-----------------------------------------+------------------------+----------------------+ |
| 179 | +
|
| 180 | ++-----------------------------------------------------------------------------------------+ |
| 181 | +| Processes: | |
| 182 | +| GPU GI CI PID Type Process name GPU Memory | |
| 183 | +| ID ID Usage | |
| 184 | +|=========================================================================================| |
| 185 | +| No running processes found | |
| 186 | ++-----------------------------------------------------------------------------------------+ |
| 187 | +[HAMI-core Msg(28:140561996502848:multiprocess_memory_limit.c:434)]: Calling exit handler 28 |
| 188 | +``` |
0 commit comments