|
| 1 | +--- |
| 2 | +title: 使用helm部署HAMi |
| 3 | +--- |
| 4 | + |
| 5 | +## 目录 {#toc} |
| 6 | + |
| 7 | +- [先决条件](#prerequisites) |
| 8 | +- [安装步骤](#installation) |
| 9 | +- [演示](#demo) |
| 10 | + |
| 11 | +本指南将涵盖: |
| 12 | + |
| 13 | +- 为每个GPU节点配置nvidia容器运行时 |
| 14 | +- 使用helm安装HAMi |
| 15 | +- 启动vGPU任务 |
| 16 | +- 验证容器内设备资源是否受限 |
| 17 | + |
| 18 | +## 先决条件 {#prerequisites} |
| 19 | + |
| 20 | +- [Helm](https://helm.sh/zh/docs/) v3+版本 |
| 21 | +- [kubectl](https://kubernetes.io/docs/tasks/tools/install-kubectl/) v1.16+版本 |
| 22 | +- [CUDA](https://developer.nvidia.com/cuda-toolkit) v10.2+版本 |
| 23 | +- [NVIDIA驱动](https://www.nvidia.cn/drivers/unix/) v440+版本 |
| 24 | + |
| 25 | +## 安装步骤 {#installation} |
| 26 | + |
| 27 | +### 1. 配置nvidia-container-toolkit {#configure-nvidia-container-toolkit} |
| 28 | + |
| 29 | +<summary> 配置nvidia-container-toolkit </summary> |
| 30 | + |
| 31 | +在所有GPU节点执行以下操作。 |
| 32 | + |
| 33 | +本文档假设已预装NVIDIA驱动和`nvidia-container-toolkit`,并已将`nvidia-container-runtime`配置为默认底层运行时。 |
| 34 | + |
| 35 | +参考:[nvidia-container-toolkit安装指南](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html) |
| 36 | + |
| 37 | +#### 基于Debian系统(使用`Docker`和`containerd`)示例 {#example-for-debian-based-systems-with-docker-and-containerd} |
| 38 | + |
| 39 | +##### 安装`nvidia-container-toolkit` {#install-the-nvidia-container-toolkit} |
| 40 | + |
| 41 | +```bash |
| 42 | +distribution=$(. /etc/os-release;echo $ID$VERSION_ID) |
| 43 | +curl -s -L https://nvidia.github.io/libnvidia-container/gpgkey | sudo apt-key add - |
| 44 | +curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \ |
| 45 | + sudo tee /etc/apt/sources.list.d/libnvidia-container.list |
| 46 | + |
| 47 | +sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit |
| 48 | +``` |
| 49 | + |
| 50 | +##### 配置`Docker` {#configure-docker} |
| 51 | + |
| 52 | +当使用`Docker`运行`Kubernetes`时,编辑配置文件(通常位于`/etc/docker/daemon.json`),将`nvidia-container-runtime`设为默认底层运行时: |
| 53 | + |
| 54 | +```json |
| 55 | +{ |
| 56 | + "default-runtime": "nvidia", |
| 57 | + "runtimes": { |
| 58 | + "nvidia": { |
| 59 | + "path": "/usr/bin/nvidia-container-runtime", |
| 60 | + "runtimeArgs": [] |
| 61 | + } |
| 62 | + } |
| 63 | +} |
| 64 | +``` |
| 65 | + |
| 66 | +然后重启`Docker`: |
| 67 | + |
| 68 | +```bash |
| 69 | +sudo systemctl daemon-reload && systemctl restart docker |
| 70 | +``` |
| 71 | + |
| 72 | +##### 配置`containerd` {#configure-containerd} |
| 73 | + |
| 74 | +当使用`containerd`运行`Kubernetes`时,修改配置文件(通常位于`/etc/containerd/config.toml`),将`nvidia-container-runtime`设为默认底层运行时: |
| 75 | + |
| 76 | +```toml |
| 77 | +version = 2 |
| 78 | +[plugins] |
| 79 | + [plugins."io.containerd.grpc.v1.cri"] |
| 80 | + [plugins."io.containerd.grpc.v1.cri".containerd] |
| 81 | + default_runtime_name = "nvidia" |
| 82 | + |
| 83 | + [plugins."io.containerd.grpc.v1.cri".containerd.runtimes] |
| 84 | + [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia] |
| 85 | + privileged_without_host_devices = false |
| 86 | + runtime_engine = "" |
| 87 | + runtime_root = "" |
| 88 | + runtime_type = "io.containerd.runc.v2" |
| 89 | + [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options] |
| 90 | + BinaryName = "/usr/bin/nvidia-container-runtime" |
| 91 | +``` |
| 92 | + |
| 93 | +然后重启`containerd`: |
| 94 | + |
| 95 | +```bash |
| 96 | +sudo systemctl daemon-reload && systemctl restart containerd |
| 97 | +``` |
| 98 | + |
| 99 | +#### 2. 标记节点 {#label-your-nodes} |
| 100 | + |
| 101 | +通过添加"gpu=on"标签将GPU节点标记为可调度HAMi任务。未标记的节点将无法被调度器管理。 |
| 102 | + |
| 103 | +```bash |
| 104 | +kubectl label nodes {节点ID} gpu=on |
| 105 | +``` |
| 106 | + |
| 107 | +#### 3. 使用helm部署HAMi {#deploy-hami-using-helm} |
| 108 | + |
| 109 | +首先通过以下命令确认Kubernetes版本: |
| 110 | + |
| 111 | +```bash |
| 112 | +kubectl version |
| 113 | +``` |
| 114 | + |
| 115 | +然后添加helm仓库: |
| 116 | + |
| 117 | +```bash |
| 118 | +helm repo add hami-charts https://project-hami.github.io/HAMi/ |
| 119 | +``` |
| 120 | + |
| 121 | +安装时需设置Kubernetes调度器镜像版本与集群版本匹配。例如集群版本为1.16.8时,使用以下命令部署: |
| 122 | + |
| 123 | +```bash |
| 124 | +helm install hami hami-charts/hami \ |
| 125 | + --set scheduler.kubeScheduler.imageTag=v1.16.8 \ |
| 126 | + -n kube-system |
| 127 | +``` |
| 128 | + |
| 129 | +若一切正常,可见vgpu-device-plugin和vgpu-scheduler的Pod均处于Running状态 |
| 130 | + |
| 131 | +### 演示 {#demo} |
| 132 | + |
| 133 | +#### 1. 提交演示任务 {#submit-demo-task} |
| 134 | + |
| 135 | +容器现在可通过`nvidia.com/gpu`资源类型申请NVIDIA vGPU: |
| 136 | + |
| 137 | +```yaml |
| 138 | +apiVersion: v1 |
| 139 | +kind: Pod |
| 140 | +metadata: |
| 141 | + name: gpu-pod |
| 142 | +spec: |
| 143 | + containers: |
| 144 | + - name: ubuntu-container |
| 145 | + image: ubuntu:18.04 |
| 146 | + command: ["bash", "-c", "sleep 86400"] |
| 147 | + resources: |
| 148 | + limits: |
| 149 | + nvidia.com/gpu: 1 # 申请1个vGPU |
| 150 | + nvidia.com/gpumem: 10240 # 每个vGPU包含10240m设备内存(可选,整型) |
| 151 | +``` |
| 152 | +
|
| 153 | +#### 验证容器内资源限制 {#verify-in-container-resouce-control} |
| 154 | +
|
| 155 | +执行查询命令: |
| 156 | +
|
| 157 | +```bash |
| 158 | +kubectl exec -it gpu-pod nvidia-smi |
| 159 | +``` |
| 160 | + |
| 161 | +预期输出: |
| 162 | + |
| 163 | +```text |
| 164 | +[HAMI-core Msg(28:140561996502848:libvgpu.c:836)]: 初始化中..... |
| 165 | +2024年4月10日 星期三 09:28:58 |
| 166 | ++-----------------------------------------------------------------------------------------+ |
| 167 | +| NVIDIA-SMI 550.54.15 驱动版本: 550.54.15 CUDA版本: 12.4 | |
| 168 | +|-----------------------------------------+------------------------+----------------------+ |
| 169 | +| GPU 名称 持久化-M | 总线ID 显存.A | 易失性ECC错误 | |
| 170 | +| 风扇 温度 性能 功耗:使用/上限 | 显存使用率 | GPU利用率 计算模式 | |
| 171 | +| | | MIG模式 | |
| 172 | +|=========================================+========================+======================| |
| 173 | +| 0 Tesla V100-PCIE-32GB 启用 | 00000000:3E:00.0 关闭 | 0 | |
| 174 | +| N/A 29C P0 24W/250W | 0MiB/10240MiB | 0% 默认模式 | |
| 175 | +| | | N/A | |
| 176 | ++-----------------------------------------+------------------------+----------------------+ |
| 177 | +
|
| 178 | ++-----------------------------------------------------------------------------------------+ |
| 179 | +| 进程: | |
| 180 | +| GPU GI CI 进程ID 类型 进程名称 显存使用量 | |
| 181 | +| ID ID | |
| 182 | +|=========================================================================================| |
| 183 | +| 未找到运行中的进程 | |
| 184 | ++-----------------------------------------------------------------------------------------+ |
| 185 | +[HAMI-core Msg(28:140561996502848:multiprocess_memory_limit.c:434)]: 调用退出处理程序28 |
| 186 | +``` |
0 commit comments