Kubernetes集群中GPU節(jié)點(diǎn)的部署及驗(yàn)證

前言

Kubernetes集群中,基于已有的透?jìng)餍虶PU虛機(jī),部署一個(gè)GPU-Node,與常規(guī)節(jié)點(diǎn)相比需要增加三個(gè)步驟:

  • 安裝NVIDIA-Driver
  • 安裝NVIDIA-Docker2
  • 部署NVIDIA-Device-Plugin

安裝NVIDIA-driver

下載 NVIDIA 驅(qū)動(dòng)

驅(qū)動(dòng)都是免費(fèi)的,根據(jù)顯卡型號(hào)選擇下載合適的驅(qū)動(dòng),官方驅(qū)動(dòng)下載地址

NVIDIA 驅(qū)動(dòng)程序下載.png

禁用 nouveau 驅(qū)動(dòng)

添加conf 文件:

vi /etc/modprobe.d/blacklist.conf

在最后兩行添加:

blacklist nouveau
options nouveau modeset=0

重新生成 kernel initramfs:

執(zhí)行sudo update-initramfs -u

重啟節(jié)點(diǎn)虛機(jī):

reboot

驗(yàn)證:沒(méi)輸出代表禁用生效,在重啟之后執(zhí)行

lsmod | grep nouveau
禁用 nouveau 生效.png

安裝驅(qū)動(dòng)

示例中:虛機(jī)操作系統(tǒng)為 Ubuntu18.04-amd64,顯卡型號(hào)為 Tesla-V100,安裝驅(qū)動(dòng)版本選擇 440.33.01

在線安裝:

apt install nvidia-driver-430 nvidia-utils-430 nvidia-settings

離線安裝:

./NVIDIA-Linux-x86_64-{{ gpu_version }}.run -s

驗(yàn)證驅(qū)動(dòng)安裝:

nvidia-smi

正確安裝驅(qū)動(dòng)后,輸出示例如下:


nvidia-smi.png

以上完成NVIDIA驅(qū)動(dòng)安裝

安裝NVIDIA-docker2

由于18.06版本的docker不支持GPU容器, 需要安裝NVIDIA-Docker2以支持容器使用NVIDIA-GPUs
注意:安裝之前要先安裝好 docker及NVIDIA驅(qū)動(dòng),但不需要安裝 CUDA。

在線安裝:

curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | apt-key add -

distribution= $(. /etc/os-release;echo $ID$VERSION_ID)

curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | tee /etc/apt/sources.list.d/nvidia-docker.list

apt-get update

apt-get install -y nvidia-docker2

systemctl restart docker

離線安裝:
在通外網(wǎng)的機(jī)器上,運(yùn)行以下命令:

$ distribution=$(. /etc/os-release;echo $ID$VERSION_ID)

$ curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -

$ curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list

$ sudo apt-get update

下載5個(gè)包

apt download libnvidia-container1

apt download libnvidia-container-tools

apt download nvidia-container-toolkit

apt download nvidia-container-runtime

apt download nvidia-docker2

將下載好的包拷貝到目標(biāo)節(jié)點(diǎn)虛機(jī)后,執(zhí)行如下命令進(jìn)行安裝

dpkg -i libnvidia-container1_1.0.7-1_amd64.deb && dpkg -i libnvidia-container-tools_1.0.7-1_amd64.deb && dpkg -i nvidia-container-toolkit_1.0.5-1_amd64.deb && dpkg -i nvidia-container-runtime_3.1.4-1_amd64.deb && dpkg -i nvidia-docker2_2.2.2-1_all.deb

設(shè)置GPU節(jié)點(diǎn)的docker default runtime 為 nvidia-container-runtime

vi /etc/docker/daemon.json

需要在該配置文件中添加的內(nèi)容如下:

{
    "default-runtime": "nvidia",
    "runtimes": {
        "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}

重啟docker

systemctl restart docker

驗(yàn)證安裝

docker info
docker-info.png

以上完成NVIDIA-docker2安裝

安裝插件 nvidia-device-plugin-daemonset

注:示例中版本為1.0.0-beta6,可去Nvidia Github 項(xiàng)目下查看所有可用版本

在線安裝:

kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin:1.0.0-beta6/nvidia-device-plugin.yml

Nvidia官方manifest/清單為:

# Copyright (c) 2019, NVIDIA CORPORATION.  All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
 
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvidia-device-plugin-daemonset
  namespace: kube-system
spec:
  selector:
    matchLabels:
      name: nvidia-device-plugin-ds
  updateStrategy:
    type: RollingUpdate
  template:
    metadata:
      # This annotation is deprecated. Kept here for backward compatibility
      # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
      annotations:
        scheduler.alpha.kubernetes.io/critical-pod: ""
      labels:
        name: nvidia-device-plugin-ds
    spec:
      tolerations:
      # This toleration is deprecated. Kept here for backward compatibility
      # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
      - key: CriticalAddonsOnly
        operator: Exists
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
      # Mark this pod as a critical add-on; when enabled, the critical add-on
      # scheduler reserves resources for critical add-on pods so that they can
      # be rescheduled after a failure.
      # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
      priorityClassName: "system-node-critical"
      containers:
      - image: nvidia/k8s-device-plugin:1.0.0-beta6
        name: nvidia-device-plugin-ctr
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop: ["ALL"]
        volumeMounts:
          - name: device-plugin
            mountPath: /var/lib/kubelet/device-plugins
      volumes:
        - name: device-plugin
          hostPath:
            path: /var/lib/kubelet/device-plugins

注: 可以給GPU節(jié)點(diǎn)添加label,并在nvidia-device-plugin-daemonset.yaml中添加nodeselector或nodeAffinity。

以上步驟完成后,驗(yàn)證GPU-Node安裝

kubectl get no {nodeName} -oyaml
GPU-node.png

圖中Node詳情中已經(jīng)可以取到nvidia.com/gpu的值了,此時(shí)GPU資源是以顯卡個(gè)數(shù)暴露給Kubernetes集群的,說(shuō)明配置生效了,另外,也可以顯存形式暴露給Kubernetes集群并實(shí)現(xiàn)GPU共享調(diào)度。
以上完成在Kubernetes集群中GPU節(jié)點(diǎn)的部署及驗(yàn)證

Reference

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡(jiǎn)書(shū)系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

推薦閱讀更多精彩內(nèi)容