簡介
Kubeflow是在k8s平臺之上針對機器學習的開發、訓練、優化、部署、管理的工具集合,內部集成的方式融合機器學習中的很多領域的開源項目,比如Jupyter、tfserving、Katib、Fairing、Argo等。可以針對機器學習的不同階段:數據預處理、模型訓練、模型預測、服務管理等進行管理。
一、基礎環境準備
k8s版本:v1.20.5
docker版本:v19.03.15
kfctl版本:v1.2.0-0-gbc038f9
kustomize版本:v4.1.3
我也不確定到底能否在1.20.5的k8s版本上完全兼容kubeflow 1.2.0版本。現在只是測試。
版本兼容性可參考:https://www.kubeflow.org/docs/distributions/kfctl/overview#minimum-system-requirements
1、安裝kfctl
kfctl 是用于部署和管理 Kubeflow 的控制平面。 主要的部署模式是使用 kfctl 作為 CLI,為不同的 Kubernetes 風格配置 KFDef 配置來部署和管理 Kubeflow。
wget https://github.com/kubeflow/kfctl/releases/download/v1.2.0/kfctl_v1.2.0-0-gbc038f9_linux.tar.gz
tar -xvf kfctl_v1.2.0-0-gbc038f9_linux.tar.gz
chmod 755 kfctl
cp kfctl /usr/bin
kfctl version
2、安裝kustomize
Kustomize 是一種配置管理解決方案,它利用分層來保留應用程序和組件的基本設置,方法是覆蓋聲明性 yaml 工件(稱為補丁),這些工件有選擇地覆蓋默認設置而不實際更改原始文件。
下載地址:https://github.com/kubernetes-sigs/kustomize/releases
wget https://github.com/kubernetes-sigs/kustomize/releases/download/kustomize%2Fv4.1.3/kustomize_v4.1.3_linux_amd64.tar.gz
tar -xzvf kustomize_v4.1.3_linux_amd64.tar.gz
chmod 755 kustomize
mv kustomize /use/bin/
kustomize version
三、基于公網的部署
如果你的服務器能夠訪問外網。就可直接執行安裝部署。
本次測試部署使用的阿里云美國西部1(硅谷)的機器。
1、創建kubeflow的工作目錄
mkdir /apps/kubeflow
cd /apps/kubeflow
2、配置storageclass
# cat storageclass.yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: alicloud-nas
mountOptions:
- nolock,tcp,noresvport
- vers=3
parameters:
volumeAs: subpath
server: "*********.us-west-1.nas.aliyuncs.com:/nasroot1/" #這里使用的是阿里的NAS存儲
archiveOnDelete: "false"
provisioner: nasplugin.csi.alibabacloud.com
reclaimPolicy: Retain
3、設置為默認的storageclass
# kubectl get sc
NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE
alicloud-nas nasplugin.csi.alibabacloud.com Retain Immediate false 24h
# 為false時為關閉默認
# kubectl patch storageclass alicloud-nas -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}'
# kubectl get sc
NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE
alicloud-nas (default) nasplugin.csi.alibabacloud.com Retain Immediate false 24h
4、安裝部署
wget https://raw.githubusercontent.com/kubeflow/manifests/v1.2-branch/kfdef/kfctl_k8s_istio.v1.2.0.yaml
kfctl apply -V -f kfctl_k8s_istio.v1.2.0.yaml
等所有pod都創建成功后檢查各個pod
保證以下所有的pod都是Running狀態。
# kubectl get pods -n cert-manager
NAME READY STATUS RESTARTS AGE
cert-manager-7c75b559c4-c2hhj 1/1 Running 0 23h
cert-manager-cainjector-7f964fd7b5-mxbjl 1/1 Running 0 23h
cert-manager-webhook-566dd99d6-6vvzv 1/1 Running 2 23h
# kubectl get pods -n istio-system
NAME READY STATUS RESTARTS AGE
cluster-local-gateway-5898bc5c74-822c9 1/1 Running 0 23h
cluster-local-gateway-5898bc5c74-b5tmr 1/1 Running 0 23h
cluster-local-gateway-5898bc5c74-fpswf 1/1 Running 0 23h
istio-citadel-6dffd79d7-4scx7 1/1 Running 0 23h
istio-galley-77cb9b44dc-6l4lm 1/1 Running 0 23h
istio-ingressgateway-7bb77f89b8-psqcm 1/1 Running 0 23h
istio-nodeagent-5qsmg 1/1 Running 0 23h
istio-nodeagent-ccc8j 1/1 Running 0 23h
istio-nodeagent-gqrsl 1/1 Running 0 23h
istio-pilot-67d94fc954-vl2sx 2/2 Running 0 23h
istio-policy-546596d4b4-6ct59 2/2 Running 1 23h
istio-security-post-install-release-1.3-latest-daily-qbrf6 0/1 Completed 0 23h
istio-sidecar-injector-796b6454d9-lv8dg 1/1 Running 0 23h
istio-telemetry-58f9cd4bf5-8cjj5 2/2 Running 1 23h
prometheus-7c6d764c48-s29kn 1/1 Running 0 23h
# kubectl get pods -n knative-serving
NAME READY STATUS RESTARTS AGE
activator-6c87fcbbb6-f4cs2 1/1 Running 0 23h
autoscaler-847b9f89dc-5jvml 1/1 Running 0 23h
controller-55f67c9ddb-67vvc 1/1 Running 0 23h
istio-webhook-db664df87-jn72n 1/1 Running 0 23h
networking-istio-76f8cc7796-9jr2j 1/1 Running 0 23h
webhook-6bff77594b-2r2gx 1/1 Running 0 23h
# kubectl get pods -n kubeflow
NAME READY STATUS RESTARTS AGE
admission-webhook-bootstrap-stateful-set-0 1/1 Running 4 23h
admission-webhook-deployment-5cd7dc96f5-fw7d4 1/1 Running 2 23h
application-controller-stateful-set-0 1/1 Running 0 23h
argo-ui-65df8c7c84-qwtc8 1/1 Running 0 23h
cache-deployer-deployment-5f4979f45-2xqbf 2/2 Running 2 23h
cache-server-7859fd67f5-hplhm 2/2 Running 0 23h
centraldashboard-67767584dc-j9ffz 1/1 Running 0 23h
jupyter-web-app-deployment-8486d5ffff-hmbz4 1/1 Running 0 23h
katib-controller-7fcc95676b-rn98v 1/1 Running 1 23h
katib-db-manager-85db457c64-jx97j 1/1 Running 0 23h
katib-mysql-6c7f7fb869-bt87c 1/1 Running 0 23h
katib-ui-65dc4cf6f5-nhmsg 1/1 Running 0 23h
kfserving-controller-manager-0 2/2 Running 0 23h
kubeflow-pipelines-profile-controller-797fb44db9-rqzmg 1/1 Running 0 23h
metacontroller-0 1/1 Running 0 23h
metadata-db-6dd978c5b-zzntn 1/1 Running 0 23h
metadata-envoy-deployment-67bd5954c-zvpf4 1/1 Running 0 23h
metadata-grpc-deployment-577c67c96f-zjt7w 1/1 Running 3 23h
metadata-writer-756dbdd478-dm4j4 2/2 Running 0 23h
minio-54d995c97b-4rm2d 1/1 Running 0 23h
ml-pipeline-7c56db5db9-fprrw 2/2 Running 1 23h
ml-pipeline-persistenceagent-d984c9585-vrd4g 2/2 Running 0 23h
ml-pipeline-scheduledworkflow-5ccf4c9fcc-9qkrq 2/2 Running 0 23h
ml-pipeline-ui-7ddcd74489-95dvl 2/2 Running 0 23h
ml-pipeline-viewer-crd-56c68f6c85-tgxc2 2/2 Running 1 23h
ml-pipeline-visualizationserver-5b9bd8f6bf-4zvwt 2/2 Running 0 23h
mpi-operator-d5bfb8489-gkp5w 1/1 Running 0 23h
mxnet-operator-7576d697d6-qx7rg 1/1 Running 0 23h
mysql-74f8f99bc8-f42zn 2/2 Running 0 23h
notebook-controller-deployment-5bb6bdbd6d-rclvr 1/1 Running 0 23h
profiles-deployment-56bc5d7dcb-2nqxj 2/2 Running 0 23h
pytorch-operator-847c8d55d8-z89wh 1/1 Running 0 23h
seldon-controller-manager-6bf8b45656-b7p7g 1/1 Running 0 23h
spark-operatorsparkoperator-fdfbfd99-9k46b 1/1 Running 0 23h
spartakus-volunteer-558f8bfd47-hskwf 1/1 Running 0 23h
tf-job-operator-58477797f8-wzdcr 1/1 Running 0 23h
workflow-controller-64fd7cffc5-zs6wx 1/1 Running 0 23h
5、訪問kubeflow ui
kubectl get svc/istio-ingressgateway -n istio-system
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
istio-ingressgateway NodePort 12.80.127.69 <none> 15020:32661/TCP,80:31380/TCP,443:31390/TCP,31400:31400/TCP,15029:30345/TCP,15030:32221/TCP,15031:31392/TCP,15032:31191/TCP,15443:32136/TCP 5h14m
代理到本地測試
kubectl port-forward svc/istio-ingressgateway 80 -n istio-system
然后本地訪問localhost即可。
四、離線部署kubeflow
如果你的機器是不能夠訪問外網的話,那可沒上面的運行那么順利。
比如這樣。ImagePullBackOff.....
那你得要一個個pod查看是用到那個image。然后再在可以訪問外網的機器上下載下來。
大概的步驟就是:
先準備所需鏡像—>將鏡像拉取推送到內網鏡像倉庫—>然后修改manifests-1.2.0項目的鏡像地址—>將manifests-1.2.0項目打包成v1.2.0.tar.gz—>啟動項目
1、準備所需鏡像
quay.io/jetstack/cert-manager-cainjector:v0.11.0
quay.io/jetstack/cert-manager-webhook:v0.11.0
gcr.io/istio-release/citadel:release-1.3-latest-daily
gcr.io/istio-release/proxyv2:release-1.3-latest-daily
gcr.io/istio-release/node-agent-k8s:release-1.3-latest-daily
gcr.io/istio-release/pilot:release-1.3-latest-daily
gcr.io/istio-release/mixer:release-1.3-latest-daily
gcr.io/istio-release/kubectl:release-1.3-latest-daily
quay.io/jetstack/cert-manager-controller:v0.11.0
gcr.io/istio-release/galley:release-1.3-latest-daily
gcr.io/istio-release/sidecar_injector:release-1.3-latest-daily
gcr.io/istio-release/proxy_init:release-1.3-latest-daily
gcr.io/kubebuilder/kube-rbac-proxy:v0.4.0
gcr.io/kfserving/kfserving-controller:v0.4.1
python:3.7
metacontroller/metacontroller:v0.3.0
gcr.io/ml-pipeline/envoy:metadata-grpc
gcr.io/istio-release/proxy_init:release-1.3-latest-daily
gcr.io/kubebuilder/kube-rbac-proxy:v0.4.0
gcr.io/kfserving/kfserving-controller:v0.4.1
python:3.7
metacontroller/metacontroller:v0.3.0
gcr.io/ml-pipeline/envoy:metadata-grpc
gcr.io/tfx-oss-public/ml_metadata_store_server:v0.21.1
gcr.io/ml-pipeline/persistenceagent:1.0.4
gcr.io/ml-pipeline/scheduledworkflow:1.0.4
gcr.io/ml-pipeline/frontend:1.0.4
mpioperator/mpi-operator:latest
kubeflow/mxnet-operator:v1.0.0-20200625
gcr.io/ml-pipeline/metadata-writer:1.0.4
gcr.io/ml-pipeline/visualization-server:1.0.4
mxnet-operator-679f456768-rcnfr
gcr.io/kubeflow-images-public/notebook-controller:vmaster-g6eb007d0
gcr.io/kubeflow-images-public/pytorch-operator:vmaster-g518f9c76
docker.io/seldonio/seldon-core-operator:1.4.0
gcr.io/kubeflow-images-public/tf_operator:vmaster-gda226016
gcr.io/kubeflow-images-public/admission-webhook:v20190520-v0-139-gcee39dbc-dirty-0d8f4c
gcr.io/ml-pipeline/cache-server:1.0.4
mysql:8.0.3
gcr.io/ml-pipeline/minio:RELEASE.2019-08-14T20-37-41Z-license-compliance
gcr.io/ml-pipeline/mysql:5.6
gcr.io/kubeflow-images-public/metadata:v0.1.11
gcr.io/kubeflow-images-public/profile-controller:vmaster-ga49f658f
gcr.io/kubeflow-images-public/kfam:vmaster-g9f3bfd00
gcr.io/google_containers/spartakus-amd64:v1.1.0
argoproj/workflow-controller:v2.3.0
gcr.io/kubeflow-images-public/tensorflow-1.15.2-notebook-cpu:1.0.0
gcr.io/kubeflow-images-public/tensorflow-1.15.2-notebook-gpu:1.0.0
gcr.io/kubeflow-images-public/tensorflow-2.1.0-notebook-cpu:1.0.0
gcr.io/kubeflow-images-public/tensorflow-2.1.0-notebook-gpu:1.0.0
---以下的鏡像拉取下來是沒有tag的,需要自己打下tag.建議拉取的時候單獨手工拉取
gcr.io/knative-releases/knative.dev/serving/cmd/activator@sha256:ffa3d72ee6c2eeb2357999248191a643405288061b7080381f22875cb703e929
gcr.io/knative-releases/knative.dev/serving/cmd/autoscaler@sha256:f89fd23889c3e0ca3d8e42c9b189dc2f93aa5b3a91c64e8aab75e952a210eeb3
gcr.io/knative-releases/knative.dev/serving/cmd/controller@sha256:b86ac8ecc6b2688a0e0b9cb68298220a752125d0a048b8edf2cf42403224393c
gcr.io/knative-releases/knative.dev/net-istio/cmd/webhook@sha256:e6b142c0f82e0e0b8cb670c11eb4eef6ded827f98761bbf4bea7bdb777b80092
gcr.io/knative-releases/knative.dev/net-istio/cmd/controller@sha256:75c7918ca887622e7242ec1965f87036db1dc462464810b72735a8e64111
gcr.io/knative-releases/knative.dev/net-istio/cmd/controller@sha256:75c7918ca887622e7242ec1965f87036db1dc462464810b72735a8e64111f6
gcr.io/knative-releases/knative.dev/net-istio/cmd/controller@sha256:75c7918ca887622e7242ec1965f87036db1dc462464810b72735a8e64111f6f7
gcr.io/knative-releases/knative.dev/serving/cmd/webhook@sha256:7e6df0fda229a13219bbc90ff72a10434a0c64cd7fe13dc534b914247d1087f4
看到這么多鏡像慌了吧?別急。肯定不會讓你一個個手工去下載。
編寫一個shell腳本來幫助我們完成這一系列的重復操作,建議是拉取鏡像的服務器上面是比較干凈的哈。最好把鏡像都清空后執行。因為下面保存鏡像那一部分是根據docker imges過濾的。也就是所如果沒有清空將會把原來本地的鏡像也一同save過去。。
建議在磁盤有100G以上的機器上執行。。
2、創建pull_images.sh
# vim pull_images.sh
#!/bin/bash
G=`tput setaf 2`
C=`tput setaf 6`
Y=`tput setaf 3`
Q=`tput sgr0`
echo -e "${C}\n\n鏡像下載腳本:${Q}"
echo -e "${C}pull_images.sh將讀取images.txt中的鏡像,拉取并保存到images.tar.gz中\n\n${Q}"
# 清理本地已有鏡像
# echo "${C}start: 清理鏡像${Q}"
# for rm_image in $(cat images.txt)
# do
# docker rmi $aliNexus$rm_image
# done
# echo -e "${C}end: 清理完成\n\n${Q}"
# 創建文件夾
mkdir images
# pull
echo "${C}start: 開始拉取鏡像...${Q}"
for pull_image in $(cat images.txt)
do
echo "${Y} 開始拉取$pull_image...${Q}"
fileName=${pull_image//:/_}
docker pull $pull_image
done
echo "${C}end: 鏡像拉取完成...${Q}"
# save鏡像
IMAGES_LIST=($(docker images | sed '1d' | awk '{print $1}'))
IMAGES_NM_LIST=($(docker images | sed '1d' | awk '{print $1"-"$2}'| awk -F/ '{print $NF}'))
IMAGES_NUM=${#IMAGES_LIST[*]}
echo "鏡像列表....."
docker images
# docker images | sed '1d' | awk '{print $1}'
for((i=0;i<$IMAGES_NUM;i++))
do
echo "正在save ${IMAGES_LIST[$i]} image..."
docker save "${IMAGES_LIST[$i]}" -o ./images/"${IMAGES_NM_LIST[$i]}".tar.gz
done
ls images
echo -e "${C}end: 保存完成\n\n${Q}"
# 打包鏡像
#tag_date=$(date "+%Y%m%d%H%M")
echo "${C}start: 打包鏡像:images.tar.gz${Q}"
tar -czvf images.tar.gz images
echo -e "${C}end: 打包完成\n\n${Q}"
# 上傳鏡像包到OSS,如果沒有oss的可以自行更換自己內網可以訪問到的其他倉庫
# echo "${C}start: 將鏡像包images.tar.gz上傳到OSS${Q}"
# ossutil64 cp images.tar.gz oss://aicloud-deploy/kubeflow-images/
# echo -e "${C}end: 鏡像包上傳完成\n\n${Q}"
# 清理鏡像
read -p "${C}是否清理本地鏡像(Y/N,默認N)?:${Q}" is_clean
if [ -z "${is_clean}" ];then
is_clean="N"
fi
if [ "${is_clean}" == "Y" ];then
rm -rf images/*
rm -rf images.tar.gz
for clean_image in $(cat images.txt)
do
docker rmi $clean_image
done
echo -e "${C}清理結束~\n\n${Q}"
fi
echo -e "${C}執行結束~\n\n${Q}"
3、編輯需要下載的鏡像列表文件images.txt
# vim images.txt
quay.io/jetstack/cert-manager-cainjector:v0.11.0
quay.io/jetstack/cert-manager-webhook:v0.11.0
gcr.io/istio-release/citadel:release-1.3-latest-daily
gcr.io/istio-release/proxyv2:release-1.3-latest-daily
gcr.io/istio-release/node-agent-k8s:release-1.3-latest-daily
gcr.io/istio-release/pilot:release-1.3-latest-daily
gcr.io/istio-release/mixer:release-1.3-latest-daily
gcr.io/istio-release/kubectl:release-1.3-latest-daily
quay.io/jetstack/cert-manager-controller:v0.11.0
gcr.io/istio-release/galley:release-1.3-latest-daily
gcr.io/istio-release/sidecar_injector:release-1.3-latest-daily
gcr.io/istio-release/proxy_init:release-1.3-latest-daily
gcr.io/kubebuilder/kube-rbac-proxy:v0.4.0
gcr.io/kfserving/kfserving-controller:v0.4.1
python:3.7
metacontroller/metacontroller:v0.3.0
gcr.io/ml-pipeline/envoy:metadata-grpc
gcr.io/istio-release/proxy_init:release-1.3-latest-daily
gcr.io/kubebuilder/kube-rbac-proxy:v0.4.0
gcr.io/kfserving/kfserving-controller:v0.4.1
python:3.7
metacontroller/metacontroller:v0.3.0
gcr.io/ml-pipeline/envoy:metadata-grpc
gcr.io/tfx-oss-public/ml_metadata_store_server:v0.21.1
gcr.io/ml-pipeline/persistenceagent:1.0.4
gcr.io/ml-pipeline/scheduledworkflow:1.0.4
gcr.io/ml-pipeline/frontend:1.0.4
mpioperator/mpi-operator:latest
kubeflow/mxnet-operator:v1.0.0-20200625
gcr.io/ml-pipeline/metadata-writer:1.0.4
gcr.io/ml-pipeline/visualization-server:1.0.4
mxnet-operator-679f456768-rcnfr
gcr.io/kubeflow-images-public/notebook-controller:vmaster-g6eb007d0
gcr.io/kubeflow-images-public/pytorch-operator:vmaster-g518f9c76
docker.io/seldonio/seldon-core-operator:1.4.0
gcr.io/kubeflow-images-public/tf_operator:vmaster-gda226016
gcr.io/kubeflow-images-public/admission-webhook:v20190520-v0-139-gcee39dbc-dirty-0d8f4c
gcr.io/ml-pipeline/cache-server:1.0.4
mysql:8.0.3
gcr.io/ml-pipeline/minio:RELEASE.2019-08-14T20-37-41Z-license-compliance
gcr.io/ml-pipeline/mysql:5.6
gcr.io/kubeflow-images-public/metadata:v0.1.11
gcr.io/kubeflow-images-public/profile-controller:vmaster-ga49f658f
gcr.io/kubeflow-images-public/kfam:vmaster-g9f3bfd00
gcr.io/google_containers/spartakus-amd64:v1.1.0
argoproj/workflow-controller:v2.3.0
gcr.io/kubeflow-images-public/tensorflow-1.15.2-notebook-cpu:1.0.0
gcr.io/kubeflow-images-public/tensorflow-1.15.2-notebook-gpu:1.0.0
gcr.io/kubeflow-images-public/tensorflow-2.1.0-notebook-cpu:1.0.0
gcr.io/kubeflow-images-public/tensorflow-2.1.0-notebook-gpu:1.0.0
4、執行腳本
sh pull_images.sh
以下鏡像建議單獨手工拉取。并打tag, 也可以修改上面的腳本把其他的刪掉,只留pull部分
docker pull gcr.io/knative-releases/knative.dev/serving/cmd/activator@sha256:ffa3d72ee6c2eeb2357999248191a643405288061b7080381f22875cb703e929
docker pull gcr.io/knative-releases/knative.dev/serving/cmd/autoscaler@sha256:f89fd23889c3e0ca3d8e42c9b189dc2f93aa5b3a91c64e8aab75e952a210eeb3
docker pull gcr.io/knative-releases/knative.dev/serving/cmd/controller@sha256:b86ac8ecc6b2688a0e0b9cb68298220a752125d0a048b8edf2cf42403224393c
docker pull gcr.io/knative-releases/knative.dev/net-istio/cmd/webhook@sha256:e6b142c0f82e0e0b8cb670c11eb4eef6ded827f98761bbf4bea7bdb777b80092
docker pull gcr.io/knative-releases/knative.dev/net-istio/cmd/controller@sha256:75c7918ca887622e7242ec1965f87036db1dc462464810b72735a8e64111
docker pull gcr.io/knative-releases/knative.dev/net-istio/cmd/controller@sha256:75c7918ca887622e7242ec1965f87036db1dc462464810b72735a8e64111f6
docker pull gcr.io/knative-releases/knative.dev/net-istio/cmd/controller@sha256:75c7918ca887622e7242ec1965f87036db1dc462464810b72735a8e64111f6f7
docker pull gcr.io/knative-releases/knative.dev/serving/cmd/webhook@sha256:7e6df0fda229a13219bbc90ff72a10434a0c64cd7fe13dc534b914247d1087f4
5、打tag
docker tag 鏡像ID 內網鏡像倉庫
docker tag 3208baba46fc aicloud-harbor.com/library/serving/cmd/activator:v1.2.0
docker tag 4578f31842ab aicloud-harbor.com/library/serving/cmd/autoscaler:v1.2.0
docker tag d1b481df9ac3 aicloud-harbor.com/library/serving/cmd/webhook:v1.2.0
docker tag 9f8e41e19efb aicloud-harbor.com/library/serving/cmd/controller:v1.2.0
docker tag 6749b4c87ac8 aicloud-harbor.com/library/net-istio/cmd/webhook:v1.2.0
docker tag ba7fa40d9f88 aicloud-harbor.com/library/net-istio/cmd/controller:v1.2.0
6、save鏡像
docker save 鏡像 -o 包名
docker save aicloud-harbor.com/library/serving/cmd/activator:v1.2.0 -o activator-v1.2.0.tar.gz
docker save aicloud-harbor.com/library/serving/cmd/autoscaler:v1.2.0 -o autoscaler-v1.2.0.tar.gz
....
然后把鏡像包上傳到內網自建的鏡像倉庫,如harbor. 我這里是將包上傳到harbor。當然 你也可以直接上傳到部署的服務器,當然你得要每個node節點都上傳,不然pod重啟切換節點將又會拉不到鏡像...
如果你剛剛拉取鏡像的服務器無法和內網的鏡像倉庫連通,那還需要將鏡像包下載到本地再上傳到內網harbor服務器。當然如果你們的鏡像倉庫是在阿里的就更加方便。
推送鏡像的腳本,注意還需要編輯一個images.txt文件。。
7、編輯push_images.sh
vim push_images.sh
#!/bin/bash
G=`tput setaf 2`
C=`tput setaf 6`
Y=`tput setaf 3`
Q=`tput sgr0`
echo -e "${C}\n\n鏡像上傳腳本:${Q}"
echo -e "${C}push_images.sh將讀取images.txt中的鏡像名稱,將images.tar.gz中的鏡像推送到內網鏡像倉庫\n\n${Q}"
# 獲取內網鏡像倉庫地址
read -p "${C}內網鏡像倉庫地址(默認aicloud-harbor.com/library):${Q}" nexusAddr
if [ -z "${nexusAddr}" ];then
nexusAddr="aicloud-harbor.com/library"
fi
if [[ ${nexusAddr} =~ /$ ]];
then echo
else nexusAddr="${nexusAddr}/"
fi
tar -xzf images.tar.gz
cd images
# tag
echo "${C}start: 加載鏡像${Q}"
for image_name in $(ls ./)
do
echo -e "${Y} 開始load $image_name...${Q}"
docker load < ${image_name}
done
echo -e "${C}end: 加載完成...\n\n${Q}"
#push鏡像
echo "${C}start: 開始push鏡像到harbor...${Q}"
IMAGES_LIST=($(docker images | sed '1d' | awk '{print $1":"$2}'))
for push_image in $(docker images | sed '1d' | awk '{print $1":"$2}')
do
echo -e "${Y} 開始推送$push_image...${Q}"
docker tag $push_image $nexusAddr/$push_image
docker push $nexusAddr/$push_image
echo "鏡像:$nexusAddr/$push_image 推送完成..."
done
echo -e "${C}end: 全部鏡像推送完成\n\n${Q}"
8、修改kubeflow項目文件里面的鏡像地址
先將整個項目拉取到本地
項目地址:https://github.com/kubeflow/manifests/releases
將v1.2.0的包下載到本地。需要改里面的鏡像。因為大部分鏡像都是國外的源。
https://github.com/kubeflow/manifests/archive/v1.2.0.tar.gz
我這里使用的是idea打開。方便全局替換、查找和編輯。
先把壓縮包解壓,會得到manifests-1.2.0這個文件。然后在idea上面打開這個項目。
使用快捷鍵Alt+Shift+R
或者
然后把原來的鏡像地址替換成自己打tag的鏡像
上面所下載的鏡像都需要替換。這個工作量有點大哈。。
9、項目打包
將項目打包壓縮上傳到一個部署服務器可以wget到的倉庫。我這里用的是nexus。
為了和源文件的打包方式一樣,我先在本地Windows上打成zip包,然后在上傳到linux服務器解壓再打包成tar包。
rz manifests-1.2.0.zip #上傳命令
mkdir manifests-1.2.0
mv manifests-1.2.0.zip manifests-1.2.0/
cd manifests-1.2.0/
unzip manifests-1.2.0.zip
rm -rf manifests-1.2.0.zip
cd ..
tar -czvf v1.2.0.tar.gz manifests-1.2.0/
curl -u test:*********** --upload-file ./v1.2.0.tar.gz http://nexus.example.com/repository/public-ftp/kubernetes/package/kubeflow/manifests-1.2.0/
10、創建kubeflow的工作目錄
mkdir /apps/kubeflow
cd /apps/kubeflow
11、創建一個StorageClass.
# cat StorageClass.yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: nfs-client
provisioner: nfs-client-provisioner # 自己內網搭建的一個NAS controller名稱,也可以是阿里的NAS,但必須能夠訪問得到
reclaimPolicy: Retain
parameters:
archiveOnDelete: "true"
并將它改為默認的 StorageClass
# kubectl get sc
NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE
nfs-client nfs-client-provisioner Retain Immediate false 21h
# 為false的時候為關閉
# kubectl patch storageclass nfs-client -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}'
# kubectl get sc
NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE
nfs-client (default) nfs-client-provisioner Retain Immediate false 21h
12、編輯配置文件
vim kfctl_k8s_istio.v1.2.0.yaml
apiVersion: kfdef.apps.kubeflow.org/v1
kind: KfDef
metadata:
namespace: kubeflow
spec:
applications:
- kustomizeConfig:
repoRef:
name: manifests
path: namespaces/base
name: namespaces
- kustomizeConfig:
repoRef:
name: manifests
path: application/v3
name: application
- kustomizeConfig:
repoRef:
name: manifests
path: stacks/kubernetes/application/istio-1-3-1-stack
name: istio-stack
- kustomizeConfig:
repoRef:
name: manifests
path: stacks/kubernetes/application/cluster-local-gateway-1-3-1
name: cluster-local-gateway
- kustomizeConfig:
repoRef:
name: manifests
path: istio/istio/base
name: istio
- kustomizeConfig:
repoRef:
name: manifests
path: stacks/kubernetes/application/cert-manager-crds
name: cert-manager-crds
- kustomizeConfig:
repoRef:
name: manifests
path: stacks/kubernetes/application/cert-manager-kube-system-resources
name: cert-manager-kube-system-resources
- kustomizeConfig:
repoRef:
name: manifests
path: stacks/kubernetes/application/cert-manager
name: cert-manager
- kustomizeConfig:
repoRef:
name: manifests
path: stacks/kubernetes/application/add-anonymous-user-filter
name: add-anonymous-user-filter
- kustomizeConfig:
repoRef:
name: manifests
path: metacontroller/base
name: metacontroller
- kustomizeConfig:
repoRef:
name: manifests
path: admission-webhook/bootstrap/overlays/application
name: bootstrap
- kustomizeConfig:
repoRef:
name: manifests
path: stacks/kubernetes/application/spark-operator
name: spark-operator
- kustomizeConfig:
repoRef:
name: manifests
path: stacks/kubernetes
name: kubeflow-apps
- kustomizeConfig:
repoRef:
name: manifests
path: knative/installs/generic
name: knative
- kustomizeConfig:
repoRef:
name: manifests
path: kfserving/installs/generic
name: kfserving
# Spartakus is a separate applications so that kfctl can remove it
# to disable usage reporting
- kustomizeConfig:
repoRef:
name: manifests
path: stacks/kubernetes/application/spartakus
name: spartakus
repos:
- name: manifests
# 注意這里需要修改成我們已經替換好鏡像路徑的項目地址
# uri: https://github.com/kubeflow/manifests/archive/v1.2.0.tar.gz
uri: http://aicloud-nexus.midea.com/repository/public-ftp/kubernetes/package/kubeflow/manifests-1.2.0/v1.2.0.tar.gz
version: v1.2-branch
登錄到部署服務器上下載剛剛打包好的包
wget http://aicloud-nexus.midea.com/repository/public-ftp/kubernetes/package/kubeflow/manifests-1.2.0/v1.2.0.tar.gz
tar -xzvf v1.2.0.tar.gz
cp kfctl_k8s_istio.v1.2.0.yaml ./manifests-1.2.0
cd manifests-1.2.0
13、部署
kfctl apply -V -f kfctl_k8s_istio.v1.2.0.yaml
檢查所有命名空間
# kubectl get pods -n cert-manager
# kubectl get pods -n istio-system
# kubectl get pods -n knative-serving
# kubectl get pods -n kubeflow
14、使用瀏覽器訪問
訪問kubeflow ui
kubectl get svc/istio-ingressgateway -n istio-system
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
istio-ingressgateway NodePort 12.80.127.69 <none> 15020:32661/TCP,80:31380/TCP,443:31390/TCP,31400:31400/TCP,15029:30345/TCP,15030:32221/TCP,15031:31392/TCP,15032:31191/TCP,15443:32136/TCP 5h14m
因為使用的NodePort的類型。所以我們就可以直接在瀏覽器上面訪問
node節點IP+31380端口
15、測試
創建一個Notebook Servers
然后去查看pod,將會多出這三個pod,如果都為running狀態則表示正常。
# kubectl get pods -A
NAMESPACE NAME READY STATUS RESTARTS AGE
anonymous ml-pipeline-ui-artifact-ccf49557c-s5jk9 2/2 Running 0 4m48s
anonymous ml-pipeline-visualizationserver-866f48bf7b-pfr4l 2/2 Running 0 4m48s
anonymous test-0 2/2 Running 0 2m13s
四、刪除kubeflow
kfctl delete -V -f kfctl_k8s_istio.v1.2.0.yaml
五、問題
1、啟動時一直卡在cert-manager這里
application.app.k8s.io/cert-manager configured
WARN[0161] Encountered error applying application cert-manager: (kubeflow.error): Code 500 with message: Apply.Run : error when creating "/tmp/kout044650944": Internal error occurred: failed calling webhook "webhook.cert-manager.io": the server is currently unable to handle the request filename="kustomize/kustomize.go:284"
WARN[0161] Will retry in 26 seconds. filename="kustomize/kustomize.go:285"
解決
先查看pod
# kubectl get pods -n cert-manager
NAME READY STATUS RESTARTS AGE
cert-manager-7c75b559c4-xmsp6 1/1 Running 0 3m46s
cert-manager-cainjector-7f964fd7b5-fnsg7 1/1 Running 0 3m46s
cert-manager-webhook-566dd99d6-fnchp 0/1 ImagePullBackOff 0 3m46s
# kubectl describe pod cert-manager-webhook-566dd99d6-fnchp -n cert-manager
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 4m57s default-scheduler Successfully assigned cert-manager/cert-manager-webhook-566dd99d6-fnchp to node6
Warning FailedMount 4m26s (x7 over 4m58s) kubelet MountVolume.SetUp failed for volume "certs" : secret "cert-manager-webhook-tls" not found
Warning Failed 3m53s kubelet Failed to pull image "quay.io/jetstack/cert-manager-webhook:v0.11.0": rpc error: code = Unknown desc = Error response from daemon: Get https://quay.io/v2/: dial tcp 54.197.99.84:443: connect: connection refused
Warning Failed 3m37s kubelet Failed to pull image "quay.io/jetstack/cert-manager-webhook:v0.11.0": rpc error: code = Unknown desc = Error response from daemon: Get https://quay.io/v2/: dial tcp 54.156.10.58:443: connect: connection refused
Normal Pulling 3m9s (x3 over 3m53s) kubelet Pulling image "quay.io/jetstack/cert-manager-webhook:v0.11.0"
Warning Failed 3m9s (x3 over 3m53s) kubelet Error: ErrImagePull
Warning Failed 3m9s kubelet Failed to pull image "quay.io/jetstack/cert-manager-webhook:v0.11.0": rpc error: code = Unknown desc = Error response from daemon: Get https://quay.io/v2/: dial tcp 52.4.104.248:443: connect: connection refused
Normal BackOff 2m58s (x4 over 3m52s) kubelet Back-off pulling image "quay.io/jetstack/cert-manager-webhook:v0.11.0"
Warning Failed 2m46s (x5 over 3m52s) kubelet Error: ImagePullBackOff
可以看到是因為鏡像無法拉取到問題導致。
如果鏡像地址沒問題的話刪除一下這個pod
kubectl delete pod cert-manager-webhook-566dd99d6-fnchp -n cert-manager
問題1、刪除kubeflow,pvc都顯示Terminating
# kubectl get pvc -n kubeflow
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
metadata-mysql Terminating pvc-4fe5c5f2-a187-4200-95c3-33de0c01f781 10Gi RWO nfs-client 23h
minio-pvc Terminating pvc-cd2dc964-a448-4c68-b0bb-5bc2183e5203 20Gi RWO nfs-client 23h
mysql-pv-claim Terminating pvc-514407db-00bd-4767-8043-a31b1a70e47f 20Gi RWO nfs-client 23h
解決
# kubectl patch pvc metadata-mysql -p '{"metadata":{"finalizers":null}}' -n kubeflow
persistentvolumeclaim/metadata-mysql patched
刪除kubeflow完后pv并不會自動刪除
# kubectl get pv
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
jenkins-home 60Gi RWO Retain Bound infrastructure/jenkins jenkins-home 9d
pvc-0860e679-dd0b-48fc-8326-8a4c993410e6 20Gi RWO Retain Released kubeflow/minio-pvc nfs-client 16m
pvc-13e06aac-f688-4d89-a467-93e5c6d6ecf6 20Gi RWO Retain Released kubeflow/mysql-pv-claim nfs-client 16m
pvc-3e495907-53c4-468e-9aad-426c2f3e0851 10Gi RWO Retain Released kubeflow/katib-mysql nfs-client 16m
pvc-3f59b851-0429-4e75-929b-33c05f8af66f 20Gi RWO Retain Released kubeflow/mysql-pv-claim nfs-client 7h42m
pvc-5da0ac9b-c1c4-4aa1-b9ff-128174fe152c 10Gi RWO Retain Released kubeflow/metadata-mysql nfs-client 7h42m
pvc-749f2098-8ba2-469c-8d78-f5889e24a9d4 5Gi RWO Retain Released anonymous/workspace-test nfs-client 7h35m
pvc-94e61c9f-0b9c-4589-9e33-efb885c84233 20Gi RWO Retain Released kubeflow/minio-pvc nfs-client 7h42m
pvc-a291c901-f2be-4994-b0d4-d83341879c3b 10Gi RWO Retain Released kubeflow/metadata-mysql nfs-client 16m
pvc-a657f4c5-abce-47b4-8474-4ee4e60826b9 10Gi RWO Retain Released kubeflow/katib-mysql nfs-client 7h42m
需要自己手工刪除
# kubectl delete pv pvc-0860e679-dd0b-48fc-8326-8a4c993410e6
如果出現katib-db、katib-mysql、metadata-grpc-deployment等這幾個pod出現pending或者初始化錯誤的話。大概幾率就是那個持久卷沒有掛載上。可以describe查看具體報錯原因。
檢查pv和pvc有沒有掛載
# kubectl get pvc -A
# kubectl get pv