一、問題描述
運行了2年多的harbor,突然有一天不能用了,問題比較詭異,有時候,docker push鏡像的時候,能成功,有時候不能成功;
harbor部署方式: harbor通過helm部署在k8s集群中,持久化存儲使用阿里的oss;
二、OSS的使用方式如下:
PV信息
apiVersion: v1
kind: PersistentVolume
metadata:
annotations:
pv.kubernetes.io/bound-by-controller: "yes"
creationTimestamp: 2020-09-27T07:17:08Z
finalizers:
- kubernetes.io/pv-protection
labels:
alicloud-pvname: prod-harbor-image-02
name: prod-harbor-image-02
resourceVersion: "8397165"
selfLink: /api/v1/persistentvolumes/prod-harbor-image-02
uid: 73ef98aa-0091-11eb-9587-00163e010494
spec:
accessModes:
- ReadWriteMany
capacity:
storage: 500Gi
claimRef:
apiVersion: v1
kind: PersistentVolumeClaim
name: prod-harbor-image-pvc-02
namespace: harbor
resourceVersion: "8397162"
uid: 846e489c-0091-11eb-9587-00163e010494
flexVolume:
driver: alicloud/oss
options:
akId: XXXX
akSecret: XXXXXX
bucket: prod-harbor-image-02
otherOpts: -o max_stat_cache_size=0 -o allow_other
url: XXXX-a.zbops.ciasyun.local
persistentVolumeReclaimPolicy: Retain
storageClassName: oss
status:
phase: Bound
PVC信息
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
annotations:
pv.kubernetes.io/bind-completed: "yes"
pv.kubernetes.io/bound-by-controller: "yes"
creationTimestamp: 2020-09-27T07:17:35Z
finalizers:
- kubernetes.io/pvc-protection
name: prod-harbor-image-pvc-02
namespace: harbor
resourceVersion: "8397167"
selfLink: /api/v1/namespaces/harbor/persistentvolumeclaims/prod-harbor-image-pvc-02
uid: 846e489c-0091-11eb-9587-00163e010494
spec:
accessModes:
- ReadWriteMany
dataSource: null
resources:
requests:
storage: 500Gi
selector:
matchLabels:
alicloud-pvname: prod-harbor-image-02
storageClassName: oss
volumeName: prod-harbor-image-02
status:
accessModes:
- ReadWriteMany
capacity:
storage: 500Gi
phase: Bound
服務mount信息
volumeMounts:
- mountPath: /storage
name: registry-data
- mountPath: /etc/registry/root.crt
name: registry-root-certificate
subPath: tls.crt
- mountPath: /etc/registry/passwd
name: registry-htpasswd
subPath: passwd
- mountPath: /etc/registry/config.yml
name: registry-config
subPath: config.yml
三、問題定位過程
3.1 在node節點做push動作
推送鏡像的時候,一直卡主不動,截圖如下:
lQLPJxaCm0QL-qfNAlvNBTewWRho1Y-tG2QC1w1QnABhAA_1335_603.png
3.2 檢查harbor的日志如下:
lQLPJxaCkIZ8B4nNBJDNCPSwyJc7RUwdLuQC1vu3g0ChAA_2292_1168.png
3.3 檢查OSS的監監控
lQLPJxaDJyqwizPNA0bNBqiwWYOkXC7YndkC1_KGtACEAA_1704_838.png
3.4 檢查OSS后臺的訪問情況
現象很怪異,后臺的請求消息都是Head消息,無http的GET和PUT方法
lQLPJxaCmNSVCKjNAuPNBzawkN9OEwQ6Y4UC1wlS2MBaAA_1846_739.png
3.4 阿里后臺人員定位了許久,但是最終的結論有些搞笑,因為harbor不是他們的產品,他們不負責,且只說明請求過來的消息有問題
image.png
四、結論
經過自己花時間研究,以及被帶偏了許久,偶然查看pvc,pv信息的時候,突然有個靈感,發現pvc何pv都有存儲大小上限500G,突然想到是否是因為存儲滿了,導致無法push上去呢?但是在harbor的容器中查看,OSS的空間有260T,這個harbor容器中看到的信息未必是準確的,或者是實際可用的;故先做pv和pvc擴容事宜;由于是時候記錄,只記錄操作信息,未有正確截圖
停止harbor的服務
kubectl scale --replicas=0 deployment/harbor-registry
刪除pvc
kubectl get pvc -n harbor harbor-image-pvc-02 -o yaml >harbor-reg.yaml
kubectl get pvc -n harbor
kubectl delete pvc -n harbor harbor-image-pvc-02
修改harbor-reg.yaml 文件,并刪除pv中和pvc關聯的信息,關聯部分為claimRef下面的內容,如圖
image.png
并且擴大storage,從之前的500G,修改為1500G,
重新創建pv,harbor恢復運行,后續測試過程簡單,及docker push正常。