kubernets中負責pod調(diào)度的重要模塊是kube-schduler。kube-scheduler就是調(diào)度安排Pod到具體的Node,,kube-scheduler通過API Server提供的接口監(jiān)聽Pod任務(wù)列表,獲取待調(diào)度pod,然后根據(jù)一系列的預選策略和優(yōu)選策略給各個Node節(jié)點打分,然后將Pod發(fā)送到得分最高的Node節(jié)點上,同時將綁定信息寫入etcd.
node節(jié)點上的kubelet通過kuber-apiserver的監(jiān)聽,獲取kube-scheduler產(chǎn)生的綁定事件,獲取pod清單,下載鏡像,啟動容器。
調(diào)度策略
Kubernetes的調(diào)度策略分為Predicates(預選策略)和Priorites(優(yōu)選策略),整個調(diào)度過程分為兩步:
預選策略,Predicates是強制性規(guī)則,遍歷所有的Node節(jié)點,按照具體的預選策略篩選出符合要求的Node列表,如沒有Node符合Predicates策略規(guī)則,那該Pod就會被掛起,直到有Node能夠滿足。
優(yōu)選策略,在第一步篩選的基礎(chǔ)上,按照優(yōu)選策略為待選Node打分排序,獲取最優(yōu)者。
- 源碼位置:
predicates包為k8s支持的所有預選策略
priorites包為k8s支持的所有優(yōu)選策略
algorithmprovider包下的defaults包為默認的預選和優(yōu)選策略
Predicates 預選策略
v1.7支持15個策略,Kubernetes(v1.7)中可用的Predicates策略有:
- MatchNodeSelector:檢查spec.nodeSelector是否包含Node節(jié)點的label定義
- PodFitsResources:檢查主機的資源(cpu和內(nèi)存)是否滿足Pod的需求,根據(jù)實際已經(jīng)分配(Limit)的資源量做調(diào)度
- PodFitsHostPorts:檢查Pod內(nèi)每一個容器所需的HostPort是否已被其它容器占用,如果有所需的HostPort不滿足需求,那么Pod不能調(diào)度到這個主機上
- HostName:檢查主機名稱是不是Pod指定的NodeName
- NoDiskConflict:根據(jù)pod.spec.volumes檢查在此主機上是否存在卷沖突。如果這個主機已經(jīng)掛載了卷,其它同樣使用這個卷的Pod不能調(diào)度到這個主機上,不同的存儲后端具體規(guī)則不同
- NoVolumeZoneConflict:檢查給定的zone限制前提下,檢查如果在此主機上部署Pod是否存在卷沖突
- PodToleratesNodeTaints:確保pod定義的tolerates能接納node定義的taints
- CheckNodeMemoryPressure:檢查pod是否可以調(diào)度到已經(jīng)報告了主機內(nèi)存壓力過大的節(jié)點
- CheckNodeDiskPressure:檢查pod是否可以調(diào)度到已經(jīng)報告了主機的存儲壓力過大的節(jié)點
- MaxEBSVolumeCount:確保已掛載的EBS存儲卷不超過設(shè)置的最大值,默認39
- MaxGCEPDVolumeCount:確保已掛載的GCE存儲卷不超過設(shè)置的最大值,默認16
- MaxAzureDiskVolumeCount:確保已掛載的Azure存儲卷不超過設(shè)置的最大值,默認16
- MatchInterPodAffinity:檢查pod和其他pod是否符合親和性規(guī)則
- GeneralPredicates:檢查pod與主機上kubernetes相關(guān)組件是否匹配
- NoVolumeNodeConflict:檢查給定的Node限制前提下,檢查如果在此主機上部署Pod是否存在卷沖突
Priorites 優(yōu)選策略
Kubernetes(v1.7)中可用的Priorites策略有:
- EqualPriority:所有節(jié)點同樣優(yōu)先級
- ImageLocalityPriority:根據(jù)主機上是否已具備Pod運行的環(huán)境來打分,得分計算:不存在所需鏡像,返回0分,存在鏡像,鏡像越大得分越高
- LeastRequestedPriority:計算Pods需要的CPU和內(nèi)存在當前節(jié)點可用資源的百分比,具有最小百分比的節(jié)點就是最優(yōu),得分計算公式
cpu((capacity – sum(requested)) * 10 / capacity) + memory((capacity – sum(requested)) * 10 / capacity) / 2
- BalancedResourceAllocation:節(jié)點上各項資源(CPU、內(nèi)存)使用率最均衡的為最優(yōu),得分計算公式
10 – abs(totalCpu/cpuNodeCapacity-totalMemory/memoryNodeCapacity)*10
- SelectorSpreadPriority:按Service和Replicaset歸屬計算Node上分布最少的同類Pod數(shù)量,得分計算:數(shù)量越少得分越高
- NodePreferAvoidPodsPriority:判斷alpha.kubernetes.io/preferAvoidPods屬性,設(shè)置權(quán)重為10000,覆蓋其他策略
- NodeAffinityPriority:節(jié)點親和性選擇策略,提供兩種選擇器支持:requiredDuringSchedulingIgnoredDuringExecution(保證所選的主機必須滿足所有Pod對主機的規(guī)則要求)、preferresDuringSchedulingIgnoredDuringExecution(調(diào)度器會盡量但不保證滿足NodeSelector的所有要求)
- TaintTolerationPriority:類似于Predicates策略中的PodToleratesNodeTaints,優(yōu)先調(diào)度到標記了Taint的節(jié)點
- InterPodAffinityPriority:pod親和性選擇策略,類似NodeAffinityPriority,提供兩種選擇器支持:requiredDuringSchedulingIgnoredDuringExecution(保證所選的主機必須滿足所有Pod對主機的規(guī)則要求)、preferresDuringSchedulingIgnoredDuringExecution(調(diào)度器會盡量但不保證滿足NodeSelector的所有要求)
- MostRequestedPriority:動態(tài)伸縮集群環(huán)境比較適用,會優(yōu)先調(diào)度pod到使用率最高的主機節(jié)點,這樣在伸縮集群時,就會騰出空閑機器,從而進行停機處理。
默認策略
默認預選策略
func defaultPredicates() sets.String {
predSet := sets.NewString(
factory.RegisterFitPredicateFactory(
"NoVolumeZoneConflict",
func(args factory.PluginFactoryArgs) algorithm.FitPredicate {
return predicates.NewVolumeZonePredicate(args.PVInfo, args.PVCInfo)
},
),
factory.RegisterFitPredicateFactory(
"MaxEBSVolumeCount",
func(args factory.PluginFactoryArgs) algorithm.FitPredicate {
// TODO: allow for generically parameterized scheduler predicates, because this is a bit ugly
maxVols := getMaxVols(aws.DefaultMaxEBSVolumes)
return predicates.NewMaxPDVolumeCountPredicate(predicates.EBSVolumeFilter, maxVols, args.PVInfo, args.PVCInfo)
},
),
factory.RegisterFitPredicateFactory(
"MaxGCEPDVolumeCount",
func(args factory.PluginFactoryArgs) algorithm.FitPredicate {
// TODO: allow for generically parameterized scheduler predicates, because this is a bit ugly
maxVols := getMaxVols(DefaultMaxGCEPDVolumes)
return predicates.NewMaxPDVolumeCountPredicate(predicates.GCEPDVolumeFilter, maxVols, args.PVInfo, args.PVCInfo)
},
),
factory.RegisterFitPredicateFactory(
"MaxAzureDiskVolumeCount",
func(args factory.PluginFactoryArgs) algorithm.FitPredicate {
// TODO: allow for generically parameterized scheduler predicates, because this is a bit ugly
maxVols := getMaxVols(DefaultMaxAzureDiskVolumes)
return predicates.NewMaxPDVolumeCountPredicate(predicates.AzureDiskVolumeFilter, maxVols, args.PVInfo, args.PVCInfo)
},
),
factory.RegisterFitPredicateFactory(
predicates.MatchInterPodAffinity,
func(args factory.PluginFactoryArgs) algorithm.FitPredicate {
return predicates.NewPodAffinityPredicate(args.NodeInfo, args.PodLister)
},
),
factory.RegisterFitPredicate("NoDiskConflict", predicates.NoDiskConflict),
factory.RegisterFitPredicate("GeneralPredicates", predicates.GeneralPredicates),
factory.RegisterFitPredicate("CheckNodeMemoryPressure", predicates.CheckNodeMemoryPressurePredicate),
factory.RegisterFitPredicate("CheckNodeDiskPressure", predicates.CheckNodeDiskPressurePredicate),
factory.RegisterFitPredicateFactory(
"NoVolumeNodeConflict",
func(args factory.PluginFactoryArgs) algorithm.FitPredicate {
return predicates.NewVolumeNodePredicate(args.PVInfo, args.PVCInfo, nil)
},
),
)
if utilfeature.DefaultFeatureGate.Enabled(features.TaintNodesByCondition) {
predSet.Insert(factory.RegisterMandatoryFitPredicate("PodToleratesNodeTaints", predicates.PodToleratesNodeTaints))
glog.Warningf("TaintNodesByCondition is enabled, PodToleratesNodeTaints predicate is mandatory")
} else {
predSet.Insert(factory.RegisterMandatoryFitPredicate("CheckNodeCondition", predicates.CheckNodeConditionPredicate))
predSet.Insert(factory.RegisterFitPredicate("PodToleratesNodeTaints", predicates.PodToleratesNodeTaints))
}
return predSet
}
默認優(yōu)選策略
func defaultPriorities() sets.String {
return sets.NewString(
factory.RegisterPriorityConfigFactory(
"SelectorSpreadPriority",
factory.PriorityConfigFactory{
Function: func(args factory.PluginFactoryArgs) algorithm.PriorityFunction {
return priorities.NewSelectorSpreadPriority(args.ServiceLister, args.ControllerLister, args.ReplicaSetLister, args.StatefulSetLister)
},
Weight: 1,
},
),
factory.RegisterPriorityConfigFactory(
"InterPodAffinityPriority",
factory.PriorityConfigFactory{
Function: func(args factory.PluginFactoryArgs) algorithm.PriorityFunction {
return priorities.NewInterPodAffinityPriority(args.NodeInfo, args.NodeLister, args.PodLister, args.HardPodAffinitySymmetricWeight)
},
Weight: 1,
},
),
factory.RegisterPriorityFunction2("LeastRequestedPriority", priorities.LeastRequestedPriorityMap, nil, 1),
factory.RegisterPriorityFunction2("BalancedResourceAllocation", priorities.BalancedResourceAllocationMap, nil, 1),
factory.RegisterPriorityFunction2("NodePreferAvoidPodsPriority", priorities.CalculateNodePreferAvoidPodsPriorityMap, nil, 10000),
factory.RegisterPriorityFunction2("NodeAffinityPriority", priorities.CalculateNodeAffinityPriorityMap, priorities.CalculateNodeAffinityPriorityReduce, 1),
factory.RegisterPriorityFunction2("TaintTolerationPriority", priorities.ComputeTaintTolerationPriorityMap, priorities.ComputeTaintTolerationPriorityReduce, 1),
)
}
默認注冊但不加載的策略
預選策略
// Registers predicates and priorities that are not enabled by default, but user can pick when creating his
// own set of priorities/predicates.
factory.RegisterFitPredicate("PodFitsPorts", predicates.PodFitsHostPorts)
factory.RegisterFitPredicate("PodFitsHostPorts", predicates.PodFitsHostPorts)
factory.RegisterFitPredicate("PodFitsResources", predicates.PodFitsResources)
factory.RegisterFitPredicate("HostName", predicates.PodFitsHost)
factory.RegisterFitPredicate("MatchNodeSelector", predicates.PodMatchNodeSelector)
優(yōu)選策略
factory.RegisterPriorityFunction2("EqualPriority", core.EqualPriorityMap, nil, 1)
factory.RegisterPriorityFunction2("ImageLocalityPriority", priorities.ImageLocalityPriorityMap, nil, 1)
factory.RegisterPriorityFunction2("MostRequestedPriority", priorities.MostRequestedPriorityMap, nil, 1)