k8s-調(diào)度算法

k8s-調(diào)度算法

  1. 預(yù)選算法,過濾nodes
  2. 優(yōu)選算法,對(duì)nodes打分

ps. scheduler_algorithm

預(yù)選

方法簽名

func Predicates(pod *v1.Pod, meta algorithm.PredicateMetadata, nodeInfo *schedulercache.NodeInfo) (bool, []algorithm.PredicateFailureReason, error) {}

總共20個(gè)預(yù)選過程

ps. 不拙劣地翻譯了,直接看代碼注解吧。代碼出處:kubernetes-master\pkg\scheduler\algorithm\predicates\predicates.go

volume

  • NoDiskConflict(重要): evaluates if a pod can fit due to the volumes it requests, and those that are already mounted. If there is already a volume mounted on that node, another pod that uses the same volume can't be scheduled there.
  • NewMaxPDVolumeCountPredicate(重要): creates a predicate which evaluates whether a pod can fit based on the number of volumes which match a filter that it requests, and those that are already present.
    The predicate looks for both volumes used directly, as well as PVC volumes that are backed by relevant volume types, counts the number of unique volumes, and rejects the new pod if it would place the total count over the maximum.
  • NewVolumeZonePredicate(重要): evaluates if a pod can fit due to the volumes it requests, given that some volumes may have zone scheduling constraints. The requirement is that any volume zone-labels must match the equivalent zone-labels on the node. It is OK for the node to have more zone-label constraints (for example, a hypothetical replicated volume might allow region-wide access)
    Currently this is only supported with PersistentVolumeClaims, and looks to the labels only on the bound PersistentVolume.
    Working with volumes declared inline in the pod specification (i.e. not using a PersistentVolume) is likely to be harder, as it would require determining the zone of a volume during scheduling, and that is likely to require calling out to the cloud provider. It seems that we are moving away from inline volume declarations anyway.
  • NewVolumeBindingPredicate: evaluates if a pod can fit due to the volumes it requests, for both bound and unbound PVCs.
    For PVCs that are bound, then it checks that the corresponding PV's node affinity is satisfied by the given node.
    For PVCs that are unbound, it tries to find available PVs that can satisfy the PVC requirements and that the PV node affinity is satisfied by the given node.
    The predicate returns true if all bound PVCs have compatible PVs with the node, and if all unbound
    PVCs can be matched with an available and node-compatible PV.

pod

  • PodFitsResources(重要): checks if a node has sufficient resources, such as cpu, memory, gpu, opaque int resources etc to run a pod.
  • PodMatchNodeSelector(重要): checks if a pod node selector matches the node label.
  • PodFitsHost(重要): checks if a pod spec node name matches the current node.
  • CheckNodeLabelPresence: checks whether all of the specified labels exists on a node or not, regardless of their value
    If "presence" is false, then returns false if any of the requested labels matches any of the node's labels, otherwise returns true.
    If "presence" is true, then returns false if any of the requested labels does not match any of the node's labels, otherwise returns true.
    Consider the cases where the nodes are placed in regions/zones/racks and these are identified by labels
    In some cases, it is required that only nodes that are part of ANY of the defined regions/zones/racks be selected
    Alternately, eliminating nodes that have a certain label, regardless of value, is also useful A node may have a label with "retiring" as key and the date as the value and it may be desirable to avoid scheduling new pods on this node
  • checkServiceAffinity: is a predicate which matches nodes in such a way to force that ServiceAffinity.labels are homogenous for pods that are scheduled to a node. (i.e. it returns true IFF this pod can be added to this node such that all other pods in the same service are running on nodes with the exact same ServiceAffinity.label values).
    For example:
    If the first pod of a service was scheduled to a node with label "region=foo",
    all the other subsequent pods belong to the same service will be schedule on
    nodes with the same "region=foo" label.
  • PodFitsHostPorts(重要): checks if a node has free ports for the requested pod ports.
  • GeneralPredicates: GeneralPredicates checks whether noncriticalPredicates and EssentialPredicates pass. noncriticalPredicates are the predicates that only non-critical pods need,noncriticalPredicates就是PodFitsResources
  • EssentialPredicates : are the predicates that all pods, including critical pods, need, 包括PodFitsHostPodFitsHostPortsPodMatchNodeSelector
  • InterPodAffinityMatches: checks if a pod can be scheduled on the specified node with pod affinity/anti-affinity configuration.

node

ps. 通過kubectl describe no {node-name}查看node狀態(tài):

  • CheckNodeUnschedulablePredicate: checks if a pod can be scheduled on a node with Unschedulable spec.檢查node的unschedulable狀態(tài)
  • PodToleratesNodeTaints: checks if a pod tolerations can tolerate the node taints,node taints污點(diǎn)機(jī)制
  • PodToleratesNodeNoExecuteTaints: checks if a pod tolerations can tolerate the node's NoExecute taints
  • CheckNodeMemoryPressurePredicate(重要): checks if a pod can be scheduled on a node reporting memory pressure condition.
  • CheckNodeDiskPressurePredicate(重要): checks if a pod can be scheduled on a node reporting disk pressure condition.
  • CheckNodePIDPressurePredicate: checks if a pod can be scheduled on a node reporting pid pressure condition.
  • CheckNodeConditionPredicate: checks if a pod can be scheduled on a node reporting out of disk, network unavailable and not ready condition. Only node conditions are accounted in this predicate.

優(yōu)選

ps. 代碼出處:kubernetes-master\pkg\scheduler\algorithm\priorities

ResourceAllocationPriority

// ResourceAllocationPriority contains information to calculate resource allocation priority.
type ResourceAllocationPriority struct {
    Name   string
    scorer func(requested, allocable *schedulercache.Resource, includeVolumes bool, requestedVolumes int, allocatableVolumes int) int64
}

// PriorityMap priorities nodes according to the resource allocations on the node.
// It will use `scorer` function to calculate the score.
func (r *ResourceAllocationPriority) PriorityMap(
    pod *v1.Pod,
    meta interface{},
    nodeInfo *schedulercache.NodeInfo) (schedulerapi.HostPriority, error) 
  • balancedResourceScorer(重要): favors nodes with balanced resource usage rate.
    should NOT be used alone, and MUST be used together ith LeastRequestedPriority. It calculates the difference between the cpu and memory fraction f capacity, and prioritizes the host based on how close the two metrics are to each other.
    計(jì)算公式:10 - variance(cpuFraction,memoryFraction,volumeFraction)*10
    選擇各個(gè)資源使用最均衡的node
  • leastResourceScorer(重要): favors nodes with fewer requested resources. It calculates the percentage of memory and CPU requested by pods scheduled on the node, and rioritizes based on the minimum of the average of the fraction of requested to capacity.
    計(jì)算公式:(cpu((capacity-sum(requested))10/capacity) + memory((capacity-sum(requested))10/capacity))/2
    選擇最空閑的node
  • mostResourceScorer: favors nodes with most requested resources. It calculates the percentage of memory and CPU requested by pods scheduled on the node, and prioritizes based on the maximum of the average of the fraction of requested to capacity.
    計(jì)算公式: (cpu(10 * sum(requested) / capacity) + memory(10 * sum(requested) / capacity)) / 2
    盡量用盡一個(gè)node的資源
  • requested_to_capacity_ratio: assigns 1.0 to resource when all capacity is available and 0.0 when requested amount is equal to capacity.

image_locality(重要)

favors nodes that already have requested pod container's images.
It will detect whether the requested images are present on a node, and then calculate a score ranging from 0 to 10
based on the total size of those images.

  • If none of the images are present, this node will be given the lowest priority.
  • If some of the images are present on a node, the larger their sizes' sum, the higher the node's priority.

interpod_affinity(重要)

compute a sum by iterating through the elements of weightedPodAffinityTerm and adding "weight" to the sum if the corresponding PodAffinityTerm is satisfied for that node; the node(s) with the highest sum are the most preferred.
Symmetry need to be considered for preferredDuringSchedulingIgnoredDuringExecution from podAffinity & podAntiAffinity,
symmetry need to be considered for hard requirements from podAffinity

node_affinity(重要)

scheduling preferences indicated in PreferredDuringSchedulingIgnoredDuringExecution. Each time a node match a preferredSchedulingTerm, it will a get an add of preferredSchedulingTerm.Weight. Thus, the more preferredSchedulingTerms the node satisfies and the more the preferredSchedulingTerm that is satisfied weights, the higher score the node gets.

node_label

checks whether a particular label exists on a node or not, regardless of its value.
If presence is true, prioritizes nodes that have the specified label, regardless of value.
If presence is false, prioritizes nodes that do not have the specified label.

node_prefer_avoid_pods

priorities nodes according to the node annotation "scheduler.alpha.kubernetes.io/preferAvoidPods".

selector_spreading(重要)

  • SelectorSpreadPriority: spreads pods across hosts, considering pods belonging to the same service,RC,RS or StatefulSet.
    When a pod is scheduled, it looks for services, RCs,RSs and StatefulSets that match the pod, then finds existing pods that match those selectors.
    It favors nodes that have fewer existing matching pods.
    i.e. it pushes the scheduler towards a node where there's the smallest number of pods which match the same service, RC,RSs or StatefulSets selectors as the pod being scheduled.
  • ServiceAntiAffinityPriority: spreads pods by minimizing the number of pods belonging to the same service on given machine

taint_toleration

prepares the priority list for all the nodes based on the number of intolerable taints on the node. 詳見taint-and-toleration

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡書系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

推薦閱讀更多精彩內(nèi)容