Eviction Kill POD選擇分析

Eviction機制是當節點上的資源緊張達到自己設定的閾值時可以進行資源回收。

EvictionManager工作流程中兩個比較重要的步驟是: reclaimNodeLevelResources(回收節點資源）和killPod。

killPod總會有一個優先級選擇pod來kill，下面就說一下這個優先級排序的過程。

pkg/kubelet/kubelet.go
func (kl *Kubelet) initializeRuntimeDependentModules() {
    if err := kl.cadvisor.Start(); err != nil {
        // Fail kubelet and rely on the babysitter to retry starting kubelet.
        // TODO(random-liu): Add backoff logic in the babysitter
        glog.Fatalf("Failed to start cAdvisor %v", err)
    }
    // eviction manager must start after cadvisor because it needs to know if the container runtime has a dedicated imagefs
    kl.evictionManager.Start(kl.cadvisor, kl.GetActivePods, kl.podResourcesAreReclaimed, kl.containerManager, evictionMonitoringPeriod)
}

這是kubelet中的代碼從最后一行可以看到調用了evictionManager的Start函數，這個函數就是啟動eviction使eviction開始工作，最后一個參數evictionMonitoringPeriod是個常量值是10秒，這個參數是讓eviction流程10秒進行一次。

pkg/kubelet/eviction/eviction_manager.go
func (m *managerImpl) Start(diskInfoProvider DiskInfoProvider, podFunc ActivePodsFunc, podCleanedUpFunc PodCleanedUpFunc, capacityProvider CapacityProvider, monitoringInterval time.Duration) {
    // start the eviction manager monitoring
    go func() {
        for {
            if evictedPods := m.synchronize(diskInfoProvider, podFunc, capacityProvider); evictedPods != nil {
                glog.Infof("eviction manager: pods %s evicted, waiting for pod to be cleaned up", format.Pods(evictedPods))
                m.waitForPodsCleanup(podCleanedUpFunc, evictedPods)
            } else {
                time.Sleep(monitoringInterval)
            }
        }
    }()
}

這個Start就是上文kubelet調用的函數，其中又調用了synchronize函數，這個就是進行evcition的流程。

func (m *managerImpl) synchronize(diskInfoProvider DiskInfoProvider, podFunc ActivePodsFunc, capacityProvider CapacityProvider) []*v1.Pod {
    // if we have nothing to do, just return
    thresholds := m.config.Thresholds
    if len(thresholds) == 0 {
        return nil
    }

    glog.V(3).Infof("eviction manager: synchronize housekeeping")
    // build the ranking functions (if not yet known)
    // TODO: have a function in cadvisor that lets us know if global housekeeping has completed
    if m.dedicatedImageFs == nil {
        hasImageFs, ok := diskInfoProvider.HasDedicatedImageFs()
        if ok != nil {
            return nil
        }
        m.dedicatedImageFs = &hasImageFs
        m.resourceToRankFunc = buildResourceToRankFunc(hasImageFs)
        m.resourceToNodeReclaimFuncs = buildResourceToNodeReclaimFuncs(m.imageGC, m.containerGC, hasImageFs)
    }

    activePods := podFunc()
    // make observations and get a function to derive pod usage stats relative to those observations.
    observations, statsFunc, err := makeSignalObservations(m.summaryProvider, capacityProvider, activePods, *m.dedicatedImageFs)
    if err != nil {
        glog.Errorf("eviction manager: unexpected err: %v", err)
        return nil
    }
    debugLogObservations("observations", observations)

    // attempt to create a threshold notifier to improve eviction response time
    if m.config.KernelMemcgNotification && !m.notifiersInitialized {
        glog.Infof("eviction manager attempting to integrate with kernel memcg notification api")
        m.notifiersInitialized = true
        // start soft memory notification
        err = startMemoryThresholdNotifier(m.config.Thresholds, observations, false, func(desc string) {
            glog.Infof("soft memory eviction threshold crossed at %s", desc)
            // TODO wait grace period for soft memory limit
            m.synchronize(diskInfoProvider, podFunc, capacityProvider)
        })
        if err != nil {
            glog.Warningf("eviction manager: failed to create hard memory threshold notifier: %v", err)
        }
        // start hard memory notification
        err = startMemoryThresholdNotifier(m.config.Thresholds, observations, true, func(desc string) {
            glog.Infof("hard memory eviction threshold crossed at %s", desc)
            m.synchronize(diskInfoProvider, podFunc, capacityProvider)
        })
        if err != nil {
            glog.Warningf("eviction manager: failed to create soft memory threshold notifier: %v", err)
        }
    }

    // determine the set of thresholds met independent of grace period
    thresholds = thresholdsMet(thresholds, observations, false)
    debugLogThresholdsWithObservation("thresholds - ignoring grace period", thresholds, observations)

    // determine the set of thresholds previously met that have not yet satisfied the associated min-reclaim
    if len(m.thresholdsMet) > 0 {
        thresholdsNotYetResolved := thresholdsMet(m.thresholdsMet, observations, true)
        thresholds = mergeThresholds(thresholds, thresholdsNotYetResolved)
    }
    debugLogThresholdsWithObservation("thresholds - reclaim not satisfied", thresholds, observations)

    // determine the set of thresholds whose stats have been updated since the last sync
    thresholds = thresholdsUpdatedStats(thresholds, observations, m.lastObservations)
    debugLogThresholdsWithObservation("thresholds - updated stats", thresholds, observations)

    // track when a threshold was first observed
    now := m.clock.Now()
    thresholdsFirstObservedAt := thresholdsFirstObservedAt(thresholds, m.thresholdsFirstObservedAt, now)

    // the set of node conditions that are triggered by currently observed thresholds
    nodeConditions := nodeConditions(thresholds)
    if len(nodeConditions) > 0 {
        glog.V(3).Infof("eviction manager: node conditions - observed: %v", nodeConditions)
    }

    // track when a node condition was last observed
    nodeConditionsLastObservedAt := nodeConditionsLastObservedAt(nodeConditions, m.nodeConditionsLastObservedAt, now)

    // node conditions report true if it has been observed within the transition period window
    nodeConditions = nodeConditionsObservedSince(nodeConditionsLastObservedAt, m.config.PressureTransitionPeriod, now)
    if len(nodeConditions) > 0 {
        glog.V(3).Infof("eviction manager: node conditions - transition period not met: %v", nodeConditions)
    }

    // determine the set of thresholds we need to drive eviction behavior (i.e. all grace periods are met)
    thresholds = thresholdsMetGracePeriod(thresholdsFirstObservedAt, now)
    debugLogThresholdsWithObservation("thresholds - grace periods satisified", thresholds, observations)

    // update internal state
    m.Lock()
    m.nodeConditions = nodeConditions
    m.thresholdsFirstObservedAt = thresholdsFirstObservedAt
    m.nodeConditionsLastObservedAt = nodeConditionsLastObservedAt
    m.thresholdsMet = thresholds
    m.lastObservations = observations
    m.Unlock()

    // evict pods if there is a resource usage violation from local volume temporary storage
    // If eviction happens in localVolumeEviction function, skip the rest of eviction action
    if utilfeature.DefaultFeatureGate.Enabled(features.LocalStorageCapacityIsolation) {
        if evictedPods := m.localStorageEviction(activePods); len(evictedPods) > 0 {
            return evictedPods
        }
    }

    // determine the set of resources under starvation
    starvedResources := getStarvedResources(thresholds)
    if len(starvedResources) == 0 {
        glog.V(3).Infof("eviction manager: no resources are starved")
        return nil
    }

    // rank the resources to reclaim by eviction priority
    sort.Sort(byEvictionPriority(starvedResources))
    resourceToReclaim := starvedResources[0]
    glog.Warningf("eviction manager: attempting to reclaim %v", resourceToReclaim)

    // determine if this is a soft or hard eviction associated with the resource
    softEviction := isSoftEvictionThresholds(thresholds, resourceToReclaim)

    // record an event about the resources we are now attempting to reclaim via eviction
    m.recorder.Eventf(m.nodeRef, v1.EventTypeWarning, "EvictionThresholdMet", "Attempting to reclaim %s", resourceToReclaim)

    // check if there are node-level resources we can reclaim to reduce pressure before evicting end-user pods.
    if m.reclaimNodeLevelResources(resourceToReclaim, observations) {
        glog.Infof("eviction manager: able to reduce %v pressure without evicting pods.", resourceToReclaim)
        return nil
    }

    glog.Infof("eviction manager: must evict pod(s) to reclaim %v", resourceToReclaim)

    // rank the pods for eviction
    rank, ok := m.resourceToRankFunc[resourceToReclaim]
    if !ok {
        glog.Errorf("eviction manager: no ranking function for resource %s", resourceToReclaim)
        return nil
    }

    // the only candidates viable for eviction are those pods that had anything running.
    if len(activePods) == 0 {
        glog.Errorf("eviction manager: eviction thresholds have been met, but no pods are active to evict")
        return nil
    }

    // rank the running pods for eviction for the specified resource
    rank(activePods, statsFunc)

    glog.Infof("eviction manager: pods ranked for eviction: %s", format.Pods(activePods))

    //record age of metrics for met thresholds that we are using for evictions.
    for _, t := range thresholds {
        timeObserved := observations[t.Signal].time
        if !timeObserved.IsZero() {
            metrics.EvictionStatsAge.WithLabelValues(string(t.Signal)).Observe(metrics.SinceInMicroseconds(timeObserved.Time))
        }
    }

    // we kill at most a single pod during each eviction interval
    for i := range activePods {
        pod := activePods[i]
        // If the pod is marked as critical and static, and support for critical pod annotations is enabled,
        // do not evict such pods. Static pods are not re-admitted after evictions.
        // https://github.com/kubernetes/kubernetes/issues/40573 has more details.
        if utilfeature.DefaultFeatureGate.Enabled(features.ExperimentalCriticalPodAnnotation) &&
            kubelettypes.IsCriticalPod(pod) && kubepod.IsStaticPod(pod) {
            continue
        }
        status := v1.PodStatus{
            Phase:   v1.PodFailed,
            Message: fmt.Sprintf(message, resourceToReclaim),
            Reason:  reason,
        }
        // record that we are evicting the pod
        m.recorder.Eventf(pod, v1.EventTypeWarning, reason, fmt.Sprintf(message, resourceToReclaim))
        gracePeriodOverride := int64(0)
        if softEviction {
            gracePeriodOverride = m.config.MaxPodGracePeriodSeconds
        }
        // this is a blocking call and should only return when the pod and its containers are killed.
        err := m.killPodFunc(pod, status, &gracePeriodOverride)
        if err != nil {
            glog.Warningf("eviction manager: error while evicting pod %s: %v", format.Pod(pod), err)
        }
        return []*v1.Pod{pod}
    }
    glog.Infof("eviction manager: unable to evict any pods from the node")
    return nil
}

activePods := podFunc()獲取了節點上的active的pod,然后調用rank(activePods, statsFunc)對activePod進行排序，這個rank最終的排序方法是在pkg/kubelet/eviction/helpers.go里面。

// Swap is part of sort.Interface.
func (ms *multiSorter) Swap(i, j int) {
    ms.pods[i], ms.pods[j] = ms.pods[j], ms.pods[i]
}

// Less is part of sort.Interface.
func (ms *multiSorter) Less(i, j int) bool {
    p1, p2 := ms.pods[i], ms.pods[j]
    var k int
    for k = 0; k < len(ms.cmp)-1; k++ {
        cmpResult := ms.cmp[k](p1, p2)
        // p1 is less than p2
        if cmpResult < 0 {
            return true
        }
        // p1 is greater than p2
        if cmpResult > 0 {
            return false
        }
        // we don't know yet
    }
    // the last cmp func is the final decider
    return ms.cmp[k](p1, p2) < 0
}

func qosComparator(p1, p2 *v1.Pod) int {
    qosP1 := v1qos.GetPodQOS(p1)
    qosP2 := v1qos.GetPodQOS(p2)
    // its a tie
    if qosP1 == qosP2 {
        return 0
    }
    // if p1 is best effort, we know p2 is burstable or guaranteed
    if qosP1 == v1.PodQOSBestEffort {
        return -1
    }
    // we know p1 and p2 are not besteffort, so if p1 is burstable, p2 must be guaranteed
    if qosP1 == v1.PodQOSBurstable {
        if qosP2 == v1.PodQOSGuaranteed {
            return -1
        }
        return 1
    }
    // ok, p1 must be guaranteed.
    return 1
}

// memory compares pods by largest consumer of memory relative to request.
func memory(stats statsFunc) cmpFunc {
    return func(p1, p2 *v1.Pod) int {
        p1Stats, found := stats(p1)
        // if we have no usage stats for p1, we want p2 first
        if !found {
            return -1
        }
        // if we have no usage stats for p2, but p1 has usage, we want p1 first.
        p2Stats, found := stats(p2)
        if !found {
            return 1
        }
        // if we cant get usage for p1 measured, we want p2 first
        p1Usage, err := podMemoryUsage(p1Stats)
        if err != nil {
            return -1
        }
        // if we cant get usage for p2 measured, we want p1 first
        p2Usage, err := podMemoryUsage(p2Stats)
        if err != nil {
            return 1
        }

        // adjust p1, p2 usage relative to the request (if any)
        p1Memory := p1Usage[v1.ResourceMemory]
        p1Spec, err := core.PodUsageFunc(p1)
        if err != nil {
            return -1
        }
        p1Request := p1Spec[api.ResourceRequestsMemory]
        p1Memory.Sub(p1Request)

        p2Memory := p2Usage[v1.ResourceMemory]
        p2Spec, err := core.PodUsageFunc(p2)
        if err != nil {
            return 1
        }
        p2Request := p2Spec[api.ResourceRequestsMemory]
        p2Memory.Sub(p2Request)

        // if p2 is using more than p1, we want p2 first
        return p2Memory.Cmp(p1Memory)
    }
}

// disk compares pods by largest consumer of disk relative to request for the specified disk resource.
func disk(stats statsFunc, fsStatsToMeasure []fsStatsType, diskResource v1.ResourceName) cmpFunc {
    return func(p1, p2 *v1.Pod) int {
        p1Stats, found := stats(p1)
        // if we have no usage stats for p1, we want p2 first
        if !found {
            return -1
        }
        // if we have no usage stats for p2, but p1 has usage, we want p1 first.
        p2Stats, found := stats(p2)
        if !found {
            return 1
        }
        // if we cant get usage for p1 measured, we want p2 first
        p1Usage, err := podDiskUsage(p1Stats, p1, fsStatsToMeasure)
        if err != nil {
            return -1
        }
        // if we cant get usage for p2 measured, we want p1 first
        p2Usage, err := podDiskUsage(p2Stats, p2, fsStatsToMeasure)
        if err != nil {
            return 1
        }

        // disk is best effort, so we don't measure relative to a request.
        // TODO: add disk as a guaranteed resource
        p1Disk := p1Usage[diskResource]
        p2Disk := p2Usage[diskResource]
        // if p2 is using more than p1, we want p2 first
        return p2Disk.Cmp(p1Disk)
    }
}

// rankMemoryPressure orders the input pods for eviction in response to memory pressure.
func rankMemoryPressure(pods []*v1.Pod, stats statsFunc) {
    orderedBy(qosComparator, memory(stats)).Sort(pods)
}

// rankDiskPressureFunc returns a rankFunc that measures the specified fs stats.
func rankDiskPressureFunc(fsStatsToMeasure []fsStatsType, diskResource v1.ResourceName) rankFunc {
    return func(pods []*v1.Pod, stats statsFunc) {
        orderedBy(qosComparator, disk(stats, fsStatsToMeasure, diskResource)).Sort(pods)
    }
}

rankMemoryPressure是對pod進行內存的排序, rankDiskPressureFunc是對pod對disk進行排序。
qosComparator，memory，disk是具體的排序策略。

排序首先會調用qosComparator進行排序，如果pod的Qos是PodQOSBestEffort會最先被殺掉，如果兩個pod的Qos相同然后會根據memory或是disk規則根據使用量進行排序，使用量大的會優先被殺掉。每次eviction流程kill pod會kill一個直至kill成功，10秒鐘以后會進行下一次流程。

最后編輯于：2017.12.10 05:12:46

?著作權歸作者所有,轉載或內容合作請聯系作者
平臺聲明：文章內容（如有圖片或視頻亦包括在內）由作者上傳并發布，文章內容僅代表作者本人觀點，簡書系信息發布平臺，僅提供信息存儲服務。

人面猴
序言：七十年代末，一起剝皮案震驚了整個濱河市，隨后出現的幾起案子，更是在濱河造成了極大的恐慌，老刑警劉巖，帶你破解...
沈念sama閱讀 230,825評論 6贊 546
死咒
序言：濱河連續發生了三起死亡事件，死亡現場離奇詭異，居然都是意外死亡，警方通過查閱死者的電腦和手機，發現死者居然都...
沈念sama閱讀 99,814評論 3贊 429
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進店門，熙熙樓的掌柜王于貴愁眉苦臉地迎上來，“玉大人，你說我怎么就攤上這事。” “怎么了？”我有些...
開封第一講書人閱讀 178,980評論 0贊 384
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵，是天一觀的道長。經常有香客問我，道長，這世上最難降的妖魔是什么？我笑而不...
開封第一講書人閱讀 64,064評論 1贊 319
?港島之戀（遺憾婚禮）
正文為了忘掉前任，我火速辦了婚禮，結果婚禮上，老公的妹妹穿的比我還像新娘。我一直安慰自己，他們只是感情好，可當我...
茶點故事閱讀 72,779評論 6贊 414
惡毒庶女頂嫁案：這布局不是一般人想出來的
文/花漫我一把揭開白布。她就那樣靜靜地躺著，像睡著了一般。火紅的嫁衣襯著肌膚如雪。梳的紋絲不亂的頭發上，一...
開封第一講書人閱讀 56,109評論 1贊 330
城市分裂傳說
那天，我揣著相機與錄音，去河邊找鬼。笑死，一個胖子當著我的面吹牛，可吹牛的內容都是我干的。我是一名探鬼主播，決...
沈念sama閱讀 44,099評論 3贊 450
雙鴛鴦連環套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開眼，長吁一口氣：“原來是場噩夢啊……” “哼！你這毒婦竟也來了？” 一聲冷哼從身側響起，我...
開封第一講書人閱讀 43,287評論 0贊 291
萬榮殺人案實錄
序言：老撾萬榮一對情侶失蹤，失蹤者是張志新（化名）和其女友劉穎，沒想到半個月后，有當地人在樹林里發現了一具尸體，經...
沈念sama閱讀 49,799評論 1贊 338
?護林員之死
正文獨居荒郊野嶺守林人離奇死亡，尸身上長有42處帶血的膿包…… 初始之章·張勛以下內容為張勛視角年9月15日...
茶點故事閱讀 41,515評論 3贊 361
?白月光啟示錄
正文我和宋清朗相戀三年，在試婚紗的時候發現自己被綠了。大學時的朋友給我發了我未婚夫和他白月光在一起吃飯的照片。...
茶點故事閱讀 43,750評論 1贊 375
活死人
序言：一個原本活蹦亂跳的男人離奇死亡，死狀恐怖，靈堂內的尸體忽然破棺而出，到底是詐尸還是另有隱情，我是刑警寧澤，帶...
沈念sama閱讀 39,221評論 5贊 365
?日本核電站爆炸內幕
正文年R本政府宣布，位于F島的核電站，受9級特大地震影響，放射性物質發生泄漏。R本人自食惡果不足惜，卻給世界環境...
茶點故事閱讀 44,933評論 3贊 351
男人毒藥：我在死后第九天來索命
文/蒙蒙一、第九天我趴在偏房一處隱蔽的房頂上張望。院中可真熱鬧，春花似錦、人聲如沸。這莊子的主人今日做“春日...
開封第一講書人閱讀 35,327評論 0贊 28
一樁弒父案，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽。三九已至，卻和暖如春，著一層夾襖步出監牢的瞬間，已是汗流浹背。一陣腳步聲響...
開封第一講書人閱讀 36,667評論 1贊 296
情欲美人皮
我被黑心中介騙來泰國打工，沒想到剛下飛機就差點兒被人妖公主榨干…… 1. 我叫王不留，地道東北人。一個月前我還...
沈念sama閱讀 52,492評論 3贊 400
代替公主和親
正文我出身青樓，卻偏偏與公主長得像，于是被迫代替她去往敵國和親。傳聞我的和親對象是個殘疾皇子，可洞房花燭夜當晚...
茶點故事閱讀 48,703評論 2贊 380

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频

Eviction Kill POD選擇分析

Eviction Kill POD選擇分析

推薦閱讀更多精彩內容

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美 国产 综合 欧美 视频

Eviction Kill POD選擇分析

推薦閱讀更多精彩內容

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频