Netflix性能分析模型:In 60 Seconds

翻譯自《Netflix Tech Blog》,原作者Brendan Gregg

Linux Performance Analysis in 60,000 Milliseconds

登陸一臺 Linux 服務(wù)器排查性能問題:開始一分鐘你該檢查哪些呢?
在 Netflix 我們有一個龐大的 EC2 Linux集群,也有許多性能分析工具用于監(jiān)視和檢查它們的性能。它們包括用于云監(jiān)測的Atlas (工具代號) ,用于實例分析的 Vector (工具代號) 。

盡管這些工具能幫助我們解決大部分問題,我們有時也需要登陸一臺實例、運行一些標(biāo)準(zhǔn)的 Linux 性能分析工具。

在這篇文章,Netflix 性能工程團隊將向您展示:在開始的60秒鐘,利用標(biāo)準(zhǔn)的Linux命令行工具,執(zhí)行一次充分的性能檢查。

黃金60秒:概述

運行以下10個命令,你可以在60秒內(nèi),獲得系統(tǒng)資源利用率和進(jìn)程運行情況的整體概念。查看是否存在異常、評估飽和度,它們都非常易于理解,可用性強。飽和度表示資源還有多少負(fù)荷可以讓它處理,并且能夠展示請求隊列的長度或等待的時間。

uptime
dmesg | tail vmstat 1
mpstat -P ALL 1 pidstat 1
iostat -xz 1 free -m
sar -n DEV 1
sar -n TCP,ETCP 1 top
譯者配圖:perf check path

這些命令需要安裝sysstat包。
這些命令輸出的指標(biāo),將幫助你掌握一些有效的方法:一整套尋找性能瓶頸的方法論。這些命令需要檢查所有資源的利用率、飽和度和錯誤信息(CPU、內(nèi)存、磁盤等)。同時,當(dāng)你檢查或排除一些資源的時候,需要注意在檢查過程中,根據(jù)指標(biāo)數(shù)據(jù)指引,逐步縮小目標(biāo)范圍。

接下來的章節(jié),將結(jié)合生產(chǎn)環(huán)境的案例演示這些命令。如果希望了解這些工具的詳細(xì)信息,可以查閱它們的操作文檔。

1. uptime

$ uptime
23:51:26up21:31, 1user, loadaverage:30.02,26.43,19.02

這是一個快速查看平均負(fù)載的方法,表示等待運行的任務(wù)(進(jìn)程)數(shù)量。
在Linux系統(tǒng)中,這些數(shù)字包含等待CPU運行的進(jìn)程數(shù),也包括不間斷I/O阻塞的進(jìn)程數(shù)(通常是磁盤I/O)。

它展示了一個資源負(fù)載(或需求)的整體概念,但是無法理解其中的內(nèi)涵,在沒有其它工具的情況下。僅僅是一種快速查看手段而已。

這三個數(shù)字呈現(xiàn)出平均負(fù)載在幾何級減弱,依次表示持續(xù)1分鐘,5分鐘和15分鐘內(nèi)。這三個數(shù)字能告訴我們負(fù)載在時間線上是如何變化的。

舉例說明,如果你在一個問題服務(wù)器上執(zhí)行檢查,1分鐘的值遠(yuǎn)遠(yuǎn)低于15分鐘的值,可以判斷出你也許登錄得太晚了,已經(jīng)錯過了問題。

在上面的例子中,平均負(fù)載的數(shù)值顯示最近正在上升,1分鐘值高達(dá)30,對比15分鐘值則是19。這些指標(biāo)值像現(xiàn)在這么大意味著很多情況:也許是CPU繁忙;vmstat 或者 mpstat 將可以確認(rèn),本系列的第三和第四條命令。

2. dmesg | tail

$ dmesg | tail
[1880957.563150] perl invoked oom-killer: gfp_mask=0x280da, order=0, oom_score_adj=0
[...]
[1880957.563400] Out of memory: Kill process 18694 (perl) score 246 or sacrifice child
[1880957.563408] Killed process 18694 (perl) total-vm:1972392kB, anon-rss:1953348kB, file-r
ss:0kB
[2320864.954447] TCP: Possible SYN flooding on port 7001. Dropping request. Check SNMP cou
nters.

這個結(jié)果輸出了最近10條系統(tǒng)信息。
可以查看到引起性能問題的錯誤。上面的例子包含了oom-killer,以及TCP丟包。

PS:這個真的很容易忽略啊,真真的踩過坑?。?另外,除了error級的日志,info級的也要留個心眼,可能包含一些隱藏信息。

[譯者注:oom-killer]
一層保護(hù)機制,用于避免 Linux 在內(nèi)存不足的時候不至于出太嚴(yán)重的問題,把無關(guān)緊要的進(jìn)程殺掉,有些壯士斷腕的意思

3. vmstat 1

$ vmstat 1
procs ---------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
34 0 0 200889792 73708 591828 0 0 0 5 6 10 96 1 3 0 0
32 0 0 200889920 73708 591860 0 0 0 592 13284 4282 98 1 1 0 0
32 0 0 200890112 73708 591860 0 0 0 0 9501 2154 99 1 0 0 0
32 0 0 200889568 73712 591856 0 0 0 48 11900 2459 99 0 0 0 0
32 0 0 200890208 73712 591860 0 0 0 0 15898 4840 98 1 1 0 0

vmstat 是一個獲得虛擬內(nèi)存狀態(tài)概況的通用工具(最早創(chuàng)建于10年前的BSD)。它每一行記錄了關(guān)鍵的服務(wù)器統(tǒng)計信息。
vmstat 運行的時候有一個參數(shù)1,用于輸出一秒鐘的概要數(shù)據(jù)。
第一行輸出顯示啟動之后的平均值,用以替代之前的一秒鐘數(shù)據(jù)。

現(xiàn)在,跳過第一行,讓我們來學(xué)習(xí)并且記住每一列代表的意義。

r:正在CPU上運行或等待運行的進(jìn)程數(shù)。
相對于平均負(fù)載來說,這提供了一個更好的、用于查明CPU飽和度的指標(biāo),它不包括I/O負(fù)載。注: “r”值大于CPU數(shù)即是飽和。

free: 空閑內(nèi)存(kb)
如果這個數(shù)值很大,表明你還有足夠的內(nèi)存空閑。
包括命令7“free m”,很好地展現(xiàn)了空閑內(nèi)存的狀態(tài)。

si, so: swap入/出。
如果這個值非0,證明內(nèi)存溢出了。

us, sy, id, wa, st:
它們是CPU分類時間,針對所有CPU的平均訪問。
分別是用戶時間,系統(tǒng)時間(內(nèi)核),空閑,I/O等待時間,以及被偷走的時間(其它訪客,或者是Xen)。CPU分類時間將可以幫助確認(rèn),CPU是否繁忙,通過累計用戶系統(tǒng)時間。

等待I/O的情形肯定指向的是磁盤瓶頸;這個時候CPU通常是空閑的,因為任務(wù)被阻塞以等待分配磁盤I/O。你可以將等待I/O當(dāng)作另一種CPU空閑,一種它們?yōu)槭裁纯臻e的解釋線索。

系統(tǒng)時間對I/O處理非常必要。一個很高的平均系統(tǒng)時間,超過20%,值得深入分析:也許是內(nèi)核處理I/O非常低效。

在上面的例子中,CPU時間幾乎完全是用戶級的,與應(yīng)用程序級的利用率正好相反。所有CPU的平均利用率也超過90%。這不一定是一個問題;還需檢查“r”列的飽和度。

4. mpstat P ALL 1

$ mpstat -P ALL 1
Linux 3.13.0-49-generic (titanclusters-xxxxx) 07/14/2015 _x86_64_ (32 CPU)
07:38:49 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
07:38:50 PM all 98.47 0.00 0.75 0.00 0.00 0.00 0.00 0.00 0.00 0.78
07:38:50 PM 0 96.04 0.00 2.97 0.00 0.00 0.00 0.00 0.00 0.00 0.99
07:38:50 PM 1 97.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 2.00
07:38:50 PM 2 98.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00
07:38:50 PM 3 96.97 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 3.03
[...]

這個命令可以按時間線打印每個CPU的消耗,常常用于檢查不均衡的問題。
如果只有一個繁忙的CPU,可以判斷是屬于單進(jìn)程的應(yīng)用程序。

5. pidstat 1

$ pidstat 1
Linux 3.13.0-49-generic (titanclusters-xxxxx) 07/14/2015 _x86_64_ (32 CPU)
07:41:02 PM UID PID %usr %system %guest %CPU CPU Command
07:41:03 PM 0 9 0.00 0.94 0.00 0.94 1 rcuos/0
07:41:03 PM 0 4214 5.66 5.66 0.00 11.32 15 mesos-slave
07:41:03 PM 0 4354 0.94 0.94 0.00 1.89 8 java
07:41:03 PM 0 6521 1596.23 1.89 0.00 1598.11 27 java
07:41:03 PM 0 6564 1571.70 7.55 0.00 1579.25 28 java
07:41:03 PM 60004 60154 0.94 4.72 0.00 5.66 9 pidstat
07:41:03 PM UID PID %usr %system %guest %CPU CPU Command
07:41:04 PM 0 4214 6.00 2.00 0.00 8.00 15 mesos-slave
07:41:04 PM 0 6521 1590.00 1.00 0.00 1591.00 27 java
07:41:04 PM 0 6564 1573.00 10.00 0.00 1583.00 28 java
07:41:04 PM 108 6718 1.00 0.00 0.00 1.00 0 snmp-pass
07:41:04 PM 60004 60154 1.00 4.00 0.00 5.00 9 pidstat
^C

pidstat 有一點像頂級視圖-針對每一個進(jìn)程,但是輸出的時候滾屏,而不是清屏。
它非常有用,特別是跨時間段查看的模式,也能將你所看到的信息記錄下來,以利于進(jìn)一步的研究。
上面的例子識別出兩個 java 進(jìn)程引起的CPU耗盡。
“%CPU” 是對所有CPU的消耗;1591% 顯示 java 進(jìn)程占用了幾乎16個CPU。

6. iostat xz 1

$ iostat -xz 1
Linux 3.13.0-49-generic (titanclusters-xxxxx) 07/14/2015 _x86_64_ (32 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
73.96 0.00 3.73 0.03 0.06 22.21
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
xvda 0.00 0.23 0.21 0.18 4.52 2.08 34.37 0.00 9.98 13.80 5.42 2.44 0.09
xvdb 0.01 0.00 1.02 8.94 127.97 598.53 145.79 0.00 0.43 1.78 0.28 0.25 0.25
xvdc 0.01 0.00 1.02 8.86 127.79 595.94 146.50 0.00 0.45 1.82 0.30 0.27 0.26
dm-0 0.00 0.00 0.69 2.32 10.47 31.69 28.01 0.01 3.23 0.71 3.98 0.13 0.04
dm-1 0.00 0.00 0.00 0.94 0.01 3.78 8.00 0.33 345.84 0.04 346.81 0.01 0.00
dm-2 0.00 0.00 0.09 0.07 1.35 0.36 22.50 0.00 2.55 0.23 5.62 1.78 0.03
[...]

這是一個理解塊設(shè)備(磁盤)極好的工具,不論是負(fù)載評估還是作為性能測試成績。

r/s, w/s, rkB/s, wkB/s: 這些是該設(shè)備每秒讀%、寫%、讀Kb、寫Kb??捎糜诿枋龉ぷ髫?fù)荷。一個性能問題可能只是簡單地由于一個過量的負(fù)載引起。

await: I/O平均時間(毫秒)
這是應(yīng)用程序需要的時間,它包括排隊以及運行的時間。
遠(yuǎn)遠(yuǎn)大于預(yù)期的平均時間可以作為設(shè)備飽和,或者設(shè)備問題的指標(biāo)。

avgqu-sz: 向設(shè)備發(fā)出的平均請求數(shù)。
值大于1可視為飽和(盡管設(shè)備能對請求持續(xù)運行,特別是前端的虛擬設(shè)備-后端有多個磁盤)。

%util: 設(shè)備利用率
這是一個實時的繁忙的百分比,顯示設(shè)備每秒鐘正在進(jìn)行的工作。
值大于60%屬于典型的性能不足(可以從await處查看),盡管它取決于設(shè)備。
值接近100% 通常指示飽和。

如果存儲設(shè)備是一個前端邏輯磁盤、后掛一堆磁盤,那么100%的利用率也許意味著,一些已經(jīng)處理的I/O此時占用100%,然而,后端的磁盤也許遠(yuǎn)遠(yuǎn)沒有達(dá)到飽和,其實可以承擔(dān)更多的工作。

切記:磁盤I/O性能低并不一定是應(yīng)用程序問題。許多技術(shù)一貫使用異步I/O,所以應(yīng)用程序并不會阻塞,以及遭受直接的延遲(例如提前加載,緩沖寫入)。

7. free m

$ free -m
total used free shared buffers cached
Mem: 245998 24545 221453 83 59 541
-/+ buffers/cache: 23944 222053
Swap: 0 0 0

buffers: buffer cache,用于塊設(shè)備I/O。
cached:page cache, 用于文件系統(tǒng)。

我們只是想檢查這些指標(biāo)值不為0,那樣意味著磁盤I/O高、性能差(確認(rèn)需要用iostat)。
上面的例子看起來不錯,每一個都有很多Mbytes。

“-/+ buffers/cache”: 提供了關(guān)于內(nèi)存利用率更加準(zhǔn)確的數(shù)值。

Linux可以將空閑內(nèi)存用于緩存,并且在應(yīng)用程序需要的時候收回。
所以應(yīng)用到緩存的內(nèi)存必須以另一種方式包括在內(nèi)存空閑的數(shù)據(jù)里面。
甚至有一個網(wǎng)站linux ate my ram,專門探討這個困惑。

它還有更令人困惑的地方,如果在Linux上使用ZFS,正如我們運行一些服務(wù),ZFS擁有自己的文件系統(tǒng)混存,也不能在free -m 的輸出里正確反映。

這種情況會顯示系統(tǒng)空閑內(nèi)存不足,但是內(nèi)存實際上可用,通過回收 ZFS 的緩存。

8. sar n DEV 1

$ sar -n DEV 1
Linux 3.13.0-49-generic (titanclusters-xxxxx) 07/14/2015 _x86_64_ (32 CPU)
12:16:48 AM IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil
12:16:49 AM eth0 18763.00 5032.00 20686.42 478.30 0.00 0.00 0.00 0.00
12:16:49 AM lo 14.00 14.00 1.36 1.36 0.00 0.00 0.00 0.00
12:16:49 AM docker0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
12:16:49 AM IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil
12:16:50 AM eth0 19763.00 5101.00 21999.10 482.56 0.00 0.00 0.00 0.00
12:16:50 AM lo 20.00 20.00 3.25 3.25 0.00 0.00 0.00 0.00
12:16:50 AM docker0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
^C

使用這個工具用來檢查網(wǎng)絡(luò)接口吞吐量:
rxkB/s 和** txkB/s**, 作為負(fù)載的一種度量方式, 也可以用來檢查是否已經(jīng)達(dá)到某種瓶頸。

在上面的例子中,網(wǎng)卡 eth0 收包大道 22 Mbytes/s, 即176 Mbits/sec (就是說,在 1 Gbit/sec 的限制之內(nèi))。

此版本也有一個體現(xiàn)設(shè)備利用率的 “%ifutil” (兩個方向最大值),我們也可以使用 Brendan的 nicstat 工具來度量。
和 nicstat 類似,這個值很難準(zhǔn)確獲取,看起來在這個例子中并沒有起作用(0.00)。

9. sar n TCP,ETCP 1

$ sar -n TCP,ETCP 1
Linux 3.13.0-49-generic (titanclusters-xxxxx) 07/14/2015 _x86_64_ (32 CPU)
12:17:19 AM active/s passive/s iseg/s oseg/s
12:17:20 AM 1.00 0.00 10233.00 18846.00
12:17:19 AM atmptf/s estres/s retrans/s isegerr/s orsts/s
12:17:20 AM 0.00 0.00 0.00 0.00 0.00
12:17:20 AM active/s passive/s iseg/s oseg/s
12:17:21 AM 1.00 0.00 8359.00 6039.00
12:17:20 AM atmptf/s estres/s retrans/s isegerr/s orsts/s
12:17:21 AM 0.00 0.00 0.00 0.00 0.00
^C

這是一個關(guān)鍵TCP指標(biāo)的概覽視圖。包括:

active/s: 本地初始化的 TCP 連接數(shù) /每秒(例如,通過connect() )
passive/s: 遠(yuǎn)程初始化的 TCP 連接數(shù)/每秒(例如,通過accept() )
retrans/s: TCP重發(fā)數(shù)/每秒

這些活躍和被動的計數(shù)器常常作為一種粗略的服務(wù)負(fù)載度量方式:新收到的連接數(shù) (被動的),以及下行流量的連接數(shù) (活躍的)。

這也許能幫助我們理解,活躍的都是外向的,被動的都是內(nèi)向的,但是嚴(yán)格來說這種說法是不準(zhǔn)確的(例如,考慮到“本地-本地”的連接)。

重發(fā)數(shù)是網(wǎng)絡(luò)或服務(wù)器問題的一個標(biāo)志;它也許是因為不可靠的網(wǎng)絡(luò)(如,公共互聯(lián)網(wǎng)),也許是由于一臺服務(wù)器已經(jīng)超負(fù)荷、發(fā)生丟包。

上面的例子顯示每秒鐘僅有一個新的TCP連接。

10. top

$ top
top - 00:15:40 up 21:56, 1 user, load average: 31.09, 29.87, 29.92
Tasks: 871 total, 1 running, 868 sleeping, 0 stopped, 2 zombie
%Cpu(s): 96.8 us, 0.4 sy, 0.0 ni, 2.7 id, 0.1 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem: 25190241+total, 24921688 used, 22698073+free, 60448 buffers
KiB Swap: 0 total, 0 used, 0 free. 554208 cached Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
20248 root 20 0 0.227t 0.012t 18748 S 3090 5.2 29812:58 java
4213 root 20 0 2722544 64640 44232 S 23.5 0.0 233:35.37 mesos-slave
66128 titancl+ 20 0 24344 2332 1172 R 1.0 0.0 0:00.07 top
5235 root 20 0 38.227g 547004 49996 S 0.7 0.2 2:02.74 java
4299 root 20 0 20.015g 2.682g 16836 S 0.3 1.1 33:14.42 java
1 root 20 0 33620 2920 1496 S 0.0 0.0 0:03.82 init
2 root 20 0 0 0 0 S 0.0 0.0 0:00.02 kthreadd
3 root 20 0 0 0 0 S 0.0 0.0 0:05.35 ksoftirqd/0
5 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/0:0H
6 root 20 0 0 0 0 S 0.0 0.0 0:06.94 kworker/u256:0
8 root 20 0 0 0 0 S 0.0 0.0 2:38.05 rcu_sched

top命令包含了許多我們之前已經(jīng)檢查的指標(biāo)。

它可以非常方便地運行,看看是否任何東西看起來與從前面的命令的結(jié)果完全不同,可以表明負(fù)載指標(biāo)是不斷變化的。

頂部下面的輸出,很難按照時間推移的模式查看,可能使用如 vmstat 和 pidstat 等工具會更清晰,它們提供滾動輸出。

如果你保持輸出的動作不夠快 (CtrlS 要暫停,CtrlQ 繼續(xù)),屏幕將清除,間歇性問題的證據(jù)也會丟失。

追蹤分析

你還可以嘗試更多、更深的命令和方法。

詳見Brendan的 Linux 性能工具輔導(dǎo)課,包括40多種命令,覆蓋可觀測性、標(biāo)桿管理、調(diào)優(yōu)、靜態(tài)性能優(yōu)化、監(jiān)視和追蹤。

基于可擴展的web,解決系統(tǒng)的擴展和性能問題,是我們矢志不渝的追求。

如果你希望能夠解決這類挑戰(zhàn),加入我們!

Brendan Gregg

以下為原文:

Linux Performance Analysis in 60,000 Milliseconds

You login to a Linux server with a performance issue:
what do you check in the first minute?

At Netflix we have a massive EC2 Linux cloud, and numerous performance analysis tools to monitor and investigate its performance.

These include Atlas for cloud-wide monitoring, and Vector for on-demand instance analysis.

While those tools help us solve most issues, we sometimes need to login to an instance and run some standard Linux performance tools.

In this post, the Netflix Performance Engineering team will show you the first 60 seconds of an optimized performance investigation at the command line, using standard Linux tools you should have available.

First 60 Seconds: Summary

In 60 seconds you can get a high level idea of system resource usage and running processes by running the following ten commands.

Look for errors and saturation metrics, as they are both easy to interpret, and then resource utilization.

Saturation is where a resource has more load than it can handle, and can be exposed either as the length of a request queue, or time spent waiting.

uptime  
dmesg | tail vmstat 1  
mpstat -P ALL 1 pidstat 1  
iostat -xz 1 free -m  
sar -n DEV 1  
sar -n TCP,ETCP 1 top

Some of these commands require the sysstat package installed.
The metrics these commands expose will help you complete some of the USE Method: a methodology for locating performance bottlenecks.

This involves checking utilization, saturation, and error metrics for all resources (CPUs, memory, disks, e.t.c.).

Also pay attention to when you have checked and exonerated a resource, as by process of elimination this narrows the targets to study, and directs any follow on investigation.

The following sections summarize these commands, with examples from a production system. For more information about these tools, see their man pages.

1. uptime

$ uptime  
23:51:26up21:31, 1user, loadaverage:30.02,26.43,19.02

This is a quick way to view the load averages, which indicate the number of tasks (processes) wanting to run. On Linux systems, these numbers include processes wanting to run on CPU, as well as processes blocked in uninterruptible I/O (usually disk I/O).

This gives a high level idea of resource load (or demand), but can’t be properly understood without other tools.

Worth a quick look only.

The three numbers are exponentially damped moving sum averages with a 1 minute, 5 minute, and 15 minute constant.

The three numbers give us some idea of how load is changing over time.

For example, if you’ve been asked to check a problem server, and the 1 minute value is much lower than the 15 minute value, then you might have logged in too late and missed the issue.

In the example above, the load averages show a recent increase, hitting 30 for the 1 minute value, compared to 19 for the 15 minute value.

That the numbers are this large means a lot of something: probably CPU demand; vmstat or mpstat will confirm, which are commands 3 and 4 in this sequence.

2. dmesg | tail

$ dmesg | tail
[1880957.563150] perl invoked oom-killer: gfp_mask=0x280da, order=0, oom_score_adj=0
[...]
[1880957.563400] Out of memory: Kill process 18694 (perl) score 246 or sacrifice child
[1880957.563408] Killed process 18694 (perl) total-vm:1972392kB, anon-rss:1953348kB, file-r
ss:0kB
[2320864.954447] TCP: Possible SYN flooding on port 7001. Dropping request. Check SNMP cou
nters.

This views the last 10 system messages, if there are any.

Look for errors that can cause performance issues.

The example above includes the oom-killer, and TCP dropping a request.
Don’t miss this step! dmesg is always worth checking.

3. vmstat 1

$ vmstat 1
procs ---------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
34 0 0 200889792 73708 591828 0 0 0 5 6 10 96 1 3 0 0
32 0 0 200889920 73708 591860 0 0 0 592 13284 4282 98 1 1 0 0
32 0 0 200890112 73708 591860 0 0 0 0 9501 2154 99 1 0 0 0
32 0 0 200889568 73712 591856 0 0 0 48 11900 2459 99 0 0 0 0
32 0 0 200890208 73712 591860 0 0 0 0 15898 4840 98 1 1 0 0

Short for virtual memory stat, vmstat(8) is a commonly available tool (first created for BSD decades ago).

It prints a summary of key server statistics on each line.

vmstat was run with an argument of 1, to print one second summaries.

The first line of output (in this version of vmstat) has some columns that show the average since boot, instead of the previous second.

For now, skip the first line, unless you want to learn and remember which column is which. Columns to check:

r:
Number of processes running on CPU and waiting for a turn.
This provides a better signal than load averages for determining CPU saturation, as it does not include I/O.

To interpret: an “r” value greater than the CPU count is saturation.

free: Free memory in kilobytes.

If there are too many digits to count, you have enough free memory.

The “free -m” command, included as command 7, better explains the state of free memory.

si, so:
Swap-ins and swap-outs. If these are non-zero, you’re out of memory.

us, sy, id, wa, st:
These are breakdowns of CPU time, on average across all CPUs.
They are user time, system time (kernel), idle, wait I/O,
and stolen time (by other guests, or with Xen, the guest's own isolated driver domain).
The CPU time breakdowns will confirm if the CPUs are busy, by adding user + system time.
A constant degree of wait I/O points to a disk bottleneck;
this is where the CPUs are idle, because tasks are blocked waiting for pending disk I/O. You can treat wait I/O as another form of CPU idle, one that gives a clue as to why they are idle.

System time is necessary for I/O processing.

A high system time average, over 20%, can be interesting to explore further: perhaps the kernel is processing the I/O inefficiently.

In the above example, CPU time is almost entirely in user-level, pointing to application level usage instead.

The CPUs are also well over 90% utilized on average.
This isn’t necessarily a problem; check for the degree of saturation using the “r” column.

4. mpstat -P ALL 1

$ mpstat -P ALL 1
Linux 3.13.0-49-generic (titanclusters-xxxxx) 07/14/2015 _x86_64_ (32 CPU)
07:38:49 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
07:38:50 PM all 98.47 0.00 0.75 0.00 0.00 0.00 0.00 0.00 0.00 0.78
07:38:50 PM 0 96.04 0.00 2.97 0.00 0.00 0.00 0.00 0.00 0.00 0.99
07:38:50 PM 1 97.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 2.00
07:38:50 PM 2 98.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00
07:38:50 PM 3 96.97 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 3.03
[...]

This command prints CPU time breakdowns per CPU, which can be used to check for an imbalance.

A single hot CPU can be evidence of a single-threaded application.

5. pidstat 1

$ pidstat 1
Linux 3.13.0-49-generic (titanclusters-xxxxx) 07/14/2015 _x86_64_ (32 CPU)
07:41:02 PM UID PID %usr %system %guest %CPU CPU Command
07:41:03 PM 0 9 0.00 0.94 0.00 0.94 1 rcuos/0
07:41:03 PM 0 4214 5.66 5.66 0.00 11.32 15 mesos-slave
07:41:03 PM 0 4354 0.94 0.94 0.00 1.89 8 java
07:41:03 PM 0 6521 1596.23 1.89 0.00 1598.11 27 java
07:41:03 PM 0 6564 1571.70 7.55 0.00 1579.25 28 java
07:41:03 PM 60004 60154 0.94 4.72 0.00 5.66 9 pidstat
07:41:03 PM UID PID %usr %system %guest %CPU CPU Command
07:41:04 PM 0 4214 6.00 2.00 0.00 8.00 15 mesos-slave
07:41:04 PM 0 6521 1590.00 1.00 0.00 1591.00 27 java
07:41:04 PM 0 6564 1573.00 10.00 0.00 1583.00 28 java
07:41:04 PM 108 6718 1.00 0.00 0.00 1.00 0 snmp-pass
07:41:04 PM 60004 60154 1.00 4.00 0.00 5.00 9 pidstat
^C

Pidstat is a little like top’s per-process summary, but prints a rolling summary instead of clearing the screen.

This can be useful for watching patterns over time, and also recording what you saw (copy-n-paste) into a record of your investigation.

The above example identifies two java processes as responsible for consuming CPU.

The %CPU column is the total across all CPUs; 1591% shows that that java processes is consuming almost 16 CPUs.

6. iostat -xz 1

$ iostat -xz 1
Linux 3.13.0-49-generic (titanclusters-xxxxx) 07/14/2015 _x86_64_ (32 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
73.96 0.00 3.73 0.03 0.06 22.21
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
xvda 0.00 0.23 0.21 0.18 4.52 2.08 34.37 0.00 9.98 13.80 5.42 2.44 0.09
xvdb 0.01 0.00 1.02 8.94 127.97 598.53 145.79 0.00 0.43 1.78 0.28 0.25 0.25
xvdc 0.01 0.00 1.02 8.86 127.79 595.94 146.50 0.00 0.45 1.82 0.30 0.27 0.26
dm-0 0.00 0.00 0.69 2.32 10.47 31.69 28.01 0.01 3.23 0.71 3.98 0.13 0.04
dm-1 0.00 0.00 0.00 0.94 0.01 3.78 8.00 0.33 345.84 0.04 346.81 0.01 0.00
dm-2 0.00 0.00 0.09 0.07 1.35 0.36 22.50 0.00 2.55 0.23 5.62 1.78 0.03
[...]

This is a great tool for understanding block devices (disks), both the workload applied and the resulting performance.
Look for:
r/s, w/s, rkB/s, wkB/s:
These are the delivered reads, writes, read Kbytes, and write Kbytes per second to the device.
Use these for workload characterization.
A performance problem may simply be due to an excessive load applied.

await:
The average time for the I/O in milliseconds.
This is the time that the application suffers, as it includes both time queued and time being serviced. Larger than expected average times can be an indicator of device saturation, or device problems.

avgqu-sz:
The average number of requests issued to the device. Values greater than 1 can be evidence of saturation (although devices can typically operate on requests in parallel, especially virtual devices which front multiple back-end disks.)

%util:
Device utilization.
This is really a busy percent, showing the time each second that the device was doing work. Values greater than 60% typically lead to poor performance (which should be seen in await), although it depends on the device.
Values close to 100% usually indicate saturation.

If the storage device is a logical disk device fronting many back-end disks, then 100% utilization may just mean that some I/O is being processed 100% of the time, however, the back-end disks may be far from saturated, and may be able to handle much more work.

Bear in mind that poor performing disk I/O isn’t necessarily an application issue.

Many techniques are typically used to perform I/O asynchronously, so that the application doesn’t block and suffer the latency directly (e.g., read-ahead for reads, and buffering for writes).

7. free -m

$ free -m
total used free shared buffers cached
Mem: 245998 24545 221453 83 -/+ buffers/cache: 23944 222053  
Swap: 0 0 0

The right two columns show:
buffers: For the buffer cache, used for block device I/O.
cached: For the page cache, used by file systems.

We just want to check that these aren’t near-zero in size, which can lead to higher disk I/O (confirm using iostat), and worse performance. The above example looks fine, with many Mbytes in each.

The “-/+ buffers/cache” provides less confusing values for used and free memory. Linux uses free memory for the caches, but can reclaim it quickly if applications need it. So in a way the cached memory should be included in the free memory column, which this line does. There’s even a website, linuxatemyram, about this confusion.

It can be additionally confusing if ZFS on Linux is used, as we do for some services, as ZFS has its own file system cache that isn’t reflected properly by the free -m columns. It can appear that the system is low on free memory, when that memory is in fact available for use from the ZFS cache as needed.

8. sar -n DEV 1

$ sar -n DEV 1
Linux 3.13.0-49-generic (titanclusters-xxxxx) 07/14/2015 _x86_64_ (32 CPU)
12:16:48 AM IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil
12:16:49 AM eth0 18763.00 5032.00 20686.42 478.30 0.00 0.00 0.00 0.00
12:16:49 AM lo 14.00 14.00 1.36 1.36 0.00 0.00 0.00 0.00
12:16:49 AM docker0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
12:16:49 AM IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil
12:16:50 AM eth0 19763.00 5101.00 21999.10 482.56 0.00 0.00 0.00 0.00
12:16:50 AM lo 20.00 20.00 3.25 3.25 0.00 0.00 0.00 0.00
12:16:50 AM docker0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
^C

Use this tool to check network interface throughput:
rxkB/s and txkB/s, as a measure of workload, and also to check if any limit has been reached.

In the above example, eth0 receive is reaching 22 Mbytes/s, which is 176 Mbits/sec (well under, say, a 1 Gbit/sec limit).

This version also has %ifutil for device utilization (max of both directions for full duplex), which is something we also use Brendan’s nicstat tool to measure. And like with nicstat, this is hard to get right, and seems to not be working in this example (0.00).

9. sar -n TCP,ETCP 1

$ sar -n TCP,ETCP 1
Linux 3.13.0-49-generic (titanclusters-xxxxx) 07/14/2015 _x86_64_ (32 CPU)
12:17:19 AM active/s passive/s iseg/s oseg/s
12:17:20 AM 1.00 0.00 10233.00 18846.00
12:17:19 AM atmptf/s estres/s retrans/s isegerr/s orsts/s
12:17:20 AM 0.00 0.00 0.00 0.00 0.00
12:17:20 AM active/s passive/s iseg/s oseg/s
12:17:21 AM 1.00 0.00 8359.00 6039.00
12:17:20 AM atmptf/s estres/s retrans/s isegerr/s orsts/s
12:17:21 AM 0.00 0.00 0.00 0.00 0.00
^C

active/s: Number of locally-initiated TCP connections per second (e.g., via connect()).

passive/s: Number of remotely-initiated TCP connections per second (e.g., via accept()).

retrans/s: Number of TCP retransmits per second.

The active and passive counts are often useful as a rough measure of server load: number of new accepted connections (passive), and number of downstream connections (active).

It might help to think of active as outbound, and passive as inbound, but this isn’t strictly true (e.g., consider a localhost to localhost connection).

Retransmits are a sign of a network or server issue; it may be an unreliable network (e.g., the public Internet), or it may be due a server being overloaded and dropping packets. The example above shows just one new TCP connection per-second.

10. top

$ top
top - 00:15:40 up 21:56, 1 user, load average: 31.09, 29.87, 29.92
Tasks: 871 total, 1 running, 868 sleeping, 0 stopped, 2 zombie
%Cpu(s): 96.8 us, 0.4 sy, 0.0 ni, 2.7 id, 0.1 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem: 25190241+total, 24921688 used, 22698073+free, 60448 buffers
KiB Swap: 0 total, 0 used, 0 free. 554208 cached Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
20248 root 20 0 0.227t 0.012t 18748 S 3090 5.2 29812:58 java
4213 root 20 0 2722544 64640 44232 S 23.5 0.0 233:35.37 mesos-slave
66128 titancl+ 20 0 24344 2332 1172 R 1.0 0.0 0:00.07 top
5235 root 20 0 38.227g 547004 49996 S 0.7 0.2 2:02.74 java
4299 root 20 0 20.015g 2.682g 16836 S 0.3 1.1 33:14.42 java
1 root 20 0 33620 2920 1496 S 0.0 0.0 0:03.82 init
2 root 20 0 0 0 0 S 0.0 0.0 0:00.02 kthreadd
3 root 20 0 0 0 0 S 0.0 0.0 0:05.35 ksoftirqd/0
5 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/0:0H
6 root 20 0 0 0 0 S 0.0 0.0 0:06.94 kworker/u256:0
8 root 20 0 0 0 0 S 0.0 0.0 2:38.05 rcu_sched

The top command includes many of the metrics we checked earlier.
It can be handy to run it to see if anything looks wildly different from the earlier commands, which would indicate that load is variable.

A downside to top is that it is harder to see patterns over time, which may be more clear in tools like vmstat and pidstat, which provide rolling output.

Evidence of intermittent issues can also be lost if you don’t pause the output quick enough (Ctrl-S to pause, Ctrl-Q to continue), and the screen clears.

Follow-on Analysis

There are many more commands and methodologies you can apply to drill deeper.
See Brendan’s Linux Performance Tools tutorial from Velocity 2015, which works through over 40 commands, covering observability, benchmarking, tuning, static performance tuning, profiling, and tracing.

Tackling system reliability and performance problems at web scale is one of our passions.

If you would like to join us in tackling these kinds of challenges we are hiring!
Posted by Brendan Gregg

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
平臺聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點,簡書系信息發(fā)布平臺,僅提供信息存儲服務(wù)。
  • 序言:七十年代末,一起剝皮案震驚了整個濱河市,隨后出現(xiàn)的幾起案子,更是在濱河造成了極大的恐慌,老刑警劉巖,帶你破解...
    沈念sama閱讀 229,001評論 6 537
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件,死亡現(xiàn)場離奇詭異,居然都是意外死亡,警方通過查閱死者的電腦和手機,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 98,786評論 3 423
  • 文/潘曉璐 我一進(jìn)店門,熙熙樓的掌柜王于貴愁眉苦臉地迎上來,“玉大人,你說我怎么就攤上這事?!?“怎么了?”我有些...
    開封第一講書人閱讀 176,986評論 0 381
  • 文/不壞的土叔 我叫張陵,是天一觀的道長。 經(jīng)常有香客問我,道長,這世上最難降的妖魔是什么? 我笑而不...
    開封第一講書人閱讀 63,204評論 1 315
  • 正文 為了忘掉前任,我火速辦了婚禮,結(jié)果婚禮上,老公的妹妹穿的比我還像新娘。我一直安慰自己,他們只是感情好,可當(dāng)我...
    茶點故事閱讀 71,964評論 6 410
  • 文/花漫 我一把揭開白布。 她就那樣靜靜地躺著,像睡著了一般。 火紅的嫁衣襯著肌膚如雪。 梳的紋絲不亂的頭發(fā)上,一...
    開封第一講書人閱讀 55,354評論 1 324
  • 那天,我揣著相機與錄音,去河邊找鬼。 笑死,一個胖子當(dāng)著我的面吹牛,可吹牛的內(nèi)容都是我干的。 我是一名探鬼主播,決...
    沈念sama閱讀 43,410評論 3 444
  • 文/蒼蘭香墨 我猛地睜開眼,長吁一口氣:“原來是場噩夢啊……” “哼!你這毒婦竟也來了?” 一聲冷哼從身側(cè)響起,我...
    開封第一講書人閱讀 42,554評論 0 289
  • 序言:老撾萬榮一對情侶失蹤,失蹤者是張志新(化名)和其女友劉穎,沒想到半個月后,有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體,經(jīng)...
    沈念sama閱讀 49,106評論 1 335
  • 正文 獨居荒郊野嶺守林人離奇死亡,尸身上長有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點故事閱讀 40,918評論 3 356
  • 正文 我和宋清朗相戀三年,在試婚紗的時候發(fā)現(xiàn)自己被綠了。 大學(xué)時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片。...
    茶點故事閱讀 43,093評論 1 371
  • 序言:一個原本活蹦亂跳的男人離奇死亡,死狀恐怖,靈堂內(nèi)的尸體忽然破棺而出,到底是詐尸還是另有隱情,我是刑警寧澤,帶...
    沈念sama閱讀 38,648評論 5 362
  • 正文 年R本政府宣布,位于F島的核電站,受9級特大地震影響,放射性物質(zhì)發(fā)生泄漏。R本人自食惡果不足惜,卻給世界環(huán)境...
    茶點故事閱讀 44,342評論 3 347
  • 文/蒙蒙 一、第九天 我趴在偏房一處隱蔽的房頂上張望。 院中可真熱鬧,春花似錦、人聲如沸。這莊子的主人今日做“春日...
    開封第一講書人閱讀 34,755評論 0 28
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽。三九已至,卻和暖如春,著一層夾襖步出監(jiān)牢的瞬間,已是汗流浹背。 一陣腳步聲響...
    開封第一講書人閱讀 36,009評論 1 289
  • 我被黑心中介騙來泰國打工, 沒想到剛下飛機就差點兒被人妖公主榨干…… 1. 我叫王不留,地道東北人。 一個月前我還...
    沈念sama閱讀 51,839評論 3 395
  • 正文 我出身青樓,卻偏偏與公主長得像,于是被迫代替她去往敵國和親。 傳聞我的和親對象是個殘疾皇子,可洞房花燭夜當(dāng)晚...
    茶點故事閱讀 48,107評論 2 375

推薦閱讀更多精彩內(nèi)容