Android系統(tǒng)每次發(fā)生ANR后,都會(huì)在/data/anr/目錄下面輸出一個(gè)traces.txt文件,這個(gè)文件記錄了發(fā)生問(wèn)題進(jìn)程的虛擬機(jī)相關(guān)信息和線程的堆棧信息,通過(guò)這個(gè)文件我們就能分析出當(dāng)前線程正在做什么操作,繼而可以分析出ANR的原因,它的生成與Signal Catcher線程是息息相關(guān)的,每一個(gè)從zygote派生出來(lái)的子進(jìn)程都會(huì)有一個(gè)Signal Catcher線程,可以在終端的Shell環(huán)境下執(zhí)行”ps -t &pid” 命令得到對(duì)應(yīng)pid進(jìn)程所有的子線程列表,如下圖所示:
USER PID PPID VSIZE RSS WCHAN PC NAME
system 2953 2646 2784184 223904 SyS_epoll_ 7f92d20520 S system_server
system 2958 2953 2784184 223904 do_sigtime 7f92d20700 S Signal Catcher
system 2960 2953 2784184 223904 futex_wait 7f92cd3f20 S ReferenceQueueD
system 2961 2953 2784184 223904 futex_wait 7f92cd3f20 S FinalizerDaemon
system 2962 2953 2784184 223904 futex_wait 7f92cd3f20 S FinalizerWatchd
system 2963 2953 2784184 223904 futex_wait 7f92cd3f20 S HeapTaskDaemon
system 2970 2953 2784184 223904 binder_thr 7f92d20610 S Binder_1
system 2972 2953 2784184 223904 binder_thr 7f92d20610 S Binder_2
system 2985 2953 2784184 223904 SyS_epoll_ 7f92d20520 S android.bg
system 2986 2953 2784184 223904 SyS_epoll_ 7f92d20520 S ActivityManager
system 2987 2953 2784184 223904 SyS_epoll_ 7f92d20520 S android.ui
system 2988 2953 2784184 223904 SyS_epoll_ 7f92d20520 S android.fg
system 2989 2953 2784184 223904 inotify_re 7f92d20fe8 S FileObserver
system 2990 2953 2784184 223904 SyS_epoll_ 7f92d20520 S android.io
system 2991 2953 2784184 223904 SyS_epoll_ 7f92d20520 S android.display
system 2992 2953 2784184 223904 futex_wait 7f92cd3f20 S CpuTracker
system 2993 2953 2784184 223904 SyS_epoll_ 7f92d20520 S PowerManagerSer
上面打印的是system_server的線程列表,其中2958這個(gè)線程便是"Signal Catcher"線程,Signal是指進(jìn)程發(fā)生問(wèn)題時(shí)候Kernel發(fā)給它的信號(hào),Signal Catcher這個(gè)線程就是在用戶空間來(lái)處理信號(hào)。
Linux軟中斷信號(hào)(信號(hào))是系統(tǒng)用來(lái)通知進(jìn)程發(fā)生了異步事件,是在軟件層次上是對(duì)中斷機(jī)制的一種模擬,在原理上,一個(gè)進(jìn)程收到一個(gè)信號(hào)與處理器收到一個(gè)中斷請(qǐng)求可以說(shuō)是一樣的。信號(hào)是進(jìn)程間通信機(jī)制中唯一的異步通信機(jī)制,一個(gè)進(jìn)程不必通過(guò)任何操作來(lái)等待信號(hào)的到達(dá),事實(shí)上,進(jìn)程也不知道信號(hào)到底什么時(shí)候到達(dá)。進(jìn)程之間可以互相通過(guò)系統(tǒng)調(diào)用kill發(fā)送軟中斷信號(hào)。內(nèi)核也可以因?yàn)閮?nèi)部事件而給進(jìn)程發(fā)送信號(hào),通知進(jìn)程發(fā)生了某個(gè)事件。除此之外,信號(hào)機(jī)制除了基本通知功能外,還可以傳遞附加信息,總之信號(hào)是一種Linux系統(tǒng)中進(jìn)程間通信手段,Linux默認(rèn)已經(jīng)給進(jìn)程的信號(hào)有處理,如果你不關(guān)心信號(hào)的話,默認(rèn)系統(tǒng)行為就好了,但是如果你關(guān)心某些信號(hào),例如段錯(cuò)誤SIGSEGV(一般是空指針、內(nèi)存訪問(wèn)越界的時(shí)候由系統(tǒng)發(fā)送給當(dāng)事進(jìn)程),那么你就得重新編寫(xiě)信號(hào)處理函數(shù)來(lái)覆蓋系統(tǒng)默認(rèn)的行為,這種機(jī)制對(duì)于程序調(diào)試來(lái)說(shuō)是很重要的一種手段,因?yàn)橄襁@種段錯(cuò)誤是不可預(yù)知的,它可以發(fā)生在任何地方,也就是說(shuō)在應(yīng)用程序的代碼里面是不能處理這種異常的,這個(gè)時(shí)候要定位問(wèn)題的話,就只能依靠信號(hào)這種機(jī)制,雖然應(yīng)用程序不知道什么時(shí)候發(fā)生了段錯(cuò)誤,但是系統(tǒng)底層(Kernel)是知道的,Kernel發(fā)現(xiàn)應(yīng)用程序訪問(wèn)了非法地址的時(shí)候,就會(huì)發(fā)送一個(gè)SIGSEGV信號(hào)給該進(jìn)程,在該進(jìn)程從內(nèi)核空間返回到用戶空間時(shí)會(huì)檢測(cè)是否有信號(hào)等待處理,如果用戶自定義了信號(hào)處理函數(shù),那么這個(gè)時(shí)候就會(huì)調(diào)用用戶編寫(xiě)的函數(shù),這個(gè)時(shí)候就可以做很多事情了:例如dump當(dāng)前進(jìn)程的堆棧、獲取系統(tǒng)的全局信息(內(nèi)存、IO、CPU)等,而這些信息對(duì)分析問(wèn)題是非常重要的。
回到主題,Signal Catcher這個(gè)線程是由Android Runtime去創(chuàng)建的,在新起一個(gè)應(yīng)用進(jìn)程的時(shí)候,system_server進(jìn)程會(huì)通過(guò)socket和zygote取得通信,并由zygote負(fù)責(zé)去創(chuàng)建一個(gè)子進(jìn)程,在Linux系統(tǒng)中,創(chuàng)建一個(gè)進(jìn)程一般通過(guò)fork機(jī)制,Android也不例外,zygote的子進(jìn)程起來(lái)后,默認(rèn)都會(huì)有一個(gè)main線程,在該main線程中都會(huì)調(diào)用到DidForkFromZygote@Runtime.cc這個(gè)函數(shù),在這個(gè)函數(shù)中又會(huì)調(diào)用StartSignalCatcher@Runtime.cc這個(gè)函數(shù),這個(gè)函數(shù)里面會(huì)新建一個(gè)SignalCatcher對(duì)象,Signal Catcher線程的起源便是來(lái)源于此。
void Runtime::StartSignalCatcher() {
if (!is_zygote_) {
signal_catcher_ = new SignalCatcher(stack_trace_file_);
}
}
在SignalCatcher的構(gòu)造函數(shù)中會(huì)調(diào)用 pthread_create來(lái)創(chuàng)建一個(gè)傳統(tǒng)意義上的Linux線程,說(shuō)到底Android是一個(gè)基于Linux的系統(tǒng),ART的線程概念直接復(fù)用了Linux的,畢竟Linux發(fā)展了這么久,線程機(jī)制這一方面已經(jīng)很成熟了,ART沒(méi)必要重復(fù)造輪子,在User空間再實(shí)現(xiàn)一套自己的線程機(jī)制,pthread_create是類(lèi)Unix操作系統(tǒng)(Unix、Linux、Mac OS X等)的創(chuàng)建線程的函數(shù),它的函數(shù)原型為:
int pthread_create(pthread_t *tidp,const pthread_attr_t *attr,(void*)(*start_rtn)(void*),void *arg);
tidp 返回線程標(biāo)識(shí)符的指針,attr 設(shè)置線程屬性,start_rtn 是線程運(yùn)行函數(shù)的起始地址,arg 是傳遞給start_rtn的參數(shù)。在SignalCatcher的構(gòu)造函數(shù)中調(diào)用該函數(shù)的語(yǔ)句為:
CHECK_PTHREAD_CALL(pthread_create, (&pthread_, nullptr,&Run, this), "signal catcher thread");
CHECK_PTHREAD_CALL是一個(gè)宏定義,最終會(huì)調(diào)用pthread_create來(lái)新起一個(gè)Linux線程,從pthread_create的參數(shù)來(lái)看,線程創(chuàng)建出來(lái)之后會(huì)執(zhí)行Run@SignalCatcher.cc這個(gè)函數(shù),并且把this指針也就是創(chuàng)建的SignalCatcher對(duì)象作為參數(shù)傳遞給了Run函數(shù),看一下Run函數(shù)的實(shí)現(xiàn):
void* SignalCatcher::Run(void* arg) {
......
Runtime* runtime = Runtime::Current();
CHECK(runtime->AttachCurrentThread("Signal Catcher", true, runtime->GetSystemThreadGroup(),//attach linux線程,使得該線程擁有調(diào)用JNI函數(shù)的能力
Thread* self = Thread::Current();
......
// Set up mask with signals we want to handle.
SignalSet signals;
signals.Add(SIGQUIT); //監(jiān)聽(tīng)SIGQUIT信號(hào)
signals.Add(SIGUSR1);
while (true) {
int signal_number = signal_catcher->WaitForSignal(self, signals); //等待Kernel給進(jìn)程發(fā)送信號(hào)
if (signal_catcher->ShouldHalt()) {
runtime->DetachCurrentThread();
return nullptr;
}
switch (signal_number) {
case SIGQUIT:
signal_catcher->HandleSigQuit(); //調(diào)用HandleSigQuit去處理SIGQUIT信號(hào)
break;
......
default:
LOG(ERROR) << "Unexpected signal %d" << signal_number;
break;
}
}
}
在這個(gè)函數(shù)里面,首先調(diào)用 runtime->AttachCurrentThread去attach當(dāng)前線程,然后安裝信號(hào)處理函數(shù),最后就是一個(gè)無(wú)限循環(huán),在循環(huán)里等待信號(hào)的到來(lái),如果Kernel發(fā)送了信號(hào)給虛擬機(jī)進(jìn)程,那么就會(huì)執(zhí)行對(duì)應(yīng)信號(hào)的處理過(guò)程,這篇文章只關(guān)注SIGQUIT信號(hào)的處理,下面一步一步來(lái)分析這四個(gè)過(guò)程。
- AttachCurrentThread
這個(gè)是通過(guò)調(diào)用Runtime的AttatchCurrentThread函數(shù)完成的,Runtime也只是簡(jiǎn)單的調(diào)用了Thread類(lèi)的Attach函數(shù),這里多出來(lái)一個(gè)Thread類(lèi),看上去像是創(chuàng)建一個(gè)thread,其實(shí)不然,在Android里面只能通過(guò)pthread_create去創(chuàng)建一個(gè)線程,這里的Thread只是Android Runtime里面的一個(gè)類(lèi),一個(gè)Thread對(duì)象創(chuàng)建之后就會(huì)被保存在線程的TLS區(qū)域,所以一個(gè)Linux線程都對(duì)應(yīng)了一個(gè)Thread對(duì)象,可以通過(guò)Thread的Current()函數(shù)來(lái)獲取當(dāng)前線程關(guān)聯(lián)的Thread對(duì)象,通過(guò)這個(gè)Thread對(duì)象就可以獲取一些重要信息,例如當(dāng)前線程的Java線程狀態(tài),Java棧幀,JNI函數(shù)指針列表等等,之所以說(shuō)是Java線程狀態(tài),Java棧幀,是因?yàn)锳ndroid運(yùn)行時(shí)其實(shí)是沒(méi)有自己?jiǎn)为?dú)的線程機(jī)制的,Java線程底層都是一個(gè)Linux線程,但是Linux線程是沒(méi)有像Watting、Blocked等狀態(tài)的,并且Linux線程也是沒(méi)有Java堆棧的,那么這些Java線程狀態(tài)和和Java棧幀必須有一個(gè)地方保存,要不然就丟失了,Thread對(duì)象就是這個(gè)理想的“儲(chǔ)物柜”,下面介紹Thread對(duì)象創(chuàng)建過(guò)程的時(shí)候會(huì)講到這一塊內(nèi)容。
bool Runtime::AttachCurrentThread(const char* thread_name, bool as_daemon, jobject thread_group, bool create_peer) {
return Thread::Attach(thread_name, as_daemon, thread_group, create_peer) != nullptr;
}
Thread* Thread::Attach(const char* thread_name, bool as_daemon, jobject thread_group,bool create_peer) {
Runtime* runtime = Runtime::Current();
......
Thread* self;
{
MutexLock mu(nullptr, *Locks::runtime_shutdown_lock_);
if (runtime->IsShuttingDownLocked()) {
......
} else {
Runtime::Current()->StartThreadBirth();
self = new Thread(as_daemon); //新建一個(gè)Thread對(duì)象
bool init_success = self->Init(runtime->GetThreadList(), runtime->GetJavaVM()); //調(diào)用init函數(shù)
Runtime::Current()->EndThreadBirth();
if (!init_success) {
delete self;
return nullptr;
}
}
}
......
self->InitStringEntryPoints();
CHECK_NE(self->GetState(), kRunnable);
self->SetState(kNative);
......
return self;
}
在Thread的attach函數(shù)里面,首先新建了一個(gè)Thread對(duì)象,然后調(diào)用Thread對(duì)象的Init過(guò)程,最后通過(guò)調(diào)用self->SetState(kNative)將當(dāng)前的Java線程狀態(tài)設(shè)置為kNative狀態(tài),先看一下Thread的SetState這個(gè)函數(shù),因?yàn)檫@個(gè)函數(shù)比較簡(jiǎn)單,它是用來(lái)設(shè)置Java線程狀態(tài)的。
inline ThreadState Thread::SetState(ThreadState new_state) {
// Cannot use this code to change into Runnable as changing to Runnable should fail if
// old_state_and_flags.suspend_request is true.
DCHECK_NE(new_state, kRunnable);
if (kIsDebugBuild && this != Thread::Current()) {
std::string name;
GetThreadName(name);
LOG(FATAL) << "Thread \"" << name << "\"(" << this << " != Thread::Current()="
<< Thread::Current() << ") changing state to " << new_state;
}
union StateAndFlags old_state_and_flags;
old_state_and_flags.as_int = tls32_.state_and_flags.as_int;
tls32_.state_and_flags.as_struct.state = new_state;
return static_cast<ThreadState>(old_state_and_flags.as_struct.state);
}
Java線程的狀態(tài)是保存在Thread對(duì)象中的,具體來(lái)說(shuō)是由該對(duì)象中的tls32_這個(gè)結(jié)構(gòu)體保存的,可以通過(guò)修改這個(gè)結(jié)構(gòu)體來(lái)設(shè)置當(dāng)前的狀態(tài),ART目前支持的Java線程狀態(tài)列表如下,通過(guò)狀態(tài)后面的注釋,大概就可以知道什么時(shí)候會(huì)進(jìn)行狀態(tài)的切換。
enum ThreadState {
// Thread.State JDWP state
kTerminated = 66, // TERMINATED TS_ZOMBIE Thread.run has returned, but Thread* still around
kRunnable, // RUNNABLE TS_RUNNING runnable
kTimedWaiting, // TIMED_WAITING TS_WAIT in Object.wait() with a timeout
kSleeping, // TIMED_WAITING TS_SLEEPING in Thread.sleep()
kBlocked, // BLOCKED TS_MONITOR blocked on a monitor
kWaiting, // WAITING TS_WAIT in Object.wait()
kWaitingForGcToComplete, // WAITING TS_WAIT blocked waiting for GC
kWaitingForCheckPointsToRun, // WAITING TS_WAIT GC waiting for checkpoints to run
kWaitingPerformingGc, // WAITING TS_WAIT performing GC
kWaitingForDebuggerSend, // WAITING TS_WAIT blocked waiting for events to be sent
kWaitingForDebuggerToAttach, // WAITING TS_WAIT blocked waiting for debugger to attach
kWaitingInMainDebuggerLoop, // WAITING TS_WAIT blocking/reading/processing debugger events
kWaitingForDebuggerSuspension, // WAITING TS_WAIT waiting for debugger suspend all
kWaitingForJniOnLoad, // WAITING TS_WAIT waiting for execution of dlopen and JNI on load code
kWaitingForSignalCatcherOutput, // WAITING TS_WAIT waiting for signal catcher IO to complete
kWaitingInMainSignalCatcherLoop, // WAITING TS_WAIT blocking/reading/processing signals
kWaitingForDeoptimization, // WAITING TS_WAIT waiting for deoptimization suspend all
kWaitingForMethodTracingStart, // WAITING TS_WAIT waiting for method tracing to start
kWaitingForVisitObjects, // WAITING TS_WAIT waiting for visiting objects
kWaitingForGetObjectsAllocated, // WAITING TS_WAIT waiting for getting the number of allocated objects
kStarting, // NEW TS_WAIT native thread started, not yet ready to run managed code
kNative, // RUNNABLE TS_RUNNING running in a JNI native method
kSuspended, // RUNNABLE TS_RUNNING suspended by GC or debugger
};
在attach函數(shù)中,主要關(guān)注的是Init過(guò)程,詳細(xì)分析Init過(guò)程之前,需要大概了解一下ART執(zhí)行代碼的方式,ART相對(duì)與Dalvik一個(gè)重要的變化就是不再直接執(zhí)行字節(jié)碼,而是先把字節(jié)碼翻譯成本地機(jī)器碼,這個(gè)過(guò)程是通過(guò)在安裝應(yīng)用程序的時(shí)候執(zhí)行dex2oat進(jìn)程得到一個(gè)oat文件完成的,這個(gè)oat文件一般保存在 /data/app/應(yīng)用名稱/oat/ 目錄下面, oat文件里面就包含了編譯好的機(jī)器碼,這里的編譯其實(shí)只是把dex文件中java類(lèi)的方法翻譯成本地機(jī)器碼,然后在執(zhí)行的時(shí)候,不是去解釋執(zhí)行字節(jié)碼,而是找到對(duì)應(yīng)的機(jī)器碼直接執(zhí)行。這樣效率就提高了, 這些機(jī)器碼不可能單獨(dú)存在,有一些功能必須借助于ART運(yùn)行時(shí),例如在heap中分配一個(gè)對(duì)象、執(zhí)行一個(gè)jni方法等,所以編譯好的本地機(jī)器碼中會(huì)引用到ART運(yùn)行時(shí)的一些方法,這就像我們編譯一個(gè)so庫(kù)文件的時(shí)候引用到了外部函數(shù)其實(shí)oat文件和so文件一樣都是ELF可執(zhí)行格式文件,只是oat文件相比于標(biāo)準(zhǔn)的ELF格式文件多出了幾個(gè)section,那么在加載這些oat文件的時(shí)候需要重定位這些外部函數(shù),打開(kāi)標(biāo)準(zhǔn)的so文件的時(shí)候,一般用的是dlopen這個(gè)函數(shù),該函數(shù)會(huì)自動(dòng)把沒(méi)有加載的so庫(kù)加載進(jìn)來(lái),然后把這些外部函數(shù)重定位好,然而oat文件的打開(kāi)方式不同,為了快速加載oat文件,ART在線程的TLS區(qū)域保存了一些函數(shù),編譯好的機(jī)器碼就是調(diào)用這些函數(shù)指針來(lái)和ART運(yùn)行時(shí)聯(lián)系,這些函數(shù)就是在Thread的Init過(guò)程中初始化好的。
void Thread::InitTlsEntryPoints() {
// Insert a placeholder so we can easily tell if we call an unimplemented entry point.
uintptr_t* begin = reinterpret_cast<uintptr_t*>(&tlsPtr_.interpreter_entrypoints);
uintptr_t* end = reinterpret_cast<uintptr_t*>(reinterpret_cast<uint8_t*>(&tlsPtr_.quick_entrypoints) +
sizeof(tlsPtr_.quick_entrypoints));
for (uintptr_t* it = begin; it != end; ++it) {
*it = reinterpret_cast<uintptr_t>(UnimplementedEntryPoint);
}
InitEntryPoints(&tlsPtr_.interpreter_entrypoints, &tlsPtr_.jni_entrypoints,
&tlsPtr_.quick_entrypoints);
}
這些函數(shù)指針是保存在Thread對(duì)象里面,而Thread對(duì)象是保存在線程的TLS區(qū)域里面的,所以本地機(jī)器碼可以訪問(wèn)這塊TLS區(qū)域,從而拿到這些函數(shù)指針。執(zhí)行了attach函數(shù)之后,一個(gè)Linux線程才真正和虛擬機(jī)運(yùn)行時(shí)關(guān)聯(lián)起來(lái),一個(gè)Linux線程搖身一變成了Java線程,才有了自己的java線程狀態(tài)和java棧幀等數(shù)據(jù)結(jié)構(gòu),那些純粹的native線程是不能執(zhí)行java代碼的,所以后面看到在dump進(jìn)程的堆棧的時(shí)候,有些線程是沒(méi)有java堆棧的,只有native和kernel堆棧,就是這個(gè)原因。
- 安裝信號(hào)處理函數(shù)
上面分析了進(jìn)程如果想要自己處理一個(gè)信號(hào),那么就得在代碼里面添加信號(hào)處理函數(shù),ART封裝了一個(gè)SignalSet類(lèi)來(lái)安裝信號(hào)處理函數(shù),但其實(shí)里面還是使用sigaddset、sigemptyset、sigwait等標(biāo)準(zhǔn)的Linux接口來(lái)實(shí)現(xiàn)對(duì)信號(hào)的處理的,通過(guò)調(diào)用 signals.Add(SIGQUIT); signals.Add(SIGUSR1);就實(shí)現(xiàn)了 SIGQUIT和 SIGUSR1兩個(gè)信號(hào)的自定義處理,安裝完信號(hào)處理函數(shù)之后是一個(gè)無(wú)限循環(huán),在循環(huán)里面執(zhí)行sigwait函數(shù)來(lái)等待信號(hào)。
while (true) {
int signal_number = signal_catcher->WaitForSignal(self, signals);
if (signal_catcher->ShouldHalt()) {
runtime->DetachCurrentThread();
return nullptr;
}
switch (signal_number) {
case SIGQUIT:
signal_catcher->HandleSigQuit();
break;
case SIGUSR1:
signal_catcher->HandleSigUsr1();
break;
default:
LOG(ERROR) << "Unexpected signal %d" << signal_number;
break;
}
}
- SIGQUIT信號(hào)的處理
發(fā)生ANR的時(shí)候,system_server進(jìn)程會(huì)執(zhí)行dumpStackTraces函數(shù),在該函數(shù)中會(huì)發(fā)送一個(gè)SIGQUIT信號(hào)給對(duì)應(yīng)的進(jìn)程,用來(lái)獲取該進(jìn)程的一些運(yùn)行時(shí)信息,并最終把這些信息輸出到/data/anr/traces.txt文件里面。
public static File dumpStackTraces(boolean clearTraces, ArrayList<Integer> firstPids,
ProcessCpuTracker processCpuTracker, SparseArray<Boolean> lastPids, String[] nativeProcs) {
String tracesPath = SystemProperties.get("dalvik.vm.stack-trace-file", null);
if (tracesPath == null || tracesPath.length() == 0) {
return null;
}
File tracesFile = new File(tracesPath);
try {
File tracesDir = tracesFile.getParentFile();
if (!tracesDir.exists()) {
tracesDir.mkdirs();
if (!SELinux.restorecon(tracesDir)) {
return null;
}
}
FileUtils.setPermissions(tracesDir.getPath(), 0775, -1, -1); // drwxrwxr-x
if (clearTraces && tracesFile.exists()) tracesFile.delete();
tracesFile.createNewFile();
FileUtils.setPermissions(tracesFile.getPath(), 0666, -1, -1); // -rw-rw-rw-
} catch (IOException e) {
Slog.w(TAG, "Unable to prepare ANR traces file: " + tracesPath, e);
return null;
}
dumpStackTraces(tracesPath, firstPids, processCpuTracker, lastPids, nativeProcs);
return tracesFile;
}
如果一個(gè)進(jìn)程接收到了SIGQUIT信號(hào)的時(shí)候,Signal Catcher線程的signal_catcher->WaitForSignal(self, signals);這個(gè)語(yǔ)句就會(huì)返回,返回后接著會(huì)調(diào)用HandleSigQuit @ Signal _Watcher.cc函數(shù)來(lái)處理該信號(hào)。
void SignalCatcher::HandleSigQuit() {
Runtime* runtime = Runtime::Current();
std::ostringstream os;
......
DumpCmdLine(os);
......
runtime->DumpForSigQuit(os);
......
}
......
Output(os.str());
}
Signal Catcher線程的作用是打印當(dāng)前進(jìn)程的堆棧(Java、Native、Kernel),同時(shí)還會(huì)把當(dāng)前虛擬機(jī)的一些狀態(tài)信息也打印出來(lái),這就是我們所看到的traces.txt文件內(nèi)容,HandleSigQuit函數(shù)里面先建立了標(biāo)準(zhǔn)輸出流,把所有的信息都輸出到這個(gè)輸出流里面,其實(shí)也就是保存在內(nèi)存當(dāng)中,當(dāng)dump過(guò)程完了之后,最后調(diào)用Output函數(shù)將輸出流的內(nèi)容保存到文件里面。
void Runtime::DumpForSigQuit(std::ostream& os) {
GetClassLinker()->DumpForSigQuit(os); //已經(jīng)加載和初始化的類(lèi)、方法等信息
GetInternTable()->DumpForSigQuit(os);
GetJavaVM()->DumpForSigQuit(os);
GetHeap()->DumpForSigQuit(os); //GC信息
TrackedAllocators::Dump(os);//對(duì)象分配信息
os << "\n";
thread_list_->DumpForSigQuit(os); //線程堆棧信息
BaseMutex::DumpAll(os);
}
從Runtime的DumpForSigQuit這個(gè)函數(shù)里,大致可以看到都dump了哪些運(yùn)行時(shí)信息。dump過(guò)程里面讀取了哪些信息其實(shí)并不重要,重要的是什么時(shí)候去讀取這些信息,也就是說(shuō)什么條件下去dump才能保證獲取的確實(shí)是我們需要的東西,例如GC信息、當(dāng)前分配了多少對(duì)象、線程堆棧的打印等一般都需要suspend當(dāng)前進(jìn)程里面所有的線程,接下來(lái)主要分析的就是這個(gè)suspend過(guò)程。SuspendAll是在Thread_list.cc中實(shí)現(xiàn)的,它的作用就是用來(lái)suspend當(dāng)前進(jìn)程里面所有其他的線程,SuspendAll一般發(fā)生在像GC、DumpForSigQuit等過(guò)程中。
void ThreadList::SuspendAll(const char* cause, bool long_suspend) {
Thread* self = Thread::Current();
......
++suspend_all_count_;
// Increment everybody's suspend count (except our own).
for (const auto& thread : list_) {
if (thread == self) {
continue;
}
VLOG(threads) << "requesting thread suspend: " << *thread;
thread->ModifySuspendCount(self, +1, false);
......
}
}
其實(shí)SuspendAll的實(shí)現(xiàn)過(guò)程非常簡(jiǎn)單,其中最重要的就是thread->ModifySuspendCount(self, +1, false);這一語(yǔ)句,它會(huì)修改對(duì)應(yīng)Thread對(duì)象的suspend引用計(jì)數(shù),核心代碼如下:
void Thread::ModifySuspendCount(Thread* self, int delta, bool for_debugger) {
......
tls32_.suspend_count += delta;
......
if (tls32_.suspend_count == 0) {
AtomicClearFlag(kSuspendRequest);
} else {
AtomicSetFlag(kSuspendRequest);
TriggerSuspend();
}
}
因?yàn)槲覀儌魅氲膁elta的值是+1,所以會(huì)走到if語(yǔ)句的else分支,它首先使用原子操作設(shè)置了kSuspendRequest標(biāo)志位,代表當(dāng)前這個(gè)Thread對(duì)象有suspend請(qǐng)求,那么什么時(shí)候會(huì)觸發(fā)線程去檢查這個(gè)標(biāo)志位呢?CheckSuspend這個(gè)函數(shù)在運(yùn)行時(shí)當(dāng)中會(huì)有好幾個(gè)地方被調(diào)用到,我們先看其中的兩個(gè)
static void GoToRunnable(Thread* self) NO_THREAD_SAFETY_ANALYSIS {
ArtMethod* native_method = *self->GetManagedStack()->GetTopQuickFrame();
bool is_fast = native_method->IsFastNative();
if (!is_fast) {
self->TransitionFromSuspendedToRunnable();
} else if (UNLIKELY(self->TestAllFlags())) {
// In fast JNI mode we never transitioned out of runnable. Perform a suspend check if there
// is a flag raised.
DCHECK(Locks::mutator_lock_->IsSharedHeld(self));
self->CheckSuspend();
}
}
extern "C" void artTestSuspendFromCode(Thread* self) SHARED_LOCKS_REQUIRED(Locks::mutator_lock_) {
// Called when suspend count check value is 0 and thread->suspend_count_ != 0
ScopedQuickEntrypointChecks sqec(self);
self->CheckSuspend();
}
GoToRunnable是在線程切換到Runnable狀態(tài)的時(shí)候會(huì)調(diào)用到,而artTestSuspendFromCode如我們前面所講的是提供給編譯好的native代碼調(diào)用的,他們都調(diào)用了Thread的CheckSuspend函數(shù),所以只要給對(duì)應(yīng)線程的Thread對(duì)象設(shè)置了kSuspendRequest標(biāo)志位,那么這個(gè)線程基本上都是可以暫停下來(lái)的,除非因?yàn)槟承┰虍?dāng)前線程被阻塞住了并且該線程還恰好占據(jù)了Locks::mutator_lock_這個(gè)讀寫(xiě)鎖,導(dǎo)致調(diào)用SuspendAll的線程阻塞在這個(gè)讀寫(xiě)鎖上面,最終導(dǎo)致suspend超時(shí),如SuspendAll的如下代碼所示:
void ThreadList::SuspendAll(const char* cause, bool long_suspend) {
......
#if HAVE_TIMED_RWLOCK
while (true) {
if (Locks::mutator_lock_->ExclusiveLockWithTimeout(self, kThreadSuspendTimeoutMs, 0)) {
break;
} else if (!long_suspend_) {
......
UnsafeLogFatalForThreadSuspendAllTimeout();
}
}
#else
Locks::mutator_lock_->ExclusiveLock(self);
#endif
......
}
接下來(lái)我們著重分析Thread的CheckSuspend這個(gè)函數(shù),這個(gè)函數(shù)里面才會(huì)把當(dāng)前線程真正suspend住.
inline void Thread::CheckSuspend() {
DCHECK_EQ(Thread::Current(), this);
for (;;) {
if (ReadFlag(kCheckpointRequest)) {
RunCheckpointFunction();
} else if (ReadFlag(kSuspendRequest)) {
FullSuspendCheck();
} else {
break;
}
}
}
如果檢測(cè)到設(shè)置了kCheckpointRequest標(biāo)記就會(huì)執(zhí)行RunCheckpointFunction函數(shù),另外如果檢測(cè)到設(shè)置了kSuspendRequest標(biāo)記就會(huì)執(zhí)行FullSuspendCheck函數(shù),kCheckpointRequest標(biāo)志位是用來(lái)dump線程的堆棧的,分析完SuspendAll之后,我們?cè)僦乜催@個(gè)標(biāo)志位的作用,這里我們繼續(xù)分析FullSuspendCheck這個(gè)函數(shù):
void Thread::FullSuspendCheck() {
VLOG(threads) << this << " self-suspending";
ATRACE_BEGIN("Full suspend check");
// Make thread appear suspended to other threads, release mutator_lock_.
tls32_.suspended_at_suspend_check = true;
TransitionFromRunnableToSuspended(kSuspended);
// Transition back to runnable noting requests to suspend, re-acquire share on mutator_lock_.
TransitionFromSuspendedToRunnable();
tls32_.suspended_at_suspend_check = false;
ATRACE_END();
VLOG(threads) << this << " self-reviving";
}
調(diào)用TransitionFromRunnableToSuspended這個(gè)函數(shù)之后,當(dāng)前Java線程就進(jìn)入了kSuspended狀態(tài),然后在調(diào)用TransitionFromSuspendedToRunnable從suspend切換到Runnable狀態(tài)的時(shí)候,就會(huì)阻塞在一個(gè)條件變量上,除非調(diào)用SuspendAll的線程接著又調(diào)用了ResumeAll函數(shù),要不然這些線程就會(huì)一直被阻塞住。
void ThreadList::ResumeAll() {
Thread* self = Thread::Current();
......
Locks::mutator_lock_->ExclusiveUnlock(self);
{
......
--suspend_all_count_;
// Decrement the suspend counts for all threads.
for (const auto& thread : list_) {
if (thread == self) {
continue;
}
thread->ModifySuspendCount(self, -1, false); //修改線程的suspend計(jì)數(shù)
}
......
Thread::resume_cond_->Broadcast(self);//喚醒那些等待這個(gè)條件變量的線程
}
......
}
至此我們就把SuspendAll的過(guò)程分析完了,我們上面提到過(guò)dump線程堆棧的時(shí)候并不是在設(shè)置了kSuspendRequest標(biāo)志位之后會(huì)執(zhí)行的,與它相關(guān)的是另外一個(gè)標(biāo)志位kCheckpointRequest. 接下來(lái)我們看一下Thread_list的Dump函數(shù),這個(gè)函數(shù)會(huì)在Thread_list的DumpForSigQuit中會(huì)被調(diào)用到,也就是在Signal Cathcer線程處理SIGQUIT信號(hào)的過(guò)程中。
void ThreadList::Dump(std::ostream& os) {
......
DumpCheckpoint checkpoint(&os);
size_t threads_running_checkpoint = RunCheckpoint(&checkpoint);
if (threads_running_checkpoint != 0) {
checkpoint.WaitForThreadsToRunThroughCheckpoint(threads_running_checkpoint);
}
}
這個(gè)函數(shù)里面首先創(chuàng)建了一個(gè)DumpCheckpoint對(duì)象checkpoint,然后以這個(gè)對(duì)象作為參數(shù)調(diào)用RunCheckpoint函數(shù),RunCheckpoint會(huì)返回現(xiàn)在處于Runnable狀態(tài)的線程個(gè)數(shù),然后調(diào)用DumpCheckpoint的WaitForThreadsToRunThroughCheckpoint函數(shù)等待這些處于Runnable狀態(tài)的線程都執(zhí)行完DumpCheckpoint的Run函數(shù),如果等待超時(shí)就會(huì)報(bào)Fatal類(lèi)型的錯(cuò)誤,如下所示:
void WaitForThreadsToRunThroughCheckpoint(size_t threads_running_checkpoint) {
Thread* self = Thread::Current();
ScopedThreadStateChange tsc(self, kWaitingForCheckPointsToRun);
bool timed_out = barrier_.Increment(self, threads_running_checkpoint, kDumpWaitTimeout);
if (timed_out) {
// Avoid a recursive abort.
LOG((kIsDebugBuild && (gAborting == 0)) ? FATAL : ERROR) << "Unexpected time out during dump checkpoint.";
}
}
我們接著分析RunCheckpoint這個(gè)函數(shù),這個(gè)函數(shù)有點(diǎn)長(zhǎng),我們分為兩部分來(lái)分析該過(guò)程。
size_t ThreadList::RunCheckpoint(Closure* checkpoint_function) {
......
for (const auto& thread : list_) {
if (thread != self) {
while (true) {
if (thread->RequestCheckpoint(checkpoint_function)) {
kSuspendRequestcount++;
break;
} else {
if (thread->GetState() == kRunnable) {
continue;
}
thread->ModifySuspendCount(self, +1, false);
suspended_count_modified_threads.push_back(thread);
break;
}
}
}
......
return count;
}
對(duì)于那些處于Runnable狀態(tài)的線程執(zhí)行它的RequestCheckpoint函數(shù)會(huì)返回true,其他非Runnable狀態(tài)的線程則會(huì)返回false,對(duì)于這些線程就會(huì)像SuspendAll過(guò)程中一樣給它設(shè)置kSuspendRequest標(biāo)志位,后面如果他們變?yōu)镽unnable狀態(tài)的時(shí)候就會(huì)先檢查這個(gè)標(biāo)志位,從而進(jìn)入suspend狀態(tài),同時(shí)RunCheckpoint函數(shù)會(huì)把這些線程統(tǒng)計(jì)到suspended_count_modified_threads這個(gè)Vector變量中,在suspended_count_modified_threads這個(gè)Vector變量中的線程,Signal Catcher線程會(huì)主動(dòng)觸發(fā)他們的dump堆棧過(guò)程。待會(huì)分析RunCheckpoint的第二部分的時(shí)候,我們?cè)賮?lái)看這個(gè)過(guò)程,我們先分析Thread的RequestCheckpoint函數(shù)。
bool Thread::RequestCheckpoint(Closure* function) {
......
if (old_state_and_flags.as_struct.state != kRunnable) { //如果當(dāng)前線程不為Runnable狀態(tài)就直接返回false
return false; // Fail, thread is suspended and so can't run a checkpoint.
}
uint32_t available_checkpoint = kMaxCheckpoints;
for (uint32_t i = 0 ; i < kMaxCheckpoints; ++i) {
if (tlsPtr_.checkpoint_functions[i] == nullptr) { //在數(shù)組中尋找一個(gè)還沒(méi)占據(jù)的空位
available_checkpoint = i;
break;
}
}
......
tlsPtr_.checkpoint_functions[available_checkpoint] = function; //設(shè)置數(shù)組元素的值
// Checkpoint function installed now install flag bit.
// We must be runnable to request a checkpoint.
DCHECK_EQ(old_state_and_flags.as_struct.state, kRunnable);
union StateAndFlags new_state_and_flags;
new_state_and_flags.as_int = old_state_and_flags.as_int;
new_state_and_flags.as_struct.flags |= kCheckpointRequest; //設(shè)置kCheckpointRequest標(biāo)志位
......
}
從前面Thread的CheckSuspend函數(shù)來(lái)看設(shè)置了kCheckpointRequest標(biāo)志位的線程會(huì)執(zhí)行RunCheckpointFunction這個(gè)函數(shù),RunCheckpointFunction會(huì)檢查checkpoint_functions數(shù)組是否為空,如果不為空,就會(huì)執(zhí)行元素的run函數(shù)。
void Thread::RunCheckpointFunction() {
......
for (uint32_t i = 0; i < kMaxCheckpoints; ++i) {
if (checkpoints[i] != nullptr) {
checkpoints[i]->Run(this);
found_checkpoint = true;
}
}
......
}
其實(shí)就是執(zhí)行DumpCheckpoint的Run函數(shù),因?yàn)镽equestCheckpoint(Closure* function)的function就是一個(gè)DumpCheckpoint對(duì)象,它是從Thread_list的Dump函數(shù)中傳遞過(guò)來(lái)的,我們看一下DumpCheckpoint的Run函數(shù)實(shí)現(xiàn):
void Run(Thread* thread) OVERRIDE {
Thread* self = Thread::Current();
std::ostringstream local_os;
{
ScopedObjectAccess soa(self);
thread->Dump(local_os); //調(diào)用Thread的Dump函數(shù)
}
......
}
饒了一大圈,原來(lái)最終調(diào)用的還是Thread的Dump函數(shù),這個(gè)函數(shù)就不繼續(xù)分析了,線程的Java堆棧、Native堆棧和Kernel堆棧就是在這里打印的,有興趣的同學(xué)可以自己去分析。上面我們說(shuō)了對(duì)于處于Runnable狀態(tài)的線程是通過(guò)調(diào)用他們的RequestCheckpoint函數(shù),然后他們自己去dump當(dāng)前堆棧的,而對(duì)于那些不是處于Runnable狀態(tài)的線程我們是把它添加到了suspended_count_modified_threads這個(gè)Vector中,我們接著分析RunCheckpoint函數(shù)的第二部分
size_t ThreadList::RunCheckpoint(Closure* checkpoint_function) {
Thread* self = Thread::Current();
......
checkpoint_function->Run(self); //以Signal Catcher線程的Thread對(duì)象為參數(shù),主動(dòng)調(diào)用DumpCheckpoint的Run函數(shù)
// Run the checkpoint on the suspended threads.
for (const auto& thread : suspended_count_modified_threads) {
.......
checkpoint_function->Run(thread);//主動(dòng)調(diào)用DumpCheckpoint的Run函數(shù)
{
MutexLock mu2(self, *Locks::thread_suspend_count_lock_);
thread->ModifySuspendCount(self, -1, false);//修改suspend引用計(jì)數(shù)
}
}
......
}
對(duì)于這些不是Runnable狀態(tài)的線程,他們可能不會(huì)主動(dòng)去調(diào)用Run函數(shù),所以只能由Signal Catcher線程去幫他們Dump,至于DumpCheckpoint的Run函數(shù)的功能還是和Runnable狀態(tài)的線程一樣的,都是打印線程堆棧。