背景:定位難
對于Android APP而言,native層Crash相比于Java層更難捕獲與定位,因為so的代碼通常不可見,而且,一些第三方so的crash或者系統(tǒng)的更難定位,堆棧信息非常少:參考下面的幾個native crash實例
甚至即時全量打印Log信息,也只能得到一些不太方便定位的日志,無法直接定位問題
09-14 10:14:36.590 1361 1361 I /system/bin/tombstoned: received crash request for pid 5908
09-14 10:14:36.591 5944 5944 I crash_dump64: performing dump of process 5687 (target tid = 5908)
09-14 10:14:36.607 5944 5944 F DEBUG : *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** ***
09-14 10:14:36.608 5944 5944 F DEBUG : Build fingerprint: 'Xiaomi/vangogh/vangogh:10/QKQ1.191222.002/V12.0.6.0.QJVCNXM:user/release-keys'
09-14 10:14:36.608 5944 5944 F DEBUG : Revision: '0'
09-14 10:14:36.608 5944 5944 F DEBUG : ABI: 'arm64'
09-14 10:14:36.608 5944 5944 F DEBUG : Timestamp: 2021-09-14 10:14:36+0800
09-14 10:14:36.608 5944 5944 F DEBUG : pid: 5687, tid: 5908, name: nioEventLoopGro >>> com.netxx.xaxxxn <<<
09-14 10:14:36.608 5944 5944 F DEBUG : uid: 10312
09-14 10:14:36.608 5944 5944 F DEBUG : signal 11 (SIGSEGV), code 1 (SEGV_MAPERR), fault addr 0x4
09-14 10:14:36.608 5944 5944 F DEBUG : Cause: null pointer dereference
09-14 10:14:36.608 5944 5944 F DEBUG : x0 0000000000000000 x1 0000000014d85fb0 x2 0000000015100bf8 x3 0000000000000000
09-14 10:14:36.608 5944 5944 F DEBUG : x4 0000000015100c18 x5 000000000000005a x6 0000000015100c30 x7 0000000000000018
09-14 10:14:36.608 5944 5944 F DEBUG : x8 0000000000000000 x9 20454cc47a8eade3 x10 00000000005c0000 x11 000000000000004b
09-14 10:14:36.608 5944 5944 F DEBUG : x12 000000000000001f x13 0000000000000000 x14 00000000a2018668 x15 0000000000000010
09-14 10:14:36.608 5944 5944 F DEBUG : x16 0000000000000000 x17 0000000000054402 x18 00000077328bc000 x19 00000077616e0c00
09-14 10:14:36.608 5944 5944 F DEBUG : x20 0000000000000001 x21 00000000151004a0 x22 0000000014d85fb0 x23 00000000a1f03180
09-14 10:14:36.608 5944 5944 F DEBUG : x24 0000000000000001 x25 0000000000000000 x26 0000000000000003 x27 00000000151000b8
09-14 10:14:36.608 5944 5944 F DEBUG : x28 0000000000000000 x29 00000000151009b0
09-14 10:14:36.608 5944 5944 F DEBUG : sp 000000773536e4f0 lr 000000779431b80c pc 0000007794240260
如上,雖然能看到 Cause: null pointer dereference,但是到底是什么代碼導(dǎo)致的,沒有非常明確的消息,不像Java層Crash有非常清晰堆棧,這就讓Native的crash定位非常頭痛。
如何定位native crash
對于Crash而言,精確的定位等于成功的一半。如何通過工具定位到native crash呢,如果是自己實現(xiàn)的so庫,一般而言還是會有相應(yīng)的日志打印出來的,本文主要針對一些特殊的so,尤其是不存在源碼的so,對于這種場景如何定位,最重要當(dāng)然還是復(fù)現(xiàn):匹配對應(yīng)的機型、環(huán)境、不斷重試復(fù)現(xiàn)線上問題,一旦發(fā)生Crash后就些蛛絲馬跡可查,本文以線上偶發(fā)的一個ARM64升級為例子,分析下定位流程:通過歸納,重試,復(fù)現(xiàn)場景后,便可以去查找問題日志,這個時候有一個挺好用的方法:bugreport命令:
$ adb bugreport ~\
app crash 的時候,系統(tǒng)會保存一個tombstone文件到/data/tombstones目錄,該命令會導(dǎo)出最近的crash相關(guān)信息,我們可以通過bugreport導(dǎo)出,導(dǎo)出后它是一個zip包的形式,解壓后如下
對于每個tombstone,如果是native crash,打開后大概會看到如下日志:
最上面的這些日志是最重要的,它包含了發(fā)生crash的線程是哪個,發(fā)的日志調(diào)用幀是哪個,到這里基本能很大程度上幫助我們實現(xiàn)問題的定位了,也就是基于bugreport + tombstone。
問題分析
線上ARM64升級的Crash只發(fā)生在Android10的系統(tǒng)上,具體到我們這個BUG,最終歸宿到
arm64/base.odex (BakerReadBarrierThunkAcquire_r15_r0_2)
Cause: null pointer dereference
但是上述的問題看起來跟如下幀似乎沒有任何關(guān)系
arm64/base.odex (com.netease.mail.profiler.handler.BaseHandler.stopTrace+360)
Java層的代碼,怎么忽然就跑到arm64/base.odex (BakerReadBarrierThunk中去了呢?不防分析一下完整的調(diào)用幀:
backtrace:
#00 pc 00000000008ee260 /data/app/com.netease.yanxuan-YLeR3gwwgd3DyIUBNJZ8cA==/oat/arm64/base.odex (BakerReadBarrierThunkAcquire_r15_r0_2)
#01 pc 00000000009c9808 /data/app/com.netease.yanxuan-YLeR3gwwgd3DyIUBNJZ8cA==/oat/arm64/base.odex (com.netease.mail.profiler.handler.BaseHandler.stopTrace+360)
#02 pc 00000000009b3cc4 /data/app/com.netease.yanxuan-YLeR3gwwgd3DyIUBNJZ8cA==/oat/arm64/base.odex (com.netease.mail.profiler.handler.TailHandler$1.operationComplete+212)
#03 pc 00000000009b3b8c /data/app/com.netease.yanxuan-YLeR3gwwgd3DyIUBNJZ8cA==/oat/arm64/base.odex (com.netease.mail.android.wzp.util.Util$1.operationComplete [DEDUPED]+108)
#04 pc 0000000000b93180 /data/app/com.netease.yanxuan-YLeR3gwwgd3DyIUBNJZ8cA==/oat/arm64/base.odex (io.netty.util.concurrent.DefaultPromise.notifyListener0+80)
#05 pc 0000000000b9370c /data/app/com.netease.yanxuan-YLeR3gwwgd3DyIUBNJZ8cA==/oat/arm64/base.odex (io.netty.util.concurrent.DefaultPromise.notifyListeners+988)
#06 pc 0000000000b94e3c /data/app/com.netease.yanxuan-YLeR3gwwgd3DyIUBNJZ8cA==/oat/arm64/base.odex (io.netty.util.concurrent.DefaultPromise.trySuccess+92)
#07 pc 0000000000ba499c /data/app/com.netease.yanxuan-YLeR3gwwgd3DyIUBNJZ8cA==/oat/arm64/base.odex (io.netty.channel.DefaultChannelPromise.trySuccess+44)
#08 pc 0000000000b90ef4 /data/app/com.netease.yanxuan-YLeR3gwwgd3DyIUBNJZ8cA==/oat/arm64/base.odex (io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.fulfillConnectPromise+84)
#09 pc 0000000000b91850 /data/app/com.netease.yanxuan-YLeR3gwwgd3DyIUBNJZ8cA==/oat/arm64/base.odex (io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect+192)
#10 pc 0000000000bb390c /data/app/com.netease.yanxuan-YLeR3gwwgd3DyIUBNJZ8cA==/oat/arm64/base.odex (io.netty.channel.nio.NioEventLoop.processSelectedKey+444)
#11 pc 0000000000bb3bf8 /data/app/com.netease.yanxuan-YLeR3gwwgd3DyIUBNJZ8cA==/oat/arm64/base.odex (io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized+312)
#12 pc 0000000000bb55b8 /data/app/com.netease.yanxuan-YLeR3gwwgd3DyIUBNJZ8cA==/oat/arm64/base.odex (io.netty.channel.nio.NioEventLoop.run+824)
#13 pc 0000000000ae1580 /data/app/com.netease.yanxuan-YLeR3gwwgd3DyIUBNJZ8cA==/oat/arm64/base.odex (io.netty.util.concurrent.SingleThreadEventExecutor$2.run+128)
#14 pc 0000000000adf068 /data/app/com.netease.yanxuan-YLeR3gwwgd3DyIUBNJZ8cA==/oat/arm64/base.odex (io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run+72)
#15 pc 00000000004afbb8 /system/framework/arm64/boot.oat (java.lang.Thread.run+72) (BuildId: 65cd48ea51183eb3b4cdfeb64ca2b90a9de89ffe)
#16 pc 0000000000137334 /apex/com.android.runtime/lib64/libart.so (art_quick_invoke_stub+548) (BuildId: fc24b8afa1bd5f1872cc1a38bcfa1cdc)
#17 pc 0000000000145fec /apex/com.android.runtime/lib64/libart.so (art::ArtMethod::Invoke(art::Thread*, unsigned int*, unsigned int, art::JValue*, char const*)+244) (BuildId: fc24b8afa1bd5f1872cc1a38bcfa1cdc)
#18 pc 00000000004b0d98 /apex/com.android.runtime/lib64/libart.so (art::(anonymous namespace)::InvokeWithArgArray(art::ScopedObjectAccessAlreadyRunnable const&, art::ArtMethod*, art::(anonymous namespace)::ArgArray*, art::JValue*, char const*)+104) (BuildId: fc24b8afa1bd5f1872cc1a38bcfa1cdc)
#19 pc 00000000004b1eac /apex/com.android.runtime/lib64/libart.so (art::InvokeVirtualOrInterfaceWithJValues(art::ScopedObjectAccessAlreadyRunnable const&, _jobject*, _jmethodID*, jvalue const*)+416) (BuildId: fc24b8afa1bd5f1872cc1a38bcfa1cdc)
#20 pc 00000000004f2868 /apex/com.android.runtime/lib64/libart.so (art::Thread::CreateCallback(void*)+1176) (BuildId: fc24b8afa1bd5f1872cc1a38bcfa1cdc)
#21 pc 00000000000e69e0 /apex/com.android.runtime/lib64/bionic/libc.so (__pthread_start(void*)+36) (BuildId: 1eb18e444251dc07dff5ebd93fce105c)
#22 pc 0000000000084b6c /apex/com.android.runtime/lib64/bionic/libc.so (__start_thread+64) (BuildId: 1eb18e444251dc07dff5ebd93fce105c)
從#22幀開始看出這個是一個ART解釋執(zhí)行的過程,Android中基本所有線程棧都是這種形式,那么最終就可以認(rèn)為是解釋BaseHandler.stopTrace這句的時候,出現(xiàn)了null pointer dereference這樣一個異常,為甚會這樣呢?由于在系統(tǒng)上有共性:只有Android10系統(tǒng)的ARM64設(shè)備上出現(xiàn),所以有理由懷疑Android10的源碼在BakerReadBarrierThunkAcquire_r15_r0_2這里的處理上有什么不對勁,通過檢索akerReadBarrierThunkAcquire_r15_r0_2字符串,發(fā)現(xiàn)code_generator_arm64.cc源碼CompileBakerReadBarrierThunk函數(shù)最終輸出了這段日志:
對比Android10與Android 11源碼發(fā)現(xiàn)有一處很明確的不同,在Field Load使用之前,多加了一個空檢查的Case:
解釋執(zhí)行代碼實在是看不懂:摘錄了下這條記錄的log Fix null checks on volatile reference field loads on ARM64.如下:
Fix null checks on volatile reference field loads on ARM64.
ART's compiler adds a null check HIR instruction before each field
load HIR instruction created in the instruction builder phase. When
implicit null checks are allowed, the compiler elides the null check
if it can be turned into an implicit one (i.e. if the offset is within
a system page range).
On ARM64, the Baker read barrier thunk built for field reference loads
needs to check the lock word of the holder of the field, and thus
includes an explicit null check if no null check has been done before.
However, this was not done for volatile loads (implemented with a
load-acquire instruction on ARM64). This change adds this missing null
check.
意思就是:對于volatile修飾的變量(映射為load-acquire instruction),加上空檢查,避免運行時空指針。Android 10沒有做這個空檢查,該commit就是為修復(fù)該BUG,回到業(yè)務(wù)中發(fā)現(xiàn),確實有地方用了多線程及volatile,并且有一定概率出現(xiàn)空指針,處理掉這段邏輯即可。
總結(jié)
最主要的是結(jié)合bugreport及tombstone文件做好定位,定位問題后,才方便解決。
參考文檔
https://wufengxue.github.io/2020/06/22/wechat-voice-codec-SEGV_MAPERR.html 有效參考分析工具
https://developer.android.com/ndk/guides/ndk-stack
作者:看書的小蝸牛
原文鏈接: Android Native Crash問題排查思路