奇怪的ANR
今天遇到了個很有意思的anr問題, 應用出現了anr:
7696:08-29 14:12:59.564898 7904 8341 I WindowManager: ANR in Window{3b0709 u0 me.linjw.demo.anr}. Reason:3b0709 me.linjw.demo.anr (server) is not responding. Waited 5001ms for FocusEvent(hasFocus=false)
8367:08-29 14:13:11.713363 7904 27946 E ActivityManager: ANR in me.linjw.demo.anr
但是trace文件里面沒有任何堆棧:
Subject: Input dispatching timed out (3b0709 me.linjw.demo.anr (server) is not responding. Waited 5001ms for FocusEvent(hasFocus=false))
--- CriticalEventLog ---
capacity: 20
timestamp_ms: 1693311179660
window_ms: 300000
libdebuggerd_client: failed to read status response from tombstoned: timeout reached?
----- Waiting Channels: pid 26859 at 2023-08-29 14:12:59.664895544+0200 -----
Cmd line: me.linjw.demo.anr
sysTid=26859 do_freezer_trap
sysTid=26864 do_freezer_trap
sysTid=26865 do_freezer_trap
sysTid=26866 do_freezer_trap
sysTid=26867 do_freezer_trap
sysTid=26868 do_freezer_trap
sysTid=26869 do_freezer_trap
sysTid=26870 do_freezer_trap
sysTid=26871 do_freezer_trap
sysTid=26872 do_freezer_trap
sysTid=26873 do_freezer_trap
sysTid=26874 do_freezer_trap
sysTid=26875 do_freezer_trap
sysTid=26877 do_freezer_trap
sysTid=26879 do_freezer_trap
sysTid=26880 do_freezer_trap
sysTid=26882 do_freezer_trap
sysTid=26883 do_freezer_trap
sysTid=26887 do_freezer_trap
sysTid=26912 do_freezer_trap
sysTid=26918 do_freezer_trap
sysTid=26919 do_freezer_trap
sysTid=26922 do_freezer_trap
sysTid=26923 do_freezer_trap
sysTid=26938 do_freezer_trap
sysTid=27772 do_freezer_trap
sysTid=27815 do_freezer_trap
sysTid=27826 do_freezer_trap
sysTid=27827 do_freezer_trap
----- end 26859 -----
libdebuggerd_client: failed to read status response from tombstoned: Try again
----- Waiting Channels: pid 26859 at 2023-08-29 14:13:09.677383215+0200 -----
Cmd line: me.linjw.demo.anr
sysTid=26859 do_freezer_trap
sysTid=26864 do_freezer_trap
sysTid=26865 do_freezer_trap
sysTid=26866 do_freezer_trap
sysTid=26867 do_freezer_trap
sysTid=26868 do_freezer_trap
sysTid=26869 do_freezer_trap
sysTid=26870 do_freezer_trap
sysTid=26871 do_freezer_trap
sysTid=26872 do_freezer_trap
sysTid=26873 do_freezer_trap
sysTid=26874 do_freezer_trap
sysTid=26875 do_freezer_trap
sysTid=26877 do_freezer_trap
sysTid=26879 do_freezer_trap
sysTid=26880 do_freezer_trap
sysTid=26882 do_freezer_trap
sysTid=26883 do_freezer_trap
sysTid=26887 do_freezer_trap
sysTid=26912 do_freezer_trap
sysTid=26918 do_freezer_trap
sysTid=26919 do_freezer_trap
sysTid=26922 do_freezer_trap
sysTid=26923 do_freezer_trap
sysTid=26938 do_freezer_trap
sysTid=27772 do_freezer_trap
sysTid=27815 do_freezer_trap
sysTid=27826 do_freezer_trap
sysTid=27827 do_freezer_trap
----- end 26859 -----
從日志上過濾進程pid可以看到正在正常的執行任務,還沒有執行完就被am_freeze凍結了進程:
08-29 14:11:45.807967 26859 27815 V MessageEncoder: ... // 正常執行任務的打印
08-29 14:11:45.809835 26859 26859 D FloatView: ... // 正常執行任務的打印,任務沒有執行完,后面應該還有打印但實際沒有
08-29 14:11:45.884625 7904 8331 D ActivityManager: freezing 26859 me.linjw.demo.anr
08-29 14:11:45.885503 7904 8331 I am_freeze: [26859,me.linjw.demo.anr]
08-29 14:12:59.660658 7904 27946 I am_anr : [0,26859,me.linjw.demo.anr,545832517,Input dispatching timed out (3b0709 me.linjw.demo.anr (server) is not responding. Waited 5001ms for FocusEvent(hasFocus=false))]
由于進程被凍結了,所以處理不了Input消息所以anr,由于進程被凍結了,所以anr的時候讓進程去dump堆棧的請求也不會被處理。
Freeze
很多的進程在退出前臺之后會長期在后臺占用內存、cpu,影響用戶體驗。在內存不足的時候會觸發lmk清除內存,但是如果內存充足的情況下為了加速應用的切換速度,是不會殺死后臺進程的。為了解決應用在后臺默默消化cpu資源的問題,高版本的安卓實現了一套凍結進程機制,在Android 11以后支持。。
我們可以在開發者選項里面找到"Suspend execution for cached apps"條目去控制后臺進程凍結功能的開關,也可以用命令去控制:
adb shell settings put global cached_apps_freezer <enabled|disabled|default>
- enable 打開
- disabled 關閉
- default 由系統決定是否打開
進程的OOM_ADJ (Out of Memory Adjustment)值除了決定系統內存不足的時候是否回收該進程,進程凍結策略也是依賴它去計算的。有下面的這些場景會觸發進程oom adj值的重新計算,大概有切換Activity、啟動廣播、綁定服務、是否可見狀態改變等:
// https://cs.android.com/android/platform/superproject/+/android-13.0.0_r74:frameworks/base/services/core/java/com/android/server/am/OomAdjuster.java
public class OomAdjuster {
static final String TAG = "OomAdjuster";
static final String OOM_ADJ_REASON_METHOD = "updateOomAdj";
static final String OOM_ADJ_REASON_NONE = OOM_ADJ_REASON_METHOD + "_meh";
static final String OOM_ADJ_REASON_ACTIVITY = OOM_ADJ_REASON_METHOD + "_activityChange";
static final String OOM_ADJ_REASON_FINISH_RECEIVER = OOM_ADJ_REASON_METHOD + "_finishReceiver";
static final String OOM_ADJ_REASON_START_RECEIVER = OOM_ADJ_REASON_METHOD + "_startReceiver";
static final String OOM_ADJ_REASON_BIND_SERVICE = OOM_ADJ_REASON_METHOD + "_bindService";
static final String OOM_ADJ_REASON_UNBIND_SERVICE = OOM_ADJ_REASON_METHOD + "_unbindService";
static final String OOM_ADJ_REASON_START_SERVICE = OOM_ADJ_REASON_METHOD + "_startService";
static final String OOM_ADJ_REASON_GET_PROVIDER = OOM_ADJ_REASON_METHOD + "_getProvider";
static final String OOM_ADJ_REASON_REMOVE_PROVIDER = OOM_ADJ_REASON_METHOD + "_removeProvider";
static final String OOM_ADJ_REASON_UI_VISIBILITY = OOM_ADJ_REASON_METHOD + "_uiVisibility";
static final String OOM_ADJ_REASON_ALLOWLIST = OOM_ADJ_REASON_METHOD + "_allowlistChange";
static final String OOM_ADJ_REASON_PROCESS_BEGIN = OOM_ADJ_REASON_METHOD + "_processBegin";
static final String OOM_ADJ_REASON_PROCESS_END = OOM_ADJ_REASON_METHOD + "_processEnd";
...
}
凍結流程
例如Activity destroy的時候在ActivityRecord.setState里面就會去更新進程狀態,更新進程狀態的時候就會更新oom adj:
// https://cs.android.com/android/platform/superproject/+/android-13.0.0_r74:frameworks/base/services/core/java/com/android/server/wm/ActivityRecord.java
WindowProcessController app; // if non-null, hosting application
void setState(State state, String reason) {
...
switch (state) {
...
case DESTROYING:
if (app != null && !app.hasActivities()) {
// Update any services we are bound to that might care about whether
// their client may have activities.
// No longer have activities, so update LRU list and oom adj.
app.updateProcessInfo(true /* updateServiceConnectionActivities */,
false /* activityChange */, true /* updateOomAdj */,
false /* addPendingTopUid */);
}
break;
...
}
...
}
// https://cs.android.com/android/platform/superproject/+/android-13.0.0_r74:frameworks/base/services/core/java/com/android/server/wm/WindowProcessController.java
void updateProcessInfo(boolean updateServiceConnectionActivities, boolean activityChange,
boolean updateOomAdj, boolean addPendingTopUid) {
if (addPendingTopUid) {
addToPendingTop();
}
if (updateOomAdj) {
prepareOomAdjustment();
}
// Posting on handler so WM lock isn't held when we call into AM.
// 這里是延遲去調用mListener的WindowProcessListener::updateProcessInfo方法,而mListener實際是實現了WindowProcessListener接口的ProcessRecord
final Message m = PooledLambda.obtainMessage(WindowProcessListener::updateProcessInfo,
mListener, updateServiceConnectionActivities, activityChange, updateOomAdj);
mAtm.mH.sendMessage(m);
}
// https://cs.android.com/android/platform/superproject/+/android-13.0.0_r74:frameworks/base/services/core/java/com/android/server/am/ProcessRecord.java
class ProcessRecord implements WindowProcessListener {
...
@Override
public void updateProcessInfo(boolean updateServiceConnectionActivities, boolean activityChange,
boolean updateOomAdj) {
...
if (updateOomAdj) {
mService.updateOomAdjLocked(this, OomAdjuster.OOM_ADJ_REASON_ACTIVITY);
}
...
}
...
}
進程oom adj值的重新計算最終會去到OomAdjuster.applyOomAdjLSP,在里面就會調用updateAppFreezeStateLSP去更新進程的進程凍結狀態:
// https://cs.android.com/android/platform/superproject/+/android-13.0.0_r74:frameworks/base/services/core/java/com/android/server/am/ActivityManagerService.java
final void updateOomAdjLocked(String oomAdjReason) {
mOomAdjuster.updateOomAdjLocked(oomAdjReason);
}
// https://cs.android.com/android/platform/superproject/+/android-13.0.0_r74:frameworks/base/services/core/java/com/android/server/am/OomAdjuster.java
boolean updateOomAdjLocked(ProcessRecord app, String oomAdjReason) {
synchronized (mProcLock) {
return updateOomAdjLSP(app, oomAdjReason);
}
}
private boolean performUpdateOomAdjLSP(ProcessRecord app, String oomAdjReason) {
...
applyOomAdjLSP(app, false, SystemClock.uptimeMillis(),
SystemClock.elapsedRealtime(), oomAdjReason);
...
}
private boolean applyOomAdjLSP(ProcessRecord app, boolean doingAll, long now,
long nowElapsed, String oomAdjReson) {
...
updateAppFreezeStateLSP(app);
...
}
updateAppFreezeStateLSP里面判斷adj >= CACHED_APP_MIN_ADJ(900)的時候就會去調用freezeAppAsyncLSP, 進程的adj在900 ~ 999代表它只有不可見的activity,可以隨時被干掉,所以我們去凍結它也不會有影響:
// https://cs.android.com/android/platform/superproject/+/android-13.0.0_r74:frameworks/base/services/core/java/com/android/server/am/OomAdjuster.java
private void updateAppFreezeStateLSP(ProcessRecord app) {
...
final ProcessStateRecord state = app.mState;
// Use current adjustment when freezing, set adjustment when unfreezing.
if (state.getCurAdj() >= ProcessList.CACHED_APP_MIN_ADJ && !opt.isFrozen()
&& !opt.shouldNotFreeze()) {
mCachedAppOptimizer.freezeAppAsyncLSP(app);
} else if (state.getSetAdj() < ProcessList.CACHED_APP_MIN_ADJ) {
mCachedAppOptimizer.unfreezeAppLSP(app, oomAdjReason);
}
}
// https://cs.android.com/android/platform/superproject/+/android-13.0.0_r74:frameworks/base/services/core/java/com/android/server/am/ProcessList.java
// This is a process only hosting activities that are not visible,
// so it can be killed without any disruption.
public static final int CACHED_APP_MAX_ADJ = 999;
public static final int CACHED_APP_MIN_ADJ = 900;
freezeAppAsyncLSP里面會post一個10分鐘的message在時間到了的時候去凍結進程(就是10分鐘之后調用Process.setProcessFrozen):
// https://cs.android.com/android/platform/superproject/+/android-13.0.0_r74:frameworks/base/services/core/java/com/android/server/am/CachedAppOptimizer.java
@VisibleForTesting static final long DEFAULT_FREEZER_DEBOUNCE_TIMEOUT = 600_000L;
@VisibleForTesting volatile long mFreezerDebounceTimeout = DEFAULT_FREEZER_DEBOUNCE_TIMEOUT;
void freezeAppAsyncLSP(ProcessRecord app) {
final ProcessCachedOptimizerRecord opt = app.mOptRecord;
if (opt.isPendingFreeze()) {
// Skip redundant DO_FREEZE message
return;
}
mFreezeHandler.sendMessageDelayed(
mFreezeHandler.obtainMessage(
SET_FROZEN_PROCESS_MSG, DO_FREEZE, 0, app),
mFreezerDebounceTimeout);
...
}
public void handleMessage(Message msg) {
switch (msg.what) {
case SET_FROZEN_PROCESS_MSG:
synchronized (mAm) {
freezeProcess((ProcessRecord) msg.obj);
}
break;
...
}
}
private void freezeProcess(final ProcessRecord proc) {
...
Process.setProcessFrozen(pid, proc.uid, true);
...
}
// https://cs.android.com/android/platform/superproject/+/android-13.0.0_r74:frameworks/base/core/java/android/os/Process.java
/**
* Freeze or unfreeze the specified process.
*
* @param pid Identifier of the process to freeze or unfreeze.
* @param uid Identifier of the user the process is running under.
* @param frozen Specify whether to free (true) or unfreeze (false).
*
* @hide
*/
public static final native void setProcessFrozen(int pid, int uid, boolean frozen);
總結一下就是,如果進程的oom adj大于CACHED_APP_MIN_ADJ,就會啟動一個10分鐘的定時器,在10分鐘之內如果進程的oom adj一直沒有變回小于CACHED_APP_MIN_ADJ就會凍結進程。
解凍流程
同樣Activity start的時候在ActivityRecord.setState里面就會去調用WindowProcessController.updateProcessInfo更新進程狀態,更新進程狀態的時候就會更新oom adj:
// https://cs.android.com/android/platform/superproject/+/android-13.0.0_r74:frameworks/base/services/core/java/com/android/server/wm/ActivityRecord.java
WindowProcessController app; // if non-null, hosting application
void setState(State state, String reason) {
...
switch (state) {
...
case STARTED:
...
app.updateProcessInfo(false /* updateServiceConnectionActivities */,
true /* activityChange */, true /* updateOomAdj */,
true /* addPendingTopUid */);
...
...
}
...
}
最終也是會去到OomAdjuster.updateAppFreezeStateLSP,調用鏈路在上面的凍結流程里面已經追過,這里就省略了。可以看到如果adj小于CACHED_APP_MIN_ADJ就會調用CachedAppOptimizer.unfreezeAppLSP進行解凍:
// https://cs.android.com/android/platform/superproject/+/android-13.0.0_r74:frameworks/base/services/core/java/com/android/server/am/OomAdjuster.java
private void updateAppFreezeStateLSP(ProcessRecord app) {
...
final ProcessStateRecord state = app.mState;
// Use current adjustment when freezing, set adjustment when unfreezing.
if (state.getCurAdj() >= ProcessList.CACHED_APP_MIN_ADJ && !opt.isFrozen()
&& !opt.shouldNotFreeze()) {
mCachedAppOptimizer.freezeAppAsyncLSP(app);
} else if (state.getSetAdj() < ProcessList.CACHED_APP_MIN_ADJ) {
mCachedAppOptimizer.unfreezeAppLSP(app, oomAdjReason);
}
}
最終去到CachedAppOptimizer.unfreezeAppInternalLSP里面,如果還在10分鐘的后悔時間里面就直接removeMessages刪除定時器,如果進程已經凍結了就調用Process.setProcessFrozen解凍進程(frozen參數傳入false)
// https://cs.android.com/android/platform/superproject/+/android-13.0.0_r74:frameworks/base/services/core/java/com/android/server/am/CachedAppOptimizer.java
void unfreezeAppLSP(ProcessRecord app, String reason) {
synchronized (mFreezerLock) {
unfreezeAppInternalLSP(app, reason);
}
}
void unfreezeAppInternalLSP(ProcessRecord app, String reason) {
final int pid = app.getPid();
final ProcessCachedOptimizerRecord opt = app.mOptRecord;
if (opt.isPendingFreeze()) {
// Remove pending DO_FREEZE message
mFreezeHandler.removeMessages(SET_FROZEN_PROCESS_MSG, app);
opt.setPendingFreeze(false);
...
}
opt.setFreezerOverride(false);
if (pid == 0 || !opt.isFrozen()) {
return;
}
...
Process.setProcessFrozen(pid, app.uid, false);
...
}
上面例子中,整個從退出Activity凍結進程到進入Activity解凍進程的流程如下:
問題定位與規避
從日志上看這個進程在被kill的時候adj就是905:
08-29 14:13:11.716499 7904 27946 I ActivityManager: Killing 26859:me.linjw.demo.anr/1000 (adj 905): bg anr
而且它的啟動時間和凍結時間剛好差10分鐘:
08-29 14:01:45.124651 7904 8283 I ActivityManager: Start proc 26859:me.linjw.demo.anr/1000 for service {me.linjw.demo.anr/me.linjw.demo.anr.RemoteService}
08-29 14:11:45.885503 7904 8331 I am_freeze: [26859,me.linjw.demo.anr]
也就是說應用進程啟動的時候adj就是905,然后就設置了10分鐘的進程凍結定時器。
問題在于我們的應用的確只有一個Service,沒有啟動Activity而是通過WindowManager.addView添加的全局浮窗。
addView源碼太多我沒有找到更新oom adj的邏輯,但是復現問題使用cat /proc/{pid}/oom_adj
命令獲取oom adj發現并不是大于900的,也復現不出10分鐘被凍結的現象。
那有可能是的確沒有,也有可能是在某種情況下沒有更新成功。在日志里沒有看到任何報錯,問題轉給系統哥估計也解決不了,只能應用規避了。
規避的方式也很簡單,將服務設置成前臺服務主動觸發OOM_ADJ_REASON_UI_VISIBILITY類型的oom adj重新計算:
// https://cs.android.com/android/platform/superproject/+/android-13.0.0_r74:frameworks/base/services/core/java/com/android/server/am/ActiveServices.java
private void updateServiceForegroundLocked(ProcessServiceRecord psr, boolean oomAdj) {
...
mAm.updateProcessForegroundLocked(psr.mApp, anyForeground, fgServiceTypes, oomAdj);
...
}
// https://cs.android.com/android/platform/superproject/+/android-13.0.0_r74:frameworks/base/services/core/java/com/android/server/am/ActivityManagerService.java
final void updateProcessForegroundLocked(ProcessRecord proc, boolean isForeground,
int fgServiceTypes, boolean oomAdj) {
...
if (oomAdj) {
updateOomAdjLocked(proc, OomAdjuster.OOM_ADJ_REASON_UI_VISIBILITY);
}
}