0. Summary
1. 問題現象
2. 問題分析
. 2.1 查看SGA設置參數
. 2.2 查看large pool大小以及自動調整
. 2.3 并行參數查看
. 2.4 告警日志詳細分析
. 2.5 shared pool大小查看
3. 問題處理建議
1. 問題現象
#### alert log ####
Sat Feb 04 02:08:41 2017
Memory Notification: Library Cache Object loaded into SGA
Heap size 51201K exceeds notification threshold (51200K)
Details in trace file /app/oracle/diag/rdbms/noap/noap/trace/noap_j000_4031.trc
KGL object name :SZ1X.IN_JS_CDR_HW_AC_TI
Memory Notification: Library Cache Object loaded into SGA
Heap size 331660K exceeds notification threshold (51200K)
Details in trace file /app/oracle/diag/rdbms/noap/noap/trace/noap_j000_4031.trc
KGL object name :alter table MOD_JS_CDR_HW drop partition SYS_P1404186
Sat Feb 04 02:09:20 2017
TABLE SZ1X.MOD_CDR_HW: ADDED INTERVAL PARTITION SYS_P1404388 (47883) VALUES LESS THAN (TO_DATE(' 2017-02-04 03:00:00', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIAN'))
Sat Feb 04 02:14:22 2017
Thread 1 advanced to log sequence 603901 (LGWR switch)
Current log# 1 seq# 603901 mem# 0: /app/oracle/oradata/noap/redo01.log
Sat Feb 04 02:32:34 2017
Errors in file /app/oracle/diag/rdbms/noap/noap/trace/noap_p085_5172.trc (incident=109601):
ORA-04031: ?T·¨·??? 2048024 ×??úμ?12?í?ú′? ("large pool","unknown object","large pool","PX msg pool")
Incident details in: /app/oracle/diag/rdbms/noap/noap/incident/incdir_109601/noap_p085_5172_i109601.trc
Sat Feb 04 02:32:34 2017
Errors in file /app/oracle/diag/rdbms/noap/noap/trace/noap_p034_5070.trc (incident=101439):
ORA-04031: ?T·¨·??? 2048024 ×??úμ?12?í?ú′? ("large pool","unknown object","large pool","PX msg pool")
Incident details in: /app/oracle/diag/rdbms/noap/noap/incident/incdir_101439/noap_p034_5070_i101439.trc
Sat Feb 04 02:32:34 2017
Errors in file /app/oracle/diag/rdbms/noap/noap/trace/noap_p092_5186.trc (incident=109832):
ORA-04031: ?T·¨·??? 2048024 ×??úμ?12?í?ú′? ("large pool","unknown object","large pool","PX msg pool")
Incident details in: /app/oracle/diag/rdbms/noap/noap/incident/incdir_109832/noap_p092_5186_i109832.trc
Sat Feb 04 02:32:34 2017
Errors in file /app/oracle/diag/rdbms/noap/noap/trace/noap_p051_5104.trc (incident=107372):
ORA-04031: ?T·¨·??? 2048024 ×??úμ?12?í?ú′? ("large pool","unknown object","large pool","PX msg pool")
Incident details in: /app/oracle/diag/rdbms/noap/noap/incident/incdir_107372/noap_p051_5104_i107372.trc
......
#### noap_p085_5172_i109601.trc ####
Dump continued from file: /app/oracle/diag/rdbms/noap/noap/trace/noap_p085_5172.trc
ORA-04031: ?T·¨·??? 2048024 ??úμ?12?í?ú′? ("large pool","unknown object","large pool","PX msg pool")
========= Dump for incident 109601 (ORA 4031) ========μ??μ???è,
*** 2017-02-04 02:32:34.673
dbkedDefDump(): Starting incident default dumps (flags=0x2, level=3, mask=0x0)
----- Current SQL Statement for this session (sql_id=385pbhfh4g7rn) -----
insert /*+append*/ into c_cdr_railway_huning
select /*+full(t) parallel(64)*/
RELEASE_CAUSE AS o??Dêí·??-ò
ACCESS_TIME AS ?óè?ê±?ì
......
告警日志有ORA-04031報錯,從報錯信息來看,直接原因是因為并行引起的large pool不足導致。
2. 問題分析
2.1 查看SGA設置參數
SQL> show parameter sga
NAME TYPE VALUE
------------------------------------ ----------- ------------------------------
lock_sga boolean FALSE
pre_page_sga boolean FALSE
sga_max_size big integer 32G
sga_target big integer 32G
SQL> show parameter db_cache
NAME TYPE VALUE
------------------------------------ ----------- ------------------------------
db_cache_advice string ON
db_cache_size big integer 22G
SQL> show parameter shared_pool
NAME TYPE VALUE
------------------------------------ ----------- ------------------------------
shared_pool_reserved_size big integer 510027366
shared_pool_size big integer 8G
當前數據庫SGA設置為ASMM自動管理
2.2 查看large pool大小以及自動調整
SQL> show parameter large
NAME TYPE VALUE
------------------------------------ ----------- ------------------------------
large_pool_size big integer 0
use_large_pages string TRUE
SQL> select t.*
2 from (select name,
3 bytes / (1024 * 1024) "MB",
4 round(bytes / (select value
5 from v$parameter t
6 where t.name = 'shared_pool_size') * 100,
7 2) || '%' "USED%"
8 from v$sgastat
9 where pool = 'large pool'
10 order by 2 desc) t
11 where rownum < 20;
NAME MB USED%
-------------------------- ---------- -----------------------------------------
free memory 119.8125 1.46%
PX msg pool 7.8125 .1%
ASM map operations hashta .375 0%
當前數據庫SGA設置為ASMM自動管理,large pool沒有設置最小值,目前使用是正常。因為使用的是自動管理,在組件進行調整的時候,也是有可能積壓到large pool的使用的。
SQL> select start_time,
2 component,
3 oper_type,
4 oper_mode,
5 initial_size / 1024 / 1024 "INITIAL",
6 final_size / 1024 / 1024 "FINAL",
7 end_time
8 from v$sga_resize_ops
9 where component in ('large pool')
10 order by start_time, component;
START_TIME COMPONENT OPER_TYPE OPER_MODE INITIAL FINAL END_TIME
------------------- ------------------------- ------------- --------- -------------------- -------------------- -------------------
30/01/2017 03:02:01 large pool GROW IMMEDIATE 192 256 30/01/2017 03:02:03
30/01/2017 03:02:01 large pool GROW IMMEDIATE 192 256 30/01/2017 03:02:03
30/01/2017 03:02:01 large pool GROW IMMEDIATE 192 256 30/01/2017 03:02:02
30/01/2017 03:02:01 large pool GROW IMMEDIATE 192 256 30/01/2017 03:02:02
30/01/2017 03:02:01 large pool GROW IMMEDIATE 192 256 30/01/2017 03:02:02
30/01/2017 03:02:01 large pool GROW IMMEDIATE 192 256 30/01/2017 03:02:02
30/01/2017 03:02:01 large pool GROW IMMEDIATE 192 256 30/01/2017 03:02:02
30/01/2017 03:02:01 large pool GROW IMMEDIATE 192 256 30/01/2017 03:02:02
30/01/2017 03:02:01 large pool GROW IMMEDIATE 192 256 30/01/2017 03:02:02
......
04/02/2017 02:32:32 large pool GROW IMMEDIATE 320 384 04/02/2017 02:32:33
04/02/2017 02:32:32 large pool GROW IMMEDIATE 320 384 04/02/2017 02:32:33
04/02/2017 02:32:32 large pool GROW IMMEDIATE 320 384 04/02/2017 02:32:33
04/02/2017 02:32:32 large pool GROW IMMEDIATE 320 384 04/02/2017 02:32:33
04/02/2017 02:35:47 large pool SHRINK DEFERRED 384 128 04/02/2017 02:35:47
......
04/02/2017 03:01:56 large pool GROW IMMEDIATE 320 384 04/02/2017 03:01:57
04/02/2017 03:01:56 large pool GROW IMMEDIATE 320 384 04/02/2017 03:01:57
04/02/2017 03:01:56 large pool GROW IMMEDIATE 320 384 04/02/2017 03:01:57
04/02/2017 03:01:56 large pool GROW IMMEDIATE 320 384 04/02/2017 03:01:57
04/02/2017 03:01:56 large pool GROW IMMEDIATE 320 384 04/02/2017 03:01:57
04/02/2017 03:01:56 large pool GROW IMMEDIATE 320 384 04/02/2017 03:01:57
04/02/2017 03:04:24 large pool SHRINK DEFERRED 384 128 04/02/2017 03:04:24
可以發現large pool較頻繁性的進行grow和shrink
2.3 并行參數查看
從報錯的trc中看,sql使用的并行度(64)較高。查看并行相關的參數
SQL> show parameter cpu_count
NAME TYPE VALUE
------------------------------------ ----------- ------------------------------
cpu_count integer 16
SQL> show parameter parallel_max
NAME TYPE VALUE
------------------------------------ ----------- ------------------------------
parallel_max_servers integer 640
64設置的較大,該主機cpu count只有16,建議適當降低點并行度。
2.4 告警日志詳細分析
Memory Notification: Library Cache Object loaded into SGA
Heap size 51201K exceeds notification threshold (51200K)
該信息代表內存中某個組件的需求空間超過閾值,這個閾值由_kgl_large_heap_warning_threshold來控制。這個特性在10gR2被引入,單獨這個信息并不代表有問題,需要觀察后續是否有4031的報錯。
參考:
Memory Notification: Library Cache Object loaded into SGA / ORA-600 [KGL-heap-size-exceeded] (文檔 ID 330239.1)
#### noap_j000_4031.trc ####
Memory Notification: Library Cache Object loaded into SGA
Heap size 73935K exceeds notification threshold (51200K)
LibraryHandle: Address=0x855ade650 Hash=70548654 LockMode=N PinMode=0 LoadLockMode=0 Status=VALD
ObjectName: Name=alter table MOD_CDR_HW drop partition SYS_P1399962
FullHashValue=3aa1433897dd4d6fc458246c70548654 Namespace=SQL AREA(00) Type=CURSOR(00) Identifier=1884587604 OwnerIdn=83
Statistics: InvalidationCount=0 ExecutionCount=0 LoadCount=2 ActiveLocks=1 TotalLockCount=1 TotalPinCount=1
Counters: BrokenCount=1 RevocablePointer=1 KeepDependency=1 BucketInUse=0 HandleInUse=0 HandleReferenceCount=0
Concurrency: DependencyMutex=0x855ade700(0, 1, 0, 0) Mutex=0x855ade780(1011, 21, 0, 6)
Flags=RON/PIN/TIM/PN0/DBN/[10012841]
WaitersLists:
Lock=0x855ade6e0[0x855ade6e0,0x855ade6e0]
Pin=0x855ade6c0[0x855ade6c0,0x855ade6c0]
Timestamp: Current=02-04-2017 02:00:34
HandleReference: Address=0x855ade820 Handle=(nil) Flags=[00]
觸發這個信息的trc中記錄了語句,即alert后面輸出的語句:
KGL object name :alter table MOD_JS_CDR_HW drop partition SYS_P1404186
繼續看large pool方面的報錯
Sat Feb 04 02:32:34 2017
Errors in file /app/oracle/diag/rdbms/noap/noap/trace/noap_p085_5172.trc (incident=109601):
ORA-04031: ?T·¨·??? 2048024 ×??úμ?12?í?ú′? ("large pool","unknown object","large pool","PX msg pool")
Incident details in: /app/oracle/diag/rdbms/noap/noap/incident/incdir_109601/noap_p085_5172_i109601.trc
Sat Feb 04 02:32:34 2017
Errors in file /app/oracle/diag/rdbms/noap/noap/trace/noap_p034_5070.trc (incident=101439):
ORA-04031: ?T·¨·??? 2048024 ×??úμ?12?í?ú′? ("large pool","unknown object","large pool","PX msg pool")
Incident details in: /app/oracle/diag/rdbms/noap/noap/incident/incdir_101439/noap_p034_5070_i101439.trc
Sat Feb 04 02:32:34 2017
Errors in file /app/oracle/diag/rdbms/noap/noap/trace/noap_p092_5186.trc (incident=109832):
ORA-04031: ?T·¨·??? 2048024 ×??úμ?12?í?ú′? ("large pool","unknown object","large pool","PX msg pool")
Incident details in: /app/oracle/diag/rdbms/noap/noap/incident/incdir_109832/noap_p092_5186_i109832.trc
Sat Feb 04 02:32:34 2017
Errors in file /app/oracle/diag/rdbms/noap/noap/trace/noap_p051_5104.trc (incident=107372):
ORA-04031: ?T·¨·??? 2048024 ×??úμ?12?í?ú′? ("large pool","unknown object","large pool","PX msg pool")
Incident details in: /app/oracle/diag/rdbms/noap/noap/incident/incdir_107372/noap_p051_5104_i107372.trc
large pool這部分輸出,從前面SQL查詢剛好是large pool的shrink操作。
04/02/2017 02:32:32 large pool GROW IMMEDIATE 320 384 04/02/2017 02:32:33
04/02/2017 02:35:47 large pool SHRINK DEFERRED 384 128 04/02/2017 02:35:47
參考:
Multiple ORA-4031 Errors Of Reducing Sizes For "PX msg pool" In The Large Pool (文檔 ID 1515877.1)
和Bug:13072654 - ORA-4031 CANT ALLOC 14MB IN LARGE POOL, PX MSG POOL有關,該bug在11.2.0.2有one-off patch, 可以考慮應用,或者設置large pool的最小值,或者改動SGA管理為手工管理。
2.5 shared pool大小查看
因為有LCO方面的信息,查看shared pool當前使用的大小
SQL> select t.*
2 from (select name,
3 bytes / (1024 * 1024) "MB",
4 round(bytes / (select value
5 from v$parameter t
6 where t.name = 'shared_pool_size') * 100,
7 2) || '%' "USED%"
8 from v$sgastat
9 where pool = 'shared pool'
10 order by 2 desc) t
11 where rownum < 20;
NAME MB USED%
-------------------------- -------------------- -----------------------------------------
free memory 2971.667167663574219 36.28%
PRTMV 2858.977592468261719 34.9%
SQLA 1862.341529846191406 22.73%
PRTDS 614.0089035034179688 7.5%
KQR M PO 315.4219131469726563 3.85%
KGLH0 199.7758560180664063 2.44%
dbktb: trace buffer 81.90625 1%
FileOpenBlock 60.796417236328125 .74%
ASM extent pointer array 52.86400604248046875 .65%
db_block_hash_buckets 44.50390625 .54%
dbwriter coalesce buffer 32.03125 .39%
ASH buffers 32 .39%
KGLHD 29.06992340087890625 .35%
kglsim object batch 19.32781219482421875 .24%
private strands 17.5341796875 .21%
Checkpoint queue 15.6328125 .19%
event statistics per sess 15.33984375 .19%
write state object 14.6377716064453125 .18%
ksunfy : SSO free list 14.32470703125 .17%
這里發現PRTMV這個組件比較陌生,并且占用了2.8G的空間,對比了其他庫:
NAME MB USED%
-------------------------- ---------- -----------------------------------------
free memory 6021.38947 58.8%
SQLA 1472.31024 14.38%
KGLH0 1264.46631 12.35%
PRTMV 219.83268 2.15%
KGLHD 199.816628 1.95%
db_block_hash_buckets 178.003906 1.74%
dbktb: trace buffer 102.390625 1%
ASH buffers 96 .94%
dbwriter coalesce buffer 80.078125 .78%
FileOpenBlock 71.1162643 .69%
KGLDA 65.826004 .64%
Checkpoint queue 46.8984375 .46%
KKSSP 40.4567947 .4%
private strands 25.9765625 .25%
dirty object counts array 24 .23%
event statistics per sess 22.8779297 .22%
ksunfy : SSO free list 21.7646484 .21%
parameter table block 19.9453812 .19%
KGLS 19.5513763 .19%
從對比可以看出,這個值可能存在異常,搜索了下MOS,確實存在相關的bug:
Bug 19461270 - high PRTMV allocations in shared pool executing concurrent DML and DDLs on interval partitioned tables (文檔 ID 19461270.8)
Description
Concurrent DDLs and DMLs happening on interval partitioned table that was created with deferred segment creation clause may do high PRTMV allocations.
Workaround
Do not run DDLs concurrently.
在使用interval分區的情況下,可能會觸發,與當前問題現象較為吻合。
Bug 17037130 - Excess shared pool "PRTMV" memory use / ORA-4031 with partitioned tables (文檔 ID 17037130.8)
Description
This bug is only relevant when using Partitioned Tables
SQL on a partitioned table may cause excess shared pool usage and
ultimately fail with ORA-4031.
Rediscovery Notes:
ORA-4031 with child cursor(s) having dependency table entries
referencing obsolete (OBS) multi-versioned objects.
Workaround
Flushing the shared_pool and avoiding DDLs during high load time
can help to avoid this issue.
3. 問題處理建議
以上分析,large pool的4031報錯很可能和shrink large pool有關。另外shared pool方面也存在問題。
對于large pool的bug,這個庫版本為11.2.0.2,未打PSU. 該bug在11.2.0.2有one-off patch,如果不應用patch,可以考慮使用以下手段規避
- 對large pool設置最小值避免頻繁shrink,當前庫設置為ASMM自動管理,db_cache(22g)和shared_pool(8g)已設置最小值,large_pool建議設置為200M.
alter system set large_pool_size=200M scope=spfile sid='*';
如果頻繁影響到并行任務,建議打上one-off patch或者修改內存管理為手工管理。
- 并行任務中并行度64設置的較大,該主機cpu count只有16,建議適當降低點并行度。
對于shared pool的問題,當前數據庫版本為11.2.0.2基版本沒有打PSU,涉及的兩個bug均沒有在11.2.0.2以及linux平臺下的one-off patch. 在無法立即升級到11.2.0.3或以上版本的情況下,建議:
- 從bug 19461270描述來看,該bug除了與interval分區有關,還和11g的新特性deferred segment creation特性有關,建議關閉這個特性。
alter system set deferred_segment_creation=false scope=spfile sid='*';
另一個bug 17037130從描述中和段延遲創建特性無關,建議按照第一步設置后持續觀察,臨時解決問題的方法是flush shared_pool或者避免在高負載時間段進行ddl.
對于當前已經使用的PRTMV組件,如果需要釋放,建議可以找業務空閑的時間段手工flush shared_pool釋放。