[ORA-4031]large pool PX msg pool錯誤診斷


0. Summary

1. 問題現象
2. 問題分析
.   2.1 查看SGA設置參數
.   2.2 查看large pool大小以及自動調整
.   2.3 并行參數查看
.   2.4 告警日志詳細分析
.   2.5 shared pool大小查看
3. 問題處理建議

1. 問題現象

#### alert log ####

Sat Feb 04 02:08:41 2017
Memory Notification: Library Cache Object loaded into SGA
Heap size 51201K exceeds notification threshold (51200K)
Details in trace file /app/oracle/diag/rdbms/noap/noap/trace/noap_j000_4031.trc
KGL object name :SZ1X.IN_JS_CDR_HW_AC_TI 
Memory Notification: Library Cache Object loaded into SGA
Heap size 331660K exceeds notification threshold (51200K)
Details in trace file /app/oracle/diag/rdbms/noap/noap/trace/noap_j000_4031.trc
KGL object name :alter table MOD_JS_CDR_HW drop partition SYS_P1404186
Sat Feb 04 02:09:20 2017
TABLE SZ1X.MOD_CDR_HW: ADDED INTERVAL PARTITION SYS_P1404388 (47883) VALUES LESS THAN (TO_DATE(' 2017-02-04 03:00:00', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIAN'))
Sat Feb 04 02:14:22 2017 
Thread 1 advanced to log sequence 603901 (LGWR switch)
  Current log# 1 seq# 603901 mem# 0: /app/oracle/oradata/noap/redo01.log
Sat Feb 04 02:32:34 2017
Errors in file /app/oracle/diag/rdbms/noap/noap/trace/noap_p085_5172.trc  (incident=109601):
ORA-04031: ?T·¨·??? 2048024 ×??úμ?12?í?ú′? ("large pool","unknown object","large pool","PX msg pool")
Incident details in: /app/oracle/diag/rdbms/noap/noap/incident/incdir_109601/noap_p085_5172_i109601.trc
Sat Feb 04 02:32:34 2017
Errors in file /app/oracle/diag/rdbms/noap/noap/trace/noap_p034_5070.trc  (incident=101439):
ORA-04031: ?T·¨·??? 2048024 ×??úμ?12?í?ú′? ("large pool","unknown object","large pool","PX msg pool")
Incident details in: /app/oracle/diag/rdbms/noap/noap/incident/incdir_101439/noap_p034_5070_i101439.trc
Sat Feb 04 02:32:34 2017
Errors in file /app/oracle/diag/rdbms/noap/noap/trace/noap_p092_5186.trc  (incident=109832):
ORA-04031: ?T·¨·??? 2048024 ×??úμ?12?í?ú′? ("large pool","unknown object","large pool","PX msg pool")
Incident details in: /app/oracle/diag/rdbms/noap/noap/incident/incdir_109832/noap_p092_5186_i109832.trc
Sat Feb 04 02:32:34 2017
Errors in file /app/oracle/diag/rdbms/noap/noap/trace/noap_p051_5104.trc  (incident=107372):
ORA-04031: ?T·¨·??? 2048024 ×??úμ?12?í?ú′? ("large pool","unknown object","large pool","PX msg pool")
Incident details in: /app/oracle/diag/rdbms/noap/noap/incident/incdir_107372/noap_p051_5104_i107372.trc
......

#### noap_p085_5172_i109601.trc ####

Dump continued from file: /app/oracle/diag/rdbms/noap/noap/trace/noap_p085_5172.trc
ORA-04031: ?T·¨·??? 2048024 ??úμ?12?í?ú′? ("large pool","unknown object","large pool","PX msg pool")
  
========= Dump for incident 109601 (ORA 4031) ========μ??μ???è,
  
*** 2017-02-04 02:32:34.673
dbkedDefDump(): Starting incident default dumps (flags=0x2, level=3, mask=0x0)
----- Current SQL Statement for this session (sql_id=385pbhfh4g7rn) -----
 insert /*+append*/ into c_cdr_railway_huning 
  select /*+full(t) parallel(64)*/
  RELEASE_CAUSE AS o??Dêí·??-ò
  ACCESS_TIME AS ?óè?ê±?ì
......

告警日志有ORA-04031報錯,從報錯信息來看,直接原因是因為并行引起的large pool不足導致。

2. 問題分析

2.1 查看SGA設置參數

SQL> show parameter sga

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
lock_sga                             boolean     FALSE
pre_page_sga                         boolean     FALSE
sga_max_size                         big integer 32G
sga_target                           big integer 32G
SQL> show parameter db_cache

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
db_cache_advice                      string      ON
db_cache_size                        big integer 22G
SQL> show parameter shared_pool

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
shared_pool_reserved_size            big integer 510027366
shared_pool_size                     big integer 8G

當前數據庫SGA設置為ASMM自動管理

2.2 查看large pool大小以及自動調整

SQL> show parameter large

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
large_pool_size                      big integer 0
use_large_pages                      string      TRUE

SQL> select t.*
  2    from (select name,
  3                 bytes / (1024 * 1024) "MB",
  4                 round(bytes / (select value
  5                                  from v$parameter t
  6                                 where t.name = 'shared_pool_size') * 100,
  7                       2) || '%' "USED%"
  8            from v$sgastat
  9           where pool = 'large pool'
 10           order by 2 desc) t
 11   where rownum < 20;

NAME                               MB USED%
-------------------------- ---------- -----------------------------------------
free memory                  119.8125 1.46%
PX msg pool                    7.8125 .1%
ASM map operations hashta        .375 0%

當前數據庫SGA設置為ASMM自動管理,large pool沒有設置最小值,目前使用是正常。因為使用的是自動管理,在組件進行調整的時候,也是有可能積壓到large pool的使用的。

SQL> select start_time,
  2         component,
  3         oper_type,
  4         oper_mode,
  5         initial_size / 1024 / 1024 "INITIAL",
  6         final_size / 1024 / 1024 "FINAL",
  7         end_time
  8    from v$sga_resize_ops
  9   where component in ('large pool')
 10   order by start_time, component;

START_TIME          COMPONENT                 OPER_TYPE     OPER_MODE              INITIAL                FINAL END_TIME
------------------- ------------------------- ------------- --------- -------------------- -------------------- -------------------
30/01/2017 03:02:01 large pool                GROW          IMMEDIATE                  192                  256 30/01/2017 03:02:03
30/01/2017 03:02:01 large pool                GROW          IMMEDIATE                  192                  256 30/01/2017 03:02:03
30/01/2017 03:02:01 large pool                GROW          IMMEDIATE                  192                  256 30/01/2017 03:02:02
30/01/2017 03:02:01 large pool                GROW          IMMEDIATE                  192                  256 30/01/2017 03:02:02
30/01/2017 03:02:01 large pool                GROW          IMMEDIATE                  192                  256 30/01/2017 03:02:02
30/01/2017 03:02:01 large pool                GROW          IMMEDIATE                  192                  256 30/01/2017 03:02:02
30/01/2017 03:02:01 large pool                GROW          IMMEDIATE                  192                  256 30/01/2017 03:02:02
30/01/2017 03:02:01 large pool                GROW          IMMEDIATE                  192                  256 30/01/2017 03:02:02
30/01/2017 03:02:01 large pool                GROW          IMMEDIATE                  192                  256 30/01/2017 03:02:02
......
04/02/2017 02:32:32 large pool                GROW          IMMEDIATE                  320                  384 04/02/2017 02:32:33
04/02/2017 02:32:32 large pool                GROW          IMMEDIATE                  320                  384 04/02/2017 02:32:33
04/02/2017 02:32:32 large pool                GROW          IMMEDIATE                  320                  384 04/02/2017 02:32:33
04/02/2017 02:32:32 large pool                GROW          IMMEDIATE                  320                  384 04/02/2017 02:32:33
04/02/2017 02:35:47 large pool                SHRINK        DEFERRED                   384                  128 04/02/2017 02:35:47
......
04/02/2017 03:01:56 large pool                GROW          IMMEDIATE                  320                  384 04/02/2017 03:01:57
04/02/2017 03:01:56 large pool                GROW          IMMEDIATE                  320                  384 04/02/2017 03:01:57
04/02/2017 03:01:56 large pool                GROW          IMMEDIATE                  320                  384 04/02/2017 03:01:57
04/02/2017 03:01:56 large pool                GROW          IMMEDIATE                  320                  384 04/02/2017 03:01:57
04/02/2017 03:01:56 large pool                GROW          IMMEDIATE                  320                  384 04/02/2017 03:01:57
04/02/2017 03:01:56 large pool                GROW          IMMEDIATE                  320                  384 04/02/2017 03:01:57
04/02/2017 03:04:24 large pool                SHRINK        DEFERRED                   384                  128 04/02/2017 03:04:24

可以發現large pool較頻繁性的進行grow和shrink

2.3 并行參數查看

從報錯的trc中看,sql使用的并行度(64)較高。查看并行相關的參數

SQL> show parameter cpu_count

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
cpu_count                            integer     16
SQL> show parameter parallel_max

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
parallel_max_servers                 integer     640

64設置的較大,該主機cpu count只有16,建議適當降低點并行度。

2.4 告警日志詳細分析

Memory Notification: Library Cache Object loaded into SGA
Heap size 51201K exceeds notification threshold (51200K)

該信息代表內存中某個組件的需求空間超過閾值,這個閾值由_kgl_large_heap_warning_threshold來控制。這個特性在10gR2被引入,單獨這個信息并不代表有問題,需要觀察后續是否有4031的報錯。

參考:

Memory Notification: Library Cache Object loaded into SGA / ORA-600 [KGL-heap-size-exceeded] (文檔 ID 330239.1)

#### noap_j000_4031.trc ####

Memory Notification: Library Cache Object loaded into SGA
Heap size 73935K exceeds notification threshold (51200K)
            
LibraryHandle:  Address=0x855ade650 Hash=70548654 LockMode=N PinMode=0 LoadLockMode=0 Status=VALD
  ObjectName:  Name=alter table MOD_CDR_HW drop partition SYS_P1399962 
    FullHashValue=3aa1433897dd4d6fc458246c70548654 Namespace=SQL AREA(00) Type=CURSOR(00) Identifier=1884587604 OwnerIdn=83
  Statistics:  InvalidationCount=0 ExecutionCount=0 LoadCount=2 ActiveLocks=1 TotalLockCount=1 TotalPinCount=1
  Counters:  BrokenCount=1 RevocablePointer=1 KeepDependency=1 BucketInUse=0 HandleInUse=0 HandleReferenceCount=0
  Concurrency:  DependencyMutex=0x855ade700(0, 1, 0, 0) Mutex=0x855ade780(1011, 21, 0, 6)
  Flags=RON/PIN/TIM/PN0/DBN/[10012841] 
  WaitersLists:  
    Lock=0x855ade6e0[0x855ade6e0,0x855ade6e0] 
    Pin=0x855ade6c0[0x855ade6c0,0x855ade6c0] 
  Timestamp:  Current=02-04-2017 02:00:34 
  HandleReference:  Address=0x855ade820 Handle=(nil) Flags=[00]

觸發這個信息的trc中記錄了語句,即alert后面輸出的語句:

KGL object name :alter table MOD_JS_CDR_HW drop partition SYS_P1404186

繼續看large pool方面的報錯

Sat Feb 04 02:32:34 2017
Errors in file /app/oracle/diag/rdbms/noap/noap/trace/noap_p085_5172.trc  (incident=109601):
ORA-04031: ?T·¨·??? 2048024 ×??úμ?12?í?ú′? ("large pool","unknown object","large pool","PX msg pool")
Incident details in: /app/oracle/diag/rdbms/noap/noap/incident/incdir_109601/noap_p085_5172_i109601.trc
Sat Feb 04 02:32:34 2017
Errors in file /app/oracle/diag/rdbms/noap/noap/trace/noap_p034_5070.trc  (incident=101439):
ORA-04031: ?T·¨·??? 2048024 ×??úμ?12?í?ú′? ("large pool","unknown object","large pool","PX msg pool")
Incident details in: /app/oracle/diag/rdbms/noap/noap/incident/incdir_101439/noap_p034_5070_i101439.trc
Sat Feb 04 02:32:34 2017
Errors in file /app/oracle/diag/rdbms/noap/noap/trace/noap_p092_5186.trc  (incident=109832):
ORA-04031: ?T·¨·??? 2048024 ×??úμ?12?í?ú′? ("large pool","unknown object","large pool","PX msg pool")
Incident details in: /app/oracle/diag/rdbms/noap/noap/incident/incdir_109832/noap_p092_5186_i109832.trc
Sat Feb 04 02:32:34 2017
Errors in file /app/oracle/diag/rdbms/noap/noap/trace/noap_p051_5104.trc  (incident=107372):
ORA-04031: ?T·¨·??? 2048024 ×??úμ?12?í?ú′? ("large pool","unknown object","large pool","PX msg pool")
Incident details in: /app/oracle/diag/rdbms/noap/noap/incident/incdir_107372/noap_p051_5104_i107372.trc

large pool這部分輸出,從前面SQL查詢剛好是large pool的shrink操作。

04/02/2017 02:32:32 large pool                GROW          IMMEDIATE                  320                  384 04/02/2017 02:32:33
04/02/2017 02:35:47 large pool                SHRINK        DEFERRED                   384                  128 04/02/2017 02:35:47

參考:

Multiple ORA-4031 Errors Of Reducing Sizes For "PX msg pool" In The Large Pool (文檔 ID 1515877.1)

和Bug:13072654 - ORA-4031 CANT ALLOC 14MB IN LARGE POOL, PX MSG POOL有關,該bug在11.2.0.2有one-off patch, 可以考慮應用,或者設置large pool的最小值,或者改動SGA管理為手工管理。

2.5 shared pool大小查看

因為有LCO方面的信息,查看shared pool當前使用的大小

SQL> select t.*
  2    from (select name,
  3                 bytes / (1024 * 1024) "MB",
  4                 round(bytes / (select value
  5                                  from v$parameter t
  6                                 where t.name = 'shared_pool_size') * 100,
  7                       2) || '%' "USED%"
  8            from v$sgastat
  9           where pool = 'shared pool'
 10           order by 2 desc) t
 11   where rownum < 20;

NAME                                         MB USED%
-------------------------- -------------------- -----------------------------------------
free memory                2971.667167663574219 36.28%
PRTMV                      2858.977592468261719 34.9%
SQLA                       1862.341529846191406 22.73%
PRTDS                      614.0089035034179688 7.5%
KQR M PO                   315.4219131469726563 3.85%
KGLH0                      199.7758560180664063 2.44%
dbktb: trace buffer                    81.90625 1%
FileOpenBlock                60.796417236328125 .74%
ASM extent pointer array   52.86400604248046875 .65%
db_block_hash_buckets               44.50390625 .54%
dbwriter coalesce buffer               32.03125 .39%
ASH buffers                                  32 .39%
KGLHD                      29.06992340087890625 .35%
kglsim object batch        19.32781219482421875 .24%
private strands                   17.5341796875 .21%
Checkpoint queue                     15.6328125 .19%
event statistics per sess           15.33984375 .19%
write state object          14.6377716064453125 .18%
ksunfy : SSO free list           14.32470703125 .17%

這里發現PRTMV這個組件比較陌生,并且占用了2.8G的空間,對比了其他庫:

NAME                               MB USED%
-------------------------- ---------- -----------------------------------------
free memory                6021.38947 58.8%
SQLA                       1472.31024 14.38%
KGLH0                      1264.46631 12.35%
PRTMV                       219.83268 2.15%
KGLHD                      199.816628 1.95%
db_block_hash_buckets      178.003906 1.74%
dbktb: trace buffer        102.390625 1%
ASH buffers                        96 .94%
dbwriter coalesce buffer    80.078125 .78%
FileOpenBlock              71.1162643 .69%
KGLDA                       65.826004 .64%
Checkpoint queue           46.8984375 .46%
KKSSP                      40.4567947 .4%
private strands            25.9765625 .25%
dirty object counts array          24 .23%
event statistics per sess  22.8779297 .22%
ksunfy : SSO free list     21.7646484 .21%
parameter table block      19.9453812 .19%
KGLS                       19.5513763 .19%

從對比可以看出,這個值可能存在異常,搜索了下MOS,確實存在相關的bug:

Bug 19461270 - high PRTMV allocations in shared pool executing concurrent DML and DDLs on interval partitioned tables (文檔 ID 19461270.8)

Description

Concurrent DDLs and DMLs happening on interval partitioned table that was created with deferred segment creation clause may do high PRTMV allocations.

Workaround

Do not run DDLs concurrently.

在使用interval分區的情況下,可能會觸發,與當前問題現象較為吻合。

Bug 17037130 - Excess shared pool "PRTMV" memory use / ORA-4031 with partitioned tables (文檔 ID 17037130.8)

Description

This bug is only relevant when using Partitioned Tables
SQL on a partitioned table may cause excess shared pool usage and
ultimately fail with ORA-4031.

Rediscovery Notes:
ORA-4031 with child cursor(s) having dependency table entries
referencing obsolete (OBS) multi-versioned objects.

Workaround
Flushing the shared_pool and avoiding DDLs during high load time
can help to avoid this issue.

3. 問題處理建議

以上分析,large pool的4031報錯很可能和shrink large pool有關。另外shared pool方面也存在問題。

對于large pool的bug,這個庫版本為11.2.0.2,未打PSU. 該bug在11.2.0.2有one-off patch,如果不應用patch,可以考慮使用以下手段規避

  1. 對large pool設置最小值避免頻繁shrink,當前庫設置為ASMM自動管理,db_cache(22g)和shared_pool(8g)已設置最小值,large_pool建議設置為200M.
alter system set large_pool_size=200M scope=spfile sid='*';

如果頻繁影響到并行任務,建議打上one-off patch或者修改內存管理為手工管理。

  1. 并行任務中并行度64設置的較大,該主機cpu count只有16,建議適當降低點并行度。

對于shared pool的問題,當前數據庫版本為11.2.0.2基版本沒有打PSU,涉及的兩個bug均沒有在11.2.0.2以及linux平臺下的one-off patch. 在無法立即升級到11.2.0.3或以上版本的情況下,建議:

  1. 從bug 19461270描述來看,該bug除了與interval分區有關,還和11g的新特性deferred segment creation特性有關,建議關閉這個特性。
alter system set deferred_segment_creation=false scope=spfile sid='*';
  1. 另一個bug 17037130從描述中和段延遲創建特性無關,建議按照第一步設置后持續觀察,臨時解決問題的方法是flush shared_pool或者避免在高負載時間段進行ddl.

  2. 對于當前已經使用的PRTMV組件,如果需要釋放,建議可以找業務空閑的時間段手工flush shared_pool釋放。

最后編輯于
?著作權歸作者所有,轉載或內容合作請聯系作者
平臺聲明:文章內容(如有圖片或視頻亦包括在內)由作者上傳并發布,文章內容僅代表作者本人觀點,簡書系信息發布平臺,僅提供信息存儲服務。

推薦閱讀更多精彩內容