記一次OGG的遷移工作

1.背景介紹

OGG抽取進程運行在由四節點組成的Oracle RAC的一臺服務器上。數據庫的數據文件采用ASM管理。因此,OGG抽取是需要讀取ASM文件的。現在Oracle RAC要做整體遷移,從目前的四節點集群遷移到另一個機房的三節點集群上去。另說一句,OGG其實很穩定的,對于單表的單向同步工作,運行的非常穩定,都跑了2年多了,也一直沒有問題,直到負責實施運維的同事都離職了,來了這么一次遷移工作。

2.遷移操作

在RAC遷移完成后,我們將OGG抽取端程序和配置打包拷貝過去。首先啟用OGG管理進程,正常。再啟用抽取進程ext1,結果卻是失敗,狀態變成ABENDING。在日志文件里里看到的信息如下:2016-01-15 19:19:48? ERROR? OGG-00446? Oracle GoldenGate Capture for Oracle, ext1.prm:? , error retrieving redo file name for sequence 300499, archived = 0, use_alternate = 0, error retrieving redo file name for sequence 300499, archived = 0, use_alternate = 0, error retrieving redo file name for sequence 300499, archived = 0, use_alternate = 0, error retrieving redo file name for sequence 300499, archived = 0, use_alternate = 0, error retrieving redo file name for sequence 300499, archived = 0, use_alternate = 0, error retrieving redo file name for sequence 300499, archived = 0, use_alternate = 0, error retrieving redo file name for sequence 300499, archived = 0, use_alternate = 0, error retrieving redo file name for sequence 300499, archived = 0, use_alternate = 0, error retrieving redo file name for sequence 300499, archived = 0, use_alternate = 0, error retrieving redo file name for sequence 300499, archived = 0, use_alternate = 0, error retrieving redo file name for sequence 300499, archived = 0, use_alternate = 0, error retrieving redo file name for sequence 300499, archived = 0, use_alternate = 0, error retrieving redo file name for sequence 300499, archived = 0, use_alternate = 0, error retrieving redo file name for sequence 300499, archived = 0, use_alternate = 0, error retrieving redo file name for sequence 300499, archived = 0, use_alternate = 0, error retrieving redo file name for sequence 300499, archived = 0, use_alternate = 0, error retrieving redo file name for sequence 300499, archived = 0, use_alternate = 0, error retrieving redo file name for sequence 300499, archived = 0, use_alternate = 0, error retrieving redo file name for sequence 300499, archived = 0, use_alternate = 0, error retrieving redo file name for sequence 300499, archived = 0, use_alternate = 0, error retrieving redo file name for sequence 300499, archived = 0, use_alternate = 0Not able to establish initial position for sequence 300499, rba 5170704.2016-01-15 19:19:48? ERROR? OGG-01668? Oracle GoldenGate Capture for Oracle, ext1.prm:? PROCESS ABENDING.

(都很久沒研究過ogg了啊,這下麻煩啦)

我們梳理一下OGG的配置和環境。我們僅僅是RAC節點減少一個,OGG找不到第四個實例,這是因為新環境沒有第四個節點,就是說OGG的抽取源少了一組REDO LOGFILE,所以報錯。google了很久,也沒看到有這種節點刪除的操作文檔。最后,在METALINK上找到一份文檔,介紹如何糾正這個錯誤的。

How to Configure GoldenGate Extract When Adding or Removing Redo Log Ths in an Oracle RAC ?, OGG-00446 (文檔 ID 1267901.1Disabling redo log threads1)

The? purpose is to remove an existing RAC thread from goldengate extract so that extract will not capture from that thread.

1) Edit the extract parameter file to either remove these parameters or specify what is required depending on which threads you wish to enable. See THREADOPTIONS PROCESSTHREADS description below.

2) Disable the redo log threads.

3) The extract will abend because these threads are not available. Simply restart the extract as you now have added the THREADOPTIONS PROCESSTHREADS in the extract parameter file.

我粗略地過了一遍,嗯,發現很簡單,改個配置就可以。于是去吃晚飯,等回來再繼續,沒想到這才是苦難的開始。

閑話少說,回來后,根據文檔的操作步驟,將RAC實例中的線程4禁用掉。刪除了線程4的redo logfile。再到extract的配置文件中,設置THREADOPTIONS PROCESSTHREADS EXCEPT 4。最后重啟EXTRACT。在啟動過程中,我想這下該正常了吧。但是,抽取進程又直接掛了。哪里有出錯了?再將文檔向下看的,完蛋。OGG的線程和RAC的線程是不匹配的。在數據庫中執行'select distinct thread# from v$log;'得到的排序結果和OGG中的不一致。

RAC THREAD#? ? OGG thread

------------? --------------? ? ?

1? ? -? ? ? ? 1? ? ?

2? ? -? ? ? ? 2? ? ?

4? ? -? ? ? ? 3? ??

3? ? -? ? ? ? 4

現在,是我禁用錯了實例。在OGG的角度看,是實例3的REDO LOGFILE不能訪問了。于是,趕緊改改改。在extract的配置文件中,設置THREADOPTIONS PROCESSTHREADS EXCEPT 3,重啟extract。結果還是報錯。我想是不是ext1因為修改錯了,導致出錯的。于是換成另一個抽取進程kh_ext,也一樣錯。2016-01-15 20:33:04? INFO? ? OGG-01643? Oracle GoldenGate Capture for Oracle, kh_ext.prm:? BOUNDED RECOVERY: CANCELED: for object pool 3: p14422_Redo Thread 4.2016-01-15 20:33:04? INFO? ? OGG-01579? Oracle GoldenGate Capture for Oracle, kh_ext.prm:? BOUNDED RECOVERY: VALID BCP: CP.KH_EXT.000001577.2016-01-15 20:33:04? INFO? ? OGG-01629? Oracle GoldenGate Capture for Oracle, kh_ext.prm:? BOUNDED RECOVERY: PERSISTED OBJECTS RECOVERED: <>.2016-01-15 20:33:05? INFO? ? OGG-00546? Oracle GoldenGate Capture for Oracle, kh_ext.prm:? Default thread stack size: 10485760.2016-01-15 20:33:05? ERROR? OGG-00446? Oracle GoldenGate Capture for Oracle, kh_ext.prm:? The number of Oracle redo threads (3) is not the same as the number of checkpoint threads (4). EXTRACT groups on RAC systems should be created with the THREADS parameter (e.g., ADD EXT, TRANLOG, THREADS 3, BEGIN...).2016-01-15 20:33:05? ERROR? OGG-01668? Oracle GoldenGate Capture for Oracle, kh_ext.prm:? PROCESS ABENDING.

根據錯誤信息,告訴我實例數不符合,能符合嘛?

咋辦?簡單的except操作搞不定。

那我重建嗎?那么多表,那么多數據能導入到其他庫中?嗯,我可以將extract刪除重建試試。這里,我又找了一篇關于如何重建extract文檔。

首先使用info ext ext1,showch,將現在extract的關鍵信息點保存下來。

GGSCI (webrac2) 27> info ext ext1,showch

EXTRACT? ? EXT1? ? ? Last Started 2015-12-24 10:02? Status ABENDED

Checkpoint Lag? ? ? 00:00:00 (updated 03:24:58 ago)

Log Read Checkpoint? Oracle Redo Logs

2016-01-15 16:06:23? Thread 1, Seqno 378433, RBA 110834248

SCN 1508.2302597186 (6479113279554)

Log Read Checkpoint? Oracle Redo Logs

2016-01-15 16:06:26? Thread 2, Seqno 319453, RBA 1128976

SCN 1508.2302597714 (6479113280082)

Log Read Checkpoint? Oracle Redo Logs

2016-01-15 16:06:30? Thread 4, Seqno 300524, RBA 704000

SCN 1508.2302597781 (6479113280149)

Log Read Checkpoint? Oracle Redo Logs

2016-01-15 16:06:30? Thread 3, Seqno 338237, RBA 813568

SCN 1508.2302597925 (6479113280293)

然后,刪除重建。

delete ext ext1

add ext ext1 ,begin now,tranlog,threads 3

alter extract ext1,tranlog,extseqno 378433,extrba 110834248,thread 1

alter extract ext1,tranlog,extseqno 319453,extrba 1128976,thread 2

alter extract ext1,tranlog,extseqno 338237,extrba 813568,thread 4

alter extract ext1,tranlog,extseqno 300524, extrba 704000, thread 3

alter extract ext1,ioextseqno 378433,ioextrba 110826000,thread 1

alter extract ext1,ioextseqno 319453,ioextrba 1047056,thread 2

alter extract ext1,ioextseqno 338237,ioextrba 812560,thread 4

alter extract ext1,ioextseqno 300499, ioextrba 5170704,thread 3

add exttrail /u01/app/ogg/ggstrail/et ,seqno 36154,rba? 56591320,extract ext1

這里有點情況,你說這thread 4是建還是不建呢。不建,數據不一致了,有些事務會不匹配。建,我到哪里去找實例4去啊。

我兩種操作都來了一遍,結果是都不行。一度我需要手工殺死OGG的操作系統進程,重啟OGG才能繼續操作。

這個時候,夜已經深了。

我們停了下來,想了一下。要重做嗎?那么大的數據量呢。我們將OGG創建的文檔找出來翻了翻。在OGG的目標端,復制進程啟動命令是這樣的,start kh_rep, aftercsn xyz。就是說,它是在SCN號之后進行復制的。那么,我們將遷移過程中這段SCN操作給略過,不就可以了嘛?因為我們可以確定,這段時間是沒有業務操作的。

重建抽取端的EXTRACT,配置成現在開始抽取,線程數設置成3。這樣就變相實現了重建過程,而目標端的數據又不需刪除后再插入。

嗯,找到現在3節點數據庫的當前SCN,重建EXTRACT。這會很快成功。幸虧,RAC的遷移是使用DG的SWITCH ROLE完成的。

最后,我們在源庫上修改一張同步表的數據,在目標庫上很快看到同步狀況,確認OGG的數據同步正常。

3. 總結

OGG在單表單向復制時,正常使用還是蠻穩定的。像這種跨庫遷移的庫,可以在啟動之前,重建一下EXTRACT,不用修改其他配置的。

最后編輯于
?著作權歸作者所有,轉載或內容合作請聯系作者
平臺聲明:文章內容(如有圖片或視頻亦包括在內)由作者上傳并發布,文章內容僅代表作者本人觀點,簡書系信息發布平臺,僅提供信息存儲服務。

推薦閱讀更多精彩內容