APPLIES TO:
Oracle GoldenGate - Version 6.0.0 and laterInformation in this document applies to any platform.
Issue:
Extract ERROR 180 encountered commit SCN that is not greater than the highest SCN already processed
Solution Overview:
This error occurs shortly after a transaction is written to the idle node but does not yet appear in the redo log when the current transaction is processed. This transaction will have a higher SCN than the previous transaction
Solution Details:
In an Oracle RAC environment, Extract has a coordinator thread. This thread assembles the operations that are generated on each of the RAC nodes into SCN order. Before starting to assemble the transactions, Extract waits for a short time after Oracle writes the transaction to the redo log. If a node has been idle with no transactions, Extract waits the number of milliseconds specified in the Extract parameter THREADOPTIONS MAXCOMMITPROPAGATIONDELAY (default 3 seconds). Waiting ensures that there are no new transactions on the idle node before writing the current transactions to the trail.
Possible Reasons for error:
-- One of the threads is slower than the other.
-- The redo logs are not flushed on time due to latency on Log writes.
-- Network issues between Extract and one of the RAC nodes, if Extract is running on a system separate from RAC nodes.
-- Long log write times due to a standby configuration, if any.
-- Log file I/Os are taking unusually long times to complete.
-- Time imperfections between the cluster nodes. All nodes in the RAC cluster must have synchronized system clocks. If Extract is running on a system other than any of the RAC nodes, that system clock must be in sync, too, because we compare the local system's time to the commit timestamp to make critical decisions. For information about synchronizing system clocks, consult www.ntp.org or your systems administrator
-- The DB timezone setting is different than the OS timezone setting
Case example:
The following shows a time sequence and explains what happens at each step in time:
TIME WHAT HAPPENED
------- ---------
2008-03-31 14:08:21.964 Starting from this timestamp, thread 2 is reporting that it is at the end of the file (EOF).
...... Thread 2 keeps reporting EOF.
2008-03-31 14:25:33.021 Thread 1 reports it is at EOF
2008-03-31 14:25:33.110 Thread 2 reports it is at EOF
2008-03-31 14:25:34.040 Thread 1 gets the commit record, with timestamp of 2008-03-31 14:10:52. It commits the transaction with an SCN of 1.2943495430.
no timestamp... The coordinator thread processes the committed transaction from thread 1, because thread 2 kept reporting EOF for the last 17 minutes.
2008-03-31 14:25:34.188 Thread 2 got an update record and a commit record with timestamp of 2008-03-31 14:10:52. It commits transaction with an SCN of 1.2943495406
2008-03-31 14:25:34.xxx The main thread processes the committed transaction from thread 2, and detects an out of order SCN. Extract abends with this error: 2008-03-31 14:25:34 GGS ERROR 180 encountered commit SCN 1.2943495406 (0001.af722cee) that is not greater than the highest SCN already processed 1.2943495430 (0001.af722d06) Redo Thread 2 (2) xid 29.27.131432 (0x001d.01b.00020168), starting seq.rba 3725.98984464, scn 1.2943495403 (0001.af722ceb), commit seq.rba 3725.99011828 commit timestamp 2008-03-31 14:10:52.000000.
In this case, the out of order SCN was caused by the huge gap in the time between when the record was generated to the log buffer and when it was captured by Extract.
Troubleshooting steps:
-- Check to see if the clocks on different nodes in the same RAC are in sync
-- Check to see if there are any unusual time gaps between log data on threads during the SCN error.
-- Check how the 'remote' log files, if any, are shared with the system on which Extract is running (e.g., NFS, SAN, shared SCSI, etc.) and see if there are any noticeable problems there.
-- Check if there are messages in the Oracle alert logs, lgwr tracefiles or net logs during the problematic time, to see if they provide any clues to point to one of the 'Possible Reasons' discussed above.
-- Check to see if any analysis been done on the I/O performance on the drives that contain the log files and whether they are "real Disk";, NFS, SAN, etc.
-- Check to see if the DB timezone is setup correctly. Issue 'SELECT dbtimezone FROM DUAL', the timezone should match the OS. For example is the result is "+00:00", that means DB is using GMT/UTC timezone.
Recommendation:-
The MAXCOMMITPROPAGATIONDELAY parameter can be used to set the delay time well above the max_commit_propogation_delay setting in the database, plus the default extra padding that Extract adds (2000 milliseconds).
In Oracle RAC, the max_commit_propogation_delay specifies the maximum amount of time allowed before the system change number (SCN) held in the SGA of an instance is refreshed by the log writer process (LGWR). Units are in hundredths of seconds.
To check Oracle's value:
Connect as a user with dba privileges and issue:
SQL> show parameter max_commit NAME TYPE VALUE
------------------------------------ ----------- ------------------------------ max_commit_propagation_delay integer 0
To set MAXCOMMITPROPAGATIONDELAY : The value of MAXCOMMITPROPAGATIONDELAY must be greater than zero and less than 90000 milliseconds.
This is how the line should look in the Extract parameter file:
THREADOPTIONS MAXCOMMITPROPAGATIONDELAY 2700
Starting from GGS Version 9.x and above, an additional parameter 'IOLATENCY' can be used if Extract abends with 'encountered SCN XXXXX' too often. IOLATENCY adjusts the delta between the database-configured max commit propogation delay and the internal value that Extract uses. By default IOLATENCY is set to 1.5 seconds
Final Fix of this issue in V11.1.1+THREADOPTIONS MAXCOMMITPROPAGATIONDELAY 2700
Starting from GGS Version 9.x and above, an additional parameter 'IOLATENCY' can be used if Extract abends with 'encountered SCN XXXXX' too often. IOLATENCY adjusts the delta between the database-configured max commit propogation delay and the internal value that Extract uses. By default IOLATENCY is set to 1.5 seconds
Note: Valid values for IOLATENCY are between 0 and 180000 milliseconds (3 minutes).
The combined parameters should look like this:
THREADOPTIONS MAXCOMMITPROPAGATIONDELAY IOLATENCY
The combination of MAXCOMMITPROPAGATIONDELAY and IOLATENCY can be used to ensure that:
1) the Oracle threads have written their most recent SCN data to the logs
2) the I/O processes have had time to complete, considering the various factors that increase I/O latency, such as hardware contention, file locking, long seek and queue times, etc.
The key is in adjusting IOLATENCY and MAXCOMMITPROPAGATIONDELAY parameters to an optimum value based on the frequency of hitting the error.
Hint: If the problem happens too often, you can start with high values for IOLATENCY and MAXCOMMITPROPAGATIONDELAY parameters. Once the error stops happening, you can gradually decrease the values to see the SCN number where this error starts appearing again. This would give you an idea of the boundary values specific to that environment.
Furthermore, the reason Extract successfully restarts after this error is that, on restart, it re-reads the operations, and this time they are all on disk and can be processed in correct SCN order. The side effect is that Extract rewrites operations into the trail. This may cause the data pump or Replicat to abend with incompatible record errors.
The combined parameters should look like this:
THREADOPTIONS MAXCOMMITPROPAGATIONDELAY IOLATENCY
The combination of MAXCOMMITPROPAGATIONDELAY and IOLATENCY can be used to ensure that:
1) the Oracle threads have written their most recent SCN data to the logs
2) the I/O processes have had time to complete, considering the various factors that increase I/O latency, such as hardware contention, file locking, long seek and queue times, etc.
The key is in adjusting IOLATENCY and MAXCOMMITPROPAGATIONDELAY parameters to an optimum value based on the frequency of hitting the error.
Hint: If the problem happens too often, you can start with high values for IOLATENCY and MAXCOMMITPROPAGATIONDELAY parameters. Once the error stops happening, you can gradually decrease the values to see the SCN number where this error starts appearing again. This would give you an idea of the boundary values specific to that environment.
Furthermore, the reason Extract successfully restarts after this error is that, on restart, it re-reads the operations, and this time they are all on disk and can be processed in correct SCN order. The side effect is that Extract rewrites operations into the trail. This may cause the data pump or Replicat to abend with incompatible record errors.
Note: From Oracle 10.2,
parameter max_commit_propagation_delay has been deprecated. So "SQL>
show parameter max_commit NAME TYPE VALUE " is no longer viable for
Oracle 10.2 upward.
After you adjusted IOLATENCY
and MAXCOMMITPROPAGATIONDELAY parameters to really big values, if the
Extract still abends and the RAC nodes clocks are in sync, then you
might hit bug 13559209. This bug was introduced in early patch versions
of V11.1.1.0.29.
Also, we found bug 13408324
when customer has more than 15 nodes in RAC, it will corrupt some of
Extract's internal data structure and cause the flag of THREADOPTIONS
IGNORESCNSEQUENCE in-correctly turned on, which will cause Extract keep
running while printing out the warning message in the report file. This
could cause downstream data integrity issues as Extract is writing out
of order transactions in the trail file.
Oracle DB HEARTBEAT is utilized to coordinate the read from multiple nodes to totally eliminate this issue. This is being tracked in BugDB 10356426, which is implemented for V11.1.1.1, also backported back to V11.1.1.0.x code line. If you are running V11.1.1.0.x GoldenGate, please request a build that has this fix. The redo compatibility of the db should be 10.2 or higher for the fix to work.
Additional causes and Versions:
This error is also known to occur in 11.2 code due to reported bug: 13955146.
This bug is fixed in v12.1 and also included in OGG versions 11.2.1.0.3+ and 11.1.1.1.5+
=======================================================================
How Can The Checkpoints In The Extract Checkpoint File Be Changed? (Doc ID 964684.1)
Recovery Checkpoint (position of oldest unprocessed transaction in the data source):
Sequence #: 34381
RBA: 8615440
Timestamp: 2008-08-27 18:51:27.000000
Redo File: /rdbms/oracle/ora1012i/64/oradata/ora1012a/redo01.log
Current Checkpoint (position of last record read in the data source):
Sequence #: 34381
RBA: 8615936
Timestamp: 2008-08-27 18:51:27.000000
Redo File: /rdbms/oracle/ora1012i/64/oradata/ora1012a/redo01.log
Write Checkpoint #1
GGS Log Trail
Current Checkpoint (current write position):
Sequence #: 0
RBA: 2970
Timestamp: 2008-08-27 18:52:24.196674
Extract Trail: ./dirdat/pj
3) In the ./dirchk directory, there will be a file with a .cpe extension that is the same name as the Extract group. Rename this file to some other name. This is essentially the same as deleting the Extract.
4) Add the Extract with the same name.
GGSCI > ADD EXTRACT , TRANLOG, EXTSEQNO , EXTRBA
GGSCI > INFO EXTRACT
Note that both the Current and Recovery checkpoint data will be the same
5) Alter the Recovery Checkpoint.
GGSCI > ALTER EXTRACT , IOEXTSEQNO , IOEXTRBA
GGSCI > INFO EXTRACT
Note that the Current and Recovery checkpoint data should be changed as specified.
6) Add the remote trail (RMTTRAIL) or local trail (EXTTRAIL).
GGSCI > ADD RMTTRAIL./dirdat/, METGABYTES xx, SEQNO , RBA , EXTRACT
It is not necessary to follow every step exactly as shown above; just choose the checkpoint record to "copy" and follow those steps. For example, in step 6, just add the RMTTRAIL as normally if the new Extract is to write to a different location and start from SEQNO 1 and RBA 0.
Caveats
It is very dangerous to change the production Extract checkpoint file. Any mistake could cause permanent data loss or break data integrity.
Please consult the GoldenGate Support team before you try this procedure on any of your production Extracts.
Altering recovery checkpoints currently does not work for DB2.
========================================================================
How to Configure GoldenGate Extract When Adding or Removing Redo Log Threads in an Oracle RAC ?, OGG-00446 (Doc ID 1267901.1)
Adding new redo log threads
Configuring GoldenGate extract when adding or removing redo
If you have a RAC with a certain number of threads and an extract that
does not explicitly exclude threads, this is the procedure to alter
extract to capture the new threads when you add new threads to the RAC
- ALTER DATABASE ADD LOGFILE and ENABLE the new threads.
- Do not perform application updates to these new threads.
- Stop GoldenGate extract.
- GGSCI > INFO EXT , SHOWCH and print or save this information.
- DELETE EXT .
- ADD EXT , BEGIN NOW, TRANLOG, THREADS
- Update the checkpoint information saved in step 3 . Follow the note 964684.1
to use the SHOWCH output to reset the inputs and outputs of the new extract to match the old one. e.g. Alter each previously existing threads EXTSEQNO and EXTRBA (for the current checkpoint),Alter each previously existing threads the recovery checkpoint IOEXTSEQNO and IOEXTRBA, Alter the Output checkpoint (EXTTRAIL or RMTTRAIL) SEQNO and RBA - START EXTRACT.
- Allow application updates to these new threads. Extract will now capture and checkpoint all ENABLE threads.
Activating disabled or inactive redo log threads
1) If a thread is inactive and you wish to enable it, you need to add
the parameter PROCESSTHREADS SELECT. However if there is an existing
thread which you want to disable, you need to add PROCESSTHREADS EXCEPT
for such thread/s
Your current extract will have this parameter
PROCESSTHREADS SELECT
or
PROCESSTHREADS EXCEPT
PROCESSTHREADS SELECT
or
PROCESSTHREADS EXCEPT
Add the following parameter if you are running V11.1.1 and later,
BR BROFF
@BR BROFF is needed due to bug 12859529
To stop extract attempting to capture from these inactive threads:
Just before you enable the threads,
1) edit the extract parameter file to either removing these parameters
or specify what is required depending on which threads you wish to
enable.
2) stop the extract
3) enable the thread
4) start extract as soon as possible and before** doing transactions on
these newly enabled threads. Extract will now capture and checkpoint all
ENABLE threads.
**Note: If you have transactions in these threads before you start
extract you will lose data into these threads. You can recover by
configuring another extract reading only these threads.
Disabling redo log threads
1) The purpose is to remove an existing RAC thread from goldengate
extract so that extract will not capture from that thread.
1) Edit the extract parameter file to either remove these parameters or
specify what is required depending on which threads you wish to enable.
See THREADOPTIONS PROCESSTHREADS description below.
2) Disable the redo log threads.
3) The extract will abend because these threads are not available.
Simply restart the extract as you now have added the THREADOPTIONS
PROCESSTHREADS in the extract parameter file.
e.g.
i) Edit the extract parameter file to add SELECT or add EXCEPT processing from the disabled nodes.
For a three node RAC system where node # 2 fails you would add
THREADOPTIONS PROCESSTHREADS EXCEPT 2
Alternately, you could also have
THREADOPTIONS PROCESSTHREADS SELECT 1
THREADOPTIONS PROCESSTHREADS SELECT 3
Add the following parameter if you are running V11.1.1 and up
BR BROFF
@BR BROFF is needed due to bug 12859529
ii) Restart your extract
NOTE : If a node is EXCLUDED or NOT SELECTED no data that is updated via that node will be extracted, now and in the future.
NOTE : If a node is EXCLUDED or NOT SELECTED no data that is updated via that node will be extracted, now and in the future.
Correlating DataBase Threads to OGG Threads;
When the new RAC thread is added , the OGG thread mapping to RAC thread may change. We need to verify the
RAC thread order from v$log and identify the correct GG thread mapping.
It is then necessary to alter the extract according to the corresponding archive sequece as per the new
identified mapping.
e.g:
Here is the example showing the same. The RAC thread# sequence should be taken in the order it is shown
in 'select distinct thread# from v$log;' The OGG thread# sequence is always be from 1 to n<no.of distinct
rac threads>.
As the new RAC node is added, the newly added extract will change thread mapping accordingly.
Before node addition GG thread 2 is mapped to RAC thread 1. But after node addition , GG thread
2 is pointing to RAC thread 5. Before node addition: ==================== RAC THREAD# OGG thread ------------ -------------- 5 - 1 1 - 2 2 - 3 3 - 3 4 - 5 After node addition: =================== The corresponding thread mapping for Golden Gate is identified as given below: RAC THREAD# OGG thread ----------- ---------- 6 - 1 5 - 2 1 - 3 2 - 4 3 - 5 4 - 6 So when the new extract is added, Extract should be altered according to the new mapping. N.B. removing instance 2 frm rac will cause error OGG-00446 as OGG-00446 Oracle GoldenGate Capture for Oracle, esmeoe.prm: The number of Oracle redo threads (1) is not the same as the number of checkpoint threads (2). EXTRACT groups on RAC systems should be created with the THREADS parameter (e.g., ADD EXT , TRANLOG, THREADS 1, BEGIN...).
======================================================================
The Importance of Bounded Recovery
Bounded Recovery is a component of Oracle GoldenGate’s Extract process checkpointing facility.
It guarantees an efficient recovery after Extract stops for any reason,
planned or unplanned, no matter how many open (uncommitted)
transactions there were at the time that Extract stopped, nor how old
they were. Bounded Recovery sets an upper boundary for the maximum
amount of time that it would take for Extract to recover to the point
where it stopped and then resume normal processing.
Extract performs this recovery as follows:
· If
there were no open transactions when Extract stopped, the recovery
begins at the current Extract read checkpoint. This is a normal
recovery.
· If there were open transactions whose start points in the log were very close in time to the time when
Extract stopped, Extract begins recovery by re-reading the logs from
the beginning of the oldest open transaction. This requires Extract to
do redundant work for transactions that were already written to the
trail or discarded before Extract stopped, but that work is an
acceptable cost given the relatively small amount of data to process.
This also is considered a normal recovery.
· If there were one or more transactions that Extract qualified as long-running open transactions, Extract begins its recovery with a Bounded Recovery.
Bounded Recovery is new feature in OGG 11.1, this is how it works:
A
transaction qualifies as long-running if it has been open longer than
one Bounded Recovery interval, which is specified with the BRINTERVAL
option of the BR parameter.
For
example, if the Bounded Recovery interval is four hours, a long-running
open transaction is any transaction that started more than four hours
ago.
At
each Bounded Recovery interval, Extract makes a Bounded Recovery
checkpoint, which persists the current state and data of Extract to
disk, including the state and data (if any) of long-running
transactions. If Extract stops after a Bounded Recovery checkpoint, it
will recover from a position within the previous Bounded Recovery
interval or at the last Bounded Recovery checkpoint, instead of
processing from the log position where the oldest open long-running
transaction first appeared, which could be several trail files ago.
Bounded
Recovery is enabled by default for Extract processes and has a 4 hour
BR interval. To adjust the BR interval to say 24 hours, use the
following syntax in your Extract parameter file:
BR BRINTERVAL 24, BRDIR BR
The default location for BR checkpoint files is the GoldenGate home directory. This can be altered by including a full path:
BR BRINTERVAL 24, BRDIR /ggsdata/brcheckpoint
Case Study
The Problem
In a recent case, Bounded Recovery was disabled through the following Extract parameter:
BR BROFF
Consequently the following behavior prevented the Extract process from recovering and starting.
1. Firstly, GoldenGate had fallen behind due to a batch job and subsequently the Extract process was reading the archived redologs and not the online redologs. Also at this time an archived redolog was deleted by RMAN during a scheduled backup, that caused the Extract process to abend with OGG-00446 (caused by ORA-15173)
Error in ggserr.log
2012-07-04 11:03:03 ERROR OGG-00446 Oracle GoldenGate Capture for Oracle, euktds01.prm: Getting attributes for ASM file +FRA/2_86717_716466928.dbf, SQL dbms_diskgroup.getfileattr('+FRA/2_86717_716466928.dbf', :filetype, :filesize, :lblksize);
END;>: (15056) ORA-15056: additional error message ORA-15173: entry
'2_86717_716466928.dbf' does not exist in directory '/' ORA-06512: at
"SYS.X$DBMS_DISKGROUP", line 304 ORA-06512: at line 1Not able to
establish initial position for sequence 86717, rba 122140688.
2012-07-04 11:03:03 ERROR OGG-01668 Oracle GoldenGate Capture for Oracle, euktds01.prm: PROCESS ABENDING.
2. Some hours later, the deleted archived redolog file
was restored and the Extract process restarted. However, despite the
process running, the RBA# and Sequence# were not incrementing. The
Extract process was stuck!
The INFO GGSCI command with DETAIL option revealed the source redo was not available.
GGSCI (dbserver09a) 2> info EUKMDS01, detail
Extract Source Begin End
Not Available 2012-07-04 23:30 2012-07-04 23:30
Not Available 2012-07-04 23:28 2012-07-04 23:30
Not Available 2012-07-01 05:35 2012-07-04 23:28
+DATA/ukhub/onlinelog/group_4.282.716467031 2012-06-24 05:28 2012-07-01 05:35
+DATA/ukhub/onlinelog/group_3.280.716467027 2012-06-23 21:06 2012-06-24 05:28
3. The ggserr.log also revealed a long running transaction detected.
2012-07-04 23:31:47 WARNING OGG-01027 Oracle GoldenGate Capture
for Oracle, euko1els.prm: Long Running Transaction: XID 197.8.3521317,
Items 0, Extract EUKO1ELS, Redo Thread 2, SCN 51.3925309013
(222968641109), Redo Seq #86717, Redo RBA 122140688.
The Solution
The
Extract process was stuck in recovery mode, but could not find the
starting RBA. In order to get the process up and running, the following
steps were executed on the source system.
1. First of all, the Extract process was stopped with the force option.
GGSCI (dbserver09a) 4> send extract EUKMDS01, forcestop
2. The start position of the Extract process was altered to the beginning of the long running transaction.
GGSCI (dbserver09a) 5> alter extract EUKMDS01, begin 2012-07-04 23:31:47
3. The extract process was started.
GGSCI (dbserver09a) 4> start extract EUKMDS01
4. Sure enough, the Extract process was reinitialized and continued to process the backlog.
GGSCI (uklpdptoy09a) 2> info EUKMDS01, detail
Extract Source Begin End
+DATA/ukhub/onlinelog/group_4.282.716467031 2012-07-04 23:31 2012-07-05 02:58
Not Available * Initialized * 2012-07-04 23:31
Not Available 2012-07-04 23:30 2012-07-04 23:30
Conclusion
Never
disable Bounded Recovery else Extract processes may fail to recover
automatically. Furthermore, to prevent RMAN from deleting archived log
files that are still required. If you register the extract with
LOGRETENTION then the GoldenGate will retain the archive logs that Extract needs for recovery.
To register Extract do the following:
1. Stop
the Extract ( Ensure that all the archive log files starting from
recovery checkpoint till current checkpoint is available on all nodes )
2. Execute the following GGSCI commands
GGSCI> dblogin userid , password
GGSCI> register extract , LOGRETENTION
You can confirm whether Extract is registered or not using the query “select * from dba_capture”. (This sounds like Streams!) This should have an entry for Extract.
3. Start the Extract
GGSCI>start extract