Search

Wednesday, August 3, 2016

goldengate problem and solution

OGG Encountered SCN That Is Not Greater Than The Highest SCN Already Processed (Doc ID 957112.1)

APPLIES TO:

Oracle GoldenGate - Version 6.0.0 and later
Information in this document applies to any platform.


Issue:
Extract ERROR 180 encountered commit SCN that is not greater than the highest SCN already processed

Solution Overview:
This error occurs shortly after a transaction is written to the idle node but does not yet appear in the redo log when the current transaction is processed. This transaction will have a higher SCN than the previous transaction

Solution Details:
In an Oracle RAC environment, Extract has a coordinator thread. This thread assembles the operations that are generated on each of the RAC nodes into SCN order. Before starting to assemble the transactions, Extract waits for a short time after Oracle writes the transaction to the redo log. If a node has been idle with no transactions, Extract waits the number of milliseconds specified in the Extract parameter THREADOPTIONS MAXCOMMITPROPAGATIONDELAY (default 3 seconds). Waiting ensures that there are no new transactions on the idle node before writing the current transactions to the trail.

Possible Reasons for error:
-- One of the threads is slower than the other.
-- The redo logs are not flushed on time due to latency on Log writes.
-- Network issues between Extract and one of the RAC nodes, if Extract is running on a system separate from RAC nodes.
-- Long log write times due to a standby configuration, if any.
-- Log file I/Os are taking unusually long times to complete.
-- Time imperfections between the cluster nodes. All nodes in the RAC cluster must have synchronized system clocks. If Extract is running on a system other than any of the RAC nodes, that system clock must be in sync, too, because we compare the local system's time to the commit timestamp to make critical decisions. For information about synchronizing system clocks, consult www.ntp.org or your systems administrator
-- The DB timezone setting is different than the OS timezone setting

Case example:
The following shows a time sequence and explains what happens at each step in time:
TIME WHAT HAPPENED

------- ---------

2008-03-31 14:08:21.964 Starting from this timestamp, thread 2 is reporting that it is at the end of the file (EOF).

...... Thread 2 keeps reporting EOF.

2008-03-31 14:25:33.021 Thread 1 reports it is at EOF

2008-03-31 14:25:33.110 Thread 2 reports it is at EOF

2008-03-31 14:25:34.040 Thread 1 gets the commit record, with timestamp of 2008-03-31 14:10:52. It commits the transaction with an SCN of 1.2943495430.

no timestamp... The coordinator thread processes the committed transaction from thread 1, because thread 2 kept reporting EOF for the last 17 minutes.

2008-03-31 14:25:34.188 Thread 2 got an update record and a commit record with timestamp of 2008-03-31 14:10:52. It commits transaction with an SCN of 1.2943495406

2008-03-31 14:25:34.xxx The main thread processes the committed transaction from thread 2, and detects an out of order SCN. Extract abends with this error: 2008-03-31 14:25:34 GGS ERROR 180 encountered commit SCN 1.2943495406 (0001.af722cee) that is not greater than the highest SCN already processed 1.2943495430 (0001.af722d06) Redo Thread 2 (2) xid 29.27.131432 (0x001d.01b.00020168), starting seq.rba 3725.98984464, scn 1.2943495403 (0001.af722ceb), commit seq.rba 3725.99011828 commit timestamp 2008-03-31 14:10:52.000000.

In this case, the out of order SCN was caused by the huge gap in the time between when the record was generated to the log buffer and when it was captured by Extract.

Troubleshooting steps:

-- Check to see if the clocks on different nodes in the same RAC are in sync

-- Check to see if there are any unusual time gaps between log data on threads during the SCN error.

-- Check how the 'remote'  log files, if any, are shared with the system on which Extract is running (e.g., NFS, SAN, shared SCSI, etc.) and see if there are any noticeable problems there.

-- Check if there are messages in the Oracle alert logs, lgwr tracefiles or net logs during the problematic time, to see if they provide any clues to point to one of the 'Possible Reasons' discussed above.
-- Check to see if any analysis been done on the I/O performance on the drives that contain the log files and whether they are "real Disk";, NFS, SAN, etc.
-- Check to see if the DB timezone is setup correctly.  Issue 'SELECT dbtimezone FROM DUAL', the timezone should match the OS.  For example is the result is "+00:00", that means DB is using GMT/UTC timezone.

Recommendation:-
The MAXCOMMITPROPAGATIONDELAY parameter can be used to set the delay time well above the max_commit_propogation_delay setting in the database, plus the default extra padding that Extract adds (2000 milliseconds).

In Oracle RAC, the max_commit_propogation_delay specifies the maximum amount of time allowed before the system change number (SCN) held in the SGA of an instance is refreshed by the log writer process (LGWR). Units are in hundredths of seconds.

To check Oracle's value:

Connect as a user with dba privileges and issue:
SQL> show parameter max_commit NAME TYPE VALUE

------------------------------------ ----------- ------------------------------ max_commit_propagation_delay integer 0

To set MAXCOMMITPROPAGATIONDELAY : The value of MAXCOMMITPROPAGATIONDELAY must be greater than zero and less than 90000 milliseconds.

This is how the line should look in the Extract parameter file:

THREADOPTIONS MAXCOMMITPROPAGATIONDELAY 2700

Starting from GGS Version 9.x and above, an additional parameter 'IOLATENCY' can be used if Extract abends with 'encountered SCN XXXXX' too often. IOLATENCY adjusts the delta between the database-configured max commit propogation delay and the internal value that Extract uses. By default IOLATENCY is set to 1.5 seconds 
Note: Valid values for IOLATENCY are between 0 and 180000 milliseconds (3 minutes).

The combined parameters should look like this:

THREADOPTIONS MAXCOMMITPROPAGATIONDELAY IOLATENCY

The combination of MAXCOMMITPROPAGATIONDELAY and IOLATENCY can be used to ensure that:

1) the Oracle threads have written their most recent SCN data to the logs
2) the I/O processes have had time to complete, considering the various factors that increase I/O latency, such as hardware contention, file locking, long seek and queue times, etc.

The key is in adjusting IOLATENCY and MAXCOMMITPROPAGATIONDELAY parameters to an optimum value based on the frequency of hitting the error.

Hint: If the problem happens too often, you can start with high values for IOLATENCY and MAXCOMMITPROPAGATIONDELAY parameters. Once the error stops happening, you can gradually decrease the values to see the SCN number where this error starts appearing again. This would give you an idea of the boundary values specific to that environment.

Furthermore, the reason Extract successfully restarts after this error is that, on restart, it re-reads the operations, and this time they are all on disk and can be processed in correct SCN order. The side effect is that Extract rewrites operations into the trail. This may cause the data pump or Replicat to abend with incompatible record errors.
Note: From Oracle 10.2, parameter max_commit_propagation_delay has been deprecated. So "SQL> show parameter max_commit NAME TYPE VALUE " is no longer viable for Oracle 10.2 upward.
After you adjusted IOLATENCY and MAXCOMMITPROPAGATIONDELAY parameters to really big values, if the Extract still abends and the RAC nodes clocks are in sync, then you might hit bug 13559209.  This bug was introduced in early patch versions of V11.1.1.0.29.  
Also, we found bug 13408324 when customer has more than 15 nodes in RAC, it will corrupt some of Extract's internal data structure and cause the flag of THREADOPTIONS IGNORESCNSEQUENCE in-correctly turned on, which will cause Extract keep running while printing out the warning message in the report file.  This could cause downstream data integrity issues as Extract is writing out of order transactions in the trail file.
Final Fix of this issue in V11.1.1+
Oracle DB HEARTBEAT is utilized to coordinate the read from multiple nodes to totally eliminate this issue.  This is being tracked in BugDB 10356426, which is implemented for V11.1.1.1, also backported back to V11.1.1.0.x code line.  If you are running V11.1.1.0.x GoldenGate, please request a build that has this fix. The redo compatibility of the db should be 10.2 or higher for the fix to work.

Additional causes and Versions:
This error is also known to occur in 11.2 code due to reported bug: 13955146.
This bug is fixed in v12.1 and also included in OGG versions 11.2.1.0.3+ and 11.1.1.1.5+ 


  =======================================================================


How Can The Checkpoints In The Extract Checkpoint File Be Changed? (Doc ID 964684.1)



Recovery Checkpoint (position of oldest unprocessed transaction in the data source):

Sequence #: 34381

RBA: 8615440

Timestamp: 2008-08-27 18:51:27.000000

Redo File: /rdbms/oracle/ora1012i/64/oradata/ora1012a/redo01.log


Current Checkpoint (position of last record read in the data source):

Sequence #: 34381

RBA: 8615936

Timestamp: 2008-08-27 18:51:27.000000

Redo File: /rdbms/oracle/ora1012i/64/oradata/ora1012a/redo01.log


Write Checkpoint #1


GGS Log Trail


Current Checkpoint (current write position):

Sequence #: 0

RBA: 2970

Timestamp: 2008-08-27 18:52:24.196674

Extract Trail: ./dirdat/pj


3) In the ./dirchk directory, there will be a file with a .cpe extension that is the same name as the Extract group. Rename this file to some other name. This is essentially the same as deleting the Extract. 


4) Add the Extract with the same name.

GGSCI > ADD EXTRACT , TRANLOG, EXTSEQNO , EXTRBA  

GGSCI > INFO EXTRACT

Note that both the Current and Recovery checkpoint data will be the same


5) Alter the Recovery Checkpoint.

GGSCI > ALTER EXTRACT , IOEXTSEQNO , IOEXTRBA

GGSCI > INFO EXTRACT  

Note that the Current and Recovery checkpoint data should be changed as specified.


6) Add the remote trail (RMTTRAIL) or local trail (EXTTRAIL).

GGSCI > ADD RMTTRAIL./dirdat/, METGABYTES xx, SEQNO , RBA , EXTRACT  

It is not necessary to follow every step exactly as shown above; just choose the checkpoint record to "copy" and follow those steps. For example, in step 6, just add the RMTTRAIL as normally if the new Extract is to write to a different location and start from SEQNO 1 and RBA 0.


Caveats
It is very dangerous to change the production Extract checkpoint file. Any mistake could cause permanent data loss or break data integrity.
Please consult the GoldenGate Support team before you try this procedure on any of your production Extracts.
Altering recovery checkpoints currently does not work for DB2.


========================================================================

How to Configure GoldenGate Extract When Adding or Removing Redo Log Threads in an Oracle RAC ?, OGG-00446 (Doc ID 1267901.1)


Adding new redo log threads

Configuring GoldenGate extract when adding or removing redo
If you have a RAC with a certain number of threads and an extract that does not explicitly exclude threads, this is the procedure to alter extract to capture the new threads when you add new threads to the RAC
  1. ALTER DATABASE ADD LOGFILE and ENABLE the new threads.
  2. Do not perform application updates to these new threads.
  3. Stop GoldenGate extract.
  4. GGSCI > INFO EXT , SHOWCH and print or save this information.
  5. DELETE EXT .
  6. ADD EXT , BEGIN NOW, TRANLOG, THREADS
  7. Update the checkpoint information saved in step 3 . Follow the note 964684.1
    to use the SHOWCH output to reset the inputs and outputs of the new extract to match the old one. e.g. Alter each previously existing threads EXTSEQNO and EXTRBA (for the current checkpoint),Alter each previously existing threads the recovery checkpoint IOEXTSEQNO and IOEXTRBA, Alter the Output checkpoint (EXTTRAIL or RMTTRAIL) SEQNO and RBA
  8. START EXTRACT.
  9. Allow application updates to these new threads. Extract will now capture and checkpoint all ENABLE threads.

Activating disabled or inactive redo log threads

1)    If a thread is inactive and you wish to enable it, you need to add the parameter PROCESSTHREADS SELECT. However if there is an existing thread which you want to disable, you need to add PROCESSTHREADS EXCEPT for such thread/s
Your current extract will have this parameter
PROCESSTHREADS SELECT
or
PROCESSTHREADS EXCEPT
Add the following parameter if you are running V11.1.1 and later,
BR BROFF
@BR BROFF is needed due to bug 12859529
To stop extract attempting to capture from these inactive threads:
Just before you enable the threads,
1) edit the extract parameter file to either removing these parameters or specify what is required depending on which threads you wish to enable.
2) stop the extract
3) enable the thread
4) start extract as soon as possible and before** doing transactions on these newly enabled threads. Extract will now capture and checkpoint all ENABLE threads.
**Note: If you have transactions in these threads before you start extract you will lose data into these threads. You can recover by configuring another extract reading only these threads.

Disabling redo log threads

1)     The  purpose is to remove an existing RAC thread from goldengate extract so that extract will not capture from that thread.
1) Edit the extract parameter file to either remove these parameters or specify what is required depending on which threads you wish to enable. See THREADOPTIONS PROCESSTHREADS description below.
2) Disable the redo log threads.
3) The extract will abend because these threads are not available. Simply restart the extract as you now have added the THREADOPTIONS PROCESSTHREADS in the extract parameter file.

e.g.
i) Edit the extract parameter file to add SELECT or add EXCEPT processing from the disabled nodes.

For a three node RAC system where node # 2 fails you would add

THREADOPTIONS PROCESSTHREADS EXCEPT 2

Alternately, you could also have

THREADOPTIONS PROCESSTHREADS SELECT 1
THREADOPTIONS PROCESSTHREADS SELECT 3
Add the following parameter if you are running V11.1.1 and up
BR BROFF
@BR BROFF is needed due to bug 12859529
ii) Restart your extract

NOTE : If a node is EXCLUDED or NOT SELECTED no data that is updated via that node will be extracted, now and in the future.

Correlating DataBase Threads to OGG Threads;

 
 
When the new RAC thread is added , the OGG thread mapping to RAC thread may change. We need to verify the 
RAC thread order from v$log and identify the correct GG thread mapping.

It is then necessary to alter the extract according to the corresponding archive sequece as per the new 
identified mapping.

e.g:
Here is the example showing the same. The RAC thread# sequence should be taken in the order it is shown
 in 'select distinct thread# from v$log;' The OGG thread# sequence is always be from 1 to n<no.of distinct
 rac threads>.

As the new RAC node is added, the newly added extract will change thread mapping accordingly. 
Before node addition GG thread 2 is mapped to RAC thread 1. But after node addition , GG thread 
2 is pointing to RAC thread 5.

Before node addition:
====================
RAC THREAD#    OGG thread
------------  --------------
      5     -         1
      1     -         2
      2     -         3
      3     -         3
      4     -         5

After node addition:
===================
The corresponding thread mapping for Golden Gate is identified as given below:

RAC THREAD#    OGG thread
-----------    ----------
      6     -         1
      5     -         2
      1     -         3
      2     -         4
      3     -         5
      4     -         6 

So when the new extract is added, Extract should be altered according to the new mapping.


N.B.
removing instance 2 frm rac will cause  error OGG-00446 as

OGG-00446  Oracle GoldenGate Capture for Oracle, esmeoe.prm:  
The number of Oracle redo threads (1) is not the same as the number of checkpoint threads (2). 
EXTRACT groups on RAC systems should be created with the THREADS parameter 
(e.g., ADD EXT , TRANLOG, THREADS 1, BEGIN...).
======================================================================

The Importance of Bounded Recovery


Bounded Recovery is a component of Oracle GoldenGate’s Extract process checkpointing facility. It guarantees an efficient recovery after Extract stops for any reason, planned or unplanned, no matter how many open (uncommitted) transactions there were at the time that Extract stopped, nor how old they were. Bounded Recovery sets an upper boundary for the maximum amount of time that it would take for Extract to recover to the point where it stopped and then resume normal processing.
Extract performs this recovery as follows:
·         If there were no open transactions when Extract stopped, the recovery begins at the current Extract read checkpoint. This is a normal recovery.
·         If there were open transactions whose start points in the log were very close in time to the time when Extract stopped, Extract begins recovery by re-reading the logs from the beginning of the oldest open transaction. This requires Extract to do redundant work for transactions that were already written to the trail or discarded before Extract stopped, but that work is an acceptable cost given the relatively small amount of data to process. This also is considered a normal recovery.
·         If there were one or more transactions that Extract qualified as long-running open transactions, Extract begins its recovery with a Bounded Recovery.

Bounded Recovery is new feature in OGG 11.1, this is how it works:
A transaction qualifies as long-running if it has been open longer than one Bounded Recovery interval, which is specified with the BRINTERVAL option of the BR parameter.
For example, if the Bounded Recovery interval is four hours, a long-running open transaction is any transaction that started more than four hours ago.
At each Bounded Recovery interval, Extract makes a Bounded Recovery checkpoint, which persists the current state and data of Extract to disk, including the state and data (if any) of long-running transactions. If Extract stops after a Bounded Recovery checkpoint, it will recover from a position within the previous Bounded Recovery interval or at the last Bounded Recovery checkpoint, instead of processing from the log position where the oldest open long-running transaction first appeared, which could be several trail files ago.

Bounded Recovery is enabled by default for Extract processes and has a 4 hour BR interval. To adjust the BR interval to say 24 hours, use the following syntax in your Extract parameter file:

BR BRINTERVAL 24, BRDIR BR

The default location for BR checkpoint files is the GoldenGate home directory. This can be altered by including a full path:

BR BRINTERVAL 24, BRDIR /ggsdata/brcheckpoint

Case Study

The Problem

In a recent case, Bounded Recovery was disabled through the following Extract parameter:
BR BROFF
Consequently the following behavior prevented the Extract process from recovering and starting.
1.       Firstly, GoldenGate had fallen behind due to a batch job and subsequently the Extract process was reading the archived redologs and not the online redologs. Also at this time an archived redolog was deleted by RMAN during a scheduled backup, that caused the Extract process to abend with OGG-00446 (caused by ORA-15173)
Error in ggserr.log
2012-07-04 11:03:03  ERROR OGG-00446  Oracle GoldenGate Capture for Oracle, euktds01.prm:  Getting attributes for ASM file +FRA/2_86717_716466928.dbf, SQL dbms_diskgroup.getfileattr('+FRA/2_86717_716466928.dbf', :filetype, :filesize, :lblksize); END;>: (15056) ORA-15056: additional error message ORA-15173: entry '2_86717_716466928.dbf' does not exist in directory '/' ORA-06512: at "SYS.X$DBMS_DISKGROUP", line 304 ORA-06512: at line 1Not able to establish initial position for sequence 86717, rba 122140688.
2012-07-04 11:03:03  ERROR OGG-01668  Oracle GoldenGate Capture for Oracle, euktds01.prm:  PROCESS ABENDING.

2.       Some hours later, the deleted archived redolog file was restored and the Extract process restarted. However, despite the process running, the RBA# and Sequence# were not incrementing. The Extract process was stuck!
The INFO GGSCI command with DETAIL option revealed the source redo was not available.
GGSCI (dbserver09a) 2> info EUKMDS01, detail

Extract Source Begin             End

Not Available 2012-07-04 23:30 2012-07-04 23:30
Not Available 2012-07-04 23:28 2012-07-04 23:30
Not Available 2012-07-01 05:35 2012-07-04 23:28
+DATA/ukhub/onlinelog/group_4.282.716467031  2012-06-24 05:28  2012-07-01 05:35
+DATA/ukhub/onlinelog/group_3.280.716467027  2012-06-23 21:06  2012-06-24 05:28

3.       The ggserr.log also revealed a long running transaction detected.
2012-07-04 23:31:47  WARNING OGG-01027  Oracle GoldenGate Capture for Oracle, euko1els.prm:  Long Running Transaction: XID 197.8.3521317, Items 0, Extract EUKO1ELS, Redo Thread 2, SCN 51.3925309013 (222968641109), Redo Seq #86717, Redo RBA 122140688.

The Solution

The Extract process was stuck in recovery mode, but could not find the starting RBA. In order to get the process up and running, the following steps were executed on the source system.
1.       First of all, the Extract process was stopped with the force option.
GGSCI (dbserver09a) 4> send extract EUKMDS01, forcestop
2.       The start position of the Extract process was altered to the beginning of the long running transaction.
GGSCI (dbserver09a) 5> alter extract EUKMDS01, begin 2012-07-04 23:31:47
3.       The extract process was started.
GGSCI (dbserver09a) 4> start extract EUKMDS01
4.       Sure enough, the Extract process was reinitialized and continued to process the backlog.
GGSCI (uklpdptoy09a) 2> info EUKMDS01, detail

Extract Source Begin             End

+DATA/ukhub/onlinelog/group_4.282.716467031  2012-07-04 23:31  2012-07-05 02:58
Not Available * Initialized *   2012-07-04 23:31
Not Available 2012-07-04 23:30 2012-07-04 23:30

Conclusion

Never disable Bounded Recovery else Extract processes may fail to recover automatically. Furthermore, to prevent RMAN from deleting archived log files that are still required. If you register the extract with LOGRETENTION then the GoldenGate will retain the archive logs that Extract needs for recovery.
To register Extract do the following:

1.        Stop the Extract ( Ensure that all the archive log files starting from recovery checkpoint till current checkpoint is available on all nodes )
2.       Execute the following GGSCI commands
GGSCI> dblogin userid , password
GGSCI> register extract , LOGRETENTION

You can confirm whether Extract is registered or not using the query “select * from dba_capture”. (This sounds like Streams!) This should have an entry for Extract.

3.       Start the Extract
GGSCI>start extract