Search

Sunday, November 18, 2012

11gR2 RAC Wait Events

Wait-class “Cluster” indicates cluster-related waits related to Cache Fusion:

   
 RAC WaitEvent Naming Conventions

    - Wait Events that start with “GCS%” and “gc%” are Cache Fusion-related waits.
 In other word, they have to do with waits experienced in shipping current or consistent-read versions of blocks across instances in a RAC cluster

    - Events that start with “ges%’ are related to Global Enqueue Services gc current/cr multiblock request  is only a placeholder , until the block is obtained Block provided Immediate from global Cache

    - “gc current block 2-way” - Indicates that a current block was sent from either the resource master   requiring 1 message and 1 transfer

 - “gc current block 3-way” – Indicates that a current block was forwarded to a 3rd node from which it was sent, requiring 2 messages and one block transfer.

    - “gc cr block 2-way” - Indicates that a cr block was sent from either the resource master requiring 1 message and 1 transfer

    - “gc cr block 3-way” -Indicates that a current block was forwarded to a 3rd node from which it was sent, requiring 2 messages and one block transfer



 High Load wait Events

    - “gc current block congested”
    - “gc cr block congested”

    The load wait events indicate that a delay in processing has occurred in the GCS, which is usually caused by high load, CPU saturation,paging and would have to be solved by additional CPUs, load-balancing, off loading processing to different times or a new cluster node.

   
AWR Reports

    Check the values for:
    Average global cache cr block recieve time (ms)
    Average global cache current block recieve time (ms)


 RAC Specific Wait Event

A) Block-Oriented Waits
-----------------------
1. gc current block 2-way
2. gc current block 3-way
3. gc cr block 2-way
4. gc cr block 3-way


B) Message-Oriented Wais
-------------------------
1. gc current grant 2-way
2. gc current grant 3-way
3. gc cr grant 2-way
4. gc cr grant 3-way (Normally this wait event is not possible, but
"_cr_grant_local_role" --> turn 3-way CR grants off, make it automatic, or turn it on)


C) Contention-Oriented Waits
----------------------------
1. gc current block busy (cluster cache contention)
2. gc cr block busy
3. gc current buffer busy (local cache contention)


D) Load-Oriented Waits
----------------------
1. gc current block congested
2. gc cr block congested
3. gc current grant congested
4. gc cr grant congested



A.1. gc current block 2-way (write/write with 2 nodes)

Definition
- Requesting instance request any data block for dml(current) from Master.
- If master is the holder of that data block and also has already modified that block.
- Then master will retain PI block for itself.
- Master will also flush respective redo log to log file before sending CURRENT block to requesting instance.
- Meanwhile requesting instance will wait in "GC CURRENT BLOCK 2-WAY"


Reason
Before sending the current data block to the requesting instance master instance first flush respective redo log of the log file then it will prepare PI block and then  send CURRENT block to the requesting instance over the interconnect.

- This wait event appears in "TOP-5" timed events section of AWR Report.
- Analyze the contention using AWR REPORT.
- In AWR REPORT, analyze "Current Block Received" and identify the top contentious objects.
- You are getting contention at segment level because of bad database design, DB object layout and Space Management.
- LGWR is not efficient to write in redo log file and that's why requesting instance is waiting in "gc current block 2-way".
- Interconnect is having Network latency.
- Application Segregation is also a reason. (bad explain, 


Troubleshooting
 - Best Solution is to apply application Segregation means try to locate all select query on one node and all DML on another node.
- Tune LGWR
- Tune Interconnect


A.2. gc current block 3-way (write/write with 3 nodes)


Definition
- Requesting instance request any data block in CURRENT MODE for dml(current) from Master.
- If master is not holder of that data block and that data block is globally available on another instance.
_ Master will send a message to the current holding instance to relinquish ownership (Downgrade lock).
- The holding instance retain the PI of that data block and then serve to the requesting instance.
- Holding instance will also flush respective redo log to log file before sending CURRENT block to requesting instance.
- Meanwhile requesting instance will wait in "GC CURRENT BLOCK 3-WAY"


Reason
As you know before sending the current data block to the requesting instance master instance first flush respective redo log of the log file then it will prepare PI block and then CR block to send to the requesting instance over the interconnect.

- This wait event appears in "TOP-5" timed events section of AWR Report.
- Analyze the contention using AWR REPORT.
- In AWR REPORT, analyze "Current Block Recieved" and identify the top contentious objects.
- You are getting contention at segment level because of bad database design, DB object layout and Space Management.
- LGWR is not efficient to write in redo log file and thats why requesting instance is waiting in "gc current block 3-way".
- Interconnect is having N/W latency.
- Application Segregation is also a reason.


Troubleshooting
- Best Solution is to apply application Segregation means try to locate all select query on one node and all DML on another node.
- Tune LGWR
- Tune Interconnect


A.3. gc cr block 2-way (read/read or write/read with 2 nodes)

Definition

Case-1: WRITE/READ

- Requesting instance request any CR data block for select from Master.
- If master is the holder of that data block and also has already modified that block.
- Then master will prepare CR copy of that data block (using undo).
- Finally Master instance serve CR block to the requesting instance.
- Meanwhile requesting instance will wait in "gc cr block 2-way"

Case-2: READ/READ
- Requesting instance request any CR data block for select from Master.
- If master is the holder of that data block and has not already modified that block.
- Master instance serve CR block to the requesting instance immediately.
- Meanwhile requesting instance will wait in "gc cr block 2-way"

Reason
- In both the cases you will encounter this wait event in "TOP-5" section of AWR Report.
- Plan of action would be similer like "gc current block 2-way"
-
No locks are maintained by GCS for CR blocks
-Long-pending transaction on highly accessed objects can lead to CR storm ( Run this sort of transactions at a less busy timeframe )

A.4. gc cr block 3-way (read/read or write/read with 3 nodes or more than 3 nodes)

This wait event is exactly same as "gc cr block 2-way", only the difference is that here 3 or more than 3 instances are involved.




B.1. gc current grant 2-way

Definition
- An instance request any data block in CURRENT MODE for dml(current) from Master.
- If the data block is not cached on any instance even on master too, then master instance will send a message to the requesting instance granting the EXCLUSIVE lock.
- Meanwhile requesting instance will wait in "GC CURRENT BLOCK 2-WAY"
- Requesting instace will read data block from disk and do the physical I/O.
- This wait event does not indicate any contention.

Reason
- This wait event appears in "TOP-5" timed events section of AWR Report.
- This wait event represent that requesting instance is spending a significant amount of time in obtaining the locks.
- Interconnect is having N/W latency (rare).

Troubleshooting
 - Best Solution is to tune your SQL Application so that it request less amout of data blocks.
- Because if requesting instance ask for more data blocks then master has to locking and holding instance related information in GRD before granting EXCLUSIVE LOCK, which will be high.
- Tune Interconnect (very rare).


B.2. gc cr grant 2-way

Definition
- An instance request any data block in CR MODE for select from Master.
- If the data block is not cached on any instance even on master too, then master instance will send a message to the requesting instance granting the SHARED lock.
- Meanwhile requesting instance will wait in "GC CR BLOCK 2-WAY"
- Requesting instance will read data block from disk and do the physical I/O.
- This wait event does not indicate any contention.

B.3. gc current grant 3-way

Definition
- An instacne request for any data block in current mode.
- Mater is not the holder of that data block, then master forward a message to provide the data block to the requesting instance.
- Current holding instance don't have the block because of aging out mechanism.
- Current holding instance grant exclusive lock to the requesting instance for physical IO.
- Meanwhile requesting instance will wait in "gc current grant 3-way".
- there is a hidden parameter to control this "_cr_grant_local_role"



C.1. gc current block busy

Definition
- An instance request for any data block in current mode, it send a request to the master.
- If master is the holder of that data block and also has already modified that block.
- Then master will retain PI block for itself.
- Master will also flush respective redo log to log file before sending CURRENT block to requesting instance.
- Now the block transfer delayed on requsting instance.
- Meanwhile requesting instance will wait in "gc current block busy"
-
Busy events indicate that LMS needs to perform add. work before sending a block ( like apply Redo to build CR block )

Reason
- The block was being used by a session on another instance.
- The block transfer was delayed because the holding instance could not write the corresponding redo record to the online redo log file.

Troubleshooting
 - Tune LGWR
- Appropriate Application Partitioning
- Tune N/w


C.2. gc current buffer busy   /       C.3. gc cr block busy
This wait event appears mostly in SINGLE INSTANCE when more than one sessions are trying to access same data block.

Troubleshooting
 - Application tuning



D.1.  gc current block congested  /  D. 2.  gc cr block congested

Definition
- Whenever any instance (HOST-03) request for any data block in any mode, this request will be served by MASTER NODE (HOST-01) of that data block.
- LMS process running on master node will provide data block to the requesting instance LMS process.
- Now LMS process running on both the nodes are highly loaded so there would be wait event ""gc current block congested
- if LMS process did not process a request within 1ms than LMS marks the response to that block with the congestion wait event

Reason
- Highly loaded LMS process or CPU.
- LMS is suffering CPU scheduling, LMS is suffering resources like memory ( paging )

Troubleshooting
- Increase number of LMS Processes by setting "GCS_SERVER_PROCESS" in 9i RAC onward.
- Optionally you can also set "_LM_LSM" till OPS.
- As LMS processes are RT processes OS scheduling delays should be minimal
- Tune OS
- Add CPUs


D.3. gc current grant congested

Definition
- Whenever any instance (HOST-03) request for any data block in any mode, this request will be served by MASTER NODE of that data block.
- Meanwhile requesting instance (HOST-03) is waiting for approval from master instance to perform physical IO to read data block from DISK.
- This wait event is "gc current grant congested"

Some more ......

GC current request/GC CR request
  • These are placeholder requests which should which should be mapped to one of the above wait events once the LMS responds
  • If there is no response in 0.5s from LMS ( 6s on windows ) the acoounted time is added to GC Lost Block wait event

GC Log Flush Sync
  • for details please  read following link..
  • in a healthy database 90 % of GCS Log flush Sync requet should be finished in 2ms or less
  • Always check related instances for LOG FILE SYNC waiut event as this event also reduces the available Redo I/O bandwith 
  • Monitor Wait distribution/histogramms  with following script : @event_histogram_from_awr.sql


GC Buffer Busy Acquire/Release
  • For RHS index growth you see GC Buffer Busy Acquire/Release and GC Buffer Busy events and gc current request for the same block
  • If the block is on instance 1 all other sessions from instance 1 are waiting on gc buffer busy aquire whereas sessions on the other instances are waiting on gc buffer busy release
  • Always monitor GV$session P1,P2,P3 to get detail information about what blocks are used


Scripts to Measure RAC Wait events

Measuring RAC Waits from V$SESSION_WAIT
set linesize 200
set pages 55
col sid format 99999
col name format a36
col p1 format 999999999 Head ‘P1’
col program format a25
col p2 format 999999999 Head ‘P2’
col p3 format 999999999 Head ‘P3’
col pgm format a15 head ‘What’
col state format a15
col wt format 9999999 head ‘Wait|Time’
col WaitEvent format a38 head ‘Wait Event’
col lc format 99999999999.99 head ‘last call’
select A.sid,
decode(A.event,’null event’,’CPU Exec’,A.event) WaitEvent,
A.p1,A.p2,A.p3,
decode(A.state,’WAITING’,’WTG’,
‘WAITED UNKNOWN TIME’,’UNK’,
‘WAITED SHORT TIME’,’WST’,
‘WAITED KNOWN TIME’,’WKT’) wait_type,
decode(A.state,’WAITING’,A.seconds_in_wait,
‘WAITED UNKNOWN TIME’,-999,
‘WAITED SHORT TIME’,A.wait_time,
‘WAITED KNOWN TIME’,A.WAIT_TIME) wt,
round((last_call_et/60),2) lc,
substr(nvl(b.module,b.program),1,15) pgm
from v$session_wait A,
v$session B
where 1=1
and (A.event like ‘gc%’
or A.event like ‘GC%’
or A.event like ‘ge%’)
and A.event not like ‘%remote message’
and A.event not like ‘%sleep’
and A.sid=B.sid
and B.status=’ACTIVE’
order by 1
/
 Sample output:
930


Measuring RAC Waits from  V$ACTIVE_SESSION_HISTORY
set lines 200
col samplestart format a30 head ‘Begin Time’
col sampleend format a30 head ‘End Time’
col aas format 999.99 head ‘AAS’
col aas_event format 999.99 head ‘AAS per |Event’
col dbt format 999999 head ‘DB Time’
col event format a30
col time_waited format 999999 head ‘Time (ms)’
col wait_pct format 999.99 head ‘% of DB Time’
COLUMN bt NEW_VALUE _bt NOPRINT
COLUMN et NEW_VALUE _et NOPRINT
select min(cast(sample_time as date)) bt,
sysdate et
from v$active_session_history;
with xtimes (xdate) as
(select to_date(‘&_bt’) xdate
from dual
union all
select xdate+(&&interval_mins/1440)
from xtimes
where xdate+(&&interval_mins/1440) < sysdate)
select to_char(s1.xdate,’DD-MON-RR HH24:MI:SS’) samplestart,
to_char(s1.xdate+(&&interval_mins/1440),’DD-MON-RR HH24:MI:SS’) sampleend,
s2.event,.001*((sum(s2.time_waited))) time_waited,
(count(s2.sample_id)/(60*&&interval_mins)) aas_event,
(count(s3.sample_id)) dbt,
round(100*(sum(s2.time_waited)/1000000/count(s3.sample_id)),2) as wait_pct
from xtimes s1,
     v$active_session_history s2,
     v$active_session_history s3
where s2.sample_time between s1.xdate and s1.xdate+(&&interval_mins/1440)
and s3.sample_time between s1.xdate and s1.xdate+(&&interval_mins/1440)
and s2.sample_id=s3.sample_id
and (s2.event  like ‘gc%’
or s2.event like ‘GC%’
or s2.event like ‘ge%’)
and s2.event not like ‘%remote message’
and s2.event not like ‘%sleep’
group by s1.xdate,s2.event
order by s1.xdate
/
undefine interval_mins
Sample output:

931


Measuring RAC Waits from DBA_HIST_ACTIVE_SESS_HISTORY
set lines 200
col samplestart format a30 head ‘Begin Time’
col sampleend format a30 head ‘End Time’
col aas format 999.99 head ‘AAS’
col aas_event format 999.99 head ‘AAS per |Event’
col event format a30
col time_waited format 999999 head ‘Time (ms)’
col wait_pct format 999.99 head ‘Wait %’
COLUMN bt NEW_VALUE _bt NOPRINT
COLUMN et NEW_VALUE _et NOPRINT
select min(cast(sample_time as date)) bt,
sysdate et
from v$active_session_history;
with xtimes (xdate) as
(select to_date(‘&_bt’) xdate
from dual
union all
select xdate+(&&interval_mins/1440)
from xtimes
where xdate+(&&interval_mins/1440) < sysdate)
select to_char(s1.xdate,’DD-MON-RR HH24:MI:SS’) samplestart,
to_char(s1.xdate+(&&interval_mins/1440),’DD-MON-RR HH24:MI:SS’) sampleend,
s2.event,
.001*((sum(s2.time_waited))) time_waited,
10*(count(s2.sample_id)/(60*&&interval_mins)) aas_event,
10*(count(s3.sample_id)) dbt,
round(100*(sum(s2.time_waited)/1000000/(10*(count(s3.sample_id)))),2) as wait_pct
from xtimes s1,
     dba_hist_active_sess_history s2,
     dba_hist_active_sess_history s3
where s2.sample_time between s1.xdate and s1.xdate+(&&interval_mins/1440)
and s3.sample_time between s1.xdate and s1.xdate+(&&interval_mins/1440)
and s2.sample_id=s3.sample_id
and (s2.event  like ‘gc%’
or s2.event like ‘GC%’
or s2.event like ‘ge%’)
and s2.event not like ‘%remote message’
and s2.event not like ‘%sleep’
group by s1.xdate,s2.event
order by s1.xdate
/
undefine interval_mins
Sample output:
932