Search

Sunday, August 14, 2016

Voting Disks in Oracle 11gR2 RAC demistify

Voting Disks in Oracle 11gR2 RAC

Voting disks are important component of Oracle Clusterware. Clusterware uses voting disk to determine which nodes are members of the cluster.  After ASM is introduced to store these files, these are called as VOTING FILES.
Primary function of voting disks is to manage node membership and prevent SPLITBRAIN Syndrome in which 2 or more instances attempt to control the RAC database.
Ø  These files can be stored either in ASM or on shared storage.
Ø  If it is stored in ASM, no need to configure manually as the files will be created depending on the redundancy in ASM. 
Ø  In shared storage system, we need to manually configure these files with redundancy setup for high availability.
   Ø  We must have odd number of disks.
   Ø  Oracle recommends minimum of 3 and maximum of 5. In 10g, Clusterware can supports 32 voting disks but in 11gR2 supports 15 voting disks.
   Ø  A node must be able to access more than half of the voting disks at any time.  For eg, if you have 5 voting disks, a node must be able access atleast 3 of the voting disks. If it cannot access the minimum of voting disks, then it is evicted/removed from the cluster.
   Ø  All nodes in the RAC cluster register their heartbeat information in the voting disks/files.  RAC heartbeat is the polling mechanism that is sent over the cluster interconnect to ensure all RAC
nodes are available.
How Voting Happens
The CKPT process updates the control file every 3 seconds in an operation known as heartbeat.
CKPT writes to a single block that is local to the node/each instance and intra instance coordination is not required.  This block is called checkpoint progress record.
All members of the cluster attempt to lock on the controlfile record for updating.
The instance which obtains the locks tallies the votes from all members.  Then, the group membership must conform to the decided(voted) membership before allowing GCS/GES to proceed for reconfiguration. The control file record is then stored in the same block as the heartbeat in the controlfile checkpoint progress record.
What is NETWORK and DISK HEARTBEAT and how it registers in VOTING DISKS/FILES
            1.       All nodes in the RAC cluster register their heartbeat information in the voting disks/files.  RAC heartbeat is the polling mechanism that is sent over the cluster interconnect to ensure all RAC
a.       nodes are available.
b.      Voting disks/files are just like attendance register where you have nodes mark their attendance (heartbeats).
            2.       CSSD process on every node makes entries in the voting disk to ascertain the membership of the node.  While marking their own presence, all the nodes also register the information about their communicability with other nodes in the voting disk.  This is called NETWORK HEARTBEAT.
            3.       CSSD  process in each RAC maintains the heart beat in a block of size  1 OS block in the hot block of voting disk at a specific offset. The written block has a header area with the node name.  The heartbeat counter increments every second on every write call.  Thus heartbeat of various nodes is recorded at different offsets in the voting disk. This process is called DISK HEARTBEAT.
            4.       In addition of maintaining its own disk block, CSSD processes also monitors the disk block maintained by the CSSD processes of other nodes in cluster. Healthy nodes will have continuous network & disk heartbeats exchanged between the nodes.  Break in heartbeats indicates a possible error scenario.
            5.       If the disk is not updated in a short timeout period, the node is considered unhealthy and may be rebooted to protect the database.  In this case, a message to this effect is written in the KILL BLOCK of node. Each nodes reads its KILL BLOCK once per second, if the kill block is not overwritten, node commits suicide.
            6.       During reconfig (leaving or joining), CSSD monitors all nodes heartbeat information and determines whether the nodes has a disk heartbeat including those with no network heartbeat.  If no disk heartbeat is detected, then node is considered as dead.
What Information is stored in VOTING DISK/FILE
It contains 2 types of data .
Static data : Info about the nodes in cluster
Dynamic data: Disk heartbeat logging
It contains the important details of the cluster nodes membership like
a.       Which node is part of the cluster.
b.      Which node is leaving the cluster and
c.       Which node is joining the cluster.
 Purpose of Voting disk or Why is Voting disk needed
Voting disks are used by clusterware for health check.
Ø  Used by CSS to determine which nodes are currently members of the cluster.
Ø  In concert with other cluster components like CRS to shutdown, fence or reboot either single or multiple nodes whenever network communication is lost between any node within the cluster, to prevent to split-brain condition in which 2 or more instances attempt  to control the RAC database and thus protecting the database.
Ø  Will be used by CSS to arbitrate (to take an authorized decision) with peers that it is not able to see over the private interconnect in the event of an outage, allowing it to salvage (rescue from loss) the largest fully  connected sub-cluster for further operation.  During this operation , node membership (NM) will make an entry in the voting disk to inform its vote on availability. Other instances in the cluster too do similar actions.  The 3 voting disks configured also provide a method to determine who in the cluster should survive. 
Example : if eviction of one of the node is necessiated due to unresponsive action, then the node that has 2 voting disks with start evicting the other node. NM alternates it action between the heartbeat and the voting disk to determine the availability of the other nodes in cluster.
Possible scenarios in Voting disks
As we know now that voting disks is used by CSSD. It contains both network & disk heartbeat from all nodes and if any break in heartbeat will result in eviction of the node from cluster. There are possible scenarios with missing heartbeats.
1.       Network heart beat is successful, but disk heart beat is missed.
2.       Disk heart beat is successful, but network heart beat is missed.
3.       Both heart beats failing.
When a cluster is involved with many nodes, then few more scenarios are possible.
1.       Nodes have a split into N sets of nodes., communicating within the sets, but not with the members in other set.
2.       Just one node going unhealthy.  Nodes with  quorum (minimum number of nodes to make cluster valid) will maintain active membership of the cluster and other node(s) will be fenced/rebooted.
Why should we have ODD number of voting diks ??
A node must be able to access more than half of the voting disks at any time.
Example.
a.       Let us consider 2 node cluster with even number of voting disks say 2.
b.      Let node 1 is able to access Voting disk 1.
c.       Node 2 is able to access voting disk 2.
d.      From the above steps, we see that we don’t any common file where clusterware can check the heartbeat of both the nodes.
e.      If we have 3 voting disks and both the nodes are able to access more than half ie., 2 voting disks, there will be atleast one disk which will be accessed by both the nodes. The clusterware can use this disk to check the heartbeat of the nodes.
f.        A node not able to do so will be evicted from the cluster by another node that has more than half the voting disks to maintain the integrity of the cluster.
Where voting disks are stored
It can be stored in
a.       Raw devices
b.      Cluster file system supported by Oracle RAC such as OCFS,Sun cluster or Veritas Cluster Filesystem
c.       ASM disks (in 11gR2).
When voting disk is stored in ASM, a question is arised how the voting file on ASM can be accessed when we want to add a new node to a cluster.
The answer is.
Oracle ASM reserves several blocks at the fixed location for every Oracle ASM disk used for storing the voting files. As a result, Oracle clusterware can access the voting disks present in ASM even if the ASM instance is down and CSS can continue to maintain the Oracle cluster even if the ASM has failed.  The physical location of the voting files in ASM disks are fixed i.e., the cluster stack does not rely on a running ASM instance to access the files.
d.      If the ASM is stored in ASM, the multiplexing of voting disk is decided by the redundancy of the diskgroup.
Redundancy
of the diskgroup
   #of copies of
voting disk  
 ( Minimum # of disks
 in the diskgroup)
External
1
1
Normal
3
3
High
5
5
Commands to check the Votingdisk
Crsctl query css votedisk    - for checking the file location
When to take voting disk backup
1.       Fresh installation
2.       Adding /deleting node
Voting disk backup  (In 10g)
dd if=<voting-disk-path> of=<backup/path>
Voting disk restore (In 10g)
dd  if=<backup/path>  of=<voting disk path>
In 11gR2, the voting files are backed up automatically as part of OCR.  Oracle recommends NOT used dd command to backup or restore as this can lead to loss of the voting disk.
Add/delete vote disk
crsctl add css votedisk <path> -adds a new voting disk
crsctl delete css votedisk <path> -- deletes the voting disk