Voting Disks in Oracle 11gR2 RAC
Voting disks are important component of
Oracle Clusterware. Clusterware uses voting disk to determine which nodes are
members of the cluster. After ASM is
introduced to store these files, these are called as VOTING FILES.
Primary function of voting disks is to
manage node membership and prevent SPLITBRAIN Syndrome in which 2 or more
instances attempt to control the RAC database.
Ø These files can be stored either in ASM or on shared storage.
Ø If it is stored in ASM, no need to configure manually as the files
will be created depending on the redundancy in ASM.
Ø In shared storage system, we need to manually configure these files
with redundancy setup for high availability.
Ø We must have odd number of disks.
Ø Oracle recommends minimum of 3
and maximum of 5. In 10g, Clusterware can
supports 32 voting disks but in 11gR2 supports 15 voting disks.
Ø A node must be able to access more than half of the voting disks at
any time. For eg, if you have 5 voting
disks, a node must be able access atleast 3 of the voting disks. If it cannot
access the minimum of voting disks, then it is evicted/removed from the
cluster.
Ø All nodes in the RAC cluster register their heartbeat information in
the voting disks/files. RAC heartbeat is
the polling mechanism that is sent over the cluster interconnect to ensure all
RAC
nodes are available.
How
Voting Happens
The CKPT process updates the control
file every 3 seconds in an operation known as heartbeat.
CKPT writes to a single block that is
local to the node/each instance and intra instance coordination is not
required. This block is called checkpoint
progress record.
All members of the cluster attempt to
lock on the controlfile record for updating.
The instance which obtains the locks
tallies the votes from all members.
Then, the group membership must conform to the decided(voted) membership before allowing GCS/GES to proceed for
reconfiguration. The control file record is then stored in
the same block as the heartbeat in the controlfile
checkpoint progress record.
What is NETWORK and DISK HEARTBEAT and how it registers in VOTING
DISKS/FILES
1.
All nodes in the RAC cluster
register their heartbeat information in the voting disks/files. RAC heartbeat is the polling mechanism that
is sent over the cluster interconnect to ensure all RAC
a.
nodes are available.
b.
Voting disks/files are just
like attendance register where you have nodes mark their attendance
(heartbeats).
2.
CSSD process on every node
makes entries in the voting disk to ascertain the membership of the node. While marking their own presence, all the
nodes also register the information about their communicability with other
nodes in the voting disk. This is called
NETWORK HEARTBEAT.
3.
CSSD process in each RAC maintains the heart beat
in a block of size 1 OS block in the hot
block of voting disk at a specific offset. The written block has a header area
with the node name. The heartbeat
counter increments every second on every write call. Thus heartbeat of various nodes is recorded
at different offsets in the voting disk. This process is called DISK HEARTBEAT.
4.
In addition of maintaining its
own disk block, CSSD processes also monitors the disk block maintained by the
CSSD processes of other nodes in cluster. Healthy nodes will have continuous
network & disk heartbeats exchanged between the nodes. Break in heartbeats indicates a possible error
scenario.
5.
If the disk is not updated in a
short timeout period, the node is considered unhealthy and may be rebooted to
protect the database. In this case, a
message to this effect is written in the KILL
BLOCK of node. Each nodes reads its KILL BLOCK once per second, if the kill
block is not overwritten, node commits suicide.
6.
During reconfig (leaving or
joining), CSSD monitors all nodes heartbeat information and determines whether
the nodes has a disk heartbeat including those with no network heartbeat. If no disk heartbeat is detected, then node
is considered as dead.
What
Information is stored in VOTING DISK/FILE
It contains 2 types of data .
Static data : Info
about the nodes in cluster
Dynamic data: Disk
heartbeat logging
It contains the important details of the
cluster nodes membership like
a.
Which node is part of the
cluster.
b.
Which node is leaving the
cluster and
c.
Which node is joining the
cluster.
Purpose of Voting disk or Why
is Voting disk needed
Voting disks are used by clusterware for
health check.
Ø Used by CSS to determine which nodes are currently members of the
cluster.
Ø In concert with other cluster components like CRS to shutdown, fence
or reboot either single or multiple nodes whenever network communication is
lost between any node within the cluster, to prevent to split-brain condition
in which 2 or more instances attempt to
control the RAC database and thus protecting the database.
Ø Will be used by CSS to arbitrate (to take an authorized decision)
with peers that it is not able to see over the private interconnect in the
event of an outage, allowing it to salvage (rescue from loss) the largest
fully connected sub-cluster for further
operation. During this operation , node
membership (NM) will make an entry in the voting disk to inform its vote on
availability. Other instances in the cluster too do similar actions. The 3 voting disks configured also provide a
method to determine who in the cluster should survive.
Example : if
eviction of one of the node is necessiated due to unresponsive action, then the
node that has 2 voting disks with start evicting the other node. NM alternates
it action between the heartbeat and the voting disk to determine the
availability of the other nodes in cluster.
Possible scenarios in Voting disks
As we know now that voting disks is used
by CSSD. It contains both network & disk heartbeat from all nodes and if
any break in heartbeat will result in eviction of the node from cluster. There
are possible scenarios with missing heartbeats.
1.
Network heart beat is
successful, but disk heart beat is missed.
2.
Disk heart beat is successful,
but network heart beat is missed.
3.
Both heart beats failing.
When a cluster is involved with many
nodes, then few more scenarios are possible.
1.
Nodes have a split into N sets
of nodes., communicating within the sets, but not with the members in other
set.
2.
Just one node going
unhealthy. Nodes with quorum (minimum number of nodes to make
cluster valid) will maintain active membership of the cluster and other node(s)
will be fenced/rebooted.
Why should we have ODD number of voting diks ??
A node must be able to access more than
half of the voting disks at any time.
Example.
a.
Let us consider 2 node cluster
with even number of voting disks say 2.
b.
Let node 1 is able to access
Voting disk 1.
c.
Node 2 is able to access voting
disk 2.
d.
From the above steps, we see
that we don’t any common file where clusterware can check the heartbeat of both
the nodes.
e.
If we have 3 voting disks and
both the nodes are able to access more than half ie., 2 voting disks, there
will be atleast one disk which will be accessed by both the nodes. The
clusterware can use this disk to check the heartbeat of the nodes.
f.
A node not able to do so will
be evicted from the cluster by another node that has more than half the voting
disks to maintain the integrity of the cluster.
Where voting disks are stored
It can be stored in
a.
Raw devices
b.
Cluster file system supported
by Oracle RAC such as OCFS,Sun cluster or Veritas Cluster Filesystem
c.
ASM disks (in 11gR2).
When voting disk is stored in ASM, a
question is arised how the voting file on ASM can be accessed when we want to
add a new node to a cluster.
The answer is.
Oracle ASM reserves several blocks at
the fixed location for every Oracle ASM disk used for storing the voting files.
As a result, Oracle clusterware can access the voting disks present in ASM even
if the ASM instance is down and CSS can continue to maintain the Oracle cluster
even if the ASM has failed. The physical
location of the voting files in ASM disks are fixed i.e., the cluster stack
does not rely on a running ASM instance to access the files.
d.
If the ASM is stored in ASM,
the multiplexing of voting disk is decided by the redundancy of the diskgroup.
Redundancy
of the diskgroup |
#of copies of
voting disk |
( Minimum # of disks
in the diskgroup) |
External
|
1
|
1
|
Normal
|
3
|
3
|
High
|
5
|
5
|
Commands
to check the Votingdisk
Crsctl query css votedisk - for checking the file location
When to take voting disk backup
1.
Fresh installation
2.
Adding /deleting node
Voting
disk backup (In 10g)
dd if=<voting-disk-path>
of=<backup/path>
Voting
disk restore (In 10g)
dd
if=<backup/path>
of=<voting disk path>
In 11gR2, the voting files are backed up
automatically as part of OCR. Oracle
recommends NOT used dd command to backup or restore as this can lead to loss of
the voting disk.
Add/delete
vote disk
crsctl add css votedisk <path>
-adds a new voting disk
crsctl delete css votedisk <path>
-- deletes the voting disk