Majority-based fencing may result in unexpected panic during network partition with 2-node cluster

Description

Error Message

From the /var/log/messages file on the leader node

Jun 24 09:34:20 nodeA kernel: LLT INFO V-14-1-10205 link 1 (ens161) node 1 in trouble
Jun 24 09:34:22 nodeA kernel: LLT INFO V-14-1-10205 link 0 (ens256) node 1 in trouble
Jun 24 09:34:27 nodeA kernel: GAB INFO V-15-1-20036 Port b[VxFen (refcount 2)] gen fee60c membership 0
Jun 24 09:34:27 nodeA kernel: GAB INFO V-15-1-20036 Port a[GAB_Control (refcount 1)] gen fee609 membership 0
Jun 24 09:34:27 nodeA kernel: GAB INFO V-15-1-20036 Port d[GAB_LEGACY_CLIENT (refcount 0)] gen fee608 membership 0
Jun 24 09:34:27 nodeA kernel: VXFEN INFO V-11-1-80 RACER Node is: 0
Jun 24 09:34:27 nodeA kernel: VXFEN INFO V-11-1-100 Current LBOLT: 4306245438
Jun 24 09:34:27 nodeA kernel: VXFEN INFO V-11-1-87 Initiating VxFen Race
Jun 24 09:34:27 nodeA kernel: VXFEN INFO V-11-1-111 VxFen Pre-Race Delay: 0
Jun 24 09:34:27 nodeA kernel: VXFEN INFO V-11-1-119 LEADER Node : 0 is in current sub-cluster
Jun 24 09:34:27 nodeA kernel: VXFEN INFO V-11-1-88 RACER Node won the VxFen race
Jun 24 09:34:27 nodeA kernel: VXFEN INFO V-11-1-112 VxFen Post-Race Delay: 0
Jun 24 09:34:27 nodeA kernel: VXFEN INFO V-11-1-90 Sending WON_RACE
Jun 24 09:34:27 nodeA kernel: VXFEN INFO V-11-1-67 call to VM ioctl VOL_CLEAR_PR returned non-zero
Jun 24 09:34:27 nodeA kernel: VXFEN INFO V-11-1-84 Completed Fencing Operation.

Cause

How majority-based I/O fencing works

When a network partition happens, one node in each sub-cluster is elected as the racer node, while the other nodes are designated as spectator nodes.

As majority-based fencing does not use coordination points, sub-clusters do not engage in an actual race to decide the winner after a split brain scenario.

The sub-cluster with the majority number of nodes survives while nodes in the rest of the sub-clusters are taken offline.

The following algorithm is used to decide the winner sub-cluster:

The node with the lowest node ID in the current gab membership, before the network partition, is designated as the leader node in the fencing race.
When a network partition occurs, each racer sub-cluster computes the number of nodes in its partition and compares it with that of the leaving partition.
If a racer finds that its partition does not have majority, it sends a LOST_RACE message to all the nodes in its partition including itself and all the nodes panic.

On the other hand, if the racer finds that it does have majority, it sends a WON_RACE message to all the nodes. Thus, the partition with majority nodes survives.

Deciding cluster majority for majority-based I/O fencing mechanism

Considerations to decide cluster majority in the event of a network partition:

Odd number of cluster nodes in the current membership: One sub-cluster gets majority upon a network split.
Even number of cluster nodes in the current membership:

In case of an even network split, both the sub-clusters have equal number of nodes. The partition with the leader node is treated as majority and that partition survives.

In case of an uneven network split, such that one sub-cluster has more number of nodes than other sub-clusters, the majority sub-cluster gets majority and survives.

Resolution

The following configuration setup and steps can be used to reproduce the design behaviour with a 2-node Cluster configuration when using majority-based I/O fencing:

The configuration consists of 2-nodes in a cluster.

[root@nodeA ~]# hastatus -sum

-- SYSTEM STATE -- System State Frozen

A nodea RUNNING 0 A nodeb RUNNING 0

-- GROUP STATE -- Group System Probed AutoDisabled State

B SG1 nodea Y N ONLINE B SG1 nodeb Y N OFFLINE B cvm nodea Y N ONLINE B cvm nodeb Y N ONLINE

The leader node in this instance is named "nodea" as it has the lowest node id:

[root@nodeA ~]# cat /etc/llthosts
0 nodea
1 nodeb

Sample main.cf:

[root@nodeA ~]# cat /etc/VRTSvcs/conf/config/main.cf
include "OracleASMTypes.cf"
include "types.cf"
include "CFSTypes.cf"
include "CRSResource.cf"
include "CSSD.cf"
include "CVMTypes.cf"
include "Db2udbTypes.cf"
include "MultiPrivNIC.cf"
include "OracleTypes.cf"
include "PrivNIC.cf"
include "SybaseTypes.cf"

cluster ia74clus (
UserNames = { admin = gNOgNInKOjOOmWOiNL }
Administrators = { admin }
UseFence = SCSI3
HacliUserLevel = COMMANDROOT
)

system nodea (
)

system nodeb (
)

group SG1 (
SystemList = { nodea = 0, nodeb = 1 }
AutoStartList = { nodea, nodeb }
)

Application app (
StartProgram = "/usr/bin/perl /root/application/start.pl"
StopProgram = "/usr/bin/perl /root/application/stop.pl"
MonitorProgram = "/usr/bin/perl /root/application/monitor.pl"
)

// resource dependency tree
//
// group SG1
// {
// Application app
// }

group cvm (
SystemList = { nodea = 0, nodeb = 1 }
AutoFailOver = 0
Parallel = 1
AutoStartList = { nodea, nodeb }
)

CFSfsckd vxfsckd (
)

CVMCluster cvm_clus (
CVMClustName = ia74clus
CVMNodeId = { nodea = 0, nodeb = 1 }
CVMTransport = gab
CVMTimeout = 200
)

CVMVxconfigd cvm_vxconfigd (
Critical = 0
CVMVxconfigdArgs = { syslog }
)

ProcessOnOnly vxattachd (
Critical = 0
PathName = "/bin/sh"
Arguments = "- /usr/lib/vxvm/bin/vxattachd root"
RestartLimit = 3
)

cvm_clus requires cvm_vxconfigd
vxfsckd requires cvm_clus

// resource dependency tree
//
// group cvm
// {
// ProcessOnOnly vxattachd
// CFSfsckd vxfsckd
// {
// CVMCluster cvm_clus
// {
// CVMVxconfigd cvm_vxconfigd
// }
// }
// }

Application Perl Scripts:

A sleep delay has been added to perl script which is responsible for stopping the application resource.

# ls /root/application/*.pl
/root/application/monitor.pl /root/application/start.pl /root/application/stop.pl

# cat /root/application/stop.pl
#!/usr/bin/perl

sleep(100);
$str = "rm -rf /tmp/sampleapp";
#rm -rf /tmp/sampleapp # add any steps, if required
#exit 0
system($str);

# cat /root/application/monitor.pl
#!/bin/sh
APPLICATION_IS_ONLINE=110
APPLICATION_IS_OFFLINE=100
if [ -f /tmp/sampleapp ] ; then # add any steps, if required
exit $APPLICATION_IS_ONLINE
else
exit $APPLICATION_IS_OFFLINE
fi

# cat /root/application/start.pl
#!/usr/bin/perl
system("touch /tmp/sampleapp");

Steps:

1. Stop the cluster on the leader node

[root@nodeA ~]# hastop -local

2. Disconnect the LLT links on the leader node

[root@nodeA ~]# lltstat -nvv active
LLT node information:
Node State Link Status Address
* 0 nodea OPEN
ens256 UP 00:50:56:05:DF:BC
ens161 UP 00:50:56:05:DF:BD
1 nodeb OPEN
ens256 UP 00:50:56:05:E0:41
ens161 UP 00:50:56:05:A3:1D

[root@nodeA ~]# lltconfig -u ens161 ; lltconfig -u ens256

NOTE: It is recommended that majority-based I/O fencing be implemented in clusters with an odd numbers of servers.

Issue/Introduction

When majority-based I/O fencing is configured using 2-nodes in a Veritas Cluster configuration, the highest node ID may panic by design when a network partition occurs.
About majority-based fencing

The majority-based fencing mode is a fencing mechanism that does not need coordination points. You do not need shared disks or free nodes to be used as coordination points or servers.

The fencing mechanism provides an arbitration method that is reliable and does not require any extra hardware setup, such as CP Servers or shared SCSI3 disks. Figure 1.0

In a split brain scenario, arbitration is done based on 'majority' number of nodes among the sub-clusters.

Majority-based fencing:

If N is defined as the total number of nodes (current membership) in the cluster, then majority is equal to N/2 + 1.

Leader node:

The node with the lowest node ID in the cluster (before the split brain happened) is called the leader node.

This plays a role in case of a tie.

Welcome to "KB Articles"