Node unable to join existing cluster due (CPS) VXFEN vxfenconfig ERROR V-11-2-1043 Detected a preexisting split brain. Unable to join cluster.

Description

Error Message

/var/VRTSvcs/log/engine_A.log
-------------------------------------------------
2021/07/10 13:39:55 VCS CRITICAL V-16-1-10037 VxFEN driver not configured. Retrying...
2021/07/10 13:40:00 VCS CRITICAL V-16-1-10037 VxFEN driver not configured. Retrying...
2021/07/10 13:40:05 VCS NOTICE V-16-1-10322 System fred1002 (Node '0') changed state from LOCAL_BUILD to EXITED
2021/07/10 13:40:06 VCS NOTICE V-16-1-11059 GAB registration monitoring action set to log system message
2021/07/10 13:40:06 VCS NOTICE V-16-1-11057 GAB registration monitoring timeout set to 0 ms

/var/VRTSvcs/log/vxfen_A.log:
-----------------------------------------------
VXFEN vxfenconfig NOTICE Driver will use customized fencing - mechanism cps
Sat Jul 10 13:44:53 AEST 2021 cluster fencing. will retry. sleeping for 28
Sat Jul 10 13:45:21 AEST 2021 calling regular vxfenconfig
Sat Jul 10 13:45:46 AEST 2021 return value from above operation is 10
Sat Jul 10 13:45:46 AEST 2021 output was VXFEN vxfenconfig ERROR V-11-2-1043 Detected a preexisting split brain. Unable to join cluster.

Cause

Preexisting split-brain scenarios with coordinator disks are normally resolved by the administrator running the vxfenclearpre command. A similar solution is required in server-based fencing (CPS) using the cpsadm command.

Run the cpsadm command to clear a registration on a CP server:

# cpsadm -s cp_server -a unreg_node -c cluster_name -n nodeid

NOTES:

cp_server The CP server's virtual IP address or virtual hostname

uuid The UUID (Universally Unique ID) of VCS cluster

nodeid The nodeid of the SF Oracle RAC cluster node

Ensure that fencing is not running on a node before clearing its registration on the CP server.

After removing all stale registrations, the joiner node will now be able to join the existing VCS cluster members.

Veritas Cluster Server (VCS) 5.1 introduced the cluster UUID (Universally Unique ID)

If a cluster UUID is not present or does not match with that of other system nodes in the VCS cluster, then VCS services will not be able start on that system node.

VCS provides the uuidconfig.pl perl script to validate the cluster UUID across the system nodes in the VCS cluster:

# /opt/VRTSvcs/bin/uuidconfig.pl -clus -display -use_llthost

The cluster UUID (Universally Unique ID) can be verified manually on each system nodes in the VCS cluster, by displaying the contents of the /etc/vx/.uuids/clusuuid file.

To correlate the VCS node name to the "Node ID" number for a specific cluster, type the following command:

Sample output

# cpsadm -s ##.##.##.## -a list_nodes -c barney

ClusterName      UUID                               Hostname(Node ID) Registered
===========   ===================================   ================ ===========
barney   {ce46ea48-1dd1-11b2-906e-00144ffda96c} fred1004(1)       1
barney   {ce46ea48-1dd1-11b2-906e-00144ffda96c} fred1001(2)       0
barney   {ce46ea48-1dd1-11b2-906e-00144ffda96c} fred1003(3)       1
barney   {ce46ea48-1dd1-11b2-906e-00144ffda96c} fred1000(4)       1
barney   {ce46ea48-1dd1-11b2-906e-00144ffda96c} fred1002(0)       1

Figure 3.0

The following illustration shows a single CPS connecting to multiple client clusters:

Listing the membership of nodes in the VCS cluster

To list the membership of nodes in VCS cluste, type the following command:

# cpsadm -s cp_server -a list_membership -c cluster_name

Registering and unregistering a node

To register a node, type the following command:

# cpsadm -s cp_server -a reg_node -u uuid -n nodeid

To unregister a node, type the following command:

# cpsadm -s cp_server -a unreg_node -u uuid -n nodeid

Figure 4.0

VCS configuration files for CP server and CP clients

Sample commands

Even by unregistering and registering the VCS node "0", fencing could not be started.

# cpsadm -s ##.##.##.## -a unreg_node -c barney -u {ce46ea48-1dd1-11b2-906e-00144ffda96c} -h fred1002 -n 0
CPS ERROR V-97-1400-403 Node 0 (fred1002) is already unregistered.

# cpsadm -s ##.##.##.## -a reg_node -c barney -u {ce46ea48-1dd1-11b2-906e-00144ffda96c} -h fred1002 -n 0
CPS INFO V-97-1400-396 Node 0 (fred1002) successfully registered

# cpsadm -s ##.##.##.## -a list_membership -c barney
CPS INFO V-97-1400-418 List of registered nodes: 0 1 2 3 4

Once fencing is started on node 0 "fred1002", the registered state for node 0 is removed from the list_membership

# cpsadm -s ##.##.##.## -a list_membership -c barney
CPS INFO V-97-1400-418 List of registered nodes: 1 2 3 4

Complex fencing related issues
---------------------------------------------------

To resolve the most complex of fencing related case, all the nodes in the cluster will need to be stopped using "hastop -all".

It is critical that fencing also be stopped across all the VCS servers, to clear stale CP registrations.

All the nodes (CP clients) will also need to be unregistered from the CP server, before attempting to start fencing.

Cluster attribute "UseFence"

The shared diskgroups the VCS configuration refers to the /etc/vxfenmode file.

* For local failed-over diskgroups, VCS,uses 'UseFence' cluster's attribute in the /etc/VRTSvcs/conf/config/main.cf file

* For shared diskgroups, VCS uses the 'vxfen_mode' attribute in the /etc/vxfenmode file

Resolution

To resolve the issue in this instance, the complete VCS cluster & most importantly fencing had to be stopped on all nodes.

If the servers have been up and running for extended periods of time, take the opportunity to reboot nodes.

Example:

NOTE: To minimize overall downtime, single Node fred1001 is rebooted first, whilst the 3 remaining nodes remain part of the cluster.

Once fred1001 has restarted, we can continue with the process.

1.] # hastop -all

2.] Unregistered CP clients (VCS nodes)

# cpsadm -s ##.##.##.## -a unreg_node -c barney -u {ce46ea48-1dd1-11b2-906e-00144ffda96c} -h fred1001 -n 1

# cpsadm -s ##.##.##.## -a unreg_node -c barney -u {ce46ea48-1dd1-11b2-906e-00144ffda96c} -h fred1000 -n 4

# cpsadm -s ##.##.##.## -a unreg_node -c barney -u {ce46ea48-1dd1-11b2-906e-00144ffda96c} -h fred1002 -n 0

# cpsadm -s ##.##.##.## -a unreg_node -c barney -u {ce46ea48-1dd1-11b2-906e-00144ffda96c} -h fred1003 -n 3

# cpsadm -s ##.##.##.## -a unreg_node -c barney -u {ce46ea48-1dd1-11b2-906e-00144ffda96c} -h fred1004 -n 1

Confirm the "Registered" state reflects "0" for all nodes (unregistered)

# cpsadm -s ##.##.##.## -a list_nodes -c barney

ClusterName UUID Hostname(Node ID) Registered =========== =================================== ================ =========== barney {ce46ea48-1dd1-11b2-906e-00144ffda96c} fred1004(1) 0 barney {ce46ea48-1dd1-11b2-906e-00144ffda96c} fred1001(2) 0 barney {ce46ea48-1dd1-11b2-906e-00144ffda96c} fred1003(3) 0 barney {ce46ea48-1dd1-11b2-906e-00144ffda96c} fred1000(4) 0 barney {ce46ea48-1dd1-11b2-906e-00144ffda96c} fred1002(0) 0

3.] Once all the CPS clients all unregistered from the CPS server, proceed to start LLT manually

fred1001 $ gabconfig -a
GAB Port Memberships
===============================================================

fred1001 $ gabconfig -c -x
GAB gabconfig ERROR V-15-2-25015 LLT not configured

fred1001 $ cd /etc
fred1001 $ ls -l llt*
-rw-r--r-- 1 root root 55 Jul 28 2019 llthosts
-rw-r--r-- 1 root root 455 Jul 28 2019 llttab.save

Rename the /etc/llttab.save file back to /etc/llttab

fred1001 $ mv /etc/llttab.save /etc/llttab

The following are Solaris related commands (svcs and svcsadm)

fred1001 $ svcs -a |grep llt
disabled 21:40:34 svc:/system/llt:default

fred1001 $ svcadm enable svc:/system/llt:default

fred1001 $ svcs -a |grep llt
offline* 23:12:31 svc:/system/llt:default

fred1001 $ svcs -a |grep llt
online 23:12:41 svc:/system/llt:default

4.] Start GAB manually

fred1001 $ gabconfig -c -x
Started gablogd
gablogd: Keeping 20 log files of 8388608 bytes each in |/var/adm/gab_ffdc| directory. Daemon log size limit 8388608 bytes

4.] Start fencing

fred1001 $ vxfenconfig -c
Log Buffer: 0x7073e090

VXFEN vxfenconfig NOTICE Driver will use customized fencing - mechanism cps

5.] Start VCS

fred1001 $ hastart
fred1001 $ hastatus -sum

-- SYSTEM STATE
-- System State Frozen

A fred1002             UNKNOWN              0
A fred1000             UNKNOWN              0
A fred1001             RUNNING              0
A fred1003             UNKNOWN              0
A fred1004             UNKNOWN              0

6.] Repeat the steps 3-5 for the other nodes

Eventually we should see the other node joining the cluster with the various different VCS ports seeding:

fred1001 $ gabconfig -a
GAB Port Memberships
===============================================================
Port a gen 281aa03 membership 0 2
Port b gen 281aa08 membership 0 2

fred1001 $ gabconfig -a
GAB Port Memberships
===============================================================
Port a gen 281aa03 membership 0 2
Port b gen 281aa05 membership ; 2
Port b gen 281aa05 visible 0
Port h gen 281aa01 membership ; 2
Port h gen 281aa01 visible 0

The cluster is now running with 2 nodes:

fred1001 $ hastatus -sum

-- SYSTEM STATE
-- System State Frozen

A fred1002 RUNNING 0
A fred1000 UNKNOWN 0
A fred1001 RUNNING 0
A fred1003 UNKNOWN 0
A fred1004 UNKNOWN 0

7.] Repeat the steps 3-5 for the other nodes until all nodes have joined the cluster

fred1001 $ hastatus -sum

-- SYSTEM STATE
-- System State Frozen

A fred1002 RUNNING 0
A fred1000 RUNNING 0
A fred1001 RUNNING 0
A fred1003 RUNNING 0
A fred1004 RUNNING 0

Process complete.

Issue/Introduction

A single VCS node is unable to join existing VCS cluster due (CPS) VXFEN vxfenconfig ERROR V-11-2-1043 Detected a preexisting split brain. Unable to join cluster.
The Coordination Point Server (CPS) is a software solution running on a remote system or cluster that provides arbitration functionality to client cluster nodes.
Figure 1.0

CPS functions as another fencing mechanism (vehicle) that integrates within the existing VCS I/O fencing module.

In this example, the VCS cluster consists of 5 nodes:
fred1001 $ hastatus -sum -- SYSTEM STATE
-- System State Frozen A fred1002 FAULTED 0 <<< this node cannot join the cluster
A fred1000 RUNNING 0
A fred1001 RUNNING 0
A fred1003 RUNNING 0
A fred1004 RUNNING 0

The CPS process vxcpserv interacts with the customized fencing framework on the client cluster Figure 2.0

The CP server details are recorded in the /etc/vxfentab on each CP client (VCS node):

fred1004 $ cat /etc/vxfentab
#
# /etc/vxfentab:
# DO NOT MODIFY this file as it is generated by the
# VXFEN rc script from the file /etc/vxfenmode.
#
security=0
single_cp=1
[##.###.###.###]:14250
NOTE: TCP/IP sockets using default port 14250 are used for communication. ##.###.###.### is used to hide the actual IP address for the CP server.

Welcome to "KB Articles"