Node unable to join existing cluster due (CPS) VXFEN vxfenconfig ERROR V-11-2-1043 Detected a preexisting split brain. Unable to join cluster.

book

Article ID: 100050843

calendar_today

Updated On:

Description

Error Message

 

/var/VRTSvcs/log/engine_A.log
-------------------------------------------------
2021/07/10 13:39:55 VCS CRITICAL V-16-1-10037 VxFEN driver not configured. Retrying...
2021/07/10 13:40:00 VCS CRITICAL V-16-1-10037 VxFEN driver not configured. Retrying...
2021/07/10 13:40:05 VCS NOTICE V-16-1-10322 System fred1002 (Node '0') changed state from LOCAL_BUILD to EXITED
2021/07/10 13:40:06 VCS NOTICE V-16-1-11059 GAB registration monitoring action set to log system message
2021/07/10 13:40:06 VCS NOTICE V-16-1-11057 GAB registration monitoring timeout set to 0 ms

 

/var/VRTSvcs/log/vxfen_A.log:
-----------------------------------------------
VXFEN vxfenconfig NOTICE Driver will use customized fencing - mechanism cps
Sat Jul 10 13:44:53 AEST 2021 cluster fencing. will retry. sleeping for 28
Sat Jul 10 13:45:21 AEST 2021 calling regular vxfenconfig
Sat Jul 10 13:45:46 AEST 2021 return value from above operation is 10
Sat Jul 10 13:45:46 AEST 2021 output was VXFEN vxfenconfig ERROR V-11-2-1043 Detected a preexisting split brain. Unable to join cluster.

 

Cause

 

Preexisting split-brain scenarios with coordinator disks are normally resolved by the administrator running the vxfenclearpre command. A similar solution is required in server-based fencing (CPS) using the cpsadm command.


Run the cpsadm command to clear a registration on a CP server:
 

# cpsadm -s cp_server -a unreg_node -c cluster_name -n nodeid
 

NOTES:

cp_server            The CP server's virtual IP address or virtual hostname

uuid                    The UUID (Universally Unique ID) of VCS cluster

nodeid                 The nodeid of the SF Oracle RAC cluster node

 

Ensure that fencing is not running on a node before clearing its registration on the CP server.

After removing all stale registrations, the joiner node will now be able to join the existing VCS cluster members.


Veritas Cluster Server (VCS) 5.1 introduced the cluster UUID (Universally Unique ID)
 

If a cluster UUID is not present or does not match with that of other system nodes in the VCS cluster, then VCS services will not be able start on that system node.


VCS provides the uuidconfig.pl perl script to validate the cluster UUID across the system nodes in the VCS cluster:

# /opt/VRTSvcs/bin/uuidconfig.pl -clus -display -use_llthost


The cluster UUID (Universally Unique ID) can be verified manually on each system nodes in the VCS cluster, by displaying the contents of the /etc/vx/.uuids/clusuuid file.

 

To correlate the VCS node name to the "Node ID" number for a specific cluster, type the following command:


Sample output
 

# cpsadm -s ##.##.##.## -a list_nodes -c barney

ClusterName      UUID                               Hostname(Node ID) Registered
===========   ===================================   ================  ===========
barney   {ce46ea48-1dd1-11b2-906e-00144ffda96c}  fred1004(1)       1
barney   {ce46ea48-1dd1-11b2-906e-00144ffda96c}  fred1001(2)       0
barney   {ce46ea48-1dd1-11b2-906e-00144ffda96c}  fred1003(3)       1
barney   {ce46ea48-1dd1-11b2-906e-00144ffda96c}  fred1000(4)       1
barney   {ce46ea48-1dd1-11b2-906e-00144ffda96c}  fred1002(0)       1

 

Figure 3.0

The following illustration shows a single CPS connecting to multiple client clusters:

 

Listing the membership of nodes in the VCS cluster
 

To list the membership of nodes in VCS cluste, type the following command:

# cpsadm -s cp_server -a list_membership -c cluster_name

 

 
Registering and unregistering a node


To register a node, type the following command:

# cpsadm -s cp_server -a reg_node -u uuid -n nodeid

To unregister a node, type the following command:

# cpsadm -s cp_server -a unreg_node -u uuid -n nodeid

 

Figure 4.0

VCS configuration files for CP server and CP clients


 

Sample commands


Even by unregistering and registering the VCS node "0", fencing could not be started.

# cpsadm -s ##.##.##.## -a unreg_node -c barney -u {ce46ea48-1dd1-11b2-906e-00144ffda96c} -h fred1002 -n 0
CPS ERROR V-97-1400-403 Node 0 (fred1002) is already unregistered.

# cpsadm -s ##.##.##.## -a reg_node -c barney -u {ce46ea48-1dd1-11b2-906e-00144ffda96c} -h fred1002 -n 0
CPS INFO V-97-1400-396 Node 0 (fred1002) successfully registered

# cpsadm -s ##.##.##.## -a list_membership -c barney
CPS INFO V-97-1400-418 List of registered nodes: 0 1 2 3 4

Once fencing is started on node 0 "fred1002", the registered state for node 0 is removed from the list_membership

# cpsadm -s ##.##.##.## -a list_membership -c barney
CPS INFO V-97-1400-418 List of registered nodes: 1 2 3 4

 

Complex fencing related issues
---------------------------------------------------

To resolve the most complex of fencing related case, all the nodes in the cluster will need to be stopped using "hastop -all".

It is critical that fencing also be stopped across all the VCS servers, to clear stale CP registrations.

All the nodes (CP clients) will also need to be unregistered from the CP server, before attempting to start fencing.



Cluster attribute "UseFence"
 

The shared diskgroups the VCS configuration refers to the /etc/vxfenmode file.


* For local failed-over diskgroups, VCS,uses 'UseFence' cluster's attribute in the /etc/VRTSvcs/conf/config/main.cf file

* For shared diskgroups, VCS uses the 'vxfen_mode' attribute in the /etc/vxfenmode file

 

Resolution


To resolve the issue in this instance, the complete VCS cluster & most importantly fencing had to be stopped on all nodes.

If the servers have been up and running for extended periods of time, take the opportunity to reboot nodes.

Example:


NOTE: To minimize overall downtime, single Node fred1001 is rebooted first, whilst the 3 remaining nodes remain part of the cluster.


Once fred1001 has restarted, we can continue with the process.

 

1.] # hastop -all

2.] Unregistered CP clients (VCS nodes)
 

# cpsadm -s ##.##.##.## -a unreg_node -c barney -u {ce46ea48-1dd1-11b2-906e-00144ffda96c} -h fred1001 -n 1

# cpsadm -s ##.##.##.## -a unreg_node -c barney -u {ce46ea48-1dd1-11b2-906e-00144ffda96c} -h fred1000 -n 4

# cpsadm -s ##.##.##.## -a unreg_node -c barney -u {ce46ea48-1dd1-11b2-906e-00144ffda96c} -h fred1002 -n 0

# cpsadm -s ##.##.##.## -a unreg_node -c barney -u {ce46ea48-1dd1-11b2-906e-00144ffda96c} -h fred1003 -n 3

# cpsadm -s ##.##.##.## -a unreg_node -c barney -u {ce46ea48-1dd1-11b2-906e-00144ffda96c} -h fred1004 -n 1

 

Confirm the "Registered" state reflects "0" for all nodes (unregistered)
 

# cpsadm -s ##.##.##.## -a list_nodes -c barney

ClusterName      UUID                               Hostname(Node ID) Registered
===========   ===================================   ================  ===========
barney       {ce46ea48-1dd1-11b2-906e-00144ffda96c}  fred1004(1)       0
barney       {ce46ea48-1dd1-11b2-906e-00144ffda96c}  fred1001(2)       0
barney       {ce46ea48-1dd1-11b2-906e-00144ffda96c}  fred1003(3)       0
barney       {ce46ea48-1dd1-11b2-906e-00144ffda96c}  fred1000(4)       0
barney       {ce46ea48-1dd1-11b2-906e-00144ffda96c}  fred1002(0)       0

 

3.] Once all the CPS clients all unregistered from the CPS server, proceed to start LLT manually

 

fred1001 $ gabconfig -a
GAB Port Memberships
===============================================================

fred1001 $ gabconfig -c -x
GAB gabconfig ERROR V-15-2-25015 LLT not configured


fred1001 $ cd /etc
fred1001 $ ls -l llt*
-rw-r--r--   1 root     root          55 Jul 28  2019 llthosts
-rw-r--r--   1 root     root         455 Jul 28  2019 llttab.save

 

Rename the /etc/llttab.save file back to /etc/llttab


fred1001 $ mv /etc/llttab.save /etc/llttab

 

The following are Solaris related commands (svcs and svcsadm)


fred1001 $ svcs -a |grep llt
disabled       21:40:34 svc:/system/llt:default

fred1001 $ svcadm enable svc:/system/llt:default

fred1001 $ svcs -a |grep llt
offline*       23:12:31 svc:/system/llt:default

fred1001 $ svcs -a |grep llt
online         23:12:41 svc:/system/llt:default

 

4.] Start GAB manually

fred1001 $ gabconfig -c -x
Started gablogd
gablogd: Keeping 20 log files of 8388608 bytes each in |/var/adm/gab_ffdc| directory. Daemon log size limit 8388608 bytes

 

4.] Start fencing
 

fred1001 $ vxfenconfig -c
Log Buffer: 0x7073e090

VXFEN vxfenconfig NOTICE Driver will use customized fencing - mechanism cps

 

5.] Start VCS

fred1001 $ hastart
fred1001 $ hastatus -sum

-- SYSTEM STATE
-- System               State                Frozen

A  fred1002             UNKNOWN              0
A  fred1000             UNKNOWN              0
A  fred1001             RUNNING              0
A  fred1003             UNKNOWN              0
A  fred1004             UNKNOWN              0

 

6.] Repeat the steps 3-5 for the other nodes


Eventually we should see the other node joining the cluster with the various different VCS ports seeding:


fred1001 $ gabconfig -a
GAB Port Memberships
===============================================================
Port a gen  281aa03 membership 0 2
Port b gen  281aa08 membership 0 2

fred1001 $ gabconfig -a
GAB Port Memberships
===============================================================
Port a gen 281aa03 membership 0 2
Port b gen 281aa05 membership ; 2
Port b gen 281aa05 visible 0
Port h gen 281aa01 membership ; 2
Port h gen 281aa01 visible 0

 

The cluster is now running with 2 nodes:
 

fred1001 $ hastatus -sum

-- SYSTEM STATE
-- System State Frozen

A fred1002 RUNNING 0
A fred1000 UNKNOWN 0
A fred1001 RUNNING 0
A fred1003 UNKNOWN 0
A fred1004 UNKNOWN 0

 

7.] Repeat the steps 3-5 for the other nodes until all nodes have joined the cluster

 

fred1001 $ hastatus -sum

-- SYSTEM STATE
-- System State Frozen

A fred1002 RUNNING 0
A fred1000 RUNNING 0
A fred1001 RUNNING 0
A fred1003 RUNNING 0
A fred1004 RUNNING 0

 

Process complete.

 

 

Issue/Introduction

A single VCS node is unable to join existing VCS cluster due (CPS) VXFEN vxfenconfig ERROR V-11-2-1043 Detected a preexisting split brain. Unable to join cluster.
The Coordination Point Server (CPS) is a software solution running on a remote system or cluster that provides arbitration functionality to client cluster nodes.
Figure 1.0

CPS functions as another fencing mechanism (vehicle) that integrates within the existing VCS I/O fencing module.


In this example, the VCS cluster consists of 5 nodes:
fred1001 $ hastatus -sum -- SYSTEM STATE
-- System State Frozen A fred1002 FAULTED 0 <<< this node cannot join the cluster
A fred1000 RUNNING 0
A fred1001 RUNNING 0
A fred1003 RUNNING 0
A fred1004 RUNNING 0

The CPS process vxcpserv interacts with the customized fencing framework on the client cluster Figure 2.0

The CP server details are recorded in the /etc/vxfentab on each CP client (VCS node):

fred1004 $ cat /etc/vxfentab
#
# /etc/vxfentab:
# DO NOT MODIFY this file as it is generated by the
# VXFEN rc script from the file /etc/vxfenmode.
#
security=0
single_cp=1
[##.###.###.###]:14250
NOTE: TCP/IP sockets using default port 14250 are used for communication. ##.###.###.### is used to hide the actual IP address for the CP server.