To troubleshoot the issue, we reviewed the log files under /var/VRTSvcs/log/vxfen
From the vxfen.log
Tue Nov 12 04:38:21 +03 2019 vxfen [30659] vxfen_script_timeout=25 does not contain a coordination point
VXFEN vxfenconfig NOTICE Driver will use customized fencing - mechanism cps
VXFEN vxfenconfig ERROR V-11-2-1090 Unable to register with a majority of the coordination points.
Tue Nov 12 04:39:41 +03 2019 vxfen-startup [34394] vxfenconfig failed. Will retry. Sleeping for 10 seconds
Tue Nov 12 04:39:51 +03 2019 vxfen-startup [34394] calling regular vxfenconfig
Tue Nov 12 04:41:11 +03 2019 vxfen-startup [34394] return value from above operation is 1
Tue Nov 12 04:41:11 +03 2019 vxfen-startup [34394] output was Log Buffer: 0xffffffffc1849f20
The vxfend_A.log contains additional logging.
VCS was unable to automatic start due to an issue with fencing failing to start during the CP discovery and registration timeout window.
From the vxfend log, we can see that the join_local_node.sh script was being terminated because of the low vxfen_script_timeout
The join_local_node.sh is responsible for registering a CP server on the node.
The vxfen_script_timeout setting determines the time vxfend should wait for customized fencing scripts to finish before attempting to manually shut them down.
From vxfen.log
=======================================================================
Tue Nov 12 04:38:21 +03 2019 vxfen [30659] vxfen_script_timeout=25
========================================================================
When a CPS is not available, the script waits for the defined (default) time before declaring that the CPS is not available.
As the vxfen_script_timeout is set to 25 by default in Virtual environments, VCS itself kills this process(script) before the script has completed, which leads to a registration failure of the CPS servers. and hence fencing does not start.
Logs -
/var/VRTSvcs/log/vxfen/vxfend_A.log
=========================================================================
Below output shows that the script did NOT finish and ended, it didn't declare successful or failed for the 3rd CP server
2019/11/12 04:39:21 VXFEN vxfend INFO V-11-2-5109 ----Begin: output from join_local_node.sh
Begin: join_local_node.sh
Arguments passed: 0 0 1
my node id: 0
my cluster: 0 1
check alien nodes:
Coord point 1 membership: 1
Coord point 2 membership: 1
Unable to get membership from cp3
Attempting to register 0 with CP server 10.1.1.10
CPS INFO V-97-1400-396 Node 0 (VCS) successfully registered
Successfully registered with cp [10.1.1.10]:443
checking membership for [10.1.1.10]:443: 0 1 and my_cluster: 0 1
Attempting to register 0 with CP server 10.1.1.20
CPS INFO V-97-1400-396 Node 0 (VCS) successfully registered
Successfully registered with cp [10.1.1.20]:443
checking membership for [10.1.1.20]:443: 0 1 and my_cluster: 0 1
Attempting to register 0 with CP server 10.1.1.30
2019/11/12 04:39:21 VXFEN vxfend INFO V-11-2-5110 ---End: output from join_local_node.sh
===========================================================================
2019/11/12 04:39:21 VXFEN vxfend ERROR V-11-2-4034 script (/opt/VRTSvcs/vxfen/bin/customized/cps/join_local_node.sh) terminated due to signal (1)
==============================================================================
The above indicates that the VXFEN calls to terminate this script.
When one or more CP servers are not reachable the script waits before declaring a registration failure, the script still tries to determine the status of the missing CP server(s). Once the default 25 seconds timeout expires, the fencing process is terminated as registration fails, potentially related to the vxfen_script_timeout being set too low.
When vxfend kills the join_local_node.sh script responsible for registering a CP server.
vxfend runs another script (unjoin_local_node.sh), if the join_local_node.sh script was previously terminated by vxfend
After that no CP servers will be registered as shown below and hence we are seeing the error that the majority of the CP servers are not registered. Hence, fencing is failing to start:
=====================================================================
2019/11/12 04:39:41 VXFEN vxfend INFO V-11-2-5121 ----Begin: output from unjoin_local_node.sh
Begin: unjoin_local_node.sh
my node id: 0
Attempting to unregister 0 with CP server 10.1.1.10
CPS INFO V-97-1400-401 Node 0 (VCS) successfully unregistered
Attempting to unregister 0 with CP server 10.1.1.20
CPS INFO V-97-1400-401 Node 0 (VCS) successfully unregistered
Attempting to unregister 0 with CP server 10.1.1.30
Failed to connect to host 10.1.1.30 on port 443: Timeout: connect timed out: 10.1.1.30:443
Unable to unregister with coordination point [10.1.1.30]:443
End: unjoin_local_node.sh returning SUCCESS (110)
2019/11/12 04:39:41 VXFEN vxfend INFO V-11-2-5122 ---End: output from unjoin_local_node.sh
==========================================================================
The cpsadm command timeout is 15 seconds and that would be applicable for the query request of each cps. The vxfen_script_timeout value should be a factor of the number of cps configured and the cpsadm command timeout.
Workaround:
To prevent the above issue from occurring again, manually edit the respective files as shown below and restart the updates servers.
On each node, set the value of the LLT sendhbcap timer parameter value as follows:
Run the following command:
# lltconfig -T sendhbcap:4000Add the following line to the /etc/llttab file so that the changes remain persistent after any reboot:
set-timer senhbcap:4000
NOTE: Veritas Engineering are working on a long-term solution to enhance the existing fencing algorithm and design, making it more efficient going forward.