Powering off a minority of Coordination Point (CP) Servers results in both nodes suffering a vxfen-induced panic
book
Article ID: 100031809
calendar_today
Updated On:
Description
Error Message
From the vxfen_A.log, the entries below state the time taken for the preempt and abort cpsadm commands. In actual fact they timed out at the default 120 seconds:
2015/12/17 11:55:11 VXFEN vxfend INFO V-11-2-5006 vxfend processing: 1, operation: RACE_CPOINT è Race started2015/12/17 11:57:16 VXFEN vxfend INFO V-11-2-5150 Begin vxfen_kill_process() à Could not be concluded within 2 minutes and hence aborted
Cause
The reason for the unexpected fencing panic was because the preempt and abort timed out at 120 seconds. The timeout was because of the cpsadm commands timing out and not completing. Their timeout (based on TCP) is 60 seconds. The challenge is that because such cpsadm commands are run sequentially, we hit the 120 second timeout of fencing script operations.
Resolution
A two-fold approach was used to reduce the time taken for IO Fencing preemption and abort to complete:
net.ipv4.tcp_syn_retries=2 (using the /proc or sysctl interface)
vxfen_script_timeout=300 (added to /etc/vxfenmode)
This enabled TCP to timeout faster and allow more tolerance (if needed) to fencing scripts to complete.
Issue/Introduction
Customer installed 7.0.1 Cluster Server (based on RedHat 7.1) but found that with 2 of 5 CP servers powered off, a network partition wasn't reacted to as expected, i.e. both cluster nodes panicked.
Was this article helpful?
thumb_up
Yes
thumb_down
No