Powering off a minority of Coordination Point (CP) Servers results in both nodes suffering a vxfen-induced panic

book

Article ID: 100031809

calendar_today

Updated On:

Description

Error Message

From the vxfen_A.log, the entries below state the time taken for the preempt and abort cpsadm commands. In actual fact they timed out at the default 120 seconds:

2015/12/17 11:55:11 VXFEN vxfend INFO V-11-2-5006 vxfend processing: 1, operation: RACE_CPOINT è Race started
2015/12/17 11:57:16 VXFEN vxfend INFO V-11-2-5150 Begin vxfen_kill_process() à Could not be concluded within 2 minutes and hence aborted

Cause

The reason for the unexpected fencing panic was because the preempt and abort timed out at 120 seconds. The timeout was because of the cpsadm commands timing out and not completing. Their timeout (based on TCP) is 60 seconds. The challenge is that because such cpsadm commands are run sequentially, we hit the 120 second timeout of fencing script operations.

Resolution

A two-fold approach was used to reduce the time taken for IO Fencing preemption and abort to complete:

net.ipv4.tcp_syn_retries=2 (using the /proc or sysctl interface)
vxfen_script_timeout=300 (added to /etc/vxfenmode)

This enabled TCP to timeout faster and allow more tolerance (if needed) to fencing scripts to complete.

Issue/Introduction

Customer installed 7.0.1 Cluster Server (based on RedHat 7.1) but found that with 2 of 5 CP servers powered off, a network partition wasn't reacted to as expected, i.e. both cluster nodes panicked.

Was this article helpful?

thumb_up Yes

thumb_down No

Welcome to "KB Articles"