GAB IOFENCE panic encountered on InfoScale/rhel cluster running on VMware

book

Article ID: 100049539

calendar_today

Updated On:

Description

Error Message

 

Cause


If the Operating System doesn't schedule GAB and LLT timer functions on time, it affects heartbeating and ultimately triggers node eviction. This is the expected behaviour from LLT/GAB perspective.

Resolution

In virtualized environments there are a lot of external factors that need to be considered with respect to the stability of the cluster.

Such factors include:

1. Provisioning ratio: CPU and memory provisioning ratios will affect the stability of Veritas cluster. For maximum stability the ratio should be kept as minimum as possible. For critical solutions that require maximum resiliency, the ratio should be 1:1 for both memory and CPU

2. CPU load on ESX: Even if the provisioning ratio is low, CPU load on ESX can still play a part in cluster stability. If the load on the ESX is very high, this can affect how vCPUs on the guest VMs are scheduled as vCPUs are just the processes with respect to the ESX servers.

3. CPU requirement of the actual workload on guests: If the total CPU requirement for workloads exceeds the available physical CPU capacity, then node evictions will still occur due to heartbeat timeouts.

4. External events: External events like vmotion, vmdk backups etc are known to add CPU load on the ESX servers and so any duration of stun in cluster environments caused by these events should be monitored and the peerinact tunable  increased if needed. If it's not possible to increase it, then these types of operations should be avoided.

5. Hypervisor best practices should be followed. The following is a link to debug virtual machine performance issues: https://kb.vmware.com/s/article/2001003


Veritas recommends changing the default value of peerinact from 16secs to a minimum of 32 seconds.

The following command can be used on each cluster node to set it dynamically:
 
# lltconfig -T peerinact:3200
 
# lltconfig -T query // to confirm that the new value is in place
 
 
To make this setting persistent across reboots, the following line needs to be added to the /etc/llttab file:
 
set-timer peerinact:3200
    
 

Issue/Introduction


GAB IOFENCE panic encountered on InfoScale/rhel cluster running on VMware. Approximately 1min before the IOFENCE panic occurred, the system started becoming unstable and a softlockup was observed. The Operating System could not trigger the timers for both LLT and GAB for around 30 seconds which is a long time for the timers to wait, especially when it is expected that the timers are kicked in once every 50ms.

[1561488.025084] <82>GAB INFO V-15-1-20124 timer not called for 31 seconds
[1561488.024960] <82>LLT INFO V-14-1-10035 timer not called for 30398 ticks

This led to LLT/GAB instability which led to a voluntary panic by GAB

[1561560.837149] <82>GAB INFO V-15-1-20033 Port h[GAB_USER_CLIENT (refcount 0)] nid 0 [3:0:0:0] [3:0:0:0] [3:0:0:0] [1:0:0:0] [0:0:0:0] 22 d47005 20 10 << node 0 see membership as 11 (3) i.e. both nodes are part of cluster
[1561560.837159] <82>GAB INFO V-15-1-20033 Port h[GAB_USER_CLIENT (refcount 0)] nid 1 [3:0:0:0] [3:0:0:0] [2:0:0:0] [3:0:0:0] [0:0:0:0] 22 d47004 20 9 << node 1 see membership as 10 (2) i.e. only local node part of cluster

Due to the above discrepancy in port membership GAB initiated IOFENCE and fenced out Node 1 from cluster

[1561560.837165] <82>GAB INFO V-15-1-20239 Initiating FFDC data collection
[1561560.837166] <82>GAB INFO V-15-1-20041 Port h: network failure: killing process << killed HAD, as it is a user level component
[1561560.837236] <82>GAB INFO V-15-1-20033 Port m[GAB_LEGACY_CLIENT (refcount 0)] nid 0 [3:0:0:0] [3:0:0:0] [3:0:0:0] [1:0:0:0] [0:0:0:0] 22 d47008 20 10
[1561560.837244] <82>GAB INFO V-15-1-20033 Port m[GAB_LEGACY_CLIENT (refcount 0)] nid 1 [3:0:0:0] [3:0:0:0] [2:0:0:0] [3:0:0:0] [0:0:0:0] 22 d47007 20 8
[1561560.837251] <6>Kernel panic - not syncing: GAB: Port m halting system due to network failure at [14:2252] << Panicked the node as port m which is for VxVM is a kernel component.

Additional Information

JIRA: STESC-5343