Service groups active on a cluster node do not fail over when the node stops responding

book

Article ID: 100029238

calendar_today

Updated On:

Resolution

Symptom
When a node fails, active service groups on failed node enter AutoDisabled state and no failover occurs.

Cause
When a cluster node stops responding or is restarted, two events are used to determine the next course of action for its running service groups.

1. The first event is when the Veritas Cluster Server (VCS) engine process (had) stops.

2. The second event is when the Low Latency Transport (LLT) cluster heartbeats from that node stop, as seen from the other cluster nodes.

If the time interval between these two events is greater than the default 120 second value, then the service groups that were active on the failed/restarted node will stay in AutoDisabled state. The Service groups will not be failed to any active node.
.
If this time interval is less than 120 seconds, the service groups will have their AutoDisabled status cleared and will fail over to one of the remaining active nodes.
 
The VERITAS Cluster Server (VCS) system attribute, ShutdownTimeout, is used to specify the time interval between the two events. The default value for this attribute is 120 (seconds).

Details
If the service groups that were active on the failed/restarted node stay in AutoDisabled state, then the service groups would need to have that status cleared and be brought online by manual methods.
 
Verify that the failed/restarted cluster node is not active.
 
Clear the AutoDisabled flag from all affected service groups.
To determine which service groups are affected, use the hastatus command and examine the AutoDisabled column under the Group State section.
If the AutoDisabled column has a "Y" for a service group, then it is affected.
 
# hastatus -sum
 
# hagrp -autoenable -sys
where, service-group-name is the name of the service group and node-name is the cluster node name for the failed/restarted node.
This command needs to be repeated for all AutoDisabled service groups.
 
Bring these service groups online on the desired active cluster node.
 
# hagrp -online -sys



Issue/Introduction

Service groups active on a cluster node do not fail over when the node stops responding