VCS unable to failover the service group after sytem crash or power-off

book

Article ID: 100006714

calendar_today

Updated On:

Cause

The problem is fixed in Etrack 1937834 which is a child of parent Etrack 2081720.   Please refer to the Supplemental Material section of this article for details.


 

Resolution

The fix for this issue is RP2 which can be downloaded here:

https://sort.Veritas.com/patch/detail/5510
 


Issue/Introduction

From the /var/VRTSvcs/log/engine_A.log of the surviving node in the cluster we see that the server crashed or powered off abruptly. 2011/10/06 18:03:35 VCS INFO V-16-1-10077 Received new cluster membership
2011/10/06 18:03:35 VCS NOTICE V-16-1-10112 System (server01) - Membership: 0x2, DDNA: 0x0
2011/10/06 18:03:35 VCS ERROR V-16-1-10079 System server01 (Node '0') is in Down State - Membership: 0x2
2011/10/06 18:03:35 VCS ERROR V-16-1-10322 System server01 (Node '0') changed state from RUNNING to FAULTED
2011/10/06 18:03:35 VCS NOTICE V-16-1-10446 Group AGILESTAGE is offline on system server01 As we can see the service group AGILESTAGE is just reported offline but no falilover occurs. From the /var/adm/messages file we see the following: Oct  6 18:03:17 server kernel: LLT INFO V-14-1-10205 link 0 (eth2) node 0 in trouble
Oct  6 18:03:17 server kernel: LLT INFO V-14-1-10205 link 1 (eth3) node 0 in trouble
Oct  6 18:03:19 server kernel: LLT INFO V-14-1-10205 link 2 (bond0) node 0 in trouble
Oct  6 18:03:30 server kernel: LLT INFO V-14-1-10509 link 0 (eth2) node 0 expired
Oct  6 18:03:30 server kernel: LLT INFO V-14-1-10509 link 1 (eth3) node 0 expired
Oct  6 18:03:35 server kernel: GAB INFO V-15-1-20036 Port a gen    dc712 membership ;1
Oct  6 18:03:35 server kernel: GAB INFO V-15-1-20036 Port h gen    dc715 membership ;1
Oct  6 18:03:35 server Had[14619]: VCS INFO V-16-1-10077 Received new cluster membership
Oct  6 18:03:35 server Had[14619]: VCS ERROR V-16-1-10079 System server (Node '0') is in Down State - Membership: 0x2
Oct  6 18:03:35 server Had[14619]: VCS ERROR V-16-1-10322 System server (Node '0') changed state from RUNNING to FAULTED   After powering the server back up, another test is run.  This time "had" and "hashadow"  are killed with a signal "-15" and as expected the group is autodisabled. 2011/10/06 18:41:48 VCS NOTICE V-16-1-10112 System (server02) - Membership: 0x2, DDNA: 0x1
2011/10/06 18:41:48 VCS ERROR V-16-1-10113 System server01 (Node '0') is in DDNA Membership - Membership: 0x2, Visible: 0x0
2011/10/06 18:41:48 VCS ERROR V-16-1-10322 System server01 (Node '0') changed state from RUNNING to FAULTED
2011/10/06 18:41:48 VCS NOTICE V-16-1-10449 Group AGILESTAGE autodisabled on node server01 until it is probed
2011/10/06 18:41:48 VCS NOTICE V-16-1-10449 Group VCShmg autodisabled on node server01 until it is probed
2011/10/06 18:41:48 VCS NOTICE V-16-1-10446 Group AGILESTAGE is offline on system server01 The sever is is then powered off and then the SG fails over to the other node as per design. 2011/10/06 18:42:53 VCS WARNING V-16-1-11141 LLT heartbeat link status changed. Previous status = eth2 UP eth3 UP bond0 UP; Current status = eth2 DOWN eth3 DOWN bond0 DOWN.
2011/10/06 18:42:54 VCS INFO V-16-1-10077 Received new cluster membership
2011/10/06 18:42:54 VCS NOTICE V-16-1-10112 System (server02) - Membership: 0x2, DDNA: 0x0
2011/10/06 18:42:54 VCS ERROR V-16-1-10079 System server01 (Node '0') is in Down State - Membership: 0x2
2011/10/06 18:42:54 VCS NOTICE V-16-1-10451 Cleared attribute-'autodisabled' for Group AGILESTAGE on node server01
2011/10/06 18:42:54 VCS NOTICE V-16-1-10451 Cleared attribute-'autodisabled' for Group VCShmg on node server01
2011/10/06 18:42:54 VCS ERROR V-16-1-10205 Group AGILESTAGE is faulted on system server01
2011/10/06 18:42:54 VCS NOTICE V-16-1-10446 Group AGILESTAGE is offline on system server01
2011/10/06 18:42:54 VCS INFO V-16-1-10493 Evaluating server02 as potential target node for group AGILESTAGE
2011/10/06 18:42:54 VCS INFO V-16-1-10493 Evaluating server01 as potential target node for group AGILESTAGE
2011/10/06 18:42:54 VCS INFO V-16-1-10494 System server01 not in RUNNING state
2011/10/06 18:42:54 VCS NOTICE V-16-1-10301 Initiating Online of Resource AGILESTAGE_DISK (Owner: Unspecified, Group: AGILESTAGE) on System server02 The second test shows that the service group configuration is correct and the problem is caused by a VCS bug.

Additional Information

ETrack: 1599129 ETrack: 2081720 ETrack: 1937834