Engine_A.log:
2011/06/24 23:52:28 VCS NOTICE V-16-1-10446 Group relay is offline on system A
2011/06/24 23:53:43 VCS INFO V-16-1-50135 User root fired command: hagrp -online relay A from localhost
This customer expected that "VCS INFO V-16-1-10493 Evaluating B as potential target node for group relay" to happen, and initiated a failover to Standby node. But this did not initiate before starting the Service Group manually.
Looked at the engine_A.log:
1) Group relay was online on system csms-sb.
2011/06/24 18:53:01 VCS NOTICE V-16-1-10447 Group relay is online on system csms-sb
2) Critical resource 'smsdg' of group 'relay' faulted on system csms-sb.
2011/06/24 21:18:43 VCS ERROR V-16-2-13067 (csms-sb) Agent is calling clean for resource(smsdg) because the resource became OFFLINE unexpectedly, on its own.
3) Engine started failover process for group relay.
4) Group relay was offlined on system csms-sb
2011/06/24 21:49:49 VCS ERROR V-16-1-10205 Group relay is faulted on system csms-sb
2011/06/24 21:49:49 VCS NOTICE V-16-1-10446 Group relay is offline on system csms-sb
5) Systems csms1 and csms-sb were evaluated as potential target node.
2011/06/24 21:49:49 VCS INFO V-16-1-10493 Evaluating csms1 as potential target node for group relay
2011/06/24 21:49:49 VCS INFO V-16-1-10493 Evaluating csms-sb as potential target node for group relay
2011/06/24 21:49:49 VCS INFO V-16-1-50010 Group relay is online or faulted on system csms-sb
6) System csms1 was selected as target node & online of group relay was initiated on csms1.
2011/06/24 21:49:49 VCS NOTICE V-16-1-10301 Initiating Online of Resource smsdg (Owner: Unspecified, Group: relay) on System csms1
2011/06/24 21:49:49 VCS NOTICE V-16-1-10301 Initiating Online of Resource vip_172 (Owner: Unspecified, Group: relay) on System csms1
2011/06/24 21:49:49 VCS NOTICE V-16-1-10301 Initiating Online of Resource vip_203 (Owner: Unspecified, Group: relay) on System csms1
7) Before group relay could complete online on csms1, User flushed group relay on system csms1.
2011/06/24 21:51:32 VCS INFO V-16-1-50135 User root fired command: hagrp -flush relay csms1 0 from localhost
8) User froze group relay.
2011/06/24 21:56:51 VCS INFO V-16-1-50135 User root fired command: hagrp -freeze relay from localhost
9) Outside VCS control; Resource smsdg of group relay came online on system csms-sb. This created concurrency violation for group relay.
2011/06/24 22:49:03 VCS INFO V-16-1-10299 Resource smsdg (Owner: Unspecified, Group: relay) is online on csms-sb (Not initiated by VCS)
2011/06/24 22:49:03 VCS ERROR V-16-1-10214 Concurrency Violation:CurrentCount increased above 1 for failover group relay
10) But concurrency violation script could not offline group relay on system csms-sb.
2011/06/24 22:49:03 VCS ERROR V-16-6-15036 (csms-sb) violation:hagrp -offline command failed to offline group relay on system csms-sb
2011/06/24 22:49:03 VCS INFO V-16-6-15077 (csms-sb) violation:The output of hagrp -offline is: VCS WARNING V-16-1-10154 Group relay is frozen
11) Once again concurrency violation found on system csms-db.
2011/06/24 23:36:06 VCS INFO V-16-1-10299 Resource smsdg (Owner: Unspecified, Group: relay) is online on csms-sb (Not initiated by VCS)
2011/06/24 23:36:06 VCS ERROR V-16-1-10214 Concurrency Violation:CurrentCount increased above 1 for failover group relay
2011/06/24 23:36:06 VCS ERROR V-16-6-15036 (csms-sb) violation:hagrp -offline command failed to offline group relay on system csms-sb
2011/06/24 23:36:06 VCS INFO V-16-6-15077 (csms-sb) violation:The output of hagrp -offline is: VCS WARNING V-16-1-10154 Group relay is frozen
12) User cleared group relay.
2011/06/24 23:42:45 VCS INFO V-16-1-50135 User root fired command: hagrp -clear relay from localhost
13) User unfroze group relay. Concurrency violation was removed successfully.
2011/06/24 23:42:52 VCS INFO V-16-1-50135 User root fired command: hagrp -unfreeze relay from localhost
Because of this concurrency violation, Group::TargetCount was incorrectly set to 0.
14) User tries manually onlining group relay
2011/06/24 23:43:19 VCS INFO V-16-1-50135 User root fired command: hagrp -online relay csms1 from localhost
But this does not help in correcting Group::TargetCount value. Group relay was already active on System csms1(Before -freeze/unfreeze & all concurrency violations). Value of Group ::TargetCount remains 0. Hence later when group relay faults on csms1; Group relay completes offline on system csms1, but does not evaluate target node for onlining.
In case of concurrency violation, issue of incorrectly setting of Group::TargetCount to 0 is tracked under incident # 2409038(Abstract: VCS:ENGINE:issue related to hagrp -switch failed and online command hung) and will enhanced it at 6.0 relase as a current plan.
Applies To
Storage Foundation High Availability(SF HA) 5.1 SP1RP1
Redhat Linux 5.1, x86