Remote group resource monitor times out when LDOM restarts and does not detect correct status

book

Article ID: 100019424

calendar_today

Updated On:

Resolution

Remote group resource monitor times out when LDOM restarts and does not detect correct status

Observed messages:

2008/08/16 14:15:42 VCS DBG_1 V-16-50-0 RemoteGroup:rdgtldom1grp:offline:Remote cluster handle is retrieved from the previously stored cookie.
RemoteGroup.C:get_remote_cluster_handle[585]
2008/08/16 14:19:26 VCS ERROR V-16-10061-8 RemoteGroup:rdgtldom1grp:offline:Error in Offlining the Remote Service Group. Error Id 264 and Error Message: Failed to failover socket connection to remaining VCS systems for Cluster Handle 52348. Unable to provide PERSISTENT connection. Please verify at least one VCS cluster system is in RUNNING state


We then attempt another monitor, which fails, however, as we hang connecting to the remote cluster until the monitor times out:

2008/08/16 14:19:27 VCS DBG_1 V-16-50-0 RemoteGroup:rdgtldom1grp:monitor:Remote cluster handle is retrieved from the previously stored cookie.
RemoteGroup.C:get_remote_cluster_handle[585]
2008/08/16 14:19:27 VCS DBG_1 V-16-50-0 RemoteGroup:rdgtldom1grp:monitor:Monitoring the Globle State of Remote Service Group.
RemoteGroup.C:remotegroup_monitor[1406]
2008/08/16 14:19:27 VCS DBG_1 V-16-50-0 RemoteGroup:rdgtldom1grp:monitor:Successfully connected to the local cluster.
RemoteGroup.C:get_local_cluster_handle[609]
2008/08/16 14:19:27 VCS DBG_1 V-16-50-0 RemoteGroup:rdgtldom1grp:monitor:Successfully retrieved Local Service Group Name
RemoteGroup.C:get_lsg_name[770]
2008/08/16 14:19:27 VCS DBG_1 V-16-50-0 RemoteGroup:rdgtldom1grp:monitor:Successfully retrieved Local System Name
RemoteGroup.C:get_local_system_name[792]
2008/08/16 14:19:27 VCS DBG_1 V-16-50-0 RemoteGroup:rdgtldom1grp:monitor:Successfully retrieved the Local Service Group (rdgtldom1) State
RemoteGroup.C:get_lsg_state[845]
2008/08/16 14:19:27 VCS DBG_1 V-16-50-0 RemoteGroup:rdgtldom1grp:monitor:State of the Local Serive Group on Local System is 10
RemoteGroup.C:get_lsg_state[860]
2008/08/16 14:19:27 VCS DBG_1 V-16-50-0 RemoteGroup:rdgtldom1grp:monitor:Successfully closed connection to the cluster.
RemoteGroup.C:close_lsg_cluster_handle[629]
2008/08/16 14:19:27 VCS DBG_1 V-16-50-0 RemoteGroup:rdgtldom1grp:monitor:Remote cluster handle is retrieved from the previously stored cookie.
RemoteGroup.C:get_remote_cluster_handle[585]
2008/08/16 14:20:27 VCS WARNING V-16-2-13139 Thread(2) Canceling thread (3)
2008/08/16 14:20:27 VCS ERROR V-16-2-13027 Thread(4) Resource(rdgtldom1grp) - monitor procedure did not complete within the expected time.


We then call clean as the resource is still up after the offline completed - this causes cookies to be deleted so we will try and reconnect to the remote cluster using VCSApiConnectionInfo::Connect - again this hangs until it times out:

2008/08/16 14:20:27 VCS ERROR V-16-2-13064 Thread(4) Agent is calling clean for resource(rdgtldom1grp) because the resource is up even after offline completed.
2008/08/16 14:20:27 VCS DBG_1 V-16-50-0 RemoteGroup:rdgtldom1grp:clean:All the required attributes are set.
RemoteGroup.C:remotegroup_clean[1679]
2008/08/16 14:20:27 VCS DBG_1 V-16-50-0 RemoteGroup:rdgtldom1grp:clean:Deleted All Cookies.
RemoteGroup.C:remotegroup_clean[1694]
2008/08/16 14:20:27 VCS DBG_1 V-16-50-0 RemoteGroup:rdgtldom1grp:clean:Remote cluster handle is retrieved from the previously stored cookie.
RemoteGroup.C:get_remote_cluster_handle[585]
2008/08/16 14:21:27 VCS WARNING V-16-2-13139 Thread(2) Canceling thread (4)
2008/08/16 14:21:27 VCS ERROR V-16-2-13006 Thread(5) Resource(rdgtldom1grp): clean procedure did not complete within the expected time.


Solution:


Configuration changes suggested by Engineering.

The actual issue of RemoteGroup going into UNKNOWN state when LDom is down and the LDom service group cannot be brought online due to this can be resolved by moving the RemoteGroup resource to a separate service group and making this service group parent of the LDom service group with online global firm dependency.

1. We need to move the resource to a separate service group so that the LDom service group can be fully probed and brought online even if the RemoteGroup resource is in UNKNOWN state.
The RemoteGroup resource would remain in UNKNOWN state until the LDom service group is online and VCS engine is running inside the LDom. Once the LDom is up along with the VCS engine, the RemoteGroup agent running in the control domain would connect to the VCS engine running in the LDom and remove the UNKNOWN state. Then, the resource can be brought online, if not already online.

2. We need to link the RemoteGroup's service group and LDom service group with "online global firm" dependency so as to allow VCS engine to move/fail-over the faulted child LDom service group independent of the RemoteGroup's service group.
If we link the two service groups with "online local firm", then VCS won't fail-over the LDom service group on another node until the RemoteGroup's service group goes offline. Since after LDom faults, the RemoteGroup will go in UNKNOWN state (due to not being able to connect to the remote cluster), the RemoteGroup's service group will not be able to go offline (it will stay in |ONLINE|STOPPING|). Hence, there will be a deadlock condition.

However, if the two service groups are linked with "online global firm", VCS can take the faulted LDom service group to another node and when the LDom goes online RemoteGroup agent can connect to the VCS engine running inside the LDom and offline the application service group followed by an online of the same. Also set the OfflineWaitLimit and ToleranceLimit to 1 for RemoteGroup resource



Customer to Also Upgrade to the following level for Fix.

Veritas Cluster Server 5.0 Maintenance Pack 3 Hot Fix 1 for Solaris (SPARC)
https://www.veritas.com/support/en_US/article.000035618