Global Cluster Option (GC0) - Global Service Group cannot be onlined with error message "Group is in the middle of a group operation in cluster <cluster name>"
book
Article ID: 100021690
calendar_today
Updated On:
Description
Error Message
VCS WARNING V-16-1-51042 Cannot online group SG_Global3. Group is in the middle of a group operation in cluster clus_x4200ab
Resolution
Symptoms
-----------------
GCO configuration with 2clusters (local cluster and remote cluster). The local cluster has Parallel Service Groups with 3 or more levels of Service Group Dependency. The Parallel Services are running on some of the nodes in the local cluster (not all the nodes in the local cluster are configured to run the Parallel Service Groups.) When all the systems configured to run the Parallel Service Groups are faulted, the Parallel Service Groups that have more than 2 levels of Service Group Dependency cannot be onlined on the remote cluster with the following error message.
VCS WARNING V-16-1-51042 Cannot online group SG_Global3. Group is in the middle of a group operation in cluster
This is because the faulted node is not cleared from the Migration Queue of the Parallel Service Groups with more than 2 levels of Service Group Dependency.
Summary
---------------
When the node running a service group faults, VCS evacuates the service groups in the online-local-firm dependency tree on the faulted node. Then VCS will try to failover the groups to some other node in the cluster. If a service group cannot be failed-over, then the MigrateQ for the service group should be cleared. However, when the group dependency is more than 2 levels, the MigrateQ's for the all the groups with more than 2 levels of dependency are not cleared. This stops the service groups from going online on the remote cluster. This happens because the MigrateQ isnot cleared for the service group on the first cluster, where the node faulted.
In order to fix this issue the core logic of VCS will need to be changed. The MigrateQ's for all the service groups which could not failover need to be cleared. As a workaround for this, we need to flush the service group (using hagrp -flush group -sys system [-clus cluster]) whose MigrateQ has not been cleared before onlining the group on the other cluster. As depicted in steps 3 and 4 in the workaround given below.
Explanation of theissue
--------------------------------------
Cluster 1 with 3 Service Groups in a 3 level dependency
-------------------------------------------------------------------------------
#hagrp-dep
#Parent Child Relationship
SG_Global2 SG_Global1 online local firm
SG_Global3 SG_Global2 online local firm
SG_Global3
|
| Online Local Firm
|
SG_Global2
|
| Online Local Firm
|
SG_Global1
There are two nodes in Cluster 1 - sys1 andsys2. The parallel groups SG_Global3 , SG_Global2 and SG_Global1 are configured to run on sys2 only.
Now sys 2 is faulted.
#System Attribute Value
localclus:sys1 SysState RUNNING
localclus:sys2 SysState FAULTED <<< sys2 isfauled
C2:sys3 SysState RUNNING
C2:sys4 SysState RUNNING
Cluster2.
---------------
The user fires the online cmd for SG_Global3 on the remote cluster and got the following error.
# hagrp -online SG_Global3-sys sys3
VCS WARNING V-16-1-51042 Cannot online groupSG_Global3. Group is in the middle of a group operation in cluster C1
The warning is displayed since the MigrateQ for SG_Global3 was not cleared on the faulted system. (Note Service Group attribute MigrateQ is displayed with the option "-all".)
# hagrp -display -all | grep -i migrateq
ClusterServiceMigrateQ localclus
OracleGrp MigrateQ localclus
SG_Global1 MigrateQ localclus
SG_Global2 MigrateQ localclus
SG_Global3 MigrateQ localclus sys2 <<< MigrateQ is not cleared
Analysis of theissue
-----------------------------------
When the node running the service group SG_Global3 faults, VCS evacuates the service groups on the faulted node. Then VCS tries to failover the group to some other node in the cluster. The MigrateQ indicates the system from which the group is migrating. The faulted system is set in the MigrateQ, thus the faulted node will be set in the MigrateQ for all the service groups. Once the MigrateQ is set, VCS evaluates the nodes for failing over the service group. If no suitable target is found for failover, then the MigrateQ should be cleared for all the groups.
The MigrateQ is cleared for the parent service group [SG_Global2] links of SG_Global1. ButSG_Global3 which is parent of SG_Global2 is not cleared since the parent links of SG_Global1 contains SG_Global2 only. From the engine log we can find the following warning message saying that MigrateQ is cleared forSG_Global2.
2009/08/07 21:38:25 VCS WARNING V-16-1-50047 Clearing parent group SG_Global2's migrateq, since child can not go online anywhere
But the corresponding message for SG_Global3 is not there:
Clearing parent group SG_Global3's migrateq, since child can not go online anywhere <<<<<<< This should have been called, but it was not.
This happens because VCS only clears the MigrateQ for the parent links of SG_Global1. This essentially means that if the group dependency is configured with more than 2 levels, we would see this problem.
Solution
--------------
The problem will be fixed by clearing the MigrateQ for all the service groups which could not failover. VCS also needs to keep track of the systems which faulted and the dependencies that are configured. VCS needs to track the SystemList of the groups and clear the faulted systems from the MigrateQ accordingly.
Workaround
--------------------
The MigrateQ's of the service groups need to be cleared in order to online the service groups in the remote cluster.
Steps
1. The node running the service group in the cluster is faulted. VCS takes the system into faulted state.
2. Before initiating the online cmd for the service group, check if the MigrateQ is set for theservice groups by running this command on any surviving node in the first cluster which
is running state. In the example, the sys2 is the node which is in faulted state.
# hagrp -display -all | grep -i migrateq <<< option "-all" is required
SG_Global1 MigrateQ localclus
SG_Global2 MigrateQ localclus
SG_Global3 MigrateQ localclus sys2 <<< sys2 is in the MigrateQ ofservice group SG_Global3
3. Before onlining the service group on the remote cluster, flush the group whose MigrateQ was not empty instep2:
# hagrp -flush SG_Global3 -sys sys2 -clus C1
With this the MigrateQ will be cleared for SG_Global3
# hagrp-display -all | grep -i migrateq
SG_Global1 MigrateQ localclus
SG_Global2 MigrateQ localclus
SG_Global3 MigrateQ localclus <<
4. Now online the group on the remote cluster with the following command.
# hagrp -online SG_Global3 -sys sys3
The group should be able to go online on the remote cluster. Run the following command in the remote cluster to confirm.
# hagrp-display
.....
#Group Attribute System Value
SG_Global3 State C1:sys2 |OFFLINE|
SG_Global3 State localclus:sys3 |ONLINE| <<< SG_Global3 is now online on sys3 in the remote cluster
Issue/Introduction
Global Cluster Option (GC0) - Global Service Group cannot be onlined with error message "Group is in the middle of a group operation in cluster "
Was this article helpful?
thumb_up
Yes
thumb_down
No