Global service group does not failover in local cluster

book

Article ID: 100063121

calendar_today

Updated On:

Description

Error Message

The following is repeatedly reported in the /etc/VRTSvcs/log/wac_A.log on the primary cluster.

2024/02/06 12:20:18 VCS WARNING V-16-1-10519 IpmHandle::send peer closed
 

The following is repeatedly reported in the /etc/VRTSvcs/log/wac_A.log on the secondary (DR) cluster

VCS ERROR V-16-3-18491 Unable to connect to remote cluster xxxxx securely
2024/02/06 12:20:41 VCS INFO V-16-3-18306 Initiating connection to cluster prodclus at xxx.xxx.xx.xxx

Where xxxx refers to the remote cluster name and cluster ip address respectively. Normal switch activity will be successful up until the node crashes.

The remote cluster will be in an INIT state.

hastatus -sum

-- SYSTEM STATE
-- System               State                Frozen

A  server101            RUNNING              0
A  server102            RUNNING              0

-- GROUP STATE
-- Group           System               Probed     AutoDisabled    State

....

-- WAN HEARTBEAT STATE
-- Heartbeat       To                   State

M  Icmp            drclus               ALIVE

-- REMOTE CLUSTER STATE
-- Cluster         State

N  drclus          INIT

...

P  globalgroup     drclus:drserver201   Y          N               OFFLINE

The following command may show the service group in a migrate state from node that crashed.

hagrp -display -all|grep -i migrateq
ClusterService                    MigrateQ                         localclus
globalgroup                       MigrateQ              localclus     Server101
globalgroup                       MigrateQ              localclus     Server101

The following command may show a non zero value for the service group.

hagrp -display -all|grep -i intentonline

globalgroup                       IntentOnline              localclus     1
globalgroup                       IntentOnline              localclus     1

 

Cause

The ClusterService WAC process has been configured securely:

 ps -ef | grep wac
root       32467       1 12 09:16 ?        00:30:44 /opt/VRTSvcs/bin/wac -secure

Trust has not been (correctly configured) between the DR site to the node in question, due to which WAC is not able to enter a running state and will be stuck in an INIT state. 

Resolution

1. The immediate workaround is to flush the service group on both nodes and manually online the service group. i.e.

hagrp -flush globalgroup -sys server101

hagrp -flush globalgroup -sys server102

hagrp -online globalgroup -sys server102

 

2. The permanent solution is to either:

a. Modify the WAC start and monitor processes to run in insecure mode.

haconf -makerw

hares -modify wac StartProgram "/opt/VRTSvcs/bin/wacstart"
hares -modify wac MonitorProcesses "/opt/VRTSvcs/bin/wac"

haconf -dump -makerw

b. Establish trust between the problem node and the DR cluster. The following must be run on the node where the service group will not online, and on all nodes at the DR site.

export EAT_DATA_DIR=/var/VRTSvcs/vcsauth/data/WAC 
/opt/VRTSvcs/bin/vcsat setuptrust –b xxx.xxx.xx.xxx:14149 –s high

Where xxx.xxx.xx.xxx is an IP address on the remote node. 

Example assuming the following:

server102 is the node where the service group does not come online and has an IP address of 192.168.10.102.

DRserver201 is a node on the DR site with an IP address of 192.168.10.201.

DRServer202 is the second node on the DR site with an ip address of 192.168.10.202. 

The following is executed on server102:

export EAT_DATA_DIR=/var/VRTSvcs/vcsauth/data/WAC 
/opt/VRTSvcs/bin/vcsat setuptrust –b 192.168.10.201:14149 –s high

export EAT_DATA_DIR=/var/VRTSvcs/vcsauth/data/WAC 
/opt/VRTSvcs/bin/vcsat setuptrust –b 192.168.10.202:14149 –s high

The following is executed on DRServer201 and DRServer202

export EAT_DATA_DIR=/var/VRTSvcs/vcsauth/data/WAC 
/opt/VRTSvcs/bin/vcsat setuptrust –b 192.168.10.102:14149 –s high

Issue/Introduction

A service group does not fail-over within the local cluster in a GCO configuration. That is, in the event of a crash, a service group will successfully fail-over from node A to node B, but will not attempt to fail-over from node B to node A.