Global service group does not failover in local cluster

Description

Error Message

The following is repeatedly reported in the /etc/VRTSvcs/log/wac_A.log on the primary cluster.

2024/02/06 12:20:18 VCS WARNING V-16-1-10519 IpmHandle::send peer closed

The following is repeatedly reported in the /etc/VRTSvcs/log/wac_A.log on the secondary (DR) cluster

VCS ERROR V-16-3-18491 Unable to connect to remote cluster xxxxx securely 2024/02/06 12:20:41 VCS INFO V-16-3-18306 Initiating connection to cluster prodclus at xxx.xxx.xx.xxx

Where xxxx refers to the remote cluster name and cluster ip address respectively. Normal switch activity will be successful up until the node crashes.

The remote cluster will be in an INIT state.

hastatus -sum

-- SYSTEM STATE -- System State Frozen

A server101 RUNNING 0 A server102 RUNNING 0

-- GROUP STATE -- Group System Probed AutoDisabled State

....

-- WAN HEARTBEAT STATE -- Heartbeat To State

M Icmp drclus ALIVE

-- REMOTE CLUSTER STATE -- Cluster State

N drclus INIT

...

P globalgroup drclus:drserver201 Y N OFFLINE

The following command may show the service group in a migrate state from node that crashed.

hagrp -display -all|grep -i migrateq
ClusterService MigrateQ localclus
globalgroup MigrateQ localclus Server101 globalgroup MigrateQ localclus Server101

The following command may show a non zero value for the service group.

hagrp -display -all|grep -i intentonline

globalgroup IntentOnline localclus 1 globalgroup IntentOnline localclus 1

Cause

The ClusterService WAC process has been configured securely:

ps -ef | grep wac
root 32467 1 12 09:16 ? 00:30:44 /opt/VRTSvcs/bin/wac -secure

Trust has not been (correctly configured) between the DR site to the node in question, due to which WAC is not able to enter a running state and will be stuck in an INIT state.

Resolution

1. The immediate workaround is to flush the service group on both nodes and manually online the service group. i.e.

hagrp -flush globalgroup -sys server101

hagrp -flush globalgroup -sys server102

hagrp -online globalgroup -sys server102

2. The permanent solution is to either:

a. Modify the WAC start and monitor processes to run in insecure mode.

haconf -makerw

hares -modify wac StartProgram "/opt/VRTSvcs/bin/wacstart"
hares -modify wac MonitorProcesses "/opt/VRTSvcs/bin/wac"

haconf -dump -makerw

b. Establish trust between the problem node and the DR cluster. The following must be run on the node where the service group will not online, and on all nodes at the DR site.

export EAT_DATA_DIR=/var/VRTSvcs/vcsauth/data/WAC
/opt/VRTSvcs/bin/vcsat setuptrust –b xxx.xxx.xx.xxx:14149 –s high

Where xxx.xxx.xx.xxx is an IP address on the remote node.

Example assuming the following:

server102 is the node where the service group does not come online and has an IP address of 192.168.10.102.

DRserver201 is a node on the DR site with an IP address of 192.168.10.201.

DRServer202 is the second node on the DR site with an ip address of 192.168.10.202.

The following is executed on server102:

export EAT_DATA_DIR=/var/VRTSvcs/vcsauth/data/WAC
/opt/VRTSvcs/bin/vcsat setuptrust –b 192.168.10.201:14149 –s high

export EAT_DATA_DIR=/var/VRTSvcs/vcsauth/data/WAC
/opt/VRTSvcs/bin/vcsat setuptrust –b 192.168.10.202:14149 –s high

The following is executed on DRServer201 and DRServer202

export EAT_DATA_DIR=/var/VRTSvcs/vcsauth/data/WAC
/opt/VRTSvcs/bin/vcsat setuptrust –b 192.168.10.102:14149 –s high

Issue/Introduction

A service group does not fail-over within the local cluster in a GCO configuration. That is, in the event of a crash, a service group will successfully fail-over from node A to node B, but will not attempt to fail-over from node B to node A.

Welcome to "KB Articles"