Remote cluster reported in 'BUILD' state in InfoScale 7.4.2/rhel7 GCO environment

Description

Error Message

2021/12/13 10:52:08 VCS INFO V-16-3-18306 Initiating connection to cluster dr11 at #.#.#.# 2021/12/13 11:07:51 VCS NOTICE V-16-3-18322 Lost connection to cluster dr11 while in BUILD; resetting state to INIT 2021/12/13 11:07:57 VCS INFO V-16-3-18306 Initiating connection to cluster dr11 at #.#.#.# 2021/12/13 11:23:40 VCS NOTICE V-16-3-18322 Lost connection to cluster dr11 while in BUILD; resetting state to INIT 2021/12/13 11:23:45 VCS INFO V-16-3-18306 Initiating connection to cluster dr11 at #.#.#.# 2021/12/13 11:30:26 VCS ERROR V-16-3-18211 Cluster prod11 lost heartbeat PocHB to cluster dr11

'#.#.#.# represents the corresponding IP address.

Cause

Whilst troubleshooting a NBU certificate issue on these clusters.

it was observed that Jumbo Frames were in use.

prodsys1 - ifconfig -a
eth0: flags=4163<up,broadcast,running,multicast> mtu 9000 eth0:0: flags=4163<up,broadcast,running,multicast> mtu 9000 eth0:1: flags=4163<up,broadcast,running,multicast> mtu 9000</up,broadcast,running,multicast></up,broadcast,running,multicast></up,broadcast,running,multicast>

drsys1 - ifconfig -a
eth0: flags=4163<up,broadcast,running,multicast> mtu 9000 eth0:0: flags=4163<up,broadcast,running,multicast> mtu 9000 eth0:1: flags=4163<up,broadcast,running,multicast> mtu 9000</up,broadcast,running,multicast></up,broadcast,running,multicast></up,broadcast,running,multicast>

Resolution

After setting the mtu to 1500 for the network interfaces in order to resolve the NBU certificate issue, the 'BUILD' state for the remote cluster was also resolved.

The configuring of Jumbo Frames itself in a GCO environment is not an issue. However, it should be noted that it isn't enough to just have jumbo frames configured at both the source and destination. Jumbo Frames need to be configured throughout the network infrastructure (switches/router etc.). Typically, if two clusters reside in two different sites or GEOs, the middle components will not have jumbo frames configured. So, although jumbo frames may be configured on the hosts in both GCO clusters, packets larger than 1500 bytes will get dropped somewhere in the network.

If using jumbo frames, it is important to ensure the pipe all the way to the target is big enough. Otherwise, issues such as remote cluster being reported in BUILD state or Icmp/PocHB being reported in UNKNOWN state may be observed due to larger packets getting dropped in the network.

GCO IP will work fine without Jumbo Frames, but if Jumbo Frames is configured, verify the entire network between the hosts in both GCO clusters is configured to support jumbo frames.

Issue/Introduction

Remote cluster reported in 'BUILD' state in InfoScale 7.4.2/rhel7 GCO environment On Prod site

# hastatus -sum -- SYSTEM STATE -- System State Frozen A prodsys1 RUNNING 0 -- GROUP STATE -- Group System Probed AutoDisabled State B ClusterService prodsys1 Y N ONLINE B nbu_group prodsys1 Y N OFFLINE -- WAN HEARTBEAT STATE -- Heartbeat To State M PocHB dr11 ALIVE -- REMOTE CLUSTER STATE -- Cluster State N dr11 BUILD
On DR site -- SYSTEM STATE -- System State Frozen A drsys1 RUNNING 0 -- GROUP STATE -- Group System Probed AutoDisabled State B ClusterService drsys1 Y N ONLINE B nbu_group drsys1 Y N OFFLINE -- WAN HEARTBEAT STATE -- Heartbeat To State M PocHB prod11 ALIVE -- REMOTE CLUSTER STATE -- Cluster State N prod11 BUILD

Commands such as ping and nc confirmed that communication between the two clusters was happening.
Even after increasing the AYATimeout and AYAInterval for the heartbeat on both clusters:

# /opt/VRTSvcs/bin/hahb -modify PocHB AYATimeout 100

# /opt/VRTSvcs/bin/hahb -modify PocHB AYAInterval 120 and the wac connect timeout on both clusters

# haclus -modify ConnectTimeout 30000 -clus <cluster>

The remote cluster was still showing in BUILD state.

Additional Information

JIRA: STESC-6588

Welcome to "KB Articles"