2021/12/13 10:52:08 VCS INFO V-16-3-18306 Initiating connection to cluster dr11 at #.#.#.#
2021/12/13 11:07:51 VCS NOTICE V-16-3-18322 Lost connection to cluster dr11 while in BUILD; resetting state to INIT
2021/12/13 11:07:57 VCS INFO V-16-3-18306 Initiating connection to cluster dr11 at #.#.#.#
2021/12/13 11:23:40 VCS NOTICE V-16-3-18322 Lost connection to cluster dr11 while in BUILD; resetting state to INIT
2021/12/13 11:23:45 VCS INFO V-16-3-18306 Initiating connection to cluster dr11 at #.#.#.#
2021/12/13 11:30:26 VCS ERROR V-16-3-18211 Cluster prod11 lost heartbeat PocHB to cluster dr11
'#.#.#.# represents the corresponding IP address.
Whilst troubleshooting a NBU certificate issue on these clusters.
it was observed that Jumbo Frames were in use.
prodsys1 - ifconfig -aeth0: flags=4163<up,broadcast,running,multicast> mtu 9000
eth0:0: flags=4163<up,broadcast,running,multicast> mtu 9000
eth0:1: flags=4163<up,broadcast,running,multicast> mtu 9000</up,broadcast,running,multicast></up,broadcast,running,multicast></up,broadcast,running,multicast>
drsys1 - ifconfig -aeth0: flags=4163<up,broadcast,running,multicast> mtu 9000
eth0:0: flags=4163<up,broadcast,running,multicast> mtu 9000
eth0:1: flags=4163<up,broadcast,running,multicast> mtu 9000</up,broadcast,running,multicast></up,broadcast,running,multicast></up,broadcast,running,multicast>
After setting the mtu to 1500 for the network interfaces in order to resolve the NBU certificate issue, the 'BUILD' state for the remote cluster was also resolved.
The configuring of Jumbo Frames itself in a GCO environment is not an issue. However, it should be noted that it isn't enough to just have jumbo frames configured at both the source and destination. Jumbo Frames need to be configured throughout the network infrastructure (switches/router etc.). Typically, if two clusters reside in two different sites or GEOs, the middle components will not have jumbo frames configured. So, although jumbo frames may be configured on the hosts in both GCO clusters, packets larger than 1500 bytes will get dropped somewhere in the network.
If using jumbo frames, it is important to ensure the pipe all the way to the target is big enough. Otherwise, issues such as remote cluster being reported in BUILD state or Icmp/PocHB being reported in UNKNOWN state may be observed due to larger packets getting dropped in the network.
GCO IP will work fine without Jumbo Frames, but if Jumbo Frames is configured, verify the entire network between the hosts in both GCO clusters is configured to support jumbo frames.
Remote cluster reported in 'BUILD' state in InfoScale 7.4.2/rhel7 GCO environment On Prod site
# hastatus -sum -- SYSTEM STATE
-- System State FrozenA prodsys1 RUNNING 0 -- GROUP STATE
-- Group System Probed AutoDisabled StateB ClusterService prodsys1 Y N ONLINE
B nbu_group prodsys1 Y N OFFLINE-- WAN HEARTBEAT STATE
-- Heartbeat To StateM PocHB dr11 ALIVE -- REMOTE CLUSTER STATE
-- Cluster StateN dr11 BUILD
On DR site -- SYSTEM STATE
-- System State FrozenA drsys1 RUNNING 0 -- GROUP STATE
-- Group System Probed AutoDisabled StateB ClusterService drsys1 Y N ONLINE
B nbu_group drsys1 Y N OFFLINE-- WAN HEARTBEAT STATE
-- Heartbeat To StateM PocHB prod11 ALIVE -- REMOTE CLUSTER STATE
-- Cluster StateN prod11 BUILD
Commands such as ping and nc confirmed that communication between the two clusters was happening.
Even after increasing the AYATimeout and AYAInterval for the heartbeat on both clusters:
# /opt/VRTSvcs/bin/hahb -modify PocHB AYATimeout 100
# /opt/VRTSvcs/bin/hahb -modify PocHB AYAInterval 120 and the wac connect timeout on both clusters
# haclus -modify ConnectTimeout 30000 -clus <cluster>
The remote cluster was still showing in BUILD state.
JIRA: STESC-6588