For a Veritas Cluster of multiple nodes, one or more nodes get hung at REMOTE

For a Veritas Cluster of multiple nodes, one or more nodes get hung at REMOTE_BUILD.

book

Article ID: 100002829

calendar_today

Updated On:

Description

Error Message

VCS engine log contains:

Had[229]: [ID 702911 daemon.alert] ASSERTION FAILED: file Gab.C, line 1440, expression (_cur_rtype != 0)

Had[229]: [ID 702911 daemon.alert] VCS WARNING V-16-1-53032 HAD Signal SIGSEGV received

genunix: [ID 603404 kern.notice] NOTICE: core_log: had[229] core dumped: /var/core/core_node02_had_0_0_1282301984_229

gab: [ID 397130 kern.notice] GAB INFO V-15-1-20032 Port h closed

Checking pstack on _had core shows below

9268: /opt/VRTSvcs/bin/had -restart ----------------- lwp# 1 / thread# 1 -------------------- ff1465e8 waitid (0, 2438, ffbf86b0, 3) ff138ec8 waitpid (2438, ffbf8804, 0, 0, ffbf885c, fedb0240) + 60 ff12c188 system (ffbf9164, ff177adc, 20000, 1, ff16e2ec, ffbf885c) + 2ec 00294298 __1cMVCSDumpStack6F_v_ (37135b, 0, 0, 0, 0, 0) + 120 002947b8 VCSSegvHandler (b, 0, ffbf9730, 1, 0, 0) + 50 ff1451fc __sighndlr (b, 0, ffbf9730, 294768, 0, 1) + c ff139cfc call_user_handler (b, 0, 4, 0, ff1e2a00, ffbf9730) + 3b8 0028fbf4 __1cJVCSAssert6Fpc0I_v_ (361a2e, 361a4a, 1d7, 3b9aca00, 6e347500, 1) + dc 0025541c __1cHget_hdr6FpnFVList_ppnFVElem__pnGMsgHdr__ (48d2a8, 0, 0, 0, c8, 2b1b) + e4 00086d9c __1cEMAIN6FLppc_v_ (2b1b, 8000, 1, 297b, 2b2b, 0) + 5b8c 00088804 main (2, ffbffaac, ffbffab8, 3a8000, ff1e0200, 0) + 4c 0006cee0 _start (0, 0, 0, 0, 0, 0) + 108

Cause

The assertions are caused due to data corruption in messaging caused by one or more Network Interface Card (NIC) used for LLT heartbeat links.

Resolution

1) Enable LLT checksum on all nodes and check if the pkt is getting corrupted in the network, using below command
# lltconfig –K 10

2) Regularly monitor the "lltstat" output to capture count of "Rcv bad checksum error"
For example:
# lltstat | grep "Rcv bad checksum"
482 Rcv bad checksum

# lltstat | grep "Rcv bad checksum"
497 Rcv bad checksum

3) After enough samples of "Rcv bad checksum error", capture the faulty NIC using below command
# lltstat -l
LLT link information:
link 0 ce1 on etherfp hipri
      mtu 1500, sap 0xcafe, broadcast FF:FF:FF:FF:FF:FF, addrlen 6
      txpkts 283176 txbytes 16992316
      rxpkts 84 rxbytes 17048223
      latehb 0 badcksum 963 errors 0
[...]

4) On the problem node, replace the faulty NIC, captured from above command.

5) After replacing the faulty NIC, regularly monitor the "lltstat" output once again to confirm there are no "Rcv bad checksum error"

Issue/Introduction

For a Veritas Cluster of multiple nodes, one or more nodes get hung at REMOTE_BUILD. This prevents the node(s) from joining the cluster and causing HAD to generate core dumps.

Was this article helpful?

thumb_up Yes

thumb_down No

Welcome to "KB Articles"