For a Veritas Cluster of multiple nodes, one or more nodes get hung at REMOTE_BUILD.

book

Article ID: 100002829

calendar_today

Updated On:

Description

Error Message

VCS engine log contains:

Had[229]: [ID 702911 daemon.alert] ASSERTION FAILED: file Gab.C, line 1440, expression (_cur_rtype != 0) 


Had[229]: [ID 702911 daemon.alert] VCS WARNING V-16-1-53032 HAD Signal SIGSEGV received 


genunix: [ID 603404 kern.notice] NOTICE: core_log: had[229] core dumped: /var/core/core_node02_had_0_0_1282301984_229 


gab: [ID 397130 kern.notice] GAB INFO V-15-1-20032 Port h closed 

 

Checking pstack on _had core shows below

9268: /opt/VRTSvcs/bin/had -restart 
-----------------  lwp# 1 / thread# 1  -------------------- 
ff1465e8 waitid   (0, 2438, ffbf86b0, 3) 
ff138ec8 waitpid  (2438, ffbf8804, 0, 0, ffbf885c, fedb0240) + 60 
ff12c188 system   (ffbf9164, ff177adc, 20000, 1, ff16e2ec, ffbf885c) + 2ec 
00294298 __1cMVCSDumpStack6F_v_ (37135b, 0, 0, 0, 0, 0) + 120 
002947b8 VCSSegvHandler (b, 0, ffbf9730, 1, 0, 0) + 50 
ff1451fc __sighndlr (b, 0, ffbf9730, 294768, 0, 1) + c 
ff139cfc call_user_handler (b, 0, 4, 0, ff1e2a00, ffbf9730) + 3b8 
0028fbf4 __1cJVCSAssert6Fpc0I_v_ (361a2e, 361a4a, 1d7, 3b9aca00, 6e347500, 1) + dc 
0025541c __1cHget_hdr6FpnFVList_ppnFVElem__pnGMsgHdr__ (48d2a8, 0, 0, 0, c8, 2b1b) + e4 
00086d9c __1cEMAIN6FLppc_v_ (2b1b, 8000, 1, 297b, 2b2b, 0) + 5b8c 
00088804 main     (2, ffbffaac, ffbffab8, 3a8000, ff1e0200, 0) + 4c 
0006cee0 _start   (0, 0, 0, 0, 0, 0) + 108 

Cause

The assertions are caused due to data corruption in messaging caused by one or more Network Interface Card (NIC) used for LLT heartbeat links.

 

Resolution

1) Enable LLT checksum on all nodes and check if the pkt is getting corrupted in the network, using below command
# lltconfig –K 10 

 

2) Regularly monitor the "lltstat" output to capture count of "Rcv bad checksum error"
For example:
# lltstat | grep "Rcv bad checksum" 
   482        Rcv bad checksum

# lltstat | grep "Rcv bad checksum" 
   497        Rcv bad checksum


3) After enough samples of "Rcv bad checksum error", capture the faulty NIC using below command 
# lltstat -l  
LLT link information: 
link 0  ce1 on etherfp hipri 
      mtu 1500, sap 0xcafe, broadcast FF:FF:FF:FF:FF:FF, addrlen 6 
      txpkts 283176  txbytes 16992316 
      rxpkts 84  rxbytes 17048223 
      latehb 0  badcksum 963  errors 0 
[...]


4) On the problem node, replace the faulty NIC, captured from above command.

5) After replacing the faulty NIC, regularly monitor the "lltstat" output once again to confirm there are no "Rcv bad checksum error"

 

 

Issue/Introduction

For a Veritas Cluster of multiple nodes, one or more nodes get hung at REMOTE_BUILD. This prevents the node(s) from joining the cluster and causing HAD to generate core dumps.