VCS engine log contains:
Checking pstack on _had core shows below
9268: /opt/VRTSvcs/bin/had -restart
----------------- lwp# 1 / thread# 1 --------------------
ff1465e8 waitid (0, 2438, ffbf86b0, 3)
ff138ec8 waitpid (2438, ffbf8804, 0, 0, ffbf885c, fedb0240) + 60
ff12c188 system (ffbf9164, ff177adc, 20000, 1, ff16e2ec, ffbf885c) + 2ec
00294298 __1cMVCSDumpStack6F_v_ (37135b, 0, 0, 0, 0, 0) + 120
002947b8 VCSSegvHandler (b, 0, ffbf9730, 1, 0, 0) + 50
ff1451fc __sighndlr (b, 0, ffbf9730, 294768, 0, 1) + c
ff139cfc call_user_handler (b, 0, 4, 0, ff1e2a00, ffbf9730) + 3b8
0028fbf4 __1cJVCSAssert6Fpc0I_v_ (361a2e, 361a4a, 1d7, 3b9aca00, 6e347500, 1) + dc
0025541c __1cHget_hdr6FpnFVList_ppnFVElem__pnGMsgHdr__ (48d2a8, 0, 0, 0, c8, 2b1b) + e4
00086d9c __1cEMAIN6FLppc_v_ (2b1b, 8000, 1, 297b, 2b2b, 0) + 5b8c
00088804 main (2, ffbffaac, ffbffab8, 3a8000, ff1e0200, 0) + 4c
0006cee0 _start (0, 0, 0, 0, 0, 0) + 108
The assertions are caused due to data corruption in messaging caused by one or more Network Interface Card (NIC) used for LLT heartbeat links.
1) Enable LLT checksum on all nodes and check if the pkt is getting corrupted in the network, using below command
# lltconfig –K 10
2) Regularly monitor the "lltstat" output to capture count of "Rcv bad checksum error"
For example:
# lltstat | grep "Rcv bad checksum"
482 Rcv bad checksum
# lltstat | grep "Rcv bad checksum"
497 Rcv bad checksum
3) After enough samples of "Rcv bad checksum error", capture the faulty NIC using below command
# lltstat -l
LLT link information:
link 0 ce1 on etherfp hipri
mtu 1500, sap 0xcafe, broadcast FF:FF:FF:FF:FF:FF, addrlen 6
txpkts 283176 txbytes 16992316
rxpkts 84 rxbytes 17048223
latehb 0 badcksum 963 errors 0
[...]
4) On the problem node, replace the faulty NIC, captured from above command.
5) After replacing the faulty NIC, regularly monitor the "lltstat" output once again to confirm there are no "Rcv bad checksum error"