HAD process reports dead and produces core file

Description

Error Message

# Start of the heartbeats delayed log entries #

May 3021:33:40 s1g5 gab: [ID 272231 kern.notice] GAB WARNING V-15-1-20057 Port hprocess 2953 inactive 13 sec

May 3021:33:41 s1g5 gab: [ID 272231 kers1g5 gab: [ID 272231 kern.notice] GAB WARNINGV-15-1-20057 Port h process 2953 inactive 14 sec

# Start of GAB killing HAD #

May 3021:33:42 s1g5 gab: [ID 191522 kern.notice] GAB WARNING V-15-1-20058 Port hprocess 2953: heartbeat failed, killing process

# Information on delays and system usage at the time HAD is killed (Note "freememory" usage) #

May 3021:33:42 s1g5 gab: [ID 975177 kern.notice] GAB INFO V-15-1-20059 Port hheartbeat interval 15000 msec. Statistics:

May 3021:33:42 s1g5 gab: [ID 217350 kern.notice] GAB INFO V-15-1-20129 Port h:heartbeats in 0 ~ 3000 msec: 291385408

May 3021:33:42 s1g5 gab: [ID 217350 kern.notice] GAB INFO V-15-1-20129 Port h:heartbeats in 3000 ~ 6000 msec: 29

May 3021:33:42 s1g5 gab: [ID 217350 kern.notice] GAB INFO V-15-1-20129 Port h:heartbeats in 6000 ~ 9000 msec: 0

May 3021:33:42 s1g5 gab: [ID 217350 kern.notice] GAB INFO V-15-1-20129 Port h:heartbeats in 9000 ~ 12000 msec: 0

May 3021:33:42 s1g5 gab: [ID 217350 kern.notice] GAB INFO V-15-1-20129 Port h:heartbeats in 12000 ~ 15000 msec: 0

May 3021:33:42 s1g5 gab: [ID 767912 kern.notice] GAB INFO V-15-1-20088 Systeminformation:

May 3021:33:42 s1g5 gab: [ID 746037 kern.notice] GAB INFO V-15-1-20089 number of cpu:32

May 3021:33:42 s1g5 gab: [ID 343786 kern.notice] GAB INFO V-15-1-20090 physicalmemory: 33435008 K

May 3021:33:42 s1g5 gab: [ID 974088 kern.notice] GAB INFO V-15-1-20091 free memory:256992 K

May 3021:33:42 s1g5 gab: [ID 663063 kern.notice] GAB INFO V-15-1-20092 average freememory in 5 sec: 256344 K

May 3021:33:42 s1g5 gab: [ID 665429 kern.notice] GAB INFO V-15-1-20093 average freememory in 30 sec: 261208 K

May 3021:33:42 s1g5 gab: [ID 259915 kern.notice] GAB INFO V-15-1-20094 number ofprocesses: 231

May 3021:33:42 s1g5 gab: [ID 631272 kern.notice] GAB INFO V-15-1-20095 load average in1 min: 9.37

May 3021:33:42 s1g5 gab: [ID 587815 kern.notice] GAB INFO V-15-1-20096 load average in5 min: 7.92

May 3021:33:42 s1g5 gab: [ID 980060 kern.notice] GAB INFO V-15-1-20097 load average in15 min: 6.19

May 3021:33:42 s1g5 gab: [ID 559196 kern.notice] GAB INFO V-15-1-20098 pagein rate:14

May 3021:33:42 s1g5 gab: [ID 582491 kern.notice] GAB INFO V-15-1-20099 pageout rate:14

# Final steps in killing HAD and generating core #

May 3021:33:42 s1g5 gab: [ID 940236 kern.notice] GAB INFO V-15-1-20041 Port h: clientprocess failure: killing process

May 3021:33:46 s1g5 Had[2953]: [ID 702911 daemon.alert] VCS WARNING V-16-1-53034 HADSignal SIGABRT received

May 3021:33:46 s1g5 Had[2953]: [ID 702911 daemon.alert] VCS WARNING V-16-1-51047 HADSelf Check: Excessive delay in the HAD heartbeat to GAB (10 seconds)

May 3021:33:52 s1g5 Had[2953]: [ID 702911 daemon.alert] VCS NOTICE V-16-1-53038Beginning execution of the diagnostics script

May 3021:33:56 s1g5 Had[2953]: [ID 702911 daemon.alert] VCS NOTICE V-16-1-53039Completed execution of the diagnostics script

May 3021:33:57 s1g5 gab: [ID 424555 kern.notice] GAB WARNING V-15-1-20035 Port hattempting to kill process due to client process failure

May 3021:33:58 s1g5 genunix: [ID 603404 kern.notice] NOTICE: core_log: had[2953] coredumped: /var/core/core_s1gcs5_had_0_0_1243715636_2953

# HAD is down #

May 3021:33:58 s1g5 gab: [ID 397130 kern.notice] GAB INFO V-15-1-20032 Port hclosed

Cause

The GAB will kill the HAD process if heartbeats are delayed.

The communication between nodes in the cluster is done by 2 kernel modules (LLT and GAB).

Veritas Cluster Server (VCS) has a user-land process called HAD, that is the engine for VCS. This needs to communicate with the other cluster nodes (via GAB) to report its current status. Hence, there exists a communication between GAB and HAD, once every 0.5 seconds.

While GAB will always communicate with HAD, but HAD being a user-land process, is restricted to normal OS scheduling. Also as a user-land process, it is possible for a rogue program to overwrite HAD memory. So, GAB will try to "talk" to HAD and if it cannot do so for 15 seconds, GAB will send a kill signal to HAD.

A node that cannot heartbeat is not viable in the cluster and must be removed.

Resolution

The best solution is to increase system resources and to set up alerts so that the system load can be addressed before GAB sends a kill signal to HAD. There are various OS/ThirdParty tools available to monitor the same . So based on the requirement kindly reach the respective OS/ThirdParty vendors.

Issue/Introduction

There are instances when the Veritas High Availability Daemon (HAD) process dies and produces a core file. This can happen when the Global Atomic Broadcast (GAB) module kills HAD.

Welcome to "KB Articles"