First Failure Data Capture (FFDC) in Veritas Cluster System (VCS) version 5.1 and onward

book

Article ID: 100006615

calendar_today

Updated On:

Description

Error Message

Sep  3 02:09:35 vcshost1 gab: [ID 191522 kern.notice] GAB WARNING V-15-1-20058 Port h process 2305: heartbeat failed, killing process
Sep  3 02:09:35 vcshost1 gab: [ID 975177 kern.notice] GAB INFO V-15-1-20059 Port h heartbeat interval 30000 msec. Statistics:
Sep  3 02:09:35 vcshost1 gab: [ID 217350 kern.notice] GAB INFO V-15-1-20129 Port h: heartbeats in 0 ~ 6000 msec: 332527539
Sep  3 02:09:35 vcshost1 gab: [ID 217350 kern.notice] GAB INFO V-15-1-20129 Port h: heartbeats in 6000 ~ 12000 msec: 2
Sep  3 02:09:35 vcshost1 gab: [ID 217350 kern.notice] GAB INFO V-15-1-20129 Port h: heartbeats in 12000 ~ 18000 msec: 0
Sep  3 02:09:35 vcshost1 gab: [ID 217350 kern.notice] GAB INFO V-15-1-20129 Port h: heartbeats in 18000 ~ 24000 msec: 0
Sep  3 02:09:35 vcshost1 gab: [ID 217350 kern.notice] GAB INFO V-15-1-20129 Port h: heartbeats in 24000 ~ 30000 msec: 0
Sep  3 02:09:35 vcshost1 gab: [ID 767912 kern.notice] GAB INFO V-15-1-20088 System information:
Sep  3 02:09:35 vcshost1 gab: [ID 746037 kern.notice] GAB INFO V-15-1-20089      number of cpu:                 128
Sep  3 02:09:35 vcshost1 gab: [ID 343786 kern.notice] GAB INFO V-15-1-20090      physical memory:               32718848K
Sep  3 02:09:35 vcshost1 gab: [ID 974088 kern.notice] GAB INFO V-15-1-20091      free memory:                   2300576 K
Sep  3 02:09:35 vcshost1 gab: [ID 663063 kern.notice] GAB INFO V-15-1-20092      average free memory in 5 sec:  2307056 K
Sep  3 02:09:35 vcshost1 gab: [ID 665429 kern.notice] GAB INFO V-15-1-20093      average free memory in 30 sec: 2301720 K
Sep  3 02:09:35 vcshost1 gab: [ID 259915 kern.notice] GAB INFO V-15-1-20094      number of processes:           143
Sep  3 02:09:35 vcshost1 gab: [ID 631272 kern.notice] GAB INFO V-15-1-20095      load average in 1 min:         29.98
Sep  3 02:09:35 vcshost1 gab: [ID 587815 kern.notice] GAB INFO V-15-1-20096      load average in 5 min:         28.92
Sep  3 02:09:35 vcshost1 gab: [ID 980060 kern.notice] GAB INFO V-15-1-20097      load average in 15 min:        32.77
Sep  3 02:09:35 vcshost1 gab: [ID 559196 kern.notice] GAB INFO V-15-1-20098      pagein rate:                   0
Sep  3 02:09:35 vcshost1 gab: [ID 582491 kern.notice] GAB INFO V-15-1-20099      pageout rate:                  0
Sep  3 02:09:35 vcshost1 gab: [ID 940236 kern.notice] GAB INFO V-15-1-20041 Port h: client process failure: killing process
Sep  3 02:09:35 vcshost1 Had[2305]: [ID 702911 daemon.alert] VCS WARNING V-16-1-53034 had Signal SIGABRT received
Sep  3 02:09:48 vcshost1 Had[2305]: [ID 702911 daemon.alert] VCS NOTICE V-16-1-53038 Beginning execution of the diagnostics script
Sep  3 02:09:50 vcshost1 gab: [ID 424555 kern.notice] GAB WARNING V-15-1-20035 Port h attempting to kill process due to client process failure
 

Cause

The FFDC log is a circular buffer within the _had process and is being continously updated by the _had process.    This buffered data will be dumped when the _had process is killed by GAB.  When the VCS engine process (_had) does not send heartbeats to GAB for more than the timout value sepecified in the environment variable VCS_GAB_TIMEOUT (default is 30 seconds, different value can be specified in the /opt/VRTSvcs/bin/vcsenv file), GAB will kill the _had process by sending a SIGABRT signal.  As part of the SIGABRT signal handler, the _had process will dump the FFDC log to an external file /var/VRTSvcs/diag/had/FFDC_VCS_"pid"_had.log.  

The FFDC log dump file will contain data like the followings.

Example 1:

#
# Log File Name:     ./FFDC_VCS_2305_had.log
# System:       vcshost3
# SysInfo:      Solaris:vcshost3,Generic_142900-02,5.10,sun4v
# Created:      2011-09-03 02:09:35                           <<< created 02:09:35
#
2011-09-03 02:09:22  [FFDC] [1] VCS DBG_IPM sent 82 bytes     <<< the earliest entry in the circular buffer is from 02:09:22
        Ipm.C:send[2657]
2011-09-03 02:09:22  [FFDC] [1] VCS DBG_IPM sent 83 bytes
        Ipm.C:send[2657]
2011-09-03 02:09:22  [FFDC] [1] VCS DBG_IPM sent 90 bytes
        Ipm.C:send[2657]
......

Example 2:

#
# Log File Name:     ./FFDC_VCS_2629_had.log
# System:       vcshost2
# SysInfo:      Solaris:vcshost2,Generic_141444-09,5.10,sun4u
# Created:      2011-01-11 23:17:23
#
2011-01-11 23:15:38  [FFDC] [1] VCS DBG_AGENT Agent: Inside process_agent
        Agent.C:process_agent[231]
2011-01-11 23:15:38  [FFDC] [1] VCS DBG_AGENT Agent name is WebLogic9
        Agent.C:process_agent[241]
2011-01-11 23:15:38  [FFDC] [1] VCS DBG_AGENT minor code is 5379
        Agent.C:process_agent[264]
2011-01-11 23:15:38  [FFDC] [1] VCS DBG_AGENT MSG_AGENT_ALIVE received
        Agent.C:process_agent[351]
2011-01-11 23:15:38  [FFDC] [1] VCS DBG_AGENT Entering alive method for agent WebLogic9
        Agent.C:alive[387]
.....

Because the whole FFDC circular buffer is dumped, the external FFDC log file will contain entries before the time the _had was killed.

 

 

Resolution

Depends on the nature of the _had process problem, the FFDC may help in the root cause analysis.


Applies To

Veritas Cluster Server version 5.1 and onward

Issue/Introduction

VCS  engine process (_had)  was killed by Group Automic Broadcase (GAB)  and First Failure Data Capture (FFDC)  log is dumped through the _had SIGABRT signal handler to faciliate the root cause analysis of the _had problem.

Additional Information

ETrack: 2577585