HAD does not start or gets restarted in Veritas Cluster Server versions 5.0.1 and above

book

Article ID: 100025192

calendar_today

Updated On:

Description

Error Message

From VCS engine_A.log: 

2011/07/08 01:01:53 VCS WARNING V-16-1-10485 Excessive delay between successive calls to GAB heartbeat (11 seconds)  
2011/07/08 01:03:51 VCS WARNING V-16-1-10485 Excessive delay between successive calls to GAB heartbeat (117 seconds)

From syslog.log:

Jul 11 15:30:24 node1 Had[4694]: VCS WARNING V-16-1-53034 HAD Signal SIGABRT received  
Jul 11 15:32:46 node1 Had[5481]: VCS WARNING V-16-1-53034 HAD Signal SIGABRT received  

Cause

The stack from _had cores points to hangs in select() system call. A sample stack from _had core is shown below:

(0)  0x000000000479fdc0  _Z12VCSDumpStackv + 0x3b0 at Platform.C:1830 [/opt/VRTSvcs/bin/had]  
(1)  0x00000000047a1000  VCSAbrtHandler + 0x60 at Platform.C:1990 [/opt/VRTSvcs/bin/had]  
(2)  0xe0000001205c7420  ---- Signal 6 (SIGABRT) delivered ----  
(3)  0x60000000c0948830  _select_sys + 0x30 [/usr/lib/hpux32/libc.so.1]  
(4)  0x60000000c095ed40  _select + 0xe0 at ../../../../../core/libs/libc/shared_em_32_perf/../core/syscalls/t_select.c:21 [/usr/lib/hpux32/libc.so.1]  
(5)  0x00000000046c4f30  _ZN9IpmHandle6eventsEP5DListPS1_S1_S2_i + 0xb30 at Ipm.C:502 [/opt/VRTSvcs/bin/had]  
(6)  0x00000000046d0100  _ZN9IpmHandle4sendEP5VListi + 0x1300 at Ipm.C:2230 [/opt/VRTSvcs/bin/had]  
(7)  0x000000000464e560  _ZN6System12process_dumpEPvP6MsgHdr + 0x920 at System.C:4871 [/opt/VRTSvcs/bin/had]  
(8)  0x00000000041e2330  _Z15process_messagePvP5VListi + 0xda0 at had.C:461 [/opt/VRTSvcs/bin/had]  
(9)  0x00000000041f5a50  _Z4MAINmPPc + 0x8d50 at had.C:3076 [/opt/VRTSvcs/bin/had]  
(10) 0x0000000004206270  main + 0x40 at had.C:3576 [/opt/VRTSvcs/bin/had]  
(11) 0x60000000c00427c0  main_opd_entry + 0x50 [/usr/lib/hpux32/dld.so]  

This points to HP-UX OS issue. Further analysis by HP referred to a regression caused by the OS patch PHKL_41700.

Resolution

To get around this problem, HP has suggested customer to tune "hires_timeout_enable" kernel parameter to 1 before starting cluster. Run the following command to set this variable to 1.

# kctune hires_timeout_enable=1 

Another possible solution is to install the following kernel patch:

PHKL_41967

Please note the above patch is the most current at the time this document was edited (July 2011). Check with HP to see if a new release of the patch is available.

Applies To

This issue is specific to:

HP-UX 11.31

VCS  or SFRAC 5.0.1 and subsequent patches

HP-UX kernel patch PHKL_41700 installed

Issue/Introduction

VCS High Availability Daemon (had) is getting killed by GAB continuously. 

Additional Information

ETrack: 1724831