VCS GAB panicked system with the panic string GAB: Port h halting system due to client process failure. HAD process is stuck in page_get_mnode_freelist().

book

Article ID: 100006924

calendar_today

Updated On:

Description

Error Message

Panic string:

GAB: Port h halting system due to client process failure

Panic stack:
==== panic thread: 0x2a10015fca0 ==== CPU: 22 ====
==== panic kernel thread: 0x2a10015fca0 PID: 0 on CPU: 22 affinity CPU: 22 ====
cmd: sched
[...]
unix:panicsys+0x48(0x2a10015f734, 0x2a10015f710, 0x18bab80, 0x1, , , 0x4480001606, , , , , , , , 0x2a10015f734, 0x2a10015f710)
unix:vpanic_common+0x78(0x2a10015f734, 0x2a10015f710, 0x7d5059d5, 0xa, 0xa, 0x1833000)
genunix:cmn_err+0x98(0x3, 0x2a10015f734, 0x4, 0x2a10015f76d, 0x2a10015f76e, 0x0)
gab:gab_halt+0xb0(0x36d70507, 0x36d70507, 0x600a28da180, 0x0, 0x0, 0x0)
gab:gab_kill_process+0xc8(0x10507)
gab:gab_timerscan+0x400(0x0)
genunix:callout_execute+0xb8(0x6009a044000, 0x6009a0131b8, 0x10b945c)
genunix:taskq_thread+0x1a4(0x6009a00d5f8, 0x0)
unix:thread_start+0x4()
-- end of kernel thread's stack --

Cause

VCS engine (had) and GAB could not communicate and attempted to kill and restart had. As 'had' was stuck in kernel, GAB initiated panic of the system.

Sat Oct 22 07:05:15 2011| GAB INFO V-15-1-20041 Port h: client process failure: killing process
Sat Oct 22 07:05:30 2011| GAB WARNING V-15-1-20035 Port h attempting to kill process due to client process failure
Sat Oct 22 07:05:45 2011| GAB WARNING V-15-1-20035 Port h attempting to kill process due to client process failure
Sat Oct 22 07:06:00 2011| GAB WARNING V-15-1-20035 Port h attempting to kill process due to client process failure
Sat Oct 22 07:06:30 2011| GAB WARNING V-15-1-20138 Port h isolated due to client process failure

 

Core dump shows that _had was stuck in OS code.
 

genunix:turnstile_block+0x600(0x600a2325e08, 0x0, 0x19cc178, 0x1832ce0, 0x0, 0x0)
unix:mutex_vector_enter+0x428()
unix:mutex_enter(0x19cc178) - frame recycled
unix:page_get_mnode_freelist+0x150(0xf, 0x1, 0x0, 0x0, 0xb, 0xf)
unix:page_get_freelist+0x430(0x3003ac65d00, 0x30065780000, 0x600df8aa288, 0xff1b4000, 0x2000, 0xb)
unix:page_create_va+0x32c(0x3003ac65d00, 0x30065780000, 0x2000, 0x3, 0x600df8aa288, 0xff1b4000)
genunix:swap_getapage+0x168(0x3003ac65d00, 0x30065780000, 0x2000, 0x0, 0x2a107ac3490, 0x2000, 0x600df8aa288, 0xff1b4000, 0x4, 0x3003cc65760)
genunix:swap_getpage+0x4c(0x3003ac65d00, 0x30065780000, 0x2000, 0x0, 0x2a107ac3490, , , 0xff1b4000, 0x4, 0x3003cc65760)


Apart from 'had', 75 other threads were stuck in page_get_mnode_freelist(). _had seems to be a victim of an operating system (OS) defect.

Further analysis by Oracle (OS support vendor) confirmed that the issue is related to a known Solaris bug # 6778289.

Resolution

Oracle advised to install Kernel patch 144500-19 to address known Solaris bug # 6778289.

Customers are encouraged to contact Oracle support for further details on the OS defect.


Applies To

This issue is applicable to:

- Systems running Solaris operating system

- VCS (any version)

 

Issue/Introduction

System running Veritas Cluster Server (VCS) panicked by Global Atomic Broadcast (GAB) with the panic string, GAB: Port h halting system due to client process failure. HAD process is stuck in page_get_mnode_freelist().