GAB: Port h halting system due to client process failure

book

Article ID: 100004052

calendar_today

Updated On:

Description

Error Message

 savecore: [ID 570001 auth.error] reboot after panic: GAB: Port h halting system due to client process failure

Cause

The issue was related to UFS filesystem logging tuning.

Resolution

From the crash analysis, found that 'had' seems to be stuck in seemed kernel context waiting on UFS: Sent the analysis to the customer and recommended that they contact Sun/Oracle, provide our analysis, and ask about what would cause this. Sun confirmed that it was their issue, and provided UFS tuning.

There are numerous hardware errors noticed in the syslog and message buffer too. IOSTAT reports a lot of hard and transport errors.
From the crashdump, we identify one thread of devfsadm command in biowait, and 79 threads waiting for a mutex.
We need to identify if all the 79 threads are waiting for the same mutex lock and what is the mutex it is waiting for?

SolarisCAT(vmcore.0/10U)> tlist biowait
thread: 0x300085c1840  state: slp
  PID:   396  cmd: devfsadmd
  idle: 5 ticks (0.05 seconds)
buf @ 0x300dd2ae740
  b_edev:   328(vxio),0 //platform/sun4u-us3/lib/libc_psr.so.1/platform/sun4u-us3/lib/sparcv9/libc_psr.so.1     b_blkno:   0x548dec
  b_addr:   0x0 b_bufsize: 0x400
  b_bcount: 1024
  b_vp:     0x6005886de00  v_op: *specfs(bss):spec_vnodeops
  b_flags:  0x80053 (BUSY|DONE|PAGEIO|READ|NOCACHE)


   1 thread in biowait() found.

threads in biowait() by device:
count   device (thread: max idle time)
    1   328(vxio),0 (0x300085c1840: 0.05 seconds) //platform/sun4u-us3/lib/libc_psr.so.1/platform/sun4u-us3/lib/sparcv9/libc_psr.so.1


Note that the thread is idle only 5 ticks, so biowait may be misleading.

The VCS had and hashadow process threads are also waiting for a mutex/lock.

SolarisCAT(vmcore.0/10U)> proc -l 11514
    addr       PID    PPID   RUID/UID     size      RSS     swresv   time  command
============= ====== ====== ========== ========== ======== ======== ====== =========
0x6008425b958  11514      1          0   19685376  8568832  8667136 1328853 /opt/VRTSvcs/bin/had
        thread: 0x300148b63c0  state: slp   wchan: 0x6008425ba1e  sobj: condition var (from genunix:exitlwps+0x11c)    <<<<
        thread: 0x300149029c0  state: slp   wchan: 0x60057957760  sobj: mutex

SolarisCAT(vmcore.0/10U)> proc -l 11577
    addr       PID    PPID   RUID/UID     size      RSS     swresv   time  command
============= ====== ====== ========== ========== ======== ======== ====== =========
0x600844dad40  11577      1          0    3670016     8192   540672      7 /opt/VRTSvcs/bin/hashadow
        thread: 0x300085b9180  state: slp   wchan: 0x6008532cb94  sobj: condition var (from genunix:wait_for_lock+0x34)    <<<<

Issue/Introduction

Root Cause Analysis is needed for node crash. The panic string shows it was a GAB initiated iofence.