GAB panic with GAB: Port h halting system due to client process failure following unsuccessful upgrade of VRTSvxfen

book

Article ID: 100003223

calendar_today

Updated On:

Description

Error Message

Panic String:
GAB: Port h halting system due to client process failure

Panic Stack:
pvthread+806300
[0003B238].panic_trap+000000 ()
[045E80E4]gab_halt+00007C (??)
[045E80E4]gab_halt+00007C (??)
[045E2664]gab_kill_process+0000B0 (??)
[045D9374]gab_timerscan+0003A4 (??)
[045D4930]gab_timeout_daemon+00008C (??)
[0016F460]procentry+000010 (??, ??, ??, ??)


The installer log on the troubled node gab is not showing port ‘b’ as registered but llt is showing port ‘b' i.e. port id 1 in lltstat –p output.

01:30:10 2 gab is running on node2, must stop
01:30:10 exec /usr/bin/rsh node2 "LANG=C /sbin/gabconfig -U 2> /dev/null" 2>/var/tmp/installrp-aZyaBF/do_loca
l2 1>&2
01:30:10 Unconfigured
01:30:10 exit=1123

[...]

01:30:38 2 Output of lltconfig on  node2 is LLT is running
01:30:38 2 LLT running on node2
01:30:38 2 llt is running on node2, must stop
01:30:38 2 Displaying lltstat -p output on node2
01:30:38 exec /usr/bin/rsh node2 "LANG=C /sbin/lltstat -p 2> /dev/null" 2>/var/tmp/installrp-aZyaBF/do_local2
1>&2
01:30:39 LLT port information:
   Port    Usage        Cookie
     0     gab          0x0
         opens:     0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
         connects:  1
     1     gab          0x1
         opens:     0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
         connects:  1
    31     gab          0x1F
         opens:     0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
         connects:  1

 

Cause

 

From system snap dump, it was concluded that _had was blocked in kernel inside vxfen code (vxfen_get_nodestats_gen2 -> vrfsm_print_states). This context was waiting for a lock which was held by another vxfen thread (the rfsm receive thread). The rfsm receive thread has gone into a near-infinite loop while traversing the list of coordinator disks. This loop condition was caused due to incorrect values of number of paths for each coordinator disk (npaths). This npaths is computed in user space and passed to kernel.  The data corruption in the kernel driver shows a specific pattern hinting that user and kernel were not upgraded properly.

Output of (checksum) cksum /sbin/vxfenconfig command on the problem node confirmed that the binary was not upgraded to RP4. This was due to the binary being in use when installp was run. This could be reproduced by installing VRTSvxfen RP4 patch when the vxfenconfig process was running. The patch install (installp) command completes without errors, however, /sbin/vxfenconfig remained at RP1.

Root cause is that AIX installp command doesn’t upgrade the files which are in use during the upgrade process but returns success. AIX installp calls inucp command to copy the files over to the filesystem for ROOT part of the files only.  The logic for USER part of the files is different as it uses inurest command which correctly unlinks the file before copying the new one.  This is the standard behavior. This has been acknowledged as a bug by IBM. Current proposal from IBM is that installp would now rather fail to install when the binary fails to overwrite the current binary in execution instead of returning success.

 

Resolution

Stop the SFCFS cluster components as explained in the installation guide.
Install the patch by following patch install instructions.

IBM is working to fix the issue of installp returning success instead of failure when a file is in use and could not be copied. Fix from IBM would avoid a silent failure of installp and further system crash.

Veritas Engineering is enhancing CPI install scripts to exit patch install if SFCFS components could not be stopped. This enhancement request is being tracked by e2169121 and is scheduled to be fixed in 6.0 release.

 

Applies To

This issue is applicable to SFCFS cluster running any version of VRTSvxfen (Veritas I/O Fencing) package.

Issue/Introduction

Storage Foundation Cluster File System (SFCFS) node panic by GAB with "GAB: Port h halting system due to client process failure" after upgrading VCS cluster. The system panic follows an upgrade of Storage Foundation Cluster File System (SFCFS) cluster from 5.0MP3RP1 to 5.0MP3RP4. _had stack from system snap dump shows: [000547EC]e_block_thread+000278 ()
[0422C1C8]pse_block_thread+0002D8 ()
[0422C4BC]pse_sleep_thread+000098 (??, ??, ??)
[048A21A0]vrfsm_print_states+000178 ()
[048B0E40].vxfen_get_nodestats_gen2+000468 ()
[048AEDFC].vxfen_ioc_get_nodestats_gen2+000234 ()
[048B4498].vxfen_ioctl+000148 ()
[048919F0].vxfenaixioctl+000144 ()
[003F1558]rdevioctl+0000C8 (??, ??, ??, ??, ??, ??)
[004C8264]spec_ioctl+000078 (??, ??, ??, ??, ??, ??)
[00406BE8]vnop_ioctl+000068 (??, ??, ??, ??, ??, ??)
[00448A78]vno_ioctl+000084 (??, ??, ??, ??, ??)
[0045FDF0]common_ioctl+0000C0 (??, ??, ??, ??)
[00003810].svc_instr+000110 ()
[kdb_get_memory] no real storage @ 2FF15D50

Additional Information

ETrack: 2169121