Hang is seen in the cluster when one of the nodes in the cluster leaves or rebooted. CFS recovery was hung for the missing large packets on delegation recovery.

Description

Error Message

msgcnt 265 Phase 0 - /dev/vx/dsk/bilappdg/billdata4lv - Blocking buffer reads for recovery. gencnt 1 primary 4 leavers: 0x10 0x0
msgcnt 281 Phase 1 - /dev/vx/dsk/bilappdg/billdata4lv - All clear for log replay.
NOTICE: msgcnt 286 - /dev/vx/dsk/bilappdg/billdata4lv - PNOLT bitmap in vrt: 0x6f 0x0 0x0 0x0, Log replay for per node logs: 0x2
NOTICE: msgcnt 297 - /dev/vx/dsk/bilappdg/billdata4lv - Log replay finished with error 0
msgcnt 298 Phase 2 - /dev/vx/dsk/bilappdg/billdata4lv - Buffer reads allowed. <<< the CFS recovery was stopped here for the problematic filesystem!

Cause

This issue is tracked via Symantec internal incident e2403126.

Whenever a node in the cluster leaves, reconfiguration happens and all the resources that are held by the leaving nodes are consolidated. This is done on one node of the cluster called primary node. Each node sends a message to the primary node about the resources it is currently holding. During this reconfiguration, in a corner case, VxFS is incorrectly calculating the message length which is larger than what GAB(Veritas Group Membership and Atomic Broadcast) layer can handle. As a result the message is getting lost. The sender thinks that the message is sent and waits for acknowledgement. The message is actually dropped at sender and never sent. The master node which is waiting for this message will wait forever and the reconfiguration never completes leading to hang.

On the node initiated the cfs recovery

(0)> pr vx_cfs_fsext_t 0xF100011024457800
struct vx_cfs_fsext_t {
    vx_u32_t cfse_cstate        = 0x00000006; <= VX_FSEXT_ASSUMING
VX_FSEXT_SECONDARY
    vx_u32_t cfse_cflags        = 0x00000088; <= VX_FSEFLG_RECOVERING
VX_FSEFLG_FCLRECOVERING
    struct vx_vrt_t *cfse_vrt   = 0xF100011024457000;
    int cfse_primary    = 0xFFFFFFFF; <= -1

the target filesystem was recovering.

(0)> f 1199 <= it's the thread playing the recovery, and was hung for getting the msg F1000110DFC68600.

pvthread+04AF00 STACK:
[000547EC]e_block_thread+000278 ()
[000550F8]e_sleep_thread+00005C (??, ??, ??)
[04566D30].vx_event_wait+0000CC ()
r31 : F1000110DFC68600 r30 : 0000000000000000
[0458F22C].vx_async_waitmsg+00001C ()
[0458EFD4].vx_msg_broadcast+000118 ()
[045C8CEC].vx_dele_rebuild+0000F8 ()
[045CF85C].vx_dele_assumption+000158 ()
[045CF5C8].vx_recover_dele+000024 ()
[04594508].vx_recover+000378 ()

(0)> pr vx_msgblk_t F1000110DFC68600
struct vx_msgblk_t {
...
    struct gab_mbr_t {
        unsigned int mbr[0]     = 0x000000EC; (1110 1100) <= it was sent to nid 2,3,5,6,7
        unsigned int mbr[1]     = 0x00000000;
        unsigned int mbr[2]     = 0x00000000;
        unsigned int mbr[3]     = 0x00000000;
    } mb_sentto;
    struct gab_mbr_t {
        unsigned int mbr[0]     = 0x00000080;(1000 0000)   <= it was waiting for the msg from nid 7, the node responding the recovery.
        unsigned int mbr[1]     = 0x00000000;
        unsigned int mbr[2]     = 0x00000000;
        unsigned int mbr[3]     = 0x00000000;
    } mb_pendrsp;
...
}

on the node responding the recovery, we saw the following thread sending back to node previous node.

(0) f 33332
pvthread+823400 STACK:
[000547EC]e_block_thread+000278 ()
[000550F8]e_sleep_thread+00005C (??, ??, ??)
[045D6D30].vx_event_wait+0000CC ()
r31 : F1000113A3C7EB00 r30 : 0000000000000000
[045FF22C].vx_async_waitmsg+00001C ()
[045FF1C8].vx_msg_send+0001A8 ()
[0463901C].vx_send_rbdele_resp+0000F8 ()
[046397BC].vx_recv_rbdele+00029C ()
[04639954].vx_recvdele+000100 ()
[045FB968].vx_msg_recvreq+000158 ()
[045FDDAC].vx_msg_process_thread+0001AC ()
[045A4CD0].vx_thread_base+00002C ()
[0016FC74]threadentry+000014 (??, ??, ??, ??)

So, it has already sent its response to former node and waiting for acknowledgement.
Lets double check it with msg structure:

(0)> pr vx_msgblk_t F1000113A3C7EB00
struct vx_msgblk_t {
    struct vx_dll_t {
...
    struct gab_mbr_t {
        unsigned int mbr[0]     = 0x00000020; (0010 0000)
        unsigned int mbr[1]     = 0x00000000;
        unsigned int mbr[2]     = 0x00000000;
        unsigned int mbr[3]     = 0x00000000;
    } mb_sentto;
    struct gab_mbr_t {
        unsigned int mbr[0]     = 0x00000020; (0010 0000)
        unsigned int mbr[1]     = 0x00000000;
        unsigned int mbr[2]     = 0x00000000;
        unsigned int mbr[3]     = 0x00000000;
    } mb_pendrsp;
...

Why the message was not delivered by the gab?

In fact, it's a defect from VxFS that, during reconfig, the delegation message sent to the assuming primary exceeds 64K
due to incorrect calculation of how many delegations can be carried in a message

Although gab did response VxFS that the message was more than 64KB, but VxFS didn't capture the error code returned appropriately, the return code was dropped off. That's why VxFS recovery was hung.

Resolution

Veritas engineering has made necessary code changes so that the message length calculation is done properly now and GAB can handle the messages.
The fix is available in the following patch releases:

- AIX 5.0MP3RP5HF3

- AIX 5.1SP1RP2P1

- HP-UX 5.1SP1RP1P1

- Linux 5.1SP1RP2P1

- Solaris 5.1SP1RP2

Applies To

Applies to all platforms: AIX, HP-UX, Linux and Solaris

Products: SFCFS 5.0MP3+ and 5.1+

Issue/Introduction

First of all, this is a corner case, can be seen only when there is a large delegation message exchanging between the nodes. That message is more than 64KB (which is the max size of the gab packets), and was lost by the gab, so the recovery of the delegation was hung, and caused the CFS recovery couldn't finish in the end.

Additional Information

ETrack: 2403126

Welcome to "KB Articles"