GAB: Port b halting system due to network failure

book

Article ID: 100023433

calendar_today

Updated On:

Description

Error Message

Looking somewhat before in the messages ... :

Jan  9 12:26:30 host4 llt: [ID 678236 kern.notice] LLT INFO V-14-1-10035 timer not called for 1149 ticks

Jan  9 12:26:30 host4 llt: [ID 678236 kern.notice] LLT INFO V-14-1-10035 timer not called for 1149 ticks

..........................

Jan  9 12:28:33 host4 Had[13119]: [ID 702911 daemon.alert] VCS WARNING V-16-1-51047 HAD Self Check: Excessive delay in the HAD heartbeat to GAB (10 seconds)

Jan  9 12:28:33 host4 Had[13119]: [ID 702911 daemon.alert] VCS WARNING V-16-1-51047 HAD Self Check: Excessive delay in the HAD heartbeat to GAB (10 seconds)

Jan  9 12:28:33 host4 Had[13119]: [ID 702911 daemon.alert] VCS WARNING V-16-1-51047 HAD Self Check: Excessive delay in the HAD heartbeat to GAB (10 seconds)

..........................

Jan  9 12:28:42 host4 gab: [ID 854858 kern.notice] GAB INFO V-15-1-20124 timer not called for 28 seconds

Jan  9 12:28:42 host4 gab: [ID 854858 kern.notice] GAB INFO V-15-1-20124 timer not called for 28 seconds

Jan  9 12:28:42 host4 llt: [ID 678236 kern.notice] LLT INFO V-14-1-10035 timer not called for 2714 ticks

Jan  9 12:28:42 host4 llt: [ID 678236 kern.notice] LLT INFO V-14-1-10035 timer not called for 2714 ticks

Jan  9 12:28:42 host4 llt: [ID 487101 kern.notice] LLT INFO V-14-1-10032 link 0 (nxge2) node 0 inactive 12 sec (142455498)

Jan  9 12:28:42 host4 llt: [ID 487101 kern.notice] LLT INFO V-14-1-10032 link 0 (nxge2) node 0 inactive 12 sec (142455498)

Jan  9 12:28:42 host4 llt: [ID 140958 kern.notice] LLT INFO V-14-1-10205 link 0 (nxge2) node 0 in trouble

Jan  9 12:28:42 host4 llt: [ID 140958 kern.notice] LLT INFO V-14-1-10205 link 0 (nxge2) node 0 in trouble

Jan  9 12:28:42 host4 llt: [ID 487101 kern.notice] LLT INFO V-14-1-10032 link 1 (nxge10) node 0 inactive 13 sec (107520707)

Jan  9 12:28:42 host4 llt: [ID 487101 kern.notice] LLT INFO V-14-1-10032 link 1 (nxge10) node 0 inactive 13 sec (107520707)

Jan  9 12:28:42 host4 llt: [ID 140958 kern.notice] LLT INFO V-14-1-10205 link 1 (nxge10) node 0 in trouble

Jan  9 12:28:42 host4 llt: [ID 140958 kern.notice] LLT INFO V-14-1-10205 link 1 (nxge10) node 0 in trouble

Jan  9 12:28:42 host4 llt: [ID 140958 kern.notice] LLT INFO V-14-1-10205 link 2 (lowpri) node 0 in trouble

Jan  9 12:28:42 host4 llt: [ID 140958 kern.notice] LLT INFO V-14-1-10205 link 2 (lowpri) node 0 in trouble

Jan  9 12:28:42 host4 llt: [ID 592107 kern.notice] LLT INFO V-14-1-10510 sent hbreq (NULL) on link 2 (lowpri) node 0. 4 more to go.

Jan  9 12:28:42 host4 llt: [ID 592107 kern.notice] LLT INFO V-14-1-10510 sent hbreq (NULL) on link 2 (lowpri) node 0. 4 more to go.

Jan  9 12:28:42 host4 llt: [ID 140958 kern.notice] LLT INFO V-14-1-10205 link 0 (nxge2) node 1 in trouble

Jan  9 12:28:42 host4 llt: [ID 140958 kern.notice] LLT INFO V-14-1-10205 link 0 (nxge2) node 1 in trouble

Jan  9 12:28:42 host4 llt: [ID 140958 kern.notice] LLT INFO V-14-1-10205 link 1 (nxge10) node 1 in trouble

Jan  9 12:28:42 host4 llt: [ID 140958 kern.notice] LLT INFO V-14-1-10205 link 1 (nxge10) node 1 in trouble

Jan  9 12:28:42 host4 llt: [ID 140958 kern.notice] LLT INFO V-14-1-10205 link 2 (lowpri) node 1 in trouble

Jan  9 12:28:42 host4 llt: [ID 140958 kern.notice] LLT INFO V-14-1-10205 link 2 (lowpri) node 1 in trouble

Jan  9 12:28:42 host4 llt: [ID 592107 kern.notice] LLT INFO V-14-1-10510 sent hbreq (NULL) on link 1 (nxge10) node 1. 4 more to go.

Jan  9 12:28:42 host4 llt: [ID 592107 kern.notice] LLT INFO V-14-1-10510 sent hbreq (NULL) on link 1 (nxge10) node 1. 4 more to go.

Jan  9 12:28:42 host4 llt: [ID 592107 kern.notice] LLT INFO V-14-1-10510 sent hbreq (NULL) on link 2 (lowpri) node 1. 4 more to go.

Jan  9 12:28:42 host4 llt: [ID 592107 kern.notice] LLT INFO V-14-1-10510 sent hbreq (NULL) on link 2 (lowpri) node 1. 4 more to go.

Jan  9 12:28:42 host4 llt: [ID 860062 kern.notice] LLT INFO V-14-1-10024 link 0 (nxge2) node 0 active

Jan  9 12:28:42 host4 llt: [ID 860062 kern.notice] LLT INFO V-14-1-10024 link 0 (nxge2) node 0 active

Jan  9 12:28:42 host4 llt: [ID 860062 kern.notice] LLT INFO V-14-1-10024 link 0 (nxge2) node 1 active

Jan  9 12:28:42 host4 llt: [ID 860062 kern.notice] LLT INFO V-14-1-10024 link 0 (nxge2) node 1 active

Jan  9 12:28:42 host4 llt: [ID 860062 kern.notice] LLT INFO V-14-1-10024 link 0 (nxge2) node 2 active

Jan  9 12:28:42 host4 llt: [ID 860062 kern.notice] LLT INFO V-14-1-10024 link 0 (nxge2) node 2 active

Jan  9 12:28:42 host4 llt: [ID 860062 kern.notice] LLT INFO V-14-1-10024 link 1 (nxge10) node 0 active

Not very informative. However, gives a clue that affected node has some issues from LLT network or link side, as the affected system's logs prints issues with LLT link for all other node.

Now having a look at one of the other node's logs, reveals that all the link to host4 become inactive for 15 secs and then declared as expired. :
 

>>>> link2 (lowpri)

Jan  9 12:28:29 host2 llt: [ID 487101 kern.notice] LLT INFO V-14-1-10032 link 2 (lowpri) node 4 inactive 14 sec (23919368)

Jan  9 12:28:29 host2 llt: [ID 487101 kern.notice] LLT INFO V-14-1-10032 link 2 (lowpri) node 4 inactive 14 sec (23919368)

....

Jan  9 12:28:30 host2 llt: [ID 205468 kern.notice] LLT INFO V-14-1-10509 link 2 (lowpri) node 4 expired

Jan  9 12:28:30 host2 llt: [ID 205468 kern.notice] LLT INFO V-14-1-10509 link 2 (lowpri) node 4 expired

>>>> link1 (nxge10)

Jan  9 12:28:30 host2 llt: [ID 487101 kern.notice] LLT INFO V-14-1-10032 link 1 (nxge10) node 4 inactive 14 sec (106635010)

Jan  9 12:28:30 host2 llt: [ID 487101 kern.notice] LLT INFO V-14-1-10032 link 1 (nxge10) node 4 inactive 14 sec (106635010)

Jan  9 12:28:30 host2 llt: [ID 205468 kern.notice] LLT INFO V-14-1-10509 link 1 (nxge10) node 4 expired

Jan  9 12:28:30 host2 llt: [ID 205468 kern.notice] LLT INFO V-14-1-10509 link 1 (nxge10) node 4 expired

 

>>>> link0 (nxge2)

Jan  9 12:28:36 host2 llt: [ID 487101 kern.notice] LLT INFO V-14-1-10032 link 0 (nxge2) node 4 inactive 14 sec (141415768)

Jan  9 12:28:36 host2 llt: [ID 487101 kern.notice] LLT INFO V-14-1-10032 link 0 (nxge2) node 4 inactive 14 sec (141415768)

Jan  9 12:28:37 host2 llt: [ID 487101 kern.notice] LLT INFO V-14-1-10032 link 0 (nxge2) node 4 inactive 15 sec (141415790)

Jan  9 12:28:37 host2 llt: [ID 487101 kern.notice] LLT INFO V-14-1-10032 link 0 (nxge2) node 4 inactive 15 sec (141415790)

Jan  9 12:28:37 host2 llt: [ID 205468 kern.notice] LLT INFO V-14-1-10509 link 0 (nxge2) node 4 expired

Jan  9 12:28:37 host2 llt: [ID 205468 kern.notice] LLT INFO V-14-1-10509 link 0 (nxge2) node 4 expired

Cause

The above logs show that the affected node faced network partition (all the LLT links become inactive and so expired after 15 sec.s) , hence causing the system panic.

Resolution

Please check for any link-down or similar OS messages are printed in OS log file.

The same can happen if the system is hung (even for a couple of minutes) which makes the LLT un-responsive.

 

Applies To

5 node 5.0 MP3RP1 cluster on Solaris 10 platform.

Issue/Introduction

System panic with following panic string, as seen in /var/adm/messages : Jan 9 12:28:48 host4 ^Mpanic[cpu37]/thread=2a100d31ca0: Jan 9 12:28:48 host4 ^Mpanic[cpu37]/thread=2a100d31ca0: Jan 9 12:28:48 host4 unix: [ID 676400 kern.notice] GAB: Port b halting system due to network failure Jan 9 12:28:48 host4 unix: [ID 676400 kern.notice] GAB: Port b halting system due to network failure Jan 9 12:28:48 host4 unix: [ID 100000 kern.notice] Jan 9 12:28:48 host4 unix: [ID 100000 kern.notice] Jan 9 12:28:48 host4 genunix: [ID 723222 kern.notice] 000002a100d31460 gab:gab_halt+b0 (6d470201, 2a100d316a8, 2a100d31680, 3a98, 3c0, 70af6c10) Jan 9 12:28:48 host4 genunix: [ID 723222 kern.notice] 000002a100d31460 gab:gab_halt+b0 (6d470201, 2a100d316a8, 2a100d31680, 3a98, 3c0, 70af6c10) Jan 9 12:28:48 host4 genunix: [ID 179002 kern.notice] %l0-3: 0000000000000002 0000000000000000 0000000000000000 0000000000000000 Jan 9 12:28:48 host4 genunix: [ID 179002 kern.notice] %l0-3: 0000000000000002 0000000000000000 0000000000000000 0000000000000000 Jan 9 12:28:48 host4 %l4-7: 000000007afeec00 000000007afeec00 0000000000000000 0000000000200000 Jan 9 12:28:48 host4 %l4-7: 000000007afeec00 000000007afeec00 0000000000000000 0000000000200000 Jan 9 12:28:48 host4 genunix: [ID 723222 kern.notice] 000002a100d31550 gab:gab_receive+2bdc (1, 7afe11c8, 0, 0, 8, 1) Jan 9 12:28:48 host4 genunix: [ID 723222 kern.notice] 000002a100d31550 gab:gab_receive+2bdc (1, 7afe11c8, 0, 0, 8, 1) Jan 9 12:28:48 host4 genunix: [ID 179002 kern.notice] %l0-3: 0000000070af28f0 0000000070adefe4 0000000000000000 00000000702bfcb0 Jan 9 12:28:48 host4 genunix: [ID 179002 kern.notice] %l0-3: 0000000070af28f0 0000000070adefe4 0000000000000000 00000000702bfcb0 Jan 9 12:28:48 host4 %l4-7: 0000030098ab2a80 0000000000000000 0000000000000001 0000000000000000 Jan 9 12:28:48 host4 %l4-7: 0000030098ab2a80 0000000000000000 0000000000000001 0000000000000000 Jan 9 12:28:48 host4 genunix: [ID 723222 kern.notice] 000002a100d316b0 gab:gab_receive_port_que+400 (1, 30098ab2a80, 70af1a40, 70af19f0, 70af1a00, 300c11ce938) Jan 9 12:28:48 host4 genunix: [ID 723222 kern.notice] 000002a100d316b0 gab:gab_receive_port_que+400 (1, 30098ab2a80, 70af1a40, 70af19f0, 70af1a00, 300c11ce938) Jan 9 12:28:48 host4 genunix: [ID 179002 kern.notice] %l0-3: 00000300c11ce938 0000000000000001 0000000000000001 0000000000000000 Jan 9 12:28:48 host4 genunix: [ID 179002 kern.notice] %l0-3: 00000300c11ce938 0000000000000001 0000000000000001 0000000000000000 Jan 9 12:28:48 host4 %l4-7: 0000000000000001 0000000070af1a48 00000000702bfce8 0000000000000008 Jan 9 12:28:48 host4 %l4-7: 0000000000000001 0000000070af1a48 00000000702bfce8 0000000000000008 Jan 9 12:28:48 host4 genunix: [ID 723222 kern.notice] 000002a100d31790 gab:gab_receive_que+210 (300c11ce938, 2a100d31ca0, 0, 0, 702bfcb0, 0) Jan 9 12:28:48 host4 genunix: [ID 723222 kern.notice] 000002a100d31790 gab:gab_receive_que+210 (300c11ce938, 2a100d31ca0, 0, 0, 702bfcb0, 0) Jan 9 12:28:48 host4 genunix: [ID 179002 kern.notice] %l0-3: 0000030098ab2a80 0000000000000001 0000000070af1bb0 00000300b7985ac0 Jan 9 12:28:48 host4 genunix: [ID 179002 kern.notice] %l0-3: 0000030098ab2a80 0000000000000001 0000000070af1bb0 00000300b7985ac0 Jan 9 12:28:48 host4 %l4-7: 0000000000000000 0000000000000000 0000000070adefe4 000000000105e234 Jan 9 12:28:48 host4 %l4-7: 0000000000000000 0000000000000000 0000000070adefe4 000000000105e234 Jan 9 12:28:48 host4 genunix: [ID 723222 kern.notice] 000002a100d31840 gab:gab_lrecv+580 (0, 300c11ce938, 0, 1, 70adefe4, 702bfc00) Jan 9 12:28:48 host4 genunix: [ID 723222 kern.notice] 000002a100d31840 gab:gab_lrecv+580 (0, 300c11ce938, 0, 1, 70adefe4, 702bfc00) Jan 9 12:28:48 host4 genunix: [ID 179002 kern.notice] %l0-3: 0000000000000001 0000000000000000 00000000702bfcb0 0000000000000000 Jan 9 12:28:48 host4 genunix: [ID 179002 kern.notice] %l0-3: 0000000000000001 0000000000000000 00000000702bfcb0 0000000000000000 Jan 9 12:28:48 host4 %l4-7: 0000000000000000 0000000070aee0c8 000000000000000b 0000000000000000 Jan 9 12:28:48 host4 %l4-7: 0000000000000000 0000000070aee0c8 000000000000000b 0000000000000000 Jan 9 12:28:48 host4 genunix: [ID 723222 kern.notice] 000002a100d318f0 llt:llt_lrsrv_port+5d8 (1, 6005ef722a0, 0, 702b9000, 702bbbf0, 1) Jan 9 12:28:48 host4 genunix: [ID 723222 kern.notice] 000002a100d318f0 llt:llt_lrsrv_port+5d8 (1, 6005ef722a0, 0, 702b9000, 702bbbf0, 1) Jan 9 12:28:48 host4 genunix: [ID 179002 kern.notice] %l0-3: 0000000000000000 0000000000000001 0000000000000000 00000000702b8628 Jan 9 12:28:48 host4 genunix: [ID 179002 kern.notice] %l0-3: 0000000000000000 0000000000000001 0000000000000000 00000000702b8628 Jan 9 12:28:48 host4 %l4-7: 00000300c65e4b00 00000000702b7628 0000000000000000 000000007afd9a94 Jan 9 12:28:48 host4 %l4-7: 00000300c65e4b00 00000000702b7628 0000000000000000 000000007afd9a94 Jan 9 12:28:48 host4 genunix: [ID 723222 kern.notice] 000002a100d319d0 llt:llt_deliver+1b0 (1d, 1e, 1e, 2, 0, 5) Jan 9 12:28:48 host4 genunix: [ID 723222 kern.notice] 000002a100d319d0 llt:llt_deliver+1b0 (1d, 1e, 1e, 2, 0, 5) Jan 9 12:28:48 host4 genunix: [ID 179002 kern.notice] %l0-3: 00000000702b77b0 00000000702b7730 00000000702b7560 0000000000000004 Jan 9 12:28:48 host4 genunix: [ID 179002 kern.notice] %l0-3: 00000000702b77b0 00000000702b7730 00000000702b7560 0000000000000004 Jan 9 12:28:48 host4 %l4-7: 000000000000001e 00000000702b75a0 00000000702b8120 00000000702b9a88 Jan 9 12:28:48 host4 %l4-7: 000000000000001e 00000000702b75a0 00000000702b8120 00000000702b9a88 Jan 9 12:28:48 host4 unix: [ID 100000 kern.notice] Jan 9 12:28:48 host4 unix: [ID 100000 kern.notice]