What is logged during GAB initiated IOFENCE split-brain?

book

Article ID: 100028891

calendar_today

Updated On:

Description

Error Message

 

Mar 7 09:04:12 elliot gab: [ID 415384 kern.notice] GAB INFO V-15-1-20033 Port a nid 0 [3:0:0:0] [3:0:0:0] [3:0:0:0] [1:0:0:0] [0:0:0:0] 22 23ce0a 20 40

Mar 7 09:04:12 elliot gab: [ID 415384 kern.notice] GAB INFO V-15-1-20033 Port a nid 1 [3:0:0:0] [3:0:0:0] [3:0:0:0] [2:0:0:0] [0:0:0:0] 22 23ce0a 20 14

Mar 7 09:04:12 elliot gab: [ID 754703 kern.notice] GAB INFO V-15-1-20034 Port a iofence set [1:0:0:0] dst [2:0:0:0] bdst [0:0:0:0] reason network failure

Cause

VCS split brain condition occurs when all LLT heartbeat links drop away between the nodes of a cluster.  This may be the result of out-of-spec for high availability engineering of  LLT nic-to-nic links.  This is where a single point of failure has impacted all links. (e.g common ethernet switch/hub hardware; common power source). 

Resolution

==========Note: node syslog timestamps out-of-sync by 7 minutes

elliot # cat /etc/llthosts

0 elliot <<<<<<<<<<< lowest node id in cluster

1 chico

============== All LLT links drop out

 

Mar 7 08:46:55 elliot llt: [ID 205468 kern.notice] LLT INFO V-14-1-10509 link 0 (ce0) node 1 expired

Mar 7 08:46:55 elliot llt: [ID 205468 kern.notice] LLT INFO V-14-1-10509 link 1 (ce1) node 1 expired

Mar 7 08:53:48 chico llt: [ID 205468 kern.notice] LLT INFO V-14-1-10509 link 0 (ce0) node 0 expired

Mar 7 08:53:48 chico llt: [ID 205468 kern.notice] LLT INFO V-14-1-10509 link 1 (ce1) node 0 expired

 

============== "Split Brain" condition for both nodes ( mini-cluster status)

Mar 7 08:47:00 elliot Had[3233]: [ID 702911 daemon.notice] VCS ERROR V-16-1-10079 System chico (Node '1') is in Down State - Membership: 0x1

Mar 7 08:47:00 elliot Had[3233]: [ID 702911 daemon.notice] VCS ERROR V-16-1-10322 System chico (Node '1') changed state from RUNNING to FAULTED

Mar 7 08:53:52 chico Had[15233]: [ID 702911 daemon.notice] VCS ERROR V-16-1-10079 System elliot (Node '0') is in Down State - Membership: 0x2

Mar 7 08:53:52 chico Had[15233]: [ID 702911 daemon.notice] VCS ERROR V-16-1-10322 System elliot (Node '0') changed state from RUNNING to FAULTED

============== Both nodes see LLT links return within 15000 ms (41.6 min) (see # gabconfig -l | egrep IOFENCE )

Mar 7 08:52:31 elliot llt: [ID 860062 kern.notice] LLT INFO V-14-1-10024 link 1 (ce1) node 1 active

Mar 7 08:52:33 elliot llt: [ID 860062 kern.notice] LLT INFO V-14-1-10024 link 0 (ce0) node 1 active

Mar 7 08:59:25 chico llt: [ID 860062 kern.notice] LLT INFO V-14-1-10024 link 1 (ce1) node 0 active

Mar 7 08:59:28 chico llt: [ID 860062 kern.notice] LLT INFO V-14-1-10024 link 0 (ce0) node 0 active

============== Node 0 (lowest node number in the cluster) issues network iofence command via LLT links to other nodes

Mar 7 09:04:12 elliot gab: [ID 415384 kern.notice] GAB INFO V-15-1-20033 Port a nid 0 [3:0:0:0] [3:0:0:0] [3:0:0:0] [1:0:0:0] [0:0:0:0] 22 23ce0a 20 40

Mar 7 09:04:12 elliot gab: [ID 415384 kern.notice] GAB INFO V-15-1-20033 Port a nid 1 [3:0:0:0] [3:0:0:0] [3:0:0:0] [2:0:0:0] [0:0:0:0] 22 23ce0a 20 14

Mar 7 09:04:12 elliot gab: [ID 754703 kern.notice] GAB INFO V-15-1-20034 Port a iofence set [1:0:0:0] dst [2:0:0:0] bdst [0:0:0:0] reason network failure

Mar 7 08:59:33 chico ^Mpanic[cpu0]/thread=2a1011d1ca0:

Mar 7 08:59:33 chico unix: [ID 517378 kern.notice] GAB: Port b halting system due to network failure at [14:2027]

Mar 7 08:59:33 chico unix: [ID 100000 kern.notice]

Mar 7 08:59:33 chico genunix: [ID 723222 kern.notice] 000002a1011d1190 gab:gab_halt+5c (7eb70201, e, 7eb, fd6e, 7eb70262, 2)

Mar 7 08:59:34 chico genunix: [ID 179002 kern.notice] %l0-3: 0000000070215e38 00000600109da7bb 0000000000000000 00000600194c6980

Mar 7 08:59:34 chico %l4-7: 00000000000007eb 000000007af6e000 000000000007af6e 000000000007ac00

Mar 7 08:59:34 chico genunix: [ID 723222 kern.notice] 000002a1011d12c0 gab:gab_recv_iofence+7a0 (7eb70000, 2a1011d15e8, 10624c00, 1080000, 7af6d970, 201)

Mar 7 08:59:34 chico genunix: [ID 179002 kern.notice] %l0-3: 0000000000000002 0000000000000e00 0000000000000001 0000000000070000

Mar 7 08:59:34 chico %l4-7: 0000000000000040 000000007eb70201 000000007af62000 0000000000200000

Mar 7 08:59:34 chico genunix: [ID 723222 kern.notice] 000002a1011d1430 gab:gab_receive+5a8 (60012953880, 20, 1, 7af56000, a, 11)

Mar 7 08:59:34 chico genunix: [ID 179002 kern.notice] %l0-3: 000000000000000b 000000000000000b 0000060012953880 0000000000000000

Mar 7 08:59:34 chico %l4-7: 0000060017379ac0 0000000000000001 000000007af6c928 0000000000000000

Mar 7 08:59:34 chico genunix: [ID 723222 kern.notice] 000002a1011d15f0 gab:gab_receive_port_que+418 (70000, 7af6c928, 7023afe4, 8, 7af6c9a0, 1)

Mar 7 08:59:34 chico genunix: [ID 179002 kern.notice] %l0-3: ffffffffffffffff 0000060012953880 0000000000000001 0000000000000001

Mar 7 08:59:34 chico %l4-7: 0000060017379e90 000000007af6c998 000000007af6c990 0000060012953880

Mar 7 08:59:34 chico genunix: [ID 723222 kern.notice] 000002a1011d16c0 gab:gab_receive_que+194 (60012953880, 1, 1, 70215d10, 70215000, 7023afe4)

Mar 7 08:59:35 chico genunix: [ID 179002 kern.notice] %l0-3: 0000000000070000 0000000000000020 000000007023a000 000000000007023a

Mar 7 08:59:35 chico %l4-7: 0000000000070000 0000060017379ac0 0000060012d39d80 0000000000000008

Mar 7 08:59:35 chico genunix: [ID 723222 kern.notice] 000002a1011d1770 gab:gab_lrecv+7dc (1, 1, 60010be6840, 20, 70215ce0, 60012953880)

Mar 7 08:59:35 chico genunix: [ID 179002 kern.notice] %l0-3: 0000000000000000 0000060018db0000 0000000000000001 0000000000000000

Mar 7 08:59:35 chico %l4-7: 000000005292e535 0000000000000000 0000000000000001 0000000000000000

Mar 7 08:59:35 chico genunix: [ID 723222 kern.notice] 000002a1011d1830 llt:llt_lrsrv_port2+368 (60014b79180, 60011867108, 0, 1, ffff, 1)

Mar 7 08:59:35 chico genunix: [ID 179002 kern.notice] %l0-3: 00000600118671a8 0000000000000000 0000060011867198 00000600118671e8

Mar 7 08:59:35 chico %l4-7: 0000060010be6840 0000000000000000 0000000000000000 000002a1011d18f8

Mar 7 08:59:35 chico genunix: [ID 723222 kern.notice] 000002a1011d1900 llt:llt_lrsrv_port+1dc (2, 60011867108, 1, 6001a04e880, 0, 1)

Mar 7 08:59:35 chico genunix: [ID 179002 kern.notice] %l0-3: 00000600118671d8 0000000000000002 0000000000000000 0000000000000001

Mar 7 08:59:35 chico %l4-7: 000000007020f5e8 000000007020ae4c 0000000000000000 0000000000000002

Mar 7 08:59:35 chico genunix: [ID 723222 kern.notice] 000002a1011d19e0 llt:llt_deliver+350 (7020a, 70000, 6, 2, 1, 60011867108)

Mar 7 08:59:36 chico genunix: [ID 179002 kern.notice] %l0-3: 0000000000000021 0000000000000001 0000000000000020 000000000000001f

Mar 7 08:59:36 chico %l4-7: 000000007020a000 000000000007020a 0000000000070000 000000000007020c

 

 

 

 

 

 

 

 

 

 

 

 

=============== Node 0 sees Node 1 leave the cluster via the LLT links because Node 1 panic-reboot

Mar 7 08:52:40 elliot llt: [ID 140958 kern.notice] LLT INFO V-14-1-10205 link 0 (ce0) node 1 in trouble

Mar 7 08:52:40 elliot llt: [ID 140958 kern.notice] LLT INFO V-14-1-10205 link 1 (ce1) node 1 in trouble

 

 

Also refer to this article: What is a GAB initiated IOFENCE ?

 

 

============== Node 1 (not lowest node number in the cluster) receives "iofence" panic command via the returned LLT links from node 1

 

https://www.veritas.com/docs/000027152

 

Captured logs for Split-Brain GAB initiated IOFENCE


Issue/Introduction

What are the log entries that indicate GAB initiated IOFENCE procedure to panic other nodes of the cluster?