Mar 7 09:04:12 elliot gab: [ID 415384 kern.notice] GAB INFO V-15-1-20033 Port a nid 0 [3:0:0:0] [3:0:0:0] [3:0:0:0] [1:0:0:0] [0:0:0:0] 22 23ce0a 20 40
Mar 7 09:04:12 elliot gab: [ID 415384 kern.notice] GAB INFO V-15-1-20033 Port a nid 1 [3:0:0:0] [3:0:0:0] [3:0:0:0] [2:0:0:0] [0:0:0:0] 22 23ce0a 20 14
Mar 7 09:04:12 elliot gab: [ID 754703 kern.notice] GAB INFO V-15-1-20034 Port a iofence set [1:0:0:0] dst [2:0:0:0] bdst [0:0:0:0] reason network failure
VCS split brain condition occurs when all LLT heartbeat links drop away between the nodes of a cluster. This may be the result of out-of-spec for high availability engineering of LLT nic-to-nic links. This is where a single point of failure has impacted all links. (e.g common ethernet switch/hub hardware; common power source).
==========Note: node syslog timestamps out-of-sync by 7 minutes
elliot # cat /etc/llthosts
0 elliot <<<<<<<<<<< lowest node id in cluster
1 chico
============== All LLT links drop out
Mar 7 08:46:55 elliot llt: [ID 205468 kern.notice] LLT INFO V-14-1-10509 link 0 (ce0) node 1 expired
Mar 7 08:46:55 elliot llt: [ID 205468 kern.notice] LLT INFO V-14-1-10509 link 1 (ce1) node 1 expired
Mar 7 08:53:48 chico llt: [ID 205468 kern.notice] LLT INFO V-14-1-10509 link 0 (ce0) node 0 expired
Mar 7 08:53:48 chico llt: [ID 205468 kern.notice] LLT INFO V-14-1-10509 link 1 (ce1) node 0 expired
============== "Split Brain" condition for both nodes ( mini-cluster status)
Mar 7 08:47:00 elliot Had[3233]: [ID 702911 daemon.notice] VCS ERROR V-16-1-10079 System chico (Node '1') is in Down State - Membership: 0x1
Mar 7 08:47:00 elliot Had[3233]: [ID 702911 daemon.notice] VCS ERROR V-16-1-10322 System chico (Node '1') changed state from RUNNING to FAULTED
Mar 7 08:53:52 chico Had[15233]: [ID 702911 daemon.notice] VCS ERROR V-16-1-10079 System elliot (Node '0') is in Down State - Membership: 0x2
Mar 7 08:53:52 chico Had[15233]: [ID 702911 daemon.notice] VCS ERROR V-16-1-10322 System elliot (Node '0') changed state from RUNNING to FAULTED
============== Both nodes see LLT links return within 15000 ms (41.6 min) (see # gabconfig -l | egrep IOFENCE )
Mar 7 08:52:31 elliot llt: [ID 860062 kern.notice] LLT INFO V-14-1-10024 link 1 (ce1) node 1 active
Mar 7 08:52:33 elliot llt: [ID 860062 kern.notice] LLT INFO V-14-1-10024 link 0 (ce0) node 1 active
Mar 7 08:59:25 chico llt: [ID 860062 kern.notice] LLT INFO V-14-1-10024 link 1 (ce1) node 0 active
Mar 7 08:59:28 chico llt: [ID 860062 kern.notice] LLT INFO V-14-1-10024 link 0 (ce0) node 0 active
============== Node 0 (lowest node number in the cluster) issues network iofence command via LLT links to other nodes
Mar 7 09:04:12 elliot gab: [ID 415384 kern.notice] GAB INFO V-15-1-20033 Port a nid 0 [3:0:0:0] [3:0:0:0] [3:0:0:0] [1:0:0:0] [0:0:0:0] 22 23ce0a 20 40
Mar 7 09:04:12 elliot gab: [ID 415384 kern.notice] GAB INFO V-15-1-20033 Port a nid 1 [3:0:0:0] [3:0:0:0] [3:0:0:0] [2:0:0:0] [0:0:0:0] 22 23ce0a 20 14
Mar 7 09:04:12 elliot gab: [ID 754703 kern.notice] GAB INFO V-15-1-20034 Port a iofence set [1:0:0:0] dst [2:0:0:0] bdst [0:0:0:0] reason network failure
Mar 7 08:59:33 chico ^Mpanic[cpu0]/thread=2a1011d1ca0:
Mar 7 08:59:33 chico unix: [ID 517378 kern.notice] GAB: Port b halting system due to network failure at [14:2027]
Mar 7 08:59:33 chico unix: [ID 100000 kern.notice]
Mar 7 08:59:33 chico genunix: [ID 723222 kern.notice] 000002a1011d1190 gab:gab_halt+5c (7eb70201, e, 7eb, fd6e, 7eb70262, 2)
Mar 7 08:59:34 chico genunix: [ID 179002 kern.notice] %l0-3: 0000000070215e38 00000600109da7bb 0000000000000000 00000600194c6980
Mar 7 08:59:34 chico %l4-7: 00000000000007eb 000000007af6e000 000000000007af6e 000000000007ac00
Mar 7 08:59:34 chico genunix: [ID 723222 kern.notice] 000002a1011d12c0 gab:gab_recv_iofence+7a0 (7eb70000, 2a1011d15e8, 10624c00, 1080000, 7af6d970, 201)
Mar 7 08:59:34 chico genunix: [ID 179002 kern.notice] %l0-3: 0000000000000002 0000000000000e00 0000000000000001 0000000000070000
Mar 7 08:59:34 chico %l4-7: 0000000000000040 000000007eb70201 000000007af62000 0000000000200000
Mar 7 08:59:34 chico genunix: [ID 723222 kern.notice] 000002a1011d1430 gab:gab_receive+5a8 (60012953880, 20, 1, 7af56000, a, 11)
Mar 7 08:59:34 chico genunix: [ID 179002 kern.notice] %l0-3: 000000000000000b 000000000000000b 0000060012953880 0000000000000000
Mar 7 08:59:34 chico %l4-7: 0000060017379ac0 0000000000000001 000000007af6c928 0000000000000000
Mar 7 08:59:34 chico genunix: [ID 723222 kern.notice] 000002a1011d15f0 gab:gab_receive_port_que+418 (70000, 7af6c928, 7023afe4, 8, 7af6c9a0, 1)
Mar 7 08:59:34 chico genunix: [ID 179002 kern.notice] %l0-3: ffffffffffffffff 0000060012953880 0000000000000001 0000000000000001
Mar 7 08:59:34 chico %l4-7: 0000060017379e90 000000007af6c998 000000007af6c990 0000060012953880
Mar 7 08:59:34 chico genunix: [ID 723222 kern.notice] 000002a1011d16c0 gab:gab_receive_que+194 (60012953880, 1, 1, 70215d10, 70215000, 7023afe4)
Mar 7 08:59:35 chico genunix: [ID 179002 kern.notice] %l0-3: 0000000000070000 0000000000000020 000000007023a000 000000000007023a
Mar 7 08:59:35 chico %l4-7: 0000000000070000 0000060017379ac0 0000060012d39d80 0000000000000008
Mar 7 08:59:35 chico genunix: [ID 723222 kern.notice] 000002a1011d1770 gab:gab_lrecv+7dc (1, 1, 60010be6840, 20, 70215ce0, 60012953880)
Mar 7 08:59:35 chico genunix: [ID 179002 kern.notice] %l0-3: 0000000000000000 0000060018db0000 0000000000000001 0000000000000000
Mar 7 08:59:35 chico %l4-7: 000000005292e535 0000000000000000 0000000000000001 0000000000000000
Mar 7 08:59:35 chico genunix: [ID 723222 kern.notice] 000002a1011d1830 llt:llt_lrsrv_port2+368 (60014b79180, 60011867108, 0, 1, ffff, 1)
Mar 7 08:59:35 chico genunix: [ID 179002 kern.notice] %l0-3: 00000600118671a8 0000000000000000 0000060011867198 00000600118671e8
Mar 7 08:59:35 chico %l4-7: 0000060010be6840 0000000000000000 0000000000000000 000002a1011d18f8
Mar 7 08:59:35 chico genunix: [ID 723222 kern.notice] 000002a1011d1900 llt:llt_lrsrv_port+1dc (2, 60011867108, 1, 6001a04e880, 0, 1)
Mar 7 08:59:35 chico genunix: [ID 179002 kern.notice] %l0-3: 00000600118671d8 0000000000000002 0000000000000000 0000000000000001
Mar 7 08:59:35 chico %l4-7: 000000007020f5e8 000000007020ae4c 0000000000000000 0000000000000002
Mar 7 08:59:35 chico genunix: [ID 723222 kern.notice] 000002a1011d19e0 llt:llt_deliver+350 (7020a, 70000, 6, 2, 1, 60011867108)
Mar 7 08:59:36 chico genunix: [ID 179002 kern.notice] %l0-3: 0000000000000021 0000000000000001 0000000000000020 000000000000001f
Mar 7 08:59:36 chico %l4-7: 000000007020a000 000000000007020a 0000000000070000 000000000007020c
=============== Node 0 sees Node 1 leave the cluster via the LLT links because Node 1 panic-reboot
Mar 7 08:52:40 elliot llt: [ID 140958 kern.notice] LLT INFO V-14-1-10205 link 0 (ce0) node 1 in trouble
Mar 7 08:52:40 elliot llt: [ID 140958 kern.notice] LLT INFO V-14-1-10205 link 1 (ce1) node 1 in trouble
Also refer to this article: What is a GAB initiated IOFENCE ?
============== Node 1 (not lowest node number in the cluster) receives "iofence" panic command via the returned LLT links from node 1
Captured logs for Split-Brain GAB initiated IOFENCE
What are the log entries that indicate GAB initiated IOFENCE procedure to panic other nodes of the cluster?