How to read the the Gab Node ID (nid) port messages

book

Article ID: 100002280

calendar_today

Updated On:

Resolution

Whenever a cluster node is panicked due to an IOFENCE on various gab ports, the messages in the logs pertaining to the node IDs can be confusing.

Here is an example from VRTS explorers where both nodes have been panicked  due to an IOFENCE on a gab port:

Jan 7 20:36:33cinohdbp01sec unix: [ID 836849 kern.notice]
Jan 7 20:36:33 cinohdbp01sec^Mpanic[cpu100]/thread=2a103053cc0:
Jan 7 20:36:33 cinohdbp01sec unix: [ID676452 kern.notice]
 GAB: Port o halting system due to network failure
Jan 7 20:36:28 cinohdbp02sec unix: [ID 836849 kern.notice]
Jan 720:36:28 cinohdbp02sec ^Mpanic[cpu64]/thread=2a1031e5cc0:
Jan 7 20:36:28cinohdbp02sec unix: [ID 676400 kern.notice]
 GAB: Port b halting system due to network failure

This message essentially indicates that something has gone wrong on the llt private interconnect between the two machines. Looking at the messages
from node 01 immediately prior to the panic, there were GAB error messages. As this panic was caused by port o we can look at these specifically:

Jan 7 20:36:32 cinohdbp01sec gab: [ID 125212kern.notice]
 GAB INFO V-15-1-20033 Port o nid 0 3 3 1 3 0 916614 608
Jan 7 20:36:32 cinohdbp01sec gab: [ID 125212 kern.notice]
GAB INFOV-15-1-20033 Port o nid 1 3 3 3 2 0 916615 60 20
Jan 7 20:36:32 cinohdbp01secgab: [ID 112206 kern.notice]
GAB INFO V-15-1-20034 Port o iofence set 2 dst1 bdst 0

The last few columns show:

Node_id Connected ReliableValid Committed Checksum Sequence Flags

Where 3 = both, 1 = node 0,and  2 = node 1

We would expect to see 3s everywhere in the first few columns, it looks like node 0 (1=node 0=cinohdbp01sec) believes it is the only valid port o member, and it
It can also be seen that its sequence number (916614) is lower than that of node 1 9916615). Under these circumstances, as there has clearly been some kind of loss of coherency,the
node is designed to panic to protect data integrity. A similar thing looks likely to have happened on node 1 on port b.

Typical reasons for this kind of panic would be memory/CPU starvation or issues with the llt private interconnect itself.
 
 

 

Issue/Introduction

How to read the the Gab Node ID (nid) port messages