CVM Cluster join cannot be established unless the nodes are started in a specific order and sequence

book

Article ID: 100017152

calendar_today

Updated On:

Description

Error Message

Here are the sequence of logged events in /var/adm/messages when a similar situation occurs. It can be seen here that port v and w have established membership (ie: membership 01) but simply produces the connection time out error message which leads to the CVM join to fail.

Aug 15 03:21:16 H22RRMOBDB01 gab: [ID 316943 kern.notice] GAB INFO V-15-1-20036 Port v gen  143b11b membership 01
Aug 15 03:21:27 H22RRMOBDB01 gab: [ID 316943 kern.notice] GAB INFO V-15-1-20036 Port w gen  143b11d membership 01
Aug 15 03:21:27 H22RRMOBDB01 vxvm:vxconfigd: [ID 511694 daemon.error] V-5-1-8756 allow join for node 1 failed: Connection timed out
Aug 15 03:21:27 H22RRMOBDB01 vxvm:vxconfigd: [ID 448643 daemon.notice] V-5-1-3765 master: cluster join complete for node 1
Aug 15 03:21:27 H22RRMOBDB01 vxvm:vxconfigd: [ID 699813 daemon.notice] V-5-1-7899 CVM_VOLD_CHANGE command received
Aug 15 03:21:27 H22RRMOBDB01 vxvm:vxconfigd: [ID 322665 daemon.notice] V-5-1-7961 establishing cluster
Aug 15 03:21:27 H22RRMOBDB01 vxvm:vxconfigd: [ID 277465 daemon.notice] V-5-1-8062 master: not a cluster startup
Aug 15 03:24:42 H22RRMOBDB01 gab: [ID 316943 kern.notice] GAB INFO V-15-1-20036 Port w gen  143b11e membership 0
Aug 15 03:24:42 H22RRMOBDB01 gab: [ID 674723 kern.notice] GAB INFO V-15-1-20038 Port w gen  143b11e k_jeopardy ;1
Aug 15 03:24:42 H22RRMOBDB01 gab: [ID 513393 kern.notice] GAB INFO V-15-1-20040 Port w gen  143b11e    visible ;1
Aug 15 03:24:42 H22RRMOBDB01 gab: [ID 316943 kern.notice] GAB INFO V-15-1-20036 Port v gen  143b11c membership 0
Aug 15 03:24:42 H22RRMOBDB01 gab: [ID 674723 kern.notice] GAB INFO V-15-1-20038 Port v gen  143b11c k_jeopardy ;1
Aug 15 03:24:42 H22RRMOBDB01 gab: [ID 513393 kern.notice] GAB INFO V-15-1-20040 Port v gen  143b11c    visible ;1
Aug 15 03:24:42 H22RRMOBDB01 vxvm:vxconfigd: [ID 699813 daemon.notice] V-5-1-7899 CVM_VOLD_CHANGE command received
Aug 15 03:24:42 H22RRMOBDB01 vxvm:vxconfigd: [ID 778436 daemon.error] V-5-1-4109 -1 returned from volcvm_establish
Aug 15 03:24:42 H22RRMOBDB01 vxvm:vxconfigd: [ID 886039 daemon.error] V-5-1-4852 cluster_establish: timed out
Aug 15 03:24:42 H22RRMOBDB01 vxvm:vxconfigd: [ID 391371 daemon.error] V-5-1-11111 kernel_fail_join() : master_takeover is -1
Aug 15 03:24:42 H22RRMOBDB01 vxvm:vxconfigd: [ID 565473 daemon.notice] V-5-1-9543 Timeout is not reset: another reconfig in progress
Aug 15 03:24:42 H22RRMOBDB01 vxvm:vxconfigd: [ID 322665 daemon.notice] V-5-1-7961 establishing cluster
Aug 15 03:24:42 H22RRMOBDB01 vxvm:vxconfigd: [ID 277465 daemon.notice] V-5-1-8062 master: not a cluster startup
Aug 15 03:24:42 H22RRMOBDB01 vxvm:vxconfigd: [ID 451250 daemon.notice] V-5-1-8061 master: no joiners
Aug 15 03:24:42 H22RRMOBDB01 vxvm:vxconfigd: [ID 738708 daemon.notice] V-5-1-4123 cluster established successfully
 

 

 

Resolution

To address this issue, please make sure all the nodes see the same number of paths to all disks/LUNs in shared disk groups.
 

Issue/Introduction

CVM Cluster join cannot be established unless the nodes are started in a specific order and sequence. This particular scenario happens when the nodes within a cluster do not see the same number of paths to the disk.
For example, node A sees 2 paths to all of the data disk  group whereas node B sees only 1 path to one of the data disk group.
 
# /usr/sbin/vxdmpadm getsubpaths dmpnodename=c2t0d6s2
NAME STATE PATH-TYPE[M] CTLR-NAME ENCLR-TYPE ENCLR-NAME ATTRS
================================================================================
c2t0d6s2 ENABLED - c2 HDS9960 HDS99600
 
# /usr/sbin/vxdmpadm getsubpaths dmpnodename=c2t0d6s2
NAME STATE PATH-TYPE[M] CTLR-NAME ENCLR-TYPE ENCLR-NAME ATTRS
================================================================================
c2t0d6s2 ENABLED - c2 HDS9960 HDS99600 -
c3t1d6s2 ENABLED - c3 HDS9960 HDS99600 -
 
This is the only LUN in the configuration which differs between the two nodes shown as an example. 
 
This requirement is not "explicitly" documented but is implied to be a requirement in CVM / SFORAC environment. All nodes should have equal number of HBA's ( 2 in this case ) for proper CVM/DMP operations ( i.e. failover failback )
 
The analogy of this is that when node A starts first, it sees 2 paths to the disk, hence, node A should become the master. Now, when node B is trying to join, it expects to see 2 paths to the disk as well, but it does not, hence unable to form a cluster join due to the reason of less number of paths visible to the joining slave node. 
 
However, if node B starts first and it only sees 1 path to the disk. The node B would become the master. Now, when node A tries to join, it sees all the paths which node B sees, (which is fine), as well as the other extra path which node A can only see and node B can NOT see. Hence, the cluster can be formed, as the master (node B) does not know anything about the additional paths (from the master point of view).