In an InfoScale FSS environment where LLT links are configured over RDMA, the CVM slave node panics whilst joining the cluster with the CVM master.

book

Article ID: 100074207

calendar_today

Updated On:

Description

Error Message

The panic can occur on any running thread, but the system will typically crash with "BUG: unable to handle kernel paging request" or "general protection fault: 0000 [#1] SMP". The fault could come from a stack involving the kmem_cache family functions.

Panic stacks such as the following may be observed for different drivers. Thereby suggesting that a memory/slab corruption has occurred.

eg 
PID: 21534    TASK: ff1e379a42020000  CPU: 6    COMMAND: "sh"
 #0 [ff4ed83bf6503980] machine_kexec at ffffffffa8e6c1f3
 #1 [ff4ed83bf65039d8] __crash_kexec at ffffffffa8fb59aa
 #2 [ff4ed83bf6503a98] crash_kexec at ffffffffa8fb68e1
 #3 [ff4ed83bf6503ab0] oops_end at ffffffffa8e2a9c1
 #4 [ff4ed83bf6503ad0] do_general_protection at ffffffffa8e274a5
 #5 [ff4ed83bf6503b60] general_protection at ffffffffa9a0113e
    [exception RIP: kmem_cache_alloc+218]
    RIP: ffffffffa9129eba  RSP: ff4ed83bf6503c18  RFLAGS: 00010286
    RAX: 967b3762439c35ea  RBX: 00000000006000c0  RCX: 967b3762439c365a
    RDX: 000000000002072b  RSI: 00000000006000c0  RDI: 0000000000039bf0
    RBP: ff1e375980b964c0   R8: ff1e3798002f9bf0   R9: 0000000000000000
    R10: ff1e3762431500e8  R11: 0000000000000000  R12: 00000000006000c0
    R13: ffffffffa8ef366a  R14: ff1e379a3cd065a0  R15: 0000000000000000
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #6 [ff4ed83bf6503c58] vm_area_dup at ffffffffa8ef366a
 #7 [ff4ed83bf6503c68] __split_vma at ffffffffa90e5e19
 #8 [ff4ed83bf6503c98] __do_munmap at ffffffffa90e609f
 #9 [ff4ed83bf6503cf0] __vm_munmap at ffffffffa90e64e8


eg
PID: 20577 TASK: ff44bde40e838000 CPU: 4 COMMAND: "mh_driver.pl"
#0 [ff66a1a7b710b860] machine_kexec at ffffffff83c6c1f3
#1 [ff66a1a7b710b8b8] __crash_kexec at ffffffff83db59aa
#2 [ff66a1a7b710b978] crash_kexec at ffffffff83db68e1
#3 [ff66a1a7b710b990] oops_end at ffffffff83c2a9c1
#4 [ff66a1a7b710b9b0] no_context at ffffffff83c7e913
#5 [ff66a1a7b710ba08] __bad_area_nosemaphore at ffffffff83c7ec8c
#6 [ff66a1a7b710ba50] do_page_fault at ffffffff83c7f8a7
#7 [ff66a1a7b710ba80] page_fault at ffffffff8480116e
 [exception RIP: unmap_page_range+2246]
 RIP: ffffffff83edba86 RSP: ff66a1a7b710bb30 RFLAGS: 00010246
 RAX: 0000000000000000 RBX: ffbaa2a05dddfbc0 RCX: 0000000000000000
 RDX: ff44be2711102000 RSI: 0000000005000000 RDI: ff44be1b34e61e00
 RBP: ff44be2711102000 R8: ffa33a6676766f88 R9: ff44be563ffd2000
 R10: 0000000000000000 R11: ffffffffffffffff R12: 0000000005000000
 R13: ff66a1a7b710bc78 R14: 0000000000000000 R15: 0000000005001000
 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#8 [ff66a1a7b710bc10] unmap_vmas at ffffffff83edc420
#9 [ff66a1a7b710bc70] exit_mmap at ffffffff83ee665d

 

Cause

During the installation of an InfoScale update patch the /etc/sysconfig/llt file was restored on the first node (CVM master), but not on the second node (CVM slave) and this resulted in an inconsistency in the LLT/RDMA tunable settings between the nodes.
 

Resolution

Compare the contents of the /etc/sysconfig/llt file on the nodes and make sure the same tunable settings are in place.

For example, if the following two lines are present in the /etc/sysconfig/llt file on the CVM master node, but missing on the CVM slave node, then they will need to be added to the file on CVM slave node by either editing the file or copying the file from the CVM master node:

LLT_MAXADVBUFS=4000
LLT_ADVBUF_SIZE=8192

Once this inconsistency is corrected, the CVM slave node should join the cluster successfully.

Please also check that the files below are consistent between the CVM master and CVM slave nodes:. 

/etc/sysconfig/gab
/etc/sysconfig/vxfen
/etc/sysconfig/vcs
/etc/sysconfig/amf

Arctera is working to address this issue.

Issue/Introduction

In an InfoScale FSS environment where LLT links are configured over RDMA, the CVM slave node panics whilst joining the cluster with the CVM master.

Additional Information

JIRA: STESC-9526