VVR: VxVM 6.2.1 VVR Primary node may panic when sending data (nmcom_throttle_send) to the Secondary node as a result of accessing already released freed memory

book

Article ID: 100045756

calendar_today

Updated On:

Description

Error Message


Node panic due to "unable to handle kernel paging request at ffff88332e8b552c"


 KERNEL: vmlinux.2.6.32-696.18.7.el6.x86_64
    DUMPFILE: vmcore-190328131922.  [PARTIAL DUMP]
        CPUS: 12
        DATE: Tue Mar 26 18:24:35 2019
      UPTIME: 82 days, 00:59:04
LOAD AVERAGE: 0.25, 0.38, 0.36
       TASKS: 2129
    NODENAME: ###########
     RELEASE: 2.6.32-696.18.7.el6.x86_64
     VERSION: #1 SMP Thu Dec 28 20:15:47 EST 2017
     MACHINE: x86_64  (3491 Mhz)
      MEMORY: 64 GB
       PANIC: "BUG: unable to handle kernel paging request at ffff88332e8b552c"
         PID: 57635
     COMMAND: "nmcom-sender"
        TASK: ffff8803a50dcab0  [THREAD_INFO: ffff88096e178000]
         CPU: 3
       STATE: TASK_RUNNING (PANIC)
 

The back trace (bt) output for task "ffff8803a50dcab0" shows the "nmcom-sender" routine is running.
        
crash> bt
PID: 57635  TASK: ffff8803a50dcab0  CPU: 3   COMMAND: "nmcom-sender"
 #0 [ffff88096e17ba00] machine_kexec at ffffffff8103eb3b
 #1 [ffff88096e17ba60] crash_kexec at ffffffff810d2772
 #2 [ffff88096e17bb30] oops_end at ffffffff81550570
 #3 [ffff88096e17bb60] no_context at ffffffff810515eb
 #4 [ffff88096e17bbb0] __bad_area_nosemaphore at ffffffff81051875
 #5 [ffff88096e17bc00] bad_area_nosemaphore at ffffffff81051943
 #6 [ffff88096e17bc10] __do_page_fault at ffffffff81052100
 #7 [ffff88096e17bd30] do_page_fault at ffffffff815524fe
 #8 [ffff88096e17bd60] page_fault at ffffffff8154f365
    [exception RIP: nmcom_throttle_send+494]
    RIP: ffffffffa09fbe5e  RSP: ffff88096e17be10  RFLAGS: 00010086
    RAX: ffff88332e8b5400  RBX: ffff88096e7a3710  RCX: 000000000000cc83
    RDX: 0000000200000001  RSI: 0000000000000246  RDI: ffff88096e7a3710
    RBP: ffff88096e17be50   R8: 0000000000000000   R9: 0000000000000000
    R10: 0000000000000000  R11: 0000000000000000  R12: 0000000000000001
    R13: ffff88096e7a3400  R14: ffff880bba33d000  R15: ffff880b7a2cd000
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #9 [ffff88096e17be58] nmcom_sender at ffffffffa09fc2e2 [vxio]
#10 [ffff88096e17bee8] kthread at ffffffff810a6d0e
#11 [ffff88096e17bf48] kernel_thread at ffffffff81557afa


crash> ps 57635
   PID    PPID  CPU       TASK        ST  %MEM     VSZ    RSS  COMM
> 57635      2   3  ffff8803a50dcab0  RU   0.0       0      0  [nmcom-sender]
 

Cause


After sending data from the VVR (Veritas Volume Replicator) Primary server to the Secondary server, the code was accessing memory variables for already released (freed) memory, due to the data ACK have already been processed.

This is a rare race condition which may happen due to accessing the freed memory.

 

Resolution


Veritas engineering successfully identified the corresponding source code which is causing the memory access.

Code changes have been made to avoid the incorrect memory access.


Please contact Veritas Technical Support to download Private hot-fix VRTSvxvm-6.2.1.8202-RHEL6.x86_64.

As the issue has only recently been identified (June 2019), the other product versions will not contain a fix at this time.
 

Reference Escalation: STESC-2900

Issue/Introduction

The VVR (Veritas Volume Replicator) Primary node may panic when sending data (nmcom_throttle_send) to the Secondary node as a result of accessing already released freed memory. We are observing a panic while accessing internal VVR structures.