System panic at vx_nio_do_work on Linux platform runnig VxFS

book

Article ID: 100028007

calendar_today

Updated On:

Cause

The problem was caused by the etrack incident listed in the Supplemental Material section of this article.   The following is a description of the problem.

SYMPTOM:
system panic when press Control-C at aio-stress running.

DESCRIPTION:
The test program is using POSIX threads, which share the same mm_struct in the kernel. [The function] exit_aio() is (correctly) only being called on the exit of the last pthread from
./kernel/fork.c:mmput();
if (atomic_dec_and_test(&mm->mm_users)) {
exit_aio(mm);
...

Therefore, other pthreads that have submitted aio can exit without waiting for IO. As there is no synchronisations between the exiting pthreads and VxFS, vx_naio_do_work() can deference an exited thread;

fsizelim = VX_GETU_RLIMIT_FSIZE_TASK(nwip->nwi_tsk);

which causes the panic. If the race was a little slower, we'd panic down in VxFS's uiomove code.

The correct way to handle this is for the IO to take a hold on the mm_struct
(inc mm->mm_users), but GPL export restrictions mean we couldn't drop the hold
(EXPORT_SYMBOL_GPL(mmput)).


RESOLUTION:
The fix uses two fields in the task structure; one to provide an exit hook (->tux_exit) that is called regardless of any pthreads (aka VM_CLONEd threads), and a counter (->tux_info) for the number of outstanding IOs against a thread.

 

Resolution

Please upgrade to Veritas Storage Foundation 5.1SP1 to fix the problem.

The required patch can be downloaded from the Veritas Operation Readiness Tools (SORT) website

https://sort.Veritas.com/patch/matrix


Applies To

The problem only affects VxFS running on Linux Platform.   It doesn't affect other platforms.

Issue/Introduction

System panic at vx_nio_do_work on Linux platform runnig VxFS with the following stack. PID: 11509  TASK: ffff810037ebf040  CPU: 53  COMMAND: "vx_naio_worker"
 #0 [ffff81003831fb60] crash_kexec at ffffffff800ada85
 #1 [ffff81003831fc20] __die at ffffffff80065157
 #2 [ffff81003831fc60] do_page_fault at ffffffff80066dd7
 #3 [ffff81003831fd50] error_exit at ffffffff8005dde9
    [exception RIP: vx_naio_do_work+162]          << vx_naio_do_work
 #4 [ffff81003831fec8] vx_naio_worker at ffffffff8850b608
 #5 [ffff81003831fee8] vx_kthread_init at ffffffff88519990
 #6 [ffff81003831ff48] kernel_thread at ffffffff8005dfb1

Additional Information

ETrack: 2080276