One possible scenario:
- Environment upgraded to 6.1.1.100, system froze when running the "ls -l" command. - The ls command was hung for over 1 minute when a crash dump was taken. - Upon review of the crash dump it shows that "ls -l" is waiting for an inode rw lock in shared mode which is currently acquired by another thread and is spinning Sample strack trace:
PID: 26248 TASK: ffff8107d80fa040 CPU: 13 COMMAND: "ls"
#0 [ffff8107d9c71ac8] schedule at ffffffff80062fa0
#1 [ffff8107d9c71ba0] vx_svar_sleep_unlock at ffffffff886f1a53 [vxfs]
#2 [ffff8107d9c71bf0] vx_rwsleep_rec_lock at ffffffff886dad8b [vxfs]
#3 [ffff8107d9c71c10] vx_recsmp_rangelock at ffffffff886a6acb [vxfs]
#4 [ffff8107d9c71c20] vx_irwlock at ffffffff886cc508 [vxfs]
#5 [ffff8107d9c71c50] vx_linux_getxattr at ffffffff8872c2e1 [vxfs]
#6 [ffff8107d9c71d50] vfs_getxattr at ffffffff800f7567
#7 [ffff8107d9c71d90] getxattr at ffffffff800f7651
#8 [ffff8107d9c71ec0] sys_getxattr at ffffffff800f7751
#9 [ffff8107d9c71f80] tracesys at ffffffff8005d29e (via system_call)
RIP: 00002b09b16569c9 RSP: 00007fffcb6e0c38 RFLAGS: 00000246
RAX: ffffffffffffffda RBX: ffffffff8005d29e RCX: ffffffffffffffff
RDX: 0000000000000000 RSI: 00002b09b116ad17 RDI: 00007fffcb6e0c60
RBP: 000000001b8268e8 R8: 0000000000000000 R9: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 000000001b8268e0
R13: 00007fffcb6e1030 R14: 00007fffcb6e0c60 R15: 0000000000000000
ORIG_RAX: 00000000000000bf CS: 0033 SS: 002b
On later version of Linux kernel, a thread cannot generate page-faults while holding page locks as this can cause a deadlock.
Because of this, VxFS pre-faults the source pages for a write system call which is using buffered-IO.
For a POSIX thread, which shares its address-space with other threads, the pre-faulting in VxFS uses the kernel’s get_user_pages().
The behaviour of get_user_pages() on RHEL5 causes the later copy of data into the file’s page-cache to enter an infinite loop in VxFS when the source pages are anonymous memory buffers.
Anonymous buffers may come from malloc(), or the application creating an anonymous mapping via mmap().
Although the restriction of page-faults while holding page-locks does not apply to RHEL5, this build incorrectly disallowed page-faults in VxFS version 6.1.1.100.
The incorrect build, combined with RHEL5’s get_user_pages() behaviour, is the cause of the hang.
NOTE: A thread usually populates the contents of a buffer given to the write system call. When this is done, the issue is not hit.
There is no workaround, other than to update the VxFS package. The fix is available in the VxFS Private hot-fix 6.1.1.103 for RHEL 5.x
Please contact Veritas support if you require this hot-fix.
Applies To
The issue is specific to RHEL5 and VxFS version 6.1.1.100.