VxFS Deadlock in page_lock_es in Solaris

book

Article ID: 100003116

calendar_today

Updated On:

Description

Error Message

A typical thread stack trace will show a VxFS thread hung in the Solaris kernel page_lock_es routine:

==== user (LWP_SYS) thread: 0x3000d7dfc60 PID: 6750 ====cmd: /san/mail/ms/messaging64/lib/imapdt_wchan: 0x70035c440c4 sobj: condition var (from unix:page_lock_es+0x1ec)t_procp: 0x60083a205e8 p_as: 0x6006279b6b0 size: 7702937600 RSS: 263528448 hat: 0x3000f5e2b80 cnum: CPU0:224/1557 cpusran:0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63 zone: nk11r10mm-mail015at_stk: 0x2a1054b9ae0 sp: 0x2a1054b7921 t_stkbase: 0x2a1054b4000t_pri: 60(TS) t_tid: 14809 pctcpu: 2.440211t_lwp: 0x300550d0e58 machpcb: 0x2a1054b9ae0 mstate: LMS_SLEEP ms_prev: LMS_KFAULT ms_state_start: 42 minutes 12.683740933 seconds earlier ms_start: 46 minutes 59.387839151 seconds earlierpsrset: 0 last CPU: 28idle: 253077 ticks (42 minutes 10.77 seconds) <<<<<===== IDLE 42 MINUTES start: Mon Aug 9 19:10:07 2010age: 2817 seconds (46 minutes 57 seconds)syscall: #4 write(, 0xffffffff25711981) (sysent: genunix:write+0x0)tstate: TS_SLEEP - awaiting an eventtflg: T_DFLTSTK - stack is default sizetpflg: TP_TWAIT - wait to be freed by lwp_wait TP_MSACCT - collect micro-state accounting informationtsched: TS_LOAD - thread is in memory TS_DONT_SWAP - thread/LWP should not be swappedpflag: SKILLED - SIGKILL has been posted to the process SJCTL - SIGCLD sent when children stop/continue SNOWAIT - children never become zombies SMSACCT - process is keeping micro-state accounting SMSFORK - child inherits micro-state accountingpc: genunix:cv_wait+0x38: call unix:swtchgenunix:cv_wait+0x38(, 0x70049eec7c0, 0xbffffc00, 0x0, 0x1)unix:page_lock_es+0x1ec(, 0x1, 0x188da98, 0x0) unix:page_lookup_create+0x134(, , , 0x0, , 0x0)vxfs:vx_page_lookup(0x600a82ed7c0, 0x1a000) - frame recycled vxfs:vx_page_alloc+0x190(0x600a785b020, 0x1a000, 0x2a1054b8ad0, 0x2a1054b877c, 0x1, 0x0, , , , 0x6008cc5baa8, 0xffffffff3151a000, 0x0)vxfs:vx_do_getpage+0x940(0x600a82ed7c0, 0x1a000, 0x2000, 0x2a1054b8abc, 0x2a1054b8ad0, 0x2000, 0x6008cc5baa8, 0xffffffff3151a000, , , 0x2a1054b877c, 0x60081d2d130, , 0x0)vxfs:vx_getpage1+0x57c(0x600a82ed7c0, 0x1a000, , , , , , 0xffffffff3151a000, 0x1, 0x60081d2d130, 0x0)vxfs:vx_getpage+0x4c(0x600a82ed7c0, 0x1a000, 0x60081d2d130, , 0xffffffff3151a000, 0x6008cc5baa8, , 0xffffffff3151a000, 0x1, 0x60081d2d130)genunix:fop_getpage+0x44(, 0x1a000, 0x2000, 0x2a1054b8abc, 0x2a1054b8ad0, 0x2000, 0x6008cc5baa8)genunix:segvn_fault+0xb00(0x3000f5e2b80, 0x6008cc5baa8, 0xffffffff3151a000, 0x2000, 0x0, 0x1)genunix:as_fault+0x4c8(0x3000f5e2b80, 0x6006279b6b0, 0xffffffff3151a000, 0x1, 0x0, 0x1)unix:pagefault+0x68(0xffffffff3151a000, 0x0, 0x1, 0x0)unix:trap+0x914(, 0xffffffff3151a000)unix:sfmmu_tsbmiss_exception(0x2a1054b8f00) - frame recycledunix:ktl0+0x64()-- trap data type: 0x31 (data access MMU miss) rp: 0x2a1054b8f00 --pc: 0x1280bc0 SUNW,UltraSPARC-T2:ci_fqtr+0x1c: ldda [%l0] ASI_BLK_AIUS, %f16npc: 0x1280bc4 SUNW,UltraSPARC-T2:ci_fqtr+0x20: faligndata %f14, %f16, %f48 global: %g1 0x2a1054b8961 %g2 0xffffffff3151999c %g3 0xffff810d791024a0 %g4 0xb04 %g5 0x1281110 %g6 0x1 %g7 0x3000d7dfc60 out: %o0 0xffff810d79102000 %o1 0x1 %o2 0x3a353020 %o3 0x300056fe700 %o4 0x640b3b %o5 0x700 %sp 0x2a1054b87a1 %o7 0x1c loc: %l0 0xffffffff3151a000 %l1 0x62a3337 %l2 0x7 %l3 0x18bf598 %l4 0x9664 %l5 0x70035c44080 %l6 0x18611b0 %l7 0x1 in: %i0 0xffff810d79102b00 %i1 0xffffffff3151a47c %i2 0x24 %i3 0x480 %i4 0x127fd24 %i5 0xffff810d791024a0 %fp 0x2a1054b8961 %i7 0x1140b9cSUNW,UltraSPARC-T2:ci_fqtr+0x1c()SUNW,UltraSPARC-T2:xcopyin(0xffffffff3151999c, 0xffff810d791024a0, 0xb04) - frame recycledgenunix:uiomove+0xa8(, , 0x1, 0x2a1054b9a98)vxfs:vx_uiomove(, 0x4a0, 0xb04, 0x1, 0x2a1054b9a98) - frame recycledvxfs:vx_write_default+0x670(0x600a785b020, 0x2a1054b9a98, 0x1a4a0, 0x1afa4, 0x2, 0x0, , , 0x0)vxfs:vx_write1+0xe8c(0x600a82ed7c0, 0x2a1054b9a98, 0x0, 0x60081d2d130, 0x1, 0x6006e7fc540)vxfs:vx_write_common_slow+0x710(0x600a82ed7c0, 0x2a1054b9a98, 0x0, 0x0, 0x60081d2d130, 0x6007efc1980)vxfs:vx_write_common+0x508(0x600a82ed7c0, 0x2a1054b9a98, 0x0, 0x0, 0x0, 0x0, 0x60081d2d130)vxfs:vx_write+0x28(0x600a82ed7c0, 0x2a1054b9a98, 0x0, 0x60081d2d130, 0x0, 0x8000)genunix:fop_write+0x20(0x600a82ed7c0, 0x2a1054b9a98, 0x0, 0x60081d2d130, 0x0)genunix:write+0x268(0xf4a)unix:syscall_trap+0xac()-- switch to user thread's user stack --

Cause

The code in vx_page_alloc is there as a result of the interaction of two
different things: ILASTPAGEZEROED and VMODSORT.

When the end of a file doesn't land on a page boundary, the portion past the
end of file needs to be zeroed when the page is mapped. The simplest way to
do this is to always zero past end of file when the file is loaded. However,
if the page is never mapped, the page can never be accessed past end of file,
so the zeroing is not needed.

ILASTPAGEZEROED is an optimization VxFS has to delay zeroing past end of file
as much as possible. There are a few places where we'll do the zeroing, but
the relevant part is that VxFS will zero the page on the first page fault of
the last page through a user mapping (e.g. segvn).

VMODSORT is an optimization Sun added to Solaris where a vnode's page list is
sorted so that all of the dirty pages are on one end of the list. Because of the
way this is implemented, if we're going to be writing into a page during a page
fault, we need to lock the page exclusively.

In simple terms, the code in vx_page_alloc says that if VxFS found the page in
the page cache already, it's the last page of the file, end of file is in the
middle of the page, the inode is mapped, and we're faulting through a user
mapping, then VxFS needs to lock the page exclusively because it will be zeroed
later in getpage.

The bug occurs during writing to a file that is also mmapped. If mmap'ing the
last page of the file (and end of file is in the middle of a page) and use
that mapping to write to the end of the same file, then the source and
destination page are the same. The applicatin code would look something like:

addr = mmap(..., fd, )
lseek(fd, 0, SEEK_END)
write(fd, addr, 1)

During the write, VxFS will first fault in the destination page. Although it's
the last page of the file, we're not faulting through the user mapping at this
point, so getpage will not zero the page and VxFS will only lock it shared.
Later, uiomove will fault the source page. This time, VxFS is faulting through
the user mapping, so VxFS will need to lock the page exclusively. Since the
source and destination pages are the same, and VxFS has already locked the
destination page in shared mode, VxFS can't get the exclusive lock on the
page, thus the deadlock.

The fix removes the requirement that the page has to be faulted through the user
mapping to zero it. With that change, the page will end up getting zeroed when
the write faults in the destination page. After it zeroes the end of the page,
it downgrades the lock to shared. When uiomove faults it the source page, it's
already been zeroed, so it doesn't need to lock the page exclusively, avoiding
the deadlock.
 

Resolution

A fix has been developed in VxFS  5.1RP1HF6.  The first generally available release to include this fix is 5.1SP1.The SYMC Etrack for this issue is 2120692

 

Applies To

Solaris environments with VxFS and VMODSORT support.

Issue/Introduction

Access to a file in a VxFS file system can hang due to a deadlock bug in VxFS.