vxconfigd stuck in biowait state for long time on InfoScale 7.4.1/Sol11.4 cluster with EMC Symmetrix 5977 and SunFire F80 storage and mpxio is used for multi-pathing.

book

Article ID: 100049629

calendar_today

Updated On:

Description

Error Message


Jan 19 17:45:06 systemA      /scsi_vhci/disk@g5002361004752310 (sd0): Command Timeout on path mpt_sas23/disk@w5002361004752310,0: 198311c4af2b8c19
Jan 19 17:45:13 systemA      /scsi_vhci/disk@g5002361004770660 (sd9): Command Timeout on path mpt_sas26/disk@w5002361004770660,0: 177040bd156f5935
 

Cause

It seems that new changes were introduced in Sol11.4 so that when handling MHIOCSTATUS(cmd 4d04) a SCSI WRITE with zero length will be issued to the array to check the access.  

The vxconfigd stack tells us that Veritas successfully passed ldi_handle cmd(4d04) along with other args and Oracle initialized SCSI WRITE(0x2a) to handle cmd(4d04).  

When dealing with a reserved device (for storage devices/arrays that are SPC-3 compliant), a TUR (Test Unit Ready) is first issued which is supposed to succeed. The write(10) with zero LBA is issued to check the access rights and again this command should succeed. The command will continue to be tried until the expected response is received from the storage (ie it succeeds) and so it can be waiting (stuck ) in biowait for quite some time.

 

 

Resolution

Oracle was able to confirm that the NULL write i/os were occurring on the SunFire F80 storage and that the issue was related to Oracle Bug 30748237 introduced in Solaris 11.4.16.4.0 and fixed in Solaris 11.4.21.69.0 or later and documented under following alert:

 

Solaris 11.4 System I/O Failure Due To Solaris I/O Multipathing (MPxIO/scsi_vhci) Retry Counter Exhaustion For Asymmetric Storage ( Doc ID 2652657.1 )

 

Veritas would recommend that in the event that such a hang is encountered that Oracle is engaged to confirm if the hang has the same root-cause.

Issue/Introduction

vxconfigd stuck in biowait state for long time on InfoScale 7.4.1/Sol11.4 cluster with EMC Symmetrix 5977 and SunFire F80 storage and mpxio is used for multi-pathing. This resulted in VX commands hanging and VCS monitor timeouts for DiskGroup and VVR-related resources and the only way to break the hang was to reboot/panic the system.

Additional Information

JIRA: STESC-5572