EMC Clariion Array Storage Processer reboot results in unrecoverable scsi offline device state when Emulex driver version 2.00a3 is used.

book

Article ID: 100007831

calendar_today

Updated On:

Description

Error Message

xx01 kernel: sd 1:0:0:3: Done: SUCCESS
xx01 kernel:         2 sd 1:0:0:3:
xx01 kernel: sd 0:0:0:3: Done: SUCCESS
xx01 kernel:         2 sd 0:0:0:3:

xx01 kernel::         command: Write(10): 2a 00 00 0c af b0 00 00 08 00
xx01 kernel: sdd: Current: sense key: Unit Attention
xx01 kernel:     Add. Sense: Asymmetric access state changed

xx01 kernel:: sd 0:0:0:3: Done: SUCCESS
xx01 kernel:        2 sd 0:0:0:3:
xx01 kernel:         command: Write(10): 2a 00 00 12 e5 e0 00 00 06 00
xx01 kernel: sdd: Current: sense key: Not Ready
xx01 kernel:     Add. Sense: Logical unit not accessible, target port in unavailable state

xx01 kernel: VxVM vxdmp V-5-0-0 i/o error analysis done (status = 1) on path 8/0x30 belonging to dmpnode 201/0x10<5>
xx01 kernel: VxVM vxdmp V-5-0-112 disabled path 8/0x30 belonging to the dmpnode 201/0x10 due to path failure

xx01 kernel:        command: Write(10): 2a 00 00 0a e4 80 00 02 80 00
xx01 kernel: sdn: Current: sense key: Unit Attention
xx01 kernel:   Add. Sense: Asymmetric access state changedDec 16 13:50:39 xx01 kernel: sd 1:0:0:3: Done: SUCCESS
xx01 kernel:         2 sd 1:0:0:3:
xx01 kernel:        command: Write(10): 2a 00 00 0c 09 20 00 00 10 00
xx01 kernel: sdn: Current: sense key: Not Ready
xx01 kernel:     Add. Sense: Logical unit not accessible, target port in unavailable state

xx01 kernel: VxVM vxdmp V-5-0-0 i/o error analysis done (status = 1) on path 8/0xd0 belonging to dmpnode 201/0x10<5>
xx01 kernel: VxVM vxdmp V-5-0-112 disabled path 8/0xd0 belonging to the dmpnode 201/0x10 due to path failure

xx01 kernel: VxVM vxdmp V-5-0-0 failover initiated for 201/0x10
xx01 kernel:VxVM vxdmp V-5-0-0 curpri set to secondary for 201/0x10

xx01 kernel: sd 0:0:0:3: Done: TIMEOUT
xx01 kernel:       0 sd 0:0:0:3:
xx01 kernel:         command: Inquiry: 12 00 00 00 08 00
xx01 kernel: sd 0:0:0:0: Done: TIMEOUT
xx01 kernel:        0 sd 0:0:0:0:
xx01 kernel:       command: Inquiry: 12 00 00 00 08 00
xx01 kernel: sd 0:0:0:2: Done: TIMEOUT
xx01 kernel:         0 sd 0:0:0:2:
xx01 kernel:         command: Inquiry: 12 00 00 00 08 00
xx01 kernel: sd 0:0:0:4: Done: TIMEOUT
xx01 kernel:         0 sd 0:0:0:4:
xx01 kernel:        command: Inquiry: 12 00 00 00 08 00
xx01 kernel: sd 0:0:0:1: Done: TIMEOUT
xx01 kernel:         0 sd 0:0:0:1:
xx01 kernel:        command: Inquiry: 12 00 00 00 08 00
xx01 kernel: sd 1:0:0:3: Done: TIMEOUT
xx01 kernel:         0 sd 1:0:0:3:
xx01 kernel:         command: Inquiry: 12 00 00 00 08 00
xx01 kernel: sd 1:0:0:0: Done: TIMEOUT
xx01 kernel:       0 sd 1:0:0:0:
xx01 kernel:        command: Inquiry: 12 00 00 00 08 00
xx01 kernel: sd 1:0:0:2: Done: TIMEOUT
xx01 kernel:         0 sd 1:0:0:2:
xx01 kernel:         command: Inquiry: 12 00 00 00 08 00
xx01 kernel: sd 1:0:0:4: Done: TIMEOUT
xx01 kernel:         0 sd 1:0:0:4:
xx01 kernel:        command: Inquiry: 12 00 00 00 08 00
xx01 kernel: sd 1:0:0:1: Done: TIMEOUT
xx01 kernel:       0 sd 1:0:0:1:
xx01 kernel:    command: Inquiry: 12 00 00 00 08 00
xx01 kernel: Error handler scsi_eh_0 waking up
xx01 kernel: Total of 5 commands on 5 devices require eh work
xx01 kernel: Error handler scsi_eh_1 waking up
xx01 kernel: Total of 5 commands on 5 devices require eh work

xx01 kernel:sd 0:0:0:3: Done: SUCCESS
xx01 kernel:     d0000 sd 0:0:0:3:
xx01 kernel:         command: Test Unit Ready: 00 00 00 00 00 00
xx01 kernel: sd 1:0:0:3: Done: SUCCESS
xx01 kernel:     d0000 sd 1:0:0:3:
xx01 kernel:       command: Test Unit Ready: 00 00 00 00 00 00
xx01 kernel: sd 0:0:0:0: Done: SUCCESS
xx01 kernel:   d0000 sd 0:0:0:0:
xx01 kernel:         command: Test Unit Ready: 00 00 00 00 00 00
xx01 kernel: sd 1:0:0:0: Done: SUCCESS
xx01 kernel:     d0000 sd 1:0:0:0:
xx01 kernel:        command: Test Unit Ready: 00 00 00 00 00 00
xx01 kernel: sd 0:0:0:2: Done: SUCCESS
xx01 kernel:    d0000 sd 0:0:0:2:
xx01 kernel:         command: Test Unit Ready: 00 00 00 00 00 00
xx01 kernel: sd 1:0:0:2: Done: SUCCESS
xx01 kernel:     d0000 sd 1:0:0:2:
xx01 kernel:        command: Test Unit Ready: 00 00 00 00 00 00
xx01 kernel: sd 0:0:0:4: Done: SUCCESS
xx01 kernel:     d0000 sd 0:0:0:4:
xx01 kernel:       command: Test Unit Ready: 00 00 00 00 00 00
xx01 kernel:sd 1:0:0:4: Done: SUCCESS
xx01 kernel:     d0000 sd 1:0:0:4:
xx01 kernel:         command: Test Unit Ready: 00 00 00 00 00 00
xx01 kernel: sd 0:0:0:1: Done: SUCCESS
xx01 kernel:     d0000 sd 0:0:0:1:
xx01 kernel:         command: Test Unit Ready: 00 00 00 00 00 00
xx01 kernel: sd 1:0:0:1: Done: SUCCESS
xx01 kernel:     d0000 sd 1:0:0:1:
xx01 kernel:         command: Test Unit Ready: 00 00 00 00 00 00

xx01 kernel: lpfc 0000:0f:00.0: 0:(0):0713 SCSI layer issued Device Reset (0, 0) return x2002
xx01 kernel: lpfc 0000:0c:00.0: 1:(0):0713 SCSI layer issued Device Reset (0, 0) return x2002

xx01 kernel: sd 0:0:0:3: scsi: Device offlined - not ready after error recovery
xx01 kernel: sd 0:0:0:0: scsi: Device offlined - not ready after error recovery
xx01 kernel:sd 0:0:0:2: scsi: Device offlined - not ready after error recovery
xx01 kernel:sd 0:0:0:4: scsi: Device offlined - not ready after error recovery
xx01 kernel: sd 0:0:0:1: scsi: Device offlined - not ready after error recovery

Cause

A bug in the Emulex driver version 2.00a3 is the cause of the issue.

Resolution

Upgrade to Emulex driver version 2.00a4 resolves the issue.

The following fixes are included in Emulex driver version 2.00a4.

1. OS driver received timeout after reset.
Setting INIT_PORT may have caused the port to fail to recover. Enhanced the DMA quiesce routine when
port is reset.
2. Enhanced Fibre Channel link bring up routine when the free buffer count is incorrect.
3. Corrected PCI default LO Exit Latency setting firmware.
4. Corrected RSCN handling during T10 PI BlockGuard processing.
5. Properly updates the Host Buffer Queue pointers in the host memory when mismatched with the driver.
6. Corrected an adapter trap with BlockGuard enabled.
7. Firmware was improperly disabling LOs control. Corrected functionality to disable LOs.

Please consult the HBA (Host Bus Adapter) vendor details on downloads and installation instructions.

 

 

 

Issue/Introduction

A reboot of an EMC Clariion SP (Storage Processor) either due to maintenance, or as part of an NDU ( Non Disruptive Firmware Upgrade ) process, results in an unrecoverable offlined device state, when using Emulex driver version 2.00a3. Any attempt to recover the scsi offlined device(s) will results SCSI Layer initiated device/bus resets and subsequent offline of the device, unless the server is rebooted. This condition causes Veritas Cluster Server Volume resource(s) to timeout and fault during their monitoring procedures, due to the I/O stall caused by the device/bus resets. For VERITAS DMP (Dymanic Multi-Pathing), in the case of an NDU process, which requires both EMC Clarrion Array SP's to be rebooted in subsequent order, since the scsi devices cannot be recovered after the SP is back online, a reboot of the second SP will result in all DMP devices becoming inaccessible. Sequence of events:
- SP is rebooted
- VERITAS DMP fails over the primary paths to the secondary ( at this point all I/O is redirected to the secondary paths ).
- A couple of minutes later we see Inquiry TIMEOUT’s for the devices associated to the SP.
- Next we see TUR (Test Unit Ready) commands succeeding
- Followed by SCSI Layer initiated LPFC driver/bus resets as the SCSI layer is trying to recover from this condition ( i.e the condition being SCSI inquir fail and TUR success). This condition causes I/O stall, vcs Volume resources will timeout and fault.
- Because the SCSI Layer initiated device/bus reset could not resolve the condition it is in, it takes action and offlines all scsi devices associated to the SP.
- Once the devices are offlined VERITAS DMP cannot restore the paths as inquiry to the devices are not responding, ( an inquiry to the device is done to get a status of the device. Inquiry to the device should succeed once the SP is back online ) so a reboot of the second SP would result in total path failure.
Note: Each attempt to recover the path will repeat the sequence.