How to recover from DMP failures on EMC arrays with Veritas Volume Manager (VxVM)

book

Article ID: 100022874

calendar_today

Updated On:

Description

Description

How to recover from DMP failures on EMC arrays with Veritas Volume Manager (VxVM)

Resolution

Issues can occur either on an Active/Active (A/A)array such as EMC Symmetrix, or an Active/Passive (A/P) arrays such as EMC CLARIION. A CLARIION array is used here with VERITAS Volume Manager 


1. Volume layout

A simple 2-column stripe volume, using two LUNS belonging to a CLARIION  array:

  vtestvol      -            ENABLED  ACTIVE  8382464  SELECT    testvol-01      fsgen
 pl testvol-01  testvol      ENABLED  ACTIVE  8382464  STRIPE    2/128            RW
 sdd2-01        testvol-01   d2      0        4191232  0/0      EMC_CLARiiON0_9  ENA
 sdd1-01        testvol-01   d1      0        4191232  1/0      EMC_CLARiiON0_18 ENA


2. File system

A VERITAS FileSystem (tm) file system resides on this volume:

  #df
 /testvol           (/dev/vx/dsk/clariiondg/testvol):3738854 blocks   467355 files


3.Multipathing

Each LUN has two paths; one LUN has the primary path on controller c2, other LUN  has the primary path on controller c3:

  # vxdisk list EMC_CLARiiON0_18
 Device:    EMC_CLARiiON0_18
 devicetag: EMC_CLARiiON0_18
 ...
  Multipathing information:
  numpaths:  2
  c2t50060169102041E8d1s2state=enabled  type=secondary
 c3t50060160102041E8d1s2 state=enabled  type=primary  

  #vxdisk list EMC_CLARiiON0_9
 Device:    EMC_CLARiiON0_9
 devicetag: EMC_CLARiiON0_9
 ...
  Multipathing information:
  numpaths:  2
 c2t50060169102041E8d10s2        state=enabled  type=primary
 c3t50060160102041E8d10s2        state=enabled  type=secondary


Examples of Failures

Two examples of path failures on the CLARIION array are illustrated here.

1.  Path failure due to transient failure such as a Service Processor (SP) going temporarily offline (for example, due to a reboot of the SP):
                                                                                                                                                   
An excerpt of some of the SCSI errors recorded in syslog :
 
scsi: [ID xxxx kern.warning] Warning:/pci@9,600000/pci@2/SUNW,qlc@5/fp@0,0/ssd@w50060160102041e8,1(ssd460): Error for Command:write(10)               Error Level:Fatal
 
scsi: [ID xxxx kern.notice]      Requested Block:367872    ErrorBlock: 367872
 
scsi: [ID xxxx kern.notice]      Vendor:DGC   SerialNumber: 0100002F9CCL
 
scsi: [ID xxxx kern.notice]      Sense Key: Not Ready
 
scsi: [ID xxxx kern.notice]      ASC: 0x4 (), ASCQ: 0x3, FRU:0x0

A little later, you will see vxio errors:
 
vxio: [ID xxxx kern.warning] Warning: vxvm:vxio:Subdisk d1-01 block 365568: Uncorrectable write error
 
vxio: [ID xxxx kern.warning] Warning: vxvm:vxio:Subdisk d1-01 block 365696: Uncorrectable write error

After this, the file system gets disabled, and you will see VERITAS File System errors similar to those below :
 
vxfs: [ID 702911 kern.warning] Warning: msgcnt 2vxfs: mesg 037: vx_metaioerr - vx_logbuf_write - /dev/vx/dsk/clariiondg/testvolfile system meta data write error in block 2573
 
vxfs: [ID 702911 kern.warning] Warning: msgcnt 3vxfs: mesg 031: vx_disable - /dev/vx/dsk/clariiondg/ testvol file systemdisabled
 
vxfs: [ID 702911 kern.warning] Warning: msgcnt 4vxfs: mesg 017: vx_delbuf_flush - /testvol file system inode 4 marked badincore
 
vxfs: [ID 885974 kern.info] vxfs msgcnt 4 offset0x00000000    41ed        4        0        1
 
vxfs: [ID 214594 kern.info] vxfs msgcnt 4 offset0x000000b0        0        0
 
vxfs: [ID 702911 kern.warning] Warning: msgcnt 5vxfs: mesg 017: vx_delbuf_flush - /testvol file system inode 2880 marked badincore

2. Error messages from permanent path failure (such as due to SAN cable failure):

An excerpt of some of the SCSI errors recorded in syslog :
 
scsi: [ID 243001 kern.info]/pci@9,600000/pci@2/SUNW,qlc@5/fp@0,0 (fcp2):
 
offlining lun=13 (trace=0),target=601200 (trace=2800004)
 
scsi: [ID 243001 kern.info]/pci@9,600000/pci@2/SUNW,qlc@5/fp@0,0 (fcp2):
 
offlining lun=12 (trace=0),target=601200 (trace=2800004)

A little later, you will see vxio errors:
 
vxio: [ID 663439 kern.warning] Warning: vxvm:vxio:Subdisk d1-01 block 4040960: Uncorrectable write error
 
vxio: [ID 663439 kern.warning] Warning: vxvm:vxio:Subdisk d1-01 block 4041984: Uncorrectable write error

After this, the file system gets disabled, and you will see VERITAS File System errors similar to those below :
 
vxfs: [ID 702911 kern.warning] Warning: msgcnt 1309vxfs: mesg 037: vx_metaioerr - vx_logbuf_write - /dev/vx/dsk/clariiondg/testvolfile system meta data write error in block 8852
 
vxfs: [ID 702911 kern.warning] Warning: msgcnt 1310vxfs: mesg 031: vx_disable - /dev/vx/dsk/clariiondg/testvol file systemdisabled
 
vxfs: [ID 702911 kern.warning] Warning: msgcnt 1337vxfs: mesg 037: vx_metaioerr - vx_inode_iodone - /dev/vx/dsk/clariiondg/testvolfile system meta data write error in block 941184
...
 
vxfs: [ID 702911 kern.warning] Warning: msgcnt 1349vxfs: mesg 017: vx_ilock - /testvol file system inode 782 marked badincore
 
vxfs: [ID 885974 kern.info] vxfs msgcnt 1347 offset0x00000090        0        0        0        0
...



System Status after the Failure

In either of the above cases, the end result is that the volume is in a DISABLED state. If a file system resided on that volume, that file system is no longer accessible. The procedure to recover from such a situation is the same in both cases above.

1. Assume that path c3 fails (either a transient or a permanent failure). After the single-path failure  and subsequent sequence of errors shown in the above two cases, the volume goes into a DISABLED state :
 
v  testvol      -            DISABLEDACTIVE  8382464  SELECT    -        fsgen
 
pltestvol-01   testvol      DISABLED NODEVICE8382464  STRIPE    2/128    RW
sdd2-01        testvol-01   d2      0        4191232  0/0      EMC_CLARiiON0_9 ENA
sdd1-01        testvol-01   d1      0        4191232  1/0      -        NDEV

2. The vxdisk list command  shows one LUN (EMC_CLARiiON0_18,  which had c3 as its PRIMARY path) is in a  FAILED state:
 
DEVICE          TYPE      DISK      GROUP        STATUS
EMC_CLARiiON0_9sliced    d2         clariiondg  online      
-              -         d1         clariiondg  failed was:EMC_CLARiiON0_18

3. The df command will show that the file system is in an I/O error state:
 
# df-k /testvol
 
Filesystem            kbytes    used  avail capacity  Mounted on
df:cannot statvfs /testvol: I/O error


Recovery Procedure

1. First, umount the file system:
 
#/usr/sbin/umount /testvol

  (If this fails, use the "-f" flag parameter to cause a force unmount of the file system; since the volume is in a DISABLED state, it is safe to use this option at this point).

2. Next, run the following command to force Volume Manager to rescan all paths:
 
#/usr/sbin/vxdctl enable

a. If a transient path error had occurred, and the path is fully functional now, Volume Manager will rediscover this path and re-enable it. All functional paths will be in an ENABLED state.
 
#vxdmpadm getsubpaths dmpnodename=EMC_CLARiiON0_18
NAME        STATE        PATH-TYPE  CTLR-NAME  ENCLR-TYPE  ENCLR-NAME
====================================================================
c2t50060169102041E8d1s2ENABLED      SECONDARY  c2        EMC_CLARiiON EMC_CLARiiON0
c3t50060160102041E8d1s2ENABLED      PRIMARY    c3        EMC_CLARiiON EMC_CLARiiON0

b. If a permanent path failure had occurred, Volum eManager will correctly discover the loss of this path and will mark all non-functional paths as DISABLED.
 
#vxdmpadm getsubpaths dmpnodename=EMC_CLARiiON0_18
NAME        STATE        PATH-TYPE  CTLR-NAME  ENCLR-TYPE  ENCLR-NAME
====================================================================
c2t50060169102041E8d1s2ENABLED      SECONDARY  c2        EMC_CLARiiON EMC_CLARiiON0
c3t50060160102041E8d1s2DISABLED     PRIMARY    c3        EMC_CLARiiON EMC_CLARiiON0

3. Use the following command to reattach the failed LUN:
 
#/etc/vx/bin/vxreattach

  Once this command completes, the disk should now be in an ONLINE state:
 
DEVICE            TYPE      DISK      GROUP        STATUS
EMC_CLARiiON0_9  sliced    d2        clariiondg  online
EMC_CLARiiON0_18  sliced    d1        clariiondg  online

  However, the volume is now in a DISABLED/RECOVER state :
 
v  testvol      -            DISABLEDACTIVE  8382464  SELECT    -        fsgen
pltestvol-01   testvol      DISABLEDRECOVER  8382464  STRIPE    2/128    RW
sdd2-01        testvol-01   d2      0        4191232  0/0      EMC_CLARiiON0_9 ENA
sdd1-01        testvol-01   d1      0        4191232  1/0      EMC_CLARiiON0_18 ENA

4. Start the volume. Use the "-f" option to force start this volume:
 
#/usr/sbin/vxvol -f start testvol

  The volume is now in an ENABLED/ACTIVE state:
 
v  testvol      -            ENABLED  ACTIVE  8382464  SELECT    testvol-01 fsgen
pltestvol-01   testvol      ENABLED  ACTIVE  8382464  STRIPE    2/128    RW
sdd2-01        testvol-01   d2      0        4191232  0/0      EMC_CLARiiON0_9 ENA
sdd1-01        testvol-01   d1      0        4191232  1/0      EMC_CLARiiON0_18 ENA

5. If this was a raw volume, use appropriate application utilities to check the consistency of the data in that volume.

  If the file system resided on this volume, first check if the file system consistency check reports any errors (use the "-n" option to check for errors without committing any changes):
 
#fsck -F vxfs -n /dev/vx/rdsk/clariiondg/testvol

  If no errors, proceed to step 7

6. If the fsck utility reports errors, do the following:

  - First, capture a metasave of the file system.

  - After completing the metasave operation, run the fsck utility (with "-o full" flag ) on this filesystem
 
#fsck -F vxfs -o full /dev/vx/rdsk/clariiondg/testvol

   If this fails with an error, contact VERITAS Technical Support for further assistance

7.  If the above fsck checks completed successfully, you can now mount the filesystem:
 
#/usr/sbin/mount -F vxfs /dev/vx/dsk/clariiondg/testvol/testvol
 
 
 
 

 

Issue/Introduction

How to recover from DMP failures on EMC arrays with Veritas Volume Manager (VxVM)

Additional Information

ETrack: 152674 ETrack: 157802