Issues can occur either on an Active/Active (A/A)array such as EMC Symmetrix, or an Active/Passive (A/P) arrays such as EMC CLARIION. A CLARIION array is used here with VERITAS Volume Manager
1. Volume layout
A simple 2-column stripe volume, using two LUNS belonging to a CLARIION array:
vtestvol - ENABLED ACTIVE 8382464 SELECT testvol-01 fsgen
pl testvol-01 testvol ENABLED ACTIVE 8382464 STRIPE 2/128 RW
sdd2-01 testvol-01 d2 0 4191232 0/0 EMC_CLARiiON0_9 ENA
sdd1-01 testvol-01 d1 0 4191232 1/0 EMC_CLARiiON0_18 ENA
2. File system
A VERITAS FileSystem (tm) file system resides on this volume:
#df
/testvol (/dev/vx/dsk/clariiondg/testvol):3738854 blocks 467355 files
3.Multipathing
Each LUN has two paths; one LUN has the primary path on controller c2, other LUN has the primary path on controller c3:
# vxdisk list EMC_CLARiiON0_18
Device: EMC_CLARiiON0_18
devicetag: EMC_CLARiiON0_18
...
Multipathing information:
numpaths: 2
c2t50060169102041E8d1s2state=enabled type=secondary
c3t50060160102041E8d1s2 state=enabled type=primary
#vxdisk list EMC_CLARiiON0_9
Device: EMC_CLARiiON0_9
devicetag: EMC_CLARiiON0_9
...
Multipathing information:
numpaths: 2
c2t50060169102041E8d10s2 state=enabled type=primary
c3t50060160102041E8d10s2 state=enabled type=secondary
Examples of Failures
Two examples of path failures on the CLARIION array are illustrated here.
1. Path failure due to transient failure such as a Service Processor (SP) going temporarily offline (for example, due to a reboot of the SP):
An excerpt of some of the SCSI errors recorded in syslog :
scsi: [ID xxxx kern.warning] Warning:/pci@9,600000/pci@2/SUNW,qlc@5/fp@0,0/ssd@w50060160102041e8,1(ssd460): Error for Command:write(10) Error Level:Fatal
scsi: [ID xxxx kern.notice] Requested Block:367872 ErrorBlock: 367872
scsi: [ID xxxx kern.notice] Vendor:DGC SerialNumber: 0100002F9CCL
scsi: [ID xxxx kern.notice] Sense Key: Not Ready
scsi: [ID xxxx kern.notice] ASC: 0x4 (), ASCQ: 0x3, FRU:0x0
A little later, you will see vxio errors:
vxio: [ID xxxx kern.warning] Warning: vxvm:vxio:Subdisk d1-01 block 365568: Uncorrectable write error
vxio: [ID xxxx kern.warning] Warning: vxvm:vxio:Subdisk d1-01 block 365696: Uncorrectable write error
After this, the file system gets disabled, and you will see VERITAS File System errors similar to those below :
vxfs: [ID 702911 kern.warning] Warning: msgcnt 2vxfs: mesg 037: vx_metaioerr - vx_logbuf_write - /dev/vx/dsk/clariiondg/testvolfile system meta data write error in block 2573
vxfs: [ID 702911 kern.warning] Warning: msgcnt 3vxfs: mesg 031: vx_disable - /dev/vx/dsk/clariiondg/ testvol file systemdisabled
vxfs: [ID 702911 kern.warning] Warning: msgcnt 4vxfs: mesg 017: vx_delbuf_flush - /testvol file system inode 4 marked badincore
vxfs: [ID 885974 kern.info] vxfs msgcnt 4 offset0x00000000 41ed 4 0 1
vxfs: [ID 214594 kern.info] vxfs msgcnt 4 offset0x000000b0 0 0
vxfs: [ID 702911 kern.warning] Warning: msgcnt 5vxfs: mesg 017: vx_delbuf_flush - /testvol file system inode 2880 marked badincore
2. Error messages from permanent path failure (such as due to SAN cable failure):
An excerpt of some of the SCSI errors recorded in syslog :
scsi: [ID 243001 kern.info]/pci@9,600000/pci@2/SUNW,qlc@5/fp@0,0 (fcp2):
offlining lun=13 (trace=0),target=601200 (trace=2800004)
scsi: [ID 243001 kern.info]/pci@9,600000/pci@2/SUNW,qlc@5/fp@0,0 (fcp2):
offlining lun=12 (trace=0),target=601200 (trace=2800004)
A little later, you will see vxio errors:
vxio: [ID 663439 kern.warning] Warning: vxvm:vxio:Subdisk d1-01 block 4040960: Uncorrectable write error
vxio: [ID 663439 kern.warning] Warning: vxvm:vxio:Subdisk d1-01 block 4041984: Uncorrectable write error
After this, the file system gets disabled, and you will see VERITAS File System errors similar to those below :
vxfs: [ID 702911 kern.warning] Warning: msgcnt 1309vxfs: mesg 037: vx_metaioerr - vx_logbuf_write - /dev/vx/dsk/clariiondg/testvolfile system meta data write error in block 8852
vxfs: [ID 702911 kern.warning] Warning: msgcnt 1310vxfs: mesg 031: vx_disable - /dev/vx/dsk/clariiondg/testvol file systemdisabled
vxfs: [ID 702911 kern.warning] Warning: msgcnt 1337vxfs: mesg 037: vx_metaioerr - vx_inode_iodone - /dev/vx/dsk/clariiondg/testvolfile system meta data write error in block 941184
...
vxfs: [ID 702911 kern.warning] Warning: msgcnt 1349vxfs: mesg 017: vx_ilock - /testvol file system inode 782 marked badincore
vxfs: [ID 885974 kern.info] vxfs msgcnt 1347 offset0x00000090 0 0 0 0
...
System Status after the Failure
In either of the above cases, the end result is that the volume is in a DISABLED state. If a file system resided on that volume, that file system is no longer accessible. The procedure to recover from such a situation is the same in both cases above.
1. Assume that path c3 fails (either a transient or a permanent failure). After the single-path failure and subsequent sequence of errors shown in the above two cases, the volume goes into a DISABLED state :
v testvol - DISABLEDACTIVE 8382464 SELECT - fsgen
pltestvol-01 testvol DISABLED NODEVICE8382464 STRIPE 2/128 RW
sdd2-01 testvol-01 d2 0 4191232 0/0 EMC_CLARiiON0_9 ENA
sdd1-01 testvol-01 d1 0 4191232 1/0 - NDEV
2. The vxdisk list command shows one LUN (EMC_CLARiiON0_18, which had c3 as its PRIMARY path) is in a FAILED state:
DEVICE TYPE DISK GROUP STATUS
EMC_CLARiiON0_9sliced d2 clariiondg online
- - d1 clariiondg failed was:EMC_CLARiiON0_18
3. The df command will show that the file system is in an I/O error state:
# df-k /testvol
Filesystem kbytes used avail capacity Mounted on
df:cannot statvfs /testvol: I/O error
Recovery Procedure
1. First, umount the file system:
#/usr/sbin/umount /testvol
(If this fails, use the "-f" flag parameter to cause a force unmount of the file system; since the volume is in a DISABLED state, it is safe to use this option at this point).
2. Next, run the following command to force Volume Manager to rescan all paths:
#/usr/sbin/vxdctl enable
a. If a transient path error had occurred, and the path is fully functional now, Volume Manager will rediscover this path and re-enable it. All functional paths will be in an ENABLED state.
#vxdmpadm getsubpaths dmpnodename=EMC_CLARiiON0_18
NAME STATE PATH-TYPE CTLR-NAME ENCLR-TYPE ENCLR-NAME
====================================================================
c2t50060169102041E8d1s2ENABLED SECONDARY c2 EMC_CLARiiON EMC_CLARiiON0
c3t50060160102041E8d1s2ENABLED PRIMARY c3 EMC_CLARiiON EMC_CLARiiON0
b. If a permanent path failure had occurred, Volum eManager will correctly discover the loss of this path and will mark all non-functional paths as DISABLED.
#vxdmpadm getsubpaths dmpnodename=EMC_CLARiiON0_18
NAME STATE PATH-TYPE CTLR-NAME ENCLR-TYPE ENCLR-NAME
====================================================================
c2t50060169102041E8d1s2ENABLED SECONDARY c2 EMC_CLARiiON EMC_CLARiiON0
c3t50060160102041E8d1s2DISABLED PRIMARY c3 EMC_CLARiiON EMC_CLARiiON0
3. Use the following command to reattach the failed LUN:
#/etc/vx/bin/vxreattach
Once this command completes, the disk should now be in an ONLINE state:
DEVICE TYPE DISK GROUP STATUS
EMC_CLARiiON0_9 sliced d2 clariiondg online
EMC_CLARiiON0_18 sliced d1 clariiondg online
However, the volume is now in a DISABLED/RECOVER state :
v testvol - DISABLEDACTIVE 8382464 SELECT - fsgen
pltestvol-01 testvol DISABLEDRECOVER 8382464 STRIPE 2/128 RW
sdd2-01 testvol-01 d2 0 4191232 0/0 EMC_CLARiiON0_9 ENA
sdd1-01 testvol-01 d1 0 4191232 1/0 EMC_CLARiiON0_18 ENA
4. Start the volume. Use the "-f" option to force start this volume:
#/usr/sbin/vxvol -f start testvol
The volume is now in an ENABLED/ACTIVE state:
v testvol - ENABLED ACTIVE 8382464 SELECT testvol-01 fsgen
pltestvol-01 testvol ENABLED ACTIVE 8382464 STRIPE 2/128 RW
sdd2-01 testvol-01 d2 0 4191232 0/0 EMC_CLARiiON0_9 ENA
sdd1-01 testvol-01 d1 0 4191232 1/0 EMC_CLARiiON0_18 ENA
5. If this was a raw volume, use appropriate application utilities to check the consistency of the data in that volume.
If the file system resided on this volume, first check if the file system consistency check reports any errors (use the "-n" option to check for errors without committing any changes):
#fsck -F vxfs -n /dev/vx/rdsk/clariiondg/testvol
If no errors, proceed to step 7
6. If the fsck utility reports errors, do the following:
- First, capture a metasave of the file system.
- After completing the metasave operation, run the fsck utility (with "-o full" flag ) on this filesystem
#fsck -F vxfs -o full /dev/vx/rdsk/clariiondg/testvol
If this fails with an error, contact VERITAS Technical Support for further assistance
7. If the above fsck checks completed successfully, you can now mount the filesystem:
#/usr/sbin/mount -F vxfs /dev/vx/dsk/clariiondg/testvol/testvol