How to detect and correct inode corruption associated with transient fiber link failures

book

Article ID: 100038941

calendar_today

Updated On:

Description

Description

This article contains the procedure to detect and correct inode corruption associated with transient fiber link failures.

Warning: Read this article, in its entirety, before making any changes. Understand that failure to follow the proper procedures could lead to data loss. Veritas should be contacted in conjunction with using this article. We recommend doing a complete backup of the data, if possible, prior to correcting inode corruption to help prevent data loss.

Procedure

This is the procedure to help detect and correct Veritas File System inode corruption as a result of transient fiber link failures.  Under certain conditions, incore inodes can be marked bad if all paths to a device have been disabled.  If the paths are re-enabled, and the file system is still enabled, it is then possible for these incore inodes to be flushed to disk and the superblock marked as needing a full fsck.  A subsequent full fsck will clear these inodes; deleting the file.  This procedure is most relevant to inodes marked bad due to read failures, as the inode was not being updated at the time, and has the most probability of being recovered successfully.

After a link failure has been detected, the /var/adm/messages file should be analyzed for possible inode failures.  File System will print "vxfs:" messages to /var/adm/messages,  and will usually indicate which inodes have been marked bad.  

A typical message looks like this:
 
Mar 15 17:26:21 ioccrmprep1 unix: Warning: msgcnt 31 vxfs: mesg 017: 
vx_ilock - /opt/data/ora16/preprod file system inode 10 marked bad

This indicates inode 10 on file system /opt/data/ora16/preprod was marked bad.  This is expected behavior, however, since the paths to the device were reset and the file system was still enabled, inode 10 was flushed to disk.  Since this was a read failure, the likelihood that inode 10 is actually corrupt is very small.

Note: Veritas Support should be contacted if file system corruption is suspected.

Procedure to Delete and Correct

1. Unmount the file system in order to attempt repairs on corrupted inodes.  The superblock can be analyzed on the failing file system to verify that it has been marked as needing a full fsck using the following command:
 
% echo "8192B.p S" | fsdb -F vxfs /dev/vx/rdsk/rootdg/meta

The actual device can be obtained from the vfstab file. 

The output will look something like this:
 
super-block at 00000002.0000
magic a501fcf5  version 4
ctime 983738769 811577  (Sun Mar  4 12:46:09 2001 PDT)
log_version 9 logstart 0  logend 0
bsize  4096 size  6043904 dsize  6043904  ninode 0  nau 0
defiextsize 0  oilbsize 0  immedlen 96  ndaddr 10
aufirst 0  emap 0  imap 0  iextop 0  istart 0
bstart 0  femap 0  fimap 0  fiextop 0  fistart 0  fbstart 0
nindir 2048  aulen 32768  auimlen 0  auemlen 2
auilen 0  aupad 0  aublocks 32768  maxtier 15
inopb 16  inopau 0  ndiripau 0  iaddrlen 2   bshift 12
inoshift 4  bmask fffff000  boffmask fff  checksum e06a935f
free 2459213  ifree 0
efree  1 2 2 2 1 1 0 0 0 1 1 0 0 0 0 1 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 
flags 301 mod 0 clean 3c
time 984695526 21111  (Thu Mar 15 14:32:06 2001 PDT)
oltext[0] 15  oltext[1] 774  oltsize 1
iauimlen 1  iausize 4  dinosize 256
checksum2 41d
checksum3 0

The key is the flags field. In this case it is "301", which breaks down to mean: VX_FULLFSCK | VX_METAIOERR | VX_DATAIOERR per the following defines:
 
VX_FULLFSCK     0x0001                       full fsck required 
VX_LOGBAD        0x0002                       log is invalid, do not do replay 
VX_NOLOG          0x0004                        no logging, do not do replay 
VX_RESIZE          0x0008                         resize in progress 
VX_LOGRESET    0x0010                        log reset desired 
VX_UPGRADING  0x0020                       upgrade in progress 
VX_UQUOTACHECK  0x0040                V2 only, moved to CUT in V3 
VX_GQUOTACHECK  0x0080                V2 only, moved to CUT in V3 
VX_METAIOERR   0x0100                     file system meta-data i/o error 
VX_DATAIOERR    0x0200                    file data i/o error 


2. Now that it is known that this file system has corruption, it is a good idea to perform a full backup of your data.  Also recommended is to dump the metadata with the "metasave" utility. Saving the metadata is a good idea in case there are problems with fsdb later on.

3.  Run a full fsck with the -n option to see which inodes are marked bad:
 
% fsck -F vxfs -n /dev/vx/rdsk/rootdg/meta | grep "marked bad"

vxfs fsck: file system had I/O error(s) on meta-data.
vxfs fsck: file system had I/O error(s) on user data.
fileset 999 primary-ilist inode 2 marked bad, allocation flags (0x0001)
fileset 999 primary-ilist inode 3 marked bad, allocation flags (0x0001)
fileset 999 primary-ilist inode 10 marked bad, allocation flags (0x0001)


This indicates that inodes 2, 3, and 10 are marked bad.  

4.  Set the "aflag" field to 0x0 using fsdb.  This step must be done very carefully since it involves writing to the file system structure itself. The incorrect use of fsdb can destroy the file system. 
   
Now, clear inodes 2, 3, and 10:
 
% echo "999fset.2i.af=0x0" | fsdb -F vxfs /dev/vx/rdsk/rootdg/meta
0000028a.0230: 0
% echo "999fset.3i.af=0x0" | fsdb -F vxfs /dev/vx/rdsk/rootdg/meta
0000028a.0330: 0
% echo "999fset.10i.af=0x0" | fsdb -F vxfs /dev/vx/rdsk/rootdg/meta
0000028a.0a30: 0


Again, the device needs to be the raw device for the file system.

5.  The inode aflag has been cleared for the 3 inodes.  Now verify with fsck:
 
% fsck -F vxfs -n /dev/vx/rdsk/rootdg/meta | grep "marked bad"

vxfs fsck: file system had I/O error(s) on meta-data.
vxfs fsck: file system had I/O error(s) on user data.
 

6.  Now it should be safe to run a full fsck with the -y option:
 
% fsck -F vxfs -y /dev/vx/rdsk/rootdg/meta

vxfs fsck: file system had I/O error(s) on meta-data.
vxfs fsck: file system had I/O error(s) on user data.
log replay in progress
file system is not clean, full fsck required
pass0 - checking structural files
pass1 - checking inode sanity and blocks
pass2 - checking directory linkage
pass3 - checking reference counts
pass4 - checking resource maps
OK to clear log? (ynq)y
set state to CLEAN? (ynq)y


7.  Mount the file system and check the inodes:
 
% mount -F vxfs /dev/vx/dsk/rootdg/meta /meta

% ls -li /meta

total 28672224
4 -rw-r-----   1 vray     101      2097160192 Mar  8 12:35 data1_01a.dbf
5 -rw-r-----   1 vray     101      2097160192 Mar  8 12:35 data1_01b.dbf
6 -rw-r-----   1 vray     101      2097160192 Mar  8 12:35 data1_01c.dbf
7 -rw-r-----   1 vray     101      2097160192 Mar  8 12:35 data1_01d.dbf
8 -rw-r-----   1 vray     101      2097160192 Mar  8 12:35 data1_01e.dbf
9 -rw-r-----   1 vray     101      2097160192 Mar  8 12:35 data1_01f.dbf
10 -rw-r----- 1 root     other   2097160192 Mar 15 13:45 file1.dbf
3 drwxr-xr-x   2 vray   101      96 Mar  4 12:46 lost+found/

Issue/Introduction

How to detect and correct inode corruption associated with transient fiber link failures