VCS Fails to Clean ZFS Mount Resource

book

Article ID: 100028237

calendar_today

Updated On:

Description

Error Message

Under 5.0MP3, we run 'umount' without any force option. If this fails, then we attempt to clean the filesystem with fuser before unmounting again. If the above conditions are met, then fuser will not return any output, nor will it kill any processes. The logs will show the umount attempts failing with "Device busy" over and over again;

2012/10/23 12:33:31 VCS INFO V-16-2-13001 (node1) Resource(zfs_mount): Output of the completed operation (clean) 
/mount/zfsmountpoint: 
cannot unmount '/mount/zfsmountpoint': Device busy 
/mount/zfsmountpoint: 
cannot unmount '/mount/zfsmountpoint': Device busy 
/mount/zfsmountpoint: 
cannot unmount '/mount/zfsmountpoint': Device busy 
/mount/zfsmountpoint: 
cannot unmount '/mount/zfsmountpoint': Device busy 
/mount/zfsmountpoint: 
cannot unmount '/mount/zfsmountpoint': Device busy 
2012/10/23 12:33:31 VCS ERROR V-16-2-13069 (node1) Resource(zfs_mount) - clean failed. 
2012/10/23 12:34:32 VCS ERROR V-16-2-13077 (node1) Agent is unable to offline resource(zfs_mount). Administrative intervention may be required. 

The mount resource ends up faulting, which can prevent a successful failover of the SG.

Under 5.1SP1 (and above) we will run a regular unmount against the filesystem. If it is busy, then the mount agent will run fuser (which may or may not work) followed by a force unmount. The force unmount is usually successful even if the filesystem is busy. However, processes which are using this mount point will not be killed. This is due to the same fuser bug mentioned above. The cluster will be able to failover successfully, but this is still not the expected behavior for the mount agent.

Cause

Oracle has bug ID 15866492 for this issue. The fix does not appear to be publicly available at this time.

Resolution

 Contact Oracle for more information about obtaining a patch. As a workaround, you can create a dummy application resource to run fuser BEFORE any unmounts are attempted. This will work around the OS bug;

1. Create an Application resource, and make it depend on your ZFS mount resources. 
2. For the Online attribute, add a small script that creates a lockfile, such as "touch lockfile."  
3. For the Offline attribute, add another small script that rm's the lockfile and also runs 'fuser -ck' against your ZFS filesystems.  
4. For the Monitor attribute, add a script to monitor the lockfile. As long as the file exist, the process resource will be able to stay online. Here is an example of a simple monitor script that should suit your needs:  

$LockFile = "/process/lockfile.lck"; 

if (-f "$LockFile") {  
exit(110); 
} else { 
exit(100); 


Applies To

Internal testing has shown that this issue exists on multiple versions of Solaris x86 and SPARC. It has been seen on Solaris 10 Updates 8/9/10.

Issue/Introduction

ZFS filesystems may fail to clean successfully if they are held open during the offline attempt. This can happen due to a Solaris OS bug related to the fuser command.

The fuser binary will stop working if the following conditions are met;

1. ZFS filesystem 
2. Subdirectory underneath the mount point is held open 
3. Mount point itself is NOT held open by anything. 
4. Unmount attempt resulting in a busy filesystem warning 
5. Run fuser -c or -ck against the ZFS filesystem 

Essentially, fuser will work perfectly until we run umount against a busy ZFS filesystem. From that point forward, the fuser command returns nothing and fails to kill any remaining processes. If there is any process open on the mount point itself (not in a subdirectory), then fuser will work normally.