NFS resource keeps faulting going offline unexpectedly

book

Article ID: 100020099

calendar_today

Updated On:

Description

Error Message


Sample Error:

2008/12/30 16:29:20 VCS ERROR V-16-2-13067 (fred) Agent is calling clean for resource(NFS) because the resource became OFFLINE unexpectedly, on its own.
 

When enabling debugging for the VCS NFS Agent, RPC calls to localhosts may fail with status=5 (timed out):
 
2008/12/3016:29:20 VCS DBG_4 V-16-50-0 NFS:NFS:monitor:RPC call failed with status=5 for program=100003, protocol=udp, version=2 NFS.C:NFS_monitor[443]
 

Resolution

DIAGNOSTIC STEPS:

To configure Debug logging for the VCS NFS Agent (logging will be recorded to /var/VRTSvcs/log/NFS_A.log):

# haconf -makerw
# hatype-modify NFS LogDbg DBG_4  DBG_AGDEBUG  DBG_AGINFO
# haconf -dump-makero
 
To disable VCS NFS Agent debugging:
 
# haconf -makerw
# hatype -modify NFS LogDbg -delete -keys
# haconf -dump -makero
 
What is status 5?
 
From /usr/include/rpc/clnt.h:
RPC_TIMEDOUT=5,/* call timed out */
||| status=5means RPC call timed out.
 
From VCS perspective:
 
||| We open a handle to the local host.
||| The call is NULLPROC, the null procedure
||| The null procedure checks status of the RPC call.
||| If the RPC return status is not successful and NFS agent debug level is 4. Then we will print the debug log message to the NFS_A.log.
 
{VCSAG_LOGDBG_MSG(VCS_DBG4,VCS_DEFAULT_FLAGS, "RPC call failed with status=%d for program=%d, protocol=%s,version=%d", status, program, protocol, version);
 
The null procedure doesn't do any processing, it is there for diagnostic purposes.

The time-out of the RPC call to NULLPROC suggests the RPC server is occasionally too busy to service the request, resulting in seemingly-random resource faults.

 
WORKAROUND: 
 
Increase the ToleranceLimit for the NFS resource type to 2.

This allows the monitor entrypoint to return OFFLINE two times before the resource is declared FAULTED.
 
NOTE: Increasing the Tolerance limit is a workaround for VCS
 
# hatype -modify NFS ToleranceLimit 2
 
 

 

Issue/Introduction

NFS resource keeps faulting going offline unexpectedly.