SRDF resource does not online on all the nodes in a parallel Service Group

Description

Error Message

=== engine log will have, resource not up even after online completed message, logged ===

2013/05/14 11:09:46 VCS NOTICE V-16-1-10301 Initiating Online of Resource srdf_mxprd (Owner: Unspecified, Group: srdf_sg) on System node01
2013/05/14 11:09:46 VCS NOTICE V-16-1-10301 Initiating Online of Resource srdf_mxprd (Owner: Unspecified, Group: srdf_sg) on System node02
2013/05/14 11:09:46 VCS NOTICE V-16-1-10301 Initiating Online of Resource srdf_mxprd (Owner: Unspecified, Group: srdf_sg) on System node03

2013/05/14 11:20:48 VCS ERROR V-16-2-13066 (node02) Agent is calling clean for resource(srdf_mxprd) because the resource is not up even after online completed.
2013/05/14 11:20:52 VCS INFO V-16-2-13068 (node02) Resource(srdf_mxprd) - clean completed successfully.
2013/05/14 11:20:52 VCS INFO V-16-2-13071 (node02) Resource(srdf_mxprd): reached OnlineRetryLimit(0).
2013/05/14 11:20:53 VCS ERROR V-16-1-54031 Resource srdf_mxprd (Owner: Unspecified, Group: srdf_sg) is FAULTED on sys node02

2013/05/14 11:20:52 VCS ERROR V-16-2-13066 (node01) Agent is calling clean for resource(srdf_mxprd) because the resource is not up even after online completed.
2013/05/14 11:20:56 VCS INFO V-16-2-13068 (node01) Resource(srdf_mxprd) - clean completed successfully.
2013/05/14 11:20:56 VCS INFO V-16-2-13071 (node01) Resource(srdf_mxprd): reached OnlineRetryLimit(0).
2013/05/14 11:20:57 VCS ERROR V-16-1-54031 Resource srdf_mxprd (Owner: Unspecified, Group: srdf_sg) is FAULTED on sys node01

When Debugging is enabled on SRDF Agent, below messages will be logged in engine log

2013/05/14 12:09:58 VCS DBG_1 V-16-20017-0 (node03) SRDF:srdf_mxprd:online:Some other node [node02] has locked this resource. Hence exiting for [300] seconds.
2013/05/14 12:09:58 VCS DBG_1 V-16-20017-0 (node01) SRDF:srdf_mxprd:online:Some other node [node02] has locked this resource. Hence exiting for [300] seconds.

Cause

First node in the SFCFS cluster, that onlines the SRDF resource locks it to perform role swap. When other nodes detect this resource is locked, they exit the online entry point and does not retry the online, causing the resource to fault.

Resolution

This issue is tracked inside Veritas via incident e3203165 and fixed from SRDF agent version 5.0.16.0 and above.

The agent is available as part of 2Q-2013 agent pack and can be downloaded using below link.

https://docs.infoscale.com

WorkAround:

Until the patch can be installed, increasing OnlineRetryLimit of the resource can be used as a workaround to address this issue.

Use the below procedure for configuring the same.

1. Open the VCS configuration

# haconf -makerw

2. Increase OnlineRetryLimit of SRDF Agent to 1, so the resource is retried online when it is detected as faulted

# hatype -modify SRDF OnlineRetryLimit 1

3. Reducing OnlineWaitLimit of SRDF Agent to 1 from 2, to reduce the time taken by VCS to detect the resource fault

# hatype -modify SRDF OnlineWaitLimit 1

4. Save and close the VCS configuration

# haconf -dump -makero

Applies To

This issue applies to all Veritas Cluster Server (VCS) versions and platforms where SRDF resource is configured in a parallel Service Group.

Issue/Introduction

SRDF Resource in Storage Foundation Cluster File System (SFCFS) environment is configured in a parallel Service group. In this configuration the resource does not online on all the nodes in the cluster.

Additional Information

ETrack: 3203165

Welcome to "KB Articles"