Veritas Cluster Server (VCS) EMC SRDF Agent 7.0.9.0 fails to online under VCS control within OnlineTimeout when performing srdf failover

book

Article ID: 100050788

calendar_today

Updated On:

Description

Error Message

 

The VCS "/var/VRTSvcs/log/SRDF_A.log" file contains the following key messages:
 

“VCS WARNING V-16-20017-27 SRDF:srdf_PROD:online:not enough time to issue failover command; administrative intervention or automatic restart required”

and
 

2021/05/08 18:37:56 VCS DBG_3 SRDF:srdf_PROD:monitor:run_cmd: Issuing cmd: [/opt/VRTSvcs/bin/hasys -nodeid]
2021/05/08 18:37:56 VCS DBG_3 SRDF:srdf_PROD:monitor:run_cmd: Issuing cmd: [/opt/VRTSvcs/bin/hares -state srdf_RS_PROD ]
2021/05/08 18:37:56 VCS DBG_2 SRDF:srdf_PROD:monitor:FindMode:Resource is not online in any of the cluster. Cannot compute RPO.
2021/05/08 18:37:56 VCS DBG_2 SRDF:srdf_PROD:monitor:MainLoop: Could not find the mode. Not doing anything in this iteration.
2021/05/08 18:37:56 VCS DBG_4 SRDF:srdf_PROD:monitor:[monitor]: ExtendMonitor set to null or is not defined. Nothing to perform
2021/05/08 18:37:57 VCS DBG_AGDEBUG script (/opt/VRTSvcs/bin/SRDF/monitor) exited with status (100)

Cause

 

In some circumstances, you may need to set the OnlineTimeout attribute for the VCS EMC SRDF resources to a higher value than the default of "300" seconds, so that the resource entry points do not time out.

Setting the OnlineTimeout attribute for the VCS EMC SRDF managed resources

To set the OnlineTimeout attribute for a single EMC device group (typically the case for SRDF), multiply the number of devices in the EMC device group with the time taken to failover

For example:

 

If you have a single EMC device group containing 5 devices, the expected time taken to failover would be no more than 260 seconds.

 

A single device is 50 seconds

Number of devices is 5

5 x 50 = 250

10 seconds is added to handle the launching of the symrdf command,


Thus, the OnlineTimeout attribute is equal to 260 seconds.

Therefore, set the OnlineTimeout attribute to [(5*50)+ 10] seconds.

 

To set the OnlineTimeout attribute for multiple device groups (currently not supported by SRDF), calculate the OnlineTimeout attribute for all device groups and set the OnlineTimeout attribute to at least the amount of time the largest device group takes to fail over.

 

Resolution

 

Veritas suggests increasing the OnlineTimeout value to 400 or higher than the default (300 seconds) when the symrdf failover fails to complete before the OnlineTimeout value is reached.

Once changed, attempt the EMC "symrdf failover" again and confirm if the OnlineTimeout needs to increased further to cater for the extended symrdf failover times.

 

Steps:

 

To set the "OnlineTimeout" to 400 seconds for the SRDF resource type:
 

# haconf -makerw

# hatype -modify SRDF OnlineTimeout 400

# haconf -dump -makero

 

If the above approach doesn’t help, you could try with modifying "DevFOTime" attribute value to "4"  as follows:

 

# haconf -makerw

# hares -modify resname DevFOTime 4

# haconf -dump -makero

 

NOTE: DevFOTime is the average time in seconds that is required for each device or composite group to fail over. This value helps the agent to determine whether it has adequate time for the online operation after waiting for other device or composite groups to fail over. If the online operation cannot be completed in the remaining time, the failover does not proceed. The default is 2.

 

Veritas provides the sigma script to get recommendations for VCS attribute values.

 

# /opt/VRTSvcs/bin/SRDF/sigma

 

Run the script on a node where VCS is running and has the SRDF devices and agent configured.

 

The sigma calculator adds 10 seconds to the value for each device group to compensate for the overhead of launching the EMC symrdf command.

 

Specify a revised value with the sigma script, if launching the symrdf takes shorter or longer.

 

The script runs on the assumption that the VCS manages all devices in the array. Other operations outside of VCS that hold the array lock might delay the online operation unexpectedly.

 

Sample output

 

# /opt/VRTSvcs/bin/SRDF/sigma

                                      SRDF Type Sigma Calculator

RDF1 Group         Devs    DevFOTime  Total Time  OnlineTimeout  Subtotal
srdf-prod          145          2         300            400       300


-----------------  ----    ---------  ----------  -------------  --------

 

Recommendation:

 

Increase the VCS SRDF resource OnlineTimeout to "400" to confirm this is sufficient to perform the failover of the SRDF device group(s).

The original values were:

SRDF               OnlineTimeout          300

srdf_PROD        DevFOTime              global     2

By increasing the SRDF resource OnlineTimeout value from 300 to 400, everything works fine. This is dependent on many external factors and is environment specific.

 

KEY POINTS:


The VCS SRDF resource expects the SYMCLI operations to come back sooner (within the default 300 seconds “OnlineTimeout”).
 

By adding additional SYMDEVs to the SRDF groups, this will potentially increase the SYMCLI command runtimes, so increasing the default OnlineTimeout value may be required in some cases, however, is environment dependent.
 

To debug the issue further, Dell EMC may be able to assist with understanding why the EMC SYMCLI commands are taking more time than expected.
 

Generally, Veritas recommends running the EMC "symcfg discover” command to update the SYMCLI database when adding and removing new LUNs to the devices group or other similar EMC reconfiguration operations.
 

Veritas also recommend upgrading to latest version VCS EMC SRDF agent available on  sort.veritas.com, as it provides support with latest solution enabler (SYMCLI) versions i.e. 9.0 & 9.1.

 

Issue/Introduction


Additional EMC SYMDEVs are added to the EMC SRDF device group containing the srdf-r1 and srdf-r2 devices.

When testing the failover functionality of EMC SRDF managed resources under Veritas Cluster Server (VCS) control. The VCS EMC SRDF managed resource fails to online within the default VCS resource OnlineTimeout since the new EMC SYMDEVs were added to the EMC device group.

Upon further troubleshooting, the EMC srdf failover operation is tested outside of VCS and symrdf commands complete succesfully when performed manually. The symrdf failover commands seems to take a considerable amount of time.

The VCS SRDF managed service group is the manually onlined using VCS after the EMC symrdf failover has completed.