VCS Hitachi universal replicator agents fails to takeover when the remote site HORCM daemon is not running

book

Article ID: 100012060

calendar_today

Updated On:

Description

Error Message


2014/02/26 13:50:07 VCS NOTICE V-16-1-10301 Initiating Online of Resource GRP_DB_TUR_htc_grp (Owner: Unspecified, Group: Hitachi_HUR_grp) on System drhost08
2014/02/26 13:50:08 VCS NOTICE V-16-20038-6809 (drhost08) HTC:GRP_DB_TUR_htc_grp:online:state of GRP_DB_TUR is PAIR; role is S-VOL
2014/02/26 13:50:08 VCS NOTICE V-16-20038-6810 (drhost08) HTC:GRP_DB_TUR_htc_grp:online:checking for replication link failure
2014/02/26 13:53:32 VCS NOTICE V-16-20038-6815 (drhost08) HTC:GRP_DB_TUR_htc_grp:online:issuing horctakover of device group GRP_DB_TUR with command: "/HORCM/usr/bin"/horctakeover -t 270 -g GRP_DB_TUR 2>&1
2014/02/26 13:55:09 VCS INFO V-16-2-13845 (drhost08) Resource(GRP_DB_TUR_htc_grp): Output of the timed out operation (online)
==============================================
pairdisplay: [EX_ENORMT] No remote host alive for remote commands or Remote Raid Manager might be blocked (sleeping) on an existing I/O.
Refer to the command log(/HORCM/log33/horcc_drhost08.log) for details.

==============================================

2014/02/26 13:55:09 VCS WARNING V-16-2-13012 (drhost08) Resource(GRP_DB_TUR_htc_grp): online procedure did not complete within the expected time.
2014/02/26 13:55:09 VCS ERROR V-16-2-13065 (drhost08) Agent is calling clean for resource(GRP_DB_TUR_htc_grp) because online did not complete within the expected time.
2014/02/26 13:55:09 VCS NOTICE V-16-20038-6809 (drhost08) HTC:GRP_DB_TUR_htc_grp:clean:state of GRP_DB_TUR is PAIR; role is S-VOL
2014/02/26 13:55:09 VCS NOTICE V-16-20038-6810 (drhost08) HTC:GRP_DB_TUR_htc_grp:clean:checking for replication link failure
2014/02/26 13:56:10 VCS ERROR V-16-2-13006 (drhost08) Resource(GRP_DB_TUR_htc_grp): clean procedure did not complete within the expected time.

 

Cause

The HTCAgent depends on the default poll time and timeout values in the /etc/horcm.conf file. The default poll time is 10 seconds and defailt timeout value is 30 seconds for the HORCM daemon as shown below:

HORCM_MON
#ip_address     service         poll(10ms)     timeout(10ms)
192.168.1.2     horcm33         1000           3000

Current version of VCS HTCAgent fails to online when these default values in horcm.conf file get modified. An example of modified values for HORCM RAID Manager as shown below:

HORCM_MON
#ip_address        service         poll(10ms)     timeout(10ms)
192.168.1.2        horcm33         3000           10200
<>

Poll: The interval for monitoring paired volumes. If set to -1, the paired volumes are not monitored. The value of -1 is specified when two or more CCI instances run on a single machine.
Timeout: The time-out period of communication with the remote server.
Default timeout value was 30 sec. and the horcm.conf file will have 3000 (10ms).

To reduce the HORCM daemon load, HDS recommends to make this poll interval longer but that will cause a timeout conflict with the VCS HTC Agent which might lead to inability for VCS HTC Agent to properly manage the HTC resource.

HORCM_MON
#ip_address     service         poll(10ms)     timeout(10ms)
Local IP         horcm0          1000           3000

Resolution

The poll interval and timeout values were reset to their defaults in horcm.conf file. Usually VCS HTC agent follows default timeout value of horcm demon (which is 30sec). HTC agent document will consider details about recommended values for HTC agent attributes if RAID manager timeout is modified. Until then either the default horcm.conf values should be used or explicitly update HTCAgent OnlineTimeout, MonitorTimeout and ActionTimeout value if the defaults are changed in the horcm.conf file.
o OnlineTimeout value should be four times more than remote RAID Manager timeout with some additional buffer time (~10sec)
o MonitorTimeout value should be more than twice the value of remote RAID manager timeout with some additional buffer time (~10sec)
o ActionTimeout value should be more than twice the value of remote RAID manager timeout 
 

 


Applies To

RHEL 6.6 x86_64
SFCFSHA 6.1 for Linux
VCS HTC Agent version : VRTSvcstc-5.0.11.0-Linux_GENERIC.noarch
 

HORCM version Details :
Model : RAID-Manager/Linux
Ver&Rev: 01-28-03/05
 

What is HORCM?

Hitachi Command Control Interface software (CCI) has two components, one residing on the storage system and another residing on the server. The HORCM operational environment operates as a daemon process on the host server. The HORCM instance communicates with the storage system and remote servers. Hitachi Universal Replicator software requires two HORCM instances one to manage the P-VOLs and the other to manage the S-VOLs. The HORCM configuration file (/etc/horcrm.conf) defines the communication path and the logical units (LUs) to be controlled. Each HORCM instance has its own HORCM configuration file. The horcm0.conf and the horcm1.conf files used for Instance 0 and Instance 1.

Also, modify the /etc/services file to register the port name and port number for each HORCM instance on each server.  The entries for HORCM in the /etc/services file must be the same on all the servers.

 

Issue/Introduction

As part of disaster failure test, HORCM service was manually shut down on the production site (/etc/init.d/horcm.sh stop) after HTC service  group offline. Manual online of the HTC resource at the remote Disaster Recovery (DR) site times out and fails to come online.