Best Practice in chasing up the clue of heartbeat Symm Agent that went DOWN.

book

Article ID: 100004853

calendar_today

Updated On:

Description

Error Message

## Current Status of Symm on the node, antec
-- WAN HEARTBEAT STATE
-- Heartbeat       To                   State

L  Icmp               SRDF-CLUSTER            ALIVE
L  Symm            SRDF-CLUSTER            DOWN

[ ERROR  MESSAGES ]
/var/VRTSvcs/log/engine_A.log
----------------------------------------------------------
2011/01/29 17:39:02 VCS INFO V-16-1-50133 User admin has logged in from 10.92.248.161

2011/01/29 17:40:42 VCS NOTICE V-16-1-50983 Resource PRD-SYMC-SYMCZ1-ZONE is offline on system antec in cluster Remo-Clust           <<<<<<<<  #1 All resources were detected as "offline" all at once!!!
2011/01/29 17:40:54 VCS NOTICE V-16-1-50983 Resource PRD-SYMC-SRDF is offline on system antec in cluster Remo-Clust
2011/01/29 17:41:00 VCS NOTICE V-16-1-50983 Resource PRD-SYMC-SYMCZ1-SYS-FS is offline on system antec in cluster Remo-Clust
2011/01/29 17:41:01 VCS NOTICE V-16-1-50983 Resource PRD-SYMC-SYMCZ1-ORACLIENT-FS is offline on system antec in cluster Remo-Clust
2011/01/29 17:41:04 VCS NOTICE V-16-1-50983 Resource PRD-SYMC-SYMCZ1-HOME-FS is offline on system antec in cluster Remo-Clust
2011/01/29 17:41:07 VCS NOTICE V-16-1-50983 Resource PRD-SYMC-SYMCZ1-ESSAGENT-FS is offline on system antec in cluster Remo-Clust
2011/01/29 17:41:09 VCS NOTICE V-16-1-50983 Resource PRD-SYMC-SYMCZ1-DAZEL-FS is offline on system antec in cluster Remo-Clust
2011/01/29 17:41:12 VCS NOTICE V-16-1-50983 Resource PRD-SYMC-SYMCZ1-CGNDM is offline on system antec in cluster Remo-Clust
2011/01/29 17:41:15 VCS NOTICE V-16-1-50983 Resource PRD-SYMC-SYMCZ1-AFTP_PROD is offline on system antec in cluster Remo-Clust
2011/01/29 17:41:17 VCS NOTICE V-16-1-50983 Resource PRD-SYMC-SYMC-CGNDM is offline on system antec in cluster Remo-Clust
2011/01/29 17:41:22 VCS NOTICE V-16-1-50983 Resource PRD-SYMC-SYMC-Z1-IP is offline on system antec in cluster Remo-Clust
2011/01/29 17:41:31 VCS NOTICE V-16-1-50983 Resource PRD-SYMC-DGAFTPPRD is offline on system antec in cluster Remo-Clust
2011/01/29 17:41:37 VCS NOTICE V-16-1-50983 Resource PRD-SYMC-Z1-ZONEID is offline on system antec in cluster Remo-Clust
2011/01/29 17:41:39 VCS NOTICE V-16-1-50983 Resource PRD-SYMC-NDM is offline on system antec in cluster Remo-Clust
2011/01/29 18:23:58 VCS ERROR V-16-3-18211 (symc) Cluster SRDF-CLUSTER lost heartbeat Symm to cluster Remo-Clust                                           <<<<<<<<  #2 Then hearbeat Symm resoruce went lost.
..

..
2011/01/29 19:26:57 VCS INFO V-16-1-10298 Resource PRD-SYMC-DR-PREP (Owner: unknown, Group: DR-SYMC-SG) is online on symc (VCS initiated)         <<<<<<<<  #3 Since PRD-SYMC-DR-PREP went online on the node, symc.
2011/01/29 19:26:57 VCS NOTICE V-16-1-10447 Group DR-SYMC-SG is online on system symc          
2011/01/29 19:39:54 VCS INFO V-16-1-50859 Attempting to switch group PRD-SYMC-SG from system symc to system antec
2011/01/29 19:39:54 VCS INFO V-16-1-50135 User admin fired command: hagrp -switch PRD-SYMC-SG  antec  Remo-Clust  from 10.92.248.161
2011/01/29 19:39:54 VCS WARNING V-16-1-50871 Unable to online group PRD-SYMC-SG on system antec.  Child group dependency would be violated
2011/01/29 20:06:57 VCS INFO V-16-3-18210 (symc) Cluster SRDF-CLUSTER heartbeat Symm to cluster Remo-Clust is alive                                        <<<<<<<<  #4 Then hearbeat Symm resource came back to alive.

Cause

- Devices removal and restoring at storage level

Resolution

[ FINDINGS AND SUGGESTION ]

0. The configuration of VCS.

## main.cf
include "types.cf"
include "SRDFTypes.cf"

cluster SRDF-CLUSTER (
UserNames = { admin = aHHhIGdEIdIEgOEdED,
z_PRD-SYMC-SYMCZ1-ZONE = IJIeGGiOHgJQiHHfGF,
rijw = cJJhKNfHKhGFjGJdKM,
z_PRD-SYMC-SYMCZ1-ZONE_symc = aHIhFCfIIfEKdKGgHP }
ClusterAddress = "192.168.1.2"
Administrators = { admin, rijw }
)

remotecluster Remo-Clust (
ClusterAddress = "192.168.1.5"
)

heartbeat Icmp (
ClusterList = { Remo-Clust }
Arguments @Remo-Clust = { "192.168.1.5" }
)

heartbeat Symm (
ClusterList = { Remo-Clust }
StopTimeout @Remo-Clust = 60
AYARetryLimit @Remo-Clust = 2
Arguments @Remo-Clust = { 0267 }
)

[ Comment ] According to the main.cf, we can see there are two ways to determine if the node at the remote site is alive or not by using ICMP and EMC Symcli command line.
So that's why there are two Agents configured in VCS configuration such as "Icmp" and "Symm".
For more details, you can refer to the "cited comment" below from SRDF Agent admin guide.

[ About cluster heartbeats ]
In a replicated data cluster, robust heartbeating is accomplished through dual, dedicated networks over which the Low Latency Transport (LLT) runs.
Additionally, you can configure a low-priority heartbeat across public networks.

In a global cluster, Cluster Server sends ICMP pings over the public network between the two sites for network heartbeating.
1)  To minimize the risk of split-brain, VCS sends ICMP pings to highly available IP addresses.
VCS global clusters also notify the administrators when the sites cannot communicate.

2)  In global clusters, the VCS Heartbeat agent sends heartbeats directly between the Symmetrix arrays if the Symmetrix ID of each array is known.
This heartbeat offers the following advantages:
- The Symmetrix heartbeat shows that the arrays are alive even if the ICMP heartbeats over the public network are lost. So, VCS does not mistakenly interpret this loss of heartbeats as a site failure.
- Heartbeat loss may occur due to the failure of all hosts in the primary cluster. In such a scenario, a failover may be required even if the array is alive.
 In any case, a host-only crash and a complete site failure must be distinguished. In a host-only crash, only the ICMP heartbeat signals a failure by an SNMP trap.
 No cluster failure notification occurs because a surviving heartbeat exists. This trap is the only notification to fail over an application.
- The heartbeat is then managed completely by VCS. VCS reports that the site is down

So therefore, from the general perspective on the error messages you encountered below, it seems that there should have been something wrong with the VCS Heartbeat agent  for heartbeats directly between the Symmetrix arrays temporarily.


1. /var/adm/messages
----------------------------------------------------------
Jan 29 18:14:56 symc   transport rejected fatal error
Jan 29 18:21:55 symc fctl: [ID 517869 kern.warning] Warning: fp(5)::66600 NS failure pkt state=dreason=9, expln=1, NSCMD=0112, NSRSP=0000
Jan 29 18:21:55 symc fctl: [ID 517869 kern.warning] Warning: fp(5)::GPN_ID for D_ID=62400 failed
Jan 29 18:21:55 symc fctl: [ID 517869 kern.warning] Warning: fp(5)::N_x Port with D_ID=62400, PWWN=5006048accab115d disappeared from fabric
Jan 29 18:21:55 symc fctl: [ID 517869 kern.warning] Warning: fp(5)::66600 NS failure pkt state=dreason=9, expln=1, NSCMD=0112, NSRSP=0000
Jan 29 18:22:00 symc -- MARK --
Jan 29 18:22:15 symc scsi: [ID 243001 kern.info] /pci@0/pci@0/pci@9/SUNW,emlxs@0,1/fp@0,0 (fcp5):
Jan 29 18:22:15 symc   offlining lun=f3 (trace=0), target=62400 (trace=2800004)
Jan 29 18:22:15 symc scsi: [ID 243001 kern.info] /pci@0/pci@0/pci@9/SUNW,emlxs@0,1/fp@0,0 (fcp5):
Jan 29 18:22:15 symc   offlining lun=f2 (trace=0), target=62400 (trace=2800004)
Jan 29 18:22:15 symc scsi: [ID 243001 kern.info] /pci@0/pci@0/pci@9/SUNW,emlxs@0,1/fp@0,0 (fcp5):
Jan 29 18:22:15 symc   offlining lun=f1 (trace=0), target=62400 (trace=2800004)
Jan 29 18:22:15 symc scsi: [ID 107833 kern.warning] Warning: /pci@0/pci@0/pci@9/SUNW,emlxs@0,1/fp@0,0/ssd@w5006048accab115d,f0 (ssd11):
Jan 29 18:22:15 symc   Command failed to complete...Device is gone
Jan 29 18:22:15 symc scsi: [ID 243001 kern.info] /pci@0/pci@0/pci@9/SUNW,emlxs@0,1/fp@0,0 (fcp5):
Jan 29 18:22:15 symc   offlining lun=f0 (trace=0), target=62400 (trace=2800004)
Jan 29 18:22:15 symc scsi: [ID 107833 kern.warning] Warning: /pci@0/pci@0/pci@9/SUNW,emlxs@0,1/fp@0,0/ssd@w5006048accab115d,f0 (ssd11):
Jan 29 18:22:15 symc   transport rejected fatal error
Jan 29 18:22:15 symc genunix: [ID 408114 kern.info] /pci@0/pci@0/pci@9/SUNW,emlxs@0,1/fp@0,0/ssd@w5006048accab115d,f3 (ssd8) offline
Jan 29 18:22:15 symc genunix: [ID 408114 kern.info] /pci@0/pci@0/pci@9/SUNW,emlxs@0,1/fp@0,0/ssd@w5006048accab115d,f2 (ssd9) offline
Jan 29 18:22:15 symc genunix: [ID 408114 kern.info] /pci@0/pci@0/pci@9/SUNW,emlxs@0,1/fp@0,0/ssd@w5006048accab115d,f1 (ssd10) offline
Jan 29 18:23:58 symc Wac[5008]: [ID 702911 daemon.notice] VCS ERROR V-16-1-18211 Cluster SRDF-CLUSTER lost heartbeat Symm to cluster Remo-Clust                      <<<<<<<<< There was something wrong with heartbeat Symm.
Jan 29 18:23:58 symc Had[4699]: [ID 702911 daemon.notice] VCS ERROR V-16-1-18211 (symc) Cluster SRDF-CLUSTER lost heartbeat Symm to cluster Remo-Clust
.

.
[ Comment ] By taking a closer at the logs in /var/adm/messages ( System log ), it is clear to see that there were some changes at H/W level thereafter a warning messaegs incurred Before VCS issued the error messages "VCS ERROR V-16-1-18211 Cluster SRDF-CLUSTER lost heartbeat Symm to cluster Remo-Clust". So therefore, it is no wonder to say that something wrong or changed at H/W level gave rise to this error messages.


2. /var/VRTSvcs/log/engine_A.log
----------------------------------------------------------
..

..
2011/01/29 17:39:02 VCS INFO V-16-1-50133 User admin has logged in from 192.168.1.12

2011/01/29 17:40:42 VCS NOTICE V-16-1-50983 Resource PRD-SYMC-SYMCZ1-ZONE is offline on system antec in cluster Remo-Clust                        <<<<<<<<  #1 All resources were detected as "offline" all at once!!!
2011/01/29 17:40:54 VCS NOTICE V-16-1-50983 Resource PRD-SYMC-SRDF is offline on system antec in cluster Remo-Clust
2011/01/29 17:41:00 VCS NOTICE V-16-1-50983 Resource PRD-SYMC-SYMCZ1-SYS-FS is offline on system antec in cluster Remo-Clust
2011/01/29 17:41:01 VCS NOTICE V-16-1-50983 Resource PRD-SYMC-SYMCZ1-ORACLIENT-FS is offline on system antec in cluster Remo-Clust
2011/01/29 17:41:04 VCS NOTICE V-16-1-50983 Resource PRD-SYMC-SYMCZ1-HOME-FS is offline on system antec in cluster Remo-Clust
2011/01/29 17:41:07 VCS NOTICE V-16-1-50983 Resource PRD-SYMC-SYMCZ1-ESSAGENT-FS is offline on system antec in cluster Remo-Clust
2011/01/29 17:41:09 VCS NOTICE V-16-1-50983 Resource PRD-SYMC-SYMCZ1-DAZEL-FS is offline on system antec in cluster Remo-Clust
2011/01/29 17:41:12 VCS NOTICE V-16-1-50983 Resource PRD-SYMC-SYMCZ1-CGNDM is offline on system antec in cluster Remo-Clust
2011/01/29 17:41:15 VCS NOTICE V-16-1-50983 Resource PRD-SYMC-SYMCZ1-AFTP_PROD is offline on system antec in cluster Remo-Clust
2011/01/29 17:41:17 VCS NOTICE V-16-1-50983 Resource PRD-SYMC-SYMC-CGNDM is offline on system antec in cluster Remo-Clust
2011/01/29 17:41:22 VCS NOTICE V-16-1-50983 Resource PRD-SYMC-SYMC-Z1-IP is offline on system antec in cluster Remo-Clust
2011/01/29 17:41:31 VCS NOTICE V-16-1-50983 Resource PRD-SYMC-DGAFTPPRD is offline on system antec in cluster Remo-Clust
2011/01/29 17:41:37 VCS NOTICE V-16-1-50983 Resource PRD-SYMC-Z1-ZONEID is offline on system antec in cluster Remo-Clust
2011/01/29 17:41:39 VCS NOTICE V-16-1-50983 Resource PRD-SYMC-NDM is offline on system antec in cluster Remo-Clust
2011/01/29 18:23:58 VCS ERROR V-16-3-18211 (symc) Cluster SRDF-CLUSTER lost heartbeat Symm to cluster Remo-Clust                                           <<<<<<<<  #2 Then hearbeat Symm resoruce went lost.
..

..
2011/01/29 19:26:57 VCS INFO V-16-1-10298 Resource PRD-SYMC-DR-PREP (Owner: unknown, Group: DR-SYMC-SG) is online on symc (VCS initiated)         <<<<<<<<  #3 Since PRD-SYMC-DR-PREP went online on the node, symc.
2011/01/29 19:26:57 VCS NOTICE V-16-1-10447 Group DR-SYMC-SG is online on system symc          
2011/01/29 19:39:54 VCS INFO V-16-1-50859 Attempting to switch group PRD-SYMC-SG from system symc to system antec
2011/01/29 19:39:54 VCS INFO V-16-1-50135 User admin fired command: hagrp -switch PRD-SYMC-SG  antec  Remo-Clust  from 10.92.248.161
2011/01/29 19:39:54 VCS WARNING V-16-1-50871 Unable to online group PRD-SYMC-SG on system antec.  Child group dependency would be violated
2011/01/29 20:06:57 VCS INFO V-16-3-18210 (symc) Cluster SRDF-CLUSTER heartbeat Symm to cluster Remo-Clust is alive                                         <<<<<<<<  #4 Then hearbeat Symm resource came back to alive.


[ Comment ] Beside, there seemed to be intentional intervention by someone else.
1)  Someone logined the system.
2011/01/29 17:39:02 VCS INFO V-16-1-50133 User admin has logged in from 192.168.1.12

2) Then all resources in PRD-SYMC-SG went offline out of VCS control then that was detected by VCS.
2011/01/29 17:40:42 VCS NOTICE V-16-1-50983 Resource PRD-SYMC-SYMCZ1-ZONE is offline on system antec in cluster Remo-Clust  

3) Thereafter, hearbeat Symm resource went lost.
2011/01/29 18:23:58 VCS ERROR V-16-3-18211 (symc) Cluster SRDF-CLUSTER lost heartbeat Symm to cluster Remo-Clust                                          

4) Since the resource, PRD-SYMC-DR-PREP went online on the node, symc.
2011/01/29 19:26:57 VCS INFO V-16-1-10298 Resource PRD-SYMC-DR-PREP (Owner: unknown, Group: DR-SYMC-SG) is online on symc (VCS initiated)  

5) Then hearbeat Symm resource came back to alive.
2011/01/29 20:06:57 VCS INFO V-16-3-18210 (symc) Cluster SRDF-CLUSTER heartbeat Symm to cluster Remo-Clust is alive



3. /var/VRTSvcs/log/wac_A.log
----------------------------------------------------------
..

..
2011/01/29 18:23:58 VCS ERROR V-16-3-18211 Cluster SRDF-CLUSTER lost heartbeat Symm to cluster Remo-Clust
2011/01/29 20:06:57 VCS INFO V-16-3-18210 Cluster SRDF-CLUSTER heartbeat Symm to cluster Remo-Clust is alive

[ Comment ] Even wac_A.log shows that there were something wrong with Symm but it turned out to be alive.


4. What Symm Agent plays a role?

$ pwd
~ /vcs/agents/hb/Symm

$ ls
aya        SymmAgent

#cat aya
--------------------------------------------------------------------------------------------------------------------
###
if [ -z "$EMC_HOME" ];
then
       EMC_HOME="/usr/symcli"
fi
if [ -z "$VCS_HOME" ];
then
       VCS_HOME="/opt/VRTSvcs"
fi
HEAD=/usr/bin/head
AWK=/usr/bin/awk
${EMC_HOME.EN_US}/bin/symrdf -sid $2 ping > /dev/null 2>&1
ret=$?
if [ $ret -eq 0 ];      # ping successful
then
       exit 102;
fi
if [ $ret -eq 1 ];      # can't find symrdf command
then
       res=`${VCS_HOME.EN_US}/bin/hares -list Type=SRDF | $HEAD -1 | $AWK '{print $1}'`
       symhome=`${VCS_HOME.EN_US}/bin/hares -value $res SymHome 2>/dev/null`
       ${SYMHOME.EN_US}/bin/symrdf -sid $2 ping > /dev/null 2>&1
       if [ $? -eq 0 ];
       then
               exit 102;
       fi
fi
exit 103;       # ping unsuccessful
--------------------------------------------------------------------------------------------------------------------

[ Comment ] As per the script above, "symrdf" command line is in use of verifying the status of heartbeat Symm.
Hence,  the following command line is to determine if the heartbeat Symm is alive or not...
#/usr/symcli/bin/symrdf -sid 0267 ping



## main.cf ##
..
heartbeat Symm (
ClusterList = { Remo-Clust }
StopTimeout @Remo-Clust = 60
AYARetryLimit @Remo-Clust = 2
Arguments @Remo-Clust = { 0267 }


Applies To

[ VERSION  OS  OS/PAKCAGE ]
- Solaris 10
- SFHA5.0MP3RP4 ( WAN Cluster )

Issue/Introduction

heartbeat Symm Agent went down.