2 node VCS; testing high availability with all heartbeat links disconnect; service group failover recovery fails

book

Article ID: 100005671

calendar_today

Updated On:

Description

Error Message

2011/03/05 15:53:57 VCS NOTICE V-16-1-10301 Initiating Online of Resource node04_ip (Owner: unknown, Group: sg_db2) on System node05
-------- but
2011/03/05 15:53:58 VCS WARNING V-16-10031-4604 (nodeP05) IP:node04_ip:online:Address 111.222.33.44 already exists: Res node04_ip will not go online.
-------- therefore
2011/03/05 15:56:00 VCS ERROR V-16-2-13066 (nodeP05) Agent is calling clean for resource(res04_ip) because the resource is not up even after online completed.
2011/03/05 15:56:01 VCS INFO V-16-2-13068 (node05) Resource(res04_ip) - clean completed successfully.
2011/03/05 15:56:01 VCS INFO V-16-2-13071 (node05) Resource(res04_ip): reached OnlineRetryLimit(0). 
2011/03/05 15:56:01 VCS ERROR V-16-1-10303 Resource node04_ip (Owner: unknown, Group: sg_db2) is FAULTED (timed out) on sys nodeP05

 

Cause

------ After a 2 node  cluster is subjected to an all LLT-dropout fault test and the fencing key race triggers panic-reboot of one node,  the booting node hangs onto the VIP too long while the surviving node tries to online the service group with the IP resource VIP.  The IP agent discovers the VIP already exists and therefore faults the IP resource and offlines the service group.

----Examine the "limbo period" time taken by the panicked node to complete reboot; the VIP on this node may have not been cleared soon enough.

 

Resolution


----Recommend increasing the types attribute OnlineRetryLimit (default is 0) to 2  and OnlineMonitorTimeout to 600 for the IP agent;  this causes the agent wait longer (review OnlineRetryLimit and the OnlineTimeout defaults) before faulting on the pre-existence "ping ip" of the VIP on the panicked node declaring the resource faulted.


Applies To

---2 node VCS cluster with  active-passive failover service groups
---fault test of removing all LLT links
---VCS fencing is configured

 

Issue/Introduction

While testing LLT dropout and VCS fencing for 2 node VCS cluster the node that wins the fencing key race is unable to bring online the service group.  The IP agent-resource discovers the virtual IP address already exists.