Cluster Server (VCS) resource cssd fails to online

book

Article ID: 100002422

calendar_today

Updated On:

Description

Error Message

# /oracle_crs/bin/crsctl check cssd
Failure 1 contacting CSS daemon

[ CSSD]2010-06-09 14:32:37.550 [20] >TRACE: clssnmRcfgMgrThread: Local Join
[ CSSD]2010-06-09 14:32:37.550 [20] >WARNING: clssnmLocalJoinEvent: takeover aborted due to ALIVE node on Disk
[ CSSD]2010-06-09 14:32:38.4

Resolution

When consultingthe main.cf, all the three main entities required by Clusterware, such as theVoting device, PrivNIC device and OCR repository were all configured, and allonline, so obvious cause of the cssd resource to fail to online was notapparent. A look at the node in question's ocssd.log file in $CRSHOME/logrevealed the following messages:

...
[CSSD]2010-06-09 14:32:37.550 [20] >TRACE: clssnmRcfgMgrThread: LocalJoin
[CSSD]2010-06-09 14:32:37.550 [20] >Warning: clssnmLocalJoinEvent: takeoveraborted due to ALIVE node on Disk
[CSSD]2010-06-09 14:32:38.415 [8] >TRACE: clssnmReadDskHeartbeat: node(1) isdown. rcfg(2) wrtcnt(527667) LATS(179894445
...

However, acheck on the Voting device confirmed that heartbeating was fine, so a check wasthen made on the PrivNIC device. This represents an IP address on each node usedfor Clusterware's OCSSD processes to communicate with the peer processes onother nodes. A ping on each node of the other node's PrivNIC IP address failed.Once the networking issue had been resolved, and the PrivNIC IP addresses couldbe pinged successfully, the cssd resource was then able to online.
A samplePrivNIC resource is below:

PrivNICora_priv (
Critical =0
Device@nodeA = { nxge3 = 0, nxge15 = 1 }
Device@nodeA = { nxge3 = 0, nxge15 = 1 }
Address@nodeA = "10.0.0.5"
Address@nodeA = "10.0.0.6"
NetMask ="255.255.255.240"
)
...


Issue/Introduction

Cluster Server (VCS) resource cssd fails to online