server101 AgentFramework[42399]: VCS CRITICAL V-16-10061-638 CoordPoint:coordpoint:monitor:Administrator intervention is required because registration key VF01F401 is missing from the coordinator disk /dev/vx/rdmp/sdj.If split brain happens in this condition, all nodes in the cluster may panic. Ensure that the local node can access the coordination disks and the keys are registered. Refer to the vxfenadm (1m) man page for more information.
The coordpoint resource is marked as faulted
# hares -state coordpoint#Resource Attribute System Valuecoordpoint State server101 FAULTED <<
Cause
In Linux environment when the access to storage is lost, the udev driver removes the devices from OS, subsequent scanning of devices within vxvm would remove the dmpnodes. When the storage is restored , OS and VxVM rebuilds the device tree and it has a potential to assign different minor for the devices . If the device minor numbers are changed for fencing disks, there would be a mismatch in the vxfen configuration.
Solution
There are multiple possible solution to address this scenario.
Option 1: Use vxfenswap to correct the vxfen mismatch after a CVM node reconnects to storage. Refer the VCS Admin guide for detailed instruction on ““Replacing I/O Fencing coordinator disks when the cluster is online". Summarized the keys task for reference.
- Make sure system-to-system communication is functioning properly ( password less ssh ) .
- Determine the value of the FaultTolerance attribute.
# hares -display coordpoint -attribute FaultTolerance -localclus
- Set the value of the FaultTolerance attribute to 0.
- Check the existing value of the LevelTwoMonitorFreq attribute.
# hares -display coordpoint -attribute LevelTwoMonitorFreq –localclus
- Disable level two monitoring of CoordPoint agent.
# haconf -makerw
# hares -modify coordpoint LevelTwoMonitorFreq 0
# haconf -dump –makero
- Make sure that the cluster is online.
# vxfenadm –d
- Validate that fencing disk are accessible in all Running nodes of the cluster.
- On any running node, run the following command to start the vxfenswap utility
# vxfenswap –g fendgname
- Confirm the updates # vxfenconfig –l
- Re-enable the LevelTwoMonitorFreq attribute and FaultTolerance of the CoordPoint agent.You may want to use the value that was set before disabling the attribute.
# hares -modify coordpoint LevelTwoMonitorFreq Frequencyvalue
# hares -modify coordpoint FaultTolerance FaultTolerancevalue
Option 2: Implement CP server based fencing. Refer VCS guides for details.
Option 3: Applicable only for the Fiber Channel based storage access. Increase dev_loss_tmo to avoid reminor of dmpnodes in the event of temporary storage loss. Having the below modified udev rule during the system boot up, increases the device timeout to 24 hrs ( 86400) instead of default dev_loss_tmo of 30 or 45 secs depending upon HBA setting.
Edit the file /etc/udev.d/rules/40-rport.rules and update to change the dev_loss_tmo.
$ cat /etc/udev.d/rules/40-rport.rules
KERNEL=="rport-*",
SUBSYSTEM=="fc_remote_ports",
ACTION=="add",RUN+="/bin/sh -c 'echo 86400 > /sys/class/fc_remote_ports/%k/dev_loss_tmo'"