Please take into account to take proactive measures as follows.
1. Check out if there was any change on /usr/bin/who as there is already a known issue caused by the return wrong value as a result of inappropriate who binary operating.
2. Please collate the system performance measurement data at that time with issue.
3. Also please get system engineer and storage engineer involved in this incident to check out their embedded logs.
4. Tune the monitoring timeout of CVMVolDg.
[ What to take preventive measure on CVMVolDg ]
1. Increasing the MonitorInterval/Timeout
[ Current values by default]
#hares display all |grep CVMVolDg
-------------------------------------------------------------------------------
hatype_display:CVMVolDg FaultOnMonitorTimeouts 4 << This is already set up with four times. This can be also tunable up to 6 times.
..
hatype_display:CVMVolDg MonitorInterval 60 << This can be tunable up to 60~300
..
hatype_display:CVMVolDg MonitorTimeout 60 << This can be tunable up to 60~180
..
hatype_display:CVMVolDg RestartLimit 0 << This can be tunable up to 1
-------------------------------------------------------------------------------
For example, you can check out the current parameter of CVMVolDg by using the command line;
#hares display all |grep CVMVolDg
And to modify thise tunable value properly, please see the below.
#haconf -makerw
#hatype -modify CVMVolDg MonitorInterval 90
#hatype -modify CVMVolDg MonitorTimeout 75
#hatype -modify CVMVolDg FaultOnMonitorTimeout 6
# hatype -modify CVMVolDg RestartLimit 1
#haconf -dump
#haconf -makero
[ NOTE] Please be aware of the following:
* The MonitorInterval needs to be greater or equal to the MonitorTimeout.
* The FaultOnMonitorTimeout is the number of timeouts before a fault is declared. Zero disables it.
* Tuning the valuable parameter for CVMVolDg is falling back on the system load and its performance in efforts to monitor resources enrolled in VCS.
Hence, it is required to ensure the customer keeps paying attention on these factors by gathering performance throughput for some time for the sake of decision making on it before changing this default value to any parameter available because as long as the reflection time of CVMVolDg detecting any problem with its resources is delayed, subsequently VCS may take a sluggish measure on those defected factors.
2. Enable debug for CVMVolDG type resource to collect additional debug info.
In order to get more debug information as to why the CVMVolDg resources are timing out, It is recommended to turn on debug mode for the agent. This will be also the good suggestion for the customer to be aware of in efforts to verify how it happens.
To do this, you will need to edit the cvmvoldg.lib, and un comment one line.
from this:
# Un comment the following to start debugging
# DEBUG="DEBUG"
to this:
# Un comment the following to start debugging
DEBUG="DEBUG"
It is not necessary to restart the agent since this lib file is read every time the monitor is ran.
By doing this, the agent will place more information in the engine log, which may give a better idea as to why the resources are timing out.