What to do when CVMVolDg got offline due to "VCS ERROR V-16-20007-1017" and "monitor:check_notify_status: can't stabilize vxstat"

book

Article ID: 100001685

calendar_today

Updated On:

Description

Error Message

- engine_A.log
2010/04/20 19:29:26 VCS ERROR V-16-2-13027 (ustu-lvsdbspdex01) Resource(cvmvoldg1) - monitor procedure did not complete within the expected time.
2010/04/20 19:31:19 VCS ERROR V-16-20007-1017 (ustu-lvsdbspdex01) CVMVolDg:cvmvoldg1:monitor:check_notify_status: can't stabilise vxstat
2010/04/20 19:32:09 VCS ERROR V-16-20007-1017 (ustu-lvsdbspdex02) CVMVolDg:cvmvoldg1:monitor:check_notify_status: can't stabilise vxstat
2010/04/20 19:33:10 VCS ERROR V-16-20007-1017 (ustu-lvsdbspdex02) CVMVolDg:cvmvoldg1:monitor:check_notify_status: can't stabilise vxstat
2010/04/20 19:33:20 VCS ERROR V-16-20007-1017 (ustu-lvsdbspdex01) CVMVolDg:cvmvoldg1:monitor:check_notify_status: can't stabilise vxstat
2010/04/20 19:35:09 VCS ERROR V-16-20007-1017 (ustu-lvsdbspdex02) CVMVolDg:cvmvoldg1:monitor:check_notify_status: can't stabilise vxstat
2010/04/20 19:35:20 VCS ERROR V-16-20007-1017 (ustu-lvsdbspdex01) CVMVolDg:cvmvoldg1:monitor:check_notify_status: can't stabilise vxstat
2010/04/20 19:35:27 VCS ERROR V-16-2-13210 (ustu-lvsdbspdex01) Agent is calling clean for resource(cvmvoldg1) because 4 successive invocations of the monitor procedure did not complete within the expected time.
..
..
2010/04/20 19:35:28 VCS INFO V-16-2-13068 (ustu-lvsdbspdex01) Resource(cvmvoldg1) - clean completed successfully.
2010/04/20 19:35:29 VCS INFO V-16-2-13026 (ustu-lvsdbspdex01) Resource(cvmvoldg1) - monitor procedure finished successfully after failing to complete within the expected time for (4) consecutive times.
2010/04/20 19:35:29 VCS INFO V-16-1-10307 Resource cvmvoldg1 (Owner: unknown, Group: cfs) is offline on ustu-lvsdbspdex01 (Not initiated by VCS)

Cause

These errors seem to be a result of a high I/O load created by the array errors or checksum error for /var/VRTSvcs/lock/${CVMVOLDG_RENAME.EN_US}_${CVMVOLDG_DG.EN_US}_vxnotify that does not match the last run. Because of this, CVMVolDg monitor scripts for the shared disk groups time out. Once these monitor timeouts occur, VCS will attempt to clean or offline the resources since it is not able to determine the correct state.
 
According to the line 1412 to 1464 in /opt/VRTSvcs/bin/CVMVolDg/cvmvoldg.lib, cvmvoldg_check_notify_status checks whether the sum of the notify file changed or not. In its entry, given that the caller will be checking the health of the vxnotify process we are not checking it here. By generating a new sum, if something changes right after we get the new sum, then we will not catch the change till the next iteration of the monitor. To reduce that time, we will do a while loop here.
 
---------------------------------------------------------
 
cvmvoldg_check_notify_status() {
..
               if [ $_ccns_counter -ge $VXNOTIFY_LOOP_MAX ] ; then
                       VCSAG_LOG_MSG "E" "check_notify_status: can't stabilise vxstat" 1017

Resolution

Please take into account to take proactive measures as follows.
1. Check out if there was any change on /usr/bin/who as there is already a known issue caused by the return wrong value as a result of inappropriate who binary operating.
2. Please collate the system performance measurement data at that time with issue.
3. Also please get system engineer and storage engineer involved in this incident to check out their embedded logs. 
4. Tune the monitoring timeout of CVMVolDg. 
 
[ What to take preventive measure on CVMVolDg ]
1. Increasing the MonitorInterval/Timeout 
[ Current values by default]
#hares display all |grep CVMVolDg
-------------------------------------------------------------------------------
hatype_display:CVMVolDg FaultOnMonitorTimeouts 4    << This is already set up with four times. This can be also tunable up to 6 times.
.. 
hatype_display:CVMVolDg MonitorInterval 60                 << This can be tunable up to 60~300
..
hatype_display:CVMVolDg MonitorTimeout 60               << This can be tunable up to 60~180
..
hatype_display:CVMVolDg RestartLimit 0                      << This can be tunable up to 1
-------------------------------------------------------------------------------
For example, you can check out the current parameter of CVMVolDg by using the command line;
#hares display all |grep CVMVolDg
And to modify thise tunable value properly, please see the below.
#haconf -makerw
#hatype -modify CVMVolDg MonitorInterval 90
#hatype -modify CVMVolDg MonitorTimeout 75 
#hatype -modify CVMVolDg FaultOnMonitorTimeout 6
# hatype -modify CVMVolDg RestartLimit 1 
#haconf -dump
#haconf -makero
 
[ NOTE] Please be aware of the following:
* The MonitorInterval needs to be greater or equal to the MonitorTimeout. 
* The FaultOnMonitorTimeout is the number of timeouts before a fault is declared. Zero disables it.
* Tuning the valuable parameter for CVMVolDg is falling back on the system load and its performance in efforts to monitor resources enrolled in VCS.
Hence, it is required to ensure the customer keeps paying attention on these factors by gathering performance throughput for some time for the sake of decision making on it before changing this default value to any parameter available because as long as the reflection time of CVMVolDg detecting any problem with its resources is delayed, subsequently VCS may take a sluggish measure on those defected factors.
 
2. Enable debug for CVMVolDG type resource to collect additional debug info. 
In order to get more debug information as to why the CVMVolDg resources are timing out, It is recommended to turn on debug mode for the agent. This will be also the good suggestion for the customer to be aware of in efforts to verify how it happens.
To do this, you will need to edit the cvmvoldg.lib, and un comment one line.
from this:
# Un comment the following to start debugging
# DEBUG="DEBUG"
to this:
# Un comment the following to start debugging
DEBUG="DEBUG"
 
It is not necessary to restart the agent since this lib file is read every time the monitor is ran. 
By doing this, the agent will place more information in the engine log, which may give a better idea as to why the resources are timing out.
 

Issue/Introduction

What to do when CVMVolDg got offline due to "VCS ERROR V-16-20007-1017" and "monitor:check_notify_status: can't stabilize vxstat"