On Solaris 10, VCS5.1 hastop sometimes fails under Service Management Facilty (SMF)

book

Article ID: 100025323

calendar_today

Updated On:

Resolution

 

 
With the time out adjusted to be 120 (double the previous time) no errors were observed and port h was closed meaning a graceful shutdown of all the services.
The logs were
 
Jul 15 17:45:12 thor248 gab: [ID 397130 kern.notice] GAB INFO V-15-1-20032 Port o closed
Jul 15 17:45:13 thor248 gab: [ID 397130 kern.notice] GAB INFO V-15-1-20032 Port d closed
Jul 15 17:45:41 thor248 gab: [ID 397130 kern.notice] GAB INFO V-15-1-20032 Port f closed
Jul 15 17:45:44 thor248 gab: [ID 397130 kern.notice] GAB INFO V-15-1-20032 Port v closed
Jul 15 17:45:44 thor248 vxvm:vxconfigd: [ID 702911 daemon.notice] V-5-1-7901 CVM_VOLD_STOP command received
Jul 15 17:45:44 thor248 gab: [ID 397130 kern.notice] GAB INFO V-15-1-20032 Port w closed
Jul 15 17:50:22 thor248 gab: [ID 397130 kern.notice] GAB INFO V-15-1-20032 Port h closed
Jul 15 17:50:23 thor248 gab: [ID 397130 kern.notice] GAB INFO V-15-1-20032 Port b closed
Jul 15 17:50:23 thor248 vxfen: [ID 469227 kern.notice] NOTICE: VXFEN INFO V-11-1-VxFEN unconfigured
Jul 15 17:50:23 thor248 gab: [ID 397130 kern.notice] GAB INFO V-15-1-20032 Port a closed
Jul 15 17:50:23 thor248 gab: [ID 226886 kern.notice] GAB INFO V-15-1-20166 Exiting from gablogd. GAB driver got unconfigured
Jul 15 17:50:23 thor248 syslogd: going down on signal 15
 
One possible solution to the GAB error could be that we give some time to HAD to offline all of its resources especially when it has CVM/CFS/RAC stacked up on it. so that all the products can offline gracefully.
 
Also, this parameter can be tuned for all the products in the respective XML file of theirs. It's recommended that the customer increase this parameter and perform a graceful reboot.

Issue/Introduction

 
Background:
 
Solaris has the SMF to run the offline scripts while it is performing a graceful shutdown such as init 6. All of our products like FS, VM, VCS, VXFEN, GAB, LLT etc are services that are monitored by the SMF and the stop and start scripts are shipped with the products. There are also dependencies on the services which decides the order in which they are brought up (boot up)and brought down (reboot/init 6/shutdown- graceful).
 
Dependencies can be seen by typing:
 #svcs -D vcs
STATE          STIME    FMRI
online         17:56:44 svc:/system/vxodm:default
 
# svcs -d vcs
STATE          STIME    FMRI
online         17:55:56 svc:/system/filesystem/local:default
online         17:56:08 svc:/system/vxatd:default
online         17:56:11 svc:/system/gab:default
online         17:56:35 svc:/system/vxfen:default
 
svcs –d gives what all services that is depending on vcs.
 
Observation:
 
The offline/online by svc is done by a master restarter daemon /lib/svc/bin/svc.startd.
When the user fires init 6, the svc.startd offlines the services in the order that it had calculated based on the dependencies set. When it runs the vcs offline script , /lib/svc/method/vcs stop , it sets a service stop time out of 60 seconds(time is evident from the logs). If the stop script doesn’t return success within the time, the svc.startd restarts the same script two more times with the same time out also killing the previous instance of the script completely everytime. If the offiline script fails on the third attempt then it goes ahead with the next service in line which is fencing. Since fencing ‘s graceful offline is dependent on vcs’s graceful offline, fencing fails. Gab which still dependent on fencing and vcs has its ports open for them gives the error.
 
The GAB error can occur even if one of the dependent services did not offline gracefully. Typically what we observed from the logs is that the times when we observed this error, time out is very less for vcs to do HASTOP –sysoffline if there are many resources configured and running. If one of the CVM or CFS resources takes a while to offline just like the mount failed case (logs in earlier mails) where HAD had to do a forceful mount, the svc.startd timer for vcs stop expires and fires another vcs stop.  This is not a desirable case and we can hit GAB unconfigure  failed error.
 
Test environment:
 
This was tested on a 2-node Solaris  cluster with CVM/CFS/RAC installed on it. It had 5.1SP1 on it. A rogue FileOnoff agent was created which introduces a delay of 300 seconds in its offline script and a resource on the cluster of that type was configured. Issuing an init 6 command, these errors were observed:
 
Jul 15 16:08:57/504 ERROR: svc:/system/vcs:default: Method "/lib/svc/method/vcs stop" failed due to signal KILL.
Jul 15 16:25:30/4: svc:/system/vcs:default: Method or service exit timed out.  Killing contract 155.
Jul 15 16:27:33/504 ERROR: svc:/system/vxfen:default: Method "/lib/svc/method/vxfen stop" failed with exit status 1.
Jul 15 16:27:33/508 ERROR: svc:/system/gab:default: Method "/lib/svc/method/gab stop" failed with exit status 1.
Jul 15 16:27:33/523 ERROR: svc:/system/llt:default: Method "/lib/svc/method/llt stop" failed with exit status 1.
 
And the corresponding errors in /var/adm/messages were:
 
Jul 15 16:25:30 thor248 svc.startd[7]: [ID 122153 daemon.warning] svc:/system/vcs:default: Method or service exit timed out.  Killi
ng contract 155.
Jul 15 16:25:30 thor248 svc.startd[7]: [ID 636263 daemon.warning] svc:/system/vcs:default: Method "/lib/svc/method/vcs stop" failed
due to signal KILL.
Jul 15 16:27:32 thor248 svc.startd[7]: [ID 652011 daemon.warning] svc:/system/vxfen:default: Method "/lib/svc/method/vxfen stop" fa
iled with exit status 1.
Jul 15 16:27:33 thor248 gab: [ID 719437 kern.notice] GAB ERROR V-15-1-20015 unconfigure failed: clients still registered
Jul 15 16:27:33 thor248 svc.startd[7]: [ID 652011 daemon.warning] svc:/system/gab:default: Method "/lib/svc/method/gab stop" failed
with exit status 1.
Jul 15 16:27:33 thor248 svc.startd[7]: [ID 748625 daemon.error] system/gab:default failed: transitioned to maintenance (see 'svcs -
xv' for details)
Jul 15 16:27:33 thor248 svc.startd[7]: [ID 652011 daemon.warning] svc:/system/llt:default: Method "/lib/svc/method/llt stop" failed
with exit status 1.
 
This was because HAD could not be closed within the time specified as it was waiting for the resource to offline.
 
This time out is a tunable parameter that is shipped to the customer in an XML file located at:
/var/svc/manifest/system/*.xml
 
For vcs:
/var/svc/manifest/system /vcs.xml
 
The file contents:
------
<exec_method
                type='method'
                name='stop'
                exec='/lib/svc/method/vcs stop'
                timeout_seconds='60'>
                <method_context>
                        <method_credential user='root' group='root' />
                </method_context>
        </exec_method>
-------
This file contents is also responsible for how the dependencies are set for our products.