Background:
Solaris has the SMF to run the offline scripts while it is performing a graceful shutdown such as init 6. All of our products like FS, VM, VCS, VXFEN, GAB, LLT etc are services that are monitored by the SMF and the stop and start scripts are shipped with the products. There are also dependencies on the services which decides the order in which they are brought up (boot up)and brought down (reboot/init 6/shutdown- graceful).
Dependencies can be seen by typing:
#svcs -D vcs
STATE STIME FMRI
online 17:56:44 svc:/system/vxodm:default
# svcs -d vcs
STATE STIME FMRI
online 17:55:56 svc:/system/filesystem/local:default
online 17:56:08 svc:/system/vxatd:default
online 17:56:11 svc:/system/gab:default
online 17:56:35 svc:/system/vxfen:default
svcs –d gives what all services that is depending on vcs.
Observation:
The offline/online by svc is done by a master restarter daemon /lib/svc/bin/svc.startd.
When the user fires init 6, the svc.startd offlines the services in the order that it had calculated based on the dependencies set. When it runs the vcs offline script , /lib/svc/method/vcs stop , it sets a service stop time out of 60 seconds(time is evident from the logs). If the stop script doesn’t return success within the time, the svc.startd restarts the same script two more times with the same time out also killing the previous instance of the script completely everytime. If the offiline script fails on the third attempt then it goes ahead with the next service in line which is fencing. Since fencing ‘s graceful offline is dependent on vcs’s graceful offline, fencing fails. Gab which still dependent on fencing and vcs has its ports open for them gives the error.
The GAB error can occur even if one of the dependent services did not offline gracefully. Typically what we observed from the logs is that the times when we observed this error, time out is very less for vcs to do HASTOP –sysoffline if there are many resources configured and running. If one of the CVM or CFS resources takes a while to offline just like the mount failed case (logs in earlier mails) where HAD had to do a forceful mount, the svc.startd timer for vcs stop expires and fires another vcs stop. This is not a desirable case and we can hit GAB unconfigure failed error.
Test environment:
This was tested on a 2-node Solaris cluster with CVM/CFS/RAC installed on it. It had 5.1SP1 on it. A rogue FileOnoff agent was created which introduces a delay of 300 seconds in its offline script and a resource on the cluster of that type was configured. Issuing an init 6 command, these errors were observed:
Jul 15 16:08:57/504 ERROR: svc:/system/vcs:default: Method "/lib/svc/method/vcs stop" failed due to signal KILL.
Jul 15 16:25:30/4: svc:/system/vcs:default: Method or service exit timed out. Killing contract 155.
Jul 15 16:27:33/504 ERROR: svc:/system/vxfen:default: Method "/lib/svc/method/vxfen stop" failed with exit status 1.
Jul 15 16:27:33/508 ERROR: svc:/system/gab:default: Method "/lib/svc/method/gab stop" failed with exit status 1.
Jul 15 16:27:33/523 ERROR: svc:/system/llt:default: Method "/lib/svc/method/llt stop" failed with exit status 1.
And the corresponding errors in /var/adm/messages were:
Jul 15 16:25:30 thor248 svc.startd[7]: [ID 122153 daemon.warning] svc:/system/vcs:default: Method or service exit timed out. Killi
ng contract 155.
Jul 15 16:25:30 thor248 svc.startd[7]: [ID 636263 daemon.warning] svc:/system/vcs:default: Method "/lib/svc/method/vcs stop" failed
due to signal KILL.
Jul 15 16:27:32 thor248 svc.startd[7]: [ID 652011 daemon.warning] svc:/system/vxfen:default: Method "/lib/svc/method/vxfen stop" fa
iled with exit status 1.
Jul 15 16:27:33 thor248 gab: [ID 719437 kern.notice] GAB ERROR V-15-1-20015 unconfigure failed: clients still registered
Jul 15 16:27:33 thor248 svc.startd[7]: [ID 652011 daemon.warning] svc:/system/gab:default: Method "/lib/svc/method/gab stop" failed
with exit status 1.
Jul 15 16:27:33 thor248 svc.startd[7]: [ID 748625 daemon.error] system/gab:default failed: transitioned to maintenance (see 'svcs -
xv' for details)
Jul 15 16:27:33 thor248 svc.startd[7]: [ID 652011 daemon.warning] svc:/system/llt:default: Method "/lib/svc/method/llt stop" failed
with exit status 1.
This was because HAD could not be closed within the time specified as it was waiting for the resource to offline.
This time out is a tunable parameter that is shipped to the customer in an XML file located at:
/var/svc/manifest/system/*.xml
For vcs:
/var/svc/manifest/system /vcs.xml
The file contents:
------
<exec_method
type='method'
name='stop'
exec='/lib/svc/method/vcs stop'
timeout_seconds='60'>
<method_context>
<method_credential user='root' group='root' />
</method_context>
</exec_method>
-------
This file contents is also responsible for how the dependencies are set for our products.