How to determine whether SFRAC node panicked due to CRS timeout

book

Article ID: 100002410

calendar_today

Updated On:

Resolution

Obtain crash dump from customer system and verify panicstring/thread

SolarisCAT(vmcore.7/10U)> panic
panic on cpu1
panic string:   forced crash dump initiated at user request
====panic user (LWP_SYS) thread: 0x300056dc340  PID: 16038  on CPU:1 ==== --------<<< Note PID id
cmd: /sbin/uadmin 51    --------<<< Note cmd
t_procp:0x30003dc5120
 p_as: 0x300059873f8  size: 2621440  rss:1474560
 hat: 0x30008299880  cnum: 0x0  cpusran:1
 zone: global
t_stk: 0x2a100bdbae0  sp:0x2a100bdb0b1  t_stkbase: 0x2a100bd6000
t_pri: 59(TS)  pctcpu:0.037107
t_lwp: 0x60012438098  machpcb: 0x2a100bdbae0
 mstate:LMS_SYSTEM  ms_prev: LMS_USER
 ms_state_start: 0.0000116 secondsearlier
 ms_start: 0.2235608 seconds earlier
psrset: 0  lastCPU: 1
idle: 0 ticks (0 seconds)
start: Wed Jun 16 07:06:51 2010
age: 0seconds (0 seconds)
syscall: #55 uadmin(, 0xffbffce8) (sysent:genunix:uadmin+0x0)
tstate: TS_ONPROC - thread is being run on aprocessor
tflg:   T_PANIC - thread initiated a systempanic
       T_DFLTSTK - stack is defaultsize
tpflg:  TP_TWAIT - wait to be freed bylwp_wait
       TP_MSACCT - collect micro-stateaccounting information
tsched: TS_LOAD - thread is inmemory
       TS_DONT_SWAP - thread/LWP should not beswapped
pflag:  SMSACCT - process is keeping micro-stateaccounting
       SMSFORK - child inherits micro-stateaccounting

pc:      0x106b2f4      unix:panic+0x1c:  call unix:vpanic

unix:panic+0x1c(0x1269e48, 0x1, 0x1815000, 0x1815000,0x2b, 0x0)
genunix:kadmin+0x4ac(, 0x1, 0x0,0x60010803d98)
genunix:uadmin+0x11c(,0x1)
unix:syscall_trap32+0xcc()
-- switch to user thread's user stack--

Print process tree of panicpid

SolarisCAT(vmcore.7/10U)>proc tree16038
4059  /bin/sh /etc/init.d/init.cssdfatal
 6855  /bin/sh /etc/init.d/init.cssd daemon---------------<<< This shows Oracle CRS daemon issued uadmin commandwhich resulted in system panic
   16038 /sbin/uadmin 51

There are many reason can cause this type of panics
-System is too busy
-Slow SAN response
-Files system is not responding

Verify whether customer has configured OCR andVOTEDISK on CFS file system

# exportPATH=$PATH:/apps/crshome/bin
# ocrcheck
Status of Oracle Cluster Registryis as follows :
       Version                  :          2
       Total space (kbytes)     :    262144
        Used space(kbytes)      :      3264
        Available space (kbytes) :    258880
       ID                      : 1962738043
        Device/FileName         :/ocrvote/ocrdisk
                                   Device/Fileintegrity checksucceeded

                                   Device/Filenot configured

        Cluster registry integritycheck succeeded

# crsctl query css votedisk
0.    0    /ocrvote/votedisk

located 1 votedisk(s).

#mount -v |grep /ocrvote
/dev/vx/dsk/ocrvotedg/ocrvotevol on /ocrvotetype vxfsread/write/setuid/devices/mincache=direct/delaylog/largefiles/qio/cluster/ioerror=mdisable/crw/mntlock=VCS/dev=4f0dea8on Wed Jun 16 14:19:25 2010


Check current timeout values forCRS

# /apps/crshome/bin/crsctl get css disktimeout
#/apps/crshome/bin/crsctl get css misscount
# /apps/crshome/bin/crsctl getcss  reboottime


Advise customer to increase value of aboveTIMEOUT values on all RAC nodes to prevent similar panics based on out come ofcrash dump analysis

# /apps/crshome/bin/crsctl set css misscount 300
#/apps/crshome/bin/crsctl set css disktimeout 300
 

 

Issue/Introduction

How to determine whether SFRAC node panicked due to CRS timeout