How to collect a complete set of kernel cores for CFS issue rca.

book

Article ID: 100029707

calendar_today

Updated On:

Description

Description

There are a couple of things to remember when collecting kernel cores for the RCA of SFCFS-related problems.

1. Due to the distributed nature of the SFCFS environment, we need to collect cores from all CFS/CVM nodes.


Note: Please note that if the kernel cores are not collected at exactly the same time, it will jeopardize the subsequent Root Cause Analysis because vital evidence regarding the problem may be erased from the kernel cores during cluster reconfiguration.


2.a. Please collect the following CVM related information on every cluster node before collecting the kernel cores.  Please run the following command three times in 30 seconds interval. 

/etc/vx/diag.d/kmsgdump 2000 > /var/tmp/kmsgdump.out.1  2>&1
/usr/sbin/vxdctl dumpmsg > /var/tmp/dumpmsg.out.1  2>&1


Please use a different file name each time, e.g. /var/tmp/kmsgdump.out.2, /var/tmp/dumpmsg.out.2
 
2.b. Please collect the following GLM related information. glmdump will attempt to collect information from all nodes in the cluster via ssh, use -r option to use rsh instead.
/opt/VRTS/bin/glmdump > /var/tmp/glmdump.out.1 2>&1


3. When a node leaves the cluster, SFCFS will initiate an auto-reconfiguration, and in that process vital information related to the problem may be lost on the remaining nodes of the cluster.   The following is a procedure to delay the GAB membership reconfiguration and hopefully it will delay the reconfiguration of other parts of the SFCFS when a node leaves the cluster.   Please perform the following on all the cluster nodes before starting to collect the kernel cores.

Setting the Peer Inactive Timer to 180,000 system clock ticks (10ms each tick) which equals to 30 minutes.   (The default value is 1600 which is 16 seconds.)

# /sbin/lltconfig -T peerinact:180000

Confirming the Peer Inactive Timer is set correctly.

# /sbin/lltconfig -T query          
Current LLT timer values (.01 sec units):
 heartbeat   = 50
 heartbeatlo = 100
 peertrouble = 200
 peerinact   = 180000           <<<<<
 oos         = 10
 retrans     = 10
 service     = 100
 arp         = 30000

 
 
Please don't set the peerinact timer to higher than 214748.  Due to Etrack incident 3304583, any peerinact timer value higher than 214748 will cause an internal kernel variable to overflow and may cause the LLT links to disconnect.

Please note that the peerinact timer should remain at its default value of 1600 during the normal operation of the systems, please don't leave the systems running with timer value other than the default one.   The timer only needs to be set to 180,000 just before the kernel cores are collected.  After the systems reboot, the timer will be set to the default value again.

If customer cannot take kernel cores from all the nodes in the cluster, customer should still set the peerianct timer on all the nodes to 180,000 before taking the required kernel cores.  Please remind the customer to change the peerinact timer of the surviving nodes back to the default value immediately after the kernel core dumping is initiated on the required nodes.  Please note without a complete set of kernel cores from the cluster, we will not have an accurate snapshot of the cluster and this may affect the successfulness of the root cause analysis.

Please also note that the timer doesn't need to be change when live kernel core (savecore -L) is collected because collecting live kernel core will not affect the membership and configuration of the cluster.

4. Even though the peeractive timer is set to 180000, there is still a chance that some other parts of the SFRAC will be reconfigured when a node leaves.   Therefore all the kernel cores should be collected at the same time or close time (within seconds).   You can have a person to force core at each console terminals, or you can have multiple console windows ready to force core dump within a couple seconds.
 
5. Collect crash dump from all nodes using your normall O/S commands. Please verify with the O/S vendor the system is setup to capture crash dumps

Note: Please note that taking all the kernel cores at exactly the same time is the most important part of the whole process.   Kernel cores taking even a few seconds apart may render them useless in the RCA process because the vital information may have been lost.

Note: Please note that when collecting the kernel cores, the systems should not go through normal shutdown procedure (e.g. /usr/sbin/shutdown, /usr/sbin/reboot, /usr/sbin/halt, etc).   It is because normal shutdown procedure will also cause reconfiguration of the cluster and vital evidence of the problem may also be erased from the kernel core.   It is recommended the databases are shutdown and file systems are unmounted (if the problem doesn't cause them to hang), then the systems are crashed using hardware specific commands (e.g. break to OK prompt and run sync on Sun Microsystems SPARC machines, TOC on HP-UX machines).  

Please refer to the specific platform documents on how to crash the system and collect the kernel cores.

6. Finally, please collect VRTSexplorer from all the nodes in the cluster after the systems reboot.

 
 

 

Issue/Introduction

How to collect a complete set of kernel cores for CFS issue rca.

Additional Information

ETrack: 3304583