Problem :
Cluster wide hang triaging and identifying culprit nodes (subset of nodes)
Environment :
All Unix Flavours (Linux, Solaris, AIX) on Infoscale 8.0 & above
Description :
When a file system hang condition occurs, the applications accessing such file systems may become unresponsive and depending on the actual issue, sometimes it may also lead to a system wide hang situation. In most cases, the process thread(s) using the file system is blocked for I/O by the Operating System until the I/O is successfully sent to the disk subsystem and an acknowledgement is received from the underlying block device to the file system.
The actual hang or slowness in the I/O operations is not necessarily caused within the filesystem itself, but the problem could be in any part of the entire I/O stack, ranging from storage subsystem to SAN, HBA, SCSI, device multi pathing software, block device, file system or even the application itself.
To identify the actual problem, it is important that each component of the I/O stack should be investigated at the time of the issue. However, the scope of this article is to provide guidelines for troubleshooting and investigating Veritas File System (VxFS) component of the I/O stack. Other components of the I/O stack may require troubleshooting, but it is beyond the scope of this document to cover all aspects of troubleshooting the other components.
If a cluster wide hang is observed in a large cluster (say number of nodes are 16) then to crash/reboot all the nodes of the cluster to get out of the hang situation is not an acceptable workaround as it hampers the high availability and results in increased application downtime.
Troubleshooting Steps :
To isolate the I/O hang and to identify if the issue is related to the Cluster File System (VxFS/CFS), the following steps are recommended:
1) Identify if the file system is accessible or is completely hung for all commands/processes
# ls -l <problem file system mount point>
# cd <problem file system mount point>
If the commands are hanging. Let the process continue running for further troubleshooting.
2) Identify the processes currently using the file system
# fuser -cu <problem file system mount point>
Collect the stack trace of process id using the filesystem for every 60 seconds for 3 times.
Solaris:
# pstack <pid of process/(es) from “fuser -cu”>
Linux:
# gdb -p <pid of process/(es) from “fuser -cu”>
From the gdb> prompt, execute the "bt" command to capture the stack of the process and then run
"quit" to exit from gdb.
Aix:
# procstack <pid of process/(es) from “fuser -cu”>
3) On all the nodes, set LLT debug messages 152, 141, 142.
# lltconfig -S 152
# lltconfig -S 141
# lltconfig -S 142
It updates lot of llt debug messages in system log files.
4) Take the comms log on all the nodes
# /opt/VRTSgab/getcomms -local
5) Increase the peerinact time on all the nodes to maximum:
# lltconfig -T query|grep -w peerinact
peerinact = 1600
# lltconfig -T peerinact:9999999
# /sbin/lltconfig -T query
check peerinact = 9999999
6) Collect the following CVM related information on every cluster node before collecting the kernel cores. Please run the following command three times in 30 seconds interval.
for i in 1 2 3 4
do
/etc/vx/diag.d/kmsgdump 2000 > /var/tmp/kmsgdump.`hostname`.`date +"%d%m%y%H%M%S"`.out 2>&1
/usr/sbin/vxdctl dumpmsg > /var/tmp/dumpmsg.`hostname`.`date +"%d%m%y%H%M%S"`.out 2>&1
sleep 30 ##### 15 or 30 seconds
done
7) Collect the following GLM related information from one of the nodes of the cluster:
# /opt/VRTS/bin/glmdump > /var/tmp/glmdump.out.1 2>&1
This command uses “ssh” communication. Use “-r” for rsh or “-h” for hacli.
8) Run msgdump command to find the culprit nodes to evict from cluster configuration. This command can be executed from any node of the cluster.
# /opt/VRTS/bin/msgdump > /var/tmp/msgdump.out.1 2>&1
This command uses “ssh” communication. Use “-r” for rsh or “-h” for hacli.
This command would identify the nodes which are not responding to a directed message or any broadcast message as well as the sender nodes. This would mostly be a subset of nodes among all the cluster nodes. We can evict those nodes from the cluster configuration to get the other nodes up and running for the affected filesystem. While evicting the nodes, crash dumps need to be collected from the concerned nodes for completion of Root cause analysis (RCA).
9) Once the affected nodes are crashed/rebooted, reset the LLT peerinact value to either default or its original value and reset LLT debug messages on rest of the nodes.
# lltconfig -T peerinact:1600
# lltconfig -R 152
# lltconfig -R 141
# lltconfig -R 142
Solution Benefit:
“/opt/VRTS/bin/msgdump” might identify the culprit nodes and this would eliminate the requirement to crash/reboot all the nodes of the cluster, and this would result in reducing application downtime and increased probability of concluding RCA.
This solution would work on best effort case and would be beneficial for most of the cases, but it does not guarantee to identify all the culprit nodes in each instance of cluster wide hang situation.
<