SG is NOT failed over due to system hang.

book

Article ID: 100001060

calendar_today

Updated On:

Resolution


====================================================================
What is System Hang or Crash.
====================================================================

# System Hang

It is a freeze or lock up of a computer. When a program crashes, it
normally can alert with a diagnostic message or error. If the whole
operating system fails, no message is displayed, mouse and keyboard
become irresponsive. It happens that computer can not restart without
turning it off completely.



# System Crash

When a program produces exception or is crashed, sometimes it causes
system to be panicked.

System Hang or crash are caused by following,
- H/W issues : CPU, memory, M/B, PCI boards like HBA or SCSI, Storage
(internal or external) and so on.
- S/W issues : OS (kernel), 3rd party driver, 3rd party SW dead-lock,
performance, and so on.

Generally, CPUs fetch instructions and run, read/write data from/to
memory, and control & data I/Os are transferred from/to files on OS
partitions.

When CPU calculates wrong instructions or memory is accessed with wrong
addresses, system may be panicked. When root disk (OS partition) is
crashed or some threads are waiting for I/Os completion indefinitely,
system may be hang. In the case of root disk crash, there may be I/Os
hang partially at first, but it spreads total OS hang soon.



====================================================================
What symptom is incurred when root disk is crashed.
====================================================================
Maybe at first, applications like _HAD accessing OS partitions are
hang because I/O to OS partition are NOT finished. Then, OS can't swap c
ontents of physical memory to virtual memory on swap partition. Also,
kernel drivers can't handle I/Os. So, OS also got hung.

So, user may see that he can't be served by services on that system
and then can't access that system.



====================================================================
How to detect system crash or hang.
====================================================================
Either system hang or crash is incurred, standby-system can detect
that _HAD on active node can be accessed and LLTs are disconnected.

But, SG is failed over when system is crashed, and SG is auto-disabled
when system is hang.



1. Why service group is auto-disabled when system is hang.

When VCS does not know the status of a service group on a particular
system, it auto-disables the service group on that system in order to
protect services and data. For example, one disk group is imported and
FS is mounted on the active node, and it is hang. Then, if user attempts
to import the disk group and mount FS on standby system, it may cause FS
or disks on the disk group to be corrupted. Also, user may fail to do it.
Like the same reason, VCS auto-disables the service group that it doesn't
know the status.

Auto-disabling occurs under the following conditions:
> When the VCS engine, _HAD, is not running on the system.
> When all resources within the service group are not probed on the system.
> When a particular system is visible through disk heartbeat only.

Under these conditions, all service groups that include the system in
their SystemList attribute are auto-disabled.



2. How VCS distinguishes between system hang and crash.

As I said, when either system hand and crash is incurred, rest systems
detect _HAD down and LLT disconnection. But, there is a bit difference
between system hand and crash.

If system is crashed, rest systems detect _HAD down and LLT disconnection
together within the time specified by 'ShutdownTimeout' attribute. If
system is hang, rest systems detect _HAD down at first, and LLT disconnection
is detected later.

So, VCS checks these two events, _HAD down and LLT disconnection.
> If VCS detects _HAD down on the active node, VCS auto-disables all
service group on the active node.
> If VCS detects LLT disconnection within the specified time, VCS regards
that the active node is crashed and enables auto-disabled service group.
> If not, VCS regards that the active node is hang and waits until user
handles this issue.



====================================================================
Additional Explanation about customer's case.
====================================================================
According to customer, the root disk on the active node(krdb001a) was
crashed, and it got hung finally. The standby node(krdb001b) detected
that _HAD on the active node was down at 10:09:51 on January 19, 2010.


---------------------
Jan 19 10:09:51 krdb001b gab: [ID 316943 kern.notice] GAB INFO V-15-1-20036 Port h gen 6ab318 membership ;1
Jan 19 10:09:51 krdb001b gab: [ID 674723 kern.notice] GAB INFO V-15-1-20038 Port h gen 6ab318 k_jeopardy 0
Jan 19 10:09:51 krdb001b gab: [ID 513393 kern.notice] GAB INFO V-15-1-20040 Port h gen 6ab318 visible 0
Jan 19 10:09:51 krdb001b Had[321]: [ID 702911 daemon.notice] VCS INFO V-16-1-10077 Received new cluster membership
Jan 19 10:09:51 krdb001b Had[321]: [ID 702911 daemon.notice] VCS ERROR V-16-1-10113 System krdb001a (Node '0') is in DDNA Membership - Membership: 0x2, Visible: 0x0
Jan 19 10:09:51 krdb001b Had[321]: [ID 702911 daemon.notice] VCS ERROR V-16-1-10322 System krdb001a (Node '0') changed state from RUNNING to FAULTED
---------------------

Later, the standby node (krdb001b) detected that LLTs were disconnected
at 10:32:26 and decided that the active node was down at 10:32:45 on
January 19, 2010.

---------------------
Jan 19 10:32:26 krdb001b llt: [ID 140958 kern.notice] LLT INFO V-14-1-10205 link 0 (ce5) node 0 in trouble
Jan 19 10:32:26 krdb001b llt: [ID 140958 kern.notice] LLT INFO V-14-1-10205 link 1 (ce2) node 0 in trouble
Jan 19 10:32:26 krdb001b llt: [ID 140958 kern.notice] LLT INFO V-14-1-10205 link 2 (ce1) node 0 in trouble
......
Jan 19 10:32:40 krdb001b llt: [ID 911753 kern.notice] LLT INFO V-14-1-10033 link 0 (ce5) node 0 expired
Jan 19 10:32:40 krdb001b llt: [ID 487101 kern.notice] LLT INFO V-14-1-10032 link 1 (ce2) node 0 inactive 16 sec (900963828)
Jan 19 10:32:40 krdb001b llt: [ID 911753 kern.notice] LLT INFO V-14-1-10033 link 1 (ce2) node 0 expired
Jan 19 10:32:40 krdb001b llt: [ID 487101 kern.notice] LLT INFO V-14-1-10032 link 2 (ce1) node 0 inactive 16 sec (409964509)
Jan 19 10:32:40 krdb001b llt: [ID 911753 kern.notice] LLT INFO V-14-1-10033 link 2 (ce1) node 0 expired
Jan 19 10:32:44 krdb001b gab: [ID 316943 kern.notice] GAB INFO V-15-1-20036 Port h gen 6ab318 membership ;1
Jan 19 10:32:44 krdb001b Had[321]: [ID 702911 daemon.notice] VCS INFO V-16-1-10077 Received new cluster membership
Jan 19 10:32:44 krdb001b Had[321]: [ID 702911 daemon.notice] VCS ERROR V-16-1-10079 System krdb001a (Node '0') is in Down State - Membership: 0x2
Jan 19 10:32:45 krdb001b gab: [ID 316943 kern.notice] GAB INFO V-15-1-20036 Port a gen 6ab30b membership ;1
---------------------

Time difference between _HAD down and _NODE down is greater than
'ShutdownTimeout' (180 seconds). So, VCS regarded that active node was
hang and service group was auto-disabled.



 

 

Issue/Introduction

SG is NOT failed over due to system hang.