VCS engine, had process consuming over 99% CPU time. Multiple ha commands return cannot connect to VCS engine and are hung in pollsys()

book

Article ID: 100025044

calendar_today

Updated On:

Description

Error Message

Output of ha commands such as hastatus, halog, hares and hagrp:
 


# /opt/VRTS/bin/hastatus -sum  
VCS ERROR V-16-1-10600 Cannot connect to VCS engine
 


Sample stack trace of one of the hares commands (from core dump):
 

==== user (LWP_SYS) thread: 0x30001932d00  PID: 6401 ====  
cmd: hares -state IP_wMPortal3 -sys tlxkswmbrkr1  
t_wchan: 0x3001e89109a  sobj: condition var (from genunix:poll_common+0x4e8)  
[...]
idle: 54934992 ticks (6 days 8 hours 35 minutes 49.92 seconds)  
[...]
genunix:cv_wait_sig_swap_core+0x130(, , 0x0)  
genunix:cv_waituntil_sig(0x3001e89109a, 0x3001e891060, 0x0) - frame recycled  
genunix:poll_common+0x4e8(0xffbf9360?, 0x1, 0x0, 0x0, , 0x60017cf5310)  
genunix:pollsys+0xf8(, 0x1)  
unix:syscall_trap32+0xcc()  
-- switch to user thread's user stack --

Cause

This issue is tracked via Symantec internal incident e2416758.

This issue occurs only when the operating system runs out of file descriptor. Use the ulimit command to determine the maximum limit of file descriptor in the system:

# ulimit -n
1024
#

 

The VCS engine, "had" process can accept a limited number of connections from clients. This limit (FD_SETSIZE) is determined by the operating system. However, the accept system call can return a file descriptor greater than the limit. In such a case "had" cannot process this file descriptor using the select system call. As a result "had" goes into a unrecoverable loop.

 

Resolution

Veritas has fixed the code to ensure that "had" will close a file descriptor which is greater than FD_SETSIZE. This prevents "had" process from going into a unrecoverable loop. The fix is available in the following patch releases.

Solaris (SPARC and x86 platforms), AIX and Linux

VRTSvcs 5.1SP1RP2 and 6.0

HP-UX 

VRTSvcs 5.1SP1RP1 and 6.0

 

Please visit SORT website to obtain the latest patch.

https://docs.infoscale.com/

 

 

Applies To

This issue is applicable to all versions of VCS and OS (Operating System) platforms.

Issue/Introduction

The CPU consumption of VCS (Veritas Cluster Service) engine "had" process is very high. The had process does not respond to any HA command. HA commands such as hastatus, halog, hares and hagrp would return "cannot connect to vcs engine" and the threads would be hung in pollsys() system call.

Additional Information

ETrack: 2416758 ETrack: 2416842