VCS NICAgent Keeps creating defunct processes.

book

Article ID: 100000696

calendar_today

Updated On:

Resolution

[ Problem ]
- The VCS NICAgent keeps creating the defunct processes.

[ Exact pattern of defunct processes of NICAgent ]
4 S root 2813 1 0 75 0 - 32211 stext Jan20 ? 00:39:53 /opt/VRTSvcs/bin/NIC/NICAgent -type NIC <<<<<< Memory usage: 32211
4 Z root 28755 2813 2 78 0 - 0 exit 15:35 ? 00:00:00 [monitor]
4 Z root 28756 2813 2 78 0 - 0 exit 15:35 ? 00:00:00 [monitor]
4 Z root 28757 2813 2 78 0 - 0 exit 15:35 ? 00:00:00 [monitor]
0 S root 28762 18950 0 78 0 - 16330 pipe_w 15:35 pts/1 00:00:00 grep 2813
------------------------------------------------------------------------------------------------------------------------
4 S root 2813 1 0 75 0 - 32211 stext Jan20 ? 00:39:53 /opt/VRTSvcs/bin/NIC/NICAgent -type NIC <<<<<< Memory usage: 32211
4 Z root 29599 2813 2 78 0 - 0 exit 15:36 ? 00:00:00 [monitor]
4 Z root 29600 2813 2 78 0 - 0 exit 15:36 ? 00:00:00 [monitor]
4 Z root 29601 2813 2 78 0 - 0 exit 15:36 ? 00:00:00 [monitor]
0 S root 29608 18950 0 78 0 - 16330 pipe_w 15:36 pts/1 00:00:00 grep 2813
------------------------------------------------------------------------------------------------------------------------

[ Tracking down the threads by strace on the linux ]
The output of strace to NICAgent on the linux system.
  • Parent PID:15182
  • Its sibling process: 15227 --> 15228
---------------------------------------------------------------------
Process 15227 attached (waiting for parent)
Process 15227 resumed (parent 15182 ready)
[pid 15182] 0.000055 <... clone resumed> child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x2aadceab9570) = 15227
......
[pid 15227] 0.000032 execve("./miiagent", ["./miiagent", "eth0.10"], [/* 38 vars */]
......
[pid 15227] 0.000024 <... execve resumed> ) = 0
......
[pid 15227] 0.000021 ioctl(3, SIOCGMIIPHY, 0xfff20ef4) = 0
[pid 15227] 0.000072 ioctl(3, SIOCGMIIREG, 0xfff20ef4) = 0
[pid 15227] 0.000058 ioctl(3, SIOCGMIIREG, 0xfff20ef4) = 0
[pid 15227] 0.000058 ioctl(3, SIOCGMIIREG, 0xfff20ef4) = 0
[pid 15227] 0.000082 fstat64(1, {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0
[pid 15227] 0.000044 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0xfff20060) = 0xfffffffff7f22000
[pid 15227] 0.000067 write(1, "110", 3) = 3
......
[pid 15227] 0.000026 exit_group(110) = ? <<<<<
......
[pid 15182] 0.000040 wait4(15227, [{WIFEXITED(s) & WEXITSTATUS(s) == 110}], 0, NULL) = 15227
---------------------------------------------------------------------

[ Resolution ]
Explanation about the defunct processes.
The problem is caused because in the agent framework we do a waitpid with a WNOHANG option which returns control back to the parent process immediately if the child process is still running.
Each agent service thread does a waitpid on its child process and then yields the scheduler to another agent thread, and after a while vies for the CPU to do a waitpid on its child process again.
If in this short interval, it's child process exits, the child process becomes a defunct process until the parent thread gets time on the CPU and does a waitpid again for that child process.
Our engineering has verified that it is indeed this which causes the entry point processes to show up as defunct processes before they die, by writing two small test programs. One test program creates a child process and does a wait on it, without the WNOHANG option.
In this case, no defunct process is created. The second test program creates a child process and does a wait with the WNOHANG option on the child process. In this case, however, the child process becomes a defunct process before it finally dies, when the parent does a wait on it again.
The method by which an agent's threads wait on the entry point processes has not been changed for a while now. It was able to be shown this same problem of defunct processes being created even in VCS 2.2 on linux. This problem will go away only if the agent threads do a wait, without the WNOHANG option. But, without the WNOHANG option, a parent process's execution is suspended when it does a wait on its child process(es), until its child process(es) exit. We definitely cannot use this option in the agent framework because we dont want the entire agent process suspended when one service thread is doing a wait on its child process. Since the agent service threads will eventually do a wait on the child entry point processes, the defunct processes are guaranteed to go away.
It was checked up on the other platforms to see if the agent entry point processes become defunct processes before they die and this is indeed the case even on AIX and Solaris. It hasn't been confirmed that this behaviour on HPUX, but it will be the same as on the other unixes.
So therefore, the problem outlined in this incident can be classified as expected behavior.


Issue/Introduction

VCS NICAgent Keeps creating defunct processes.

Additional Information

ETrack: 1984679 ETrack: 252281