[ Problem
]
- The VCS
NICAgent keeps creating the defunct processes.
[ Exact
pattern of defunct processes of NICAgent ]
4 S root
2813 1 0 75 0 - 32211 stext Jan20 ? 00:39:53 /opt/VRTSvcs/bin/NIC/NICAgent -type
NIC <<<<<< Memory usage: 32211
4 Z root
28755 2813 2 78 0 - 0 exit 15:35 ? 00:00:00 [monitor]
4 Z root
28756 2813 2 78 0 - 0 exit 15:35 ? 00:00:00 [monitor]
4 Z root
28757 2813 2 78 0 - 0 exit 15:35 ? 00:00:00 [monitor]
0 S root
28762 18950 0 78 0 - 16330 pipe_w 15:35 pts/1 00:00:00 grep 2813
------------------------------------------------------------------------------------------------------------------------
4 S root
2813 1 0 75 0 - 32211 stext Jan20 ? 00:39:53 /opt/VRTSvcs/bin/NIC/NICAgent -type
NIC <<<<<< Memory usage: 32211
4 Z root
29599 2813 2 78 0 - 0 exit 15:36 ? 00:00:00 [monitor]
4 Z root
29600 2813 2 78 0 - 0 exit 15:36 ? 00:00:00 [monitor]
4 Z root
29601 2813 2 78 0 - 0 exit 15:36 ? 00:00:00 [monitor]
0 S root
29608 18950 0 78 0 - 16330 pipe_w 15:36 pts/1 00:00:00 grep 2813
------------------------------------------------------------------------------------------------------------------------
[
Tracking down the threads by strace on the linux ]
The output
of strace to NICAgent on the linux system.
- Parent
PID:15182
- Its
sibling process: 15227 -->
15228
---------------------------------------------------------------------
Process
15227 attached (waiting for parent)
Process
15227 resumed (parent 15182 ready)
[pid 15182]
0.000055 <... clone resumed> child_stack=0,
flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD,
child_tidptr=0x2aadceab9570) = 15227
......
[pid 15227]
0.000032 execve("./miiagent", ["./miiagent", "eth0.10"], [/* 38 vars */]
......
[pid 15227]
0.000024 <... execve resumed> ) = 0
......
[pid 15227]
0.000021 ioctl(3, SIOCGMIIPHY, 0xfff20ef4) = 0
[pid 15227]
0.000072 ioctl(3, SIOCGMIIREG, 0xfff20ef4) = 0
[pid 15227]
0.000058 ioctl(3, SIOCGMIIREG, 0xfff20ef4) = 0
[pid 15227]
0.000058 ioctl(3, SIOCGMIIREG, 0xfff20ef4) = 0
[pid 15227]
0.000082 fstat64(1, {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0
[pid 15227]
0.000044 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1,
0xfff20060) = 0xfffffffff7f22000
[pid 15227]
0.000067 write(1, "110", 3) = 3
......
[pid 15227]
0.000026 exit_group(110) = ? <<<<<
......
[pid 15182]
0.000040 wait4(15227, [{WIFEXITED(s) & WEXITSTATUS(s) == 110}], 0, NULL) =
15227
---------------------------------------------------------------------
[
Resolution ]
Explanation
about the defunct processes.
The problem
is caused because in the agent framework we do a waitpid with a WNOHANG option
which returns control back to the parent process immediately if the child
process is still running.
Each agent
service thread does a waitpid on its child process and then yields the scheduler
to another agent thread, and after a while vies for the CPU to do a waitpid on
its child process again.
If in this
short interval, it's child process exits, the child process becomes a defunct
process until the parent thread gets time on the CPU and does a waitpid again
for that child process.
Our
engineering has verified that it is indeed this which causes the entry point
processes to show up as defunct processes before they die, by writing two small
test programs. One test program creates a child process and does a wait on it,
without the WNOHANG option.
In this
case, no defunct process is created. The second test program creates a child
process and does a wait with the WNOHANG option on the child process. In this
case, however, the child process becomes a defunct process before it finally
dies, when the parent does a wait on it again.
The method
by which an agent's threads wait on the entry point processes has not been
changed for a while now. It was able to be shown this same problem of defunct
processes being created even in VCS 2.2 on linux. This problem will go away only
if the agent threads do a wait, without the WNOHANG option. But, without the
WNOHANG option, a parent process's execution is suspended when it does a wait on
its child process(es), until its child process(es) exit. We definitely cannot
use this option in the agent framework because we dont want the entire agent
process suspended when one service thread is doing a wait on its child process.
Since the agent service threads will eventually do a wait on the child entry
point processes, the defunct processes are guaranteed to go away.
It was
checked up on the other platforms to see if the agent entry point processes
become defunct processes before they die and this is indeed the case even on AIX
and Solaris. It hasn't been confirmed that this behaviour on HPUX, but it will
be the same as on the other unixes.
So
therefore, the problem outlined in this incident can be classified as expected
behavior.