Wednesday, June 13, 2007

Zombies due to pipes in system() function call

Today I solved an interesting problem. One of my fellow developers used system() function in his code to run some command. The code looks like:

while (condition) {
if(system (...) == 0)
dosomething (...);
sleep (...);
}
When we ran the application, I observed that the system was crawling. I verified the IO utilization and found it was normal. I checked the CPU utilization using top and that too was normal. When I did a ps, I found that there were too many defunct processes in the system.

I grabbed a cup of coffee and dug what could have caused so many defunct processes. There was only one place, which I suspected, could have caused the defuncts. That piece of code is given above. So I thought what was wrong with the argument to the system () command. It goes something like this:
system ("head -1 input.txt | grep pattern")
I modified the command above as it would be executed in system (), and run it through truss to find out if all the forked processes are reaped using wait () or waitid () calls. The following is the truss output (note the -f argument to truss, which is very important):
$ truss -f /bin/sh -c "head -1 input.txt | grep pattern" 2>&1 | egrep  "fork|exec|wait"
80: execve("/usr/bin/sh", 0xFFBFFB6C, 0xFFBFFB7C) argc = 3
80: fork() = 81
81: fork() (returning as child ...) = 80
81: fork() = 83
83: fork() (returning as child ...) = 81
81: execve("/usr/bin/grep", 0x0003A498, 0x0003A588) argc = 2
83: execve("/usr/bin/head", 0x0003A4B4, 0x0003A5A8) argc = 3
80: waitid(P_PID, 81, 0xFFBFF8D0, WEXITED|WTRAPPED|WNOWAIT) = 0
80: waitid(P_PID, 81, 0xFFBFF8D0, WEXITED|WTRAPPED) = 0
There are three processes: the shell (pid 80), the grep process (pid 81) and the head process (pid 83). But we find only one waitid () call, which is for grep process. The head process, being first in the pipe, is left to become a zombie. The moral of the story is:
If you have a long pipe of processes, remember that only the last process of the pipe will be reaped using waitid () by the shell. Rest of the processes will become defuncts and reaped by the init process soon.
But if the rate at which the defuncts are created is high (like the while loop given above), then your system is bound to experience a slow down.

(The actual code is not as trivial as using a head and grep alone!)

No comments: