Wednesday, October 01, 2008

Strace and defunct processes

We have a web application running in Tomcat. Today I ran into an issue that the web application started throwing some errors during start up. I wanted to figure what the application reads during the start up. So I started the whole Tomcat under strace, and started to analyze the strace output. Once I figured what was happening, I tried to stop strace with Ctrl-C, but it won't stop. So, I killed it with "kill -9". The strace is dead, but the Java process that was being traced went into a defunct state. This is what I was seeing when I did a "ps aux":
9929 1932 0.0 0.0 0 0 pts/0 Z 16:53 0:03 [java <defunct>]
Hmm ... Not good! I tried to kill this process with "kill -9", but it won't work. If I try to start another instance of Tomcat, it complains that the port 8989, where it was supposed to listen, was already in use. But if I try to do a "netstat -pan", I couldn't find any process listening in that port either. If I try to do a "telnet localhost 8989", I am able to connect. If I try to do a "/usr/sbin/lsof -n -P | grep 8989", I didn't see any process listening in that port! Wierd!

I tried to Google about defunct processes, but didn't find anything that would work.

Upon further investigation, the issue became clear. When I did a "ps auxm", I realized that there were other threds under this PID. And all those threads were under "Tracing" mode. See below (I have modified the output to fit the screen):
9929 1937 0.0 2.3 594924 48764 pts/0 T 16:53 0:00 java ...
... (10 such threads) ...
9929 1932 0.0 0.0 0 0 pts/0 Z 16:53 0:03 [java <defunct>]
The most important column is the "T" column. So, when the strace was killed, it left the process being traced and all the threads under that in being "Traced" state. I observed that any process that is defunct and being "Traced" cannot be killed with "kill -9". So the only way we can bring the process out of this state is by sending a SIGCONT signal. So I tried "kill -CONT". Voila! All those threads were reaped by init process in second.

So next time, if you see a process that is defunct, check if it is having any threads that might be in "T" state. If yes, try to send SIGCONT signal to one of the threads that is in T state. It should bring the process out of that state and the process will immediately be reaped by the init.

This issue happened in kernel version 2.4.21. I don't know if it has been addressed in later versions of kernel.