Sunday, June 24, 2007

Comparison of PTHREAD_PROCESS_SHARED in Soaris and FreeBSD

Let us begin with the sample code below (headers omitted for brevity):
int main(int argc, const char* argv[]) {
void* mmap_ptr = mmap (NULL, sizeof (pthread_mutex_t),
if (mmap_ptr == MAP_FAILED) {
perror ("mmap failed");
return -1;

fprintf (stderr, "mmaped at: %x\n", mmap_ptr);

pthread_mutex_t* mutp = (pthread_mutex_t*)mmap_ptr;

// initialize the attribute
pthread_mutexattr_t attr;
pthread_mutexattr_init (&attr);
pthread_mutexattr_setpshared (&attr, PTHREAD_PROCESS_SHARED); // this is what we're testing

// initialize the mutex
pthread_mutex_init (mutp, &attr);
pthread_mutexattr_destroy (&attr);

// acquire the lock before fork
pthread_mutex_lock (mutp);

pid_t chld = fork ();
if (chld != 0) { // parent
fprintf (stderr, "parent: going to sleep...\n");
sleep (30);
fprintf (stderr, "parent: unlocking.\n");
pthread_mutex_unlock (mutp);
} else { // child
fprintf (stderr, "child: going to acquire the lock ...\n");
pthread_mutex_lock (mutp);
fprintf (stderr, "child: acquired the lock.\n");
sleep (30);
pthread_mutex_unlock (mutp);
return 0;

The above code makes use of mutexes to synchronize between two processes. It first acquires an anonymous memory mapped region, which is marked for sharing between processes. Then it initializes a mutex in that region with the attribute PHREAD_PROCESS_SHARED. Before forking, the mutex is locked by the parent. After fork, the child process too attempts to lock the mutex.

The expected result is that the child should block until the parent unlocks the mutex.

I tried this piece of code in two Unix flavours: Solaris 9 and FreeBSD 6.2. Only the Solaris implementation seemed to give the expected result.

In FreeBSD implementation, the types pthread_mutex_t, pthread_cond_t and pthread_rwlock_t are all typedefed as pointers. Hence whenever a variable of one of these types is declared, only a pointer is declared. When the respective *_init function is called, the actual memory is allocated. The thread/process calling init functions does not have any control over from where the memory is acquired (a memory mapped region, shared memory or heap). Also, it might lead to memory leaks (but how many times in a program a thread, attribute, mutex, etc are initialized :-).

Contrarily, Soaris has these types as structures. Hence it is easy to share a mutex between two processes by having the mutex in a mmaped region or shared memory.

To compile the above program in Solaris, give "CC -mt ". In FreeBSD, give "CC -pthread -lpthread".

Saturday, June 16, 2007

Infnite loop while doing ++map.begin() on empty map

Look at this seemingly simple program:
int main(int argc, const char* argv[]) {
map mymap;
map::iterator it = mymap.begin();
return 0;
What do you think would happen when I compile and run this program? Infinite loop! I tried this program in two different versions of compilers: Sun Forte 6 suite on Solaris and gcc 3.4.6 on FreeBSD.

In most of the library distributions, maps and sets are implemented using red-black trees. The iterator above seem to have three links: parent, left and right. For some strange reason, when the map is empty, the iterator returned by begin() (and end() too) has parent as NULL and left and right to be pointing to itself!

(gdb) print it
$1 = {_M_node = 0xbfbfecd4}
(gdb) print *it._M_node
$2 = {_M_color = std::_S_red, _M_parent = 0x0, _M_left = 0xbfbfecd4, _M_right = 0xbfbfecd4}
You can have a look at the _M_increment function to know why this results in an infinite loop.

Now the history. One of our programs running in test region behaved very weird. What should have been processed in few mins wasn't processed even after an hour. So when I attached a debugger to the program and analyzed, I figured this was the issue. I accept that the program had a logical error that it didn't check for empty container. I think getting into an infinite loop is a big punishment. A core dump at least would have given a hint that something went wrong.

In his book "Writing Solid Code," Steve Maguire tells that code should be written such a way that every bug is forced to surface. I guess that's what is missing in this piece of library function.

Wednesday, June 13, 2007

Zombies due to pipes in system() function call

Today I solved an interesting problem. One of my fellow developers used system() function in his code to run some command. The code looks like:

while (condition) {
if(system (...) == 0)
dosomething (...);
sleep (...);
When we ran the application, I observed that the system was crawling. I verified the IO utilization and found it was normal. I checked the CPU utilization using top and that too was normal. When I did a ps, I found that there were too many defunct processes in the system.

I grabbed a cup of coffee and dug what could have caused so many defunct processes. There was only one place, which I suspected, could have caused the defuncts. That piece of code is given above. So I thought what was wrong with the argument to the system () command. It goes something like this:
system ("head -1 input.txt | grep pattern")
I modified the command above as it would be executed in system (), and run it through truss to find out if all the forked processes are reaped using wait () or waitid () calls. The following is the truss output (note the -f argument to truss, which is very important):
$ truss -f /bin/sh -c "head -1 input.txt | grep pattern" 2>&1 | egrep  "fork|exec|wait"
80: execve("/usr/bin/sh", 0xFFBFFB6C, 0xFFBFFB7C) argc = 3
80: fork() = 81
81: fork() (returning as child ...) = 80
81: fork() = 83
83: fork() (returning as child ...) = 81
81: execve("/usr/bin/grep", 0x0003A498, 0x0003A588) argc = 2
83: execve("/usr/bin/head", 0x0003A4B4, 0x0003A5A8) argc = 3
80: waitid(P_PID, 81, 0xFFBFF8D0, WEXITED|WTRAPPED) = 0
There are three processes: the shell (pid 80), the grep process (pid 81) and the head process (pid 83). But we find only one waitid () call, which is for grep process. The head process, being first in the pipe, is left to become a zombie. The moral of the story is:
If you have a long pipe of processes, remember that only the last process of the pipe will be reaped using waitid () by the shell. Rest of the processes will become defuncts and reaped by the init process soon.
But if the rate at which the defuncts are created is high (like the while loop given above), then your system is bound to experience a slow down.

(The actual code is not as trivial as using a head and grep alone!)

Monday, June 11, 2007

How to free memory held by a container?

I have a test program like this:
int main() {
string large_str;
for (int i = 1; i <= 1000; ++i) {
string slice(100*i, 'X');
large_str += slice;
large_str.clear ();
printf ("size: %-5d, capacity: %-5d\n", large_str.size(), large_str.capacity());
The last line of the output is:
 size: 0, capacity: 131043  
The question is:
It is very obvious that the string container still holds memory that it allocated for the longest string it contained. How to deallocate this memory, without destructing the object?
Thanks to James Kanze who posted an answer in this usenet thread, here is an elegant solution for this problem.

template <typename Container>
void reset( Container& c ) {
Container().swap( c ) ;

So when you have to free the memory, just call reset(large_str).

Saturday, June 09, 2007

A script to monitor IO activities

This follows my discussion posted earlier regarding iostat. The script given below might be helpful in monitoring a device that has the given directory in it.


# This script will print IO activity for the partition given in the argument.
# If no partition is given, it will print IO activity for the partition
# that contains the current directory.

if [ -z "$1" ] ; then

while [ "$DIR" != "/" -a "$DIR" != "." ] ; do
MDEV=`mount -p | grep $DIR | nawk '{print $1;}'`
if [ ! -z "$MDEV" ] ; then
MDEV=`basename $MDEV`
DIR=`dirname $DIR`

if [ -z "$MDEV" ] ; then
echo "Unable to find out the mounted device for $ORIG_DIR."
exit 1

echo "Mounted device for $ORIG_DIR is $MDEV"

iostat -x -p -n -z 5 | egrep "$MDEV|device"

This script was tested in SunOS 5.9 running ksh.

Friday, June 08, 2007

Using iostat for monitoring disk activities

There could possibly be a lot of reasons for application slow down. Identifying the cause for the slow down could be a bit tricky. iostat is a tool that helps in monitoring I/O activities in the system, which might have been caused your application slowdown.

iostat helps to monitor I/O activity on a per-disk and per-partition basis. There are a number of options that might suite your particular need. But I find the ones below to be good enough for my needs:
iostat -x -n -p -z 5 10

-x : Show me extended statistics for each disk.
-n : Don't show cryptic names of devices, if possible show readable names.
-p : Show per device statistics and per partition statistics.
-z : Don't show me the rows that have all zeros in them.

Let us take a sample output and explore.

extended device statistics
r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
0.0 0.8 0.0 10.8 0.0 0.0 0.0 0.6 0 0 c2t7d3s6

The following is what the 'man iostat' has to say regarding the columns above:

name of the disk

r/s reads per second

w/s writes per second

Kr/s kilobytes read per second

Kw/s kilobytes written per second

wait average number of transactions waiting for service
(queue length)

actv average number of transactions actively being serviced
(removed from the queue but not yet completed)

svc_t average service time, in milliseconds

%w percent of time there are transactions waiting for
service (queue non-empty)

%b percent of time the disk is busy (transactions in pro-

wsvc_t is the average time spent in the wait queue and asvc_t is the average time spent being serviced.

There are a couple of things that are important to us:
  • If your application is performing too many random reads/writes, you will find that the first four columns will have high values. (What is high is dependent on your system! There is no universal number.)
  • As a result, you will find that the wsvc_t and asvc_t to be high too.
Here comes the tricky part: how will you know if these numbers go high, it is due to your application? To a reasonable extent, you could find out.

First, make sure that you are looking at the right device/partition where your application is doing reads/writes. You could use mount, and find out the device which is having the directory you are interested in.

Second, as much as possible you should try to isolate the numbers on a per partition basis, rather than on a per deice basis. Per device statistics are aggregations over all the partitions under them. For e.g. monitor c2t7d3s6 instead of c2t7d3, as you will get a slightly more accurate picture.

The following are some sample outputs of iostat that would help you to do a comparison.

extended device statistics
r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
27.8 181.6 497.8 4748.0 124.0 211.6 592.0 1010.5 72 100 c2t7d3s6

extended device statistics
r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
216.2 167.0 5534.2 5363.0 0.3 88.7 0.8 231.4 2 100 c2t7d3s6

extended device statistics
r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
0.0 0.2 0.0 0.2 0.0 0.0 0.0 11.1 0 0 c2t7d3s6

extended device statistics
r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
0.0 2.8 0.0 1.5 0.0 0.0 0.0 8.6 0 0 c2t7d3s6

The first two snap shots were taken when the system was having heavy IO activity, and the latter two were taken when there was not much of an IO activity. Compare the wait times and bytes read/written. It is interesting to ask the following questions:
  1. Why is the waiting time in the queue is much lesser than the time being serviced? And what would the reverse of this case mean?
  2. Let us assume that the device can serve 2N fixed sized IO requests in the given time interval. Consider two applications making N requests each. How different this would be from one application making 2N requests? Using which parameter you could distinguish these two cases?
Okay. Once you have identified that your application is choking IO, what is the next step?

One strategy that you could follow to make your application better is to keep your design such a way that the read/writes are sequential instead of random. This might not be possible always. At least, try to reduce the spread of random read/writes by using strategies like always using the lower addressed blocks. I have encountered applications that use linked list to keep track of free memory blocks on the disk, where I changed the the linked list to a min-heap that proved to be improving performance. But using a min-heap too has its downsides :-)

Thursday, June 07, 2007

An interesting question on Fibonacci series

I was thinking about an interesting question regarding the Fibonacci series. Here it is.
Assume that we draw the Fibonacci series of n as an inverted tree, where each node has its two additive terms as child nodes. For e.g. root node F(n) has F(n-1) and F(n-2) as children, and so on. Give an expression for the number of times node i occurs in the entire tree, where 1 <= i <= n.
Try to solve this problem. There is an interesting pattern to observe here.