Wednesday, March 11, 2015

Bit twiddling in JDK to generate random number

Some time back a friend asked me an algorithm question:
Given an integer random number generator randomN() that can generate random number in the range [0, N), how will you generate random numbers in the range [0, M) where M
I used modulo arithmetic to generate the desired random numbers in the range [0, M). And I reasoned out that if N = q*M + r, every number in the range [0, r] has an occurrence probability of (q+1)/N, but the numbers in the range [r+1, M) has an occurrence probability of only q/N.

It is easy to visualize this. See the diagram below:

You can see that we can divide line of length N units (given in green) by lines of length M units (given in black). When N is not exactly divisible by M, in the last part alone we have only r units.

So if we choose a random integer in the line represented in green, and then take a modulo M on the value, all the values except the values in the range [r+1, M) (represented in red) will occur q+1 times. But the values in that interval will occur only q times.

So how do we generate uniform random numbers? Solution is very simple: when the generated random number is in the last segment, just discard the number and regenerate another random number. My solution looked like this:

maxAllowed = N - (N mod M)
do {
    randValue = randomN()
} while (randValue >= maxAllowed)
return maxAllowed mod M

Recently I was reading the Java API documentation for Random.nextInt(M). And I stumbled upon a clever trick that actually does what I did using some bit twiddling. The next(numBits) in Java code generates random numbers of length numBits. Since signed integers are 31 bits, internally nextInt(M) calls next(31). This will return values in the range of [0, MAX_INT + 1). Now using this the code generates a random number in the range of [0, M). What is the clever trick here? Look at how integer overflow is used to detect boundary condition.
do { 
 bits = next(31); 
 val = bits % n; 
} while (bits - val + (n-1) < 0);

This boundary condition is similar to "randValue >= maxAllowed" condition in my code.

Personally I feel that my code is easier to understand and (I guess) performs better. The reason why I guess my code performs better is due to the fact that it does modulo arithmetic exactly two times. But JDK's code performs modulo arithmetic every time in the loop. Also for checking the condition it performs addition of four values, while my boundary condition is a straight forward comparison. 

Yet it was interesting to see this piece of code and spend some time to understand what it is actually doing.

Monday, September 08, 2014

Is accepting Optional as argument a code smell?

I was reviewing some piece of code and came across a function that was taking Optional as argument. In the same class, another function was taking a couple of Optional as arguments.

When I thought about it, I felt that taking an Optional should be avoided. You can return an Optional from a function, but avoid taking Optional arguments. Before calling a function, you should check for the presence or absence of the values that you are passing. Hence a function taking one or more Optional is a code smell.

One argument that can possibly presented in favor of taking Optional as argument is every caller checking for the presence of the arguments. Consider the code example below:

If you pay attention, the error handling from the caller's point of view is ugly if the caller wants to report correct error. Much worse, the error check is done after using the values.

A side effect of product() taking Optional is that it must return an Optional. Otherwise it has to throw an exception. They both can be avoided if it doesn't take the Optional argument.

A simplified version will be:

Tuesday, July 29, 2014

A minor annoyance with two argument logging in SLF4J and Scala

I am using Scala for one of my recent projects. For logging, I am using SLF4J/LOGback. There is one minor annoyance while you are trying to log two argument messages like:"Some log with arg1 [{}] and arg2 [{}].", arg1, arg2)
While you compile with sbt you will get the following error:
[error] both method info in trait Logger of type (x$1: String, x$2: <repeated...>[Object])Unit
[error] and  method info in trait Logger of type (x$1: String, x$2: Any, x$3: Any)Unit
[error] match argument types (String,String,String)
If you are getting this error, a small trick that I did to avoid casting to AnyRef or Object was to just add a null argument at the end which will force the Scala compiler to make use of the vararg version. LOGBack just ignores extraneous arguments. Like this:"Some log with arg1 [{}] and arg2 [{}].", arg1, arg2, null)

Disclaimer: I am a Scala rookie, hence take my advice with a pinch of salt!

Monday, July 21, 2014

The most important keyboard shortcut you should know in IntelliJ IDEA

TLDR; If you are new to IntelliJ (like me), use "Cmd+Shift+A" to find your way around.

Recently I started developing an application in Scala. I am an avid Eclipse user, but unfortunately I felt that the support for Scala is a bit clunky as of today. It was driving me nuts to see some of the Java classes highlighted in red as unknown classes, but restarting the Eclipse or refreshing project was removing those errors.

Based on my research, I felt that IntelliJ IDEA has better support for Scala. So I wanted to give it a spin. So far I feel so good that I decided to give it a try.

I prefer to get things done using keyboard shortcuts, touching the mouse only for absolutely essential needs. The most important keyboard (and in my opinion the most awesome) shortcut that you should know in IntelliJ is "Cmd+Shift+A". This will show you an input box where you can enter a partial string and you will be presented with the list of action or options that is relevant to the search term. For e.g. you can enter "override" and presented with actions relevant to "overriding". 

For Emacs users, this is very similar to "apropos" command. For Eclipse users, this is similar to "Cmd+Shift+L", but only better.

Saturday, May 24, 2014

An absolute essential thing you should know about Hibernate cache

You might have used Hibernate for your ORM needs. The most important thing that you should about Hibernate cache is that it is implemented using a Map and a reference every entity that you have read from the database is kept in that Map. The key of that Map is the Entity ID (which could be just a Long or some form of composite primary key you have defined and wrapped in an EntityKey). The value stored is the entity itself (proxied).

If you read a 1000 entities (or rows) from the database, a reference to each one of them will be kept in the Map. If those entities have relationships specified (one-to-many or many-to-many) with other entities and those entities should be loaded eagerly, you are asking Hibernate to load a lot more than 1000 entities.

If you venture into processing a large set of rows (say 100,000 rows), a reference to every one those rows is going to be kept in the Map. Sometimes the amount of memory needed could be so large that your process might run out of memory (as mine did).

So what is the life cycle of this Map? This map is part of the "persistence context" that is referenced in Hibernate session. Hence when a Hibernate session is created, this Map contains no elements. As you keep reading or inserting entities, this Map accumulates references to these entities. When you close the session the Map is cleared.

If you would like to deal with a large volume of entities, given below is a strategy that you can follow:

The disadvantage of this method is that if there is one or more entities that are referred across that batches, they will be read again once per batch. But the main advantage is that you will keep your memory utilization low.

Tuesday, April 08, 2014

Heartbleed bug in OpenSSL

There was a serious vulnerability reported in the OpenSSL library that can let the attacker to dump memory contents from the server. Thus the attacker can perform offline analysis of the memory contents and identify sensitive information like private key of the server, key material for SSL sessions, decrypted data that is in memory, etc. This affects any server that uses OpenSSL to implement HTTPS.

I thought I will share some material in one place that will be helpful for people to understand the problem better.

  • Description of the problem can be found here.
  • A simple Python script to test your servers can be found here or you can use this site.
  • NVD entry for this issue can be found here.
  • How some of the companies are responding: Heroku, AWS, Lastpass.
Hope that helps.

Saturday, April 05, 2014

Two important traits of building reliable distributed systems

Designing a distributed system is hard enough. Even harder to design a distributed system that is reliable. There are many best practices that you can follow to make a reliable distributed system. Based on an issue that I recently troublehooted, there are a couple of them that I think are critical:

  • Enabling TCP keep-alive between the processes if you are using TCP
  • Performing all IO operations with a time out
My advise is based on my experience in Linux. But I think it should be applicable to other operating systems as well. When a host goes down with a kernel panic, none of the established connections are closed by sending a FIN or RESET packet. This cauases trouble that the peer process doesn't know about the other end of the communication being gone. When you enable TCP keep-alive, the kernel sends a zero length packet as per the configuration. Hence if the peer has died due to kernel panic, the zero length packet will not be ACKed. Thus, the peer death is detected. If the peer has died and rebooted, when the zero length packet is sent, the peer responds with RESET packet. Thus the best way to detect peer death is enabling keep-alive.

For Linux, to enable keep-alive, please refer to this page. The most important thing is after you open the socket, you should use setsockopt() to set the SO_KEEPALIVE option.

Using timeouts is one way to detect that the service quality is degrading. For e.g. when you try to fetch data from an external system, the external system might be choking due to too many requests and may not be as responsive as it should be. But if you build retry mechanisms based on timeouts, you have to be extra careful. Because sending more requests to an external system might make the situation worse than it should be.

Monday, July 15, 2013

A Python script to execute Python code like "perl -ne"

I wrote a small utility script that can be used to run a small snippet of Python code like "perl -ne". I find it very useful for my needs. Hope it helps you too. Any suggestions welcome.
You can find the script as a gist here.

Setting the terminal window title - version 2

I had earlier written a small snippet about setting the window title from command line. Based on my experience, I felt that I could make it a lot simpler. Here is version of the same function.

function wtitle {
    if [ -z "$ORIG_PS1" ] ; then
    export PS1="\[\033]0;$1 - \u@\h:\w\007\]$ORIG_PS1"

Wednesday, July 03, 2013

A rarely used but very useful option in grep

I usually do a lot of log analysis. I search for patterns in multiple files in multiple hosts. Then I collate the result and do processing on the result lines. A sample output looks like:
/tmp/input1.txt: line containing pattern
/tmp/input2.txt: another line containing pattern
 As it turns out, I don't need the file names in the grep result. I always used to remove the file name prefixes using a Perl one-liner. Silly me!

I was pretty sure that this was a common problem and it must have been solved already. When I referred to the man page of grep, I came across this gem: -h. If you specify this option, grep will not prefix each line with a the file name.

There is also -H option which will prefix each line with a filename even if you are searching in only one file.

Monday, June 03, 2013

Returning to Python

After a couple of years or so, I starting making use Python as my main programming language for one of the projects. This time I was making use of the virtualenv to install Python with various modules and just tar it up and copy to different machines. virtualenv saved me tons of time.

Tuesday, April 23, 2013

Zig-zag search

This is an interesting problem that I ran into. The problem definition goes like this.
You are given an array of integers. The array is considerably large (say 5 million elements). You are given two inputs: an index i in the array and an integer value v. Start searching the array for value v from the index i, and expand your search towards the two edges of the array. Return the index where the value v occurs closest to index i. If the value v occurs on both sides of index i at an equal distance, return the lower of the two indices. If the value v does not occur at all, return -1.
It was very interesting to solve this problem. Give it a shot, you might also like it.