Tuesday, April 08, 2014

Heartbleed bug in OpenSSL

There was a serious vulnerability reported in the OpenSSL library that can let the attacker to dump memory contents from the server. Thus the attacker can perform offline analysis of the memory contents and identify sensitive information like private key of the server, key material for SSL sessions, decrypted data that is in memory, etc. This affects any server that uses OpenSSL to implement HTTPS.

I thought I will share some material in one place that will be helpful for people to understand the problem better.

  • Description of the problem can be found here.
  • A simple Python script to test your servers can be found here or you can use this site.
  • NVD entry for this issue can be found here.
  • How some of the companies are responding: Heroku, AWS, Lastpass.
Hope that helps.

Saturday, April 05, 2014

Two important traits of building reliable distributed systems

Designing a distributed system is hard enough. Even harder to design a distributed system that is reliable. There are many best practices that you can follow to make a reliable distributed system. Based on an issue that I recently troublehooted, there are a couple of them that I think are critical:

  • Enabling TCP keep-alive between the processes if you are using TCP
  • Performing all IO operations with a time out
My advise is based on my experience in Linux. But I think it should be applicable to other operating systems as well. When a host goes down with a kernel panic, none of the established connections are closed by sending a FIN or RESET packet. This cauases trouble that the peer process doesn't know about the other end of the communication being gone. When you enable TCP keep-alive, the kernel sends a zero length packet as per the configuration. Hence if the peer has died due to kernel panic, the zero length packet will not be ACKed. Thus, the peer death is detected. If the peer has died and rebooted, when the zero length packet is sent, the peer responds with RESET packet. Thus the best way to detect peer death is enabling keep-alive.

For Linux, to enable keep-alive, please refer to this page. The most important thing is after you open the socket, you should use setsockopt() to set the SO_KEEPALIVE option.

Using timeouts is one way to detect that the service quality is degrading. For e.g. when you try to fetch data from an external system, the external system might be choking due to too many requests and may not be as responsive as it should be. But if you build retry mechanisms based on timeouts, you have to be extra careful. Because sending more requests to an external system might make the situation worse than it should be.

SublimeText 3/Anaconda error

When I installed Anaconda manually by downloading and untarring the file (as given in the manual installation instructions here ), I got th...