Two important traits of building reliable distributed systems
Designing a distributed system is hard enough. Even harder to design a distributed system that is reliable. There are many best practices that you can follow to make a reliable distributed system. Based on an issue that I recently troublehooted, there are a couple of them that I think are critical:
- Enabling TCP keep-alive between the processes if you are using TCP
- Performing all IO operations with a time out
My advise is based on my experience in Linux. But I think it should be applicable to other operating systems as well. When a host goes down with a kernel panic, none of the established connections are closed by sending a FIN or RESET packet. This cauases trouble that the peer process doesn't know about the other end of the communication being gone. When you enable TCP keep-alive, the kernel sends a zero length packet as per the configuration. Hence if the peer has died due to kernel panic, the zero length packet will not be ACKed. Thus, the peer death is detected. If the peer has died and rebooted, when the zero length packet is sent, the peer responds with RESET packet. Thus the best way to detect peer death is enabling keep-alive.
For Linux, to enable keep-alive, please refer to this page. The most important thing is after you open the socket, you should use setsockopt() to set the SO_KEEPALIVE option.
Using timeouts is one way to detect that the service quality is degrading. For e.g. when you try to fetch data from an external system, the external system might be choking due to too many requests and may not be as responsive as it should be. But if you build retry mechanisms based on timeouts, you have to be extra careful. Because sending more requests to an external system might make the situation worse than it should be.
Comments