Service health checks - the right way to build them
Service health checks are ubiquitous. If you have built any software that relies upon any upstream service, I am sure you would have used some form of health check. Your software could be a stand alone program or it could be a proxy. You may also have built and exposed a service that is used as an upstream service by another software.
The worst way to perform health check is to pull a resource (like GET /hc.html, if your service is exposed as a HTTP service) or perform a TCP connect check in the same port where data is served. These are checks I call as in-band health checks. Please don't do these.
Based on my experience, the best way to expose health checks is by using an out-of-band mechanism. This means that you expose another port for performing health check. The client can perform a HTTP check or TCP check in the health check port. As a convenience, if you expose health check as HTTP service in health check port, you can consolidate multiple health checks like: GET /healthcheck/port1, GET /healthcheck/port2, etc.
One convention that could be used is to use a fixed offset port as control port for any data port. For instance, 80's control port is 81 and 443's control port is 444. Or 80's control port is 8080, and 443's control port is 8443. So when you need to decide if the service is up/down, you can check the control port.
The worst way to perform health check is to pull a resource (like GET /hc.html, if your service is exposed as a HTTP service) or perform a TCP connect check in the same port where data is served. These are checks I call as in-band health checks. Please don't do these.
Based on my experience, the best way to expose health checks is by using an out-of-band mechanism. This means that you expose another port for performing health check. The client can perform a HTTP check or TCP check in the health check port. As a convenience, if you expose health check as HTTP service in health check port, you can consolidate multiple health checks like: GET /healthcheck/port1, GET /healthcheck/port2, etc.
One convention that could be used is to use a fixed offset port as control port for any data port. For instance, 80's control port is 81 and 443's control port is 444. Or 80's control port is 8080, and 443's control port is 8443. So when you need to decide if the service is up/down, you can check the control port.
Now let me explain what is the benefit of the out-of-band health checks. If you would like to take the service down gracefully for maintenance, all that you have to do is to fail health checks for that port (GET /healthcheck/port1 returns a 4xx) or stop listening in the health check port. This will not impact any of the on-going requests. You can gracefully close the connections after serving the requests or after a period of idle time. This will enable you to perform scheduled maintenance like a champ without impacting any client traffic at all. Having zero impact to any client traffic should be your number one goal, when performing scheduled maintenance.
Comments