53. Health Checks for Services, Containers and Daemons
Jon Christensen and Chris Hickman of Kelsus discuss health checks for services, containers, and daemons. They use them to keep Kelsus’s distributed systems and services functioning.
Some of the highlights of the show include:
- Health Checks: A first line of defense when running any software in production from an operational standpoint to detect errors and identify when a service needs to be recreated
- Health checks involve something hitting an endpoint to execute application code and determine if it’s responding on that port and back to it
- Two main types of health checks:
- Shallow: Use service code to create new endpoint; goes through frontend, routes to your code, executes code, and returns a response that signifies success
- Deep: Less common and has several pros and cons; doesn’t test any dependencies, only identifies if service is running and responding to requests
- Do a deep health check, if your service can’t function without dependencies
- Design your system to be able to gracefully degrade; if deep health check hits your database, you’re making sure it’s up and running and can connect to it
- Considerations for deep health checks:
- Expensive
- Startup latency issue
- Domino effect
- Rotation: Load balancer is put in front of things to start routing requests
- Parameters to remember when configuring a health check include:
- Set an interval because health checks run periodically, not in real time; incorporate a 30- to 60-second delay to give the system time to start up
- Determine how often you want to send health checks (i.e, every second, half a second, 20 seconds, five minutes); depends on type of service
- Identify how long you want to give your health check to succeed or fail; does it failed after 100 milliseconds or 30 seconds to meet latency requirements
- Implementation Styles: Where are you using the health checks?
- ELB level
- Background jobs or daemons
- Synthetics: Health checks from caller’s standpoint that verify your service is up and running and what the end user is expecting is correct
Links and Resources
Amazon Elastic Container Service (ECS)
Amazon Elastic Compute Cloud (EC2)
Rich: In episode 53 of Mobycast, Jon and Chris discuss health checks for services, containers, and daemons. Welcome to Mobycast, a weekly conversation about cloud-native development, AWS, and building distributed systems. Let’s jump right in.
Jon: Welcome Chris, it’s another episode of Mobycast. This time, you’re sitting right next to me.
Chris: This is a first.
Jon: Yes.
Chris: Good to be back Jon.
Jon: I think the only thing that will change with us sitting right next to each other here in beautiful Florianopolis, Brazil is maybe there’ll be a little less interrupting each other. But I expect otherwise, the episode will be just like any others. This week, we’re going to talk about health checks, how we use those just on our team, and how we may want to change how we use them for keeping our distributed systems and services up and running. Before we get started, since this is sort of a special week, what have you been up to this week, Chris?
Chris: I have been basically living on a plane, it feels like. I traveled for about 32 hours from snowy cold Seattle, halfway across the world to beautiful, sunny Brazil. We’re here at our company retreat with the entire team and it’s amazing to see just how big this team has grown. There’s a lot of new faces since the last company retreat, pretty interesting.
Jon: Yeah, it’s really fun. It’s big enough that I feel like I’m not getting quality time with every single person on the team this time around, everybody started to form their own groups and social circles, and it’s really interesting to see that grow. It’s so fun and Chris, this is your second time here in Brazil. I’ve been here six times and I just love this place. It’s a place that I was just thinking earlier today that it’s like a place that doesn’t change. You come back and back, and you can kind of expect things to be the way they were. There’s something about that after living in fast-paced America that I can really get behind.
Chris: Absolutely.
Jon: Under health checks, maybe we should start like we do with many episodes, and you can give us a definition.
Chris: Sure. Health checks are kind of a cornerstone of running any kind of software in production from an operational standpoint. It’s the basic check of just making sure that your code, your service is up and running, and it’s able to service request. This is one of the first line of defenses to detect those errors when things have gone wrong. Typically, these are used to identify when a service needs to be recreated. In the past, many have rebooted, and the cloud really just basically you shoot it in the head and spin a new one up. Health checks gives us that core capability of identifying when things go wrong and basically just restart.
Jon: Thinking back in my career, I think the first time that I got exposed to doing health checks was when we were configuring load balancers for clusters of Java web application servers. I think at the time, the health checks that we did weren’t really able to tell much about what was going on inside the application because the load balancer itself wasn’t really able to tell. It wasn’t able to load balance at that application level. It was only able to the balance what is the computer and network responding, not as the application happy. But it was still sending pings or network requests to each of the machines in the cluster all the time.
Whenever I saw one wasn’t available, I would just stop running traffic that direction. I think things have come a long way since then. The health checks are way more sophisticated, but at the end of the day, it’s still the same basic premise. If you see a system or a service that’s not available, stop sending traffic to it.
Chris: Right. I think what you’re describing there is basically the port-level, network-level, health checks. Your service is running on port 80 or 443, so ping that port, do you get a response back? That’s a basic-level check. Now, normally a health check involves something hitting an endpoint per se. You’re actually executing application code to determine not only is it responding on that port, but is the code actually running to respond back to it as well.
Jon: I think that’s a nice way to enter into the idea of the types of health checks that you might run. If you’re going to hit an endpoint and you’re expecting a response from that endpoint, what might you do in order to see if the system that you’re looking at is healthy?
Chris: Yeah. Maybe this is a good time just explain broadly two main types of health checks here, so shallow versus deep. Shallow is definitely the most common. This is typically what you do. We have our micro service, or some API service, whatever it may be exposing an endpoint and usually it’s very simple. We’ll get into this a little bit more about how shallow versus deep health checks are different and why you want to take into account this consideration. Again, the basic thing is, with these shallow health checks, keep it tight. It’s something that’s very quick. It’s exercising your actual service code. It’s an endpoint in there that is responding back, so it’s going through all the frontend, the routing to your code, executing that code, and returning back a response that signifies success.
Jon: A shallow health check with that, let’s just try to think in terms of examples. Maybe we have a blog service. You can do crud on posts. You can create a post, update a post, get posts. With a shallow health check, will that actually get a particular post or would you try to find something more shallow than that?
Chris: Yes. For a shallow checking, you definitely want something more shallow than that. Typically, your shallow health checks are something that are going to be executed quite frequently.
Jon: You want to be ahead on getting a post, or just like tell me that this endpoint exists and I can send stuff to it, like HEAD instead of GET? Do you see what I’m saying? What if all you can do is create, update, read, and delete posts? That’s a very small, tiny, micro service and you want to do a shallow health check on it, and you don’t want to have the databases. That’s what I’m hearing with the shallow health check. Can you do just some lighter weight other HTTP request like HEAD or something like that to keep it shallow?
Chris: Yeah. I just recommend you would create a new endpoint for your shallow health check. Call it /status. You create a new endpoint /status and basically all it does is just echoes back something like, “Just return a 200.”
Jon: And that tells you that your service is alive, because if it wasn’t answering, then the whole service is dead.
Chris: Exactly. It’s very lightweight. It’s very quick. You’re not testing anything else. You’re not testing upstream dependencies. You’re not taxing your service with any load. It’s just verifying basically my processes up and running, and kind of that at a top level, everything is working. The requests are coming in, they’re getting routed, codes being executed, and the responses coming back.
Jon: Yeah and you can learn quite a bit from that, because if your process is thrashing, it doesn’t have enough memory, it’s got problems, then even that shallow health check will have a problem probably, or in many cases it will.
Chris: Absolutely. Shallow health checks despite the name, this is not a bad thing. This is what you want to do. They’re very useful and they will give you that indication like something is wrong here, we’re not able to satisfy a request.
Jon: Yeah. Because if your shallow health check doesn’t pass, then you definitely want to stop routing traffic immediately to that particular instance of your service.
Chris: Yeah. In the words of our friend, Chrome, “It’s dead, Jim. Oh, snap.”
Jon: Right. Then there’s the other kind of check with the obvious, it’s deep. Let’s talk about those a little bit, when you might use them, what they are.
Chris: Deep health checks are kind of interesting. Definitely much less common, lots of pros and cons, and lots of just considerations around them. We talked about the shallow health check. It’s very quick. It’s not testing any dependencies, very lightweight. It just basically says, “Hey, this service is up and running and it’s responding to your request.”
Deep health checks. So I have this micro service architecture. My service is a consumer of other services. Maybe my service is completely unusable if some of its dependencies are not up and running. A deep health check would be something that’s a bit more advanced and inexhaustive in its checking. You’re not just checking that your service is running, you’re checking that your dependencies are running as well. Your dependencies again could be other microservices that you make calls on that you really depend upon. It could be your database. Your example before of like, “Hey, do we hit a database with this call?” that would definitely be something to consider with a deep health check.
Jon: Okay, great. It makes sense what they are. We talked about how it’s important and good to use shallow health checks. If our main ideas that we just want to stop routing traffic to a process that’s just thrashing or dead, then shallow make sense. When does it make sense to use it? You said maybe other services are running that you depend on, but can you just give an example? Have we ever used a deep health check? When does it really make sense? When would you do it?
Chris: I think really where this makes sense is where if your service, if it just can’t function without dependencies. In that case, you have to have a deep health check. Let’s just say you have a database that your microservice talks to and there’s just no way your service is going to run without being able to talk to that database. You may very well change your health check to test that. Again, if this is the requirements that you have that your service just can’t operate without that dependency being up, then you want to look at the deep health check.
Of course, it’s also really good thing to design your system so that they can gracefully degrade. Something like a database, you’re probably not going to be able to degrade too gracefully from perhaps, but you may very well decide like, “Rather than failing over this, I’m going to display an error message or I’m going to switch over to a system like…”
Jon: AOL?
Chris: Yeah, exactly. AOL. That may be your strategy there while you have the alerting going on for the dependent service that, “Hey, this thing needs to be fixed.”
Jon: Earlier you had said, to keep your shallow health check really lightweight and to keep it out of the way of other processing that’s more important, you would create your own endpoint just for that. Would you do something similar? Say your deep health check needed to just make sure the database was there. Would you maybe make a status table that just has one row in it, and then you just go get that one row, and then that’s an easy way of making sure your database is alive, and it’s not doing anything to any other tables, and it’s super easy on the database?
Chris: That’s a great point. If you are doing a deep health check and it’s going to hit your database, you’re making sure that the database is up and running and that you can connect to it. If you have a table with millions of records, don’t go select on that table as part of your health check. That’s not what you’re testing with this. You’re just testing basic connectivity. Go hit a table with a single record in it. Even though this is a deep health check, you still want to keep it light.
Jon: Right on. Where do we go from here? We know what shallow health checks are, deep health checks are. Maybe we can get into some implementation styles?
Chris: Yeah. Maybe something to talk about more with the deep health checks is some of those considerations. Again, they’re expensive so you need to weigh that in into account. You also have this issue of startup latency. Typically, the way things are when you’re spinning up a new instance of the service, one of the things that is going on is a health check being performed. Once that is successfully passed, then the system knows that this new service is spun up correctly and it can now be put into rotation effectively.
Jon: Do you mind if I say that in just a little bit of a different way? I mean that was absolutely correct, but it just felt a little complex to me when I heard it. I just want to say, you’ve got different containers if you’re using ECS or different machines if you’re using EC2, who knows what if you’re using another service. They’re all available. You’ve got a cluster. You’ve got a lot of things running. You want each of them to be able to service your request. When you said put it into the rotation, that’s what you mean. Your load balancer that’s in front of all these things is going to be able to start routing requests to that thing. Now it’s in the rotation. I just had to clarify that one term.
Chris: Yeah. Absolutely. You have that, you have startup latency to consider a lot of systems like the ELBs or whatever cluster mechanism you’re using. They’re going to have a certain amount of time before they fail. They’re going to hit a health check and if it doesn’t respond within five seconds, then it failed the health check. It’s not going to sit there forever. There’s going to be some time associated with it. If you have a deep health check that’s very expensive that can’t respond in that time, then you’re going to have a big problem, because you’re never going to pass your health check.
Even though everything is perhaps okay with your system, it’s never going to pass the test. You need to keep that into account. It would be a good idea to kind of think of like having an initial health check that’s pretty deep and expensive. Then after that, you switch over to shallow. A lot of times it’s like when you start up, you want to make sure that everything is up and running. Once that’s done, then you can switch over to a shallow. That’s definitely a bit more complicated and advanced to do, but definitely something to take into account.
The other problem or consideration to take into account with the deep health checks is just this concept of a domino effect. Your deep health check is hitting multiple services. Imagine the health check on your niche on your main service takes some amount of time and then it hits a dependent service for a health check. Maybe it’s actually another microservice. It’s not something like a relational database. Well, what if that microservice, its health check goes and hits another one? You start chaining together these requests and you have to take into account all of that time. Then also what happens if it’s one of those in the middle that’s failing or it’s the tail end. It gets much more complicated with the deep health check. This is typically why almost all the time you’re going to stay with the shallow health checks and you’re going to rely on your monitoring and your alarms independently with your upstream dependencies.
Jon: Yeah, that makes sense. I was just trying to think about how that domino effect would work, because when your micro service that you call calls the micro service that it depends on, if it just happens to hit the one that’s down out of 100 that are available, then it thinks the whole thing is down and it’s going to say, “I’ve got to be down too. I’m not working either,” when really it was just unlucky. That potential for unluckiness keeps getting multiplied as you go deeper and deeper in your list of dependencies of microservices. Yeah, it seems like micro service, deep health checks, really think hard before you put those in place.
Chris: Yeah and in that particular case, hopefully everything else is working so your upstream services, they should have their own health check and they’re in a cluster. Ideally, you shouldn’t even be hitting. If there is one out of however many notes in your cluster that’s bad, it should have failed its health check with its cluster and then pulled. Hopefully, that has happened before it even tries to go and connect to it. It’s not even in the routing for that, but it could be that it’s failing. Health checks run periodically. They’re not running in real time. You have to you some interval for these health checks around. You are going to have failures and they will be up. Use it around that.
Jon: You just talked about the interval because you had talked about a few parameters that you need to think about when you’re setting up a health check. I think we might as well just make that concrete in terms of AWS. I’m just going to go out on a limb. I can’t remember for sure, but I think that there’s three parameters basically that you have to keep in mind. One is the delay before you start doing health checks. I want to wait 30 seconds, 45 seconds, a minute before I even send my first health check to this system to give it time to start up. Another one is how often do I want to send health checks. I want to send health checks every second, every half a second, every 20 seconds, every five minutes, and that really depends on the type of service. There’s no best practice there, it just absolutely depends on the type of service and how flaky or finicky it is. I think the third one is how long you want to give your health check to succeed or fail. Is it failed after 100 milliseconds, because you have a super high and low latency requirement, or is it failed after 1 second, 5 seconds or 30 seconds. I think those are the three parameters, but please correct me if I’m wrong.
Chris: In general I think these are the categories. AWS in particular, for ELB health checks, some of the parameters are kind of imagine, how many times does it have to pass a health check before it’s determine healthy. Is it parameter? You can set it to one. I think the default is to two. It’s got to successfully pass two health checks before it will get put into rotation. You will also have how many health checks do you need to fail before you’re considered unhealthy. You have your health check interval. How often are you running these health checks.
Jon: That one I got.
Chris: Yes. Once something has been marked unhealthy, how many successful health checks does it need to pass before it gets put back into rotation. The delay for starting that initial health check, that’s not on ELB, that’s usually on some other services. ECS has this. ECS will spin up a task and you can specify. The ECS will have a delay before it registers it as part of the target group.
Jon: That’s a little complicated, wouldn’t it be nice if you just can figure this all in one place? But ECS is the one telling the load balancer, “Now I’m running for you,” as opposed to the load balancer saying, “I see you there. I’m going to give you time to be ready,” and I can configure all my health check needs stuff in one place, but it makes sense. ECS is where it’s aware of the fact that it takes time to start up the load balancer. It has no idea. The load bouncer is like, “Just let me route traffic somewhere.”
Chris: And the load balancer is very generic. There’s tons of things that you could put an end to this.
Jon: Shall we move on into implementation styles? I jumped the gun on that one before.
Chris: Yeah. That actually gets into just implementations like where are you using these health checks. The most common without a doubt as far as services go and containers go is you’re doing health checks at the ELB level. This is just built-in to the load balancer. It’s just part of it. Load balancers are just keeping a membership set of all the different host computer, targets, whatever it is that it’s managing as a set. This is the set of my cluster of nodes that are in here that can answer requests. That has a membership set. Those health checks are built right into load balancers.
The load balancer knows like, “Hey, for every single one of the nodes in here that’s in the membership set, go on periodically and hit its health check and if it fails, take it out of the membership set, mark it as unhealthy. Continue hitting it, and if I get it back on so it’s now healthy again, then I’ll put it back into the membership set.” All that’s done at the ELB. You get it for free. You don’t have to do anything other than configure that health check, have an appropriate health check implemented by your service, and away you go. This is just so simple as the common routine, if you will. Any micro services that had inbound traffic, that are better fronted by an ELB, this is a great pattern for.
It gets more complicated when you have services that aren’t friends with ELB, because now you have to ask yourself what’s it going to do the health check. Common examples of this is if you have background jobs or daemons. Basically, you can think of it as push versus pull. The ELB fronted services, their requests are being pushed to them and they’re coming into the front versus these daemon services that are typically pulling. They’re the ones that are going out and pulling, looking for work to do periodically type of thing. They don’t have the inbound request coming into them. Instead, they’re doing work, and they’re basically a client, and they’re probably hitting something else that’s fronted by an ELB. For those, you’re going to need a custom implementation to do these health checks. It gets a bit more complicated, but also it’s very important to do. There’s tons of different ways that you can do this.
Jon: I don’t think I’ve ever worked for a software company that had some daemon processes running where it didn’t go down and nobody knew about it. That always happens when you have a startup company, you’re billing systems, you’re going fast, you build the demon, and you’re like, “Wait, how come we haven’t seen any PDFs generated in awhile? Ah, the daemon process isn’t running.”
Chris: Typically in those cases, you’ll notice minutes, hours, days later and you’re like, “Uh-oh, this thing was down the whole weekend,” and you never knew. That’s a big bad. Health checks for these kind of things are super important. But again, it’s more complicated. You got to figure out how you implement it.
Jon: Yeah, you definitely want to find out about it on Sunday mornings or Saturday afternoons, not like Monday morning at nine. You show up and, “I can fix it, that’s my job,” never that convenient.
Chris: Absolutely. Again, various different plans of attack there. You can have something just really simple and maybe even a little bit silly, but go ahead and put an ELB in front of those daemon jobs and just have them have one inbound route. It’s your status check. Have a private facing ELB that basically has one job and one job only and that is to check to see whether these things are up and running.
Jon: That’s great. Is that really easy that to just configure some sort of an alert or cloud watch alert so that you can be notified when the things are not available?
Chris: In this particular case, you do not have to do anything because the ELB…
Jon: Oh, it will start it over.
Chris: Yeah. You have your simple health check. It’s kind of weird because you’re only putting an ELB in front of it for the health check. Thinking about it it’s like, “I don’t know, why not?” It’s minimal amount of code and you can leverage what you’re used.
Jon: Just to make sure I understand this because the ELB will signal to ECS, you’re supposed to keep the service running, at least one instance of the service running, ELB is what tells ECS, “This thing is down. Restart it.” Is that why you would put an ELB in front of it?
Chris: The reason I put an ELB in front of it is because you need something to do to health check. This is actually how just a normal service running on ECS like a microservice. It’s not ECS detects that it’s down, it’s the load balancer doing the health check that figures that it’s down. It marks it as unhealthy. ECS subscribes that event, it sees that, and then it kills the task and spins up a new one. Then that comes back up and inserts it into the membership set. The ELB then takes over and performs its health checks. If it passes, it goes back in the membership set. It’s this dance back and forth between them.
You can do the same thing with these background, these daemon jobs. Really, the only extra work you have to do is you have to update that daemon to accept some HTTP traffic. It’s like you have to have a little micro HTTP service just listening in and can satisfy request that are coming into your HTTP or HTTPS, whatever it may be.
Jon: I love that hack. It’s small and it just takes advantage of all this heavy lifting that AWS already knows how to do.
Chris: Yeah. It feels like cheating to me and it doesn’t feel right. But again, it’s simple and fast. You can do other things that are perhaps a bit more sophisticated. There’s a bunch of different techniques that you can use, but basically, you just have to have something that is running at regular intervals that can go, reach out, talk to these things, and figure out whether or not they’re running.
Whether it’s a cloud watch alarm that’s triggering a Lambda to go figure out if something’s running or not, then you can deal with removing these things, or failing them and mark them, and letting ECS then kill it and restart it. That’s a more sophisticated approach and there’s some pros to that but again, a lot more heavy lifting.
Jon: Right. Great. I think we’re on our last bullet point here of the day. Synthetics, what’s this?
Chris: Yeah. The last thing I want to talk about is just this concept of synthetics. Health checks are verifying that your service is up and running. It can respond to your request, but it doesn’t necessarily mean that things are going swimmingly well and that what users are seeing is actually correct. A really great example of this would be like, you have a website and maybe there’s a login page, and it has to go and talk to a database or maybe some other dependent micro service, and something goes wrong, and there’s a bug in the code where it’s just not rendering the login box, the username and password at the login screen. It’s a broken web page.
Your health checks aren’t going to catch this. It does it when it hit the port 80 or port 443. It passed the health check but really your site doesn’t work. Synthetics are basically health checks from the caller standpoint, from the end user’s standpoint. In that particular example, you would have a synthetic that’s basically going out and it’s not just testing whether it gets back at 200. It’s actually examining the response and verifying that response is correct. You can really think of this as like a test case.
Jon: I guess a production integration test.
Chris: Yeah. We had a great talk today with the team about doing UI testing for mobile apps and how do you actually create those remnants. That’s what you can use a synthetic for which would be a useful thing. It would go and fetch the HTML. Then it will go and just load that into to the DOM and check the elements that they are expecting are actually there and they have the right text labels on them or what not so. Again, it’s another one of the things that is really useful to have. You’re verifying not only is your service up, but also what the end user is expecting is indeed correct.
Jon: The thing that makes me think of is, as we were talking about health checks, and shallow and deep, I was thinking at some level, we also want our codes to just react when things aren’t going well. We want our code to catch errors and start to behave in a way that is appropriate for the types of errors that are seen. If things aren’t going the way they should from a database or another service that we depend on, the code should start to be able to shut things down a little bit, but that’s really difficult.
A lot of times when you write systems, you don’t have the time to make your code go that deep. This is a way of saying the code may return some bad stuff, and the whole system might get into some bad states, but if we can write these synthetics to go in and test the world as looking right from a user point of view, then we can at least have a fighting chance of finding out if things in the world are not the way they’re supposed to be before users do or very early on in that process. Even if not all of our error-handling and thinking about every single micro service and what it’s supposed to do in a case of an error is totally vetted, baked, and perfect, because it never is.
Chris: Yeah. Absolutely. You think of synthetics are your way of a combination of having shallow health checks plus synthetics kind of gives you the advantage of those deep health checks. That’s really what they’re doing. If your end user is not responding the way that they expect your caller that’s an API, if it’s not getting the kind of response it is expected, there’s bugs in the code or it could be that one of the upstream dependencies is down, you’re synthetic will catch that. It gives you the best of both worlds there.
Don’t use the deep health checks for your standard health check. That’s hitting them at a more frequent pace but instead, use the synthetics and you can judge what you want to use for an interval there as well. It’s something that can be very useful to have in concert with the shallow health checks.
Jon: Very cool. Well, this has been fascinating for me. I haven’t done a lot of work in this area recently, so I learned a lot. Thank you.
Chris: Awesome. Go check your status.
Jon: Right on.
Chris: Alright.
Jon: Talk to you next week.
Chris: All right. Thanks, Jon. See you.
Rich: Well dear listener, you made it to the end. We appreciate your time and invite you to continue the conversation with us online. This episode, along with the show notes and other valuable resources is available at mobycast.fm/53. If you have any questions or additional insights, we encourage you to leave us a comment there. Thank you and we’ll see you again next week.