31. Logging A-to-Z for Containerized Microservices (Part 2)
Chris Hickman and Jon Christensen of Kelsus and Rich Staats from Secret Stache continue their discussion on how to handle application-level logging for microservices running on elastic container service (ECS). Specifically, they talk about best practices and how to forward logs from containers to other services.
Some of the highlights of the show include:
- Garbage In, Garbage Out: Logs are only as useful as telemetry injected into the system
- When running multiple concurrent services for scalability or availability in the Cloud, there’s multiple requests; logs aren’t sequential, so need to be tied together
- View logs as atomic units of information; besides standard log levels, adopt a custom inner level to emit a request level log event for every API call into backend microservices
- Log all requests; putting request logs into the info log level means you can’t put anything else in the info log level because you want to keep it clean
- Info-level and request-level are the same thing – requests that become an alias for info; also allows you to separate it out
- APIs are usually always stateless; log statements should be, too
- Unexpected errors happen in code, so log all the context you need around them; narrow down and single out the error to know exactly what’s going on
- Don’t log so much or become numb to it; ask yourself – “Is that the right context here?” and “Is this the right thing to do, and is this at the right level?”
- Where to put the logs:
- Logging can involve sending to standard out, standard error; it either goes to the terminal screen or is redirected somewhere else
- Log directly to a file or via API to an actual log service or Daemon/SaaS product
- Docker has log driver plugins to identify how you want your logs treated; there’s also plugins for CloudWatch and other options
- Kelsus’ application logs go to a file being written to a known location on the host with its name spaced out per service
- Consider space; if you have something verbose with its logging, it can fill up your file system, so you may need to implement log rotation or purge files
- Many tools and services exist to get logs off a host and analyze them; Kelsus installs a collector agent on each ECS host to identify where to look for files
- If the log goes to a machine and the machine crashes, there’s only a window of less than a minute where you would lose log information
- Frequency of getting written to disc is up to Docker, your OS, and buffering; you can configure the rate
- Log events are JSON objects and your atomic unit of what’s being emitted
- Most of Kelsus’ code is Node.js, and Winston is used to support file system transports to the console
Links and Resources
Rich: In Episode 31 of Mobycast, we dive into Part 2 of our series on application logging. In particular, we discuss best practices as well as how to forward logs from your containers to other services. Welcome to Mobycast, a weekly conversation about containerization, Docker and modern software deployment. Let’s jump right in.
Jon: All right and welcome, Chris, Rich, in another episode of Mobycast.
Rich: Hey.
Jon: How are you, Rich?
Rich: I’m doing good. How are you?
Jon: I’m good. What have you been up to this week.
Rich: Let’s see. This week, we just surpassed another record-breaking month for revenue at Secret Stache so that’s really, really exciting and a pattern that we’ve seen all year. The downside to that it’s exposed a bunch of inefficiencies that we’re working through and so this week has been mostly about writing out a lot of those lower-level processes that we do so that, hopefully, I can hand those off to an executive assistant or a virtual assistant, something similar to that.
Jon: Sounds good. Time to scale. We missed you last week, by the way.
Rich: Yeah, I can’t wait to hear that episode.
Jon: Chris, how about you? What have you been up to?
Chris: Rich, did you just say that you have an executive assistant?
Rich: Not yet.
Chris: Okay. I was like, “Wow, man, this dude is–” You’ve made it. I’m so jelly.
Rich: My executive assistant will probably be a virtual assistant somewhere in the Philippines though it’s not going to be someone that’s going to be hanging out in my home office.
Chris: Okay, it’s not your dog then.
Rich: He’s my confidant and therapist.
Chris: Cool. I was fortunate enough to make a trip out to Viejo recently and hanging out John and got to get some mountain-biking in which was a very first ever for me. I’m a roadie so to actually get out on fat tires and go down a mountain was a new experience and fun so that was great. Then, I unfortunately picked up a bug somewhere along the way and had been sick for the last week but on the mend, so I’m very happy for that.
Jon: I’m glad to hear you’re feeling better. You’ve went on a trip. After you visited, I went to San Diego and did my usual thing out there that involves a little bit of surfing and a little bit of meeting people for Kelsus so I had a good time there. Yeah, we’re going to continue this week with talking about logging. Last week, we’ve just talked more generally at a high level of what we should log and log levels, some things like that, so this week we’re going to move forward and get…
The ultimate direction we’re going is we understand logging, we know what we need to un-log, we know what it’s good for and then what changes about how we do logging when we’re doing it in ECS with Docker. That’s the direction we’re going, and we’ll push that way today so I think we’re going to start with best practices for log events. Take it away, Chris.
Chris: I thought those would be interesting for us to talk about just because it’s an important part of the whole logging process. Basically, it’s the old axiom, “Garbage in, garbage out.” Your logs are only going to be as useful to you as the telemetry that you’re actually injecting into the system. There’s lots of–some of this experience, but there are general rules that you can follow here. I think we probably touched on this a little bit in the last portion of this.
When you’re running in the cloud multiple concurrent services for scalability or availability, you have multiple requests going on the system. Logs are no longer sequential and you need other ways of tying these things together. One of the big things here is that when you’re logging events in your code, react accordingly. Don’t log sequentially because they’re absolutely not going to read sequentially unless you do something more sophisticated to thread these things together.
It’s just not going to happen so you really need to think of your logs as much as possible to think of them as atomic units of information. That can be pretty helpful. One of the things that we’ve done that’s worked really well for us, in addition to the standard log levels that just about everyone uses, the debug, info, warning, error, what we’ve adopted as a best practice is to have a custom inner level. We call it request, and we basically emit one request level log event for every API call that we get into our backend micro-services.
This ends up being really, really useful because, now, you can start doing things like, “How many requests a second is my micro-service getting?” or, “For particular a API call, how many requests is it getting?” or, “How long are these requests taking? How many of these API calls are returning back user errors versus service errors?” or whatnot, and it also gives you that–it becomes easier to troubleshoot, debug and to look for what you need in your logs by having those semantics of, “Hey, we’re going to emit a log for every request call,” because that’s usually what your users are doing.
Someone will say, “Maybe I had a problem uploading an image,” so it’s like, “We know which one of our API primitives endpoints handles that. Let’s go search for the request-level logs for that particular API event for that particular user,” and we can now narrow it down much, much quicker.
Jon: I’ve seen that. Of course, you want to log all your requests. I guess I’ve seen that in other companies and other projects, people putting those request logs into the info log level, but then that means you either can’t put anything else in the info log level because you want to keep that just crisp and clean and only requests or you end up with info log levels that are a little bit of a mishmash of too many things and you have to write scripts together just like at your requests.
Chris: Yeah, absolutely. At the end of the day when we do this, info-level and request-level actually end up being the same thing. They’re just requests that become an alias, essentially, for info, but it also allows you to separate it out and to do it. You could do other things. You could still use info-level but then probably have another metadata tagged on that that says, “These are requests,” or something to have a further filter.
Again, maybe the important point here is think of your logs as these atomic units of information that stand–they need to be standalone on their own. Unless you do the work or have the tools that can thread a user request through multiple log events, unless you have that sophistication built into it, you’re just going to be looking at a stream of events. For them to be able to stand alone and have all the context you need to make sense of them is very, very useful.
Jon: What’s occurring to me is that, since our APIs are pretty much always stateless now, so should our log statements be.
Chris: Absolutely. That’s a great way to put it.
Jon: Cool. Was there anything else you wanted to say about what you meant beyond logging every request and keeping those essentially stateless or standalone? Anything else that is typical that we do or is the rest of it more left to your own tastes?
Chris: I would say knowing–obviously, unexpected errors that happen in the code where it’s like, “Hey, this was not expected and the code really doesn’t handle all that well.” Definitely log all the context you need around that, and that’s going to promptly go out at an error level. Again, whatever is in that log event, make sure that there’s all the context there so that, once someone looks at it, as quickly as possible, they can narrow it down and know exactly what’s going on so, again, keeping with that stateless, atomic concept of your log event.
Warning level, this is something that you would use it for like, “Hey, the code has handled it.” Maybe it was a client error so the client called and sent a bad request or maybe it’s a bug in the client or something like that. When you receive that on the backend, maybe you want to emit that at a warning level so it allows it to stand out from the other events but you don’t have to treat it as critical or something that needs immediate attention.
You can go and then do something like, “Okay, show me all the warning events that happened in the last eight hours,” and you can periodically go do that and you might see like, “You know what? We have a client out there that’s got a bug in it because it’s sending the wrong post-body,” or something like that or, “It’s malformed so maybe it’s just a bug and maybe we wrote to that client or maybe someone else did,” but just being able to single out those things.
Jon: I think a good one–because I think warnings have a tendency to get a little overused and then, once they’re overused, they no longer pass your filter of, “I should do something about this.” “Yeah, it’s full of warnings. Yeah, they’re fine,” that kind of attitude. A really good question is: If I were to see this in the log, would I actually look at this and do something about it or would I just know already for sure that the code is fine and I’m not going to do anything? If it’s the latter, then don’t put a warning log, I would think.
Chris: Absolutely. It’s the whole–you don’t want to cry wolf and you don’t want to become immune to your log events because, then, they absolutely lose their power. If you’re logging so much or if you just become numb to that stream, then it’s worthless to you so, definitely, you want to–it’s just so important to log, be very deep about how you think about your logging whenever you do–”Is that the right context here?” and, also, “Is this the right thing to do and is this at the right level?”
These are 10,000 events that are really worthless or missing a very critical event that only happens in production and now it sets you back a couple of days because you just didn’t have the context around that. To really think these things through just from the aspect of, “Okay, what information do I or someone else need in order to take an action to fix this, and is it something that someone needs to know about or is it not?”
Going along here with the–”Your logs are no longer meant for human eyes,” necessarily or, “They don’t read serially,” having events at the info level to say, “Entering this function,” or, “Leaving this function,” those have no place in log events anymore. Instead, you can use your request-level events to trigger–you have all the information there, you know what’s happening and, if an error happened during that request event, then you have your info. Your request-level event, you should log your return status on that so it’s going to show, “Hey, this was a 500 error,” and then you also have a corresponding proper error event that goes along with that.
Jon: Exactly. Moving on from what and when to log, let’s talk about where to put the logs. I guess this is the first time we’re going to start to enter into a little bit of, “Things are different now because we’re in containers.”
Chris: Yeah, lots of options here. Logging can be as simple as sending to standard out, standard error and either it’s just going to the terminal screen or you can redirect that to somewhere else. You can log directly to a file or you can probably log directly via API to an actual log service or Daemon or SaaS product or something like that. There’s multiple different options there. They all have pluses and advantages and, when you are running your applications and containers, you come into some even more options and nuances.
We’ve talked about this many times before in various episodes of this podcast. A container’s basically a hermetically-sealed bubble unless you punch holes in it to share information with a host so if you’re out putting your logs to standard out, standard error, there’s some additional considerations for how do you actually get that information out of that container and somewhere else so you can actually see the output.
Docker actually has–it’s a very well-known typical use case so Docker has actually log driver plugins and so you can tell, basically, Docker how you want your logs treated. If you are emitting to standard out, standard error, you can tell Docker you want that stuff to be captured and output to a file, and it will do that for you, and then there’s ways to access that. You can use it at the logs command for Docker so Docker Logs will show you the information that’s being captured there.
You can also have plugins for things like–if you’re using AWS, you can have CloudWatch. You can send your logs to CloudWatch, and there’s many other options. I think we could probably have a whole other episode just on all the pros and cons and how all of this works so I don’t want to go into that too deeply but maybe just talk a little bit about what we’ve standardized.
Jon: Yeah, that’s exactly where I was going. I was going to say, “Okay, there’s lots of choices here. Some of them are probably bad, and what we do is maybe not the best in the world but it’s probably something that works and something that is efficient so what do we do?”
Chris: Right. What we end up doing is all of our application logs are going to a file, and that file is being written to a known location on the host and its name spaced out per service. We use volume mounts to create that bubble to punch that hole in that hermetically-sealed bubble so that these log files actually get written out on the host and then, from there, we can then do something else to get those log files collected and get into our log service.
Overall, that works pretty well for us but there are some issues with that. You have to think about things like the space, what happens especially if you have something that’s very verbose with its logging and lots of requests. That can be something that fills up your file system so it’s something to think about. Then, you start getting into things like, “Do I implement log rotation?” “How do I purge files?” and there’s various ways to do that.
I think we’ve reached a nice little steady state where we have a big enough file system partition for all these log files to go to. We’ve configured the log level for our apps to be at the info level when running in prod. We end up recycling our host machines frequently enough.
Jon: Once a month at least, you were saying?
Chris: About once a month. It’s usually a good thing to–by then, there’s usually an updated EMI with software updates and security updates and whatnot. It just makes sense and it’s so easy to rotate these things out. It ends up being a pretty nice and easy solution for us.
Jon: It sounds like we don’t have to add complexity of log rotation and coming up with another container or script to do that and whether to deploy it and make sure it’s running and doing its thing. We just make sure there’s plenty of space and kill the machines once a month.
Chris: Exactly.
Jon: Cool. I guess one other thing–this is going to be my question: You’re going to talk about–okay, the log’s going to the machine and then something else gets them off the machine or some service sends them from the machine to some other place where they can be kept more permanently. I guess one of the things that came to my mind was, “Uh-oh, what if the machine just totally crashes? How much logging do we lose?” Before you answer that, tell us how we get the logs off of the host.
Chris: Again, there are many different options once you’ve got these log files and all your services are running. How do you know analyze? You certainly don’t want to SSH into every one of these machines and bring it up in VI or Talum or something like that.
Jon: Especially because you’re going to be killing the hosts once a month so then you would never have logs older than a month.
Chris: Yep, exactly. There are various tools and services out there for accurately getting your logs. There are literally so many services. At DockerCon, there was probably at least 10 vendors out of the hundreds there that were log services companies. It’s crazy. It’s like, “There’s got to be an acronym for yet another log service.” We use Sumo Logic. It’s been around for quite some time. This is the only thing that they do, is logging and the analysis of it and all of the tools around that and the features surrounding that.
It works really well for what it is that we need it to do. The task then becomes, “How do you get your log events that are in these files sitting in these ECS hosts over to Sumo Logic?” which is basically a SaaS service. There’s many different ways to do that. The way we do it is we end up installing a collector agent on each one of our ECS hosts. What that does is, basically, you tell how to look for files like where to look for them so you give it, basically, a regex and a path, and it will go and basically hook file system notification changes.
Whenever there’s new data being written to a file, it could then go and read that and then it marshals that up and ships it over to the Sumo Logic service and so it’s constantly doing this and it’s running and whatnot. Latency ends up being definitely less than a minute for all that to happen. From when an application from a service is emitting a log event, it’s going to a file, it’s being shared through that volume mount, written on the host. The agent receives the file system change notification, reads the data, patches it up, makes it API call back to the Sumo Logic service to get it there, and so that’s just constantly running. What that means is there’s some window of less than a minute where, if that machine just got vaporized, that would be the log information that you would lose so it’s a pretty small window.
Jon: That’s super-interesting. It’s almost streaming the logs. I actually don’t know how file change notifications work. Do they trigger every line that gets at it or every character? Do you know?
Chris: Whenever it’s written to disc, whenever those blocks on the data change, there will be a file system notification and then it’s up to whoever’s subscribing to that to do something about it.
Jon: Right, but then how it often gets written to disc is up to Docker because it’s Docker that is where we’ve said – make a file from standard out and so it’s the one that’s got to be like, “Okay, save. Save. Save. Save.” Yeah?
Chris: Yeah. It’s that and it’s also your OS and buffering and all that other side–there’s also the agent itself. You can configure it to say how often it should actually look at that stuff and marshal it up and ship it over. There’s a whole bunch of ways. There’s just various different players here and different ways to tweak it.
Jon: Sure, I’m sorry that you got super-deep on that but I was just curious because it’s using a pretty sophisticated little feature of UNIX to be able to essentially stream the log files off of the host, and I was like, “Huh, I wonder if it knows about every single line. I wonder if it knows about every 10th line or every character,” and then I was like, “If it’s every character, what happens if it ships a partial lane?” Anyway, all those questions occurred to me and that was more the programmer in me than the software architect in me.
Chris: Right. I think it’s safe to say that you’re–we talked about these before. Our log events are basically JSON objects and so, at the very least, you have that as your atomic unit of definitely what’s being emitted. In the application code, it’s definitely logging at least one event at a time and then we do have the ability to either–we’re either outputting to standard out, standard error and redirecting that or sometimes we log straight to a file I use in our login system.
Most of our code is Node.js and so we use things like Winston, and Winston supports file system transports in addition to the console transport. In that case, it’s not Docker doing it; it’s this whole–how Winston is doing its buffering and how often it’s writing the disc and the volume mount and what the OS is doing and whatnot.
Jon: That’s super-interesting. I think that that’s a pretty good place to stop because we’ve got the logs over to Rollbar and then, next week, we go into–
Chris: Logic.
Jon: I’m sorry, Sumo Logic. Rollbar is another logging service for you. It’s for the frontend for those listening. Great. We’ll continue where we left off next week.
Chris: All right. Thanks, guys.
Jon: Yeah. Thanks, Rich. Thanks, Chris.
Rich: See you.
Jon: Yeah.
Rich (Program): Well, dear listener, you made it to the end. We appreciate your time and invite you to continue the conversation with us online. This episode, along with show notes and other valuable resources, is available at Mobycast. If you have any questions or additional insights, we encourage you to leave us a comment there. Thank you and we’ll see you again next week.