October 3, 2018

30. Logging A-to-Z for Containerized Microservices

Show Notes
Transcription
Discussion

In this episode, Chris Hickman and Jon Christensen discuss how Kelsus handles application-level logging for microservices running on elastic container service (ECS). Logging is an important part of your toolset to run production workloads and view how services are running and performing. Logs offer a treasure trove of data!

Some of the highlights of the show include:

Cattle vs. Pets: Differences in what is done now with logging and functions compared to what was done in the past; some companies are driving machine learning with logging
Why logging? Different logs are pieces of the puzzle that can offer meaningful metrics
Application-level logs give fundamental insight into how actual applications work; i.e. how many upload image requests are failing or how long do they take to execute
Various log sources are available – System, ECS, etc.
Application-level logs offer lots of ways to set up log structure, including stream based or defined schemas through JSON
Filtering and picking log levels; typical log levels are debug, info, warning, and error
Experience plays a role in the art and science of when you log an event and what you omit with it
SQL statements could be included as part of info log, especially with difficult situations

Links and Resources

Rich: In episode 30 of Mobycast, Jon and Chris talk about how Kelsus handles application-level logging for the microservices running on ECS. Welcome to Mobycast, a weekly conversation about containerization, Docker, and modern software deployment. Let’s jump right in.

Jon: All right. Hey, Chris. Welcome to another Mobycast.

Chris: Hey, Jon. How’s it going?

Jon: Good except I’m a little nervous about whether we’ll be able to pull this off because I’m first time we were missing Rich. Rich is our producer and he helps us put this together and makes sure that everything flows. Also just having him listen helps keep me grounded to make sure that as I talk here and learn from you that we try keep things understandable. Without Rich, I’m feeling a little out there hanging in the wind but we’ll try this without him and see how it goes.

Chris: Yeah. Time to push ourselves out the nest.

Jon: He should have back next week, though. We miss him. We’re going to talk about logging. Logging essentially helps you with everything from just troubleshooting and seeing what your apps are doing to running entire businesses, Google, for example. A lot of the data that they move around and use for everything comes through their logs as opposed to relational databases. I guess we’re going to talk about specifically how Kelsus has been doing application logging with ECS, right?

Chris: Yup, you bet, to find a pretty important part of the tool set when running production workloads, having that view into how you services are running, how they’re performing, are there problems, maybe you can anticipate problems, and logs are a great source of this. But it’s also kind of comprehensive to get it all set up for A to Z. Thought today we talk about how Kelsus does this for its containerized microservices running on ECS.

Jon: Definitely quite a bit different using ECS and using containerized services than maybe back in the day if you’re using Log4j for your Java projects, not that Log4j Kenneth would be involved in this setup, but there are certain things you have to do that maybe you didn’t used to have to do when your servers were your pets.

Chris: Absolutely and we’ll totally get into that. There’s still a bit of that culture, if you will, of like, “printf(‘Entering function’); printf(‘Exiting function’),” that thing. That’s kind of again, the way that it worked when you had pets. But when you have cattle, that stuff is completely worthless to you. That’s what we are going to talk about today is compare and contrast some of that stuff as well.

Jon: I used to use HIJC for my long prefix to make sure that I can see my log lines. I knew how tired I was and how much sleep I had gotten, depending if any HIJCs got into the production branch.

Chris: At first I thought HIJC like, “Is that like a sports drink or is that a fruit juice?,” and it’s like, “Oh, no. It’s like, ‘Hello JC,’” your initials. It’s your own personal salutation, like, “Hey, this is my log.” It’s a signature.

Jon: Exactly.

Chris: The artist signing his work.

Jon: All right, let’s jump right in. It looks like we’re going to start with talking about log sources and why we’re logging, just getting into that a bit more deeply. Go ahead, Chris.

Chris: We’re touching this a little bit, but logging, why do it? There’s many types of logs that you can have out there and they’re all just pieces of the puzzle that can be useful if done well and if you can actually make meaning of them. There’s other ways of getting input and getting the visibility into your services, understanding how they’re operating, various metrics, and looking at other parameters in your services, but logging forms are core that. You also have complete control over it and it also—in particular application-level logs—give you the fundamental insight into your actual application and how it works, and that’s not something you can really get in a general purpose way.

You can’t say it for like, “How’s my memory consumption work?,” or it’s like, “What is my CPU load look like?” Those are general-purpose and you can get those for free, really not to do anything, but like, “What does it mean to know how many of my upload image request are failing or how long do they take to execute?” That kind of stuff only your application knows about that. It’s application-specific and application-level log is one of the things that you can do to light up visibility into that. Some really great benefits to be able to just instrument your service at the application level. You can derive metrics from it, you can derive performance calculations and its associated metrics, and you can also understand just the overall health of the system, where request are failing, context around those failures, and all other sorts of whatever kind of information context you want up in the log. Super important, super useful, and it’s just definitely considered to be one of those core things in the tool belt when building production-like services in the cloud.

Jon: Agreed. One of things that’s happening a lot more regularly now, especially with tools like Splunk and just watching what other big companies are doing, is even driving machine learning with logging.

Chris: Absolute yeah. That’s I’m capable getting to is a little bit more of it. This is one of the consequences of going from pets to cattle and just scalability and having multiples of things. In the past, logging was for developer eyes or DevOps eyes. You would actually, when something’s going wrong, you need to analyze it, you would probably be tailing a log file or you would be opening it, downloading it, uploading it, looking it in an editor, searching through it, just looking it with your eyeballs to see what’s going on. It would read sequentially. You probably want to have one of these things so you could read that log file sequentially. It was the entire picture and that work. Having things like log messages like, “I’m entering this function now,” and, “Now I’m exiting it,” it worked, it was useful. But once you go from one to many, this doesn’t work anymore, because now it’s sort of….

Jon: From synchronous size, synchronous too.

Chris: Absolutely. You have multiple request going on in the same service, those that logging for each one of those concurrent request is being interleaved with each other so it no longer reads sequentially. And not only that, you have just a slice of the overall deal into the system, especially if you don’t have server-based affinity for your request coming in off your load balancer. Its single user making a series of calls, maybe bounced across multiple servers.

You can’t see the entire flow of what they did looking at a single log file. It’s definitely much more complicated now that we’re multiple machines, we have multiple concurrent request. It’s really no longer something that you can really read by, by looking at these log files with your eyeballs and say you need help.

Jon: If I had a log file of my entire life, I can narrow how much of it I’ve spent tailing log files. Let me tell you, that number is non-trivial.

Chris: I think that’s true for anyone that has more than five years of experience in this space. Actually, the truth is, I suspect there’s actually a lot of folks out there that still do that, that still live in that world. We’ve all been there and absolutely I as well.

Back in the day when I was building back-end services for Microsoft and trying to deploy a service into our data center, we didn’t have really great test environments to simulate usage patterns and what not. There’s always some kind of unexpected surprises. We all suffered the beginning edge of that. It was like, how you even build web applications and back-end services.

Spent lots of times in a data center. Also because I couldn’t remote in, that didn’t exist. Looking at log files in this ice-cold data center, standing on a rack, that definitely paid my dues doing that. I’m so glad I don’t have to do that anymore. It may have been a great point with things like ML and AI. What a treasure trove of data for these things to be pointed at because there’s patterns and can be automated and sleuth out to even do more of the forensic analysis as well as the predictive analysis, I think. Lots of interesting possibilities.

Jon: Let’s talk a little bit about log sources.

Chris: Sure. We’re really going to focus on application-level logging but there’s all sorts of sources of logs that may or may not be useful to you. There’s definitely logs produced by your operating system environment, system logs, there are logs produced by relevant dependencies in your system. For example, if you’re running as we are our containerized applications on ECS, there’s ECS logs to tell us what happens like jobs being scheduled to run and how does initiation go versus when they get terminated. how does that happens, there’s logs output by that. Lots of different sources of logs out there and again, they all have their usefulness for whatever it is that their output go. We’re going to focus on the application-level logs for the rest of this.

Jon: Sounds great. Let’s get right into it, talk about log structure.

Chris: With application-level logging, you have lots and lots of choices. There’s lots of ways to set this up, lots of design decisions, questions and answers. We’re going to talk about maybe some of those inflection points and just best practices that we employ at Kelsus. We need to start off with log structures. When your logging your application-level logs, you’re in your code somewhere, you decided that some event is meaningful to be persistent as a log event. You now have a choice of what does that look like.

In the past a lot of logs are just string-based series, as we talked about, printf or console log. Error happened, this was my input, and here is the error code, something like that. There’s just one string that just gets emitted. You can do that and there are libraries out there that you can use regular expressions and you can use parsing libraries to pull out the semantics of that string, but it’s pretty difficult.

Another way of handling your logging is to actually define an actual schema to your logs. Have some very general-purpose structure where these things look similar. Think of it more as like an object as opposed to a string and one of the great tools for formats for doing this is JSON. This is what we do at Kelsus with our application-level logs. Each log event is simply a JSON object and that JSON object, every one of those has some core schema that’s there for every type of event, regardless of what it is. Things like, “What’s my time stamp? When was this emitted? What is the host name from where this was emitted? What other metadata about the actual context of the log event that is common across any type of event,” and then you can extend that.

You can add additional pieces to it for events that need more specialized information with it. But give us overall core standards-based structure and now, parts of it become super easy because you can just go and if you want to get all the logs from a particular host, it becomes really easy. Everything’s a JSON object, you just do a JSON search and you can say, “I know that 320 log events that has a host name properties, where hostname property equals the host I’m looking for, pull back all those log events,” and now you can go search it.

Jon: Something that’s funny that I just wanted to point to is that you said, “Kelsus does this JSON format now,” and that’s what we always do. But what’s funny is that the very first time we ever did that is when you were working for one of Kelsus’ former client. You weren’t the VP of Technology of Kelsus at that time, Chris, and we decided to use JSON log formatting for a mobile app that we were doing for that client. That was new because despite the fact that server logging has been modern and up-to-date in terms of how formatting gets done, et cetera for many years, for a few years in the iOS and around the world, this is just like NSLog.

Then finally we got logging tools, one called Lumberjack that I think we still may use. Just the idea of shipping those logs off of the device and getting them out of them so that you can look at them because they have a tendency to get trapped on devices, all of that was new and I think that was bout 2012. Ever since then, it’s just been a no-brainer. Use JSON for your logs.

Chris: Yup, absolutely. Another thing that you can do to your logs to structure them is to define what, in general, what the log event represents, what level of severity or context it has. This brings into the idea of the concept of log levels. This has been an idea that has been out there for quite some time, mentioned earlier Log4j, it had the idea of log levels but definitely, to do this, you need that structured format. These log levels again inform just what level of severity or detail you’re logging at.

Some typical log levels would be debug. These are like I’m just trying to figure out something, maybe locally on my development environment or really just in test mode type thing, it’s pretty verbose and I wouldn’t want something like that emitted during production. It’s really just for troubleshooting type thing.

You got some other events would be like info and that would be at a higher level of context where this is something that’s more meaningful and more relevant than normal production usage. It might help inform the overall context of what’s going on in the system. It’s just for purely informational purposes.

Another more would be warning. This is a log level you’d use to say something unexpected happened but it’s not really an error, the code has handled it, it can still operate normally. It doesn’t really require any intervention or remediation. However, probably something you want to look at be aware of. A lot get the warning level.

And then error. The error level represents something unexpected happens, really this represents a bug, it’s something wrong with the host that it’s running on, something that really needs someone to look at it. It needs attention paid to it.

Having these various log levels as you’re application emits these log events, picking the right level becomes pretty important to you because you can filter on this stuff as well. If we have these structured log events we’ve already talked about, now we have these log levels that gives you another filter. You can say, “Go show me all log events where their level is warning,” and now you can see that and you can go across all your host in one swoop and you can further filter that based upon what kind of schema are you set up for your JSON logs to further refine that so it becomes really useful.

Jon: I have some thoughts about what level to set your production logs at that are a bit nuanced. I think maybe a while back people wanted to try to figure out a way to only set their production log levels at warn which is horrible because then you lose all that glorious information that you have in the info level that can be used for metrics, analytics, and other things. It’s just clear that you want to be always saving and storing info and above, because info is going to be where you get a lot of really valuable data. For sure you want to save that data in production.

That’s one thing, but then another thing occurs to me. I could be wrong on this. I’m just not sure what’s going on with the speed of machines and just capabilities these days. In an ideal world, you would even keep debug level logs. You just leave them on if you could. Or very low-lying troubleshooting-level logs, if you could, if it actually doesn’t slow down your application because if there’s valuable information in there that is going to help you troubleshoot a problem, then a problem happens and you’re not actually saving that information, then you’ve got to go make the problem happen again in order to do a troubleshooting.

It seems like in a perfect world, you’d be able to do something like say, “Okay, store these but just keep them around for 24 hours or 48 hours and then they don’t need to go into that permanent record but it would be nice if you can just keep them.” I think the only reasons not to do that would be if (a) there’s some sensitive information in the debug logs that you’re just not allowed to ever store, or (b) there’s so much debug logging that you’re absolutely sure you’re that you’re slowing down your system and costing significantly more for your deployment and you just need to save that money, save that processing cost, which would be sad. Do you see what I’m saying? You’re bummed when you have to make the problem happen again in order to troubleshoot it.

Chris: Yeah. Part of this, it comes down to the art and science of just how when you log and event and what you emit with it, part of this just comes with experience. Thinking back for the projects that I’ve worked on, I think it’s been a long time for me where I don’t have the log information I need in production because it wasn’t just being emitted. Part of it is being bitten by that a few times. You can understand, there’s a very specific use for debug verbose output and it really is more like this is the stuff that you need to get the code complete. But then after that, it’s probably not too terribly useful type thing.

Jon: I think here’s the most common example of this. When you’re doing your debug level logging, a lot of times you log entire SQL statements and then you try not to do that in production, at least. That’s why it’s used to be 2008 through 2010 or 2012 Rails applications. But now, how many times are you trying to troubleshoot something like, “I wish I could see the SQL so that I could see if I’m troubleshooting a performance problem or something, trying to find a deadlock in the database or something. I sure wish I could look at that SQL and crap, I can’t because we don’t log it because that’s the best practice not to do that. Now, I have to go make it happen again with SQL logging turned on.” What are your thoughts on that.

Chris: I think it boils down to the specific situation. They may very well be some situations where, for whatever reason, you’re building a complicated SQL statement at runtime and it’s just super complicated, involving local tables and joins and everything else. All that said, for me personally for those more complicated things, I might actually include that as part of the info level logging because I might like, “This is really hairy and once in a while there’s a problem with it,” and I can always rebuild that SQL statement by just running locally on my machine and see what it is if I want, but it’s a bit of a hassle. Maybe I’ll output it as part of my info level log, just to have that additional context make it easier. Now I can actually see what that complicated SQL query look like after being built up with the runtime inputs. Obviously, you don’t need that for everything. There’s a bunch of SQL logging that you never want to see because they’re just so terribly boring.

Jon: Yeah, just a thought. It just occurred to me because there’s certainly been times when I’ve been troubleshooting a problem and it’s like, “Better turn on SQL logging so I can see what’s going on with the database. It’s got to be somewhere in these SQL statements if there’s a problem.”

I think we’ve got a good start. We’ve been talking about logging, we’ve been talking about formats, we’ve been talking about log levels, and we have another line here in our outline about what and when to emit. I think that’s what we were just talking about, unless there’s some more you wanted to add to that before we wrap up?

Chris: That’s actually a pretty interesting point and I think it makes sense to spend a bit more time diving into that one because it’s definitely one of the nuances and one of the areas that I see developers making a lot of mistakes, so I think we’ll save that for next time and use that to pick up.

Jon: Alright, sounds good. It’s great talking to you about this today. Thanks.

Chris: Yeah, awesome. Thanks, Jon. See you later.

Rich: Well dear listener, you made it to the end. We appreciate your time and invite you to continue the conversation with us online. This episode, along with show notes and other valuable resources is available at mobycast.fm/30. If you have any questions or additional insights, we encourage you to leave us a comment there. Thank you and we’ll see you again next week.

The Docker Transition Checklist

19 steps to better prepare you & your engineering team for migration to containers

30. Logging A-to-Z for Containerized Microservices