32. Logging A-to-Z for Containerized Microservices (Part 3)
Chris Hickman and Jon Christensen of Kelsus and Rich Staats from Secret Stache conclude their series on application logging. In this episode, they discuss how Kelsus leverages Sumo Logic to manage project logs.
Some of the highlights of the show include:
- Kelsus’ application logs are dumped to files; volume mounts have these files leave a container and go to a host, then are adjusted into a tool or platform to be analyzed
- Kelsus uses Sumo Logic, but there are several log services on the market
- Logs start to pile up because each containerized microservice outputs them, so they’re moved to Sumo Logic to review, report, and generate events or alerts
- Pros and cons of services with API calls that can directly ship log events to the service or bypass the log files
- Sumo Logic provides a collector – a piece of code that runs on a host; tell it what part of the file system to look for and what kind of files are returned
- To find and register a file system change notification, it registers for the file system change events; when files are updated, new lines are pulled out and shipped
- With logs and events, consider namespacing to separate each microservice; what is the file name? What’s the location?
- Search symptoms and metadata associated with each log event to narrow down issues
- Name’s basis has nothing to do with physical architecture of the system, such as what machines and containers are running
- Kelsus usually has three, separate environments running in the Cloud: 1) Development, 2) Staging, and 3) Production; same microservice can run in all three environments
- Logs can be found via namespace, microservice name, metadata properties, and host
- Kelsus uses the JSON format; each log event is similar across each service and has a defined structure with same high-level properties and sub-objects
- Kelsus has different log levels, including debug, info, warning, error, and critical; Sumo Logic has a robust query language with a tail feature to find and search logs quickly
- Database vs. Log Querying: Rare for Kelsus to have to go to its database to see what’s happening; usually, its logs give all the footprints needed to trace the issue’s journey
Links and Resources
Rich: In Episode 32 of Mobycast, we finished our series on application logging. In particular, we discuss how Kelsus leverages Sumo Logic to manage the logs of their projects. Welcome to Mobycast, a weekly conversation about containerization, Docker and modern software deployment. Let’s jump right in.
Rich: Hey, guys. How’s it going?
Jon: Good. As usual, I always want to know what you’ve been up to. What have you been up to this week, Rich?
Rich: We’re bringing on a new contractor for a trial period to do some more administrative assistant-type stuff so I’ve been working with him a little bit to get him on board on the entire system and building out processes so fun stuff. It’s pretty boring but it’s necessary because that means that, hopefully, I’ll be doing a few less recurring things week to week so I’m excited to see that happen.
Jon: That does sound pretty good. Congrats on the growth.
Rich: Yeah, thank you.
Jon: How about you, Chris?
Chris: I have been busy getting over a nasty bug and so I’m happy to be back to normal health and then, also, it’s been pretty exciting. We’ve hired a new project manager here at Kelsus so I’ve been spending a lot of time with him and ramping him up and just getting him up-to-speed on all the projects that we’ve got going on here.
Jon: Great. Yeah. As for myself, also busy. I’ve been trying to compress a week of work into three days because I’m heading out of town tomorrow and Friday to do a little surfing in Baja. It’s a little vacation that’s exciting for me. Two weeks now, we’ve been talking about logging. There’s just so much to cover and we’ve tried to be pretty in-depth and discussed not just best practices but how we do things at Kelsus. That comes from not just decisions we made at Kelsus but from our entire career so writing software and working with logs.
I think we just wrapped that up this week. Last week, we went down in the weeds talking about exactly how our tool, Sumo Logic, gets logs off of containers and over to itself. Maybe you could just recap that at a high level and then we’ll spend the rest of the episode just talking more about Sumo Logic and how we use it. I think even if you don’t use Sumo Logic, it’ll be just a good discussion of what you should expect out of your log aggregation tool, and if it doesn’t have some of these features, then you’ll know you haven’t chosen a good one. If you have already chosen one and you’re not using some of these features, then maybe this will help you dig in and know more about your own log collection software.
Chris, go ahead. Remind us how we get the logs off the containers and into Sumo Logic.
Chris: Great. Where we last left off is we have these application logs being dumped to files. We’ve done volume mounts to create ways to have these files leave the container and go to the host. Now, they’re sitting on the host file system and now we have to do something with those, get those adjusted somewhere else into an actual tool or platform that will allow us to analyze and search and those logs since we definitely can’t do it manually by SSH in the machine and actually hand-reading a file, if you will.
Again, we use Sumo Logic. There’s tons of log services out there to use, things like Loggly or Splunk. There’s Logstash. There literally might be a hundred different solutions out there, and a lot of them really do a pretty good job, especially the more popular ones, but we use Sumo Logic. The goal then becomes you have all these logs and they’re piling up. Each one of micro-services that are containerized are outputting these files. How do we now get those over to Sumo Logic in a same namespace way that we can now look through those logs and understand them and to do things like reporting and generating events or alerts when we have to?
There’s many different options for folks that are using something like Sumo Logic for doing that, ranging from–in your application itself, almost all these services have API calls where you can directly ship those log events to the service. You can even just bypass the log files if you want. There’s some pros to that. I think there’s a lot of cons to it: dependency, what happens if your connection goes bad like what’s the lacy of doing that.
You can do things like fire and forget, which basically means you don’t wait for it to be successful; you just continue on, but I definitely would not recommend that technique. Let’s maybe just get into what we do. Sumo Logic provides an agent but it’s a collector. Really, all it is, is just a piece of code that runs on a host. You tell it what part of the file system to look for and what kind of files back so give that a regex, like, “These are the names of the log files or the patterns that’ll match.”
To look for and register this file system change notification, it registers for the file system change events and so, whenever these files are updated, they can then read the new lines that have been added to that file that pulls them out, marshals them up and then ships them over the wire to the Sumo Logic service and then goes back into listening mode. We set up one of these collector agents on each one of the hosts inside of our ECS cluster so we don’t do one per container or one per micro-service. It’s not a sidecar. We actually have it set up so there’s just one for EC2 host because that’s really all you need.
Jon: Right. The logs are making their way over. We trust that it’s working, that those log collectors are running and not forgetting any logs or we’re getting all of our logs into Sumo Logic. Once they’re in there, I guess what do we do with them? How do we look at them? Why do we look at them? When do we look at them?
Chris: Some of the first considerations that come into play is namespacing. You may have a handful or you may have tens. You may have hundreds of micro-services running, and they’re all outputting their logs and then you’re collecting all these with your agent, sending them over there and then there’s just a stream of just thousands and hundreds of thousands and millions of blog events. You need to make sense of that. One, the very first things you’d want to do is you’d want to do some name-spacing to least separate each micro-service so that you can basically – you want to see the events for Micro-Service A or Micro-Service B so that’s one thing you want to make sure you do.
Jon:That’s sort of the equivalent of–in the old days, I’m going to go log onto the machine that has that microservice running, go into the directory where the logs are getting created, and look at them.
Chris: Exactly. Again, it’s going to depend on your log service, like how you do that name-spacing. We Sumo Logic, we actually can leverage exactly that, like what is the file name, what’s the location it’s been written onto the host. That ends up becoming the way that it makes it pretty easy to names-space on it per application basis.
Jon: Quick question on the name-spacing along those same lines: Do you name-space all the way down to, for example, a service that’s running so that I can tell the difference between Service A and Service B that might be running either on the same instance or in different instances so that I can just see different containers and their outputs, like tell them apart. Do you have that name-space level)–I guess is my question, or are you review in that somewhere else because we do need it.
Chris: Are you referring–I may have three instances of my micro-service. It’s an ecommence, microservices, stuff like that, and for healing ability redundancy. I have three separate containers running the same code, low balanced.
Jon: That’s my job on this podcast, making it more understandable. Thank you.
Chris: You’re asking, “Okay, can I actually go–I want to look at Container A versus Container B versus Container C.”
Jon: Yeah, exactly, because maybe only one of your container instances is having some sort of problem and you want to see that one.
Chris: There’s a few things: Usually, you don’t need to do that. You just want to look at it on an application basis. I would just say go look at all the logs for my ecommerce microservice and what are the symptoms, like what is it that’s wrong. That’s what I’d be searching on. Maybe it’s like the performance is really slow so go look at all requests that took longer than a second or something like that, more than a thousand milliseconds, or go look at all the ones that are returning–maybe it’s a 500 error or it’s emitting warning level log events or something like that.
You can use that to definitely narrow down. The other thing that’s going on, too, is that you have metadata that’s associated with each log event that’s being ingested so things like host ID, and that’s basically the container ID that’s coming up later. If you wanted to, it’s very easy, actually, to say, “Go get me all the log events for this container,” and so if you wanted to look at that, you could. I’ve never had a use case like that was something I needed to do.
Jon: At the end of the day, the answer is no. The name’s basis may down to the micro-service level or some other logical level but it has nothing to do with the physical architecture of the system, right? Physical architecture of the system being shortcut for me, saying it has nothing to do with what machines and what containers are actually running. You leave that out in your name-spacing.
Chris: Right. You can think of this as telemetry that has a bunch of different properties associated with it, and then you can slice it and dice it every way. I’m using the term ‘name-spacing’ here a little loosely because the truth is, drilling down to the events for a particular service, you can really think of this as, again, more metadata on the log events themselves. True name-spacing, we do use that and we use that to separate out events between different environments.
Here at Kelsus, we typically have three separate environments running in the cloud: one for development where it’s very unstable code–this is really for the developers do their integration testing. It’s very unstable and it’s really just for the developers. Then, we have a staging environment which is really a pre-prod environment. This is where stable code should be where we can show off, demo new features before they’re ready to go to prod, get the final blessing and then, of course, we have the prod environment, and that’s where the real production code lives and breathes.
You can have, obviously, the same micro-service running in all three environments, and it’s pretty important to know, like, “I only want to look at the events in development,” or, “I only want to alarm on the events in prod. I wouldn’t want to alarm on the events in dev.” That would probably way too chatty and not too terribly useful. There is a name-spacing feature in Sumo Logic where we can basically just attach this namespace to these events.
The environment for us, usually, is specified via an environment variable on the host and so, on the actual agent itself, we’ll configure the agent as it starts up as part of its configuration just to say, “Hey, your namespace is going to be–you’re actually a collector for staging so you’ll prefix this namespace of the staging environment to every one of the log events that you’re adjusting regardless of what service it is.” That way now, we have this further partitioning of our log events.
Jon: Got it. It makes total sense. Cool. Now, we can find the logs that we’re looking for. We can find them by all kinds of different things, by the namespace, by the micro-service name, by the host, et cetera. Tell me about looking at them within Sumo Logic.
Chris: Like I said, there’s tons of metadata properties in here so there’s things like time stamps, a lot of things you get for free so timestamps like IP address, where these things came from, similar to hostname, the IP address. You have all those properties that you can go and search on. If you did want to see–”Show me all the events or logs from my ecommerce microservice in the last 15 minutes,” it’s really pretty simply easy to do. If I wanted to see a specific IP or a specific host, I could do that.
Beyond that, it now becomes like, “Okay, how do you structure your log events?” like, “What is the schema associated with them?” and, again, we talked about this in one of the earlier episodes. We definitely like the JSON format for our logs and have a minimal schema associated with it. Each one of these loge events is really similar across each service where it has a defined structure and, at the high level, it’s got the same high-level properties and sub-objects in it.
Then, each one of those services can amend that to do application-specific stuff that might be more interesting for each one of those. Again, we have different levels that we log at, talked about those debug, info, warning, error, critical, that kind of thing, so if I wanted to go see all the warning-level events for the logs, that becomes really easy for me to go search on. There’s a very robust query language that all these log surfaces have, and Sumo Logic is no different.
It’s really, really powerful. It’s very similar to SQL but even easier or it’s kind of–if anyone has used the JQ tool, just really pretty easy to do these. You’re basically just doing your pipeline filtering, and you can just build these things together and just pipe the results to the next one. You can say, “Go filter for all warning events,” and then you can say, “And then I want only those warning events where they took longer than two seconds,” and then, “I only want those warning events that were done for this particular user,” and you can drill down this way.
Then, you can say, “I want to see the top five events that took the longest to run,” or something like that. There’s so many different ways. It’s very, very powerful and there’s so many different ways you can slice and dice it to really narrow down what it is that you want to find, but it’s really going to be dependent upon you, the information that you’re logging. If it’s not structured, if it’s just a bunch of essentially print off, then it’s going to be really difficult to make sense of it.
Jon: Right. One thing that occurred to me while you were describing how powerful the query language was and how well you can essentially filter what you’re looking at was, “Can you do that live?” essentially like the same as telling a log. Can you say, “Alright, I want to see these as they come in but only show me things that have this user,” or, “Only show me things that come from this particular container instance,”? Can you do that and have it be essentially like a tail?
Chris: Sumo Logic definitely has a tail feature.
Jon: Is it filterable with the same query language?
Chris: I don’t know. I’ve never actually–it’s never been something that I needed to do. In the past, that definitely did not have that functionality. I wouldn’t be surprised if they do now. It may very well be possible but, at the very least, you can absolutely have this tail feature where, basically, you’re on their website, you’re looking at the stuff and it’s just auto-updating for you and you can see these. As the log events get collected and shipped over to Sumo Logic, they’ll be refreshed and popped up on your screen so you can see that. Actually, I totally will amend that. That tail is based upon a query that you built so you can say, “Here’s the query. I’m going to go look for all the events with this user,” and then you can say, “Now, I want to do a live tail,” and it will filter on that.
Jon: Yeah, that’s great. Hopefully, you’re not in that situation where you have to be monitoring logs as they happen, but it’s great that you can. Cool. I guess another thing that I’ve noticed that we do that’s a little bit different than my own past experience working on enterprise apps as more of a developer is that we look to the logs as opposed to the database for a lot of information that we want. I think that, in my own past experience when we–maybe there was some sort of to-do list and you’re looking to see if somebody checked off something, you would immediately go to the database to go see if and when they did check that off because that would be in the database, but I’ve noticed that we often go to our logs to look for that kind of information. Can you talk about searching the logs and what we used that for that may have changed a little bit since the advent of these kinds of services?
Chris: I knew you just helped me understand a little bit more the situation. You’re looking in the database for–is this to see whether a particular feature was coming up?
Jon: No, whether a user actually did something. The cases I’ve seen have been like, “Hey, so-and-so filled out this audit and had a hard time submitting it,” or whatever and, often, I guess in my past, that would involve some database querying. From what I’ve seen when we troubleshoot that stuff, it tells us from one of our big clients. What is the Sumo Logic method?
Chris: Absolutely. I do not want any of my developers touching the production database. There’s so much that can go wrong with that. A big strategy here is like, “Yeah, your log events are super, super powerful, super critical for troubleshooting, debugging, understanding what your system’s doing so, as you’re writing your code and as you choose what it is that you want to log and what information goes with that log event, be very careful about that, be very thoughtful about that and include the information that you need so that it’s all right there for you.”
That’s exactly what we do. It’s very rare where we actually have to go and take a peek at the actual database to see what’s going on. Usually, our logs will give us all the footprints that we need to go trace the trail back to what happened.
Jon: Cool. I think that’s a real transition. I think that even as recently as 2006/7/8, that was not how people were doing things.
Chris: No, absolutely not. At that point, log files were just mostly like param statements, too, like entering a function, leading a function, that type of thing. JSON wasn’t really even highly adopted at that point.
Jon: We can even match it for 2010/11/12, also still not really doing this. I think maybe it was–do you think it was Splunk that made this the reality of how software gets developed and troubleshooting happens?
Chris: Yeah, I don’t know if it was Splunk. I think a lot of this becomes the natural consequence of just how things are being architected and the complexity of them. You’ll have things like cloud computing and now you have N number of instances of the same thing running so you really can’t SSH so when you’re looking at logs, it doesn’t give you the whole picture or if that isn’t working, there’s also just too many log events. You have so much load going on and so you just need something else to deal with this.
There’s been a lot of tools out there in the open source world and various ways of doing it and then other companies–they saw the opportunity. People are doing this stuff. They’re reinventing the wheel. They’re doing it all in-house. It’s hard to get right so this should be a service. This is a company, and that’s what happened. The whole concept of understanding what your code is doing when it’s running in production environments and just instrumenting it and getting monetary in it and getting the real-time insights and performance, that’s just a whole big segment of the market that folks like Splunk, New Relic, Datadog–there’s just tons of companies in that space, and logging is definitely one part of that.
Jon: Cool. Excellent, I think that that is a great summary. That’s where we’ve landed and the future is more and more dependent on logging and taking advantage of the information in those logs. It’ll be interesting to see what happens.
Jon: All right. Thank you, Rich, and thank you, Chris.
Chris: Alright, have a good one. See you. Bye.
Rich: Dear listener, you made it to the end. We appreciate your time and we invite you to continue the conversation with us online. This episode, along with show notes and other valuable resources, is available at mobycast.fm/32. If you have any questions or additional insights, we encourage you to leave us a comment there. Thank you and we’ll see you again next week.