August 15, 2018

23. Case Study: Troubleshooting Container Disk Space

Show Notes
Transcription
Discussion

Alarms were blaring, things were not working like they should, and developers were scratching their heads over what was going on. A reporting application started throwing lots of errors and prevented people from using the tool. So, when in doubt, do the typical tech support move: Turn it off and then back on, which seemed to fix it. Chris Hickman and Jon Christensen of Kelsus address this real-life production outage case study that involves troubleshooting container disk space. Why did the errors occur? What needs to be done? How do you make sure it doesn’t happen again?

Some of the highlights of the show include:

Figure out how and why it returned to a good state; or it will come back to bite you again
Error Messages: People tend to not read, register, or realize how critical they are
Good error messages save you hours of troubleshooting time
EROFS was actual error code in this case, which indicates a read-only file system issue
Process of Elimination: Permissions weren’t changed, containers have read-only file systems, and writing is not done to read-only files
Came to the conclusion that it was a disk space problem; application was running under Docker in a Docker container on Amazon ECS with an optimized AMI
Looked at machine to determine if out of disk space by performing df command
Application was writing to local file system as backing storage for cache entries
Common “gotcha” when running containers on Amazon ECS is to know what default ECS optimized AMI is doing; attaches two EBS volumes to the host
The disk space it ran out of is basically disc space that the container was trying to write to in it’s own local file system – where is that being backed by?
Check Docker daemon logs and messages on host; also try Docker info command
Something pushed application over its limit and could do no more writes; container could not grow any bigger because of no disk space being available
Treat containers as immutable and stateless to reduce mystery of errors
Running out of disk space can be related to initiation, image size, volume mounts, or containers growing; clean up temp lifes and bound cache
There are no out-of-the-box Cloud watch alarms for disk space

Links and Resources:

Rich: In episode 23 of Mobycast, Jon and Chris will dive into another Kelsus case study. This time, we chat about troubleshooting container disk space. Welcome to Mobycast, a weekly conversation about containerization, Docker, and modern software deployment. Let’s jump right in.

Jon: Here we go, another Mobycast. Welcome producer, Rich and welcome, Chris.

Rich: Hey.

Chris: Hi Jon. Hi Rich.

Jon: How’s it going today, Rich?

Rich: It’s going pretty good. Been fighting some Facebook ads and trying to figure out why some of the stuff that we’ve been working on is becoming unpredictable. I’ve just been knee-deep in that.

Jon: I think you need more nukes in the cauldron. And how about you, Chris? What are you up to?

Chris: Not much, just normal week, heads down working on some client projects and again, the usual of some big releases, new features and what not, getting those tested, and deployed to production.

Jon: Cool and as for me this week, I was supposed to be in Michigan this week, but I had a back injury that made me feel it wasn’t going to be a good idea to spend all day in a plane and in cars, so stayed home. If you hear pain in my voice, that’s where that’s from.

The other thing that’s fun that’s going on this week—I’ll just put this out there into the world—is that we’re trying to get a spot to talk at re:Invent. By putting it out there into the world, I hope that you are as hopeful as we are that we get that opportunity to tell one of our stories at that big conference. I’ve been working on that with our AWS partner a little bit this week.

This week we’re going to be discussing—these are our favorite kinds of episodes—we’re going to be discussing a real-life production outage type of situation. I can’t remember if it was an outage but it was at least alarms blaring and things not working the way we wanted them to type of situation.

Chris, if you would just set the stage, tell us how the weather was, what was happening, and how you found out about this? Tell us how we got into it.

Chris: Sure, yeah. It was a Tuesday morning, I think. There was a light drizzle of rain. Really what happened was, it’s what typically happens in these kinds of cases where thing were working fine and then all of a sudden they’re not. Come in in the morning and see that one of the applications that were used for doing some reporting on some of our applications just started throwing lots of errors.

We saw that in our alerting systems and obviously something that we need to go into and figure out why. It was caught pretty early in the morning by some of our developers and they were scratching their heads over what was going on. The short term fix was really turn off and turn it back on again. Kill the existing containers that were hosting that application and restart them. When they restarted them, the errors went away.

Jon: I think that also for stories like this, it’s just kind of to know, so was this something that was invisible to our real users or real users having problems and we were kind of like, “Uh-oh, we’ve got to help with the users out there that are seeing errors or outages or something.”

Rich: Yeah, I know. This was a reporting application for internal users and they definitely were seeing this as application failures that was preventing them from actually using the tool. Not only were our alert systems complaining, we are also getting real complaints from actual users that were trying to use it as just part of their normal day-to-day business.

Jon: It’s totally unsurprising with the reporting systems. That’s the first thing people do when they show up for work, when their job is to keep track of things like the reports.

Rich: Yup, absolutely.

Jon: They did this typical tech support move, let’s turn it off and turn it back on. What happened from there?

Rich: That fixed things, errors stopped, everyone was happy again, and the developers filed this away as, “Okay, I need to figure out what happened, but it’s now working again.” That’s about the time that I came online, saw the errors, saw what happened, and started asking some questions. That’s where we’re at now.

It’s like, “Okay, let’s figure out why did these errors start happening? What do we need to do to make sure it doesn’t happen again,” because I’m definitely a huge believer in you do not want to have any mysteries whatsoever in your software. If something happens, just getting it back into a good state, if you’re not sure exactly how you got it back into a good state or why it’s back in a good state, then you can’t be happy with that. It will come back to bite you again. So it’s really important to chase these things down, find out exactly what’s going on so that you understand it and make sure that it doesn’t happen again.

Jon: I see you always make sure you understand what errors are all about and raise to you error messages themselves are. It’s just been my experience that people have a tendency to not really read them, or not really register them, or not realize how critical they are. It’s surprising because as a developer, you often put error messages into your code and you then therefore know that the error message that you just put into your code can only be shown if something very, very specific happens in your code, like that specific line of code has to be executed.

I would stand to reason that as developers when we troubleshoot, if we see an error message, that’s the thing. Get to know the error messages, make sure you understand them. It’s common that an error message may be a bit of a rare red herring, but it’s very uncommon that it has nothing to do with what’s going on. That’s myself when I talk about error messages.

Chris: Error messages are pretty wonderful. They help you. Just good error message is just a software development practice, it will really save you so many hours of troubleshooting time, to be able to narrow down where exactly is this failing and why, so absolutely.

Jon: Did we have an error message?

Chris: We were seeing error messages. Maybe to mop up a bit, again it’s a reporting application, it’s actually a Ruby on Rails app. It’s a mixture of both serving up the front-end client, as well as the backend API calls and making connections to databases and what not. As part of that Ruby application, it also had some caching features going on in it in order to take advantage of frequent lookups of data and for performance reasons.

Jon: Is it the one I have here? EROFS?

Chris: Yes, EROFS. It was the error code. We kind of like see, “Oh, it’s got this error code EROFS which indicates that it’s a read-only file system.” Error read-only file system is what that code stands for and then it had a path for the file that it was trying to access.

Jon: I said right because that’s how I think about error messages. I think, “Oh, read-only file system. We’re having an error about it. Somebody must be trying to do something other than read from it, trying to write through it, or edit it, or delete it.”

Chris: Yup and then we see a path. It’s like /apps/temp/cache/ and then a bunch of other codes and digits and whatnot. That was the error that was popping up, so obviously not necessarily an error code generated by us, by the developer, but definitely an error code from an API call, from a system call that keeps bubbling up.

This gives us a pretty big hint on what’s going on just by reading that error. As you said, we have this error code that it says, “Read-only file system.” We have this word cache in this file path. We also noticed that this file path, it’s definitely local to the container and so we can surmise that this is probably a disk space issue. It’s either a disk space issue or for some reason, some other reason the file system is now read-only.

Jon: Yeah. I guess that’s where my mind is like. It wasn’t happening, it wasn’t happening, it wasn’t happening. What suddenly changed to cause something to start writing to something that’s claiming that it’s read-only?

Chris: Maybe it could be a permissions thing. You could imagine if you went in, just took away write permissions to that particular directory or something like that, then you’d probably see something like this as well. But that’s probably unlikely to in this case. That’s one of the things we know. We know that our machines for the most part, they’re like cattle not pets where we’re not going in there showing in the machines and doing the administrative commands on and actually doing anything with app maintenance on a machine.

To have someone go in or something change permissions on a file or directory, that seems a little unlikely. The more likely path was, “Okay, maybe this is a disk space issue, so let’s look into that.”

At that point, that’s where it gets a little different, a little weird. This Rails application is running under Docker in a Docker container on Amazon ECS. It’s running on a machine that’s using the Amazon ECS-optimized AMI. It’s running in that kind of environment and that has a particular setup as per the way that Amazon has created that AMI and optimized it.

One of the first things you do is you can go like, “Let’s go look at that machine. Let’s go see the disk space.” You could do something like you could shell into that machine and then do a df command.

Jon: Just to re-explain what you just said to make sure I understand, we’ve got containers running, they’re on an ECS instance, meaning we have an EC2 machine in our ECS cluster and that EC2 machine, the kind of machine it is, is called this ECS-optimized AMI, which is a mouthful, but it is a special AMI, special image just for ECS machine. You’re going to go in there and you want to take a look at the containers running on that machine or just take a look at the machine itself?

Chris: Yeah. The first thing will just be like, “Hey, let’s go look and see. Are we out of disk space? Let’s get on that machine.” We can see what cluster host this particular app was running on, was assigned to. Show me that machine and we can do a df.

Jon: I just have to say that logically, there is a bit of a jump for me in my mind. I’m sorry to sidetrack the conversation, but you going from a read-only file system to disk space issue, to me belies some previous experience because my mind doesn’t go there.

I think if it were me just in my limited experience, I would have been digging around and trying to figure out what writes to this directory? What is that thing that writes to this directory? How does it write to it? When does it write to it? What is it doing? I probably would have to spend some time just learning what it is writing to this cache directory and figure out why that could possibly be a problem. It seems you were able to skip right to thinking about disk space stuff.

Chris: Definitely a part of that might be experience and just knowing that that error just by itself, if it had been maybe in a different environment, then absolutely it would be a bit more of a wildcard but given that we’re running containers and the file path here was local to the container. These containers are meant to be immutable, so they just come and go as they please. They really shouldn’t have much in the way of state, if any.

If it was a permissions thing, the read-only file system is internal to the container. It’s not really on the host itself. The fact that these things spin up and spin down, and come and go, that again leads more towards there’s something else about the environment on which this thing is running and is causing this. Then again it’s probably not a permissions thing or something like that.

The error is a little weird. Obviously, it would be much clearer if it just said, “Out of disk space.” But this is how it manifested.

Jon: If I were to follow your logic—I’m really trying to connect the dots here for anybody that’s listening—what I’m hearing is basically, in your mind you’re like, “Okay, I know that nothing is changing permissions anywhere. I know that these containers do have read-only file systems on them. I also know that we generally don’t try to write anything to those read-only file systems, so maybe something is happening that’s causing the container to have no other place to write but still try to write to that read-only file system.” Maybe that’s what’s going on your mind.

Chris: Yup, absolutely and maybe just another nuance to keep in mind a bit. The file systems themselves that these containers write to, they’re not necessarily read-only. This error popped-up read-only once it ran out of disk space.

This particular app, it was writing to its local file system and because its Rails code feature for implementing a cache, it’s essentially using a file as a backing storage for the cache entries. As it caches items, it’s now basically writing to a file and that file happens to live within a container at that /apps/temp/cache space. Think of it as some scratch space that it’s writing to. The error that eventually happened was, it can no longer write to that and why did that happen?

Again, looking at that based upon previous experience and issues and whatnot, let’s look at disk space, again if you look at the free space on the host on which that container is running, you’ll notice that actually there’s plenty of disk space, I think on that particular machine, there’s probably only 30% disk space that’s being utilized, so if you do a df, you’ll see there’s quite a bit of disk space here, like, “What’s going on?”

That’s where it gets the easiest optimized AMI, the way it is configured comes into play. This is a common “got you” stuff for anyone’s that’s running containers on Amazon ECS, it’s really important to know what that default optimized AMI is doing. It attaches to EBS volumes to your host.

Jon: It’s been a bit. EBS is what again?

Chris: Elastic Block Service.

Jon: Okay, great.

Chris: For all intents and purposes, you can think about just a way of attaching a disk to a virtual machine type of thing. There’s two of those volumes that are attached to these machines. One is the normal one that we’re all used to and that’s an 8 gig volume and that’s the volume used for the OS, basically for the virtual machine itself. All the code for the OS and normal stuff. that’s what you’ll see actually if you do the normal df command. you’re seeing that volume. When you see that, “Hey, there’s 30%,” it’s only using 30%, it’s 30% of that 8 gig volume.

There’s a second volume and that volume is at least 22 GB in size. You can change it and make it bigger if you want, but that is the volume that is used for Docker, for all of Docker’s image and metadata storage. That one is configured as an LVM device—LVM is Logical Volume Management—technology and it’s done at a different level. They’re not mounted, so you’re not going to see them with the normal commands like df. You’re going to have to look using some other tools.

Before getting too much deeper into that, the reason why this is important is because of this particular nuance here that the Rails application, the disk space that ran out of is basically disk space that the container was trying to write to in its own local file system. We have to ask ourselves, where is that being backed by? Because it’s a Docker container, this is managed by Docker itself. It’s where Docker is using its image metadata storage. It turns out that again on this ECS-optimized AMI, that’s on the LVM thing.

That’s why just doing a df and seeing that you have disk space available, well you may not have disk space actually because this is a Docker container that ran out of disk space. You need to look at the Docker storage and see if they use other techniques to see that.

Jon: That makes sense. Then I assume that’s what you did?

Rich: Yes, and there’s various levels of detail that you can go into here. Some quick things to do like just in general when things go wrong with your containers running. You can look at the Docker daemon logs as one way of just doing the first pass. Those are typically at /var/log/docker and you can go and tail that log and see if there’s errors in there.

In this particular case there were errors being thrown in there indicating that it has a problem with disk space. That’s a big clue.

You can also do things on the host. You can use things like dmesg to see critical system messages. In this case, that was actually throwing a lot of errors as well, indicating problems with one of the volumes.

Another very quick easy thing to do is the docker info command. The docker info command will show you total disk space and free space before the docker storage and the metadata. That’s another very easy-to-use tool and you don’t have to get too deep in some of the other stuff that’s LVM-specific. Just a combination of that stuff would’ve definitely indicated that we are out of Docker storage.

There’s a minimum thin pool size setting in Docker and I think the default on these machines is something around 2 gigs. Again, that volume is 22 gig in size so once we ate up more than 20 gigs of storage space, that’s when things started getting shut down. There’s just no more space here so you can’t write anymore. That’s what happen with this particular app. Something pushed it over that limit and at that point, basically, you couldn’t do anymore writes. The app was trying to write to its local file system for these cache updates and those were being rejected because the container cannot grow any bigger because there’s no Docker disk space.

Jon: If anyone is listening that is from the Rails world and has used Heroku, this story reminds me of a truth of easy Heroku which is that, it from the start in your Rails application that never lets you write locally and it will just kick you out. It will just basically throw a big old errors at you if you try to write to local disk. You have to set up other things if you want to have file attachments and other stuff that you’re writing. It’s an interesting thing where that very handheld Platform as a Service would have caught this and here, we’ve got to be more aware of ourselves of not putting state into our containers.

Chris: Yup, absolutely. In general, you want to treat your containers as immutable and stateless. Part of this is to reduce that mystery of things just randomly starting to have errors. As long as you have a machine that you’re configuring and you’re using, you have to have some amount of disk space and you’re always going to come to some point where you may run out.

There’s two ways to run out of disk space here. One is just upon initiation. There’s a certain amount of disk space that Docker’s going to need to run a container. There’s the size of the image itself that factors into that, whether or not it has volumes, volume amounts, and if there’s not enough disk to accommodate that, then it’s going to fail on startup. That’s a better failure. We’re going to notice that and we can take some action to it versus in this particular case.

What’s happening is, you can imagine maybe you have six containers running on a machine. You start up the seventh one. It starts up just fine. You know it’s running for a few days, then it starts failing. Nothing else change on that machine other than one of those containers was just continuing to grow. It’s making these file system writes and that just expands that size of the container on the Docker storage allocation.

Jon: It could have been one of the seven because we’re not controlling container disk space by container. They all have excess of about 20 gigs.

Chris: Exactly, and now it can be any one of the seven. It also affects all of the seven. If any of those other ones write into /tmp or something like that, they are going to be affected by this as well because it’s going to basically make it so none of those can do any of these writes.

So it’s important to keep that in mind. Obviously, you don’t have to be super draconian like there is useful scenarios for having some scratch base, but definitely clean up after yourself. Just don’t let it grow unbounded. If you’re using temp files, make sure you clean them up, or if it’s a cache, make sure it’s bounded. Don’t let it grow unbounded type thing.

Jon: I’ve got to say, it does feel like there’s a bit of a missing alert config, though. It seems like perhaps we would have an alert tell us if we would have run close at the end of the 8 gigs on the EC2 instance. Maybe it’s because there’s that 2 gigs that is sort of held aside for the Docker work and that felt like open disk space to the thing that’s monitoring disk space on that instance.

But here, what I’m saying if we’re running out of disk space, something else should probably be telling us that we’re getting close to running out of disk space, too.

Chris: Yeah. That’s a good point because out of the box there are no cloud watch alarms for disk space. So even for that 8 gig root volume that you can see with just df that’s mounted, that’s a custom cloud watch you have to make.

Jon: You tried yet Amazon?

Chris: It’s even harder for the Docker store because this is an unmounted LVM.

Jon: If AWS doesn’t provide it, maybe at some point maybe we would get around finding a custom watcher for that.

Chris: All the ATM companies out there, I think most of them now do support this. In the past, companies like New Relic and Datadog, they didn’t provide report, but I’m almost positive that they do now, that they now updated their agents be able to go, see into this, and to do that. So, it’s actually possible.

Jon: It’s a complex problem too because people attach just in different ways and they may even have things set up that grow the amount of storage that they have as needed. It’s not as simple check the box, do we look at disk space, but yeah, interesting.

Chris: Yeah and then based on other parts of troubleshooting this was then, once you figure out, “Okay, yeah, we can see that Docker is out of disk space,” then the next thing is to see, “Okay, who’s the culprit? Who’s actually doing this?” One is knowing where on the file system Docker is storing these files for all its space for the containers, images, and whatnot. By default, that’s /var/lib/docker.

If you go to /var/lib/docker, there’ll be directories for the various things like images and containers. If you go into the containers directory of that, you’ll then see container IDs for all the currently running containers and then that is the actual backing storage for that container. If you go in there and now do something like a du command—du shows the disk space usage on a per directory basis—you can now find out which one of those containers is using more storage than it seems reasonable.

In this particular case, we did that, looked that and it turns out, the one that was using the most storage space was actually the container that was hosting the ECS agent. The reason for that was the ECS agent has its own logging that going to stdout and stderr. When you log to stdout and stderr inside a Docker container, it’s got to go somewhere. It ends up getting sent to disk. That container grows.

The ECS agent is part of the host startup. Whenever the host is born essentially, that ECS agent is started and it just runs the entire time that that machine is up and running. It’s the longest lived running container on those machines always, and I think this particular machine had been up for about five weeks.

Over the span of five weeks, the verbose level of the ECS agent had caused its logs to end up being, like I think it was around 5 GB or 6 GB in size. It took up a significant chunk of the Docker disk space. That was useful to understand that which gave us an action items to go look at the ECS agent configuration and making sure that it’s not doing that, it’s not as verbose, or the very least, it’s not sending to stdout, stderr and growing the container.

Jon: Pretty cool. I’m glad that (a) the Microsoft technique worked of just restarting, and (b) that we figured out what to do long term and pretty interesting to listen to this, and it is super specific to ECS, but we love ECS, so it’s good to talk about it in detail. I think we should wind it up there, unless you have anything else you wanted to add, Rich or Chris?

Chris: Nope. Not at all.

Rich: Nope.

Jon: Great. Thank you so much. Another good week. We’ll talk to you next week.

Chris: See you guys. Take care.

Rich: Well dear listener, you made it to the end. We appreciate your time and invite you to continue the conversation with us online. This episode, along with the show notes and other valuable resources is available at mobycast.fm/23. If you have any question or additional insights, we encourage you to leave us a comment there. Thank you and we’ll see you again next week.

The Docker Transition Checklist

19 steps to better prepare you & your engineering team for migration to containers

23. Case Study: Troubleshooting Container Disk Space