16. Troubleshooting Distributed Systems: A Case Study
Chris Hickman and Jon Christensen of Kelsus recently troubleshot a client’s distributed systems with a containerized infrastructure. They discuss the thought process they went through and how they peeled back the layers of that infrastructure. Was it as simple as getting an alert of about the problem and how to fix it? Maybe in an ideal world, but not a realistic one. The symptom was deploying a new microservice to support a mobile application. During deployment and testing, they received alerts about failures. So, they new when something was going wrong, but what was going on?
The highlights of the show include:
- Mobile app, proxy service, and backend service: Mobile app makes calls to the proxy service, and proxy service is making calls to the backend service
- Make sure everything is running before deployment to customers
- Troubleshoot: Understanding parts, what’s communicating with what, and all failure paths
- Teams may not see eye-to-eye regarding where the problem lies
- Deployment of proxy service, which serves as middleman, generates high error rates
- Possibility of timeouts occurring because of not enough computing power
- What are the dependencies and communication paths?
- Services are fronted by an elastic load balancer (ELB) with timeout logic
- ELB Types: Application load balancer, classic load balancer, and network load balancer
- Isolating problem using a curl against the backend service
- Enable verbose mode for curl to identify specific steps
- Go through three IP addresses to make connection to IP address
- Problem is with the ELB; possible because of networking, security, and permissions
- ELB had a subnet that did not have an Internet gateway
- Simple Fix: Correctly define ELB to use subnets that have Internet gateways to communicate with Internet traffic
- Mobile app did not have same issue because of architecture decision where there were two ELBs fronting the backend service
Links and Resources:
Rich: In episode 16 of Mobycast, Jon and Chris discuss how they recently troubleshot a client’s distributed system. Welcome to Mobycast, a weekly conversation about containerization, Docker, and modern software deployment. Let’s jump right in.
Jon: Welcome, Chris and Rich. It’s that time of the week for another Mobycast. I’m excited to talk today. How are you?
Rich: How’s it going?
Chris: Hey, Jon. Hey, Rich.
Jon: Rich, what have you been up to this week?
Jon: That sounds fun. What about you, Chris? What have you been up to?
Chris: I am busy packing. I am getting on a plane tomorrow, heading out to San Francisco for DockerCon 2018. It’s going to be fun. I’m going to spend three or four days out there—all things Docker and containers.
Jon: I’m picturing a graph of how much you’ll learn each day and it’s a step function as you run out of learning capacity and get more tired as the days go on.
Chris: Yeah. I need to stop by in the store and get a few more RAM chips or something.
Jon: Yeah. As for me, this past weekend was the GoPro games in Vail. It Teva Games. It’s just amazing to see this event that went from like, “Oh, let’s have some weird sports,” several years ago to now, 100,000 people and booths everywhere—GoPro booths—climbing, slacklining. Just wall to wall entertainment for people. It’s craziness. It’s so interesting to see how big that’s gotten.
It made me have this realization that as we do these trainings—these trainings for people that we’re teaching containerization on Docker, and running on AWS that maybe we should start offering some of them up here. There’s just many things to do up here. I bet we could entice some people to come up to Colorado for some of these trainings.
Rich: That’s right. I love the GoPro games. That’s probably the piece that I miss the most about Vail.
Jon: They are so fun.
Today, we’re going to talk about troubleshooting. At Kelsus recently, we’ve had a couple of things come up that involve our containerized infrastructure. It wasn’t as simple as stepping through code or looking at a stack trace and say, “No, this is what’s obviously wrong.” It involves some peeling back of layers and thinking about the infrastructure.
We thought, we’d talk about the thought process that we went through. Maybe we can just start by describing—tell us, what happened? When did it happen? What were people saying, Chris, when we realized we needed to do some troubleshooting?
My hope before you even answer that question, my hope was that we got a nice alert that told us exactly what the problem was and how to fix it. Is that what happened?
Jon: In the ideal world that’s what would’ve had happened.
Chris: Sure. The symptom was deploying a new microservice to support a mobile application. Up to the point that I am starting to test it, seeing failures and calls to a dependency. Those failures did show up as alerts coming through the whole system—through the log in system—and then being raised as an alert and using semologic.We knew immediately when these things were happening, indeed. The big question was like, “Okay, what’s going on here?”
Jon: We’re getting some of the alerts and the service that we put in is just not––it’s timing out, did you say?
Chris: Yes. The first version is a mobile app, it was going straight. There was no backing microservice to it. Instead, it was going straight to this other backend dependency controlled by another group, by another team. For various reasons, to kind of add on additional functionality for this mobile app, better traceability, debugging, troubleshooting, whatnot. There’s a need for us to create a specific microservice with this mobile app to communicate which, through which some of these core calls that the existing apps was making would be proxied through that microservice.
That’s where we ended up having problems were these calls that were being proxied to a new microservice to the backend dependency, which were the exact same calls that the existing mobile app was making.
Jon: We should probably get some things some names here, otherwise, it’s going to be hard to talk about the dependency and the microservice and which one are we talking about here? Let’s give everything a name.
We’ve got a mobile app, what should we call that? That one’s easy. Let’s just call it the mobile app because there isn’t another one. Then, we’ve got a microservice that’s sort of the proxy. Should we call it the proxy service? Then, backend dependency, let’s just call it the backend service. We have a mobile app, a proxy service, and a backend service.
You’ve said the mobile app is making calls to the proxy service. The proxy service, in turn, is making calls to the backend service? Is that right?
Chris: The mobile app is making these calls directly to the backend service.
Jon: Okay. Before, there was no proxy service, and now, there is. Aside from these alerts, they’re saying, “Hey, there’s some timeout.” Is there anything that’s happening to users? Are users noticing anything because of these timeouts?
Chris: This is a brand new service, it’s not yet rolled out to customers. It’s just being tested. The only ones affected are just us doing our own test to make sure that everything is running before we go and deploy this and roll it out to all our customers.
Jon: Okay, got it. As we test, what are we seeing? Are we seeing the mobile app sitting there spinning, all the time, everytime, sometimes? What’s happening to us at every test?
Chris: I don’t even think we’ve have gotten that far. I think we’re just using some test tools and maybe these scripts or curl making these API calls against the proxy service to do these various bits of functionality of the mobile app needs to do. The end results is insane. These timeouts are happening in the backend app.
Jon: You just mentioned curl and you reminded me of a conversation that we had when this was going on. I think it’s worth repeating what you told me about how curl ended up being the proof that we needed.
Chris: Yeah. That kind of gets into a little bit of like, “Okay, how do you go and troubleshoot this?”
Chris: I don’t know if you want to. Maybe we can lead up to that and get to that and talk about just like, “Okay, how do you just go about?” These are the symptoms, you can talk a little bit about what happened? The path that people went down.
Stepping back, peeling back and really, what I wanted to highlight about this is this is super common issue that I see time and time again, is that distributed systems in the cloud, microservice containers, networking issues, these are very complicated systems. You can’t Google your way to an answer, you can’t stack your way to an answer.
You really have to have a good understanding of all the different parts in your system, what’s communicating with what, all the various paths where failures can happen to be able to troubleshoot this.
Time and time again, I see people just kind of getting stuck immediately out of the gate and kind of just making guesses about what’s wrong and then going and then just throwing things against the wall and seeing what sticks.
Jon: Right. I just wanted to add in, I guess the reason that I was going down the road of why do we use curl was because the thing that I want to point out, the thing that makes us even more classic, and even more as you say, a common type of problem that people have to deal with, is that there are two teams involved.
Whenever there’s two teams involved, you do get a little bit of human troubleshooting as well like pointing fingers, “No, it’s on your side.” That was kind of what I was getting to.
Now, that we’ve revealed that there are teams that maybe don’t see eye to eye about where the problem might be. Now, we can continue down the road of, “What do we do? What happened? Who did what?”
Chris: Right. Where we’re at is we know that we have this existing client that’s making these backend calls. We’re not seeing anyone near the kind of error rate with that, like things are kind of working. This has been up there in live production for many months. It’s working very well. We’re not seeing these kind of error rates.
They will deploy this new proxy service to be a middleman between the mobile app and the backend service, and right out of the gate just in very limited testing we are seeing high error rates like in the order of––it kind of feels like 50%, it’s probably not that high. But it’s definitely a very high percentage of calls are now timing out and the head scratcher here is that it’s making the same calls. The existing mobile app is doing that, the existing mobile app is not having issues with, but our proxy service is.
That’s where the existing problem where we start off with the initial state. What kind of happened was, so we reported timeouts to the other team. The consensus was among everyone there like, “Oh, timeouts?” That means you don’t have enough computing power.
There is a backing database for that backend service. People were thinking, “Oh, we just need to go increase the size of that database instance.” Therefore, we won’t have these timeouts. That’s kind of the route that people started going down. Quite a few people looked into that and trying to look at database performance and seeing if there were queries that could be optimized or indexes or how to move to a bigger instance. What are the repercussions of that?
I was kind of watching from the sidelines and saw that. That’s when I started asking some questions because even though sometimes, timeouts indicate that, “Hey, something is overwhelmed.” That’s not the only reason why you would get a timeout. The thing that stuck out to me was you have this existing mobile app that is making the call straight to the backend service and it’s not seeing these timeouts. We’re only seeing the timeouts with the proxy service.
Jon: Which is such a weird thing like, with this data, it must be a problem in the proxy service.
Jon: That’s the thing that changed. It must be their fault.
Chris: Right. You start asking yourself, okay. What are the dependencies here? What are the communication pass? What’s talking to what? There’s a lot of things actually where this could be going wrong. It could be something as silly as the code in the proxy service. The way that it’s making http calls to the backend service. Who knows? Maybe it’s configured incorrectly and it’s set to a 2-second-timeout or something like that.
It could be timing out prematurely. I don’t think at that point that we even knew exactly how long these timouts were. Was it 60 seconds before it timed out? Or was it an immediate timeout? Was it somewhere in between? I’m not sure we actually had that data. But that’s one place where it could be going wrong.
Jon: Let me just pause for a second here. Both Chris and I are familiar with this problem and what happened, but I just want to make sure that we’re explaining it in an understandable way. Rich, are you following these? Can you picture the architecture in your mind? A mobile app calling a proxy service that calls the backend service?
Rich: Yeah, I’m following right along.
Jon: Okay, great. Okay, sorry to interrupt, Chris. Let’s continue.
Chris: One place where it could be is it’s actually in that code for the proxy service whether it’s a misconfiguration on how it’s making its http calls to make these API request to the backend service. Who knows? It could be a permission problem, a security issue.
It could also be, this gets more into cloud specific, architecture, and networking but each one of these services is always fronted by an ELB.
ELB, the way you kind of think of it is they are basically computer themselves, and like 99.999% sure that ELB is an Amazon or software load balancers, it’s not hardware devices.
Jon: By the way, ELB is Elastic Load Balancer, right?
Chris: Yes. It’s another hop and it’s another thing with software that’s making decisions and whatnot. ELB is in Amazon. They all have timeout logic of their own. These are settings on ELB’s themselves.
By default, they are set at 60-second timeouts. It means for incoming request—maybe ELBs, Elastic Load Balancer, really what they are is they are a way of turning that collection of stateless services into a cluster that you can basically round-robin through. It gives you both scalability and availability to manage it.
You might have an ELB with three instances of your service behind it, and in this way, you have three times the throughput and then, also when if one of those services dies for some reason, you still have two that are up and running and still answer that question.
These ELBs, like these incoming requests, to say I want to access one of those things. I need this capability—this service call. The ELB will then route it to one of those instances. Then, when the instance responds, it then proxies it back to the original caller.
All that said, the ELB, if the request takes longer than the timeout period, it will actually terminate the connection, return back a 503 error code to the caller. Which is kind of interesting because it could be that the backend, whatever service is behind that load balancer, it may still be crunching away and working on it.
Maybe it take 70 seconds for it to complete. As far as it is concerned, everything is just fine. It doesn’t realize that the ELB gave up on it and terminated the connection. When that service behind it finishes and sends the request, it’s going out to basically death and all because the client got disconnected. That’s something else to consider in this whole puzzle. Is it an ELB a timeout? Or is it a timeout actually in the service itself?
Jon: One thing I just want to make sure we’re using the right terminology, on ECS, is it ELB? Or it’s the Application Load Balancer, ALBs?
Chris: An ALB is a type of ELB. ELBs come in to three flavors: you have your Application Load Balancers which are path-based routing. They work at layer seven, basically they only do http and https. They’re working at the application level that why they are called Application Load Balancer.
You have the classic load balancers, those work in layer four. Those can work across all the different protocols whether you want to do TLS, if you want to do SMTP or POP3 or whatever internet protocol you want. It’s working at that networking level.
They have another one called a Network Load Balancer and that one is at an even lower level. It’s basically more like the pure metal version of it.
Three different types of flavors of load balancers. With ECS, some of our services may very well use classic. We prefer the ALB because usually there are damages to the ALB. You can actually front multiple microservices with a single ALB if you can design your routes correctly.
Jon: I was thinking about that exact thing which is why I thought, “Oh, wait. It’s ALBs.”
Anyways, you just discussed here’s what load balancers are doing and to kind of go back to the problem, we just need to get some proof around what’s happening. The service is working unless you call it through the proxy. What happens next then?
Chris: The suggestion is like, “Okay, let’s see if we can isolate part of this.” What if we just do a curl against the actual backend service itself? Can we actually see these timeouts happening? Are they happening that way? That way you’re moving the proxy service from it. You’re removing the ELB of the proxy service from the equation. Now, you’re going back to that straight back to the backend service.
As a first step, let’s go see what that gives . In doing that test, what happened was, one of the people in the team, when they did that they saw that sure enough, one of those ghost calls timed out after about 60 seconds. But then they did it again and it came back with a successful response in 3 seconds. Radically different.
This keeps happening. Sometimes it takes really long, it timeouts, and other times, it’s normal. It just responds. That’s kind of what we had next, the team was kind of like, “Let’s see, it’s proof.” It’s the other team’s problem. They are timing out.
Jon: It’s the proxy services’ fault because curl going directly to the backend service still shows the timeout.
Chris: Indeed. Let’s dig a little deeper here and see what’s going on because the thing that’s sticks out there is this wide discrepancy. Sometimes, it takes 60 seconds and sometimes, it takes 3 seconds. It’s not like the 60-second calls are all happening in a one-hour window during peak hours, these are back to back calls.
In a way, it was super good news because it’s like, “Okay, this smells like we have some sick server—some degraded server that is in that cluster behind the ELB that’s not working correctly.”
What’s happening is for those requests that timed out, they get routed to the sick thing. It’s really slow, it doesn’t have enough memory. Maybe it’s having problems with disk space. It’s swapping. Something’s going on. That kind of makes sense. That matches up with the behavior that we’re seeing.
The next step is to, you can just enable verbose mode on curl to see exactly the steps that are happening when curl is trying to make that request. That’s when things got even more interesting.
When we turned on verbose mode for curl and make this request, what was happening on the ones that we’re timing out, when it first did its initial attempt at this, it was looking at the DNS record.
The DNS record for this is actually round-robin DNS because this is a load balancer and because the load balancer was defined to be in three AZ’s. That means there’s three separate actual load balancers. They all resolve to the same DNS name so just call it like elb.amazon.com as the DNS name. In that, there’s three IP addresses for the DNS record.
Curl, what it does is if you want a connection to dns.amazon.com or elb.amazon.com and when it does that, it looks up the DNS record, it sees, there are three things. It’s up to the client on what they want to do. Really, probably, the client should do something like round-robin, but they don’t have to.
They can pick one of those three addresses and go ahead and make connection to that IP address. What we’re seeing is that, out of these three IP addresses, one of them, when the client first picked that one, that was the one that was not responding.
In curl, what was happening was, I think curl had a timeout for about 15 seconds. We’d see this wait, it’s like, you can see it in verbose mode, you can see the DNS resolution that resolved to the IP, it tries to make a connection to the IP, and it’s just hanging there for 15 seconds. It then times out, gives up, and then it goes back to the DNS record and says, “I’m going to try another one.” It grabs another one and when it grabbed the other one, then it went through, it connected just fine.
This happened multiple times. Every time that it failed, it was that one IP address out of the three that it was trying to connect to that just wasn’t responding. This was really interesting and good information to have because now we knew that the problem is a problem with the ELB. It’s not with the backend service. It’s the actual ELB that’s having a problem.
When we figured that out, the light bulb goes off. Why would an ELB go bad? Why would it have problems? Chances are, you can’t reboot an ELB. It’s a matter of service. Amazon controls that. The likelihood that it’s overwhelmed, that’s probably not the cause. Instead, it’s probably something related to networking, security, permissions, and whatnot.
With that, they can go into Amazon and into the console for the ELB, look up that ELB, and see how it’s configured. When we did that, we saw that it was correctly used in the three availability zones. But what happened was, the subnet definitions that it was using turned out the one ELB that was not responding back to a request, it was on a subnet that did not have an internet gateway.
Because of that, it was basically hidden from the internet—from internet request. That’s why it couldn’t service those requests because basically, that particular ELB was configured to be a private subnet, it’s a private ELB that’s not supposed to connect to the internet.
The only way to really connect to that would have been if you were actually inside the AWS data center or through a VPN or on another machine that’s actually inside there and using the private IP.
Jon: That’s something that I was keeping in mind is that when networks are unobtainable through the internet, when you cannot route to a network through the internet, the behavior is typically a timeout as opposed to a refusal or something like that. You usually get that try and try and try and try, okay, I give up. You know that the network is there and it should be available, but when it times up that’s when you start thinking, “Hmm, is this thing accessible to the internet? Can it be reached?”
Chris: Absolutely, yes. Definitely for sure. That’s like a big smoking gun. It’s like a zero routing issue or it’s a permission issue that’s preventing that from happening.
The fix is really simple. It was just like, “Hey, we need to correctly define this ELB to use a subnet to make sure all the subnets that it’s using to actually have internet gateway so they can talk to internet traffic.”
Once, that incorrect AZ was swapped out from the subnet, that AZ was swapped out to one that was correct that actually had an internet gateway on it, then lo and behold, everything worked just fine.
Jon: The thing I’m still confused about is why was it that the mobile app before the proxy didn’t have the same problem? Isn’t it hitting the same ELB, DNS, name?
Chris: That’s a great question. It’s because of an architecture decision where there’s actually two ELBs fronting that backend service. There’s one that fronts it for public calls and one that’s for internal only. What happened was if it’s going against the internal only one, it would’ve been fine because that meant that we would access the internal one that whatever was connecting to it was not on the open internet, then could make that connection versus going through this other…
Part of it was the proxy service was trying to go through the public interface. That was causing some issues. I think there was also some affinity for how the client was connecting to the public facing ELB on what ELB it was connecting to as well.
Jon: Interesting. Well, I’m glad we didn’t just make a better database.
Chris: Amazon’s bummed out.
Rich: How long did this take for you to figure out?
Chris: It was about 10 minutes or 15 minutes, maybe?
Jon: Oh, I thought it was a lot longer than that.
Chris: Yeah, from the time that the problem originally surfaced to when it was fixed was probably like three days and involved quite a few people. When I got involved and kind of had the opportunity to kind of look at it with fresh eyes, and come out from a different angle, that’s the part that went rather quick.
Jon: Okay, got it. Thanks for taking us through that really interesting troubleshooting lesson. We’ll get back to each other in a week and talk some more.
Jon: Alright, see you.
Chris: Great, thanks guys. See you.
Rich: Bye. Well, dear listener, you made it to the end. We appreciate your time and invite you to continue the conversation with us online. This episode, along with show notes and other valuable resources, is available at mobycast.fm/16. If you have any questions of additional insights, we encourage you to leave us a comment there. Thank you and we’ll see you again next week.