28. Setting up Virtual Private Clouds on AWS (Part 2)
Chris Hickman and Jon Christensen of Kelsus and Rich Staats from Secret Stache continue their micro-series on setting up virtual private Clouds (VPCs) for ECS on Amazon Web Services (AWS). In this episode, they focus on availability and security considerations and best practices.
Some of the highlights of the show include:
- Network Address Translation (NAT) Gateway: NAT is essentially a firewall where you want what’s behind it to be able to make outbound connections but not inbound ones
- NAT translates an address from a private to a public one to allow it to talk to the outside world; it’s like a database server that offers directly hittable functionality and security
- NAT gateway now makes the process a lot easier to set up; previously, you had to manually deal with a NAT instance and worry about throughput, network load, etc.
- NAT gateways are per availability zones (AZs) on a subnet; there’s some additional complexity to make sure things are going to work when an availability zone goes down
- Assign NAT gateway to a subnet in an AZ; if available when AZs fail, there’s a single point of failure; define a NAT gateway for every AZ to have separate routing tables for every AZ
- Every network request has an inbound request going from Computer A to Computer B, and Computer B receives and responds to it
- Network ACLs: Define stateless rules for the request and response; rules can have different settings for the response vs. the request
- Rules for security groups are stateful because one rule applies to both the request and response; security groups are more granular, so you can slice and dice for added control
- Network ACLs are good, if you just want to have VPC-wide rules to allow or disallow certain types of traffic patterns
- Minimally, highly available VPC needs to be in one region, have at least two AZs, use an Internet gateway to talk to the outside world, and have at least two NAT gateways
- Public subnet is where a routing table allows inbound Internet access, so it’s directly accessible from the Internet and resources are addressable from the open Internet
- Private subnet includes resources that should not be directly accessible from the Internet and may or may not need to have access to the Internet through outbound requests
- AWS console does not include “create public subnet” or “create private subnet” buttons; it’s a networking architecture consideration to set privileges and access permissions
Links and Resources:
Rich: In Episode 28 of Mobycast, we continue our conversation on VPC setup on AWS. In particular, we discussed availability and security considerations. Welcome to Mobycast, a weekly conversation about containerization, Docker and modern software deployment. Let’s jump right in.
Jon: Welcome, Chris and Rich. It’s another Mobycast.
Chris: Hi, Rich.
Jon: Hey. This week, we have a lot to talk about because, last week, we started getting into what it takes to set up a VPC for ECS on AWS. That’s three three-letter acronyms in one sentence. It turns out that there’s a lot to talk about in there. As interested as I am in what you did this past week, I think we should jump right in. Where we left off last week is that we were talking about attaching an internet gateway to your VPC and then we started talking about routing tables a little bit. Then, that got us talking about just the importance of understanding TCP/IP networking in general and how it always comes up.
It comes up in your daily life at home, dealing with your computers, to dealing with AWS, to dealing with stuff in the office and your programming and everything. It just always comes up. The more you know about it, the more valuable of a software engineer you’ll be. I think, as we continue, we want to be careful to make sure that we define everything and talk about things that make sense but we do assume that, if you’re listening, you have some familiarity with TCP/IP networking so we’ll try to keep our definitions and conversations to AWS-specific stuff instead of kind of doing a networking fundamentals podcast.
With that in mind, I think the next thing to talk about is, “Is it network ACLs or are we talking about net gateways?”
Chris: Yeah, I think we were kind of doing a broad survey of just all the various players that go into a VPC and its creation and management. We talked about subnets, routing tables. We talked about what an internet gateway is. Another piece of that is a NAT gateway. NAT stands for network address translation and, in addition to NAT gateways, some of the major players in this would be some of the security considerations like, “What are network ACLs versus security groups?” and I think we’ll round out the primary components of what makes up a VPC. Then, after that, we can start diving into more about how you start applying these things and what are some of those security considerations and availability considerations as well as best practices for how you ideally set this up. We’ll focus on when you’re running your workload on ECS, what the best practice for that is.
Jon: Let me interrupt real quick because I want to set it up just exactly as the way you had set it up before, which is we had set up our subnets and you were saying there’s got to be a way to talk from the subnet out to the network and there’s got to be a way for the internet to talk. If you want the subnet to be available from the outside world, there’s got to be a way to talk from the outside world to the computers inside the subnet. The way out was the internet gateway. That’s how the computers on the inside of the subnet can talk out to the worldwide web and the internet at large. Then, the NAT gateway is how they can talk in. What more can you tell us about NAT gateways and setting them up?
Chris: Basically, NAT is essentially a firewall. It’s where you want the stuff that’s behind the firewall to be able to make outbound connections but you don’t want to allow any inbound connections. Network address translation is really what it implies, is that you are translating an address from a private one to a public one, essentially, in order to allow it to talk to the outside world. A NAT gateway, again, becomes important as part of your VPC when you have resources that you don’t want to be directly accessible by other machines on the open internet yet it still needs to go access the internet itself.
You could think of it like maybe a database server in that you don’t want your database server to be directly hittable by the public internet but maybe it needs to go have network access to download updates because you’re running post-scripts and you want to get the latest build of post-scripts or something like that or you need to get patches or whatnot so it needs internet access. A NAT gateway would provide that functionality that you need and that security. Think of it as a firewall, as a proxy. It’s basically doing some of that proxy, doing some translation, saying, “Hey, I know you have this public IP which is not addressable from the internet. I’m going to basically keep a mapping of your private IP to a public IP and I will do the communication on your behalf.”
Jon: With the internet gateway on AWS, specifically, that was a managed service that AWS provides so you don’t even have to think of it as a computer that you’re containing. What about with NAT gateways?
Chris: There’s two flavors of this. One is NAT instance and then there’s the idea of a NAT gateway. NAT gateway is something that’s relatively new. We’re getting value-added service added by AWS to make life easier for you. It is something that’s out of the box. It makes this process a lot easier to set up. In the past, you had to deal with basically a NAT instance, in which case it was much more manual where you would basically build your own NAT functionality by spinning up an EC2 and enabling port forwarding and just manage that all yourself so you had to worry about things like throughput and did you have a big enough EC2 that can handle the network load and whatnot. That’s a NAT instance.
Luckily, we really don’t have to do that. There’s no reason to do it anymore. You can use a NAT gateway, but the issue with a NAT gateway is NAT gateways are per availability zone in that you have to put them onto a subnet. What that means is it does mean there’s some additional complexity there to make sure that things are going to work when an availability zone goes down. This is actually one of those things that I think a lot of folks don’t really think through and realize what the implications are.
They’ll build a setup, a single NAT gateway, assign that to a routing table that becomes essentially a private routing table and then everything works and they’re fine. They may very well have multiple subnets across multiple AZs but they’re still not fully available because, if that availability zone in which the NAT gateway is assigned goes down, then that means, now, all of their NAT functionality is broken; it’s not going to work.
Jon: I was listening to you but I missed one part. Did you say that NAT gateways only work per availability zone or that you had to have one per subnet?
Chris: It needs to be assigned to a subnet. By virtue of that, it has to be assigned to an AZ.
Jon: Again, you can have multiple subnets inside an AZ. I don’t know necessarily that you would typically do that but, if you did have multiple subnets inside one single AZ, you would need to have a NAT gateway per subnet that you wanted a NAT gateway on, right?
Chris: No, you need to have a NAT gateway. It needs to live somewhere. In AWS’s case, you need to assign it to a subnet so it’s going to live in a specific AZ. You can now have a subnet in some other AZ talk to that NAT gateway. You don’t have to define it. Each subnet doesn’t have to have its own NAT gateway so you can have 16 subnets in a single NAT gateway and that’s perfectly fine. Again, this is the common scenario; this is what most folks do, but the problem is, is that, if you’re in a region with, say, three AZs and you have six subnets, there’s two subnets in each one of the AZs and you’ve created a single NAT gateway that sits in the AZ1, now you have the subnets spread across AZ1, 2 and 3 but they’re all using that NAT gateway that’s in AZ1.
If AZ1 goes down, you did the design so the other two AZs are still up and working but the problem is, now, when they go try to access the internet, they’re not going to be able to because the NAT gateway was in AZ1 and it went down. Even though you did some design to make yourself available when AZs have failures, you had a single point of failure, basically, with your NAT gateway. The way around this is you basically define a NAT gateway for every AZ so you end up having separate routing tables for every AZ.
Jon:That makes sense.
Chris: If you’re spread across three AZs, you’ll have three NAT gateways and you’ll have three private routing tables and then, now, you’re as available as you can be. That way, if there’s a failure in AZ1, that just means whatever’s in AZ1 will be affected but AZ2 and 3 will still perform just as without a hitch.
Jon: Right on. Yeah, that makes sense.
Chris: That’s a bit about NAT gateways. Rounding out the components of a subnet that are important would be network ACLs and security groups. I think we’ll get more into this a little bit later as we dive into some of the security considerations for how to set it up, but this might be a good time to talk about, just at a broad level, network ACLs. They’re being applied across the VPC and across every subnet. These rules are stateless so every network request has two hops to it. It’s the inbound request so you’re going from Computer A talking your Computer B–that’s the first hop–Computer B receives that request and then responds back to it. It’s the request and the response.
With network ACLs, you define the rules for both the request as well as response and so it’s called a stateless rule because of that, because it does not remember the settings of what it was. You can have different settings on the response versus a request. With security groups, they’re staple. You have the one rule that basically applies to both the request and the response. That’s one of the big differences between those two. The other thing is that security groups are much more granular and so you can slice and dice and have much more fine-grained control. Network ACLs are really good if you just want to have VPC-wide rules to allow or disallow certain types of traffic patterns. You can block traffic from a certain IP range or you can say, “You know what? We’re not going to allow any SSH on this VPC so Port 22 is blocked for the entire VPC.” That’s when network ACLs come into play.
Jon: Okay, that makes sense. You’ll use network ACLs and security groups to secure your VPC. We’ve talked about all the parts and pieces. We have a few more minutes. Maybe we can start putting it together a little bit.
Chris: Yeah, and we’ve touched a little bit on availability considerations. You do want to be spread across multiple AZs. We talked about internet gateways, how that’s a fully-managed service and don’t need to worry about that. NAT gateways, you do. Those are basically one per AZ and so you just need to do a little bit more work to make sure that that is as available as you can be. We’ve talked a little bit about security.
Jon: That does seem like a good way to wrap it up a little bit. Maybe we can say for availability considerations. We talked about all of those three things, multiple AZs, the internet gateway and the NAT gateway, but maybe we could just quickly define a minimally, highly available VPC. It’s got to be in one region, it’s got to have at least two availability zones, it’s got to use an internet gateway if it needs to talk to the outside world and then it’s got to have at least two NAT gateways because it has two availability zones. Is that minimally, highly available?
Chris: Yeah, I think that covers it.
Jon: We actually still have time so keep going. You were going to talk about security considerations?
Chris: We’ve talked a little bit about network ACLs and security groups. We could dive really deep and spend a long time on that–and I think we’ll save that maybe for some other day–but one of the important considerations that come out of what we’ve talked about so far is–and you’ll hear this a lot–the idea of a public versus private subnet.
Jon: Yes. That’s so tricky for people, I think.
Chris: Yeah, and it’s so key. It’s so fundamental. Basically, in general, a public subnet would be a subnet where the routing table allows internet access coming inbound from the internet so it’s directly accessible from the internet. That’s why it’s called public; it’s publicly accessible. If you have resources that need to be directly addressable from the open internet, then you’re going to be making sure those resources are on a public subnet. A private subnet would be those resources that should not be directly accessible from the internet, and they may or may not need to have access to the internet themselves through outbound requests.
This is definitely one of those fundamental design considerations that you’re going to want to separate your resources into, public versus private. You’ll want to set up public versus private subnets, and the same rules about availability apply here as well. If you’re going to have both public and private subnets as part of your architecture, then you’re going to want to spread those across availability zones. If you’re spreading across three AZs, you may very well end up having six subnets at a minimum because you can have three public subnets on the three different AZs and then three private subnets on each one of these AZs.
Jon: I now remember having a conversation with you a while back, Chris, about this public versus private subnet thing because one of the things that happens is you’ll make a mistake setting up your VPC or your subnets and you’ll end up in a situation where maybe you’re trying to get through a database or you’re trying to get through a machine and just can’t seem to get through it. You’re like, ” I’m on the AWS console and I’m right-clicking and I’m saying connect to this computer and it just doesn’t work,” and it’s like, “What is going on? Why is AWS broken?” and the answer might be that there’s no route from the internet to that private subnet.
AWS isn’t telling you that because it doesn’t know that; all it knows is that there’s a computer there and, typically, if there was a route to it, this is how you would connect to it. I remember saying, “Why can’t AWS make this easier? Why can’t it make it a big, red ‘Hey, this is not publicly available. This is on a private subnet thing on there.'” Remember that conversation?
Chris: I do now.
Jon: It basically came down to, “It’s too tricky.” The fact is that it’s a routing table that needs to have a route into the subnet and not a property of the subnet itself.
Chris: Correct. Yeah, there’s no button in the AWS console that say ‘create public subnet’ or ‘create private subnet’. This is just the convention of what public and private subnets mean so it’s very much a networking architecture consideration type of thing and it’s just a way for you, a best practice, to set things up so that you are giving the least amount of privilege and access that’s necessary in order for your system to work and then also to make sure that you’re as secure as possible. The other thing that complicates like what you talked about, Computer A trying to talk to Computer B and it can’t get there because of a route, it’s not just the route that could be in play. It could be the network ACLs are preventing that access. It could be a security group that’s preventing that access.
Jon: Yeah, and that’s what I was going to get at, is when you’re creating an EC2 instance and you say, “Well, this is the security group that I’m going to use for this EC2,” one of the things that can trick you into thinking that the thing should be publicly available is that you might crank that security group wide open so anybody can talk to this computer. It is available but, still, if there’s no internet, if there’s no networking route from the internet to that EC2, it can be wide open and you still can’t talk to it. I think that catches more people than just about anything else.
Chris: Yeah, absolutely. There’s a lot of layers here, and any one of those layers, if it’s not configured the way that you need it to be, it’s not going to work. You can think of it like switches. There might be three switches and they all have to be in the closed position in order for you to have connectivity. Then, if any one of those switches is open, it’s just not going to happen, no connection.
Jon: Some of them are sort of serviced by the AWS console and others are things that you need to be aware of when you’re creating your VPC.
Chris: Absolutely. That’s, in general, public versus private. By default, when you create a new subnet, it’s going to be a private subnet because the routing table won’t have the path to an internet gateway defined. If you want something to be public, like I said, you’d set up an internet gateway and you’d add that to the routing table used by that particular subnet so that now you have both inbound and outbound internet access. For your private subnets, if they don’t need any internet access whatsoever, then there’s nothing you need to do. You just leave them alone and they basically can only talk to other machines that are there within the VPC.
If those machines need to make outbound requests to the internet, again, like to go download patches or to make API calls to other services or whatnot, what you’d need to do is to use a NAT gateway and so you would update the routing tables such that the machines inside that subnet have a route to the NAT gateway. The NAT gateway needs to be on a public subnet so that it can now go talk to the outside world. It’s one of those things where, at the end of the day, ends up being pretty straightforward and simple, like what the difference is between a private and a public subnet, and how to create them and what it means, just the practicalities of setting that up. Then, it’s also one of those things that maybe a lot of people don’t think through and understand how it works so it’s good to talk about.
Jon: Great. I think that can wind up our conversation for today and hope that people were able to follow us without whiteboards and diagrams because this is tricky stuff.
Chris: It sets us up, I think, for the next time to start bringing this all together and how does this relate to–actually, now that we’ve set up the infrastructure, now that we’ve built the condominium building and we’ve laid in the utilities and we have electricity and plumbing, how do we move in our folks and actually start running our ECS workloads in here, the practical day-to-day use, like, “How do I connect to those machines that I have to do debugging?” and, “What does that mean if you’re on a private subnet?” We can get into all that.
Jon: Thank you, Chris. Thank you for educating us. Thank you, Rich, for producing this episode. Talk to you next week.
Chris: Thanks, guys.
Rich: Thanks, guys.
Rich: Bye. Well, dear listener, you’ve made it to the end. We appreciate your time and invite you to continue the conversation with us online. This episode, along with show notes and other valuable resources, is available at Mobycast. If you have any questions or additional insights, we encourage you to leave us a comment there. Thank you and we’ll see you again next week.