92. VPC Ninja – Part 3 – Moving an ECS Application to Private Subnets
Summary
In the first two episodes of this series, we learned how to build a VPC with public and private subnets. We did a deep dive on NAT, or network address translation, and then setup a software-only VPN for secure access to the private subnets.
Now, it’s time to put everything together and earn our cloud networking black belt.
This week on Mobycast, Jon and Chris conclude their three-part series on how to incorporate private subnets for your cloud network. We finish by explaining step-by-step how to move an existing ECS application onto our new private subnets. Now… go build, ninja!
Show Details
In this episode, we cover the following topics:
- We describe the existing application, which is a typical two-tier web application, with a web service fronted by an Application Load Balancer (ALB) and database hosted on MySQL using RDS.
- The current application is containerized and running under ECS.
- Everything (the load balancer, ECS cluster, RDS instance) is running on public subnets.
- The goal is to leave only the ALB public-facing, with all other resources protected on private subnets.
- There are two phases to moving the application to private subnets. First, we need to move the ECS cluster to private subnets. Then, we can move the RDS instance to private subnets.
- We detail step-by-step two deployment approaches for moving our ECS cluster to private subnets, both of which involve zero downtime.
- Rolling deployment, which updates the existing cluster in-place.
- Blue/green deployment, which creates a new cluster to replace the existing one.
- We discuss the steps on moving the database instance to private subnets, including application downtime considerations.
- As a bonus, we explain how to add encryption-at-rest to the RDS instance during the migration.
Links
- VPC with Public and Private Subnets (NAT)
- Changing the Launch Configuration for an Auto Scaling Group
- Add Encryption to an Unencrypted RDS DB Instance
- Amazon ECS-optimized AMIs
End Song
The Runner (Lost Lake Remix) by Fax
More Info
We’d love to hear from you! You can reach us at:
- Web: https://mobycast.fm
- Voicemail: 844-818-0993
- Email: ask@mobycast.fm
- Twitter: https://twitter.com/hashtag/mobycast
- Reddit: https://reddit.com/r/mobycast
Stevie Rose: In the first two episodes of this series, we learned how to build a VPC with public and private subnets. We did a deep dive on NAT, or Network Address Translation, and then set up a software-only VPN for secure access to the private subnets.
Now, it’s time to put everything together and earn our cloud networking black belt. This week on Mobycast, Jon and Chris conclude their three-part series on how to incorporate private subnets for your cloud network. We finish by explaining, step-by-step, how to move an existing ECS application onto our new private subnets. Now, go build, ninja.
Welcome to Mobycast, a show about the techniques and technologies used by the best cloud native software teams. Each week, your hosts Jon Christensen and Chris Hickman pick a software concept and dive deep to figure it out.
Jon Christensen: Welcome, Chris. It’s another episode of Mobycast.
Chris Hickman: Hey Jon, it’s good to be back.
Jon Christensen: Good to have you back. Here we are for part three of our VPC ninja series. I don’t know about you, Chris, but I feel like going through this VPC stuff with you, I’ve learned quite a bit.
Chris Hickman: Yes. Maybe not a black belt yet, but hopefully feeling like a ninja.
Jon Christensen: Right on. Yeah. I think, again… well probably since this part three of a series and this is… hopefully people have been binge listening, we’ll skip with the pleasantries, we’ll just move right into sort of… maybe if you could give a recap, just for those people that haven’t been binge listening. And then you can press the Skip Intro button on your podcast app, if you don’t want to hear the recap. Oh they don’t have that yet. Go ahead and give us the recap.
Chris Hickman: This is not Netflix. Not yet. Yeah, so recap. So previous two episodes, we covered public verses private subnets. So we talked about what are private subnets verses public and why we should care about them. We talked about NAT, Network Address Translation. We had a deep dive on that, and just really understanding how does that work. Which NAT gives rise to the ability to have these private resources, and their own private dedicated networks.
And then we talked about how you go about connecting to these private resources. And there’s various different techniques, but we kind of settled on, “Hey, VPN is the way to go here, because it’s the most flexibility and most powerful at the least amount of cost and complexity of set up.”
So we talked about how to go set up a third party software-only VPN for very little expense. And then we updated our VPC to have these private subnets. So after those two episodes, we now have a VPC that has public and private subnets, and now we’re set up so that we can have only our public-facing resources on those public subnets. Everything else is going to be going on those private subnets.
Again, as part of the… just best practice, reducing the surface area that we have on our resources that are in our cloud. And then we have the secure access to these private subnets over VPN.
So now we’re ready to go ahead and move an existing application that’s on public subnets onto the private subnets. So we can take advantage of this.
Jon Christensen: Great. Yeah, just thinking about our listeners, and one of the things that I think is fun about this episode is that it’s really a how-to. And so it’s just kind of… while you’re driving or while you’re walking, you can kind of just imagine with us the steps that we’re going to go through to do this work, to move an application into a private subnet. But it’s also… if you do this this way, you can feel pretty confident that you’re doing things in a way that is going to be okay for your software.
So you are going to have software in production, and you’re following some best practices around keeping your data in places where bad actors can’t get to it.
It’s just good stuff to know and it’s also just probably pleasant to listen to, going through the how-to of this, and listening to us talk about each step, especially some of the why’s on each step are things that I think, if you’re reading a quick tutorial or you’re reading a quick AWS blog entry, “Oh, here’s how to put something on a private subnet,” you might miss out of some of this back and forth that you and I will have on why we’re doing each part.
I think that’s why we’re doing it this way.
Chris Hickman: Yeah. Hopefully this is actually a pretty common use-case, right? Folks, when they first start using the cloud, AWS, or Azure, or Google, or whatnot, you’re going to probably use like, “Okay everything’s run on public subnets, right?” You’re not going to necessarily spin it up with completely best practices and have the sophisticated deployment.
So hopefully this is, again, a pretty common use case, where like, “Hey, I’m using the cloud, I’ve got my stuff set up. I don’t have private subnets, I don’t have a VPN, but I kind of would like to have that, and understand how to do it.”
So hopefully that makes this that much more relevant and hopefully useful for folks out there.
Jon Christensen: Great. Yes, agreed.
Chris Hickman: Yeah.
Jon Christensen: Okay, let’s dive in.
Chris Hickman: Yeah, so let’s just talk about what… this current state that we have, right? And so this is kind of like all based upon me moving my… hosting my own personal blog inside my personal AWS account.
And so the current state of that was… So this blog is basically a simple two-tier web app, right? So it’s a web service that’s running the blog software, which is Ghost. So it’s an open source blog application that was inspired by WordPress, and I think some of the core WordPress developers left and did this Kickstarter to go build the next generation blogging engine.
So Ghost is that web app, no JS web app. It’s fronted by an ALB, and the actual app is containerized, and it’s running under ECS. And for persistence, we’re using MySQL and RDS.
So single RDS MySQL instance for this state, and then we can have multiple container instances of the Ghost service that’s fronted by the ALB for handling these requests.
And then with this… again, with the current state, everything is on public subnets. So the load balancer, the ECS cluster, and the RDS instance, they’re all on public subnets. And so we now went through this process of updating our VPC, so we now do have private subnets and a way to connect securely to those. Now, we want to change this state to now move everything onto the private subnets.
And so only the ALB will be public-facing, and everything else, we want to now move to the private subnets. And so that’s what we’ll talk about for the remainder of this. It’s like, “Okay, what are the practical steps? Like how do we do that? What are our choices? What are the options here? And just what exactly do we need to do, to achieve this?”
Jon Christensen: Great. And are we going to… I guess maybe those are some of the questions that we’ll be asking ourselves, but I’m curious, are we going to keep the blog running throughout this process, or are you going to take it down and then we’ll bring it back up when it’s on private subnets?
Chris Hickman: So, there is, from a web service standpoint, zero downtime to do this. We can do it in a way that there’s no downtime there.
Where it’s going to get a little tricky is the database.
Jon Christensen: Yeah.
Chris Hickman: So we’re going to have… So we could come up with a solution that had no downtime with database, but we’re not going to do that today. So we’re going to have a small amount of downtime, as we move to a new RDS instance.
Jon Christensen: Let’s view it. Here we go.
Chris Hickman: Yeah, so basically like what this boils down to, there’s two phases to do this migration. So one is, we need to move our ECS cluster to private subnets. So that’s the first phase. And then the second phase is going to be moving that RDS instance to private subnets.
And then also, just because we can, we’re going to enable encryption at rest for our RDS instance, as well, when we do that.
So those are the two phases. Like I said, the first phase, zero downtime. The second phase, we’re going to have a little bit of downtime.
Jon Christensen: I’m laughing, because this is starting to taste like security broccoli.
Chris Hickman: But it’s so yummy with the cheddar cheese sauce I’m going to put on it. Right? So… yeah.
Jon Christensen: So how step-by-step are we going to get, Chris? Like are we opening up our web browser and typing in aws.amazon.com?
Chris Hickman: Yeah, it’ll be a higher [inaudible 00:08:55], right? We don’t want people to fall asleep, but also want to kind of give enough detail so that hopefully after listening to this, you would be able to figure it out yourself, and actually be able to do it.
So we’ll try to keep it at that level.
So yeah, so let’s talk about this first phase. So the first phase is… Okay, we want to move our ECS cluster to private subnets. So what are our options there?
And I think, broadly, we have two choices. We can either treat this as a rolling update, or we can treat it as a blue-green deployment update. And so the difference between these two, with a rolling deployment, you are updating the instances in your application, in your architecture, one at a time, in place. So you have a mixture as you go through this deployment period. So you can imagine, like if you have five instances, one at a time, you’re changing them and say, “Okay, this instance is now being changed. It’s no longer on the public subnet. It’s on the private subnet. So now we have four on the public subnet and one on the private.”
And we keep doing that lather, rinse, repeat, until we’re done. So the important thing to keep in mind here with this rolling deployment is that you have a mixture as you’re going through the deployment process.
The blue-green is basically, “I’m going to spin up an entirely separate set of resources with the new version, and once I have that new set of resources up and running, then I’m going to now switch my application to use that. And then once that has been verified and everything’s good, then I can go ahead and tear down the old one.”
So we’re going to talk about both approaches here, and how you would do it. Pretty similar, but as we go through and talk about it, we can talk about the pros and cons of what you should do, and just give you and overall feeling for what level of work is involved.
Jon Christensen: Chris, do you happen to know off the top of your head why blue-green is called blue-green?
Chris Hickman: You know, I don’t. I mean-
Jon Christensen: I don’t either.
Chris Hickman: … someone chose it, right? Like… I mean, obviously… There’s a reason why it’s not called red-green.
So I think it was-
Jon Christensen: I mean, that’s the picture that kind of comes to my mind, is like you’re trying to make your way towards green, because green is good. But I don’t know why it’s not yellow-green, or what blue really means. I don’t know where that came from. Maybe we’ll talk about it at some point.
Chris Hickman: I think both colors kind of indicate they’re safe, they’re good. And that’s what a blue-green deployment is. Like so you have your existing current state, and that is good, solid, and it should have a color that’s associated with that kind of stableness.
And then as you do a deployment, you want to have your new system up and running that is also in a good state. So green means go. Like we know that.
So I mean, I think… you know, like what color would we choose, other than blue? Definitely we want green, but like… so brown-green? And we know that red, yellow, orange, those are definitely not colors that we’d want to associate with something being good and stable, right?
So there’s probably not a lot of choices here for that. Like aqua-green? Cyan? [Ecru 00:12:24]? So we can surmise.
Jon Christensen: Yes, yes, that’ll work for now.
Chris Hickman: It would be interesting to kind of do the research to find out like when did this first enter the vernacular.
Jon Christensen: Yeah, yeah, exactly. I mean, it’s been in there, and I’ve heard it, and I just have never bothered to look it up. And it’s not quite fully intuitive, but it is kind of intuitive enough for me to not go look it up.
Chris Hickman: Yeah.
Cool. So let’s walk through that rolling deployment, then, first.
So again, so what we’re going to be doing is we’re going to be updating our existing ECS cluster in place. And again, what we want to do is we want to move these host EC2 instances from a public subnet to a private subnet. So how do we do that?
So remember, with ECS, our EC2 host, the cluster is backed by a launch configuration and an auto scale group. So those are the things that we have at our disposal, to change.
So what we’ll do is we’ll… The first step is going to go in and we’re going to create a new launch configuration, based upon the existing one. And on this new launch configuration, we just want to make sure that we’re now no longer assigning public IPs to instances that are launched by this.
So that’s the first step. So create this new launch configuration by copying the existing one, and then making this change to say, “Hey, no longer assign public IPs.” So that’s the first step.
Then the next one would be, we now… to go update our auto scale group. So when we update our auto scale group, we’re going to do two things. One, we’re going to… Your auto scale group specifies the networking, right? So you’re going to specify what subnets to use for this auto scale group.
So the existing one will be using… specifying public subnets. So we want to delete all those, and this is where we’re going to add our private subnets. So we’ll select our three private subnets that we created previously, and then we’ll also update the auto scaling group to use the new launch configuration, so that now we’re not assigning public IPs.
Jon Christensen: I mean, we could still assign public IPs, but they would just be useless.
Chris Hickman: Yes.
Jon Christensen: Just wasting our IP addresses, and therefore, our money.
Chris Hickman: Well, and not only that, but also really confusing, too, right?
Jon Christensen: Yes.
Chris Hickman: So like someone looking at this configuration would be like, “Wait a minute, I thought these are private. But why do they have public IPs?” Right?
It just… So you could do it, but it would be like-
Jon Christensen: That’s my favorite kind of Amazon Web Services troubleshooting.
Chris Hickman: Yeah. Yeah, exactly. Yeah.
And so now we have an updated… we’ve updated our auto scale group. Now the next thing is basically just to do our rolling update.
So to do that, all we need to do is just terminate the existing host EC2s, one at a time. So again, if we had five EC2s in our cluster, we can go pick one of those, and just terminate it, whatever your favorite method is for terminating an EC2 instance. So that will shut down and go away, and the auto scale group will see that, “Hey, my desired state was five machines, and now I only have four. So I’m going to go spin up a new one.”
And when it spins up a new one, it’s going to be using that new launch configuration.
Jon Christensen: Right.
Chris Hickman: And the auto scale group, again, we’ve updated it now to say, “Put things on the private subnets.”
So the new machine that comes up, it won’t have the public IP address assigned to it, and it will come up onto the private subnets. And so now it’s just lather, rinse, repeat. We just do this for the remaining EC2s that are in there. So you just-
Jon Christensen: And this is not going to mess with our load balancer? Isn’t our load balancer going to be like, “Oh, where’d they go? Where’d my machines go?” Or it’s like, “It’s fine,” it doesn’t care, it just knows how to… wherever these services are, it can get to them?
Chris Hickman: Right. So, you know, when you terminate the machine, it’s going to now [inaudible 00:16:21] the membership set for the ALB, because it’s no longer healthy, right?
So instead of having five instances in our membership set for the ALB, there’ll be four. And then once a new one spins back up, once the auto scale group spins up the new one, because it’s no longer at this desired size, and so it says, “Okay, I need to go create a new one.”
Once that one comes back up, it’s now going to pass the health check. And it will now be added back to the… sorry.
So when we get into ECS services, we will be talking more about the ALB and the health check. So actually, there’s nothing to do with ALBs at this point, right? Because this is just our cluster EC2s. So this is not the actual ECS services themselves.
So what happens here… So let’s talk about what happens with the existing ECS services.
So the existing ECS services… any services that were run… any containers that were running on that host EC2 that we terminated, they now go away, and ECS, the scheduler, will see that and it will not create new tasks on the other existing EC2s, and bring those up, and then the will get instead back into the membership set.
So the ALB interaction is with the ECS service. It’s not directly with the EC2 host, because this is… remember, this is just our cluster of machines.
Jon Christensen: Right, right. But I was just… So the ECS services doesn’t care… when a new machine becomes available for it to put containers on, it doesn’t care that the new machine is in a private subnet? There was nothing that we had to do to the ECS service to say, “You can use this private subnet, too”?
Chris Hickman: No, no. So, I mean-
Jon Christensen: Oh, it’s interesting.
Chris Hickman: So the ECS is just backed by this auto scale group and launch configuration, right? So that is the connection that it has to what machines are involved with this particular cluster. So they’re just now… And remember, there’s not a lot of real difference between public and private subnets, other than their routing, right? So just like we’re using multiple availability zones to begin with, like these are just other subnets and other… likewise in different availability zones, and they just have… their networking’s a little bit different, right? Their routing’s different.
So instead of their route table having a direct route to an internet gateway, instead it doesn’t have that, and it has a route to a NAT gateway.
Jon Christensen: Right, right. Yeah, that’s interesting. I mean, it’s probably a little unintuitive, right? It’s probably a little bit like, “Oh my goodness, I can just move these… I can change my auto scale group so that it starts computers in a different subnet, and ECS doesn’t care. ECS is just going to continue to be able to orchestrate containers into those machines on that different subnet.”
Chris Hickman: Mm-hmm (affirmative).
Jon Christensen: That’s kind of interesting.
Chris Hickman: Yeah, at the end of the day, the only thing ECS cares about is that it can just talk to them, right? So some base networking has to be there, right? Like your ECS… If you spin yourself up into a new subnet, remember there is an ECS agent that’s on these EC2 machines, that’s communicating to the global ECS service.
So they need to have internet access. They need to be able to talk to the internet, to go talk to the ECS service. So-
Jon Christensen: Ah, not internet access, but VPC networking access, right?
Chris Hickman: No. Internet access, because ECS service, the global ECS service is not running inside your VPC.
Jon Christensen: Oh.
Chris Hickman: Right? So… And this is where you start… you get the things like VPC endpoints, and… it’s like S3, right? Or it’s like Lambda. Like the global… So the S3 itself, like if you want to make an S3 call, like it’s not inside your VPC. You’re actually going out over the public internet to talk to it, unless you do an S3 end point, and then that keeps all the traffic inside the VPC.
Jon Christensen: Right, right. And last week we talked about, when we were setting up those private subnets, giving them each their own NAT gateway, so that they could talk out to the internet.
Chris Hickman: Yes.
Jon Christensen: Not so that they’re reachable from the internet, but so they can talk out to the internet.
Chris Hickman: Yes, right. Which is what we need here, for… like what ECS needs.
So that the ECS agent code that’s run on each one of those EC2s, it needs to be able to have internet access so it can go talk to the global ECS service.
Jon Christensen: Mm-hmm (affirmative). Okay, cool.
Chris Hickman: Yeah.
Jon Christensen: Makes sense.
Chris Hickman: So hopefully that kind of shines some light on what the process is.
And another… Just keep it in… As long as we have at least two tasks running for each one of our ECS services, when we do all this, there’s going to be no downtime. So again, the underlying EC2 host will be terminated, so all tasks running on that EC2 host will go away.
But as long as we’re running more than one task for each one of those services, that means that there’s still a good running one on one of the other EC2 machines. And then again, the ECS scheduler will notice that, “Oh, you wanted two tasks for your blog service, but now there’s only one. So I need to spin up a new one.”
So it’ll spin up a new task, so that you are running two again, now on the remaining four machines.
Jon Christensen: And it’s responsible for figuring out how to basically balance the tasks across the machines. And I guess one thing that you would probably want to keep an eye on is if you are knocking out your EC2 instances and you see other ones spin up in the private subnet, but you weren’t seeing any tasks migrate over to those instances on the private subnet, and all your tasks are kind of clustering up on the one that’s still in the public subnet, you wouldn’t want to kill that one, and you’d want to figure out way.
I don’t know why that might happen, but you should want to make sure not to… if all of your tasks end up on the last machine that you’re going to kill, make sure not to kill it, and figure out what’s going wrong. You made a mistake somewhere.
Chris Hickman: Yeah-
Jon Christensen: Because ECS should be smart enough to send those tasks over to the other instances, unless it can’t.
Chris Hickman: Yeah. And I mean, the… So the ECS scheduler… very complex, very complicated, lots of different strategies and options here. It’s not a trivial problem of saying, “Hey, I have this set of resources on which I can run tasks. And given that I need to run a task, like where does it go?”
And so there’s placement strategies. You have control over this. You can tell ECS what placement strategies to use. So do you want to do [crosstalk 00:23:23] thin packing or do you want to do spread?
So it’s like… by default, it’s going to do the reasonable thing, and it is going to do a spread. So it’s going to do its best to schedule these things on different machines and on different subnets. We could spend a whole episode on this.
Jon Christensen: Right, but I just thought of an obvious reason it could happen, why all your tasks are sort of bunching up in one machine, would be… your auto scale group is sending them off to the private subnets. So change your auto scale group to say, “New machines go on these private subnets,” and you forgot to put in that gateway on any of the private subnets.
ECS is going to realize that it can’t put… ECS is not going to be able to put tasks on those machines, because the agent can’t say, “Hey, I’m here. I’m ready. I’m good to go.”
Chris Hickman: Yeah, in fact, you won’t even see the EC2 in your ECS cluster. It won’t even show up-
Jon Christensen: Oh, right. Right.
Chris Hickman: Right? Because the agent can’t connect to the global ECS service, to register. So you wouldn’t even see that.
So you would see-
Jon Christensen: You would just see your number of machines going down, down, down?
Chris Hickman: Yeah. So you would in ECS, your ECS cluster now only has four machines, instead of five. In the EC2 console, you would see you’d have five EC2 instances now spun up.
But again, there’d only be four in your cluster. And so that is-
Jon Christensen: That’s interesting. That’ll tip you off.
Chris Hickman: … that’s because… Yeah, that’s because the ECS agent couldn’t register with ECS, because it doesn’t have internet access.
Jon Christensen: Mm-hmm (affirmative). And that error won’t be in your ECS console, because ECS won’t know about it.
Chris Hickman: It didn’t even get to that point.
Jon Christensen: Yeah, exactly.
Chris Hickman: It can’t even talk to it. Yeah, so-
Jon Christensen: So that would be a really tricky thing to troubleshoot, right? Like, “I don’t see an error in ECS, so ECS must be fine,” but it’s actually your subnet that’s misconfigured. That’s tricky. That’s tricky, for sure.
Chris Hickman: Yeah, and you know, I think I’ve seen this in the wild multiple times, from different teams-
Jon Christensen: I know.
Chris Hickman: … making this mistake. So it’s-
Jon Christensen: I think we even have a Mobycast episode about this.
Chris Hickman: It’s kind of a common one. So, you know, just keep it in the back of your mind. Like if something like that happens, this is probably why.
Jon Christensen: Cool.
Chris Hickman: Yeah. Check your networking.
Cool. So that is the rolling deployment, right? So we… That is really all we need to do to move from public to private, if we want to do this rolling deployment where we have this mix. And it kind of requires a bit more, maybe, hand-holding to kind of… I mean, you could script all this stuff too, as well, if you wanted to get fancier. But that’s the general flow.
Jon Christensen: We cover a lot of information here on Mobycast. And if you’ve ever wanted to go back and remind yourself of something we talked about in a previous episode, it can be hard to search through our website and transcripts to find exactly what you’re looking for.
Well now, it’s a lot easier. All you have to do is go to Mobycast.fm/show-notes, and sign up. We’ll send you our weekly, super-detailed outline that we use to actually record the show. A lot of times, this outline contains more information than we get to during our hour on the air.
So sign up, and get weekly Mobycast cheat sheets to all of our episodes, delivered right to your inbox.
Chris Hickman: So let’s talk a little bit, then, about the blue-green deployment and how that works.
So we said, with blue-green, we want to duplicate our solution with the new design. And once we’ve verified that that’s up and running, then we can just switch it over, and then tear down the existing one.
So what we’re going to do is we’re going to create a new ECS cluster, that will eventually replace the existing one.
And so this is the same process that we’ve talked about, actually, in the Fargate episode series and also in the original ECS one, right? So we’re just going to go into ECS, create a new cluster, we’re going to choose the EC2 Linux and networking cluster template. We can go and specify all the normal stuff that we do, like instance type, and options, and our networking.
We’ll create a security group for this, and we’re going to… I mean, just going into a little bit of the details there on that security group, because these are the EC2 machines, they’ll be accepting traffic from the ALB. Because it is ALB, we’ll use dynamic routing, dynamic ports.
So we just need to make sure that we allow the ephemeral range of ports for our custom inbound TCP rule, that are only coming from the ALB. So this, you will just specify the security group of the ALB to be the source, and use that ephemeral port range as the valid addresses.
And then we can have a single other rule on there, to allow inbound SSH, but only coming from within the VPC itself.
So that’s kind of like a minimal security group, that still allows us to SSH into those machines, if we want to.
Jon Christensen: Can you… Yeah, say that again. Like you just kind of jumped up a level, to sort of say what you just said, in a more kind of digestible way. Because you kind of walked through, “This is what… We’re going to do these things with the ephemeral ports if we’re going to do these things with allowing SSH in a certain place.”
But what’s the kind of user requirement that this is meeting, of this security group?
Chris Hickman: Yeah, so this would just be an example of minimal security group on the ECS host EC2s. There’s really only two things that we want doing inbound connections to them.
One is connections from the ALB that are forwarding these requests that are coming in. And then the other one would be SSH, if we wanted to SSH into these machines.
So we’ll lock down that first rule for requests coming in from the ALB. They’re going to be coming in on the ephemeral port range, and we’ll lock down the source to only allow from the ALB security group. So that’s pretty tight and controlled.
And then we’ll also have the rule for SSH, that will only allow sources coming from inside the VPC. So that means that we would have to be on a VPN connection before we could-
Jon Christensen: Got it, got it.
Chris Hickman: … in order to pass that rule.
So pretty locked-down.
Jon Christensen: Yep. Got it. Okay.
Chris Hickman: Yeah.
And then as we create our cluster too, we’re going to notice a new option here for container insights. And so this is much more interested CloudWatch monitoring of ECS services and tasks, and just gives you a lot more insight into what’s going on with your system. It’s going to cost you a little bit more, in the CloudWatch metrics. So it’s just something to keep in mind, there. It’s not free, but it’s also not terribly expensive, either. But this is where you could enable it.
So just something to be aware of. If you want ECS container insights, you can’t enable it for an existing cluster. You can only do it when you create the cluster.
Jon Christensen: Oh, interesting. And terrible.
Chris Hickman: Yeah.
Jon Christensen: And probably already not true anymore, right? Like probably there’s a AWS tweet or something, as we speak, that says, “Now, you can enable CloudWatch container insights on your running clusters.”
Chris Hickman: Yeah, who knows. We are in announcement season right now, as we lead up to… We just did a… We’re breaking the time-space continuum.
Jon Christensen: Yes, yes, we did.
Chris Hickman: But yes, who knows, by the time you hear this, it could be different.
Jon Christensen: Cool.
Chris Hickman: So we have our new ECS cluster, and after that… that will spin up. So these machines are going to come spin up. We specified our private subnets in the networking, when we created that cluster. So that’s good.
But we don’t have the ability to not disable public IP address assignments to that. So it’s going to follow whatever the default is for your VPC.
So if we want to, we can go in and, like we did before, we could create a new launch configuration where we could change that and basically say, “Don’t assign the public IP addresses.” And then we could update the auto scaling group and… did what we did before.
So you could do that, or you could also, like I said, just make sure that for your… set the rule on the VPC to make that the default, as well.
Jon Christensen: So that’s a VPC-level rule where you could decide new EC2s get a public IP address by default, or not?
Chris Hickman: Mm-hmm (affirmative).
Jon Christensen: Okay. Is it kind of like, “Hey everyone, you should probably set the default to not what it normally is”?
But it’s… like, “A good security best practice is to go change the default”? Is that what I’m hearing? Kind of?
Chris Hickman: It’s complicated, because it just depends on what your use cases are, what… how you’re creating things in your VPC, and whether or not that’s… do you want to change that default. So it’s-
Jon Christensen: And it’s kind of interesting. I mean I guess I get it-
Chris Hickman: People have to be on the same page, right? Like it’s just got to be something that folks are aware of. Otherwise, it leads to confusion.
Jon Christensen: Right, right. And it’s… It makes sense that the default would be to make public machines, because AWS would have a lot of people complaining, that are first-time users, if all of a sudden they were making networks and they couldn’t get to any… “I’m going to make an EC2, and I’m going to check out AWS. Wow, how do you even get to it?”
Chris Hickman: Right.
Jon Christensen: “Oh, well let me tell you about NAT gateways… or internet gateways.”
Like, “I only wanted to make a machine. I just wanted to test-drive AWS.”
Chris Hickman: Right. Yeah, and this is why the default [NACLs 00:33:21] allow all traffic, instead of denying all traffic.
Jon Christensen: Mm-hmm (affirmative). Interesting.
Chris Hickman: Yeah. Unless you create a new NACL, and then the default is that it blocks everything.
Jon Christensen: Right.
Chris Hickman: Yeah.
So we have our new cluster now, and those EC2s are set up the way we want. And then the next step would be to create a new ECS service on that new cluster. And we can use our existing task definition, for the existing one.
Jon Christensen: Too bad you can’t just clone the service, right? Like make a new service based on this other service.
Chris Hickman: Yeah. I mean that’s not… We’re on a whole different cluster, too. So it’s… yeah.
But you could probably… I mean, you could… Who knows. Maybe there’s a open source project out there, some scripting that will kind of do that for you. It’s actually not too terribly difficult to do, to go get the service description and use that to go create a new service. It’s probably pretty easy to do yourself, if you wanted to.
If you were doing this a lot, you might decide it was worth the effort to do that.
Jon Christensen: Well and I just also think that it’s sort of like kind of a typical user experience thing. As software creators, whenever you have a UI where people have to do a lot of work, and you then have a storage place where that work gets saved, if people want to go create a new thing that’s going to take work, it’s a really nice feature to say, “Hey, do you want to base your new work on something you’ve already done work on?” It’s just like a classic user experience good thing to do. Okay, AWS?
Chris Hickman: Well they did it for launch configurations-
Jon Christensen: Right, right.
Chris Hickman: … so we can-
Jon Christensen: Someone knows about this classic thing, yes-
Chris Hickman: … be grateful for that, yeah.
Cool. So once we have our new service, it’s going to then start running our tasks. Once we can… So we verify that our tasks are up and running and healthy, and at that point, we can now update our load balancer to now forward traffic to the new target group, for the ECS service.
So this is now… Now we’re going from the blue to green switch. And it really is just updating the ALB to say, “Instead of routing to this target group, go to this other target group.”
So quick and easy change on the ALB, and at that point, we now are free to delete our original ECS cluster. And so to do that, we can just go in and update the service, say, “We want zero tasks,” we can then kill any running task if we want to, to speed that up. And at that point, we can delete the cluster.
So a gotcha that you might run into here is that [CloudFormation 00:36:11]… So ECS, when it creates a cluster, it’s actually using CloudFormation behind the scenes… a CloudFormation stack to build out all the resources that it needs, because it’s building out things like a launch configuration and an auto scale group, and it’s doing some security group creations and whatnot.
So when you go and delete the cluster, it’s now basically doing a CloudFormation stack delete, and this is where, typically, I know me personally, I always run into issues where the stack delete’s going to fail. And it’s usually because there’s circular dependencies between security groups.
So when you lock down your security groups, where you’re saying, “Oh, the source has to be only from the security group used by the ALB.” And then we have a… allowing… maybe the database instance security group only allows traffic from the ECS security group, and so you start getting these circular dependencies. And CloudFormation gets confused, and it just doesn’t know what to do.
So if that happens to you, the solution is you just manually remove those rules in the security groups that are referring to those circular dependencies. And once that happens, then you can go and delete the cluster, and then CloudFormation will be able to proceed and tear down the stack for you.
Jon Christensen: Makes sense.
Chris Hickman: Yeah.
So now we have our… We’ve maneuvered our ECS host from public subnets to private subnets. We went through the two broad approaches there, the rolling and the blue-green style.
And so now the last thing that remains is we need to move that RDS instance that’s on public subnets over to private subnets.
Jon Christensen: And, you know, secretly we also just totally taught you how to do a migration, even though it was under the guise of, “I’m going to move stuff from public subnets to private subnets,” it’s also like, “Hey, you can use this rolling or blue-green deployment strategy for any other kind of reason that you might want to move off of a cluster.”
Maybe you want to want to change all your instance sizes. Maybe you want to… tons of other reasons you might want to do this.
Chris Hickman: Yeah. See, it’s not just security broccoli that we sneak in there. There’s other vegetables.
Jon Christensen: Yes.
Chris Hickman: All right. So we need to get this… So this RDS instance is currently on public subnet. Let’s move it to a private one. So what do we need to do?
So the first thing is we are going to create a new RDS subnet group. And so RDS uses subnet groups as ways of informing it where these databases should be placed. And so we’re going to create this new subnet group, and that subnet group will contain the three private subnets that we have for our VPC.
Jon Christensen: Got it, got it.
Chris Hickman: So pretty easy, pretty straight-forward.
And then we’ll also want to create a new security group for the instance. And so, for me, I wanted to just lock this down to basically only allow… this is my MySQL, so only allow the MySQL port traffic, so 3306. And the source for that be the security group from my ECS machines.
So basically, I’ve locked it down so that the only thing that can connect to that RDS instance, it has to originate from an ECS host machine, and that’s it.
Jon Christensen: Cool.
Chris Hickman: I could actually add another rule, if I wanted to. If I wanted to actually connect to my database directly from my favorite database tool, whether it be…
Jon Christensen: Don’t do it.
Chris Hickman: … Navicat. Well, I mean, Navicat or whatnot.
So you can say like, “Oh, I want this over…” I can require it be over VPN, and then I would open it up and say, “Okay, any VPC traffic could connect to it over port 3306.”
Jon Christensen: Right. And then you can just fix production issues right on the database. Just get in there and do it.
Chris Hickman: Exactly. Exactly.
So-
Jon Christensen: That’s a bad idea, by the way. That was sarcasm. Like, don’t. Don’t.
Chris Hickman: Yeah. Well we were joking, like for the [re:Invent 00:40:33] this year, they made the keynotes… you had to reserve to get into them. This is a new thing. Like they’ve never had reservations for keynotes before.
Jon Christensen: Oh yeah.
Chris Hickman: And this year they were saying like, “Hey, not only can you reserve, but you have to reserve. And if you don’t reserve for the keynotes, then you’re not going to be able to go to those sessions.”
And they said, “We’re going to do this is waves. At this time, on this date, is when we’re going to open it up. We’ll change the catalog so that now you can go reserve these keynotes, and then they’ll be available for a certain amount of time, and then they won’t be available. And then we’ll open up another window at some later point in time.”
And so this whole… It just felt like the whole re:Invent session catalog was not geared up for this kind of functionality-
Jon Christensen: No, it certainly wasn’t.
Chris Hickman: … and they were kind of like hacking it into it.
So it was supposed to… starting at 8:00 AM Pacific Time was when were were going to allow you to reserve for the keynotes. And so of course everyone gets on a few minutes before, and hitting refresh, and the keynotes aren’t showing up. And 8:00 comes by, still no keynotes. Keep refreshing, refreshing. 8:02, 8:03, 8:04 AM, still nothing. Refresh.
You can just imagine. Like there’s tons of people refreshing, like just hammering the servers.
And then you can also imagine, over in AWS-land, like there’s some person sitting at a keyboard, with a SQL connection up to the database, and typing-
Jon Christensen: It’s saying there’s a syntax error in MySQL.
Chris Hickman: … “Insert…” Yeah.
But they’d be running a manual command, to insert into catalog, keynote…
So I wouldn’t be surprised. So you’ve got people standing around the person, watching it happen. And then-
Jon Christensen: It’s 8:07-
Chris Hickman: … so it took a few extra minutes. Yeah.
And it’s like, “Phew, okay, it worked,” you know? And then you see something like, “Okay, seven rows affected. Yay.”
Jon Christensen: I mean, that was a little [inaudible 00:42:34] of like… I’ve certainly been a production database, trying to fix live data that’s broken, before. And it’s an ugly thing to have to do. It’s always better to try to write at last a well-tested script, or if you’re actually typing SQL into a console on production, just really… you’ve just got to ask yourself what you’re doing. It’s not a good idea.
Chris Hickman: Mm-hmm (affirmative). Indeed. There’s… The risk is so, so high. And it’s one of those things, too, it’s like everyone gets bit at some point in their career.
Jon Christensen: Yep.
Chris Hickman: And I remember once, 15 years ago, where I was doing some update to the database, and hit F5 to commit it, and that horrible sinking feeling when I saw the message, “200,000 rows affected.”
And so I was doing an update statement, and I forget the “where” clause.
Jon Christensen: Oh my god.
Chris Hickman: So basically… And it was like my ordering tracking system. And so it’s like all orders tracking just now got blown away.
And so it was like, okay, spend the next six hours testing my backup restore process, and whether or not that was functioning. And it’s just not a fun time. So you learn these things the hard way, for sure.
Jon Christensen: I forgot where we even were, because how did this even come up?
Chris Hickman: So we were talking about moving our RDS instance to private subnets. We’re in the preparation phase, if you will. So we’ve created our RDS subnet group, that is using the private subnets, and then we also have created a security group for our RDS instance. And so we’ve locked that down to have a single inbound rule, through port 3306, that only come from our ECS instances, and not from anywhere else, so that we can’t connect to it like we just talked about, and issue that update without the “where” clause.
Jon Christensen: Yes, there we are.
Chris Hickman: Yes. All right.
So we’ve done the preparation. Now what we’ll do is we’ll create a snapshot of the existing database. So that’s something really easy to do, inside the RDS console or through the CLI.
Once we have the snapshot, we’re going to go ahead and… let’s enable encryption at rest for our RDS instance, just because we can and it’s available.
So to do that… And by the way, this is the same process you would use if you had an existing RDS instance and you wanted to apply encryption at rest to it.
So basically, the formula is, create a snapshot of the existing database, then you create a copy of the snapshot. And when you create the copy of the snapshot, you’ll then have the option to enable encryption.
And so you can use KMS to encryption, so just specify what KMS master key, or create a KMS master key if you don’t already have one. Use that for the encryption.
And then now you can restore the… Now you have an encrypted snapshot, and now you restore that snapshot to a new DB instance.
And so we’ll do that. So we’ll create a… in a snapshots area of RDS, we can go and choose to restore snapshot, and when we do that, it’s going to ask us what kind of instance we want. So we can choose our instance type. We want to make sure that we make sure that public accessibility option is set to “No,” because we don’t want this to be public. And then just make sure we’re specifying the correct VPC, the subnet group, and the security group that we set up previously.
And so once we’ve done that, we now have a new database instance that’s up and running, that is now on the private subnets, and now has encryption at rest.
So maybe… You know, something to point out, maybe everyone has kind of realized, is this is kind of assuming this perfect world where no new updates are coming in to the existing database, right?
So luckily… I mean, for me, it’s a blog. So I don’t have this issue, because I know I’m not going to go publish any new blog entries while I’m doing this.
So obviously it gets… you have to do some other… it gets more complicated if you actually did have a system that is up and running, that is doing updates. What do you do? Do you kind of go into maintenance mode, so that you disable all rights while you make the snapshot and copy and then build up a new database? Or do you have to do some sort of… more of a zero-downtime deployment? How do you take care of that? That’s even more complicated.
But we’ll leave that as an exercise to the reader, or for a future episode of Mobycast.
Jon Christensen: Okay.
Chris Hickman: So we have this new database instance up and running now, and so the last thing that we need to do, is we just need to update our application to use that. And so… And this whole process here, this is where there’s like… from the creating the snapshot to updating the application to use the new database, like that’s all times where… your application… there’s going to be some downtime, or there’s going to be some disruption to it, because you can’t have rights going onto your database while you’re with this particular… simpler approach to it.
So for me, to point my blog application to the new database instance, that’s just… so it’s a new database endpoint, and so I just need to change the configuration for that particular services to use the new endpoint, and the way that the configuration works for Ghost is… then it just needs to be restarted.
So ECS has a nice option called… when you go and update an ECS service, you can do a Force New Deployment, and it’s just a checkbox there. So this means that I don’t need to do a new task definition, I don’t really need to do anything. I can just go and update ECS service, check the Force New Deployment, and what that will do is that will just basically terminate each existing container for that service, and then spin up a new one.
And so when it spins up, it’s going to get the new configuration file, read the configuration change, and now it’s going to connect to the new RDS instance.
And at that point, I’m now free to delete the original RDS instance, and we’re now done. Now everything is now on private subnets, except for the public-facing ALB. And everything else is protected, and we can now use VPN to connect to the machines, if we have to. And we are pretty locked-down.
Jon Christensen: Cool. Very cool. You know, this whole switching out of the database reminded me of another episode we did where we talked about how Aurora Serverless works. And it just reminded me of the really cool engineering that they did to be able to swap out a database without having to update your ECS configuration, or any kind of configuration that talks to that database, by putting a kind of a proxy between the database and the world, and then hot-swapping out databases underneath.
Just reminded me of that episode. If you’re new to Mobycast, you can go listen to that and check it out, and it’ll remind you of this episode.
Chris Hickman: Yeah.
Jon Christensen: Yeah. So very cool, we did it. We moved it over.
Chris Hickman: Yes. Yeah, so hopefully now you feel like a… very much a VPC ninja, understanding public versus private subnets, and just some of the new complications that arise from that, and with VPNs and then just the practicality of how do you actually migrate these things that currently exist.
Jon Christensen: Yep. I definitely do. There’s so much… Like we were talking about in the last episode, there’s so much about networking that is kind of new for software developers to have to contend with in our day-to-day jobs, that we used to sort of leave to another team, or used to be just kind of static. “Oh, that’s going to get set up in the data center.” Once it’s set up, you never think about it again.
And now all of a sudden, it’s very much a dynamic part of your day-to-day life. Where are machines? Are they coming and going? What are their IP addresses? How are they getting assigned? What kind of subnets are set up? Are they available through the internet or not? How are routing tables… how do they work, and do I need to adjust them to be able to talk to the internet?
All that cool stuff is stuff we didn’t… you know, I didn’t really have to spend a lot of time of that, between when I started doing software stuff in ’99, and really all the way up to probably around 2010 or so. I didn’t think about it too much.
2012 and on is really when AWS has picked up, more than even. So yeah.
Chris Hickman: Yeah. Yep, with the advent of the cloud, networking becomes something that we need… just be more involved with and cognizant of.
So in addition to security broccoli, we will also be talking about networking cauliflower, I think.
Jon Christensen: Yes.
Chris Hickman: And this will be just an ongoing thing.
Jon Christensen: Very cool. Well thanks for the detailed explanations and the education, Chris. I appreciate it.
Chris Hickman: You bet. Thanks, Jon.
Jon Christensen: Talk to you next week.
Chris Hickman: Bye.
Stevie Rose: Thanks for being aboard with us on this week’s episode of Mobycast. Also, thanks to our producer, Roy England. And I’m our announcer, Stevie Rose.
Come talk to us on mobycast.fm, or on Reddit at r/mobycast.