March 28, 2018

03. An Introduction to Elastic Container Service (ECS)

Show Notes
Transcription
Discussion

Jon Christensen and Chris Hickman of Kelsus discuss Amazon’s Elastic Container Service (ECS) – how it works and why they use it. The term “container” comes from stacking containers on ships, and containers offer several advantages. But how does containerization relate to scaling on a large application?

Some of the highlights of the show include:

We use the analogy of a Docker container being like a cargo container that is stacked on a container ship. The ship is like the host computer where containers run. Containers are uniform and can be stacked on any ship.
The lifecycle of using a container includes choosing where to run it, loading a container image on the host machine, allocating host computer resources to the container, starting the container and making sure it stays running, and eventually updating it with a new image when your software changes.
It is possible for one person to manage one or a few containers. But what happens when there are 50 containers? Elastic Container Service (ECS) builds on top of the container technology as an orchestration system, serving as the “conductor of the orchestra.” It dictates who does what, when, and how.
Using the ship analogy, an orchestration system is like the head foreman at the dock who tells the cranes how to load containers onto the ships in an optimal manner.
ECS automates these orchestration tasks. It allows you to scale applications and eliminates manual tasks.
What is your infrastructure philosophy when delivering software? Microservices is an industry trend that advocates breaking a single application into many separate components or services where each service does one thing and does it well – a single concern.
To achieve reliability and high availability, you should have multiple instances of each service running in case one goes down. If you are running five services, then you’ll need 10 separate containers at minimum.
Pets vs. Cattle analogy: When you manage your own servers, you tend to treat them like pets: You give them a name, you bond with them, and you take care of them for the long-term. With the cloud, you treat your computing resources more like cattle: they have an ‘ear tag’ with a number, and we don’t need to care about any particular server – each server can come and go.
There are benefits to short-lived machines compared to long-lived ones. The longer the machine is used, the more issues you have to deal with: entropy, fragmentation, memory leaks, hardware failures. Short-lived machines provide only the resources you need at a specific time. If it has problems, put it ‘out to pasture’ and get a new machine.
There are various orchestration systems available on the market, including Docker Swarm, Kubernetes, and Amazon ECS. Jon and Chris choose to use ECS because it is well integrated with all the other Amazon services – so much bang for your buck!
An ECS ‘Cluster’ is a set of EC2 (Elastic Compute Cloud) machines – like a fleet of ships ready for containers. EC2 machines are the ships, and your software products are loaded into the containers. EC2s are machines (virtual or real), ready for containers. If you had multiple EC2s, that would be a fleet of ships. A cluster is a set of machines (EC2s) onto which the orchestration tool (ECS) can deploy containers.
An ECS ‘Service’ defines a collection of containers that you want to run together as a long-running application. Services could be handling inbound requests, or background processes.
The Service defines how many containers to run, how to start the software, and the security credentials (identity) that each process will assume at run time.
ECS manages your services for you and keeps them running. It also manages load balancers in front of your cluster, if those are needed. The ability to seamlessly manage load balancers is a good example of an advantage of ECS over third-party orchestration tools when you’re running inside AWS. Third-party orchestration tools require many more pieces of your infrastructure to be deployed and managed outside the orchestration tool.
AWS Cloud Formation allows Infrastructure as code. It’s a tool for automating the creation, update, and deletion of infrastructure components, including computers, load balancers, security groups, etc. It can replace the manual process of configuring your infrastructure through the AWS console.
Besides Amazon Cloud Formation, other infrastructure as code options include Terraform, which works across different cloud providers. Terraform scripts have more powerful coding capabilities, whereas Cloud Formation files are more like static data. Cloud Formation also has a visual designer to graphically define your infrastructure.
Both Cloud Formation and Terraform files define all the parameters needed to make AWS API calls for managing pieces of infrastructure.
Launch Configuration: Describes the configuration of each virtual machine inside of your cluster. It defines what type VM to use, what operating system it has, networking parameters, memory and CPU power, etc.
Auto-Scale Group: Describes the configuration of your cluster. It defines what network the machines will be on, and how many machines should be running. You can set up scaling policies to define when a scale group should add or remove nodes in the cluster to scale up or down.
An auto-scale group uses the Launch Configuration to define each node within the cluster.
When using ECS you are actually utilizing Cloud Formation, launch configurations, and auto-scale groups behind the scenes. ECS hides that complexity from you, but you will still see it reflected in your monthly AWS bill, which includes separate charges for all those services you’re using behind-the-scenes.
An ECS ‘Task’ is defines what Docker image to run within the Service, the resource quotas needed to run (e.g. memory, CPU, disk volumes). The ECS Service invokes your Task to start your container.
Deploying an application to production on ECS involves several steps. From source code, you first build your Docker image from your code. Then, publish the image to a repository (e.g. Docker Hub or AWS ECR). Next, create or update the task definition to use this new Docker image. Finally, tell the ECS Service to use the Task you defined.
This may seem like a lot steps, but it is very easy to automate.

Links and Resources:

Amazon’s ECS and other services; EC2 machine; CloudFormation; ECR

Docker, Swarm, and Docker Hub

Jenga

United Airlines and Peacock

Jon:

Today, there were several things that we talked about last week that I felt like we could get into a little deeper. But rather than doing an extension of last week’s episode where we talked about what a company or an organization goes through when they begin to take on containerization, the docker, and putting stuff in the cloud with docker, rather than continue down that path of thought, we do something a little bit more technical today which is to talk about something that we use a lot which is Amazon’s Elastic Container Service or ECS, because that will just give us something a little less theoretical and a little more concrete to talk about. We can explain how it works, talk about why we use it, and other alternatives and everything. We can just stay a little bit more grounded and a little less meta.

Today, we talk about ECS. To start with, everybody probably already knows the containers. The name comes from containerships and stacking containers on a ship. That’s the core analogy of what containers are all about but maybe you can talk about that a little bit more, Chris, just in terms of how that analogy relates to scaling and having a large application.

Chris:

Sure. As you talked about, we’ve discussed containers and virtual machines, the differences between them, and why containers offer some great advantages. We’ve talked a little bit about how you might create those containers and run them but that’s kind of been this kind of been on this manual processing, you could imagine as you adopt this technology, it’s one thing to do it for a single application.

One person can mainly go through the steps of figuring out where you’re gonna run this container, on what virtual machine, or what piece of Bare Metal you’re gonna run it on, and then actually making sure that the image is there on the container, making sure that there’s enough resources on that particular machine, starting the container, making sure that the container stays up and running. Then what happens when you wanna go and update your container with the new image because you now wanna deploy a new version of your software.

There’s lots of parts of the container’s life cycle that need to be addressed. Docker in and of itself or any kind of container technology at its core doesn’t address that.

Jon:

Just to stick with the analogy, what you just described of running a container, choosing what machine it goes on, figuring out how much memory it has and all that. That is sort of like, to stick with our analogy, it’s sort of like having a courier take your package from one place to another in the United States, you choose your courier, you tell them what plane to take, they get on the plane and they go with just your package to where it’s supposed to go.

Chris:

Yeah. Or perhaps even more so with like, you are the courier.

Jon:

Yeah.

Chris:

It’s not someone else. It’s you. You’re the one that’s doing that. Obviously, there’s a lot of overhead with that. There’s only so much that you can do. Again, if it’s just one container that you’re doing, then okay, maybe it’s fine. If it’s a couple, okay, it’s a little bit more work and now maybe you’re spending an hour of your day doing it. But what happens when you have 50 containers? What happens then?

That’s where these other tools and services that build on top of the container technologies come into play and they’re probably called orchestration systems because they’re kind of like the conductor of the orchestra. It’s kind of basically dictating who does what, when they do it, how things are sequenced, schedule, and what not. You can think of them as the dock conductor of the orchestra. The head foreman on the dock that’s directing all the cranes on how to take those containers and put them on the ship.

Jon:

There we go.

Chris:

[inaudible 00:04:41] don’t have to do it.

Jon:

Sticking with our ship analogy. They’re the ones deciding which color of containers go where because all we can see of a container is its color. All the red ones over here and all the blue ones over there. Then, once the ship is filled up, what do I do?

Chris:

Exactly. Absolutely. It’s their job to make sure that they’re putting these things, stacking them in a way that makes sense. You don’t have a whole bunch of holes and gaps. You don’t lay at one container horizontally and then another one stacked vertically. It’s not Jenga. You want something intelligent. That’s actually a pretty difficult problem to do. That’s what these systems like ECS, that’s their job. It’s a great job that’s right for automation that software can do instead of people. That’s what these systems do and allows you to scale and not waste your time doing these things manually.

Rich:

Chris, can I ask you a quick question? You’d mentioned that if you’re only managing one container, then sure you could probably do this yourself, but what happens when you reach about 50 containers? In my head, I think of like a single, let’s just say a web application is running on a single container. How does that application end up managing or running something like 50 containers? What’s the actual thing that’s happening that would require so many different containers for a single application?

Chris:

Right. This comes in probably just like, “What is your overall philosophy for your infrastructure for delivering your software?” A lot of teams, companies, they definitely have more than one application or what they’ve done is they, you can think of it logically as one application but it’s actually broken up into many components, it’s a modular architecture.

This is one of those best words that you may have heard of microservices architecture, where it’s really just breaking things up into doing one thing, doing that one thing well, a single concern. You may have a single application that’s actually broken up into five separate services. To your end users, they’re just using the timesheet tracking application but on the backend, for the DevOps team, they’re actually having to keep five servers up and running.

Then you wanna have duplicates of those things as well for availability. What happens if one of those services dies because there’s a bug in a code or some failure on machine, or what not. You wanna have duplicates of it. This is just a simple case, just one application, maybe it’s broken up into five microservices, and you wanna have duplicates of each so, at bear minimum, you’re actually running 10 separate containers for that one application.

Now, multiply that by, there’s many teams and companies that have 10 applications or 50, it depends on the size of your organization. Do we also talk about the number 50? Fifty’s definitely not a magic number. That was just a number that I threw out there. Wow, I personally would not want to handcraft the placement the running of 50 containers. I wouldn’t wanna do it with actually even one, personally. Why even waste 15 minutes of each day dealing with these things?

It’s one thing to run a, and we talked about, I think in a previous podcast, running containers locally on your development machine. For that, you definitely would be doing it by hand and you are the orchestrator for that. But once you go to a deployment environment, whether it be your staging or production environment, that’s when letting something else take care of the automation. That’s very important.

Jon:

There’s another analogy, Rich, too that I wanna talk about here that we’ll probably refer to again and again after shorthand which is the pets versus cattle analogy. I’m gonna ask Chris to explain what that is but I think it’s timely because we just talked about transportation and we just saw, I think it was United Airlines that wouldn’t let emotional support peacock onboard?

Rich:

Yeah. It was United.

Jon:

If we’re moving stuff, we’re moving containers from one place to another in our analogy, but treating them like pets versus treating like cattle, please discuss Chris.

Chris:

Sure. Again, this is kind of a wonderful evolution of the technology, and the tools, and automation, and just driven by the needs of just scaling systems. In the past, before we had a lot of these great automated tools, if you wanted a server, you went and you bought a server, you racked it, you install the memory chips and then maybe you upgraded the hard-drives, you installed your application software on that, you gave it a name, you called it like Mr.Smith 1.

You called another one, maybe you named that after a cartoon character, or something like that but you actually, probably gave it a name. You actually printed out a label probably on a label maker and put it on the front of machine. You had a collection of these. These are long running things that you individually referenced and you had to deal with. It was kind of like this finite state of resources that you kept around. Those are like pets.

Jon:

Yes. So much of your emotional state was invested in keeping those things up and running that they probably weren’t that different from emotional support animals.

Rich:

I get that. Yeah. That makes sense. Yup.

Chris:

Absolutely. Yeah. Just like the pets, the pet’s around for hopefully a long time, you give it a name, you bond with it, and it’s something that you’re gonna hold on to. That’s really how things definitely were ingrained and how they used to be when kind of what we call the cloud came along, that kind of changed everything in the sense that you are no longer responsible for creating these machines, it was someone else.

At first, people, I think, treated them the same way that they would bring one of these things up, they would name it, it would be long lived, and they would kind of treat as a pet. But as time went on, as the tools got better, folks started realizing that there’s really no reason to have that approach. We can view this as a very fungible set of resources.

This is one of the reasons why like with Amazon, so many of their services are prefixed with the term elastic. Elastic really gets to the core of the pets versus cattle mentality, the analogy. With pets we talked about these are long live things, you name them, you have bonds with them, you have this one-to-one relationship with them.

Cattle, they have ear tags, they have ID numbers, they don’t have names, and they come and go. That’s how we’re now moving towards treating our machines in the cloud. It’s this varying state of resources, we have a need at some particular point in time to run some software on a particular machine and we don’t really care what machine that is. We have some requirements on what the specifications are for that machine, but other than that, we don’t care if it comes from server A, server C, server Z. It really doesn’t matter. We just need some server to run that software and they can come and go.

There’s also lots of benefits that come into play by having the short lived machines as opposed to long lived machines. The longer a machine is up and running, the more kind of just entropy comes into plan in that machine. Things like fragmentation of storage systems, things like memory leaks of long running applications on there, and issues with hardware, like hardware fails, chips go bad, things burn out.

This idea of saying, “Hey, just bring up some resource, a machine that I can use, I’m gonna use it for a short period of time, and then when I’m done, let’s get rid of it.” It also really balances out well with only having the capacity that you need at that point in time. Again, I’m going back to the elastic, you think of this as like the amount of resource that you have, it stretches and it contracts based upon what your current demands are.

In the past, if you treat your machines as pets for their long live, then you kind of have to have enough machines around for your peak usage. But maybe most of the time, you’re not in your peak so you have all these capacity out there that just really doesn’t need to be used. That’s because you’re treating these as pets and they’re these long lived finance that are resources.

Cattle, where you have these set of resources that can come and go, you can expand and contract that accordingly because it’s a lot more efficient just from the business standpoint and cost standpoint.

Jon:

It’s also useful for monitoring and troubleshooting. To think about things as cattle instead of pets, because I think, engineers have a history of having a relationship especially with troubleshooting, of getting onto a machine and using Linux commands or terminal commands on that machine to find out what’s running, how old processes are, look at logs on that machine, doing lots of greps, and other commands to just figure out what’s happening. It’s like talking to your pet to find out what’s wrong, whereas in the new containerized world, a more likely thing to do would be to just let the orchestrator take that particular container out to pasture. Let’s just be done with that one. If it’s acting up just shut it down. Let’s get a new one.

Let’s get in the ECS specifically. I guess, we kind of need to start by defining what it is. We’ve talked about how it’s an orchestrator. We’ve talked about what orchestration means in terms of our analogies. What is ECS as an orchestrator? I don’t know if there’s much we can add to it but I’ll give you a couple minutes here to just do your best, Chris.

Chris:

Sure. There’s definitely some important overall high level concepts that we can talk about that will help frame that discussion and we can figure out where we wanna go from there. I think it will be super helpful for Rich to pipe in and let me know and you know, Jon, when we start using terms or techniques, terminology that we take for granted, what it means and how that works.

Jon:

That sounds good.

Chris:

Yeah. I should mention, there’s many orchestration systems out there for containers and docker. One of the most popular systems out there is Kubernetes. Very, very popular piece of software out there, it’s open source software. It was developed at Google for them to run their loads. It’s open source. It’s portable. It runs in just about any kind of environment and it’s also one of the [inaudible 00:17:05] that has been around the longest. That’s one of the reasons why it’s so popular.

Docker itself, they have built their own orchestration technology called Swarm. That is actually just part of the Docker engine. You can use that as your orchestration system. We’re talking about ECS now which is Amazon’s version of it. One of the reasons why we use ECS as opposed to Kubernetes, is just because there’s a lot, like if you’re running your cloud workloads inside Amazon, there’s just a lot of benefits to using the Amazon orchestrator because it is so well integrated with all the other various Amazon services, whether they’d be load balancers or Auto Scale groups, CloudWatch, monitoring systems logging, and what not. There’s just so many benefits, so much bang for the buck there, and that’s one of the reasons why we really prefer ECS. Basically, Amazon is our preferred cloud provider.

With that, one of the first things an orchestrator needs is it needs resources. It needs actual machines on which it can schedule, and run, and monitor these containers. In ECS world, that’s referred to as a cluster. What’s a cluster? A cluster is basically a set of EC2 machines, you have 1 to N of these EC2 machines.

Jon:

In our analogy, that cluster is a group of ships running for those containers.

Chris:

Exactly, exactly. EC2 stands for Elastic Compute Cloud. It’s Amazon’s technology for basically spinning up some sort of computing device, server, whether it be a virtual machine, they actually have support for Bare Metal instances as well. They have multi-tenant versions versus dedicated ones. Lots of different types of machines but that’s the core technology for saying, “Hey, I need a machine, a computer essentially, running in the cloud that’s an EC2 machine. For my cluster, I’m gonna have 1 to N of these machines as part of that cluster.”

Rich:

Chris, if I can stop you there, is each one of these EC2 instances its own container? Or could there be many containers on each EC2 instance?

Chris:

We had the discussion about virtual machines versus containers. These EC2s think of them as these are virtual machines. The containers then gets scheduled on them. Going back to the analogy that Jon was pointing out, these EC2s, they’re the ships. Our software are the containers that will then get loaded on to these ships and packed into them accordingly. The more ships that we have, obviously, the more containers that we can actually run in our workload. Does that answer your question?

Rich:

Yeah, yes. I think what you’re saying is that there are many containers as part of an EC2 instance. Then, the EC2 instance, the analogy is that it’s the ships, and so if you had multiple EC2 instances, it will be like a fully dev ships.

Chris:

Exactly, right. The EC2s are your host resource. You have to run a container on some computer, those are the EC2s and you have this cluster then. Each one of these EC2s has a certain amount of processing power, there’s a certain amount of memory, maybe storage, there’s this resource constraints that each one of these things have. That comes into play when the orchestrator wants to schedule one of your containers to actually run. The cluster ends up like, if you don’t have resource to run your containers on, you’re not gonna go anywhere. That’s one of the core foundational ideas behind the orchestrator.

Jon:

By the way, shame on Amazon for naming their container orchestration service ECS. I mean, come on, it’s not memorable. I’ve had 15 conversations at least where people accidentally call ECS, EC2 or vice versa, and their competition is using cool words like Kubernetes and Swarm, and here we are using ECS.

Chris:

Hey, in Amazon’s defense, they’re learning, they now have Fargate.

Rich:

That is a problem though. When I was doing a little bit of research, I watched a video on ECS, that never once said ECS. It was EC2. I was like, “Wait a second. Are these like the same thing?” It’s like Elastic Container Service but then EC2 is Elastic Container with this 2.

Jon:

Yeah. EC2 is Elastic Cloud Compute.

Rich:

Okay. Cloud [inaudible 00:24:21] That’s the problem. I don’t know.

Chris:

What’s really ironic is you would actually expect the EC2 to be called ECS because just about everything that Amazon does it’s a service. It’s Elastic Block Service or SQS or SES, Simple Email Service, SNS, Simple Notification Service. You think that EC2 would be the Elastic Compute Service. But instead they call that the Elastic Compute Cloud. The two CCs end up getting abbreviated as C2. But this was one of the very first services they offered way back when, like back in 2006. It’s almost like they are reserving ECS for something else. It’s completely happen stances, but it’s really kind of ironic that it worked out that way.

We talked about the cluster and that’s the set of host resources on which the scheduler can schedule things. Then we have to define our applications themselves and what the application definition, if you will. An ECS parlance you define your services. A service is a definition that dictates the collection of containers that you want to run as a unit as part of that service.

Jon:

Can I just say that I have a way of thinking of services that may be a little bit more understandable. That was a very good technical definition of it but I just think of a service as a long running thing. It’s something that can do something that just runs forever.

Then, what it’s running, how many containers it has, what the containers are doing is what you need to define. But a service is something that’s long running. It just goes and goes and goes. That’s what you have to define one of those or more of them.

Chris:

Yeah. That’s a great way to think about it. That’s exactly what it is. You have the nuances, there’s some services, they have inbound traffic, your software will be accepting request that will be coming into these containers. They’ll front those with load balancers to direct that traffic into it. Then you can also have services that don’t have inbound traffic at all. Instead, they’re kind of demon process, they’re background processes that’s just again, long running. They’re doing work in the background. They need to be running as a service but they don’t have any inbound requests coming into them. Those can be defined as well with ECS and run.

Some of the other criteria that you have with services is you’ll tell ECS how many of these things you want running so you could have a single instance of your service running or you could say, for redundancy, your scalability reasons, for performance reasons you may wanna have two, three, or four. You can tell ECS that and ECS will then deal with it accordingly as it manages that service for you.

Let’s see some of the other things that you’ll do in services. You’ll give it identity information like what type of identity the application should have seen when it’s running and that will dictate things like security and just access to various resources or what not. I think, the important takeaway here is that a service long running software in container format kind of the parameters that go along with how that service should be managed and spun up and how many of them there should be.

Jon:

I wanna take a moment to restate something that you’ve said to kind of summarize something that you said and bring it back to the point that you made earlier. You just talked about, you need computers to run these services. You need to define the services. The services are underneath them that are gonna be containerized software. Then if a service is listening for people calling it, then it’s gonna have a load balancer in front of it.

Those are three things that ECS is taking care of for you. It’s taking care of letting you to define the service, letting you say how many computers are in the cluster and putting load balancers in front of your cluster.

The thing I wanna bring that back to is that bringing up and making available computers and load balancers and stuff that, because everything is inside AWS, can all happen on email automatically. I think, at least, before the very, very recent advent of Kubernetes on AWS that you had, if you’re going to use another one of the orchestrators like Kubernetes or Swarm, some of that stuff you had to deal with a lot more manually. I’m not entirely sure because I don’t have experience with those other ones, but I think that is the case.

Chris:

Yeah, absolutely. Getting back to earlier, why are we talking like ECS and not Kubernetes when Kubernetes is kind of like the de facto leader here. If we were running our workloads on premise, like our own machines, using our own Cloud computing stack open source software, then we’d be using Kubernetes. If we’re running inside Google Compute Cloud, we’d probably using Kubernetes.

But given that we’re inside Amazon, a lot of the stuff that you just talked about that you would have to do manually, how do you define a cluster? You have to set that up manually for using Kubernetes versus with ECS. You just tell ECS, “Hey, create a cluster for me.” You kind of give it some parameters like what type of virtual machine should the host be, and how big this should be, and then ECS, under the covers, it’s actually using things like Cloud Formation to go built launch configuration definitions which dictate what these host machines look like. It’s created an Auto Scale group for you which basically says, “Here’s how to go create a cluster that can be scaled up and scaled down using that launch configuration.” It’s wiring up load balancers to talk to your services. Yeah, it’s doing a lot of stuff under the covers that you don’t have to do. That saves you a lot of headache and a lot of trouble.

Jon:

Maybe also remains a little bit of flexibility but for probably even more than 80% of the workloads that you might have, it does the trick perfectly well.

Chris:

Yeah. The great thing is that you have access to all those variables. If there is anything you do wanna tweak, by all means go ahead and do it. You can change the launch configuration to have your own custom user data scripts and startup each one of your host EC2s, download some software, or run some kind of process that you need to have running on your machines. You can change at your Auto Scale group to say, “Hey, I want these things now to run on a private subnets and to not have public IP addresses.” That’s super easy to do. You don’t have to go deeper and to get more advanced of that stuff, but the hooks are there and you can go ahead and do so if you’re so inclined, if you need them.

Jon:

There are three things that we’ve been talking about in the last five minutes that we haven’t really defined and they’re not simple. But if you can’t really have a conversation about ECS without using these terms, maybe we can talk about what each of them are. You talked about Cloud Formation, you talked about Auto Scale groups, and you talked about launch configurations. Those three terms, maybe we can go one at a time. What’s Cloud Formation?

Chris:

Cloud Formation, broadly, is typically called infrastructure as code. What that means is it’s a mechanism for automating the creation, and the update, and the deletion of the infrastructure components whether they’d be computers, or load balancers, or security groups, that kind of thing. Without having infrastructure as code, you might do all the stuff manually via an administrator console. But there’s very little insight into what’s going on, and it’s a very much a manual process.

You go in, you sign into an account, you go to some screen, you click a button to create something, you type in a bunch of parameters, you go to the next screen, type some more parameters, click save, and now you’ve created that one piece of it. Then you go and do this now for 12 other pieces, type things. Very manual, but also, it’s mysterious. No one knows really like, “How did you do that?” You have to go, maybe show someone how to do it, or you have to let people what you did.

Infrastructure as code is taking all those actions and it’s actually put into a written set of instructions, whether it’d be on JSON format, or sometimes it could even be an actual software code, but it’s something that’s very declarative, it’s something that can be checked into your source control systems, you can be versioned, and then even more importantly, you can now automate that. You write that once and now you run that through the system that reads that, that can then interpret that, and go do those actions on behalf of you. You may have to create your staging environment.

There’s lots of different various infrastructure resources you would have to create, whether it’d be like networking and subnets, and routing tables, and bring that virtual machines and EC2s, and load balancers, roles, that kind of stuff, that might take you hours even days to do. But instead if you do at once via using one of these tools like Cloud Formation, it’s this infrastructure as code, then, to spin up one of these things, takes minutes because you just feed that set of instructions into this piece of software to do that.

Think of it as just as a recipe for declaring what resources to create and then likewise, it can use that same recipe to then say, when you don’t need it anymore to go and delete all that stuff for you. There’s lots of infrastructure as code, tools, and technologies out there for delivering this Amazon’s version that’s called Cloud Formation.

Jon:

I wanna take an even one more layer into the real physical world, just because I think it’s fun to think about things this way. There’s data centers out there and they have these real machines that are just sitting on racks in these data centers, like thousands, and thousands, and thousands, and they’re always there. They don’t actually go away, they’re always sitting there.

When you use the AWS console, you’re on amazonwebservices.com, and you’re clicking around, you’re like, “I want another EC2 instance.” You click all the buttons on the web page to give you another EC2 instance. What’s happening behind the scenes is that on all of those machines, on all of those racks are running at least one little program that control the machine. You say you want a new EC2 instance and Amazon has computers that route in your request and goes and finds a free one that’s not getting used, and says, “Let’s associate this one with.” In our case it will be Kelsus. Let’s give this one the Kelsus, let’s turn on the meter like a taxi cab. Okay, Kelsus it’s yours. Do whatever you want with it.” That sort of the manual way that you already described, Chris

Then you described infrastructure as code. It’s the same exact thing except now we’re running a program to say, “Go and give Kelsus this instance. This EC2 machine.” Instead of clicking around in the website to get that machine.

I just wanted to take it all the way down to that level of these machines are literally there, sitting on racks, waiting on people to have them. The code is just a program to essentially automate what you might do if you are clicking around in the console.

Rich:

I think the piece is that you have the option to do either. Is that right?

Jon:

Oh, yeah. Absolutely. Yup. Now we know what Cloud Formation is. Actually, I wanna let Chris say what a couple of the other options are out there for infrastructure as code, but before you say that Chris, I just wanted to say that we know what Cloud Formation is. We have to talk about Cloud Formation with ECS because under the covers, when ECS is making a new cluster for you or when it’s having a new machine because you’re scaling up or taking away a machine because you’re scaling down, or putting a load balancer in front of your cluster, it itself is using Cloud Formation to do that all for you. You never have to touch Cloud Formation and in fact, when I first started messing around with it, I wasn’t even aware that that’s what was happening but that is what’s happening. ECS is using Cloud Formation to do all of that. Okay, what are some other infrastructure as code options, Chris?

Chris:

One of the more popular ones out there is Terraform, this is from HashiCorp. They’re nice set of tools and support for various cloud providers as well, open source software, and even a bit more perhaps developer-friendly.

Cloud Formation is completely specified in JSON and various different keywords that you need to know about and what not. Amazon does has some digital designers to help create these JSON documents, this code, the infrastructure as code, the recipes if you will. Terraform looks a bit more like actual code and less like pure data.

Jon:

Actually, I think you made me realize that there is yet another layer down. Cloud Formation is something that you can code to and it has its own domain specific way of defining your infrastructure. But when it runs, it’s really using some underlying APIs that Amazon has defined that when you call those APIs, it will turn on a machine, or turn off a machine, or assign an IP address to a machine, or whatever. I think that, correct me if I’m wrong, but I believe that Terraform is going straight up as APIs, or do you think it’s going through Cloud Formation to do what it does?

Chris:

I don’t know. It’s absolutely going straight through those APIs. Just like Cloud Formation is as well. Cloud Formation, it’s JSON data describing basically all the information that the Cloud Formation engine needs to know in order to make the appropriate API calls for the various subsystems. If it needs to spin up an EC2, it’s gonna make an EC2. API call if it needs to create another scale group, it’s gonna make that API call.

Likewise, something like Terraform, it’s doing the exact same mechanism. They support multiple cloud providers whether it’d be Google or Microsoft Azure or AWS. They have an obstruction, they are built on top of that. They have a concept, they’re just like, “I need to create a VM.” And then they have providers and for each one of those various different services to make the specific API calls to do that. It’s a different API call on AWS than it is on Google to spin up a virtual machine, but Terraform obstructs that away from you.

That’s one of the advantages of something like Terraform is that it is cross-platform. If that’s something that is interesting to you, it’s something to look at. If you’re all in on AWS, then Cloud Formation is something to get really close look at.

Jon:

The first one was Cloud Formation. I think the next in this series of things we need to define is a launch configuration. What’s that?

Chris:

A launch configuration is actually pretty straightforward and simple. It’s literally just the description of what you want your virtual machine to look like that’s gonna be inside of your cluster. You’re specifying things like what type of instance should your virtual machine be? Should it be a machine with a more inexpensive machine that has relatively small amount of processing power and perhaps less than a gigabyte of RAM or is it beefier machine that has eight gigs of RAM and it’s got two virtual CPUs to give you some more processing power.

You’re telling it like what type of virtual machine you wanna spin up? You’re also saying what you want the preloaded software in there to be, like what operating system you want on there? Is this Linux? Is it Windows? What flavor of Linux is it? What packages of software should be on there?

You’re also defining things like some of the networking primers around it, it’s basically that whole, whatever kind of criteria you need to specify to say like, “This is what my virtual machine should look like. This is how you actually create one of these things that’s gonna go inside the cluster.” That’s the launch configuration.

Jon:

Right. ECS is running along happily. It’s hosting a website maybe, and then the website ends up on national news and ECS is like, “Woah. I need to scale this thing up because I’m starting to get a lot of traffic here.” ECS says, “Alright, I need another machine,” and it says, “Oh, let me look at my launch configuration and see what kind of machine I need.” It looks at its launch configuration and it says, “Okay. I see what machine I need. Now, Cloud Formation go get me one of these.”

Chris:

Yeah. Logically, that’s how it works. There’s the other piece that we haven’t talked about yet which is the Auto Scale group. That is the piece that bridges these two things. Auto Scale groups, they’re fed a launch configuration. As a launch configuration is a template for what an individual machine looks like, an Auto Scale group is a template for saying what should my cluster of these things look like. You give an Auto Scale group a launch configuration saying, this is what each one of the notes and said that cluster should, this is how they should be provisioned.

In addition to that, you’re also telling Auto Scale group what network should you be putting this machine and so on, and how many of these machine should you be running? What’s the minimum number that should be running? What’s the maximum? What’s the desired count? That’s the Auto Scale group. It’s the cluster definition of these resources, if you will.

Then, you can then hook it up scaling policies on that scale group. You can go in and define when that scale group should change its parameters, when it should spin up another one. In that case where load goes up, we don’t have enough resources now to really comfortably handle a load, we need to spin up some more EC2s. You can create a scaling policy associated to that scale group that tells it that when this event happens, go ahead and go from four machines to five machines. Likewise, you can have scale down policies, you can say, “Hey, if this event happens, our utilization, it goes under 10% for X number of time units then take one away, kill a machine, terminate the machine.” It can do that for you as well.

Jon:

When you said the minimum number of machines, the maximum number of machines, that minimum number and max, are those scale policies or are they just inferred scale policies?

Chris:

No, they are limits that any kind of scale policies would have to adhere to.

Jon:

Okay. Yeah. The reason I asked is because one of the things that seems kind of magic is say, you have everything set at two. You want your minimum to be two, your desire to be two, and your maximum to be two. That means that ECS is gonna use that Auto Scale group and just keep your number of machines at two at all times. If you then manually behind ECS’s back right into the EC2 management console and you kill one of those machines behind the ECS’s back, ECS is gonna say, “Woah, wait a minute, I’m supposed to have two of these.” It will automatically make another one right then.

Chris:

Yeah. The kind of the cool thing here, it’s not actually ECS that’s doing that. ECS doesn’t even really care. It’s the Auto Scale group that’s doing that. There’s kind of like this really cool thing that Amazon is eating their own dogfood. They’re building these services on top of other core services whether they’d be Cloud Formation or Auto Scale groups, load balancers. They’re leveraging the power and the capabilities of each one of those things, and building these value added services on top of that. It’s one of the reasons why the number of services that Amazon is offering is just expanding so quickly. The foundational stuff that was done years ago, that was some of the hardest work.

Now, adding these new services, if it’s a voice transcription service, image recognition service, it’s building on so many of those foundational services that it becomes the developers of the services. They can focus on just what’s different. They don’t have to do this other table stakes types thing. Auto Scale groups is one of those things that ECS just kind of gets for free type of thing.

Jon:

Yeah. That’s super interesting. As excited as I am about what AWS has done there, I also wanna point out that it’s a bit annoying that they don’t make that clear. When you use ECS, especially if you haven’t really got and even figured out what it’s all about, it seems like a fairly superficial thing that you’re doing, you’re setting up an ECS cluster and then you’re defining tasks and services for that cluster.

Then, when you go, look at your Amazon bill, you’re like, “Woah. Look, I’ve got load balancers that I’m paying for.” It’s just not obvious when you set this all up that you’re creating launch configurations, you’re creating Auto Scale groups, you’re turning on load balancers. They don’t make it clear without reading the manual. I guess everybody should do. But this is what you’re really signing up for that you’re signing up for all these different services.

Chris:

Right. This is probably gonna be just the way that things work and it’s gonna accelerate. That’s gonna become more of just something that you will have to deal with as we use these services, things are so complicated. There are so many core services and ways to throw at these things together. You may think, again logically, you’re creating this one thing, but really it’s like 13 things have to happen. If they made you do those certain things, it would be like such a big hurdle that very few people would be able to do this, so they have to automate it, and kind of make it as easy and as one button, simple as they can, one click button that they can.

But then the other thing, I think, if they were to try to tell you what they’re doing, that would be pretty complicated too and daunting. If you saw all these messages about, “Oh, I just did these 13 things.” You’re like, “Wait a minute, what’s an Auto Scale group? Do I have to know that?” I think, they’re on purpose, they’re obstructing all of that stuff away, they’re hiding that stuff where they don’t want you to know about it in general.

Jon:

Right. You do see it in your bills. I think that’s a place where AWS has some room for improvement. If you’re using in a higher level service, figure out how this sort of charge for the higher level service instead of exposing all of that underlying services that you’re charging for so that you can think about it as paying for your higher level service and your usage of that.

Chris:

Yeah. It would be wonderful to feel like, “Oh, instead of my bill having like this seven different categories but I spend in networking and EC2 and RDS, and load balancers and what not,” to instead just say like, “This is what my ECS service costs me to run the application. That’s $173 a month.” Instead of all this broken on to various different pieces that I’m not like, “What do I do with that?”

Jon:

Yeah. We’re running out of time here but I think we do have time to go through one more kind of mental exercise to kind of bring it home with the ECS and talk about in a concrete way. I’ll set it up like this. People that have come from Platforms as a Service, they know that when they are going to deploy their application to production, they will have written some code, tested it on their local machine, maybe tested it on some cloud machine somewhere. But eventually, they get to the point where their code is in the code repository like GitHub, and then they can tell Git to push their code to the Platform as a Service, and then they know that the Platform as a Service is gonna suck in that code, put it on a machine, or multiple machines and run it and make it available.

The whole process, when you put something on ECS is a lot different and it’s a lot more complex. I was wondering if we can walk through it at the most understandable, highest level possible without just sort of skipping overall of it, but if we could just kind of try to make that real. What is that you see that code in your IDE and you want it on ECS. What is that? What is sort of the very highest level process of what that looks like?

Chris:

The one piece that we didn’t really talk about with ECS yet is task. I think we’ve mentioned the word a couple of times but we haven’t really talked about it and that’s the last piece to discuss. Services, they describe this long running service, how many instances of them should be, but it doesn’t really describe what it is that you’re instantiating. That’s the task definition.

The task definition is the fundamental unit that says, “This is the Docker image to run. These are the resource coders that I need to run. This is how much memory I need. This is how much CPU I need.” It has other component like if it needs volumes available, that would be on there, but that’s that fundamental unit. That task definition says, “This is the Docker image to run.”

Your service, when you define a service, you tell the service what task definition to be using. It tells the service basically what Docker image it needs to go pull and how it should go create that on one of the host EC2s that’s in your cluster. To actually deploy your service – you have your code, you wrote your code, you updated it, it’s in your IDE. The first thing that has to happen is you actually have to build your new Docker image from your code and you can do that to either locally on your machine, or you can have a build server, or you can use a CIS service. Something like CircleCI or Travis CI. Something like that where it’s looking at your code repository. It’s building your Docker image based upon that.

Once the Docker image is built, it then needs to be put somewhere where ECS can see it. Those are called repositories, artifact repositories, there is Docker itself runs one. Definitely the most [inaudible 00:52:05] is something called Docker Hub. That’s their Docker repository for where all these images can be stored. You can think of it as like a library maybe, where you just have all these images that are available there, they’re catalogued, indexed, named. You’re able to uniquely identify each one, and then read them when you need to, and you can put new ones there and it’s publishing an image. We wrote our software, we’ve built the image, we’ve then published it to the repository. Amazon has its own one called ECR, which simply is Elastic Container Repository.

Jon:

Correct.

Chris:

Yeah. That’s where your Docker images will go. Once you’ve built that image, you’ve published it to the repository, the next thing you’ll do is you’ll create a new task definition for your service. It will be an iteration of the previous one. But basically, what you’re doing is you’re probably just changing one thing and that is you’re saying instead of using this older image, use this new one. You’re giving it that new unique idea for the new image that you just built and published to your repository. You update that, you create that new task definition with those parameters.

The last thing that you do is you now tell the ECS service, “Instead of using that old task definition, use this new one.” That’s when the orchestrator kicks in and does its thing. Your service definition said, “I want two of these things running at anyone time.” I’ve now gave it the new task definition. It looks it was running out there and it sees, “Oh, I have two of these things running.” But they’re the old version and that’s not what I want. We want the new version. There’s parameters on the servers that you tell it on how it can do on these updates by giving it like minimum number and the maximum number of your instances that you want running.

Depending on those parameters, it can do things, it might terminate one of them to take you down to one running instance. You now have one running instance of the old version. It then creates a new version, a new instance, using a new version. Now you have these two things running side by side. One instance is the old version, one instance is the new version.

Once that new version comes up, it is healthy and it passes the health checks. ECS will then go ahead and terminate the remaining old one and it will then spin up to the new one. It handles this whole rolling deployment of your new service by virtue of you just need to create a new task definition and register that with your ECS service.

Jon:

Something that you just hit on, reminded me of a question that we’d chatted earlier which is how many containers? Why would there be more than one? You talked about, “Well, there can be five different services and each one might need to be redundant.” That right there can be 10.

When you just described that with the rolling updates, it made me feel like three maybe kind of a magic number because if you have three of each service, then that gives you the ability to take one down and still have redundancy while one is down and then it comes back up. At no point do you have less than two running as you’re doing a rolling deployment. It feels like sort of a minimum of at least three might be something that a lot of companies reach it for when they’re doing this.

Chris:

Yeah. You can totally specify that with your service definition and saying what your minimum threshold is and your maximum. You tell ECS, your service, you tell it, “This is the number that I desire to have running.” Let’s just say it’s two, we want two instances servers of our task running. You then tell what’s the minimum percent to be running at any one time and the maximum.

If we wanted the case where there was always never less than two of these things running, we would set our minimum threshold to 100%. That tells the ECS to never go below two. If we do that, we have to change maximum to be above 100%, otherwise it can’t create any new instances. We’re never gonna be able to deploy.

If we set at something like 200% or maybe 150% type thing, that means, now, ECS is allowed to go above that in order to do its update. Once the update has happened it can then terminate once to come back down to what the desire level is. You can control exactly how you want this to roll out based upon those thresholds, the minimum and maximum thresholds. The downside to that is you then have to have the capacity on your cluster to support bursting to those higher numbers.

It kind of boils down to how often you’re deploying and how many different services that you have. You could be in a state where if you kind of say, “I never wanna go below two. I’m always gonna burst above it.” That you kind of end up being in the state where you have to have a few extra host EC2s to handle just the additional resource that you need to deploy so that you don’t violate those rules. There are some tradeoffs there.

Jon:

That’s interesting. This whole process of deploying your code, it does sound like it’s a few more steps but one of the things that is awesome is that once that code is built and put into, you take that image, you build that image, you put it into your Docker repository, then that thing can be used anywhere. You can use that image on staging. You can use that on a test or a demo environment or production. It never gets built again which was not true in the past.

Chris:

Absolutely. It makes things like rollbacks just so painless, it’s so easy to do, for sure. The other thing too, it’s really easy to automate this stuff, it may sound like there’s a lot of steps but so much of this is it’s super easy to automate. You can actually, the way that we set up our systems that Kelsus, a developer commits code to GitHub, and once it merges to a designated branch, it just kicks off some scripts that makes API calls to ECS to create the new task definition, to update the service, and the deployment just happens automatically, they don’t have to do anything. They just commit their code and a few minutes later it’s running in their staging environment.

Jon:

We’re just back to where we were when we started.

Chris:

Exactly.

Jon:

Cool. Alright, I think that can do it for this week, unless, did I leave anything up, Rich or Chris?

Rich:

In my notes, I have that AWS defines the four key components of a container services; tasks, containers, clusters and container instances. I think we tackled the first three of those, not in that order, but do we talk at all about what a container instance is with regard to it being the fourth? There’s also containers, are they separating the idea of one components of container and one component as a container instance like what’s the difference there?

Chris:

I think in that terminology, AWS is referring to container what we’re calling an image. That’s the container definition. It’s that self-described unit of “This is all my code and operating system and everything it needs in order to run.” The instance is the actual taking that image and running it on one of the host EC2s type thing. I think that’s what that terminology they’re using there is what they’re calling a container is actually what we’ve kind of referred to as an image.

Rich:

That makes sense.

Jon:

Alright. Well, thanks, Rich and Chris. This is another fun one.

The Docker Transition Checklist

19 steps to better prepare you & your engineering team for migration to containers

03. An Introduction to Elastic Container Service (ECS)