83. Virtual Machines vs. Containers Revisited – Part 3

Summary

Containers are just lightweight virtual machines, right? No, not really. There’s much more to the story than that, so we decided to do a four-part series on virtual machines versus containers.

In parts 1 and 2, we discussed virtual machines in detail and how they work. Now, in parts 3 and 4, we turn our attention to containers. Turns out, containers are not very complicated. They are just normal Linux processes with some isolation superpowers.

In today’s episode of Mobycast, Jon and Chris go into depth on containers, their history and the underlying operating system technologies that make them possible. If you ever wondered why you can’t run Windows containers on a Linux host, this episode will clear up the mystery.

Show Details

In this episode, we cover the following topics:
  • Operating-system-level virtualization = containers
    • Allows the resources of a computer to be partitioned via the kernel
      • All containers share single kernel with each other AND the host system
    • Depend on their host OS to do all the communication and interaction with the physical machine
      • Containers don’t need a hypervisor; they run directly within the host machine’s kernel
    • Containers are using the underlying operational system resources and drivers
      • This is why you cannot run different OSes on the same host system
        • i.e. Windows containers can run on Windows only, and Linux Containers can run on Linux only
      • What we think of different OSes (RHEL, CentOS, SUSE, Debian, Ubuntu) are not really different…
        • They are all same core OS (Linux), they just differ in apps/files
    • Based on the virtualization, isolation, and resource management mechanisms provided by the Linux kernel
      • namespaces
      • cgroups
  • Container history
    • FreeBSD Jails (2000)
      • BSD userland software that runs on top of the chroot(2) system call
        • chroot is used to change the root directory of a set of processes
      • Processes created in the chrooted environment cannot access files or resources outside of it
      • Jails virtualize access to the file system, the set of users, and the networking subsystem
      • A jail is characterized by four elements:
        • Directory subtree: the starting point from which a jail is entered
          • Once inside the jail, a process is not permitted to escape outside of this subtree
        • Hostname
        • IP address
        • Command: the path name of an executable to run inside the jail
      • Configured via jail.conf file
    • LXC containers (2008)
      • Userspace interface for the Linux kernel features to contain processes, including:
        • Kernel namespaces (ipc, uts, mount, pid, network and user)
        • Apparmor and SELinux profiles
        • Seccomp policies
        • Chroots (using pivot_root)
        • Kernel capabilities
        • CGroups (control groups)
    • Docker containers (2014)
      • Early versions of Docker used LXC as the container runtime
      • LXC was made optional in v0.9 (March 2014)
        • Replaced by libcontainer)
        • libcontainer became the core of runC
      • LXC was dropped in v1.10 (February 2016)
  • Container technology
    • Containers are just processes. So what makes them special?
    • Namespaces
      • Restrict what you can SEE
      • Virtualize system resources, like the file system or networking
        • Makes it appear to processes within the namespace that they have their own isolated instance of resource
        • Changes to the global resource only visible to processes that are members of the namespace
      • Processes inherit from parent
      • Linux provides the following namespaces:
        • IPC (interprocess communications)
          • CLONE_NEWIPC: Isolates System V IPC, POSIX message queues
        • Network
          • CLONE_NEWNET: Isolates network devices, stacks, ports, etc
        • Mount
          • CLONE_NEWNS: Isolates mount points
        • PID
          • CLONE_NEWPID: Isolates process IDs
        • User
          • CLONE_NEWUSER: Isolates user and group IDs
        • UTS (Unix Timesharing System)
          • CLONE_NEWUTS: Isolates hostname and NIS domain name
        • Cgroup
          • CLONE_NEWCGROUP: Isolates cgroup root directory
      • Syscall interface
        • System call is the fundamental interface between an app and the Linux kernel
          • i.e. Linux kernel calls to create/enter namespaces for processes
    • Control groups (cgroups)
      • Restrict what you can DO
      • Limits an application (container) to a specific set of resources like CPU and memory
      • Allow containers to share available hardware resources and optionally enforce limits and constraints
      • Creating, modifying, using cgroups is done through the cgroup virtual filesystem
      • Processes inherit from parent
      • Can be reassigned to different cgroups
        • Memory
        • CPU / CPU cores
        • Devices
        • I/O
        • Processes
      • Using cgroups
        • To see mounted cgroups:
          • mount | grep cgroup
        • To create a new cgroup:
          • mkdir /sys/fs/cgroup/cpu/chris
        • To set “cpu.shares” to 512:
          • echo 512 > /sys/fs/cgroup/cpu/chris/cpu.shares
        • Now add a process to this cgroup:
          • echo <get_pid> > /sys/fs/cgroup/cpu/chris/cgroup.procs
  • Pseudo code: Creating a container
    • Steps:
      • Create root filesystem for container
        • Spin up busybox in Docker container, and then export filesystem
      • Run “launcher” process that sets up “child” namespace
      • Launcher process forks new child process (now under new namespaces)
        • Child process then forks new process for container
          • chroot (to our root filesystem)
          • mount any other FS
          • set cgroups (e.g. apply CPU constraints)

Links

End Song
Bettie Black & Sophia – Something Beautiful

https://makemistakes.bandcamp.com/album/the-golden-years

We’d love to hear from you! You can reach us at:

Voiceover: Containers are just lightweight virtual machines, right? No, not really. There’s much more to the story than that, so we decided to do a four part series on virtual machines versus containers. In parts one and two, we discussed virtual machines in detail and how they work. Now, in parts three and four, we turn our attention to containers. It turns out, containers are not very complicated. They are just normal Linux processes with some isolation super powers. In today’s episode of MobyCast, Jon and Chris go into depth on containers, their history and the underlying operating system technologies that make them possible. If you ever wondered why you can’t run Windows containers on a Linux host, this episode will clear up the mystery.

Voiceover: Welcome to MobyCast, a show about the techniques in technologies, used by the best cloud native software teams. Each week, your hosts Jon Christensen and Chris Hickman pick a software concept and dive deep to figure it out.

Jon: Welcome Chris, it’s another episode of MobyCast.

Chris: Hey, Jon. It’s good to be back.

Jon: Yeah. Good to have you back. Chris, what have you been up to lately, this week?

Chris: Yeah, this is a big week. We are sending my oldest son off to college so it’s a time I think that’s come but it’s going to be a huge adjustment for us so we’ve just been preparing and getting him all packed up and all the … checking off stuff from supplies list and trying to think of what he might need and then we’ll be off and making the track to take him to the college.

Jon: Right on, when do you think you’re going to see him again, next weekend?

Chris: Yeah, let’s see. The good news here is his college is only about an hour and a half away from us so it’s a nice drive.

Jon: Right, right, yeah, that’s what I was thinking.

Chris: I think parents’ weekend is about a month away so that will definitely be no later than that and well, it’ll really be up to him, I think but then I’ll always see him before then.

Jon: I can’t imagine, I mean, I have about another 10 years or so before I have to face that and it’ll be … yeah, I just don’t even know what your life looks like here but best of luck and yeah, next phase with one kid at home and one kid out.

Chris: Yeah, life. Life is a sequence of transitions and journeys and all that kind of stuff.

Jon: Yup, yup. Yeah, I guess, here, I don’t have such a big week happening. I went … I think last week, I mentioned that I was going to go surf in Baja and that was going to be fun and I did and now, I’m back and refreshed and I imagine Chris, you can hear it in my voice.

Chris: It’s a little bit of salty. Yeah, absolutely. Yeah, that’s surfer vibe of just like chill.

Jon: For me, surfing for whatever reason is … you think of it as like this free flowing natural, out there, one with the ocean but really, it’s just like the hardest thing I’ve ever tried to learn how to do. It is just so difficult and like every little piece of progress is hard fought and a big part of the reason it’s hard fought is because you only get a few seconds on a wave and then you maybe only get 10, 15 waves in an hour or two of surfing. It’s like you get a total of maybe 45 seconds to a minute of actual practice, every time you go. That’s not a lot compared to other sports.

Chris: That’s not an efficient ratio.

Jon: No, no. Yeah. It’s just like … I’m kind of like a bulldog when I decide that when I learn something, I just get on it and get on it and get on it until I figured it out and here we are like 15 years later after I’ve started surfing or maybe even more than that and there’s still some fundamental things that I just wish I could do better and finally I had such a breakthrough this past weekend and I got one move just dialed in better than I ever had before and it was just there for me, whenever I wanted it and it was like, I think I should just quit. I think I just got as good as all I ever get.

Chris: That’s it, right there. Mission accomplished.

Jon: Right. This software is a little bit the same way. We have to learn and learn and learn and learn and you never kind of get to that point where you’re like, okay, I’m done learning them and it’s good as I’m ever going to get, like there’s always more, I can always get better.

Chris: Indeed, absolutely.

Jon: Let’s do that right now. Let’s get better. Yeah, so part three of virtual machines versus containers revisited.

Chris: Yeah.

Jon: What do you think Chris, what are we going to do in part three?

Chris: Right, yeah, so in the first two episodes, we kind of went deep into virtual machines, really understanding exactly what they are and how they work and what a hypervisor is, how it’s really instrumental in allowing virtual machines to actually exist and how that works and so we’ve got a really good understanding now of what a virtual machine is.

Jon: Yes.

Chris: It’s virtualizing the entire machine. It’s running a full copy of your operating system, along with the virtual copy of the hardware and it is … that VM is assimilating enough hardware to allow whatever guest OS you’re running on there to be … to run unmodified and to be run in isolation, right?

Jon: Unless you’re doing paravirtualization, see I’ve learned something.

Chris: Right, but even with paravirtualization, it’s still isolated. It’s just now, it’s making those calls to access the kernel via software as supposed to via hardware. Great discussions there. I think hopefully everyone now has a really good understanding of what a virtual machine is and so, obviously, this series is virtual machines versus containers. Now, it’s time to talk about containers and we can dive deep into those and understand exactly, okay, what is a container and then follow that up with okay now, how do these compare, right and so now, by the end of this, we should have a really good fundamental understanding and be able to talk about virtual machines, containers and when to use … when to use either one of those. The advantages of them, maybe disadvantages of them and what not.

Jon: Yeah, this should be useful for your actual work but it should also be useful for when you get in an argument with your software developer and you get an argument with the architect on your team and you can just … you can just be like, I know what I’m talking about here. I’ll win this one.

Chris: Right.

Jon: Are you still … I’m just thinking about myself because I use to love doing that. I use to love like practicing my arguing over software details and technology skills against the people that were better and more experienced on a team and like coming away with a win in one of those arguments, was like that was the best, best feeling.

Chris: Were you on the debate team in college?

Jon: No, in fact, I think I studied debate or I tried to learn about debate in Junior High and I studied and studied and studied and then we had our debate, that was in front of people and there’s a camera and it was the one and only time in my life that I stood and it was my turn to stand up and I had five minutes to say something and I didn’t say anything and my face just turned red and I just stood there and it was the most embarrassing thing that have ever happened. I think it’s driven me to like be a public … more of a public speaker since then, so now you got a little personal story for you.

Chris: Okay, there you go.

Jon: Yeah. Anyway, yeah, I also love debating with them, more experienced people in the team and now, I can debate with you.

Chris: Absolutely so, after this series, I should definitely be able to go deep into the debate on when someone says that containers are just lighter weight VMs.

Jon: Yes.

Chris: What do you mean by that? What does lightweight mean, right? Be able to go as deep as you want on that. With that why don’t we just talk about … again, there’s two broad types of virtualization. The full virtualization, that’s what … again, you’re virtualizing everything including the hardware, that’s what we’re calling virtual machines. Then you have operating system level virtualization and that’s really containers and so that’s kind of the first thing to take away here is what’s the surface area of virtualization and so with containers, it’s happening at the operating system level.
Perhaps, one of the most important things to keep in mind here is that all the resources of that computer are now being partitioned via the host kernel and all containers running on that host, they’re sharing the same single kernel with each other and the host. This is different with virtual machines. Virtual machines, there are multiple OSs and they each have their own kernel, right?

Jon: Yes.

Chris: With containers, they don’t have their own kernels. They’re getting that from the host machine so there’s a single kernel on that host machine and everyone is sharing that.

Jon: I think the part that was a mouthful that you said is when you said, that they are sharing the host kernel with themselves and the host. It was like, wait, what is the host? Yeah, it’s an operating system, it’s got to do itself, right, so it’s doing itself and it’s doing all that containers.

Chris: Right, yeah, so again … and so containers, you can run containers on Bare Metal. You can also run them inside of a VM, right, but again, whatever it is that you’re running on with it, if it’s Bare Metal and you just installed Linux on a Bare Metal machine and now, your containers are running into it, you’d have that single Linux kernel for the Bare Metal machine and then also, any containers you run on that machine, there can be, leveraging that same kernel and the same applies if you’re running it inside of a virtual machine as well so the kernel inside the virtual machine is being shared amongst the host, the actual … the host as well as the containers that are running inside that VM.

Jon: And maybe some Mac users might be saying, wait, that doesn’t make sense to me. I had to install a VM in order to be able to run my containers when I initially did them and that would have been because initially before a couple years ago, there was into native … there was a native support for Docker in Mac and I think there is now where you don’t have to run a separate VM on a Mac in order to run containers, you can just run them …

Chris: You’re still running VM, it’s just now, it’s provided by Docker so it’s using Alpine.

Jon: There’s still is a native support for containers in Mac OS, you have to have a VM running.

Chris: This is really a good point, good settle point in that, because your containers are sharing the same kernel, right, it means that …

Jon: I think I know where you’re going.

Chris: Right, it means … it’s all the same operating system, right?

Jon: Yeah, it’s got to be the same.

Chris: It means that if you’re going to be running a Linux container, it’s got to be running on top of a Linux kernel.

Jon: Right.

Chris: Mac OS is not a Linux Kernel.

Jon: Right.

Chris: Right?

Jon: Right.

Chris: Just by virtue of that fact alone, that means that you now have to have some way providing the Linux Kernel which really the only way is a VM, right? Under the covers, that’s what happening is that Docker is running inside of VM, right and this is the reason why you can’t run Windows containers on a Linux machine.

Jon: Right.

Chris: Right, because again, Windows containers are going to need a Windows kernel while there’s no kernel there on that Linux machine.

Jon: Yes.

Chris: The same thing goes for … vice versa, right, so Linux headers can only run on Linux. We’ve got some … because Linux containers are so ubiquitous, the various platforms have accommodated them, right, with these virtual machines. We talked about all in Mac with Docker for Mac, it’s end of that VM for you. On Windows with Hyper V, they have support for spinning up a Linux VM so that you can run your Linux containers on Windows platform and it feels like it just works, right but again underneath the covers, they’re running inside of a virtual machine.

Jon: Yes, so have you ever said something that’s so obvious but not until after you say it, is it like, “Oh my God, that is obvious.” It is, we’re talking about it, we’re getting to the details of what these things are that containers require a kernel to be of the same flavor of OS that they are. They use it so they need it to be same so yeah.

Chris: Yeah.

Jon: Of course, of course you can’t run Linux containers on a Windows machine without a VM.

Chris: Yeah.

Jon: Obviously.

Chris: Yeah.

Jon: Yes.

Chris: It is. It’s one of those things that it’s … you say it, it’s totally obvious but for whatever reason it’s not really intuitive and I think this kind of goes to the reason why we’re even talking about this in this series of what are VMs and containers and how do they compare.

Jon: Yes.

Chris: Do you really understand what a container is and like, this is a fundamental thing of a container but it’s not really called out, explicitly, like when you learn about it, right?

Jon: I know. It should be like, on the Docker home page, it should be like, a container is a program that runs on your operating system. That should be the very first thing it says, instead of like, whatever it does say something about … I don’t know, the last time I was on there, it was at least 18 months ago and I don’t really remember it being super helpful and it’s just like kind of, here’s a bunch of links to go download a bunch of programs more or less and that Docker was taking over the world, I think it also said that.

Chris: For sure, yeah. Part of this might be the fact that like the end of the day, containers are not too terribly magical and they’re not really too terribly interesting.

Jon: Yes, yes.

Chris: If you actually describe them accurately for what it is that they are, then it takes away the magic and mystery of them and perhaps even like the marketablility …

Jon: Of cachet.

Chris: Yeah, absolutely.

Jon: Yeah, marketability. Yeah.

Chris: That maybe one of the reasons for doing this, right, but we’re going to break it down today, this is Containers 101 or just the fundamental … like, this is what is a container and kind of pull back some of that mystery. We’re going to find out that there’s not a lot going on with a container. It’s not as … definitely with a VM, there’s a lot more going on. With containers, it’s much more minimal.

Jon: Okay, so what’s next?

Chris: At the end of the day, a container is a process that is just being run on a toast. Of course, I mean, we … these things are virtual, right? We’ve talked about how they have isolation. You can run these … these containers are these, we’ve talked about before, how they’re like these hermetically sealed bubbles that ideally, nothing else can see inside them unless you want to poke holes in that bubble. There’s that isolation so how is this achieved and it’s really based upon some of the virtualization, isolation and some resource management mechanisms that are provided by the kernel, the operating system kernel.
Particularly, we’ll talk about in the Linux world … and there’s just two main technologies there that really have enabled containers and that is name spaces and secrets and so we will dive into that a bit more and just understand how they work but maybe as kind of an example of just kind of like driving home the point of like what a container is, so again, these are processes. They are using the same kernel that’s on the host. They’re sharing that, the OS kernel. One of the things that I think we think about with containers is like, “Oh, I’m running like this entire virtual …” I mean, it feels like a virtual machine right? It feels like it’s got an operating system and it’s got applications in it, right?

Jon: Yeah, that’s … you can log in to it, associate to it. It feels like a full computer.

Chris: Right, but again, it’s not. Really, all it is, is it’s this process where you can copy files into it and those files get basically mounted and this process starts and it mounts that file system and that’s all it can see and so again, it’s using … whatever kernel is already on that host that’s what it’s using and now, these are the files that it can see. With a Docker file, what we’re doing, right, we’re just adding files. We have Docker add command and we have docker copy. We’re basically just adding files and so you’re building up that file system that that process can see.

Jon: Okay.

Chris: We kind of think along the lines like, we have all these various Docker images out there and we can have all these different OSs like Red Hat or CentOS or SUSE or Debian or Ubuntu, right? We think, “Oh, these are like operating systems,” but if you think about it, they’re really not. It’s just … they’re all the same core OS, right, if they’re all Linux, there are all the Linux kernel and really they just differ with the apps and files that are included in those distributions.

Jon: Okay.

Chris: Again, just think of this containers as like … really, they’re just processes. They have a file system, a collection of files that they can see and that’s really what you’re doing when you’re creating your docker files, when you’re building up these images is you’re just giving them, here’s the files, that’s going to be available to you type thing.

Jon: Well, that kind of make sense from a Linux perspective because that really is what Linux is anyway. Everything is a file already. Processes are files. Your whole configuration of your operating system is files inside ETC. It’s just files, files, files, all the way down. It makes sense then that you could say, “Oh here’s a collection of files that is going to behave like Ubuntu.” I’m going to zip it up and call it an image and hand it over to my operating system and it’s going to mount those somewhere and then do this stuff in them. Use them as the basis of executing a process. It kind of makes an intuitive sense to me.

Chris: Yeah, I mean, there’s some of the … the kernel initials, the kernel has to be loaded, right, there’s device drivers and all that kind of stuff has to happen, right, so it’s not … when you flip on a machine that’s running Linux or any other operating system, there’s a bunch of stuff that happens to begin with and then once it initializes enough and now it’s reading from … now, it’s in its normal state if you will and it’s reading from the file system, so just with containers, like all that, the kernels have been loaded, all that stuff has already happened and now, it’s just going against the file system.
Just keep that in mind as we go through and talk about some of these particular technologies, namespaces and Cgroups and hopefully, it’ll really make this crystal clear.

Jon: Cool. Cool.

Chris: Before we do that though maybe just really talk quickly about container history because as I was going through this and just looking through it, it’s really kind of interesting to understand like how containers came to be and how long they’ve been around as well and it helps kind of understand like again, what is a container and how we got here. Really, kind of going back in the year 2000, FreeBSD, they had a feature called Jails. This was a way of isolating processes, running on FreeBSD and it was really based … a core part of it was based on the Chroot operation. Chroot basically is change root so what it says is it says, for this particular process, your root so slash is going to be some other place on the file system, right?
It’s a form of name spacing, of isolation so if you want this process to think that like slash users, slash Jon is the root, then you just use this Kernel command chroot to say okay, that’s my new root.

Jon: Right, and if I do that then inside slash users slash john, there better be a slash SC, slash bin, slash bar, slash map, like all that stuff has to be there before anything can work.

Chris: Right. If you’re going to be a container, if you’re going to be a container, right? You don’t have … I mean, when this first came out, Jails, I mean, it wasn’t necessarily like, “Oh, we’re going to be running an entire copy of an operating system inside of it type thing.” It could be just like we just want … we just want the isolation, right? We don’t want it to see any other files outside of it. It’s kind of more of a security thing, right?

Jon: Yeah. It’s like don’t let me go dot, dot, all the way back into the real stuff.

Chris: Right.

Jon: CD slash or CD space dot, dot, all the way back into other people’s stuff.

Chris: Right. Right.

Jon: Cool, cool and I know for sure that, it seems like Apple made big use of this for IOS and even in Mac just to kind of separate processes and make them safer from each other.

Chris: Yeah. I mean, it’s a pretty natural concept, right, this idea of isolating something, especially from a security standpoint, right? That was kind of like the first containers if you will, right, so FreeBSD, this is in again, year 2000 so this is almost 20 years ago and again, it’s based on that core, that chroot operation, it added additional capabilities on top of that so that it allow for virtualization not only the file system but also the set of users and the networking sub system. Then you also have like a configuration file for this which looked really, really similar to a lot of other configuration files that we work with now.
When I was looking at it, it looked a lot like an engine X configuration file, right because it has things like what do you want the host name to be and its IP address and what command is going to be run, what is the path of the executable that’s going to be running inside that jail.

Jon: Right, I don’t know if I’m also thinking about just some famous celeb devs like Kris Nova and Jessie Frazelle. I’ve seen them make inside jokes about containers are just Chroot or why don’t we just go back to using Chroot when they’re dealing with like Kubernetes headaches. Now, we can only get those jokes.

Chris: Yeah, there you go. That’s FreeBSD Jails. It is FreeBSD though right so that kind of limited its appeal and its adoption, so that’s one of the reasons why we’re not talking about Jails today instead of containers. In 2008, there were improvements to the Linux kernel to provide isolation and virtualization and resource management capabilities and so this gave rise to LXC containers so this is eight years after FreeBSD Jails, we’re still about 11 years ago now that LXC containers came on. These are actually full blown containers. They support things like namespacing. Security capabilities like AppArmor and cell Linux profiles, Seccomp Policies.
It takes advantage of Chroot for file system virtualization and introduced cgroups which we’re going to talk about as well. I think kind of like again, the real maybe eye opener here is that this is really the start of containers and it was built into Linux, so this is 2008.

Jon: You wouldn’t happen to know anybody that was making use of those do you?

Chris: I do, so Docker actually, when Docker first launched, they were using LXC as the … what was the runtime for the containers, so Docker, the early versions of Docker up through versions 0.9, which was March of 2014. Docker was using LXC. It was only in version 0.9 that they then came up with an alternative to LXC.

Jon: Interesting, I could have asked a better question. This show is scripted.

Chris: Like I said, it’s very interesting to see like where it came from and the thing that’s … it was really interesting to me is that just LXC containers again have been around for a long time. They’re still around today, it’s still a healthy robust system.

Jon: I’m just imagining like super Linux computer programmer and geek nerds around 2008, going, “I’m going to make this cool thing,” making it and kind of putting it out there and being like, “Nobody cares about this cool thing I made.”

Chris: Yeah.

Jon: Then the Docker people kind of picking it up and figuring out how to market it.

Chris: Yeah, a little bit of that, something else maybe to keep in mind is that LXC, it’s all C code so it’s very, very tight and fast and so there are other … especially more like I think embedded environments and some, just more constrained environments, like LXC containers are really like the only option and Docker is not a great option perhaps or it’s getting better but that’s … one of the big differences between these two is that LXC is in C versus Docker is now written Go.

Jon: Interesting.

Chris: Yeah, but this is one of the things I was looking is as I was doing this preparation, just kind of looking, I was like, well, LXC containers are like, this is basically everything that what we think of as Docker containers like what’s really the difference here. At the end of the day, it’s using these, like really the same kernel features to build these things. It’s just they’re implemented in two different ways. Yeah.

Jon: Very cool.

Chris: That’s LXC containers again, they rolled out in 2008 and have … that’s when they landed in the Linux Kernel main line and they continue today again as a robust project. Docker containers, 2014 is when again, up until that point, they were using LXC as their container technology, they switched over to something, their own implementation that was called Libcontainer that eventually became core of RunC which we’ll talk about next time as for container run times. Then, Docker dropped LXC support in version 1.10 which was in February of 2016.

Jon: Interesting, okay.

Speaker 1: Just a sec, there’s something important you need to do. You must have noticed that MobyCast is ad free but Chris and Jon need your help to make this work for everyone. Please help the MobyCast team by giving us five stars on iTunes, writing positive reviews and telling your colleagues, friends, neighbors, children and pets about the show. Go ahead and do it now. Great. I promise not to ask you to do that again.

Chris: Maybe now it’s a good time just to dive in to the container technology and so we …

Jon: Yes, what we’ve all been waiting for.

Chris: Yeah, so containers are just processes, right? I mean it literally just … like you’re just creating a process on your operating system so what makes them special right, like what makes it a container? Really, it goes into these isolation, the virtualization and the resource management capabilities and it really comes from two main things, namespaces and control groups or cgroups.

Jon: To me, namespaces is a just a word that computer people throw around to seem like they’re really hard core.

Chris: Yeah.

Jon: Everything has namespaces. It’s like, if you can’t think of a better name for it, for your way of kind of chopping things up, let’s call it a namespace but let’s find out … okay, so we’ve talked about namespaces in a lot of other context. I’m curious to see what namespaces are in terms of containers.

Chris: Right, and so namespace … This is actually the term and it’s the actual …

Jon: Current economical form.

Chris: What’s the kernel feature, right? It’s actually called Linux namespaces, this is what the Linux folks called it, right? It’s not that kind of an arbitrary term. Actually it’s, this is the feature and the kernel. Again, it’s known as Linux namespaces. What namespaces do, Linux namespaces do is it restricts what you can see, okay? It’s virtualizing system resources like the file system or networking and it makes it appear the processes that are now within that namespace, that they have their own isolated instance of that, right? This is definitely … this is all about isolation and restricting what you can see, which is a big part of providing that hermetically sealed bubble, if you will for containers.

Jon: Do you think that it lets you kind of restrict anything and everything like you’ve mentioned networking and file system but it also enable restrictions on memory, restrictions on devices, restrictions on all the other things.

Chris: The answer to that is yes and no. Yes, there’s other things that you can provide isolation on via Linux namespaces so in particular, there’s seven things that Linux namespaces provides for. Let’s just go through them quickly. One is IPC, so the IPC namespace. It’s inner process communications. This is like the … more of the low level again, IPC mechanisms inside but it’s providing namespace isolation for that. That’s one thing you can isolate. Another one is the network, so all your network devices and stacks and ports and what not, that can be isolated in namespace. Mounts, so all mount points, those could be isolated.

Jon: Makes sense.

Chris: Process IDs, those could be isolated in there, in namespace. Users and groups, those could be isolated in namespace.

Jon: Typical.

Chris: The UTS, which is UNIX Time-sharing system, which for all practical purposes, it’s really about isolating the host name and the NIS domain name and then, we also can isolate Cgroups as well because we’ll talk about Cgroups a little bit later but Cgroups at the end of the day get exposed as files, as you alluded to earlier, how originally Linux is a file, right? That’s how Cgroups are implemented or seen. There’s isolation namespacing for that as well. Those are the standard namespace categories that Linux gives us. Those are all available.

Jon: Cool.

Chris: Another thing to consider is that process. When you’re launching a process, you can set up these namespaces for it and any processes that are then instantiated by that process so that children are going to inherit from that parent, so everything that that thing does. right, it’s all going to be kept in the same isolation containment.

Jon: Just to make sure I understand, when you’re making a namespace, you say of these seven things, here’s the sub parts that this namespaces allowed access to. Sometimes maybe all of it, sometimes maybe sub parts of it, then you give that namespace a name. You can call it Jon and then when you start a process, you’re going to say, hey, this process, I want you to go inside namespace Jon, yeah?

Chris: Yeah, absolutely. You can pick and choose right, so you don’t have to … It’s not like, it’s all or nothing or if you don’t choose to set up like the PID namespace that now, you don’t have access to that so, if all you want to do is virtualize the mount points, then you can just set up a namespace for that and that’s what it will be. Everything else will be like …

Jon: Default access, it would be the default access for any process?

Chris: Well. I mean, you can think about it, right, everything has a name, you have the root namespace and then it all flows down to there, right, you can kind of think of it as a tree and so you’re always inheriting from your parent, unless your parent is overwriting that.

Jon: Easy peasy.

Chris: Yeah.

Jon: Yes, I like it.

Chris: Yeah. Something else to keep in mind is that namespaces, it’s a syscall interface, right? It’s basically … that’s the fundamental interface between an app and the Linux kernel so it’s just making these Linux kernel calls to create setup and enter these namespaces for processes.

Jon: I think that we’ve never really talked about syscall on Mobycast before and we have talked about it with some of the members of our team and it’s a really straightforward thing but we don’t have time to get into it today so maybe we can come back and talk about that at some other point because it’s pretty interesting.

Chris: Sure, so that is namespaces. Again, it’s one of the big key things that enable containers is just that isolation of these resources and so these containers, when you’re in a container, it’s using these namespaces so that all they see is the network, the mount points, the processes that are relevant to them that is being set up by the run time that’s hosting that container.

Jon: Restrict what you can see, underline see.

Chris: Yes, exactly. The other part is control groups or more typically you’ll hear Cgroups. Cgroups, these restrict what you can do as a process, right? This is the … before you ask about, well what about memory? This is how you limit memory or CPU resources for a particular container or process, right? Cgroups allow containers to share the available hardware resources and optionally enforce limits and constraints.

Jon: Okay.

Chris: We talked a little bit about this where the … creating, modifying and using Cgroups, it’s all done through a virtual file system and just like with namespaces, processes are inheriting from their parent.

Jon: Yup.

Chris: If a particular process belongs to a particular Cgroup, then any children that spawns off will be part of that Cgroup as well.

Jon: Yup.

Chris: Right, and so some of the common Cgroup categories would be things like memory, CPU and CPU cores, devices, I/O and processes.

Jon: I’m kind of curious about this. I’m just imagining, you making a Cgroup and you want to say that this process can have access … maybe the machine has a terabyte of memory and you want to give the process access to four gigs, would you have to say this process can have access to this address space to this address space or would it be like any four gigs you can find, set aside for this process and let the operating system figure out the address space.

Chris: I mean, the OS is going to … this is going to figure out like the address and where that belongs. The Cgroup is just saying, hey, this is how much memory I need, this is how much you should allocate to me and that’s what I want, right? I think when that process starts up, right, like this is just part of just the operating itself, like instantiated, starting this process, like okay, how much memory does it get and then how much CPUs is it going to get? Those are all just characteristics of that process that the Kernel has to manage.

Jon: Right and then I think of these categories, devices makes sense, like there’s a list of devices, here’s some you can access, here’s some you can’t. I/O make sense like you can get I/O from the keyboard or you can send I/O out to the screen or whatever, that makes sense to me. Processes also makes sense. There’s a list of processes and some of them have children and you can have access to these ones but not these ones, that makes sense to me but the one that does not make sense to me is CPU and I feel like we kind of guessed around this in previous episodes around virtual machines and CPU scheduling, like the operating system has to say, when a process gets to run its instructions and when it has to stop running instructions.
On that we decided, that interrupts make a lot of sense, okay, the interrupt comes along and the CPU is like, all right, you’re out. You don’t get to run anymore instructions. We weren’t sure like the magic around scheduling like is it a round robin, everyone gets a little bit of time and then as the next processes turn, does this have anything to do with that? What are we saying when we say, we’re restricting you to some amount of CPU. Is it a priority number, what is it?

Chris: Cgroups, it’s restricting what you can do from the aspect of specifying what limits and constraints you have against those resources. It’s not whether or not you can see it. It’s what … how much of it you get, right? It’s like for processes, it’s not what processes you can see. That’s done through namespacing. Instead, it’s like maybe you’re only allowed to create a maximum of 20 processes. Okay, that’s your limit. You can’t have more than 20 processes running inside of your container. That would be what you’d use Cgroups for.

Jon: Okay.

Chris: Right, but the processes that you see, that’s done by the namespace.

Jon: Yup, okay. There’s a big part of what I just said that Chris has corrected me on so listeners, it’s not about what you can see, it’s about what you can do but I still have my question about what you can do part means for a CPU.

Chris: Memory is really straightforward, right, because the … again, when the operating system instantiated a process and it can say, okay, I’m going to dedicate this … this is how much memory this thing gets and so that’s pretty crystal clear. With CPU and CPU cores, some of that is a little bit more fungible, right? CPU cores is pretty straightforward, you can say, you want to get one core …

Jon: Yeah, so against those CPU cores, you only get to use two or whatever, yeah.

Chris: For CPU utilization, it’s more on lines of like CPU shares, right, so you can say, this is the amount of CPU shares it should get. Really, what it is, it’s not a hard constraint. It is something that is … it’s more of a target to achieve, right?

Jon: What is a CPU share?

Chris: This is really … we see this in practice like, with container orchestration systems like ECS, right? When we’re creating ECS task, we say how much CPU and memory they get, right? Memory is a hard tenant and then CPU is more of a watermark. If there’s available CPU on that host, then containers can burst above whatever their shares are set to but it’s a way of allocating how much …

Jon: Totally, this is making sense, right, it’s like you just divide up the … all of the instructions per second into shares, so maybe it means you have billion instructions per second, if you divided that into a thousand shares, then each share is worth a million instructions.

Chris: Right, yeah. It’s just a way of saying, like hey, there’s a finite amount of power that the … of CPU power that’s here, what percentage of it do I expect my process to use of that?

Jon: Is the CPU share lets you get to find out how many CPU shares there are, is there a command you can do to say hey, how many CPU shares are there and then I want to use like three of them or I don’t even know what a normal number is, is it three, is it 3,000, is it 300,000?

Chris: I think it’s one of those things that’s a bit arbitrary and it’s up to whatever platform you’re running on.

Jon: Okay.

Chris: For this particular point, like the CPU share setting and Cgroups, I’m not sure if the Linux Kernel actually does anything with this and enforces it or if it’s more a way of tracking and basically keeping that metadata of what should be there. An example of this again is like ECS. ECS is doing … is using the CPU shares on your task to kind of decide whether or not it should schedule something on a particular machine, right? It’s going to see, like here’s these … with ECS, they do come up with the number, right? For them every CPU is 1024 shares. It’s divided up to 1024 right? For every core that you have on that EC2, there’s 1024 CPU units and your task is say, how many units I’m taking up.
If you have a task that says, I’m going to use 512, you can always spend up to those task on that EC2 if there’s only one core on that. If it’s like a T2 micro or something like that, there’s only one virtual core on that, so you can only ever had … it’ll only ever schedule two task on that because it basically said, I’m going to use up all the CPU between these two task, right?

Jon: Yeah, yeah, yeah. I’m just imagining, if you only have two containers on there and you’ve each given them one, it’s probably going to let them use a lot of CPU just to do the work because there aren’t any more containers on there so it’s probably smart enough to know not to like …

Chris: Right and that’s why it’s not a hard constraint, right? It’s kind of more for just allocation and scheduling and kind of like bookkeeping than it is for like this hard limit.

Jon: Yeah, this is a fascinating topic and it’s one, I feel like … it’s probably the most hard core part of all of this and it’s not even all that important, really at the end of the day for most of our jobs but it’s fascinating how CPU sort of divide themselves up and how you tell the CPU what part of it you want and when your instructions are going to get run versus when some other instructions are going to get run and we know what kinds of stuff create noisy neighbor issues, like, “Oh that is pretty fascinating to me for some reason.”

Chris: Yeah.

Jon: I’m sure, if you were to go and look at the code, I bet there would be quite a bit of code in that area of container implementation and inside the Kernel itself.

Chris: Like CPU scheduling and … yeah, absolutely.

Jon: I would imagine that’s probably one of the more opaque pieces of the Kernel.

Chris: Sure. Yeah, I mean, that’s the hard … I mean ….

Jon: Yeah, hard core.

Chris: That’s what makes the Kernel, right?

Jon: Right.

Chris: It’s between that and managing memory in some devices and what … I mean, that’s the crux of it. That is namespaces and Cgroups. Those are the two main technologies really along with Chroot, that is really, allows for containers to … it’s really what containers are. It’s just … they’re these normal processes but they have been namespaced into isolation for those, some are different possible types of namespaces and then they have Cgroups that limit or constrain what resources they can and how much of the resources they can’t use and that allows for things like, just making sure that we don’t have containers that take up too much more resources than they really should and allows for sharing and basically digging up the resources of that machine.

Jon: Cool and I think that it makes total sense to me that there would be one namespace per process but could there be more than one Cgroup per process or is it also just one, like I’m going to run a process, I’m going to be in this name space and also in these three Cgroups or just this Cgroup?

Chris: I don’t think there’s anything stopping a process from belonging to more than one Cgroup. We talked about like Cgroups the way that you create, modify, use these things is, it’s through a virtual file system and so your Cgroups are going to be mounted somewhere on your file system again as this virtual file system and they’re going to be basically just directories setup for these Cgroups. If you want to add a process to a particular Cgroup, it’s literally just navigating to that folder for that particular Cgroup. There’s going to be a file there called like cgroup.procs and you just do something like echo PID and add that to the end of that file and now that process is now part of that Cgroup.

Jon: Okay.

Chris: You can again do this for multiple ones. At the end of the day, I mean, maybe it’s got to do something with how do you handle conflicts and overwriting if it’s specifying conflicting settings if you will with it but something to do a little bit of more research on as well.

Jon: Cool. I was just wondering, so we have one last little part that we want to get to and we’ve hit about 50 minutes right now but I think it’s so interesting and it won’t really fit on next week so do you think you can get through the pseudo code for creating a container in like three minutes?

Chris: Yeah, absolutely. This is kind of born of … one of the talks that went to at Docker Con a couple of years back was basically a talk on namespaces and Cgroups and how these relate to containers and that’s part of that, the speaker went through and did a demo of like let’s go and basically do some of these calls to create a container, if you will and kind of again, it takes away a lot of the mystery to this. Let’s do that, right? I’m just going to walk you through like what that process looks like, just to kind of like again, hopefully shed some light on it. There’s not a lot going on here at the end of the day.
To create our container if you will, like first we need a root file system and so for this, we can kind of like cheat a little bit and make it really easy and what we’ll do is we can just spin up Docker with the BusyBox image. That’s kind of like just a Base OS image, the BusyBox image so we can create that, launch a container, running that, that image and then we can use the Docker export command to export out that file system, okay, onto our host, right? Yeah, we’re just doing that and we can tar it up and so now we have … we now have this file, a root file system that we need for our container.

Jon: It’s got slash SC, it’s slash FAR, all that good stuff.

Chris: Yeah, so we have that so then the next part is now, we’re going to do some system calls, right to set up our namespaces so the first thing we need to do is we’re going to have a launcher program and what it’s going to do, it’s going to be the ones responsible for setting up the namespaces so we’ll create a namespace for like let’s say UTS to give us our new host name. We’ll set the namespace for PIDs so that obviously the only process that we can see are the processes inside our container and then we’ll also do networking as well and mount points. This is really easy to do like in the language like Go. Go gives you access to syscalls and so it’s literally … you can just basically say like, setup a command, object in Go.
You can set it to run yourself and so a shortcut to that is slash proc, slash self, slash exe. We thought everything is a file in Linux, we talked about the virtual file system for Cgroups. There’s also a virtual file system for processes and it’s at slash proc. The currently running process is slash proc, slash self and then slash exe is what executable actually is being run, right? You can set up your command to run yourself again. You can make the syscall to set up your new namespaces for UTS and PID and mounts and then, you run again. Now, when you actually execute that, run it, you got a new process that just got launched.
It’s now in this new namespace and now, what you want to do is now set up your actual container executable, right? You’re going to set up another command. This time you’re going to set the executable to be whatever you want your container command to be. You can make some syscalls to set your host name. I set my host name to Fubar. I can do a Chroot to my root file system that I exported from BusyBox to now point to that. I can do a change dir, to make sure that I’m there at the root. I can mount some mounts, mount points if I want and then I’d go ahead and just run that. For all intents and purposes, that is now a container.
I’ve got my name spacing. I’ve got my new file system. It’s been chrooted and I can now go into that and I’m not going to be able to see anything else outside of that, the file system. It’s only what’s inside of there, after … you can also now, if we want to, we can set constraints with Cgroups, right? We can find out what our PID is and we can set up something like CPU shares or set the max processes to be 20 so that it can kick off more in my processes.

Jon: Right, so I think that of those three steps, that second one where you’re setting up your child namespace and you talked about Go, that one for me was a little opaque and I think for our listeners too, it’s probably a little tough and I think that it’s probably because of some of the additional syscall stuff that you threw out there, just on the fly and I’m just thinking that in order to get everybody over that hump, maybe during the next episode, we could add, talking about syscalls a little bit more, talking about work in processes. Just talking about a few of those Linux fundamentals that I know our team didn’t have ready at hand when we talked about them at Kelsey’s Camp last year or earlier this year.
I think that our audience would benefit from that and then we could redo this pseudo code example again next week with that in mind along with the other stuff that you want to talk about next week, does that make sense?

Chris: Sounds good. Yeah, and then maybe just to kind of wrap this up and drive home … again just containers are just processes. If you’re on that host, we talked about how proc is exposed as a virtual file system, so you’re on that host and you can CD to slash proc and you can go to the PID for that particular container that’s running. If you have access, right, if you have pseudo access there’s going to be a directory and there’s going to be a file in there called environ. This is basically … this contains all the environment variables being used by that. You just kept that file, you can see all the environment variables that are defined inside that container.

Jon: Right.

Chris: Right, so that’s … we’ll hear about things like container leakage and this is where some of that stuff comes into. Again, you have to have like root access in order to see that to be able to cut that file but it’s just … these are just normal process and it’s just standard Linux way of managing processes and all the information that it exposes, not … hopefully, that’s taken away some of the mystery of containers and how they work.

Jon: Yeah, it definitely helps so I think after today, I totally understand namespaces. I totally get the idea of Cgroups. I was a little mystified around the pseudo code for creating a container so I think we can cling that up next week and then move on and talk about containerD and RunC.

Chris: Sounds good.

Jon: Cool. Thanks so much Chris, that was super useful.

Chris: All right. Thanks, Jon.

Jon: Yup, talk to you later, bye.

Chris: Bye.

Voiceover: Nobody listens to podcast outros, why are you still here? That’s right, it’s the outro song. Come talk to us at mobycast.fm or on Reddit at r/mobycast.

  • Thanks so much for this chapter! I learned a lot from it, and besides understanding better how containers work under the hood, it also gave me a great intro to Linux Kernel, which was extremely useful for me as a former Windows Kernel developer.

  • Show Buttons
    Hide Buttons
    >