May 22, 2019

61. Just What is a “Service Mesh”, and If I Get One Will It Make Everything OK? (A Dockercon 2019 Recap)

Show Notes
Transcription
Discussion

Jon Christensen and Chris Hickman of Kelsus recap Elton Stoneman’s DockerCon 2019 session titled, Just what is a “service mesh,” and if I get one, will it make everything OK?

Some of the highlights of the show include:

Elton’s Definition of Service Mesh: Communication between software components made into its own backbone for components to talk to each other
Service meshes are DNS, but up a level; service discovery is DNS for microservices
Why you’d need a service mesh: Connections between API, Web, and database servers
Service Mesh Architecture: Proxy to your software component gives access to mesh
General features of service mesh include traffic management, security, and observability
When to use service mesh: Numerous microservices, snowflake implementations, and Cloud-native application pilots
Service mesh offers a lot of functionality, but isn’t free; adds a layer of complication
Service Mesh Alternatives: Feature toggles, dynamic DNS, load balancers, etc.

Links and Resources

Just what is a “service mesh,” and if I get one, will it make everything OK?

Rich: In episode 61 of Mobycast, we recap another DockerCon 2019 session titled “Just What Is A ‘Server Mesh,’ And If I Get One Will It Make Everything OK?” by Elton Stoneman. Welcome to Mobycast, a weekly conversation about cloud-native development, AWS, and building distributed systems. Let’s jump right in.

Jon: Welcome, Chris. It’s another episode of Mobycast.

Chris: Hey, Jon. Good to be back.

Jon: Alright. Two in a row. Our missing range is a little tough. I’ll ask you what you’ve been up to lately, Chris.

Chris: I have been enjoying the weather here in Seattle. It’s sunny. I think it’s going to hit almost 80 degrees today. I’ve been getting a lot of outside time making up for our incredibly long winter. This is just so joyful––seeing the sun, feeling the heat, wearing shorts, and trying not to look at the forecast that’s coming for the next week, or this getting taken away.

Jon: Yeah. It’s awful here. It’s supposed to be the start of river surfing season, if you don’t know what that is, just Google river surfing. I was excited for it. Now, the flows are all down because flows of river water on melt season go down when it gets cold. They don’t create a nice big wave to surf on.

Chris: You need a flamethrower or something like that.

Jon: Right. It’ll melt all the mountains.

Chris: Yes. Stations and people above.

Jon: We’re going to do another episode talking about one of the other really good talks that you saw at DockerCon. This will probably be a two parter as well. As we said, DockerCon is really more focused on the enterprise and also likely on people that are kind of new to Docker. A lot of the talks that were available were not for people that have been using Docker since its inception.

This one I think, for me, I’m really excited about. We’re going to talk about service meshes which I’ve known only just enough about until today. Until half an hour from now, I’ve only known just enough about to figure that we’re probably fine without them for our clients and within our company at Kelsus. I hope to be proven wrong. Maybe it’s time. Maybe they’ve become something that even users of four and five and a dozen microservices could benefit from a service mesh.

Let’s get into it, Chris. Let’s talk about service meshes.

Chris: Yeah. It’s a good time to talk about it because the term service mesh has been out there and out for a while. It’s pretty hot, buzzword, if you will. You hear a lot talking about it and definitely with, in relation to the most popular service mesh right now which is Istio. I saw this talk there at DockerCon. This makes sense to go. The title of the session was, “Just What Is ‘A Service Mesh,’ And If I Get One Will It Make Everything OK?” This was given by Elton Stoneman who was a developer at Docker.

What’s really good about this session, it just really talks about in a general high level service mesh. Things will kind of cover, what is a service mesh and why do we need them? How do they work? What’s the basic architecture? What functionality do you get with them? Should you use one? What is the cost of doing these? That’s what we’ll talk about today. Just to really give a good treatment to service mesh and by the end of this, definitely should have a good idea of that question you have at the top of, “Hey, should we be using this for our clients?” The answer would be very evident.

With that, service mesh defined, what is this? Elton kind of had his own version of simplified definition. He likes to refer to this, it’s the communication between software components made into its own thing. Note, it’s not container-specific. It’s not microservice specific. It’s just kind of referring to this communication backbone, if you will, for these components to be able to talk to each other.

Jon: The old […] in me says that we used to call that middleware but okay.

Chris: There’s been just anything and everything under the sun, various different technologies, we used to have, I don’t know if you remember, CORBA.

Jon: Yeah. That’s exactly what I’m getting at. That handles all that communication.

Chris: The difference here is most of those things weren’t necessarily doing network traffic. You think of service mesh as the switchboard, the telephone operator that’s connecting calls. Someone tries to make the call and says, “I want to go and talk to this person,” so the operator’s making that connection. The person calling doesn’t know how that connection needs to be made. They just ask for the request be made. They know, they have the map if you will. I had a talk to them.

That’s kind of what a service mesh is doing versus the things like the objects, request brokers, and that middleware. I think that’s more about just interoperability and […]. There’s more in the application level layer than it was in the network layer. That’s really where the server’s mesh lies.

Jon: Okay. With that description, I want to poke on this a little bit more. I know we’re going to get deeper into what it actually is and what it does. It’s making me feel like service meshes are kind of DNS but up a level like DNS for your service.

Chris: It’s a big part of that. You can’t really talk about service mesh without talking about service discovery. That’s a whole big topic and that’s definitely part of one of the things that a service mesh will give you. A service discovery is basically DNS for microservices where you have many many microservices, and keeping track of where the names are, and where they live. It becomes a difficult problem. Service discovery is pushing more and making it much more dynamic. That’s definitely something that service mesh will help facilitate.

Jon: Okay, that helps.

Chris: We have this general definition of service mesh. Why do we need a service mesh? During this talk, we walk through the examples. Let’s start with just a simple application. We have a web server and it needs to talk to a database. In order to do that, you need to either know or you need to implement functionality for things like what is the address of the database? Why connect to it? What are timeouts? How does that work? That logic. All that needs to be thought of. It needs to be implemented and dealt with.

If you just have two components, you just do it once and you’re done. Now, what happens if you add an API server into the mix, you have the webserver, an APS server, a database server, Now, you have to start doing the same things again. Those connections between say the web server and the API server, same kind of issues like what’s the address? How do I connect to it? What about encryption, timeouts, and whatnot.

With the microservices architecture, this could definitely very quickly become an unmanageable task where you have so many point to point connections because there’s so many different software, components, and services. You have all these different point to point connections that can be made. You have to deal with all these communication things.

Jon: I’m sort of with you but I got interrupted again. I apologize. This is where my mind is going. Okay, you just put an apple, an orange, and a banana out. You said, “They all talk to each other the same way, let’s abstract that out.” I’m kind of like, “No, they don’t.” They’re apple, orange, and banana. I will have very different rules on retrying to a database. I will, maybe a web server asking for my API server. That may be fine if it retries in a certain way. I’m certainly not going to retry. In certain ways, the same way, I guess, my database. The communication between them is different too. I guess, under it all, it’s CCPIP but there’s a reason that we’re not just writing a socket code between our API server and our database. It’s all been raised up a level.

I’m nervous because it feels like we’re going to have a bag full of really dispersed stuff and we’re going to try to make it interoperable, or try to find the lowest common denominator between all of it, and how it can it be useful.

Chris: A few things to keep in mind, same with the fruit analogy. In this particular case, I did mention three components that are quite different in functionality. Web server, microservice, API, and a database server. Call it apple, banna, pear.

However, in a microservice, it’s architecture, you’re going to have lots and lots of API servers. They’re all bananas, different varieties and whatnot. Same thing with the apples, you can have lots of different apples. Some cameos, some gall apples, and whatnot. You have that. Think of it from that respect. You definitely have categories on things so you can think about at a high level, you can probably, “Okay, I retry logic. It might be different point I’m talking to a database. Maybe not so much different when it’s talking to a restful API type of thing.” It’s like having service policies that apply to each type of fruit, if you will. Further, all of these things need things like water, fertilizer, sunlight––that’s regardless of what kind of fruit it is.

There’s a lot of commonality there. There’s a lot of duplication. There’s things that you can categorize and group together as well––lots of overlap. That’s what a service mesh is addressing. You don’t have to keep reinventing a wheel and keep your architecture dry.

Jon: Okay. Cool.

Chris: That’s basically the genesis for why service meshes were created. It does boil down to microservices’ architecture and the proliferation of many individual components instead of big monoliths. The point to point communication goes up, this becomes much more of an issue.

This is a good time to talk about how a service mesh works, the basic architecture. Really, the keypoint here is that it’s a proxy to the software component. Just that area, service mesh works this way where for each component that you have, that you want to be part of the mesh, there’s going to be a sidecar deployment of a proxy. That proxy is what basically gets access to the mesh. Now, instead of your software component talking directly to other software and components, the only thing it talks to is proxy. It’s the proxies that are now implementing this whole communication backbone.

Jon: Okay. You’re not doing your, some of the stuff, like encryption, maybe even authentication too. Stuff like that is now going to reside now in the sidecars that you put everywhere.

Chris: It depends on what mesh are you using, how much of those features that you’re doing in the application, depends on what we mean by that if it’s mutual TLS encryption like the certificates, identities, that can definitely be managed by your mesh.

That’s definitely something key going for––

Jon: Yeah, that helps it become clear. Now that I have a single one understandable thing that gets attached to every single container in my system that I can go talk to. That container is responsible for getting information in and out of the actual running container. I get it. Yeah, that can be useful.

Chris: Yes. Your software component, they talk to the proxy. Then, the proxy talks to the other proxies. Things like address, like what’s the address of this thing I need to talk to? Things like timeouts, retries, encryptions, that all now lies in the service mesh as implemented by the proxy.

Now, because of that architecture, now you have that proxy that’s separate from your application. You can continue to add features to it. Things like rate limiting becomes very easy to do. You don’t have to have any changes to your application, whatsoever. You just have a policy that says, “Hey, when this component makes a request to this other component, I’m not going to let it do it more frequently than every 60 seconds or something like that.”

Jon: Cool.

Chris: What general features does a service mesh give us? It really kind of boils down to three broad categories. We have traffic management. In traffic management, this is a lot but we’ve kind of talked about. This is service discovery like load balancing, failure handling, health checks, retry logic. It’s also customize routing, things like faulty injections. You can do things like A/B testing and a stage rollout. All that falls under traffic management. That’s one of the big things that a service mesh will give you.

A second large component of a service mesh is the security. Things like encryption, authentication, authorization can be covered by your service mesh. Things like having neutral TLS encryption and certificate management which is completely automated. Your mesh can issue, can keep track, and also rotate the certificates that are being used to do that mutual TLS.

Jon: Now, it’s becoming just so obvious to me. I think that for a lot of people that are new to this that have been around monoliths, this stuff that monoliths provide for free, like your Rails framework or your Jango framework is going to give you all this stuff, then you just have to implement features on top of it. Now that everything’s distributed and very small, you still need all those features. You still need all of this stuff that your big monolith framework got you. Now, you need it in a really distributed way. It feels like that’s a big part of what service mesh is giving you back in this hugely distributed world.

Chris: Yeah, maybe. I think part of that too is just the fact that things like microservices architecture have now created these problems. If it’s a monolith and it’s just talking to a database, the idea of creating X509 certs to do encrypted mutual TLS encryption over the wire and to do autorotations of those credentials. That’s not so much of the deal, load balancing, and customize routing. These become new issues that we have to deal with because we have a microservices architecture.

In that sense, communication is much more complicated than microservice architecture than it is with a monolith. Service meshes are trying to help alleviate some of that pain.

Rich: Hey, there. This is Rich. Please pardon this quick interruption. We recently passed internal milestone of 30,000 listeners. I wanted to take a moment to thank you for the support. I was also hoping to encourage you to head on over to iTunes to leave us a review. Positive feedback and constructive criticisms are both incredibly important to us. Give us an idea of how we’re doing and we’ll promise to keep publishing new episodes every week. Okay, let’s dive back in.

Chris: The third major category of service mesh features you can expect is absorbability. These are things like monitoring, logging, distributed tracing, and then visualization. Your mesh can give you that. You can see all the communication that’s happening inside your system where things might be slow or having problems––troubleshooting, debugging. Distributed tracing becomes really a key problem to solve in the microservice of architecture because it’s no longer just like, “Oh, I came into this service. What happened?” And said, “No. You have to follow the thread of the path.” That actually went from service A to B to C to D. You need to take all of that into account. That’s distributed tracing. That’s something your service mesh will give you as well.

Jon: Okay, cool. I feel like I get that. I also feel like I already have an inkling of where this is going and whether we need this or not. My preview of where it’s going, it’s like what is a harder problem for you right now? Dealing with all of your services and all of the things they need to operate them and for them to communicate with one another, is that harder for you?

Chris: I think it all boils down to how much pain do you have with this communication between your microservices. If you have enough microservices where it’s painful, then you need a mesh. But if you really aren’t feeling the pain, like these features that we’re calling out are not really speaking to you, then you probably don’t need it.

Jon: Right. That’s what I was getting at. You don’t need it if you’re just deploying a few services. Maybe you’re not deploying them more than a couple times a week at most and you have good CICT that deploys them automatically. It’s not hard for them to find each other because there’s only a few of them. They all have their own rules for retry and failure. Monitoring them it is not that hard because you can just hooked them each to a monitoring systems––server application monitoring systems––and see them all in one dashboard at one time. Then, it feels like a service mesh is just adding another thing that you have to take care and feed of. It’s like another component you have to watch out for and make sure it’s running etcetera, etcetera.

Chris: Absolutely.

Jon: It’s an application, you have to deploy it into your cluster, make sure that it’s got enough bandwidth to take requests. That initial proxy, it has to be up and running. Otherwise, every one of your services are all down.

Chris: Service mesh is not free. It definitely has its own layer of complication. For example, Istio which we referred to a little bit earlier, is probably the most popular, most robust, and most mature service mesh out there just doing a standard installation of […] on a Kubernetes cluster. During this talk, I kind of pointed this out, it spends 59 containers for just Istio, for just this service mesh. That’s before you start deploying any of you apps.

Jon: That’s amazing.

Chris: The other thing of that point too is it’s actually around 2 million lines of code. There’s a lot there. There’s a lot of functionality that you’re getting but it’s not free. There is not just that but you also have to think about how do I use this? How do I maintain it? How do I configure it? How do I run it? All that stuff. There’s a learning curve to that.

We’re kind of getting into a little bit about when should you use a service mesh and definitely consider the cost of the mesh. It kind of touched a little bit about when you should use it. If you’ve got a lot of services where again, this is causing pain, things like services is a problem. You do want to do things like rate limiting or you want to do more detailed, fine grained routing, or you want to make it easier, set a policy for security reasons. Only these services are gonna talk to these services type of thing, that’s when you want to start thinking of the service mesh.

If you have snowflake implementations where you’re finding you’re doing ladder rinse repeat on building some of the functionality in each one of your apps or microservices. That’s another one of those things you should look at service mesh as a way of getting out of that, reinventing the wheel.

You also might consider service mesh if you’re doing an application pilot from scratch. Greenfield built on a cloud native application, you might consider this as a way of piloting it and getting some familiarity with the service mesh. If you have a handful of microservices, you probably don’t need a service mesh.

Jon: Yeah. I may just have to do a hard disagree there. Unless you’re sure that you’re going to go from zero to 60 very very quickly, it sure feels like spending a limited development project, playing with the service mesh. Also, trying to invent something brand new. It feels a little wrongheaded to me. Don’t use that as a time to go do something. That’s also superhard––developing a brand new app that’s never existed before.

Chris: Yeah. I think the point that Elton was making here was just if you’re going to use it, don’t do it on existing applications. We have to go and change codes and whatnot. Deal with something and start from scratch.

Jon: It’s a very tricky problem. When you start from scratch, unless you know right out of the gate that like, “Okay, we’re going to start from scratch. This thing is going to have 25 microservices that all need to talk to each other. It’s going to have millions of users and followers.” If you know that from day one and you know that you’re going to need that load balancing capability that’s super flexible, route management, all the thing you get from a service mesh.

Maybe you’re a teleco and you’re building a new thing. You know that’s what’s going to happen. People are going to flood in and you need that, okay, sure, that’s probably why you do it. Otherwise, if you’re like, “I don’t know if people are going to use this or not. We’re not sure if this is going to be a big thing.” Maybe you’re a startup or you’re building a new thing within an enterprise, you don’t know how people are going to react to it, and you know it’ll start small. Then, it feels like you shouldn’t start there, you should leave it out in order to maximize the amount of features you can get per developer hour.

Then, you end up in a situation later where you need it and it seems like the most important feature that your developers out there could think about as they’re developing and adding features to this is like people need to be able to add this to existing architecture. Let’s think really hard about that because it’s architectures that grow. They are the ones that suffer from service tanglement that makes you want to have a service mesh.

Doesn’t it seem like that would be the case? Like, “Oh, crap! Everything’s falling over. We have service tanglement. We don’t know what to do. I’ve heard Istio solved this problem, let’s go add it.” Wouldn’t that be the most normal way into it?

Chris: Yeah. With everything else, it’s like the architecture’s big dents. This is architecture change, there’s no silver bullet. Same kind of issue we’ve been dealing with over the last five, six, seven years of big balls of mud monolith decomposing them into microservice. How do you do that? Just not grinding everything into a hold. It’s the same thing here.

Jon: I just want to push on this more because it doesn’t feel like it should be the same thing. You already got microservices. You decided at the beginning, you’re going to do microservices. The nice thing about microservices is that they are all independently deployable and kind of have the same service area from the outside at least they all look like little black boxes that are running in containers in each other a little cluster, maybe. For the most part, they’re not like proteins. They’re pretty similar looking. It seems like the whole point is that it should be possible to take those things and put them into a management system that proxies them. Then, they’re like, “Hey, I’m so cool. You’re proxying to me?” LIke, “No problem, cool.” I was fine without being proxied too. I’m still fine that now you’re proxied.

Chris: The level of effort that that’s going to take is going to be directly commensurate with what you’ve put into your application to begin with. All these things the service mesh gives you like service discovery, mutual TLS, retry logic, routing, it just really depends on your existing architecture––how you’re dealing with that if you now have code changes. Maybe even something as simple as just service discovery, maybe now you’re using DNS and now you need to change. That’s code change right there. Maybe you’ve implemented TLS encryption between you and other microservices. Now, you want to take advantage of mutual TLS and have your service mesh tier certificates, that’s code change. Same thing goes with maybe you’ve done quotas or rate limiting, retry logic, all that.

Jon: It’s hard to name a problem, though, because now every single time we start a new project, we’re going to play a game of russian roulette. Are we going to kill the project because we lost all our velocity from having to implement a service mesh from day one? Or are we going to get a bunch of velocity, get a bunch of users, and then fall over when we can’t apply a service mesh when we desperately need it? It’s like you get to choose which death you would rather have.

Chris: The good news here, people that need service mesh are Netflix. It’s kind of like at that scale. For most people, most organizations, I just don’t see them getting complicated enough. It’s like, “Oh, if we don’t have a service mesh we’re going to fail.” You can implement bits and pieces of this without a mesh. There are service mesh alternatives. We talked about the three main things, giving traffic management security, absorbability. There are mediations for all that stuff outside the service mesh. You can take off as little or as much as you want. You can do things like feature topples or you can have dynamic DNS for traffic management. We heavily use load balancers with path-based routing and application load balancers with AWS. We can easily do things like A/B testing if we want or blue green deployments.

Same thing goes with observability. You can go implement tools. Even go do things, use hosting services like Datadog or SignalFx. There’s just millions of those kinds of services out there. There’s good open source tools like Prometheus, etc. There are answers to these problems and it’s not all or nothing.

Again, unless you really are going to have a very big engineering team or whether you’re creating many microservices. You’re just not going to go that scale where it’s just like, “Oh, I need this to go.”

Jon: Right. Really interesting. It’s especially interesting because as you’re saying that, part of me is like if you’re spending the time mucking around with logic for routing on your load balancers and you got special DNS, specialize line of code that deals with DNS lookups. You’ve got future […] that you’re hand coding, you got all the other things, you’re also configuring your application monitoring. That feels like a lot of individual pieces of work. It would be interesting to think about it from that perspective a little bit like the Istio is a big beast but it is a centralized beast that probably has a fairly consistent looking feel when you’re playing with it. If it’s replacing a bunch of features that you have to go to to um-teen different places to go through on your own, maybe once, it could be sort of like a super huge learning curve. Once you’ve done it, it’s like, woah. Now all of this is so much easier and I would use this on every project no matter how small.

Chris: Yeah. I’m sure you definitely have people that feel that way and do it. You can totally decide to do that investment to go through that learning curve where you get really, really good at it. It’s a very solid piece of infrastructure that’s very, very powerful. It’s kind of like Kubernetes. Kubernetes is 3.1 million lengths of code. Kubernetes is very, very big. It takes a long time to become an expert in Kubernetes.

Kind of an interesting point. These kind of things are kind of anti-dev ops. Things like Istio and Kubernetes, these are becoming so complicated. The developers, they’re not going to touch it. They don’t use it. They don’t understand it. That’s purely for the ops team.

Jon: Yeah.

Chris: It’s kind of creating that wall again because these tools are becoming sophisticated instead of complicated.

Jon: I would fully agree. That’s a great point.

Chris: That’s probably another good rule of thumb on whether or not you should be doubling down on these things. Do you have an ops team? Do you want to have an ops team? That’s probably the route you’re going down.

Jon: Very cool. I think the second episode is going to be a bit deeper dive on how some of these work. I think that’s probably good enough where we’ve talked about sort of all what it is from a high level architecture point of view and whether we should use it or not. Then, we’ll be back next week and talk about this a little more deeply.

Chris: Alright. Sounds good.

Jon: Thanks, Chris.

Chris: Thanks, Jon.

Rich: Well dear listener, you made it to the end. We appreciate your time and invite you to continue the conversation with us online. This episode, along with show notes and other valuable resources is available at mobycast.fm/61. If you have any questions or additional insights, we encourage you to leave us a comment there. Thank you and we’ll see you again next week.

The Docker Transition Checklist

19 steps to better prepare you & your engineering team for migration to containers

61. Just What is a “Service Mesh”, and If I Get One Will It Make Everything OK? (A Dockercon 2019 Recap)