Just as the Twelve-Factor App methodology was born from real world experience of deploying successful apps at Heroku, architects at AWS created the Well-Architected Framework to document best practices they observed when running workloads in the cloud.
Originally announced as a whitepaper in October 2015, the Well-Architected Framework got center stage treatment at re:Invent 2016 during Werner Vogels’ keynote address. Since then, it has evolved to become an indispensable resource when building and running workloads in AWS.
But the Well-Architected Framework is massive, consisting of 6 core whitepapers that total over 400 pages. It would be easy to dismiss it as just another boring set of documents. But doing so would be a big mistake. There is a lot of gold to be found if you are willing to do some digging.
In this episode of Mobycast, Jon and Chris kick off a three-part series where we grab our shovels and dig deep into the Well-Architected Framework and explain how you can best take advantage of this important resource.
In this episode, we cover the following topics:
- AWS Well-Architected Framework
- Provides consistent approach to evaluating systems against cloud best practices
- Helps advise changes necessary to make specific architecture align with best practices
- Comprised of 3 components:
- Design Principles
- Operational Excellence
- Performance Efficiency
- Cost Optimization
- General design principles
- Cloud-native has changed everything. In cloud, you can:
- Stop guessing capacity needs
- Test at scale
- Automate all the things to make experimentation easier
- Allow for evolutionary architectures (you are never stuck with a particular technology)
- Drive architectures using data (allows you to make fact based decisions on how to improve your workload)
- Improve through game days
- Cloud-native has changed everything. In cloud, you can:
- Pillars in depth
- Operational Excellence
- “Ability to run and monitor systems to deliver business value and to continuously improve supporting processes and procedures”
- Design principles
- Perform operations as code
- Annotate documentation
- Make frequent, small, reversible changes
- Refine operations procedures frequently
- Anticipate failure
- Learn from all operational failures
- Key service: CloudFormation
- Focus areas
- Services: AWS Config, AWS Config Rules
- Services: CloudWatch, X-Ray, CloudTrail, VPC Flow Logs
- Services: Elasticsearch (for searching log data to gain insights), CloudWatch Insights
- Best practices
- Implement telemetry for:
- User activity
- Implement transaction traceability
- Implement telemetry for:
- Any event for which you raise an alert should have associated runbook
- Runbook defines triggers for escalations
- Users should be notified when system is impacted
- Communicate status through dashboards
- Provide dashboards to communicate the current operating status of the business and provide metrics of interest
- Any event for which you raise an alert should have associated runbook
- Feedback loops
- Identify areas for improvement
- Gauge impact of changes to the system (i.e. did it make an improvement?)
- Perform operations metrics reviews
- Retrospective analysis of operations metrics
- Use these reviews to identify opportunities for improvement, potential courses of action, and share lessons learned
- Retrospective analysis of operations metrics
- Feedback loops
- Key points
- Runbooks, playbooks
- Document environments
- Make small changes through automation
- Monitor workload with business metrics
- Exercise your response to failures
- Have well-defined escalation management
- Operational Excellence
- In future episodes, we’ll cover the remaining 4 pillars
- AWS Well-Architected Framework – Online/HTML version
- includes drill down pages for each review question, with recommended action items to address that issue
- Are You Well-Architected?
- AWS re:Invent 2016 Keynote: Werner Vogels
- See: 25:45 through 31:25
- AWS Service Health Dashboard
- AWS Personal Health Dashboard
- AWS Well-Architected Framework
- Operational Excellence Pillar
- Security Pillar
- Reliability Pillar
- Performance-Efficiency Pillar
- Cost Optimization Pillar
For a full transcription of this episode, please visit the episode webpage.
We’d love to hear from you! You can reach us at:
Voiceover: Just like Heroku did with the Twelve-Factor App, AWS created the Well-Architected Framework to document the best practices they observed running workloads in the cloud. The Well-Architected Framework started as a whitepaper in 2015, but went to center stage during Werner Vogels’ keynote at Reinvent 2016. As of today, it is massive, consisting of six core whitepapers that total over 400 pages. It would be easy to dismiss it as boring, long-winded documentation, but doing so would be a big mistake. There is a lot of gold to be found, if you are willing to dig. In this episode, Jon and Chris kick off a three part series. They grab their shovels and dig deep into the Well-Architected Framework. Welcome to Mobycast, a show about techniques and technologies used by the best cloud-native software teams.
Each week, your hosts Jon Christensen and Chris Hickman pick a software concept and dive deep to figure it out.
Jon: Welcome, Chris. It’s another episode of Mobycast.
Chris: Hey, Jon, good to be back.
Jon: Yeah, good to have you back. We’re missing Rich today. We’ll do our best to produce this show on our own, but we have a lot to cover. We’re going to go over the AWS Well-Architected Framework. This might take a couple of episodes, just because it’s got five pillars and each one of those is so huge, holding up so much information, and it’s all really good stuff, so we can’t do our normal pleasantries. We have to go right into it. Chris, do you want to kick us off?
Chris: Yeah, absolutely. Last episode we were talking about software architecture under the guise of Twelve-Factor App Framework, and so obviously 12 factors that we went through. Today, we’re going to come at it from a different approach with AWS has their Well-Architected Framework, which is all about answering the question: Are you well architected? It’s a different lens of looking at your architecture. As you mentioned, core to that there’s five pillars, basically subject areas to go focus in on your architecture and the Twelve-Factor App. Obviously, we had 12 things to talk about. So it might seem like the Twelve-Factor App, that’s more to cover there. The truth is absolutely not the case at all. The Well-Architected Framework is massive in its scale. It’s very, very comprehensive.
Each one of those pillars can go really deep with many design principles and focus areas, best practices. There’s very much just a wealth of information there. Sometimes it’s kind of hard to unpack, and that’s what we’re going to try to do with this episode.
Jon: Yeah. Right, Chris. I’m sorry to interrupt, but I was going to say that another difference for me is just that the Twelve-Factor App really feels technical. It just feels like as a software developer these are a bunch of things that you can think about when you’re writing an application, and the Well-Architected Framework, I feel like, and this could be wrong, but I feel like it’s easier, and we’ll see this. It’s easier to draw the lines between the pillars and the business needs, so I do feel like this sort of steps up a level and keeps the whole purpose of software in mind, and how much it costs and what other things could happen that might impact your business when you’re talking about the Well-Architected Framework, and it’s not just purely technical for technology’s sake.
Chris: Right, absolutely. Yeah. The Twelve-Factor App is really good for what it’s designed for, and it really is really focused on an application, and the pragmatic things that go into building, and deploying, and operating an application, especially on a PaaS platform like Heroku.
Jon: Sure, yeah.
Chris: So it’s not looking at things like, “Hey, I’m going to optimize the… you know, how does this affect cost? Am I being efficient with how I’m spending money? How much is it going to cost to run this and operate it? Do I have the right people to do it? Am I meeting the needs from the business and what about compliance issues and whatnot?” The Well-Architected Framework, just much more comprehensive. It completely subsumes the Twelve-Factor App and then some. Right?
Jon: Yep. You probably better say what it is or we could spend the entire episode talking about how it’s different than the Twelve-Factor App without ever actually saying what it is. Let’s say what it is.
Chris: The Well-Architected Framework, it’s documented a set of foundational questions to help determine if a specific architecture aligns with what are deemed cloud best practices. What it’s doing is it’s providing a consistent approach to evaluating your architectures against those best practices, and then also it’s going to help advise what changes might be necessary to bring that into alignment. As part of this framework, it’s really three main things that it’s comprised of. One is general design principles. Two are the pillars. There’s five pillars that hold up this framework. The third component would be the actual review.
This is a series of questions that you go through that are related to each one of these pillars and subsections and focus areas inside those pillars, and basically you go through and ask yourself these questions, and that kind of gives rise to, “Okay. What changes do I need to make?” It’s those three components that go into it. There are supplements to this as well that we’ll get into later. The Well-Architected Framework is documented via whitepapers, and there’s also HTML web page versions of that as well. There is something that’s new-ish is the Well-Architected tool, which is actually pretty useful, and really helps you do this review process and keep track of it and whatnot, and so we’ll get into it as well as we go through this.
Jon: That sounds great, and I guess as we get into I’ll say from my perspective that you’ll be teaching me. I know in general some of these principles, and I’ve certainly been around it enough that through osmosis I’ve picked up a lot of it, but it’ll be interesting to get a little deeper into it, especially because what I’ll tell you is that my first impression of it is a vomit of information that is not that useful for somebody that already knows what they’re doing. Get away from me. There’s a little bit of that in my mind when I see this amount of documentation thrown out there. It’s like, “Come one. Get over yourselves.” Right? Like, a little bit of me says that.
But I think what’s starting to happen is that you looked into it this week more deeply in order to prepare for this, and the surprise might be, “Oh, there’s some real gems in here, and some things that if you just follow this you’re unlikely to go wrong.” It could be the map to do your job as a software architect. Just follow this map to the treasure. That’s all you have to do. It’s just that the map is like a thousand pages.
Chris: Yeah, and let’s kind of talk about it. There is a wealth of information here. It is. There’s so much to wade through, and it really is kind of difficult to unpack it in some regards, but it is a treasure trove. AWS has spent a lot of time on this. I think they first rolled out the Well-Architected Framework in 2015. This is something that they continually are updating. It’s based upon what they internally do at AWS and Amazon, how they’re actually building things. This is very practical, real world. I mean, none of this theoretical. It’s not academic or-
Jon: Or ivory tower or any of that.
Chris: No, absolutely.
Jon: Do you get the sense that it’s like kind of Werner Vogels’ baby, if you will? This is kind of the Office of the CTO, if there was one thing that they produced, it’s kind of this?
Chris: I don’t know if I would say it’s Werner’s baby because I kind of see Werner more as like technology and vision, and this is much more along the lines of your VP of engineering-type thing. Right? So that said, he was the one that did unveil it and whatnot, but I would think that his interest aligned more with things like Dynamo DB, and technology in general, and things like bringing blockchain to the cloud, and whatnot, as opposed to like, “How do you actually build, deploy, operate systems. Right? What are all the checkboxes that you need to do something like that? Obviously, he’s interested, but I don’t think that that would be necessarily like his passion.
Chris: If Werner is listening, he can let us know.
Jon: Yeah, let us know, Werner. Go ahead.
Jon: I think we’re going to talk about general design principles.
Chris: Right. Yeah. Really kind of like the overall philosophy of this is just it’s this Well-Architected Framework is saying, “Look. What you did in the past for like on-prem-type applications, everything has changed. Like cloud native has changed everything.” Right? So there is some just core general design principles here that you need to just really embrace, if you are going to be cloud native. So things like you no longer have to guess capacity needs.
Jon: I want to stop you there because who knows who our audience is. Podcasting is the worst for figuring out who actually listens to you. If you listen to this, please let us know who you are, so we can give you the right information that you want, but I am guessing that not that many people that listen to this have ever had to guess capacity needs. That’s like year 2000 or 2005 stuff. That’s over 10 years ago that people were doing that. So I don’t know. It’s just like that doesn’t feel like a relevant point anymore, unless you’re talking to old, slow-moving enterprises, and then it’s almost like a sales pitch. Like, “Stop guessing capacity needs.”
Chris: I think we’re so close to this that it may feel that way, but I think real world this is still something to be thinking about. There is still a lot of people that aren’t on-prem that are not in the cloud, yet, or they’re just working to go hybrid. There is a lot of code out there that’s still on-prem. It’s hard to believe. Right?
Chris: I mean, it really is the case.
Jon: Like call up the [inaudible 00:11:40] a bunch of servers or something, yeah.
Chris: Even when people do move to the cloud, and that’s part of what this general design principle is trying to get at is that when people do move from on-prem to the cloud, they kind of use the same models. Right? They’re just thinking about, “Okay. I don’t have a server on a rack in my on-premise datacenter anymore, so I spin up a server in the cloud in EC2, and I’m just now managing that.” Right?
Jon: Right. So in the interest of guessing capacity, how many EC2s am I going to need?
Chris: Absolutely, right? Who knows? I mean, are you really taking advantage of auto scaling? Are you wiring that in with your scale up and scale down policies? Right?
Jon: Mm-hmm (affirmative).
Chris: So that’s what this is all about is just saying like, “Stop guessing your capacity needs.” This is a game changer in the cloud, now. Right?
Chris: You need to embrace that. Going through some more general design principles, another one would be you can now test at scale. Before, that was very difficult because you had to buy a bunch of capacity in order to do that. Now, it’s like you can just spin up infrastructure on a moment’s notice as much as you need, and then when you’re done with it, tear it all down. So testing at scale is just totally something you can do, now. Another general design principle is just to automate all the things to make experimentation easier so you can use infrastructure’s code in the form of things like cloud formation to rapidly spin up these environments, and the infrastructure you need to go do experiments and tests. Again, like the cloud has enabled this. This is stuff that in the on-prem world much more difficult to do.
Jon: I don’t know. For those on the inside, if you’re listening and you’re like you’ve been doing AWS for a long time, and you’ve played with cloud formation, that one might ring a little hollow. Like, “Yeah. Experimentation didn’t actually get any easier because I now am spending all of my time maintaining cloud formation templates that I’m waiting for them to spin up and spin down, and I thought I was going to be able to do experiments.” There’s a little bit of like [inaudible 00:13:45] from that because cloud formation is very time consuming to [inaudible 00:13:49]. Yes. Once you have it, okay. Now, I can spin up an environment easily [inaudible 00:13:56], but you pay for that. You pay in a lot of tinkering and waiting for environments to spin up just a little bit wrong, and then, “Oh, man. Now, I’ve got to try that again.”
You go, “Oh, it’s still a little bit wrong.” Now, you just lost half a day in those two statements. Anyway, it’s something to try and pushback on AWS, and all their magic is that that’s not cheap.
Chris: Yeah. Part of that is just how far along you are in maturity on adopting those kind of things. So it’s all a journey, and for some people they’re doing it in bits and pieces or they’re just starting the journey, and then there are folks out there that really have embraced this, and been doing it for years, and for them it really is very easy for them to spin this stuff up. They’re not doing a lot of that troubleshooting.
Chris: So maybe just the important point here is that just now you have the opportunity to leverage that. Actually getting there is maybe hard, but it’s there now. Right?
Chris: Some other design principles drive architectures using data so you can now make fact-based decisions on how to improve your workload. I’m not so sure that that’s necessarily specific to the cloud versus on-prem, but definitely something that we want to do in a modern app, right?
Chris: Is to have that feedback loop where we’re collecting the telemetry. We’re looking at the metrics, and that is informing us on what needs to be changed. How do we make improvements and then be able to measure the effects of those changes to make sure that it truly is an improvement. Right?
Chris: With the cloud we definitely want to allow for evolutionary architectures, so you never want to be stuck with a particular technology, and so this is… It’s still the challenge, but I think definitely in cloud it becomes a bit easier to do because you have a great toolbox to work from. You can get these services and options and features as managed services that you don’t have to go and stand up a box and install software, and run it, and whatnot. If you want to try a new no SQL technology, becomes much easier. If you want to caching or whatnot, or backup, or multiple networks, and peering and whatnot, all that stuff is much easier to do in the cloud as opposed to on-prem.
Jon: Right, yeah.
Chris: Lastly, another design principle is to improve through game days, which not necessarily cloud specific, but I think the cloud makes this a lot easier to do.
Jon: Can you help me know what a game day?
Chris: Yeah. So it’s kind of interesting. Game day is like the… It’s kind of like misnamed in a way because it’s like it almost feels like it should be called scrimmage because what you’re doing. So a game day is when you are purposely setting aside time to go and simulate things that can happen that can go wrong, and then how you deal with them. So it’s like let’s have a game day where we test what happens if a whole AZ goes down, and what happens there or let’s have a game day where we’re just going to go and make sure that we can. Let’s simulate a database failure where we have to restore from a backup. Right?
Jon: Got it.
Chris: Just do that. So it’s going through and practicing these things and practice responding to time-critical failures, and other exceptional events that you want to be able to practice and make sure that you can do it, and not find out that, “Hey, we really didn’t have this button down when it happens in real life.” Right?
Chris: It’s practicing those things that we normally don’t do, and those are game days.
Jon: Got it.
Jon: It sounds like [inaudible 00:18:19], honestly. That sounds so fun and great, but it’s also a whole day of your team not doing anything, not building features or not dealing with customer issues or whatever it is, like a whole day of just like imagining a Black Swan event happening, and training to deal with it. It’s so important but, at the same time, it’s also like a luxury for a lot of companies, I think.
Chris: So this is a good point, and this is something definitely to kind of call out as we go through this Well-Architected Framework. This is a very comprehensive framework and this is geared towards the most sophisticated systems and workloads that you could run for budgets very big, and for systems that can handle millions of requests, billions of requests, millions of users-type thing down to the other end of the spectrum. As we talk about this and go through it, this is where some of the complexity comes in is trying to figure out, “Okay. What’s really applicable to me?” Yeah. There’s these five pillars, and each one of these pillars has its focus areas, and has its best practices. As we go through the review, there’s going to be questions that ask us about this or that.
You have to look at your own situation and say, “You know what? That’s not something that we’re going to do.” From a pragmatic standpoint, we don’t need to survive like a region failure. It’s just not something that we’re going to invest the time and money on. Likewise, it may be like-
Jon: It’s like, “Why? Why would we need to survive that?” It’s because if 90% of the services in this region go down, and they’re just not available, our little company is going to be the least of people’s worries in a lot of cases. Right?
Chris: Mm-hmm (affirmative).
Jon: Everything else they do is also going to stop working, so it’s not something we need to worry about.
Chris: It’s very much like a cost issue so it becomes… I mean, to do all of this stuff becomes very expensive, both from an actual what my bill from AWS looks like versus just the amount of time and resources that goes into building and maintaining, and operating these things. Right?
Jon: Mm-hmm (affirmative).
Chris: That’s the challenge for all of us as we look through this lens of the Well-Architected Framework is: What’s the right level of architecturedness for you?
Jon: That’s a hard thing that they left as an exercise for the reader, though. Don’t you think? Wouldn’t it be nice if it wasn’t just like, “Here’s a perfect architecture,” and like, “Good luck getting it as close to this as you can”? Instead of like, “Here’s like kind of ways of identifying how to make these trade offs because these trade offs are the… There’s kind of magic in these trade offs that comes from years of experience knowing don’t chase that down. It’s really not going to likely happen, so let’s spend our money building this new thing instead of worrying about that potential thing.
That ends up being the magic of being a software architect or a leader or an engineer, and I feel like this helps you not miss anything that you should be thinking of, but it doesn’t help you make the decisions necessarily. It doesn’t help you know if it’s okay that… and maybe it does, and we just haven’t got to that part, yet.
Chris: No. I think, for the most part, it’s not even addressed, but that said, I think this concept of if you can imagine like a slider, and going from the very far left. It’s like this is the minimal amount of work to do for like, “Okay. I’m not going to [crosstalk 00:22:28]”-
Jon: It’s running!
Chris: … “to be that.” Yeah. It’s serving some requests. Right? Maybe now, maybe not. Versus on the far right, like this is completely buttoned up. This is you’re doing game days. You can support region failures, all that kind of good stuff. You can imagine that slider being able to drag it to whatever your threshold is, and then it would pick out the relevant parts of the Well-Architected Framework, and the relevant questions, and the relevant action items commensurate with that. There is definitely some opportunity there to do it. That seems really straightforward to do, and maybe they will do it.
Jon: Right, but if they don’t, guess who is going to do it. We are. We’re going to do that. We’re going to help you figure this out. Stick with us. We are going to help you learn that stuff you need to know to be able to make these decisions.
Jon: We’ll stand in AWS’s sted for a bit, but in the meantime, let’s figure out what AWS wants you to do. Let’s keep going.
Chris: Right. All right, so like I said, the three components, design principles, the pillars, and then the questions and answer the review itself. So we went through the decision principles, so the next step’s the pillars. As I mentioned, there’s five pillars. These are big, comprehensive areas of focus. Maybe just quickly running through the five pillars, so you have operational excellence. You have security. You have reliability. You have performance efficiency, and you have cost optimization. Right? So those are the five broad areas, and everything that goes into being a Well-Architected workload will fall into one or more of those pillars.
Jon: You just used the word workload. I haven’t really used that in my career until I started doing AWS stuff. Is that kind of an AWSie term or is it really a more broad term or a more recent term? I don’t know why it is, but I just haven’t said workloads until I started a bunch of AWS stuff. You?
Chris: No. I think it is kind of a AWS-preferred term. They’re using it in the context of like, “What are the components to provide some business value?” Right?
Chris: So it’s the architect-
Jon: Workload [crosstalk 00:24:54] stuff, yeah. Or it might be computing something. Yeah.
Chris: Yeah. So you need something somewhere to describe it. We might say like, “Web application,” or something like that.
Jon: Yeah. I guess that’s why I started using it is because I was like, “Oh, that’s a handy term. It makes sense. It’s clear what it means, and I didn’t have a word for that before. Now I do.”
Chris: Yep. So moving onto the first pillar, operational excellence. So this one’s all about the ability to run and monitor systems to deliver business value and to continuously improve your supporting processes, and procedures. So pretty straightforward here, right? This is like, “Okay. You have your workload, and you want to run and monitor it, and you want to get better as time goes on.” So that’s about operational excellence, and this is a big one, too, because there’s a lot that goes into it. I kind of touched on this before. This is actually one of those things that really separates the medium performing teams from the really top-notch performing teams. I mean, it’s pretty straightforward to build an application, deploy it, stand it up, and start taking a request.
It’s another thing to do that at scale, and to do it reliably, and everything else that goes into that. A lot of times, I think the real work starts once you go into prod, right?
Chris: Once you go to production, it’s like, “Wait. There’s a lot of stuff here that goes on,” especially if you have some requirements where you have to have some sort of uptime, whether it be three nines or four nines, whatever it may be. So that’s what this pillar is all about.
Jon: Yep. It’s like the picture I have in my mind is when the jets get fueled in midair. When you can do that, then you’re starting to approach operational excellence.
Chris: Yeah, and that’s just one small slice of it. Right?
Chris: There’s everything that goes into that as well. Like, “Okay. How do we actually locate you in 3D space? How do we hook these two planes up?” Everything else that goes along with it, and do it while you’re flying 600 miles an hour.
Chris: Yeah, so why don’t we talk about some of the design principles that form this particular pillar? One would be you want to perform operations as code, and so this is a really strong theme throughout the entire Well-Architected Framework, and everything that AWS does is just automated as much as possible. If it’s something that can be automated, do it. Spend the resources. Spend the time to do it. Really root out all the manual steps, anything that has to be manually performed like that, you need to ask yourself, “Can I automate this?”
Chris: “Can I have a computer do it?” Right?
Chris: So that’s what. That’s a general design principle here is if you can automate it and have code do your operations, then that’s what you want to strive for. Another principle is annotate documentation. The idea here is that there is some amount of documentation that’s manually produced, but then it should be supplemented with automation, with code. It should be able to build that up, whether it may be like documenting an architecture or a code base, or run books and procedures, and whatnot just thinking of documentation not as this static thing that someone has to go update in an editor, but instead more of a living thing that can be more easily updated.
Jon: Right, and it updates itself as you update your application.
Chris: Yeah. So another design principle is to make frequent, small reversible changes. Maybe that goes without… Yeah. That sounds good.
Jon: Yeah, but it’s so hard for companies. Oh, my goodness. I think it’s particularly hard for companies that are led by folks that don’t have a lot of experience with technology. They just love the big bang of like, “Look at all these great, shiny new features.” There’s truth in it, right? Apple still releases one big iOS release a year that has big, shiny new features, and they’re all packaged together. It’s sort of 100% against this principle. The better, more safe, more automated, more Well-Architected way of doing it would be if each of those features came out a little bit at a time, small changes, maybe even only to a few people at a time to limit blast radius if there’s a problem.
That’s what AWS is saying to do, and it’s obviously what they do inside of AWS, like the feature roll outs for the AWS are constant. They’re never ceasing. Apple is getting more that way with iOS, too. [inaudible 00:30:25] I’m like, “Look at that. That’s new.” There was no big media event around it, but I think people that are in traditional companies used to doing things the old way is really hard for them to let go of big bang releases, and we work really hard as a consulting company to try to break that habit, but we are not always successful, and we do end up with high risk releases.
Chris: Yeah, and it’s also… I mean, like an example with Apple. They’re definitely doing this with their backend services. Right?
Chris: It’s like client code, actual mobile. This is your mobile iOS, your actual operating system software. Right?
Chris: So that world is definitely more of the big release things, which-
Jon: It works in marketing, and so when it comes to aligning thoughts from a business value, that is important. So that consumer-facing stuff probably does need to keep behaving that way to an extent.
Chris: It’s a bit of a user education issue, too. Right?
Jon: Mm-hmm (affirmative).
Chris: It would be really confusing if every time you open up your phone, brand-new ways of doing something. You know?
Jon: Right. Yes.
Chris: It’s like, “How do you keep track of what it is?” I think that probably goes into this as well. When we’re talking about workloads, definitely frequent, small reversible changes is definitely the way to go. Another core design principle here is to make sure you’re refining your operations procedures frequently. Again, you’re continually learning. You’re continually improving. You’re never done. So always be ruthlessly looking at your operations, and like, “How can we improve?” Another principle is anticipate failure. Failures are going to happen, so prepare for it.
Chris: Think about what can go wrong, and then how are you going to address that? How are you going to remediate that? Which failures are you not going to be concerned with?
Jon: Even the smallest thing can take a few minutes or an hour or two here and there to think about this.
Chris: Yeah. Lastly, make sure you just learn from your operational failures. So when things do go wrong, which they will, do the analysis. Do the investigation to figure out, “Okay. How did this happen? What went wrong? What changes do we need to make so that something like this doesn’t happen in the future?” So that learning process.
Chris: Operationally excellent. They do point out. The key AWS service here for this pillar is cloud formation. Cloud formation is the infrastructure’s code product that they offer that allows you to basically codify what your infrastructure looks like for running your operations for anything and everything, not just what servers get stood up, and how application code gets deployed, but also what alerts get created. What events are being tracked and whatnot? So all that can be expressed via code, and so cloud formation is a key part of that.
Jon: Right, and AWS is doing a better job recently. I’ve noticed this year. They’re doing a better job of making sure that things are cloud formation ready when they’re released. There was quite a while where people were getting upset that new services kept coming out without cloud formation support, and that seems to be less and less common now.
Chris: Yeah. I know that just at the last Reinvent that they mentioned that several times that that’s top of mind for them. As they release new services, they definitely want to have cloud formation support from the get-go.
Voiceover: Just a sec. There’s something important you need to do. You must have noticed that Mobycast is ad free, but Chris and Jon need your help to make this work for everyone. Please help the Mobycast team by giving us five stars on iTunes, writing positive reviews, and telling your colleagues, friends, neighbors, children, and pets about the show. Go ahead and do it now. Great. I promise not to ask you to do that again.
Jon: Oh, great.
Chris: So the key focus areas for this particular pillar there is three: prepare, operate, and evolve. So prepare is thinking about everything that… What do we have to have in place? What do we need to do in order to be operational excellent? What happens when the workload is actually running? And then evolve is that learning process. How are we going to continually improve? So kind of everything in this pillar falls into these three focus areas.
Jon: I just want to help our listeners understand the outline of the Well-Architected Framework. It looks like each of these pillars has several sun points, and they’re the same for each pillar. So each pillar has design principles. Each pillar has a key service. Each pillar has a focus area. Each pillar has best practices, and each pillar has key points. Is that right?
Chris: Yeah, so I think less the key points. That’s not necessarily a part of the-
Jon: [crosstalk 00:36:01]
Chris: … framework. Right.
Jon: [crosstalk 00:36:03]
Chris: Yeah, so the design principles, the focus areas, the key service, those are definitely all part of it, and then best practices are just examples of specific action items you can do in each one of those focus areas in order to achieve the goals of that particular pillar. So that’s what we’ll be talking… As we go through each one of these pillars, we’ll be calling out these design principles, the focus areas, the key AWS service that enables this pillar, and then we’ll go through some examples of some best practices that you could do.
Jon: Right on. I guess I’m still having a little trouble. Now that I see this like this, I’m sorry that our users can’t see it like this, but we’re talking about this outline that has two things in it, design principles, and focus areas, and to me feel real overlappy, and since design principles is kind of a technical thing that I understand from my education, and from my experience, but focus areas is kind of not. It’s more like a wishy-washy term in my mind. I don’t really know how to store this in my mind, and I don’t know how to use it because it’s like I got to think about my design principles. No. I can’t forget my focus areas. There’s too much. Have you figured out a way to let this rest in your head?
Chris: Yeah. I think the good news here, you don’t have to because, at the end of the day, this is all going to come down to the Well-Architected review, and that’s going to guide you through all this stuff. Design principles, I mean, these are not exhaustive, either. In the Well-Architected Framework, they call this out specifically just to kind of show what’s important, and what’s changed now in the cloud native world, versus the way things used to be. These are the kind of things, the kind of action items that you can wrap your mind around that you’ll be doing, but it’s absolutely not exhaustive. We listed six design principles for operational excellence, and we could sit here at a whiteboard, and probably come up with 25 pretty easily.
Jon: Just to kind of poke this in the eye a little bit more, though, really this is problematic for me. I think it’s because of just the way my brain works. I like things to be really specific, and to work together with each other, and I read this focus areas, where I think about focus areas, and in my mind when I think about focus area, I think of, “Okay. Well, there’s a big thing of stuff and I want to focus on an area.” So maybe there’s ways of sending messages around, so my focus area is going to be on message buses or let’s take it out of computers. There is poverty around the world. I’m going to have my focus area be racial inequality. That’s my focus area.
I am a PhD student, and I’m a student in biology and my focus area is going to be on RNA. These focus areas that I’m reading for operational excellence are prepare, operate, evolve. Do you see how that doesn’t make sense?
Chris: Yeah. Think of the focus areas as just segmentations of the things that you need to be doing and thinking about for the particular pillar. So they really are just like buckets, so it’s like, “How are we going to slice this up into categories or subcategories?” That’s what these focus areas are. So for operational excellence, you’re either preparing, you’re operating or you’re evolving. It’s a very straightforward way to just say like, “These are distinct phases of design development in relation to operational excellence.”
Jon: When you just gave this, I was like, “Oh, yeah. That would have been a good term for it.” I think it’s really I’m just kind of angry at the terminology. I think it probably came out of somebody trying to make it consistent across things that are not actually the same things, and that happens. If you take oranges and apples, and you try to talk about… If you try to talk about them in the same exact way, things are just not going to work out right. In this case, operational excellence is a really different thing from security, so yeah. Focus areas, like for operational excellence, maybe it shouldn’t have talked about phases, and for security maybe focus areas made more sense. We’re not there, yet, but yeah. Sorry. It really does make it hard for me when the terms don’t match the things they refer to. It makes it hard for me to remember things. [inaudible 00:40:55] AWS.
Chris: Yeah. For the rest of this, just think of focus areas as just like segments-
Chris: … of working up just what needs to be done underneath this pillar. For each one of these, it’s going to be a little bit different on how they’ve broken it up, but they’re always going to be called focus areas.
Jon: Okay, cool.
Chris: Some services that are going to be important for this, so prepare things like AWS config, and AWS configurals are things that you want to have top of mind for operate, things like cloud watch, and cloud trail, APC flow logs, x-ray, these are all ways to get inside telemetry, and-
Jon: Run the system [inaudible 00:41:46].
Chris: Yeah, and then for evolve they talk about how Elasticsearch is pretty important here and, also, things like Cloud Watch Insights, as well. So this is the… How do you actually go and search your telemetry to gain insights, and whatnot, so that you can make these changes based upon the data? So just point out, like in the Well-Architected Framework they are listing like, “Hey, here is some of the AWS services that are particularly going to be useful for these particular focus areas, and this particular pillar.
Jon: So the prepare, operate, and evolve kind of maps to deployment-types of tools, operational-types of tools, and all the cool types of tools. Cool.
Chris: Yeah. Maybe we can give some best practices examples for some of these focus areas. So starting off with prepare, one of the things best practice definitely you want to do is implement telemetry, being able to generate that telemetry, and then collect it, and so kind of defining what kind of metrics you want. This is for various different components of your workload. You want telemetry for your application at an application level. You want it at a workload level. You want it at a user activity level, and then you also want it for your dependencies.
Jon: Right. We’ve talked about this before, and that telemetry is a fancy word for metrics, and that’s what you use if you want to seem cool at software parties.
Chris: Yes. See? This is pretty comprehensive. Right?
Chris: Application workload, user activity, dependencies, a lot of folks out there if they have telemetry just for at the application level, they’ve got the checkbox marked. Right?
Jon: Mm-hmm (affirmative).
Chris: They’re like done. That’s really what I think, by and large, most workloads out there are doing. This kind of goes back to the pragmatic approach to this is like how much are you going to do of the Well-Architected Framework. So it’s absolutely best practice. You have to do it at all of these four different levels for collecting and analyzing telemetry, but are you going to do it? Are you actually going to spend the time and the resources, the effort to go do all this?
Jon: Yeah. It’s great to know this stuff so that you can make an informed, trade off decision instead of an indirect trade off decision.
Chris: Yeah, absolutely. That’s definitely, again, like the thrust of the Well-Architected Framework is to give you everything. This is comprehensive. Again, it’s like everything you need to go run, like Gmail. Right?
Jon: Mm-hmm (affirmative).
Chris: This is it. Everything you need to run, Dynamo DB. It is covered. Right?
Chris: Yeah. An example of another best practice in the prepare focus area would be implementing transaction traceability. Pretty important with microservices architecture being able to thread a user-initiated action all the way through the system as it touches various microservices, and having that traceability. So pretty complicated, pretty sophisticated, and you may not be doing it, but as a best practice, this is what we would all aspire to.
Chris: So operate, some examples of best practices there. They do talk a lot about run books and playbooks. An example of the best practice here would be any event for which you raise an alert should have an associated run book entry. So your run book defines… It will define triggers for escalations for when this particular alert needs to be bumped up a level, and wake someone up or whatnot. So a run book is a… It documents procedures that need to be performed, and so this would be things like make a backup of your database or restore a backup of your database, or encrypt something, and place it into a secure bucket or something like that. It could be how to do a rollback of a deployment.
A run book is about if this happens, this is what you do. They’re well defined events and how to deal with them. A playbook is for how do you basically troubleshoot and deal with issues that come up. This might be something like you’re getting latency in the system at some level. How would you go and what do you do there?
Chris: Right? I know we personally at Kelsus we’ve talked about run books, and we have run books. We combine both of these aspects to it, right?
Chris: I think we could probably do better by actually splitting them up and to really think of them as two separate things as run books and playbooks, and think about how do we automate more of the run book, actions that need to happen, instead of having this be a manual thing? It also gives a bit more structure to it as well, just from like what does this contain in it-type thing. Again, as a best practice, if you have an event that you’re going to generate an alert for, well, you better have a run book entry for it. Another best practice in the operate focus area is that if the system is impacted, then your user should be notified. They should have a way of knowing that.
You’ll notice most of the big services out there do this. They have status pages. So GitHub or AWS itself or just about everyone. Salesforce, they all have a status page that you can go see. Are there issues going on? They typically will have feeds, maybe like Twitter feeds or whatnot, letting people know what’s going on. Just again, as a best practice, your user should be notified. Is something going on?
Jon: Yeah. I think even the biggest systems out there don’t really do that great a job of that. It’s always like, “Hmm. I think there’s a problem here. Let’s go check Twitter and see if everyone agrees,” kind of thing, and then people point to the status page, and say things like, “It’s not working for me,” but the status page is not updated, yet, like that kind of stuff. Even the biggest, most famously well-operated companies in the world it’s like a aspirational best practice.
Chris: It is. That is really frustrating, too, when you know there’s something wrong and you go to the status page, and you’re like, “No. Thumbs up. It’s all green.”
Chris: It’s like, “Come on. What use is this?” Yeah. It’s not easy to get it right, but as a best practice, it’s definitely something to aspire to. Some examples in the evolve focus area, feedback loops. These help you identify areas for improvement. They help you gauge the impact of changes to the system. When you did have this idea to this change based upon the data that you’re seeing and whatnot, you need to be able to measure it, and really ascertain. Did it actually improve or is it not improving? Did it make things worse? Kind of having regular operational reviews where you go and you look at the metrics for a particular period, and use these reviews to identify opportunities for like, “How can we improve? What are some of the potential courses of action?”
Just in general sharing lessons learned so you can be talking about what operational events you had during that period, what the overall health of the workloads were, how many issues came up and were resolved, all sorts of things. Kind of getting into that diligence of saying, “Hey, operations is really key. It’s important. We do want to have this constant learning and improvement, and so we’re going to treat it as a first-class citizen and have these regular reviews of our operations.
Jon: For the smallest team, I can imagine an operational review being kind of reduced down to even just like a once a month check-in kind of like a, “Hey, what part of running the system do we just hate right now?” What part of it’s just bothering us? What part do we want to fix so that we don’t have to do it manually or fix little broken threads everywhere every time this happens kind of stuff?
Chris: Yeah. An ad-hoc thing like that is definitely a step in the right direction, and it’s actually not too much work to put some-
Chris: … data behind that, too. Right?
Jon: Mm-hmm (affirmative).
Chris: It could be something as simple as just measuring the number of bugs that get generated during the course of a certain amount of time or generate how many alerts were generated during a specific time or how many if you do things with your logging emitting a warning or error levels. How many of those events happened? Those are all pretty easy to plug into and get data on, and they’re definitely good heuristics for just how well you’re operating. So maybe just to wrap up with this pillar then some really just key points are run books, playbooks. Make sure you have them. Document your environments, so multiple different environments usually for your workloads. You know, dev, stage, prod, whatnot.
Make your small changes through automation. Monitor your workload with business metrics. Practice your response to failures. Don’t wait for them to happen. Have those game days, and have a well-defined escalation management. Be very clear about when something needs to be addressed manually by a real person versus what can be done in code, versus what can be looked at later. Understand what things should be escalated and risen up. If you’re waking someone up, you’d better-
Jon: It better be important. Yeah.
Jon: Those are the kinds of things that you don’t even really need a Well-Architected Framework to figure out. That’s the stuff that every company that operates software starts to figure out pretty quickly. Yeah.
Chris: Yeah, especially if you’re the person getting woken up at 2:30 in the morning. Right?
Chris: At some point, you got the cry uncle and say, “No.”
Jon: Well, there is so much here, Chris. I mean, this is just huge this Well-Architected Framework, and I don’t think we can do another pillar today.
Chris: I don’t think so, either. I think there is… and we’re still just kind of like scratching the surface here. The point I want to stress is that there is just a wealth of information here. This is based upon running these very big work loads in AWS, and everything that goes into it, and so there’s a lot of really good information here. Again, you’re going to have to figure out what’s applicable to you, and what you can implement, versus what you can’t, but it’s a lot of really good information. Yeah. We got through one pillar, and we got four left. This is probably going to end up being like a three parter.
Jon: Yeah. All right. Well, thank you. Thank you, everyone, for listening. We’ll talk to you next week.
Chris: All right. We’ll see you. Bye.
Voiceover: Nobody listens to podcast outros. Why are you still here? Oh, that’s right. It’s the outro song. Come talk to us at Mobycast.com.fm or on Reddit at R/Mobycast.