September 25, 2019

79. Are You Well Architected? The Well-Architected Framework – Part 2

Show Notes
Transcription
Discussion

Summary

In episode #78 of Mobycast, we introduced the AWS Well-Architected Framework, an indispensable resource of best practices when running workloads in the cloud. We explained that the framework defines five pillars of excellence, and we dug deep on the first pillar, “Operational Excellence”.

If you missed that episode, hit pause now and go listen to that one first. It’s ok, we’ll wait for you.

Now, in this episode of Mobycast, Jon and Chris continue their three-part series on the AWS Well-Architected Framework and discuss the next two pillars of excellence: “Security” and “Reliability”.

Show Details

In this episode, we cover the following topics:

Pillars in depth
- Security
  - “Ability to protect information, systems, and assets while delivering business value through risk assessments and mitigation strategies”
  - Design principles
    - Implement strong identity foundation
    - Enable traceability
    - Security at all layers
    - Automate security best practices
    - Protect data in transit and at rest
    - Keep people away from data
    - Prepare for security events
  - Key service: AWS IAM
  - Focus areas
    - Identity and access management
      - Services: IAM, AWS Organizations, MFA
    - Detective controls
      - Services: CloudTrail, CloudWatch, AWS Config, GuardDuty
    - Infrastructure protection
      - Services: VPC, Shield, WAF
    - Data protection
      - Services: KMS, ELB (encryption), Macie (detect sensitive data)
    - Incident response
      - Services: IAM, CloudFormation
  - Best practices
    - Identity and access management
      - AWS Cognito
        
        Act as broker between login providers
        
        Securely access any AWS service from mobile device
    - Data protection
      - Encrypt
        
        Encryption at rest
        
        Encryption in transit
        
        Encrypted backups
      - Versioning
      - Storage resiliency
      - Detailed logging
    - Incident response
      - Employ strategy of templated “clean rooms”
        
        Create new trusted environment to conduct investigation
        
        Use CloudFormation to easily create the “clean room” environment
- Reliability
  - “Ability to recover from failures, dynamically acquire resources to meet demand and mitigate disruptions such as network issues”
  - Design principles
    - Test recovery procedures
    - Auto recover from failures
    - Scale horizontally to increase availability
    - Stop guessing capacity
    - Manage change with automation
  - Key service: CloudWatch
  - Focus areas
    - Foundations
      - Services: IAM, VPC, Trusted Advisor (visibility into service limits), Shield (protect from DDoS)
    - Change management
      - Services: CloudTrail, AWS Config, CloudWatch, Auto Scaling
    - Failure management
      - Services: CloudFormation, S3, Glacier, KMS
  - Best practices
    - Foundations
      - Take into account physical and service limits
      - High availability
        
        No single points of failure (SPOF)
        
        Multi-AZ design
        
        Load balancing
        
        Auto scaling
        
        Redundant connectivity
        
        Software resilience
    - Failure management
      - Backup and disaster recovery
        
        RPO, RTO
      - Inject failures to test resiliency
  - Key points
    - Plan network topology
    - Manage your AWS service and rate limits
    - Monitor your system
    - Automate responses to demand
    - Backup
In the next episode, we’ll cover the remaining 2 pillars and discuss how to perform a Well-Architected Review.

Whitepapers

End Song

The Runner (David Last Remix) – Fax

For a full transcription of this episode, please visit the episode webpage.

We’d love to hear from you! You can reach us at:

Web: https://mobycast.fm
Voicemail: 844-818-0993
Email: ask@mobycast.fm
Twitter: https://twitter.com/hashtag/mobycast
Reddit: https://reddit.com/r/mobycast

Voiceover: In episode 78 of Mobycast, we introduced the AWS Well-Architected Framework, an indispensable resource of best practices when running workloads in the cloud. We explained that the framework defines five pillars of excellence, and we dug deep on the first pillar, operational excellence. If you missed that episode, you can hit pause, and we’ll wait here while you catch up. Now, in this episode of Mobycast, Jon and Chris continue their three part series on the AWS Well-Architected Framework and discussed the next two pillars of excellence, security and reliability.
Welcome to Mobycast, a show about the techniques and technologies used by the best cloud native software teams. Each week your hosts, Jon Christensen and Chris Hickman pick a software concept and dive deep to figure it out.

Jon Christensen: Welcome, Chris. It’s another episode of Mobycast.

Chris Hickman: Hey, Jon, good to be back.

Jon Christensen: Good to have you back. So, again without Rich, another week goes by. We’re missing the man. He should be back soon. We are still in the middle of just the depths of learning about software architecture, and there is so much here, so much to learn, and we can’t get it all done in a quick 40 minute conversation. So, two weeks ago, we started this off with the Twelve-Factor app, and that was pretty easy and digestible. It was just like, here’s some things you do with an app to make sure that it’s pretty good, and you can deploy it onto your platform as a service, and it works well and is pretty reliable. Then, last week we got into the real hardcore stuff. We started doing the Well-Architected Framework from AWS, which is just a beast. It was like, here’s everything you need to know and do if you’re going to run distributed systems that are capable of serving millions and millions of people and have umpteen-nines worth of uptime.
There are five pillars, and last week, we got through … There’s five pillars, and there’s also some general design principles and just the introduction of the whole thing. So, last week we talked about just the whole thing, and what it is, and some of the general design principles. Then, we talked about the first pillar, which was operational excellence. This week, we’re going to talk about two more pillars, security and reliability. And, I’m excited. I think that we’ve noticed that not everybody that listens to Mobycast loves our episodes about security. So, we’ll keep that tight. We might not even put it in the title, but we do have to talk about it. It’s pretty important. So, we’ll talk about security and reliability. Chris, do you want to kind of kick those off for us, and any kind of, talking about stuff we talked about before too is welcome.

Chris Hickman: Sure, yeah. I mean, you set it up perfectly. So, we’re just going to roll right into it here with the second pillar being security. There’s a lot, again. There’s a lot here. Security is extremely important. There are breaches every day, and these are really expensive, right? They cost millions and millions and millions of dollars across the board. So, security, we always kind of laugh at it a little bit. No one likes to talk about it, and no one likes to do it, but it’s so important. I think the great news here is that there are so many tools and services that make it so much easier.
It doesn’t have to be a really big thing, right? You can actually get a big area of coverage with security with not a lot of effort. But, you just have make that commitment to actually include it. That’s one of the reasons why it’s here. It’s one of the pillars. I mean, I remember the first time that you were looking at the Well-Architected Framework, and you were like, “Scalability, that should be a pillar, right?” It’s not, right? So, instead it’s in this. It’s in reliability, is really kind of where that is. But, security is so important it gets its own pillar.

Jon Christensen: Yeah, and security is the high fiber diet of software engineering. You definitely want to do the things that are not the best tasting in order to have good results.

Chris Hickman: Yeah, indeed. Indeed. Yeah, so let’s dive into it. Security pillar, they define it as, security, it’s the ability to protect information, systems, and assets while delivering business value through risk assessments and mitigation strategies. So, this is all about-

Jon Christensen: That’s a mouthful. That is a mouthful.

Chris Hickman: Yeah, but so, the important thing here is just, I think it’s risk assessments, so knowing where you have risk areas, and then developing your mitigation strategies, right? So, how are you going to-

Jon Christensen: Right on. Let’s talk about the first three things too, though. Why bother separating out information, systems, and assets? What are those three things? I guess data, that maybe in databases. Systems are like computers or running software, maybe. Running software is what I would think of as systems. Then, assets would be the actual computers or load balancers or physical devices, kind of thing. Or, maybe, yeah, I guess that’s probably what they’re talking about when they talk about assets. I guess I was also thinking assets could be video files or something. Those could be assets, because it’s weird. It’s one of those words, that it means one thing in one situation, and another in another. If you look at your Node.js directory, and there’s a thing called assets in it, that’s going to be your images and your HTML, your images, and your videos, and just things like that, GIFs, whatever. This is not that kind of assets. This is business assets, I would guess.

Chris Hickman: Again, this is kind of like the general description of what it is. We’ll get into this with the focus areas, and this will become definitely a bit more clear on what it is that needs to be protected, right?

Jon Christensen: Yeah, I guess I interrupted and wondered, because it’s like, well if this is so carefully worded, then those three things must be the things. There must not be anything else, or whatever. But, it could just be a writer and did this, and those came out. Cool.

Chris Hickman: All right. Yeah, so let’s walk through the design principles for the security pillar.

Jon Christensen: Before you do, I just want to remind people from last week that every pillar has design principles and has the focus areas and has best practices. Those are the three things we talked about. Last week, the operational excellence pillar also had a key service. Oh, it looks like security has key services as well.

Chris Hickman: Yeah, for each pillar, we will talk about a key service.

Jon Christensen: Key service. Okay, cool. So, design principles, key service, focus areas, and best practices. Here we go with design principles.

Chris Hickman: All right. Yeah, so first design principle is we want to implement a very strong identity foundation. So, this is kind of core to protection, is just the fact that you can do things like authentication and authorization. So, that’s the first thing. Second, enabled traceability. So, being able to see what’s happening and what’s happened in your system from a security standpoint.

Jon Christensen: Well, that one’s nice, because we talked about that one last week, too. So it’s like, kill two pillows with one stone, because you wanted the traceability, at least transactional traceability. We talked about that last week, like putting in x-ray or something like that.

Chris Hickman: Yeah, so that was, yeah, request traceability or transaction traceability. This is now traceability from the aspect of security and access.

Jon Christensen: Okay, so this is more like who is actually touching systems, what IAM accounts are doing things in your AWS account, kind of stuff.

Chris Hickman: Yeah, I mean, think about it. When they do have these breaches, usually, it’s not like … They don’t catch them in the act, usually. Something gives rise where they’ve realized that, oh no, we’ve had a breach. Then, they have to do the forensics on it, right? So, that’s when they’re going back to things like logs and whatever, what other sort of data that they have to do that forensic analysis. So, that’s that traceability part.

Jon Christensen: Right, and this is for both. I guess I take back one thing I just said. It is for the AWS systems and IAM roles and accounts and stuff that are touching that, but it’s also your application, so any users that are doing things in your application. That should all be traced, as well. Nobody should be able to change state of your application without being who they are and having that be audited in some way, going to some sort of log somewhere.

Chris Hickman: Yeah, and this feeds into the next design principle, which is just security at all the layers. You want security at the data layer. You want it at the infrastructure layer. You want it at the application layer. You want it at the network, just all the layers, right?

Jon Christensen: Right. Cool. I could have written this myself.

Chris Hickman: The Well-Architected Framework, by John Christian. Yes. Cool, so moving on to the next design principle, automate your security best practices. As we talked about before, a huge theme throughout the entire Well-Architected Framework is automate all the things, right? Anything that can be automated, go ahead and make the investment to automate it. So, we did that in operational excellence. Automate as much as we can of operations, and so the same thing here at security. Take your best practices and automate them as much as you can.
Another key design principle here is protect your data in transit and at rest, very topical coming off of our encryption series, those episodes where we talked in depth about encryption and the difference between doing encryption in transit versus at rest. Another principle is keep people away from data. Really, what this is saying is if someone has no need to access data, then don’t allow them to. So, it’s the principle of least privilege here applied to the actual data itself. So, this is obviously super important for folks that have regulations that they need to comply to, maybe HIPAA or PCI or whatnot. So, making sure that you’re giving least privilege access.

Jon Christensen: I think it’s worth telling a quick story about this, because this just came up with one of our clients just the other day. I feel like this is sensitive enough that I should kind of keep who it is and who the third party is to myself, but I’ll try to at least explain it. There was a service that would allow people to write messages to each other, direct messages to each other. That service has a feature that is an extra add on, pay for feature that allows you to do content moderation. So, you could go and see if people are saying bad things to each other and stop that, prevent them or tell them not to or whatever.
The guiding principle that we were using is, you know what? Let’s give people the tools to not accept messages from certain people if they want, so to kind of block them or report them. But, we don’t really want to be in the business of knowing what people say to each other. We just don’t want to even know. So, it was a conscious decision that we were going to absolutely not use those moderation tools. We were not going to buy that feature. We did not want to have access to that feature. You don’t want to be in the same room as the temptation. Make sure, we just didn’t want it. Let’s have a policy be that the keys are behind locked doors. Only one or two people have access to the keys, ever. If there’s some reason that some messages need to be decrypted, it better be a legal reason, and we don’t mess around with that stuff. I think that’s a telling example of that security principle you just mentioned. Don’t make that data available if it is sensitive data and it could cause harm to people.

Chris Hickman: Yeah, exactly. All right, and then for design principles to close it out, the last one is to prepare for security events. So, this is just again, kind of knowing the inevitability. There’s probably going to be some kind of incident. So, what’s your response going to be to that? And, just having that thought out. How will you do it? Again, this goes back to risk assessments and mitigation strategies.

Jon Christensen: Yeah. Call everyone into a room, and everyone runs around screaming and shouting, pulling their hair. That would be my plan.

Chris Hickman: We’ll work on that. I think we can do a little bit better than that.

Jon Christensen: Go ahead.

Chris Hickman: Yeah, so key service. This one’s pretty obvious. It’s AWS identity and access management is definitely going to be key to the security pillar here. IAM is the core identity foundation for AWS and all its services. It gives you support for users, roles, groups. You can have policies. You can have fine grained access control. So, definitely just this is going to be the foundation for most of the things that we do in the security pillar.

Jon Christensen: Right on. It does feel like this is a bit of a push though from AWS. Yes, yes, of course you need to use IAM for any infrastructural stuff. So, you’re going to have roles around ECS, or you’re going to have roles and users that can talk to certain EC2 instances, and that all makes sense. Of course you have had that. But, I do feel like AWS for a while now, and more and more successfully, is trying to push IAM into your applications themselves, and that’s through tools like Cognito and federated identities where some pool of your users might take on a role, an IAM role in order to access certain files in S3 or whatever. I just feel like that’s the direction AWS wants to go. They want to get IAM into your app, since they’re having you own everything about your users and your own databases and application software. It just feels that way to me.

Chris Hickman: Yeah, a common theme just with AWS and any one of the cloud providers is they have their core services, and a big part of the value proposition is the synergy that you get, and the interoperability and the integration between those services. So, IAM is their identity and access management platform that all their services are going to use. So, it’s going to, the more you use that and leverage that, the easier it’s going to be for you to integrate into the system. It’s kind of like what we talked a little bit about with KMS with encryption, where again, KMS is the core service for managing keys and doing things like encryption, and it’s very well integrated with all the AWS services. So, if you want to make things as easy as possible, if you want to have that deep integration, then you’re going to have to use KMS, or you should use KMS as opposed to … They give you the tools that you can bring your own, do your own thing, but if you do that, you’re going to have some heavy lifting to do.

Jon Christensen: Yeah, and today I can imagine doing something where my actual, our application’s actual users that are in our user table in some database have some sort of direct affinity to IAM roles in order to be able to access certain things. Maybe some users have a different role than other users. I can really see doing that today, and I would say five years ago, I wouldn’t have even crossed my mind. Cool.

Chris Hickman: Yeah, indeed. All right, so rolling on, focus areas. So, five focus areas here for the security pillar. One, identity and access management. Two is detective controls. Three, infrastructure protection. Four is data protection, and the last one is incident response. These focus areas, we can now look back at that original statement of the ability to protect information, systems, and assets, right? So, what does it mean by that, by information, systems, and assets? So, these focus areas kind of give clues so you know what we’re talking about here. One of the focus areas is infrastructure protection. That is probably in this case, it’s the assets, and maybe kind of getting into systems, as well.
Then, we have data protection, which that’s going to line up well with the information part of that, what we’re protecting. Those are the two focus areas around protection with infrastructure and data. Then, we have the core identity access management as being one of the focus areas. Then, we have these detective controls and incidents response. Detective controls is basically, those are the things that’s allowing this traceability for who’s doing what. Then, incident responses again is, what are we going to do if there is a security event? How do we handle that?

Jon Christensen: Last week, I went on a tirade about focus areas, because I felt like prepare, operate, evolve are not … They don’t grammatically fit that word, but these ones, they feel right to me. These do feel like focus areas of security. Well done on this one, AWS.

Chris Hickman: Cool. Yeah, wait until we get to reliability. I don’t think you’re going to be too happy with those focus areas. But, we’ll get there.

Jon Christensen: Right on, and today, too.

Chris Hickman: Yeah, and today. Then, maybe just to call out a little, some of services, AWS services that will be applicable to each one of these focus areas. So, for identity and access management, obviously IAM. That’s its reason. It’s actually, IAM is identity and access management, right? So, it’s almost like gnu is not gnu. Then, other things though, AWS organizations and MFA would definitely be kind of key features and services to be thinking about in that focus area. For detective controls, we’ll have things like CloudTrail, CloudWatch.

Jon Christensen: By the way, there’s notable lack of Cognito on that one that I thought I’d point out.

Chris Hickman: Only just because it’s … I mean, that would definitely fall into that for sure, but it’s not as common use case, right? Cognito is a bit more at the edges for various different workloads. It’s only for certain workloads that you’re probably going to use Cognito. So, that’s why it’s not one of those core services that we’re talking about right now, but absolutely it falls under that umbrella. So, yeah, for the detective controls, CloudTrail, CloudWatch, AWS Config, and GuardDuty would be some of those important services to be thinking about. CloudTrail is the capability for logging basically all the API calls being made in the system. It’s going to vary on the service on exactly what’s logged as far as the API calls, but once you enable that for your account, you’ll get that log in, and you’ll be able to see those events.

Jon Christensen: I think it’s worth saying that when you say all the API calls, that really means everything that happens inside of AWS, because nothing really happens in AWS without an API call. When you’re using the console, under the covers an API call happens. When you’re using CloudFormation, under the covers an API call happens, right?

Chris Hickman: It is.

Jon Christensen: Everything that happens.

Chris Hickman: The kind of unfortunate thing is that CloudTrail is not going to log all the API calls. It’s only going to, and again it’s on a per service basis, but service is kind of dictating what API calls will get logged to CloudTrail. So, as an example, not too long ago, we had a IAM user account set up, and it had a key pair, a secret key access key, so developer credentials associated with that account. This had been set up a long time ago. These credentials were being used by some application in the system of which there was many, many applications, right? We could see when these credentials were last used, so we knew that they were still being used, but we had no way of seeing who was, where this was actually being used, right?

Chris Hickman: … was where this was actually being used. And so the hope was CloudTrail will help here. And it turns out with CloudTrail, it’s only going to log the API calls for those credentials on management task, not on actual user actions. So the calls for using those credentials to do some operation, that’s not going to get logged. So, it was one of those things where like, this feels perfect, CloudTrail will really help here, and turns out it wasn’t going to help us at all.

Jon Christensen: I want to say that there’s some irony about the AWS log contextual framework and the security best practices and AWS itself tripping over that, with that example you just gave. But yeah, I’m not sure. There’s also, it feels like, wiggle room for them to [crosstalk 00:01:03].

Chris Hickman: If anyone out there has a better idea for how to figure out what actual application is using a set of developer keys, I’d be open to hearing that. We tried many different … Just anything and everything between CloudTrail and VPC flow logs, just anything and everything and just not really coming up with anything good to try to figure out, who is using this?
Infrastructure protection. So, that focus area, some key services you can use there are going to be VPC and all the network components that go along with that. How you’re building out your subnets and your NACLs, security groups, Firewalls, all that kind of stuff. AWS Shield is another thing here in helping you with distributed denial of service events, service attacks, and then WAF, which is the web application firewall.

Jon Christensen: WAF, that’s what I’m thinking.

Chris Hickman: So, some good services there to take advantage of. For data protection, KMS. We’ve talked about this in length and the encryption series. ELB is just in the, from the respect that ELBs will do TLS for you, so let them, so they can do your TLS termination. And then [crosstalk 00:02:39].

Jon Christensen: Although I think it’s worth pointing out, that if you let your ELBs do your TLS termination, then you’re not technically all the way end to end with your encryption.

Chris Hickman: No, you’re not. But it’s actually really hard to do that.

Jon Christensen: Yeah, we had to do it for a client that had to be HIPAA compliant and it wasn’t sufficient to let SSL encrypt the TLS terminated at the ELP. Gosh, that was a mouthful. Cool. Incident response?

Chris Hickman: Some key services there are going to be, IAM and CloudFormation. The Well-Architected framework calls out CloudFormation here, which is a little bit of a head scratcher at first. But really what it is, is their approach to these kinds of things is to basically build a clean room environment where they can go do the forensic analysis. So, CloudFormation comes in here where it allows you to stand up a complete replica of the environment, do the replay and do the forensic analysis, in this clean room. Which is, again, pretty sophisticated. [crosstalk 00:03:48].
Those are the key focus areas and their associated AWS services that are probably going to be pretty … you want to keep pretty close by. As far as best practices go, why don’t we talk about a few examples of best practices in some of these focus areas?
One would be in the identity and access management focus area, we could use Cognito. You mentioned that before. This is absolutely one of those best practices to act as a broker between login providers and also if you want to do secure access of any AWS service from a mobile device. And we talked about this in a previous episode where we specifically just talked about Cognito and your experience of building an app that needed to access these AWS services. I think it was Dynamo and how it could do that securely. And so Cognito was the answer there, where it’s basically creating these user pools that then get mapped to temporary IAM credentials.

Jon Christensen: Yeah. But that was a JavaScript react.js app with a node layer in Lambda so it’s not just mobile. Cognito isn’t. Go ahead.

Chris Hickman: In the data protection focus area, some best practices there are definitely just encrypt all the things. We want to use encryption at rest, we want to do encryption in transit, and we want to do encrypt our backups. We talked about encryption at length and just there’s really no reason why you shouldn’t, there’s no reason not to. So just encrypt all of the things. You also want to do for data protection and do things like versioning. So S3 supports versioning of objects. This gives you resilience from mutations that are bad. You want to be able to recover and roll back to a previous version, you can turn off deletion of versions as well to protect that.
Another best practice on data protection would be dealing with storage resiliency. Making sure that you have a resilient storage platform. So again, things like S3, built out to [crosstalk 00:06:18]. That’s pretty durable. You really don’t have to worry about losing any data there. Detailed logging goes along with this as well.
In the incidents response, an example of best practice there is, again, using that templated clean room strategy for doing the forensic analysis when there is a security event. Template out your environment in CloudFormation and use that to quickly spin it up so you can create a new trusted environment to conduct your investigation.

Jon Christensen: And again that one sounds like a big teams only type of thing. Maybe the thing is when you do have a security incident, it doesn’t matter how big or small your team is, it’s all hands on deck and that could be a great way to start to take a look at it.

Chris Hickman: Yeah, it’s one of those things where, obviously, you hopefully you never have to do this. Odds are if you are employing a strategy, then you’re a big enough company and team just to be able to support just having these folks onboard to be able to do this.

Jon Christensen: Exactly. All right, so those are best practices, right? There weren’t any others we were going to talk about for security [crosstalk 00:07:43].

Chris Hickman: Everyone you can open your eyes, we’ve made it through [crosstalk 00:29:48].

Jon Christensen: Fantastic.

Voiceover: Just a sec. There’s something important you need to do. You must have noticed that Mobicast is ad free, but Chris and John need your help to make this work for everyone. Please help the Mobicast team by giving us five stars on iTunes, writing positive reviews, and telling your colleagues, friends, neighbors, children, and pets about the show. Go ahead and do it now. Great. I promise not to ask you to do that again.

Jon Christensen: So timed and talk about reliability. I think you can get a good part of this today and then two more pillars next week. Reliability. Go for it.

Chris Hickman: Yeah, pillar three is reliability. What is that? Reliability is the ability to recover from failures, to dynamically acquire resources to meet demand, and to mitigate disruptions such as network issues. This is, again, a big comprehensive swath of functionality that it’s dealing with, right? It’s just everything that goes into having a workload that’s always available. It’s operating and people can can use it in a timely fashion. So, we’re going to deal with failures. We’re going to deal with scaling up and down. And we’re going to be resilient to disruptions that shouldn’t take us down completely. That we might, we might do some graceful degradation.

Jon Christensen: Right. And you know, reliability is so big that it it makes me think about ilities in general. I’m not sure if you’ve heard that term and as, as a computer term turnaround, just ilities. Have you heard that?

Chris Hickman: Yeah, we talk about things like scalability, reliability,

Jon Christensen: Dependability, [crosstalk 00:31:47] is probably one of them. It’s the grandfather maybe of all the ilities. It covers a lot of them. And maybe that’s why they chose this one to be their pillar.

Chris Hickman: Yeah, it definitely covers a lot. That’s why there’s not the scalability pillar, because scalability is part of this reliability. Reliability is even bigger than that. Otherwise if scalability was a pillar, we’d have more than five.
So let’s talk about design principles for reliability. One is to test your recovery procedures. Which is a big one. I think it’s one that a lot of people don’t really do.

Jon Christensen: I think we just lost the 999 out of 1000 [debcams 00:32:43].

Chris Hickman: Definitely this is one of the core principles there is having recovery procedures and testing them.

Jon Christensen: Isn’t that AWS is for though? Aren’t they supposed to be the ones that test that?

Chris Hickman: Well, it depends on what’s… This goes to the [crosstalk 00:11:02].

Jon Christensen: [crosstalk 00:33:03] but I think there starts to be that sort of attitude around managed services. “It’s a managed service. They take care of that work.”

Chris Hickman: [crosstalk 00:33:10] for managed services, they are, for the most part, dealing with that. If Lambda has a problem, if they have to do recover from a Lambda failure, they’re doing that. That’s a fully managed service. Things like RDS, they’re managed application, they’re semi managed, it’s not completely. There’s responsibility there on both sides. The AWS is doing some stuff. They’re responsible for the actual database. But you’re responsible for the data. So there’s some back and forth there. And then of course, if you’re self posting, then that’s all on you.
Another key design principle here is to auto recover from failures. Automate all the things. Same thing here. As much as we can, we want to automate whatever steps need to happen to recover from a failure. And as we’ll go into you if you have really well-defined run books and playbooks, that will feed into this as well.
Another core principle, scale horizontally to increase availability. We talked about this in the Twelve-Factor app, where it’s all about you’re going to scale. You have these stateless processes and you scale by just adding more processes. That’s that horizontal scaling. We’re just going to keep adding these duplicate services that can do work and they’re all stateless.

Jon Christensen: Right. And speaking of ilities, that was one, availability

Chris Hickman: Indeed. Another one is stop guessing capacity. This goes back to one of the, and we talked about this in last week’s episode, that one of the overall design principles of the Well-Architected framework and this is specifically in the reliability pillar as well. This is all about, we talked about as the definition of it, the dynamically acquiring resources to meet demand. That’s what this is, take advantage of all the services and features that in the cloud to allow you just to spin up, spin things down, go to that model where you’re only paying for what it is that you need and you don’t need to over provision or you don’t need to run out of capacity unless you’ve actually built that in your architecture erroneously.

Jon Christensen: Right. Well, but is it worth saying though that this doesn’t necessarily mean that you have to be able to write a system that can scale from five users to 5 billion with one architecture, but rather that for that then your system can handle scaling up and down for the maximum amount of workload that it can handle.
I’m thinking, for example, about you just may have a system that’s built on a regular RDBMS database and it can scale pretty big, but at a certain point you’re just going to need to add more. It’s that flexible architecture that the Well-Architected framework is also talking about. You’re going to have to add more components, certain kinds of caching, maybe other types of databases, in order to be able to scale beyond a certain point. It doesn’t matter if you’ve built auto scaling into it, there is going to be that place where it’s, okay, this can’t handle the load anymore. That’s not the same as not guessing capacity, but it is similar. It’s like, well, we’ll build this to handle up to 100,000 users and then after that we’re going to have to do something.

Chris Hickman: Yeah. You can think of it as that your workload’s going to go through evolutions and it’s going to go through milestones. So you may have an MVP and then B2, B3, B4. Each one of those may have a different architecture. Or you’re going to make changes to the architecture. But for each one of those milestones, you’re not guessing on capacity. That’s the core thing that should be going on.
And then last design principle is to manage your change with automation.

Jon Christensen: [inaudible 00:00:37:34].

Chris Hickman: [crosstalk 00:37:37]. Just automate, automate, automate as much as you can. For the design principles here, we talked about auto recover from failures, this is basically use automation to manage change as well. Failures and change, it’s just everything that can be automated, you want to automate it. Take people out of the equation. Because people make mistakes. People forget. We’re not good at doing those kinds of things versus that’s exactly what computers are good at.

Jon Christensen: And this is such a great business thing too. We talked about how this Well-Architected framework reflects back into the business and not just a pure technology part of the… Anyway, it’s not just technology.
If you think about Amazon itself and how it became such a powerhouse, it is because of that automate all the things mindset. As a bookseller, they started automating stuff and then they had made so much cool automation technology that they started selling it, and they’re just automating on top of that too. If you want to build another Amazon, there’s something to learn from that.
Even as we sit around saying, well, some of this stuff is maybe a little too much to reach for for a team with a lower budget, or a smaller team, or a team with certain kinds of risk tolerance, because they’re trying to grow fast or something, maybe they don’t have all the time to worry about game day type activities. But that one thing, automate all the things, any team can have that mindset and that can lead to amazing growth.

Chris Hickman: Yeah. And it’s also those are investments that end up usually paying off at some point. Again, maybe something that manually it takes you one hour a week to do versus to automate it might take 20 hours, but after 20 weeks, it’s now free. You’re now getting back time. And so now, that’s a win. You start doing those things consistently and you really get the network effect.
So key service for reliability, that would be CloudWatch. CloudWatch is going to give us insight into exactly what’s going on in the system. It’s going to be given us events for when certain conditions are reached and whatnot. It’s going to help us scale our capacity. It’s going to tell us when latency is too high and we need to deal with that. CloudWatch is going to be just a core service here in the reliability pillar.

Jon Christensen: It also will be the cloud’s version of Cron. So it can schedule things to happen that maybe should be happening when the system’s not busy. All kinds of great stuff in CloudWatch.

Chris Hickman: Yeah, you have to hats off to CloudWatch. It’s pretty amazing. Just if you think about like what it has to do, it’s a really complicated, a hard problem. And it’s so key to just about everything. If CloudWatch, there’s any problems with that, just everything breaks. Just the Cron job part of it. Just having scheduled events, scheduled CloudWatch events, just that problem alone, for the number of events that it’s dealing with, and then each of those schedules, just that alone, is trying to build a service to do that, that’s a pretty monumental task. And that’s just a small slice of what CloudWatch does.

Jon Christensen: Yeah, I know that’s one of the most impressive services once you get to know it. At first it’s like, wait, this feels crappy for a logger. That’s not all it is.

Chris Hickman: Yeah. Well, don’t get me started on CloudWatch logs. Focus areas. Three focus areas here. Foundations, change management, and failure management. This is again one of those head scratchers. It’s like, so we’re in the reliability pillar in three focus areas, and they’re foundations, change management, and failure management.

Jon Christensen: I can see failure management. That makes sense. In order to be reliable, you’ve got to make sure you’re managing failures and it’s a focus. There’s other ways to being reliable.
Change management. Now that feels just like AWS, like, “Man, remember that one time where we had a bad configuration and we blew out one of our regions.” So that’s probably what that’s about. Because they’ve made a change to our configuration and totally brought down a region. And then foundations, I don’t know what that is. It doesn’t really speak to me.

Chris Hickman: Yeah. When I came across these I was just like, this is surprising, for given that it’s the reliability pillar. It feels like it’s almost not enough surface area being covered by these three things. [crosstalk 00:42:57]. That said, you could say foundations ends up that covers just about everything. That’s not either change or failure. It’s foundation.

Jon Christensen: That’s right. Well, and reliability is responsible for the ever increasing costs of software too. I often talk about how software expectations just go higher and higher. Building a website is not just as easy as throwing out some HTML and you’re like, “Look, here’s my website.” So much more goes into it and the tools get harder. React.js is not an easy thing to learn. You don’t learn it in an afternoon.
So, why are these tools so hard? I think a lot of it comes back to this reliability piece, especially on for backend developers. It’s just a black eye for anybody now to have a site go down, or to have a service go down, or to have a very visible error. That all comes into reliability. Or to have a deployment, or to have a release that just fails, and then to automate recovery of that and to limit blast radius, those are all expected things now, they’re not like, oh yeah…

Jon Christensen: … limit blast radius. Those are all expected things now. They’re not like, “Oh yeah, everybody makes mistakes.” It’s like no, a lot of companies don’t really make mistakes anymore. So which kind of company are you going to be? One that doesn’t make mistakes or one that does?

Chris Hickman: Yeah.

Jon Christensen: Yeah. So it gets harder, expectations get higher, software gets more expensive.

Chris Hickman: Yep. Yeah, indeed. Yeah. So maybe we can talk a little bit about some of the core AWS services for each one of these focus areas. So with foundations, some of those services are going to be things like IM, BPC, Trusted Advisor. Trusted Advisor is because of its visibility in the service limits. So service limits are going to be pretty important from a foundational standpoint, right? I understand.

Jon Christensen: Yeah, yeah. You can bring down a system if you’re not paying attention. Yeah.

Chris Hickman: And then Shield, again, we talked about that previously. So this is going to protect you against things like distributed denial of service attacks. For change management, things like Cloud Trail, AWS Config, for sure.
So for those people that are not familiar with AWS Config, that’s a way for you to track the state of your system and then any changes that happen to that state, get events on that. So really good.

Jon Christensen: So does CloudFormation use that as a dependent service? Because I noticed their CloudFormation tracks drift. Is that using AWS Config?

Chris Hickman: I don’t know for sure. I won’t be surprised either way if CloudFormation has its own way of doing it and Config is a whole different way. But yeah, they have drift detection, right?

Jon Christensen: Sure. This is what you made and now it’s changed, we’ve not [crosstalk 00:45:50] instance. You’ve change your IAM policy. It’s like, that’s drift.

Chris Hickman: Yeah. Anything being managed by CloudFormation if it’s not configured that way in your environment, then that’s drift.

Jon Christensen: Cool.

Chris Hickman: And then CloudWatch is definitely part of this change management and auto scaling. And this is like the scaling up and down to meet demand, right? So that’s considered part of the change management philosophy.

Jon Christensen: OK. Makes sense.

Chris Hickman: For failure management, things like CloudFormation, S3 Glacier-

Jon Christensen: Glacier!?

Chris Hickman: Yeah. The reason here for failure is basically back up and NTR strategies, right? So S3 is for backups and then Glacier’s for archives. So it’s all about having those copies of the data being able to recover when failures happen. Right?

Jon Christensen: Sure OK.

Chris Hickman: Yeah. So yeah. So let’s talk about some best practices. In the foundation’s focus area, talked about this a bit before, but taken into account your physical and service limits for your workload, right?
So this is both at the network level, understanding things like CIDR ranges, and how many IPs you have, and dealing with service limits both from how many API requests you can make for a certain time period or other infrastructure limits, like low balancers. Like how many little balancers can you have in an account or a region type thing.
So just really making sure that you’ve thought that stuff through and you’re not putting up any brick walls as your system evolves. Another best practice in the foundations is to really think about high availability, and what does that mean to your workload, and how are you going to architect it to have high availability?
So there’s a big, big, big topic and there’s lots of questions here to ask yourself and things to think about. But some of the major things to think about would be like making sure you don’t have any single points of failure, right? That would be a big, big problem, right?
If you have a SPOF you want to be making sure that you have a multi availability zone design, right so you don’t want to be just in one easy, right? You want to be at least two and probably three. You want to take advantage of things like load balancing, you want to take advantage of auto scaling.
If you have a hybrid environment or VPNs or whatnot, you’re going to want to think about redundant connectivity. So if you direct connect, you want to make sure you have redundancy on that. Same thing goes with with VPNs and any other kinds of networking you may have.
And another big open area is software resilience. This goes to our applications, right? Like, making our applications more resilient and that’s just a wide open … We covered probably at least four or five of the principles of the 12 factor app, right? All are underneath this concept of software resilience.

Jon Christensen: Right. I wanted to mention that the availability stuff, a lot of it, it kind of comes for free with typical web applications when you’re using the Cloud. Like, it’s not that hard to do it. If you’re using the console to setup your database, you just click the right thing and boom you got some high availability database. Or you’re using an ECS, you just choose to have more than one node run in your containers and you got high availability.
And a lot of that stuff is super easy with the Cloud. Like, it’s almost harder to get it wrong and that’s on purpose, right? And you mentioned if you’re running a hybrid environment, that’s when you got to think about having additional ways of … More than one way to have internet conductivity.
And that got me thinking about some of our own projects and just in particular IOT. And it’s like whenever I’m thinking about availability and connectivity it’s often because I’m worried about little IOT things like little devices that are here and there, and trying to send data to AWS.
That’s when I’ve gotten wrapped around the axle in recent years when it comes to availability and dealing with it because the Cloud takes care of so much of the other stuff.

Chris Hickman: Yeah. I mean, there’s no doubt. I mean, we build mobile apps too, right? So building mobile apps that actually still do the right thing when they’re transferring between cell towers, right? They lose average and they go off of wifi to LTE. So, what happens? People go into airplane mode, right? So it’s definitely-

Jon Christensen: It seems like that’s-

Chris Hickman: … had some challenges.

Jon Christensen: Yeah. And that’s where some of the hard thinking about that stuff happens now, more so than … As long as you just follow some best practices around not having single point of failures in the Cloud, the rest of it … And then running good software or actually having applications that don’t fall over because of their own bad code, then that availability problem takes care of itself in the Cloud.

Chris Hickman: Yeah, I mean I definitely agree with you. The Cloud makes it a lot easier to build a Devo workload that has high availability. But that said, it’s still not easy. I mean, I would bet that the majority of workloads in the Cloud do not have these basic characteristics, all of them. Right?
So there are architectures out there that have single points of failure, there are architectures out there that they’re not multi-AZ, right? It’s all on a single AZ. There are ones that are not doing auto scaling or the very least they’re not scaling up and down automatically. It’s a manual thing.
And then obviously in software resilience, being able to … Having services thrash and crash. I mean, even silly things like session state still trips people up. So, again, it’s kind of easy to take all this stuff for granted like it’s kind of easy to build a high availability system. But I think in practice I think people still find it challenging.

Jon Christensen: Mm-hmm (affirmative). Mm-hmm (affirmative).

Chris Hickman: All right, so continuing on best practices in the failure management focus area. Things here are just backup and disaster recovery and having those plans. Thinking about your RPO, which is your recovery point objective and your RTO, which is your recovery time objective, understanding what those are and what you’re designing for, you should be thinking about that.
So recovery point objective is really just saying, what is the increment of data that you can afford to lose? So you might say we have to have data within the last 30 minutes. That’s as much as we’re willing to lose. Or it can be five minutes, right? But that’s your recovery point objective.
And then your recovery time objective is how long is it going to take you to recover when this does go bad? And so if you have backups on S3, then it’s like how long is it going to take to detect the failure, get the appropriate backup, and then get that backup restored? And that would be your time, your RTO, your recovery time objective.
And that might be different if your backups are actually archived to Glacier or something like that. So keep those things in mind. Have your disaster recovery plan in place. And then obviously you want to have game days and practice going through those kinds of processes.

Jon Christensen: Very cool. Yeah.

Chris Hickman: Yeah. And then another thing, best practice and failure management is inject failures into your system to test resiliency.

Jon Christensen: Yeah.

Chris Hickman: So this is basically chaos engineering and you can go crazy with it. It’s again, pretty sophisticated. It’s not really new anymore, right? It’s been around … Like, the concept has been around for five plus years. Especially thanks to Netflix and its Simian army, but definitely as a best practice. I think this is what we can aspire to do.

Jon Christensen: Yeah, yeah. I like it. Every time I hear about it it’s like I want to do it and I’ll admit we haven’t done it yet, but it seems fun. Seems like a fun coding challenge [inaudible 00:55:06] overcome some of the things that a chaos type testing could result in.

Chris Hickman: Yeah. I mean, it really doesn’t make sense to do it until you’ve got the fundamentals down, right? I mean, you have to be able to recover from failures automatically.

Jon Christensen: Right, yeah.

Chris Hickman: You have to have software that’s resilient to these things. And so otherwise if you don’t have that stuff in place, then inject failures all you want. It’s just going to crash the whole thing, right?

Jon Christensen: Yeah, yeah.

Chris Hickman: That’s not going to be too terribly useful.

Jon Christensen: Yes, that’s the thing. Right.

Chris Hickman: You have to build this thing out with the idea that failures are going to happen and this thing will continue to run when these failures do happen.

Jon Christensen: Mm-hmm (affirmative).

Chris Hickman: Cool.

Jon Christensen: Cool.

Chris Hickman: So to wrap this up with reliability, we can just talk about maybe some of just the key points to keep top of mind. So one is, plan your network topology. So again, don’t design yourself into a box by creating one subnet with the entire 64K address space.
Think carefully about your CIDRs, make sure if you’re going to be having peered VPCs or other networks like hybrid networks and on-prem networks, that you don’t have collisions between your subnets. Just plan your network topology because it is really kind of … It’s important and it could turn out to become really important down the road. And it’s definitely one of the first things you do, so-

Jon Christensen: If you’re planning a network topology, I mean we don’t really have time to get all the way into this, but maybe there’s a little rule of thumb that would be easy for people? Like, I think it would be safe to say it’s easier to grow a subnet than it is the shrink it.
So maybe think of it from that perspective? There’s a limited IP address space, you don’t want collisions, you don’t want to run out of IP addresses. But if you give yourself 16 or 24 or 32 IP addresses and you discover you need 35, that’s not a hard change to make.

Chris Hickman: It depends on a lot of things, right? It just depends on-

Jon Christensen: Sure, sure.

Chris Hickman: … everything else and how you defined it or what’s the interrelationship between these things, what you’re putting on the subnets. So I mean, ideally you just want to … You want to rightsize them and you don’t want to have to-

Jon Christensen: No.

Chris Hickman: -change them.

Jon Christensen: And then you’re going, wait a minute though, that’s so against the well-architected framework rules. Didn’t we say in the very beginning that we don’t want to guess at capacity needs? Like, literally that is a tenant. It’s so funny that it’s like wait, but actually make sure you have enough IP addresses.

Chris Hickman: Well, this is a little bit different because this is your network topology. And so your network is not going to scale up and down elastically, right? It just doesn’t work that way.

Jon Christensen: But it does. Isn’t that computers? And computers, don’t they use an IP address and wait, now you need to know how many computers you’re going to need.

Chris Hickman: Right. But you have things like your mask, right? Your CIDR address space, right? And that’s kind of a fixed thing and you can’t … I mean, sometimes you can’t even change that, right? I guess an example would be like if you’re in a VPC, if you make a a subnet too big, well then you’re kind of hosed.
You can’t add any more addresses, you can’t add any more subnets because the space is being taken up by the other one.

Jon Christensen: And that where I was going with my original comment. Make them small and then you can increase their size later if you have to.

Chris Hickman: Yeah, just as long as there’s the … Again I think there’s a bunch of different factors that go into play there. I think it’s kind of easy to think … You can think of your network topology in the form of building blocks. Also, it’s okay to … I mean, there’s some size to a subnet that that makes sense to you. And it could be like it’s a subnet that’s 256 IPs.
And that’s a good building block size. Or it could be five 12 or it could be a K or something like that, but it probably shouldn’t. It shouldn’t be like 32K. Like, putting 32K addresses on a subnet doesn’t make a lot of sense. And likewise, 16 is probably too small and same thing with 32 and even 64.

Jon Christensen: So maybe the thing is, use some other company’s crappy way of doing it. Like, think about it the way some other company does. Like AWS has a thing where they kind of limit any subnet to have 5000 IPs to limit blast radius. So if if you need more than 5000, then that’s another deployment, right?
And then the deployments may need to talk to each other through some other system like network to network type communication. But I think that’s a bigger architecture, right? So you’re not like, “Oh, I need to be able to scale from one the 100 thousand machines in this horizontal scale up that I’m doing.
No, it’s like I’m going to scale from one to 5000 and then I’m going to scale again. I’m going to make another one and it’s going to scale from one to 5000. Then I’m going to make another one and it’s going to scale from one to 5000.
And that way you can kind of meet the tenant because that’s what I’m sort of joking about. Like okay, you’re not guessing at capacity means you’re just setting up something that’s able to scale. And then also a network that can kind of handle scaling again, and again, and again, and again as needed.

Chris Hickman: Mm-hmm (affirmative). Yeah, so it’s not this kind of elastic principle that we’re kind of used to with things like compute and storage and whatnot. But you can think of it as just building blocks and it’s really easy to add on these additional blocks as you need them.
And again, the takeaway here is just think it through. Like, understand what those trade offs are, understand what limitations you might be putting in at the very beginning just by when you’re setting up your VPC, right? There’s been many, many people that have been burned by that when they set up their VPC, then find out down the road it’s like, “Oh no. Now I have to go create another VPC.”

Jon Christensen: Right, right. And VPCs really are the root of all evil when it comes to AWS because not only do they catch those people that you just mentioned, but you’re trying to have a podcast and keep it to 40 minutes. And then you start talking about VPCs and boom, you just blew out in the podcast.

Chris Hickman: Yeah, indeed. So moving on, some other key points. Just make sure you manage your AWS service and rate limits. So know what those are. Again, make sure you’re not putting up any artificial walls that are going to block you as you grow. Monitor your system. So get CloudWatch, instrumental here.
Just really know what’s going on in your system so you know when you do have to scale up or when you had to scale down. Know when things are failing, know when things change. Automate your responses to demand. So again, take advantage of those services, those features that will provide that elasticity.
So use auto scaling, have the scale up, the scale down events. Use some of the managed services where it makes sense that just have basically infinite scalability. So Lambda, look at things like Fargate, Dynamo DB. I mean, all these things really help with that so that you don’t have to manually deal with, with demand.
And then final one would … Obvious backup, right? So you’re not going to deal well with failures if you’re not backing up. So backup, restore, disaster recovery. Keep all that top of mind and have plans for it and practice.

Jon Christensen: That’s good. All right. Two more pillars down. So we’ve made it through operational excellence, security, and reliability. We’ve got two more to go next week.

Chris Hickman: We do.

Jon Christensen: Looking forward to it.

Chris Hickman: All right. Sounds good.

Jon Christensen: Thanks so much for this week’s education, Chris. Talk to you next week.

Chris Hickman: Yeah, all right. Thanks. See you.

Voiceover: Nobody listens to podcast outros, why are you still here? Oh, that’s right. It’s the outro song. Come talk to us at mobycast.fm or on Reddit at r/mobycast.

Summary

Show Details

Links

Whitepapers

End Song