January 9, 2019

42. The Birth of NoSQL and Dynamo DB – Part 4

Show Notes
Transcription
Discussion

What’s under the hood of Amazon’s DynamoDB? Jon Christensen and Chris Hickman of Kelsus continue their discussion on DynamoDB, specifically about it’s architecture and components. They utilize a presentation from re:Invent titled, Amazon DynamoDB Under the Hood: How we built a hyper-scale database.

Some of the highlights of the show include:

Partition keys and global secondary indexes determine how data is partitioned across a storage node; allows you to scale out, instead of up
Understand how a database is built to make architecture/component definitions less abstract
DynamoDB has four components:
1. Request Router: Frontline service that receives and handles requests
2. Storage Node: Services responsible for persisting and retrieving data
3. Partition Metadata System: Keeps track of where data is located
4. Auto Admin: Handles housekeeping aspects to manage system
What level of uptime availability do you want to guarantee?
Replication: Strongly Consistent vs. Eventually Consistent
Walkthrough of Workflow: What happens when, what does it mean when…
DynamoDB architecture and components are designed to improve speed and scalability
Split Partitions: Longer times that your database is up and the more data you put into it, the more likely you’re going to get a hot partition or partitions that are too big

Links and Resources

DynamoDB

re:Invent

Amazon DynamoDB Under the Hood: How we built a hyper-scale database

Paxos Algorithm

Amazon S3

Amazon Relational Database Service (RDS)

Rich: In episode 42 of Mobycast, we continue our conversation on DynamoDB, this time diving deep into its architecture and components. Welcome to Mobycast, a weekly conversation about containerization, Docker, and modern software deployment. Let’s jump right in.

Jon: Welcome, Chris. It’s another episode of Mobycast.

Chris: Hey, Jon. How are you doing?

Jon: Good. We’re missing Rich today. Hopefully, we’ll be able to put this together. His producer talent is much needed and when we don’t have him, I do feel a little lack of proper orientation but we’ve got to get this done this week, so we’re charging on without him. We’ll see you next week, Rich.

Chris: Sorely missed.

Jon: Yes. We’ve been doing this series on DynamoDB. My favorite episode of the series was the first one when you talked about just your own personal relationship with the world of internet-scaled databases and how you did a startup that built a product which DynamoDB later referenced as prior art. That was my favorite.

After that, we did one just about what DynamoDB is all about and we got into a high-level overview of it. Essentially, the database elements we talked a lot about last time, about tables and items, partition keys, sort keys, and secondary indexes we talked about as well.

I think that just even saying that, this time I do feel like I have more to learn there and I may need to even go back and listen to what we’ve talked about before. But this time, we want to talk more about the architecture and components of how DynamoDB is put together.

Before we do that, one last thing that we’ve talked about last time is partition keys and global secondary indexes and how they determine how data is partitioned across storage nodes. I think that that is going to be fundamental to talking about the architecture. If we keep in mind that data is partitioned across storage nodes, a key feature of being able to scale out as opposed to scaling up, then having that in mind, this episode will talk about the architecture, and I think that that concept will come up probably again and again, wouldn’t you say, Chris?

Chris: Yeah. A lot of those database elements that we’ve talked about will definitely reference the architecture, the architecture decisions and the components that DynamoDB has directly support that.

Jon: I think one of the things that will be nice about today’s episode is talking about how the whole database is put together, as we continue to talk about tables and items, partition keys and sort keys. Understanding how the database is built will help us understand those things. As a user, if you don’t get into understanding how the thing is put together, then some of those definitions, like what is a partition key, feels a little abstract.

Let’s talk about the real stuff under the hood. It looks like our first thing to talk about is the architecture and components. It looks like the first thing you’ve listed here, Chris, is the request router.

Chris: Yes. That maybe just popped up a little bit. We’ve talked about re:Invent. Both you and I went there and attended a lot of interesting discussions. This whole series was kicked off based upon, in part, Werner Vogels’ keynote where he talked about his first day at Amazon.

Again, thousands of sessions there at re:Invent. One of the ones that I went to was titled Amazon DynamoDB Under The Hood: How We Built A Hyper-Scale Database. When I saw this in the catalog, I was like, “Oh, this is awesome. This is really great.”

What it was was it is a talk by one of the senior principal engineers there on the DynamoDB team. They promised to disclose for the first time the underpinnings of DynamoDB and how we run a fully managed non-relational database used by more than 100,000 customers. They cover the underlying technical aspects of how an application works with DynamoDB for authentication, metadata, storage nodes, streams, backup, and global replication.

It felt like a real privilege to treat where Amazon was opening up the kimono, if you will, and really just given a deep dive into how DynamoDB has been architected, constructed, and how it works. I was very interested in this, given my background, and to be able to see how does this compare and contrast. That’s what we’ll talk about today is basically a recap of that session, going into the various components of the system that they talked about in this particular session.

Jon: All right, wonderful. I’m excited to do it, so get it started.

Chris: Sure. First off, we can talk about just high level, the components that make up DynamoDB. For our purposes, to keep it at high levels, there’s four components that we’ll discuss. One is the request router. The request router is basically the frontline service that receives requests, handles them, and is responsible for basically all the primitives that DynamoDB supports.

Another component is the storage node. These are the actual services that are responsible for persistent data and for retrieving it. There’s the partition metadata system. This is the part of the system that keeps track of what data is where, which is super important. It’s basically that’s what glues together that allows the request router to know what storage knows to talk to. That’s a pretty key part of the system.

The other one that we can talk about is what they call auto admin. This is a management component to the system that does all the housekeeping that’s necessary for managing the system like this. It does things like it deals with partition splits—we’ll talk about that and what that means—making sure the metadata is up-to-date, provisioning tables, a bunch of other things. Basically, there’s a lot of housekeeping that goes on in a system that’s very dynamic and that’s what auto admin is there for. Those are the four components that we’ll talk about.

Jon: Okay. Hearing there are four components, we said request router, we said storage node, we said partition metadata system, and we said auto admin. Just logically, my mind was expecting there to be something like query executor or something, a place where the compute gets done. I assume that’s probably for an individual query. I assume that’s sort of a combination of the partition metadata system and the storage node, partition metadata system basically saying, “Hey, these are the two storage nodes that are going to do this work for you,” and the storage node then do the work of going and finding the stuff on a disk and getting it back. I was waiting for you to say, “And there’s the query execution module,” or something like that, the thing that does the work of getting stuff?

Chris: Yeah. I think it basically is handled by the mesh of the system itself. I’m not sure what you’re alluding to there.

Jon: […] spread across some of the different things that you mentioned, these four pieces?

Chris: Yeah. I mean, in a way. Again, the front-end are these request routers and these are what clients are talking to. Clients should know how stuff’s partitioned, where things are and what not. Request routers, that’s their job. They are taking these incoming requests, whether they be PUTs or GETs, queries, DELETEs, whatever that request is, it’s then working in concert with the metadata system to figure out what’s the right storage node to go talk to, and then it forwards that request to the storage node.

Jon: I think one of the answers is right there in the name, request router. It’s literally routing to where the data is, routing to the storage nodes, am I understanding you right? Otherwise, what is the purpose of the word ‘router’? What are we routing to?

Chris: That’s exactly what I mean. The primary purpose for the request router is, given an input request, figure out/consult the metadata system, where does the data for this live, then forward that request on. Request that data from that particular node, then retrieve back the results, probably do some additional processing after it comes back, and then send it off to the client as a response.

The storage node is not just a disk. There’s software there for sure. The request coming in is looking something up by key, it’s consulting its index, and it’s going in retrieving that particular piece of data.

Jon: Yeah and I think that that’s what I’m trying to get my head around. Maybe I’m thinking about things in terms of how load balancers work or maybe I’m thinking about things in terms of how every cluster applications work, or traditional database clusters. Originally, I was imagining, “Oh, the request router is going to get to some machines because maybe it’s distributing load or something,” but what it’s really more like is that my DynamoDB, even though when I go and set it up, I’m saying in the console or via the API, “Hey, make me a table,” and I don’t really see multiple machines or nodes becoming involved in the creation of that table. It’s all totally a black box to me.

But behind the scenes, maybe I’ve made a table of concerts coming up over the next five years and all the concerts of bands from A to G are in one node and H to K are in another node, and above that is in a third node, just for the sake of argument, and then I make a request that for a band whose name starts with the letter R, that’s going to be in the third node and the request router’s the thing that’s going to get me to that node, and that’s where the work of getting the information out of the table is going to happen.

It’s hard to get my head around because I thought I just made a table and I would just be talking to a system that knows about that table. But it sounds like behind the scenes I’m talking to a number of different nodes, each of which may contain parts of my table?

Chris: Yeah and it definitely goes back to one of the key principles of NoSQL and why relational databases have problems scaling. In order to scale, you really have to scale out and to scale out, it means you have to partition data. That’s really one of the core fundamental things that DynamoDB does. You have these front-end request routers are basically like partition manager or partition consultants or what not. They’re the ones that are responsible for taking these input requests and saying, “You know what? Your data could be spread over 10 different storage nodes or something like that, or behind storage nodes and each responsible for now.” “Okay, which one is it?” and then arriving to it. It’s almost like you don’t really have one database. It’s like you have five and the request router’s the one for figuring out which one should it go to.

In that standpoint, it really is a router. It’s almost like a DNS in a way. Basically, that partition metadata system is the routing table. It’s consulting those routing tables to figure out what is the address that it should go talk to for this particular request. It’s looking at that partition key at a global secondary index and saying, “Okay, what does that actually map to?” And they go consult partition metadata system and then wrap that request accordingly.

Jon: Right on. I think this is making sense to me. I have so many questions, though, but let’s try to push forward.

Chris: Sure. We talked quite a bit about the request router. Maybe some additional things to talk about with the storage nodes is that there’s two main types of storage nodes. There’s the leader and then there’s the secondaries. You can think of the leader as the master. This is the one that always has the most up-to-date version of the data. Then, they have two secondaries. These are the replicas. It’s three storage nodes, essentially for every partition, if you will, of your database. These replicas obviously receive the updates from the leader themselves. That’s something to keep in mind. As I mentioned, the leader’s propagating those rights to the peer storage secondary nodes.

A big part of this is like, “Hey, we need to know what happens when there’s a problem? What if the leader fails? What if the secondary fails and what not? All that stuff has to be thought through and there needs to be ways to handle that. In this particular topic shared, the leader is sending out heartbeat messages every 1½ seconds to its secondary nodes. If the secondary nodes don’t receive those heartbeats, the system will then perform an election to basically elect a new leader. They use an algorithm called Paxos which has been in the academic domain for quite some time. There’s been a few papers written about it and described in it. It’s a pretty common algorithm for dealing with these distributed systems. When you need to elect a leader, a primary, a master if you will, type thing, they use Paxos.

Jon: At this point, we have storage nodes and each storage node is a leader and two secondaries. At that point, it feels like that stuff, which it is part of the overall architecture of DynamoDB, it’s really part of the solving for the availability piece and the reliability and resilience of data piece. It feels like the solution there is really similar to what we may already be familiar with in terms of relational databases. “Oh, I’m going to make a leader and two followers of my relational database, read replicas of my relational database, and that’s how I’m going to make sure that if the main ever goes down, I’ll have backups like that.” But it feels that piece is like, “We’re going to do that for each partition of the DynamoDB and, who knows how many partitions we have? We have partition keys that decide how the partitions get divvied up. We don’t know how many there are but for each one, we’re going to have this overall clustering architecture behind this that we can trust that our data’s going to be super reliable and just there.” Is that a fair description of what’s happening at those storage node level?

Chris: This is one of the patterns for distributed systems. When you scale out, that means you’re talking clusters and you’re either clustering stateful or stateless resources and both of those have very common patterns. It’s just different for each application. You’re just going to be implementing that same pattern in whatever way makes sense for your particular domain.

Just like RDS. You want to run your relational database in a safe, secure, reliable way, then yeah, you better have replicas, you better be cross AZ, you have to know when failures happen, what does failover mean, and how does that work. Same thing goes for S3. Everything is replicated across three AZs.

Jon: That’s the subtlety that I’m getting at. It’s hard to put my finger right on it but the subtlety that I’m getting at is that, right now, the part of the architecture that we’re talking about DynamoDB is really related to the fact that it’s a managed service. These decisions of how we’re going to create availability have been decided by Amazon. Whereas, if we were doing Monggo or something and we wanted to have each partition to have four nodes, we can do that.

If DynamoDB wasn’t a managed service and it was instead something that you can just install somewhere, then we might decide that’s where our needs really need to per a leader and secondary per storage node. But that’s not something that we get to do because DynamoDB is only a managed service and this is how Amazon has set it up as a managed service.

Chris: Yeah, absolutely. This all goes to what level of uptime availability they want to guarantee. This is why they do offer sometimes differences. You can go S3 one zone if you want. You can pay a little bit less but you don’t get as good of availability versus the default that go across three zones. With Dynamo, there’s no options there right now but it’s one of those things where database is definitely pretty core, pretty critical. This is something that most people would be doing anyhow.

Jon: You would hope but they totally wouldn’t be. They’d be like, “Oh, yeah. I think I have a read replica and we do backups that way.” We don’t do backups but I think they’re automated. We’ve never tested a restart. That’s what really happens in the real world.

Chris: Of course, if someone like Amazon did that, then the first outage would be front page news everywhere. Obviously, Amazon knows how to run things and they have learned from experience and what not. In managed service, they get to decide how it’s architected, how many replicas they have, duplicates, what that is because they also have to keep it up and live up to their SOA that they’re promising.

Jon: You have down here about replication which we talked about already but you also have another about strongly consistent versus eventually consistent?

Chris: Yeah. With just about any kind of system like this where you do have replication going on, there’s almost always this concept of do you want strongly consistent replicas or do you want eventually consistent replicas? There is trade-offs between the two of them and between performance versus that consistency model.

Strongly consistent, in general, usually means we’re not going to call this operation done until it has been replicated to all the replicas. It’s taking you longer for you to basically commit that request because it has to do multiple writes and it has to wait for—

Jon: Potentially locking up, locking things up longer, if you need to prevent other writes from happening while you write that one thing or other reads from happening while you write that stuff.

Chris: Yeah. Depending on how it’s architected, chances are you don’t have necessarily contention on that particular road that you’re writing but it may very well cause contention of the request routing level. So that’s strongly consistent.

Probably, the more normal mode for systems like this is eventually consistent. Eventually consistent just says that, “Hey, we’re going to write basically to the master, make that commit, and then we will report back that this request is done and should turn back to the caller while the replication is now happening after the fact.”

It may be possible when that caller gets back the response that, “Yeah, you’re right. It’s happened,” if they actually tried to read from one of the replicas, the data wouldn’t be there yet. It would still be the old version because of replication hasn’t happened. But eventually, it will make it there. That’s the eventually consistent.

This reduces the number of writes that you have to do, the write operations whenever you are making a mutation, but it comes at the cost of, you may have inconsistencies in your data for some short period of time. It sounds kind of bad, like to say, “Oh, eventually the data will be there. Eventually, it will be consistent.” But in practice, it’s really not a huge problem because you’re already dealing with the very dynamic system.

If you’re doing something like a shopping cart and trying to do a transaction or something of that nature, then the consistency problem becomes more important. But if you’re doing something like, “Hey, I created a new message,” or if you are in a chat application with chat messages and your clients are requesting the chat history. One person may request and he’s see the new message and then the other one doesn’t see it. But then they requested the messages a few seconds later and there it is. To the end user, it was no big deal or they didn’t really notice anything at all, right?

Jon: Right and everybody just refreshes their browser. If they don’t see the thing that they’re expecting, it’s sort of already ingrained from age three years old onwards to try to refresh whatever you’re looking at. So, people are kind of trained in advance to deal with an eventual inconsistency.

Chris: Yup. Swipe down to see the spinning.

Jon: Yup.

Chris: Maybe we should talk a little bit about a walkthrough. This is actually how they start it off the top, we’re just kind of doing a walkthrough for operations to get a piece of data or put a piece of data. It’s a lot easier to do this in diagram format. That’s one of the reasons why we talk through this way. Hopefully, we’re painting a better mental picture before we talk about it.

We probably already touched on this a little bit. But just walking through like what does it mean to do a GET? Maybe you have, again, a table of chat messages or something like that and it’s now requesting the most recent chat message or the most recent message. So, what happens there?

In that particular case, the DynamoDB client is making its call. It’s getting sent to the front-end request router and the request routers are obviously duplicated, they’re behind load balancers so that they scale out. The request router receives that request from the client. It’s then going to authenticate with the caller, so it’s going to do that dance to make sure that this particular caller is authorized to do this.

After that, it’s going to be consulting with the partition metadata system. It’s going to be looking at the item that they’re requesting, where does that live based upon the partition key or the global secondary index. Again doing that, like the equivalent of looking at a routing table, where should this go? Once it figures that out, then that’s when the request router will then go and retrieve that item from its assigned storage node. That’s essentially the workflow for a GET request.

Jon: It’s just in my mind, I’m just kind of trying to picture how this works. I imagine that the request router has the query, it’s able to figure out what the table is from the query, it maybe uses the partition metadata system to figure out what storage nodes the data that the query is requesting it is going to be on.

But then in my mind, it’s forwarding the whole part of the query that’s relevant to the particular storage node to that storage node, and that that storage node itself is doing some work, like, “Oh, okay. I just got this request for this data. That’s my data and I know how I’ve got my data organized, so I’m going to reach into the file system at this location and grab the next 10 blocks of data or whatever based on my understanding of how the data is laid out inside of my node, and then I’m going to send it back.”

I guess what I’m getting at is that in my mind, that storage node is doing the work of knowing where the data is on disk and retrieving it, as opposed to just having the data, and the request router or the partition metadata system is saying, “Go get it.” You know what I mean? In one sense, the master overall computer can say, “Go into my network file system and grab this out,” and in another way, it’s actually sending the query on and then storage node is going, “Okay, now I’m going to go inside the disk and get the stuff want out.” You see how it work?

Chris: Yeah. Again, don’t think of the storage node as just the disk. It really is a database. There’s a database engine there and all the things that go along with that. It has indexes. There’s B-tree indexes for all the indexes on that particular table. It uses that to figure out how to go and efficiently retrieve that data. It’s a full-out database engine.

Jon: Okay, cool. Now I have a better mental picture.

Chris: Awesome. The walkthrough is just a corollary of the how to do you writes. So, what is a PUT look like in a system? A lot of that is pretty similar, especially the first part of it. The DynamoDB client is making the call, it’s sitting in the front-end request routers, the authentication’s happening, it’s consulting the partition metadata system to figure out what the leader is, the storage node leader for that particular piece of data, and then it’s now the request routers going to forward on that write request to the storage node leader.

The leader then replicates that request to its secondary nodes and the acknowledgment of the request is sent back after. This is the eventual consistency model that’s by default. It’s going to send back the acknowledgment after the write has happened to the leader and it’s been replicated to at least to one secondary. It does this in case of failure. There’s not just one copy. It needs to be at least two. Some other systems use things like transaction logs but based upon this talk, it wasn’t totally clear on what they’re doing but this is the default behavior is to write to the leader. The leader also then replicates to the two secondaries and it will return that confirmation once it’s been written to at least one of the secondaries and the other secondary is not yet updated. Does that makes sense?

Jon: Listening to this, I just have a hard time believing that all of this could be efficient because we have a load balancer handling requests in front of the request router, the request router is talking to the partition metadata system, that then is figuring out where a leader is to talk to, sending a query to that or a PUT to that, and then that’s happening and actual writing to disk, and then all of it is traveling back up the tree, and then back out to the client.

All that needs to happen lightning fast and here I’ve got a javascript unit test that takes zero-and-a-half seconds to run. That certainly wouldn’t be okay if it took 2½ seconds for all of that to happen but it seems like a whole lot of stuff that has to happen. How can they possibly make that all fast?

Chris: Yeah and truth is this all happens in milliseconds, right?

Jon: Right, yeah.

Chris: But think about it. This is actually pretty extreme. I mean, this is pretty barebones and streamlined. This is built for speed and it’s built for scalability. Those front-end request routers are built to be very, very fast. They’re lightweight. I’m sure they’re caching partition metadata information. They probably have the map. They’re in memory on their nodes. So it’s very fast to consult which storage node I got to go to talk to. The networking connectivity that they have with the storage nodes is obviously the best that you can get. Amazon is optimized for that, so these request over to the storage nodes are very fast. Then, the database engine on those storage nodes is very tight, purpose-built just for this.

Jon: I guess one thing that still I just kind of poke at this is one thing that maybe isn’t necessarily classically slow but in my experience has been typically slow, is lots of JSON parsing. I don’t know if it’s the case that the request router receives a request that contains a big old chunk of JSON in it, but I imagine that it could be the case. If it is the case that this whole database is just dealing in lots of JSON, wouldn’t that also tend to be a problem?

Chris: Yeah. For the most part, the JSON is not too much of an issue because as far as the system cares, it’s just a bag of data.

Jon: True, yeah, if you’re just getting a whole document out. It’s like, “Yup, let me go get that bag of data out,” but if I’ve got to traverse some JSON to figure out, “Okay, now deep inside here I can see which storage node I need to write to.” Don’t some queries require that kind of processing?

Chris: Again, you’re going to be going off the partition key or the global secondary index on actually what partition you’re going to. Once you got to that point, then the rest of the core processing is all there on that storage node. That’s what those B-tree indexes are for when you do want to access based upon a property inside your JSON document. It’s not necessarily parsing the JSON then. It’s going off, it’s indexes, it’s high speed indexes that are implemented as B-trees. It’s very fast, very efficient.

Jon: I think that’s just a fascinating thing because it’s not intuitive, but essentially a database, this whole world is storing JSON in it and that’s what it’s all about, and yet it’s built to not ever really have to deal much in JSON. It’s built to not have to parse JSON very much. The whole point of it is to avoid having to do that. Pretty wild to think about.

Chris: Yeah. It’s really similar to any database system. Look at SQL. At the end of the day, there’s this certain layout format that it’s writing data to disk and it’s got to know offsets. It’s got to do parsing and figure out exactly where that stuff is. It’s got to handle these queries that come in and it’s going against whatever fields that you’re querying on. So, it has its indexes to go figure out, “Okay, where is it on disk?” to go retrieve the stuff. But at that point, it’s just pulling back the row, the record. From that standpoint, JSON’s really just a different way of marshalling the data.

Jon: Very cool. I think we’re running out of time here. We’ve gone a little long but I want to give us a chance to add anything more before we shut it down?

Chris: I think we’ve made really good progress here. The only thing that we really didn’t touch on too much is the auto admin component, which is slightly unfortunate. It is a pretty important part of the system and there’s some really interesting problems that arise with a system like this. The biggest thing is just partitions. The longer the time that your database is up and the more data you put into it, the more like that, “Hey, I’m going to get a hot partition,” or, “Partitions are just going to get too big. Performance is starting to become an issue or I am starting to reach a limit. So, what do I do? I need to actually split this up.” So, this partition has to go from one to two, and how do you do that? How do you decide when to do it and then how do you do the data migration, how does all that keep in sync.

All that happens behind-the-scenes, without the clients knowing about it. That’s the reason for things like this auto admin component. That’s one of the things that it does along with other things than just, it detects failures, it detects corruptions, and just all that kind of general housekeeping. It’s not a very glamorous component but it’s really important. Without that, this whole thing just does not work on the surface.

Jon: That makes sense. There was going to be some of my questions like, “How do you know how many partitions you have? Does it even matter how much data do you had to put in to have a new partition get created?” I think some of the answers to that are inside how the auto admin works.

Chris: Yup, absolutely. Since Amazon runs this as a service, those are questions that they decide how to answer and they have thresholds. Obviously, they have heuristics and what not. They’re based upon various metrics where there could be performance response time. They don’t want things to get beyond a certain limit from a size standpoint, from a partition size, could be things like replication lag, a bunch of different things would go into that to say, “Okay. You know what? When you do a split.”

Jon: Makes sense. Next week, we’ll talk a little bit more about Python and how that relates back to this architecture we talked about. We’ll do one more conversation about DynamoDB before moving on in the world of AWS.

Chris: Awesome. Sounds good.

Rich: Well, dear listener, you made it to the end. We appreciate your time and invite you to continue the conversation with us online. This episode, along with show notes and other valuable resources, is available at mobycast.fm/42. If you have any questions or additional insights, we encourage you to leave us a comment there. Thank you and we’ll see you again next week.

The Docker Transition Checklist

19 steps to better prepare you & your engineering team for migration to containers

42. The Birth of NoSQL and Dynamo DB – Part 4