January 2, 2019

41. The Birth of NoSQL and DynamoDB – Part 3

Show Notes
Transcription
Discussion

Jon Christensen and Chris Hickman of Kelsus and Rich Staats of Secret Stache continue their discussion on the birth of NoSQL and DynamoDB. They examine DynamoDB’s architecture and popularity as a solution for Internet-scale databases.

Some of the highlights of the show include:

Challenges, evolution, and reasons associated with Internet-scale data
DynamoDB has been around a long time, but people are finally using it
DynamoDB and MongoDB are document or key value stores that offer scalability and event-driven programming to reduce complexity
Techniques for keeping NoSQL database’s replicated data in sync
Importance of indexes to understand query patterns
DynamoDB’s Table Concept: Collection of documents/key value items; must have partition key to uniquely identify items in table and determine data distribution
Sort keys create indexes (i.e. global/local secondary index) to support items within partitioning
Query a DynamoDB database based on what’s being stored and using keys; conduct analysis on queries to determine their effectiveness

Links and Resources

Rich: In episode 41 of Mobycast, we continue our conversation on the birth of NoSQL and DynamoDB. In particular, we pull back the covers on DynamoDB, examine its architecture and discuss why it’s such a popular solution for internet-scale databases.

Welcome to Mobycast, a weekly conversation about containerization, docker in modern software deployment. Let’s jump right in.

Jon: Hello and welcome, Rich and Chris, another episode of Mobycast.

Chris: Hey guys. Hey, Jon.

Rich: Hey, Chris.

Jon: All right. I would say how is it going and I must talk about our weeks but we’re on episode three now of what’s been some of my most fun times here at Mobycast so far, which is just talking about DynamoDB and Chris’ history and internet data storage world. This time we’re going to steer and back away from Chris’ personal story which I’m kind of sad about because that’s so fun for me and so interesting but we do have to cover some technical stuff and hope that people can learn from this podcast. We’re going to talk about DynamoDB a little bit more and really give in to the internals of what it is and how it works and all that stuff.

I think that an obvious place to start with a conversation like this is just about NoSQL and just why NoSQL. It’s been talked about to death but let’s just at least glance over some of the highlights of that conversation. Go ahead, Chris.

Chris: We touched on this with the previous two episodes on what are the challenges of being an internet scale, the types of data that’s being collected and whatnot. Maybe just as a recap and just talking a little bit about what has been the evolution of data. Previously, before we needed to have internet-scale databases, SQL and relational databases definitely rule the world, maybe the interesting thing to consider with that is that, for the most part, SQL, it came about as a way to optimize our storage. When it was developed, storage especially was very expensive. By going with a normalized data model where you basically aren’t replicating the data, you could save on storage space. This is one of the primary motivations behind building these relational systems and SQL.

Optimize for storage because storage is really expensive, the data is normalized that you don’t have duplicates and you have these multiple tables of the data and they connect together through keys. And they are also built to scale vertically. In order to handle more log, you get a bigger machine or you add more storage or you add more memory or CPU power.

Here we are, another 20, 30 years into the future and now the storage is really cheap. We have lots and lots of it. This was one of the reasons for going to NoSQL. NoSQL, it’s the normalized data so you rather than making references to your data and other collections, you just keep it all together basically as an atomic unit. There’s some duplication of data for sure, but again, storage is not nearly so much of an issue as is scalability and the ability to scale horizontally. That was really what drove this evolution from SQL to NoSQL. Basically just realizing that the constraints were now different, the economics of the resources that have been used are different and this model came about.

Jon: Right. I remember memes before they were called memes, joking that people that we’re going to Mongo a long time ago. That we’re basically making fun of the fact that people were using it for not internet-scale problems. It’s like, are you internet scale? No, no, a lot of people were not internet scale. It was that scalability that first driving force behind it. It wasn’t that something like group on or something that was one of the first users of Mongo in particular, at scale.

Chris: Maybe. I don’t recall exactly who some of those first marquee customers we’re for Mongo. It was probably 2010, around that time frame I think when Mongo appeared on the scene.

Jon: It came after CouchDB but I think it came a little stronger and with a little bit more buttoned up and ready for people to use.

Chris: I think they probably had more resources behind them, they were able to iterate more quickly and get some of that market recognition and share that comes along with it.

Jon: We are going to talk about DynamoDB here. What’s interesting for me about DynamoDB is that it’s been around longer than either of those. We went deep into the history of that before but I think it’s only this year. It could just be me and I’m projecting on the world but I think it’s only this year that people are realizing that Dynamo is not an ugly step trial of Mongo and it’s not a newer entrant and it’s not a half thought up thing but it’s actually the OG of NoSQL databases and it’s really very good and managed in AWS. I think people are realizing this and talking about it a lot and making the decision to use it. Let’s talk about what it is.

Chris: DynamoDB just like MongoDB, NoSQL offerings, for the most part, they’re document, stores, they’re key value stores. The data is denormalized, you’re basically fetching something based upon some kind of key and you can think about this as a bag of data that’s along for the ride. It’s a document or it’s a value tech thing. DynamoDB in particular is definitely geared towards event driven programming, which is really a very popular pattern. We talked about this before in previous episodes at Mobycast. It’s very popular and it’s going to get even more popular. It’s designing distributed systems that are complicated going to an event-driven models, have key or to reduce the complexity. DynamoDB has native support for that, we can talk a little but it about that as we dive into it more.

Of course, DynamoDB is designed to be very, very scalable and very performant. One of the things that reinvent this field that it pointed out a single table on Dynamo can handle 4 million transactions per second. That’s a lot.

Jon: No way. I was not going to hear that.

Chris: 4 million transactions a second. That ends up being a quarter billion a minute?

Rich: Right. That’s unbelievable.

Chris: That’s every four minutes doing a billion transaction. Yes. It’s very scalable and very performant. That’s the power of scale out, that’s the power of horizontal scaling and partitioning and sharding. That’s why we use systems are really good at this. They can do this partitioning and sharding because of that NoSQL data model. It’s the data denormalized. It’s not normalized. When it’s normalized data, like it is a relational database system, you have this network of connections and links that just make it very, very difficult to isolate that data and to partition it up. This is really what most people systems are really good at. Hence why they’re really good at this internet-scale database is the kind of data that we need to store for this internet applications.

Rich: I want to dive in to something here that always confused me a little bit. When you’re working with a NoSQL database, you know that there is a lot of data that’s just replicated in there, you kind of don’t worry about it too much because that’s denormalizes’ not important to NoSQL database and it’s just not how they worked. But when I get tripped up on, it’s keeping all that data sync. What I’m saying is if you have an attribute that then 10 or 15 different kinds of documents that spread out all over your NoSQL database, and you need to update just that attribute, that means, to me in my mind that means you got to go find all those documents and update that attribute. I just can’t trip over that. my mind just trips over that and says, how can that possibly be performing?

I don’t know if you could just talk about how do we not trip over that a little bit. I realized that I didn’t prepare you for that question. That could be a fairly difficult question so we can punt it, but that’s where I get stuck up.

Chris: Yeah. It just really depends on what data is changing the kind of data that you are replicating and whether or not that needs to be updated. Are you doing this for query reasons, for building indexes, that gives rise to other techniques? There are definitely techniques that you can use that you don’t really have this problem that you can design yourself around it. It doesn’t make sense to say, okay, I’m going to go update the title of a document and that means I have to go now make 15 right operations because I’ve replicated that across 15 other documents in my table and what happens. You wouldn’t want to do that. It just doesn’t make sense.

You would design your model and your data so that you weren’t having to do that. You can pick and choose, what’s the data that you want to replicate and keep together and then do you want to keep that data that may be changed that is common. I think a good of this would be like you have a list of types of users. Maybe you store the type of users and maybe you want to keep that friendly label for it, it’s a string. You have admin, you have read-only user, you have a normal user. You have various different types and labels and that goes along with them. You may include that label and each one of your user documents. You can replicate that across, you can have million users registered in your system and they all have that as a label, chances are you’re not going to change that. You’re not going to update the title for the admin user necessarily, right?

The same thing, if you kept track of states and the postal abbreviations for them, you can duplicate that data but chances are it’s not changing. You don’t have to worry about that.

Rich: This exact problem is the exact thing I get wrapped in the Excel list, when I think about NoSQL database design. If you want to get back over DynamoDB and what it is and how it’s designed but just bear with me for a little bit.

You got your admin, you got your normal users, and you got your read-only users and they each have different label. I would have imagine the kinds of queries that you want to do with the, okay get me Steve and what kind of users Steve these and admin. Okay cool, Steve can do this. But then the other kind of query you might want to do is, tell me all the admins, and if you only have the admin doesn’t attribute under the users, then that’s a painful query. You have to go get all of the users and then look at all of their role types and then okay now, I can tell you who all the admins are. That’s not a good way of doing that query.

But then, okay, we’re not going to worry about duplication of data and instead we’re going to have all the u-turns and their role is going to be in the user record and then we’re also going to have another document that just has at the top level is going to be each user type. There’s going to be at the top level of the here’s the admins, here’s the normal users, and here’s the read-only users. And then, under each one of those, now it’s what’ll be on the list of every user, or their IDs or something.

It just seems to me that there could be a problem having both of those sitting around. Having that data stored in both of two different ways. You see what I mean? Doesn’t it being like that could happen?

Chris: This is where index is coming to play for sure. If you had the query pad and where you just have to go get all normal users or all admin users, you would just index on that particular attribute in your table. That would end up being a very fast and very performant operation.

Rich: Okay. An index is essentially just going to take all that data and store it in another way, another order, so that you can get stuff in the order that it’s stored in. The index used to be kept up to date and all that and that cold be expensive but that’s fine. You can let the database system, take care of that. Okay got it. That make sense.

Chris: Indexes are super important with things like DynamoDB. And really understanding what are your query patterns and what, where do you need indexes and how do you do that.

Rich: I could be wrong in this, but I think when I used Mongo on a big project and it must then like 2010 or 2011, it did not have indexes. It wasn’t the future. I could be wrong, but I think they just didn’t have it. That was what was killing us. We were like, oh my God, these queries are slow. Let’s just […] the data, okay, now they’re fast again. Oh no, we have a horrible mess of data on our hand.

Chris: Yeah, personally, I started using Mongo in 2012, and it had index support.

Rich: Interesting. It could very well have, it could have been that we were just missing them both but, index is making a ton of sense.

Chris: Yes. And there’s lots of support for that, with DynamoDB and maybe this is a good time to talk about just like the overall terminology or DynamoDB. Like with relational systems, Dynamo has this concept of table. The table is basically a collection of these documents or these key-value things. Those particular components referred to as items. You have tables, tables that composed of zero to n items, a table must have a partition key. This is really important identifying what is the partition key for your table. Definitely it’s a primary key ready, uniquely identifies an item, but what’s really important about it is that, it’s determining the data distribution for all the items that are going into that table. This is what is being needed to share the data, to partition the data. You want something that is uniformly distributed and fairly random. You wouldn’t want a partition key on a status value or user type or gender, those wouldn’t be good partition keys. But something like a UUID, would be a good choice type of thing.

Partition keys, again, this is the primary way that you’re going to uniquely identify the items inside table and it does determine your data distribution and how well the data can be distributed. It can be fanned out, distributed across all the storage resources that are there. In addition to the partition keys you also have something called the sort key. Sort keys, these are unifying keys that basically create indexes that allow you to sort items within that particular partition name.

Some other terminology and we can talk a little bit more about this as we have time, there’s a concept of a global secondary index and what this is, is this is a way to provide an alternate partition key. These are pretty powerful but they’re also really expensive. I believe that the limit is you can have a total of five of these.

Jon: You mean, this in terms of how much extra concubery towards as it takes or storage resources it takes, or memory it takes.

Chris: This is creating duplicates of you data behind the scenes. It’s basically saying, I’m going to use a different partition key. It’s almost like a whole different database. It’s a copy, it’s shuffled in a totally different way. That’s what a global secondary index mean to you. The expense there is you have the primary table partition key gets updated first and then after that, it has to make the updates to the global secondary indexes. That’s where some of that performance and cost comes into place, just the updates of now you have to the global secondary indexes.

And there is latency, those are eventually consistent updates. It’s a synchronous but just what you want. If it was synchronous, it would be your performance but really go down quite a bit. There’s some latency there.

And then there is a concept of a local secondary index which is similar to the sort key as well. It provides you an alternate sort key but it’s local to the partition that it’s on. Those are some of the overall terminology, but at the end of the day, it’s pretty straightforward and simple. You have these tables, these tables are composed of items, the items, they do have schema but the schema’s flexible. That’s one of the primary principles of NoSQL. Again, it’s up to you as how you use this data, like how structured you want it to be and how flexible it needs to be.

And then you have this indexes and keys that determine how the data’s going to be partitioned and then how you want to assess it in the most efficient manner that you need.

Rich: Yeah. pretty straightforward and familiar for people that maybe have not worked with NoSQL databases that come from the relational database world.

Chris: Indeed. The relational database, you’re going to look at your query patterns and you’re going to create indexes on certain columns in your database because you want to be able to query on those and you don’t want to do a full table scan which means I have to go look at every single item in the table in order to find out whether or not it meets the search criteria that you want to create an index on it. So that you can only very efficiently retreat those values. Same concept applies with NoSQL and with Dynamo and with its indexes and keys.

Rich: There’s a whole section on Architecture and components that we wanted to talk about but I’m thinking we can save that for the next episode. Before we talk about that, there is one thing that’s bothering that I don’t know about DynamoDB and it really is for developers, it’s the key, it’s the crucial thing. How do you query a DynamoDB database, do you use SQL, or do you use a proprietary DynamoDB language? And if it is a near thing that you have to learn in the query language, how hard is it and how confident can you be as you’re learning it that you are writing good queries. As you answer that question, like how confident can you be, just think about how hard it is to be confident that your SQL queries are good? It’s a few years of experience before like okay, that’s a bank query, yup, and this is going to be a better query.

Chris: NoSQL databases are definitely built for modern apps. They’re very developer friendly, and you don’t really have to go learn a new query language. For the most part, it depends on again, what it is that you are storing. If you’re using as key-value store, obviously, your queries are based upon the key. Like you’re saying, “Here’s my key. Go fetch this value.”

For documents, again you can reference those by key, by its partition key or if you want, you can do queries where you’re saying, “Go get me all the documents where the user a superuser.” In which case, probably, you’re using something that looks a lot like JSON.

It’s going to be very, very familiar to developers. You’re just going to say, “Here’s my JSON expression for the attribute that I’m going to search on.” For every attribute that matches this value and there’ll be like certain place holders for operators. Like, do you want to do and is an or, is it greater than or less than. Some just very familiar concepts for putting together Boolean expressions for calling in and acquiring this data.

There’s a very rich ecosystem of tools and information to let you know how your queries are doing. Just any DynamoDB has the support, Mongo has it as well. You can go analyze your queries, you can get reports on slow running queries, what’s taking a long time, and then also just hands on how to go fix it. There’s a lot of support there.

Jon: Cool. That’s good. SQl, being what it is, it’s definitely a weak spot for a lot of folks. It’s a weak spot that people address, because if SQL’s not going away but to have double the amount of weak spots because of NoSQL database it’s difficult over query languages SQL would be very really burdensome for the whole community, so it’s good to hear.

Chris: SQL is very robust, very extensive, there’s very few people that understand every function, every component of the language. It’s being constantly added to as well. It’s like you go look at the documentation for post […] version of SQL, what it supports. Oh wow, there’s 20 different ways and operators that can use to go roll up aggregations, and it’s like, okay, what do I do here?

Jon: Yeah, totally.

Chris: Versus not nearly as expansive in the NoSQL space. It’s a much more constrained. This is reflex, the data model. The data model simpler, so therefore the query language is simpler.

Jon: Excellent. Yeah. I think we’ll wrap it up there. Definitely a good first look into how Dynamo’s put together and I think we’ll probably do one more next week. Just to talk about really how Dynamo sits in AWS and what’s going on there and because it’s not an application you install in your own servers, so what is this thing and how to put it together. We’ll talk about that next week.

Chris: Alright. Thanks guys.

Jon: Good talking with you too, and thanks for putting it together, Rich. Talk to you next week. Bye-bye.

Rich: Thanks guys.

Well dear listener, you made it to the end. We appreciate your time and invite you to continue the conversation with us online. This episode, along with show notes and other valuable resources is available at mobycast.fm/41. If you have any questions or additional insights, we encourage you to leave us a comment there. Thank you and we’ll see you again next week.

The Docker Transition Checklist

19 steps to better prepare you & your engineering team for migration to containers

41. The Birth of NoSQL and DynamoDB – Part 3