Rick Houlihan did a talk a few years ago about designing the data later for an application using dynamodb. The most common reaction I get from people I show it to- most of them Amazon SDEs who operate services that use Dynamodb- is "Holy shit what is this wizardry?!"
One of the biggest mistakes people make with dynamo is thinking that it's just a relational database with no relations. It's not.
It's an incredible system, but it requires a lot of deep knowledge to get the full benefits, and it requires you, often, to design your data layer very well up-front. I actually don't recommend using it for a system that hasn't mostly stabilized in design.
But when used right, it's an incredibly performant beast of a data store.
It's worth noting that a lot of the early database designs, including this 2018 video pre-date some dramatic improvements to dynamodb usability.
I think the biggest ones were:
- an increase in the number of GSIs you can create (Dec 2018) [1]
- making on-demand possible [2]
- an increase in the default limit for number of tables you can create (Mar 2022) [3]
I don't think these new features necessarily make the single-table, overloaded GSI strategy that's discussed in the video obsolete, but they enable applications which are growing to adopt an incremental GSI approach and use multiple tables as their data access patterns mature.
Some other posters have recommended Alex DeBrie's dynamodb book and I also think that's an excellent resource, but I'd caution people who are getting into dynamodb not to be scared by the claims that dynamodb is inflexible to data access changes, since AWS has been adding a lot of functionality to support multi-table, unknown access patterns, emerging secondary indexes, etc.
Something else important to mention is that dynamodb now re-consolidates tables.
This is a lousy explanation, but Read/Write quota is split evenly over all partitions. Each partition is created based on the hash-key used, and there's an upper limit on how much data can be stored in any given partition. So if you end up with a hot hash-key, lots of stuff in it, that data gets split over more and more and more partitions, and the overall throughput goes down (quota is split evenly over partitions).
I believe this is still a general risk, and you need to be extremely canny about your use of hash key to avoid it, but historically they couldn't reconsolidate partitions. So you'd end up with a table in a terrible state with quota having to be sky high to still get effective performance. The only option then was to completely rotate tables. New table with a better hash-key, migrate data (or whatever else you needed to do).
Now at least, once the data is gone, the partitions will reconsolidate, so an entire table isn't a complete loss.
This bit me badly - An application that did significant autoscaling, and hit a peak of 30,000 read/write requests per second - But typically did more like 300.
The conversation with the Amazon support engineer told us that we had over a hundred partitions (which even he admitted was high for that number), and so our quota was effectively giving us 0 iops per partition. This obviously didn't work, and their only solution was "scale it back up, copy everything to a new table". Which we did, but was an engineering effort I'd rather have avoided.
People don't need to be scared they just need to do their homework.
In my opinion having more tables and more GSIs available won't help you very much if you started with flawed data model (unless you kept making the same design mistakes 256 times). A team that tries to claw back from a flawed table design by pilling up GSIs is just in for a world of pain.
So if you are planing to go with Dynamo:
- Read about the data modeling tecniques
- Figure out your access patterns
- Check if your application and model can withstand the eventual consistency of GSIs
- Have a plan to rework your data model if requirements change: Are you going to incrementally rewrite your table? Are you going to export it and bulk load a fixed data model? How much is that going to cost?
I also recommend Alex DeBrie's "The DynamoDB Book" (https://www.dynamodbbook.com/). It is a great resource that talks about these design patterns in depth. It has served me and my team well over the past few years.
For explicitness & searchability, commenting with the title of this talk, which is indeed excellent, not limited to DynamoDB, and which was kind of a revelation after years of using DynamoDB suboptimally:
Definitely one of my favorite talks by Rick and I apply lessons learned in that video on a daily basis.
Must of watched that video...about 4-5 times, before I really grasp the topics since I started my career that burned the concept of relational databases into my head. Breaking from that pattern of thought was difficult, initially.
Indeed with the GSI's etc you can implement a priority queue or store data in the order you want etc. Once you are clear on the access patterns of your app DynamoDB is amazing to model for and will scale with your app. But if you are not clear about your app's access patterns or need adhoc queries, then dynamoDB is not a good fit.
Can be performant, nowadays anyway. Worked with a team who built their own implementation because Amazon's was too slow and expensive.
It's a weird model. Too small of a dataset and it doesn't quite make sense to use Dynamo. Too big of a dataset and it's full of footguns. Medium-sized may be too expensive.
Too-small seems to be the perfect use case for DDB. I need someplace to stash stuff and look it up by key. A full RDS is overkill, as is anything else that requires nodes that charge by the hour.
Thank you for this recommendation, I'm on a DynamoDB contract job and... really learning to think hard about key structure and designing for efficient querying, rather than efficient storage.
Thanks, bookmarked this. It's good to see a proper take on data modelling on document stores instead of just "through any old JSON in there it'll be fine!!!"
https://youtu.be/HaEPXoXVf2k
One of the biggest mistakes people make with dynamo is thinking that it's just a relational database with no relations. It's not.
It's an incredible system, but it requires a lot of deep knowledge to get the full benefits, and it requires you, often, to design your data layer very well up-front. I actually don't recommend using it for a system that hasn't mostly stabilized in design.
But when used right, it's an incredibly performant beast of a data store.