> Normalization was built for a world with very different assumptions. In the data centers of the 1980s, storage was at a premium and compute was relatively cheap. But the times have changed. Storage is cheap as can be, while compute is at a premium.
Normalisation isn't primarily about about saving storage, it's about avoiding update anomalies i.e. correctness.
I'm reminded of being 18, my first year of college, and I had this lovely database professor who was good with crowds. He'd get us all to memorize and repeat, as a group certain phrases.
"Why do we normalize?"
150 students, in unison: "To make better relations"
"And why do we DE-normalize?"
150 students, in unison: "Performance"
"And what is a database?"
150 students, in unison: "A self-describing set of integrated records"
That was 16 years ago, and I'm still able to say those answers verbatim.
Maybe to someone who could make sense of the DDL and read the language the label col names are written in. And understand all the implicit units, rules around nulls/empties, and presence of magic strings (SSN, SKU) and special numbers (-1) and on and on. For that you need something like RDF and a proper data model.
Aren't you conflating the lexicon of data management with specific implementations of a relational database management system (RDBMS)?
Sorry, but your response sounds snarky and reminds me of all the ego hurdles I had to overcome when leaving/loving databases and set theory. Please remember that your comment could be someone's first introduction or step early step in learning.
If you use Oracle, PostgreSQL or MySQL (those are the ones I'm familliar) you can always query the data dictionary and see how your tables relate. For me that is self-describing.
From the abstract of [1], Codd's stated motivation for applying relation theory to data storage:
"Activities of users at terminals and most application programs should remain unaffected when the internal representation of data is changed and even when some aspects of the external representation are changed."
From Section 1.4, "Normal Form":
"A relation whose domains are all simple can be represented in storage by a two-dimensional column-homogeneous array of the kind discussed above. Some more complicated data structure is necessary for a relation with one or more nonsimple domains. For this reason (and others to be cited below) the possibility of eliminating nonsimple domains appears worth investigating. There is, in fact, a very simple elimination procedure, which we shall call normalization."
As I read this, normalization was originally "about" making storage simpler than it was with contemporaneous models. Section 2.3 ("Consistency") discusses how "data bank state" can become inconsistent, and how such inconsistencies might be addressed, up to and including "the user or someone responsible for the security and integrity of the data is notified".
I think it's reasonable to infer that guaranteed transactional consistency (what I think kpmah means above by "correctness") and the space-saving properties of eliminating redundant data both happened later, and both fell out of the initial motivation of simplification.
[The relational model] provides a basis for a high level data language which will yield maximal independence between programs on the one hand and machine representation and organization of data on the other.
A further advantage of the relational view is that it forms a sound basis for treating derivability, redundancy, and consistency of relations [...]
So I would say it is mainly about flexibility, correctness and the possibility to create a simple yet powerful query language.
Great point! It was originally about both of these things, but the storage aspect isn't discussed much anymore because it's really not a concern.
The data integrity issue is still a concern, and I talk about that in the book. You need to manage data integrity in your application and think about how to handle updates properly. But it's completely doable and many people have.
> You need to manage data integrity in your application
This is just another way of saying you need to implement your own system for managing consistency. Dynamo offers transactions now, but they don’t offer actual serialization. Your transaction will simply fail if it runs into any contention. You might think that’s ok, just retry. But because you’ve chosen to sacrifice modelling your data, this will happen a lot. If you want to use aggregates in your Dynamo, you have to update an aggregate field every time you update your data, which means you create single points of contention for your failure prone transactions to run into all over your app. There are ways around this, but they all come at the cost of further sacrificing consistency guarantees. The main issue with that being that once you’ve achieved an inconsistent state, it’s incredibly difficult to even know it’s happened, let alone to fix it.
Then you run into the issue of actually implementing your data access. With a denormalized data set, implementing all your data access comes at the expense of increasing interface complexity. Your schema ends up having objects, which contain arrays of objects, which contain arrays... and you have to design all of your interfaces around keys that belong to the top level parent object.
The relational model wasn’t designed to optimize one type of performance over another. It was designed to optimize operating on a relational dataset, regardless of the implementation details of the underlying management system. Trying to cram a relational set into a NoSQL DB unavoidably comes at the expense of some very serious compromises, with the primary benefit being that the DB itself is easier to administer. It’s not as simple as cost of storage vs cost of compute. NoSQL DBs like dynamo (actually especially dynamo) are great technology, with many perfectly valid use cases. But RDBMS is not one of them, and everybody I’ve seen attempt to use it as an RDBMS eventually regrets it.
Are there any basic examples you can give around maintaining that integrity?
I'm liking DynamoDB for tasks that fit nicely within a single domain, have relatively pain-free access patterns, etc. And I've found good fits, but there are some places where the eventual consistency model makes me nervous.
I'm specifically thinking about updating multiple different DynamoDB keys that might need to be aggregated for a data object. The valid answer may be "don't do that!" – if so, what should I do?
> Are there any basic examples you can give around maintaining that integrity?
For those types of use cases, the OP’s advice would actually require implementing a fully bespoke concurrency control system in your business logic layer. Without trying to disparage the OP, this is for all intents and purposes, impossible (aside from also being very, very impractical). There’s some things you can do to create additional almost-functional (though still highly impractical) consistency controls for dynamo (like throttling through FIFO queues), but they all end up being worse performance and scaling trade-offs then you’d get from simply using an RDBMS.
A lot of it boils down to the fact that dynamo doesn’t have (and wasn’t designed to have) locking, meaning that pretty much any concurrency control system you want to implement on top of it, is eventually going to run into a brick wall. The best you’d possibly be able to do is a very, very slow and clunky reimplementation of some of Spanner’s design patterns.
Yeah, the limited transaction support is a killer for many use cases.
Thankfully, it’s not actually all that hard to implement your own “dynamo layer” on top of an SQL database and get most of the scaling benefits without giving up real transactions.
Normalisation isn't primarily about about saving storage, it's about avoiding update anomalies i.e. correctness.