This article points out some of the pain associated with index maintenance. It should also point out that ALL indexes on a table suffer from the same issue. If your 20 column table has 7 indexes, then the suggestions should be applied 7x.
It is conventional wisdom that indexes are absolutely essential for any relational table of at least reasonable size (e.g. thousands of rows) and is accessed more often than daily. Indexes can be a pain to create and maintain; but can greatly speed up queries and primary key validations. The pain mostly comes from having to figure out what indexes to create and how often to maintain them, rather than doing the actual thing.
Indexes also have a performance penalty for any table updates. Creating new rows, updating existing rows, or deleting rows all require updates to each index.
But are indexes really required? I am creating a new kind of general purpose data management system (a kind of object store) called Didgets. The tagging mechanism that I invented to allow tags to be attached to each data object, are key-value stores that essentially form a set of columnar stores.
I found that these columnar stores could also be used to create regular relational database tables. The data is structured such that indexes are not needed. All the tests that I have run (up to a thousand columns with over 100 million rows), show that query speeds are equal to, or better than other database systems that are well indexed.
The system is still under development, so it is still missing some key features that would make it a drop-in replacement for other databases; but it proves that it is possible to structure relational data such that query speeds can be optimal without needing separate indexing structures that have to be maintained.
I’m not clear on how you’re deviating from a normal columnar/OLAP database?
> I found that these columnar stores could also be used to create regular relational database tables.
Doesn’t every columnar store do this? Redshift, IQ, Snowflake, ClickHouse, DuckDB etc
> but it proves that it is possible to structure relational data such that query speeds can be optimal without needing separate indexing structures that have to be maintained.
Doesn’t every columnar database already prove this?
I am not an expert on all the other columnar stores out there; but it is my understanding that they are used almost exclusively for OLAP workloads. By 'regular database tables', I meant those that handle transaction processing (inserts, updates, deletes) along with queries.
My system does analytics well, but it is also very fast with changing data.
I also think that some of those systems (e.g. Duckdb) also use indexes.
They’re used by OLAP workloads because columnar properties fits better — namely, storing data column-wise obviously makes row-wise operations more expensive, and column-wise operations cheaper; this usually corresponds to point look-ups vs aggregations. Which cascades into things like constraint-maintenance being more expensive, row-level triggers becoming a psychotic pattern, etc. Column-wise (de-)compression also doubles-down on this.
They still do all the regular CRUD operations and maintain transactional semantics; they just naturally prefer bulk operations.
Redshift is the most pure take on this I’ve seen; to the point that they simply don’t support most constraints, triggers and data is allocated in 2MB immutable chunks
such that non-bulk-operations undergo ridiculous amounts of write amplification and slow to a crawl. Afaik other OLAP databases are not this extreme, and support reasonable throughput on point-operations (and triggers, constraints, etc) — in the sense that it’s definitely slower, but not comically slower. (Aside: Aurora is also a pure take on transactional workloads, such that bulk aggregations are comically slow)
> I also think that some of those systems (e.g. Duckdb) also use indexes.
I’m pretty sure they all use indexes, in the same fashion I expect you to (I’m guessing your system doesn’t do table-scans for every single query). Columnar databases just get indexes like zone-maps for “free”, in the sense that it can simply be applied on top of the actual dataset without having to maintain a separate copy of the data ALA row-wise databases do. So it’s an implicit index automatically generated on every column — not user-maintained or specified. I expect your system does exactly the same (because it would be unreasonable not to)
> My system does analytics well, but it is also very fast with changing data.
Talk more, please & thank you. I expect everything above to be inherent properties/outcomes of the data layout so I’m quite curious what you’ve done
My project Didgets (short for Data Widgets), started out as a file system replacement. I wanted to create an object store that would store traditional file data, but also make file searches much faster and more powerful than other file systems allow; especially on systems with hundreds of millions of files on them. To enhance this, I wanted to be able to attach contextual tags to each Didget that would make searches much more meaningful without needing to analyze file content during the search.
To facilitate the file operations, I needed data structures to support them. I decided that these data structures (used/free bitmaps, file records, tags, etc.) should be stored and managed within other Didgets that had special handling. Each tag was basically a key-value pair that mapped the Didget ID (key) to a string, number, or other data type (value).
Rather than rely on some external process like Redis to handle tags, I decided to build my own. Each defined tag has a data type and all values for that tag are stored together (like column values in a columnar store). I split the tag handling into two distinct pieces. All the values are deduplicated and reference counted and stored within a 'Values Didget'. The keys (along with pointers to the values) are stored within a "Links Didget'.
This makes analytic functions fast (each unique value is stored only once) and allows for various mapping strategies (one-to-one, one-to-many, many-to-one, or many-to-many). The values and the links are stored within individual blocks that are arranged using hashes and other meta-data constraints. For any given query, usually only a small number of blocks need to be inspected.
I expected analytic operations to be very fast, like with other OLAP systems; but I was pleasantly surprised at how fast I could make traditional OLTP operations run on it.
I have some short demo videos that show not only what it can do, but also benchmark many operations against other databases. Links to the videos are in my user profile.
Interesting ideas. Im very interested in database ideas that bring new capabilities or better ways to acconplish old ones.
W.r.t. query speeds on your columnar storage engine, you will obviously have much better writes that row oriented storage engines. This limits your write capabilities though. Any effort you put into restoring write speeds necessitates an extra step to the maintain the columnar stores - which puts you back into the group of databases naintaining indices that you criticize above.
I think modern databases are bringing new ideas on how to accelerate both write and query speeds simultaneously with tradeoffs around CAP.
Although I have done many more benchmark testing against other databases for query speeds; I haven't noticed any significant speed degradation on writes.
Could you clarify what you mean by 'this limits your write capabilities'?
> W.r.t. query speeds on your columnar storage engine, you will obviously have much better writes that row oriented storage engines.
This should have said reads, not writes. Columnar storage takes significantly more effort to handle writes because it must do many more IOs across the different columns, potentially more de/compression cycles, etc.
It is conventional wisdom that indexes are absolutely essential for any relational table of at least reasonable size (e.g. thousands of rows) and is accessed more often than daily. Indexes can be a pain to create and maintain; but can greatly speed up queries and primary key validations. The pain mostly comes from having to figure out what indexes to create and how often to maintain them, rather than doing the actual thing.
Indexes also have a performance penalty for any table updates. Creating new rows, updating existing rows, or deleting rows all require updates to each index.
But are indexes really required? I am creating a new kind of general purpose data management system (a kind of object store) called Didgets. The tagging mechanism that I invented to allow tags to be attached to each data object, are key-value stores that essentially form a set of columnar stores.
I found that these columnar stores could also be used to create regular relational database tables. The data is structured such that indexes are not needed. All the tests that I have run (up to a thousand columns with over 100 million rows), show that query speeds are equal to, or better than other database systems that are well indexed.
The system is still under development, so it is still missing some key features that would make it a drop-in replacement for other databases; but it proves that it is possible to structure relational data such that query speeds can be optimal without needing separate indexing structures that have to be maintained.