Why don’t more companies / startups choose ScyllaDB rather than Postgres or MySQL?
ScyllaDB is a C++ version of Cassandra so it looks like speed and scalability is a complete advantage over the Java based Cassandra and Discord is using ScyllaDB at scale too.
Because Postgres and MySQL are more familiar and by the time they're falling over, you have raised 80M and can afford to hire folks like me who's entire job is pestering developers to stop reading-after-writing, to stop forcing queries to the primary (just incase there's lag!), to stop "select id from table" -> "select data_I_want from other_table where ID IN (list-of-a-million-ids"), to beg AWS for early access to the fastest possible machines, to systematically destroy the custom ORM you built during the first 5 years of the company, before you had customers.
Scylla/Cassandra is exactly the database you want when you're serving the entire planet.
Postgres is exactly the database you want when you're desperately searching for product market-fit and just need things to work and code to get written.
Turns out it's much better to start simple and face scaling problems maybe than start complex and face scaling problems maybe.
Curious how you approach business units asking for reports that require multiple joins and can "filter by any column" and "sort by any column" and also offer pagination and then complain about why it's so slow? This is MySQL by the way.
The sorting by any column and pagination is really killer. Can't do cursor-based pagination so you get LIMIT/OFFSET which is terrible for performance, at least in this context. Indexes are often useless in this circumstance as well due to the filter by any column. We get lucky sometimes with a hard-coded WHERE clause as baseline filter for the entire report that hits an index, but not always. Just for fun, add in plenty of LIKE '%X%' because we limiting them to STARTS WITH is out of the question; it must be CONTAINS.
It's a constant source of frustration for the business and for the dev team.
Sounds like the job for a denormalized OLAP data structure (e.g. star schema) with a BI tool. This is the kind of use case they were made for. In other words, you've rediscovered the motivation for data warehouses.
Thank you for the feedback. I'll do some research along these lines. We have moved a report or two to Power BI which has been must faster than similar reports built using dynamic queries. Keeping the data in sync and the Power BI learning curve have not been easy.
In our case we switched to holding that data in Elasticsearch. Now we have 2 problems: running Elasticsearch and keeping the data in Elasticsearch in sync with MySQL. We took an application-level approach to triggering record sync to ES but somewhere in our ORM it's doing some magic and doesn't sync in certain places. Database triggers are the next stop.
If we could do away with ES and go back to MySQL for what are glorified data grids with 20+ column filtering, sorting, and no full text requirements, we would.
We use ElasticSearch in a very limited capacity for keyword searches against consumer product metadata. We haven't ventured into using it for BI reporting, but I'm interested. I'm sure there is plenty of knowledge I can google, but any early insights or traps to avoid when moving from SQL-based reporting to something in ES?
If you want to sort by any column and paginate on any column (so you can have your cursor), a BTREE (MySQL
index) is sort of the wrong structure in the first place. You probably want a document store for something like this. Scylla!
More pragmatically, just force bookmarked scans - limit the columns that can be sorted by - use a replica with indexes that the writer doesn’t have - pay for fast disks - use one of the new parallel query engines like Aurora or Vitess - just tell the business people to wait a dang minute, computer is thinking!
The lazy approach is to denormalise the data in advance and load the result into memory. If it’s small enough you can simply return the entire result set to a reporting tool and query it locally.
I think it depends on the problem. I've worked at a tiny telco (<100 people) where MySql absolutely wouldn't have worked. We ended up using Scylla DB. We had billions of rows in single tables.
If you're launching an e-commerce site that sells shoes or something, yeah, you probably aren't going to sell a billion shoes every year so MySql is probably fine.
Finally, I use this example a lot: Github thought they could scale out MySql by writing their own Operator system that sharded MySql. They put thousands of engineering hours in it, and they still had global database outages that lost customer data.
You get that shit for free using Scylla/Cassandra (the ability to replicate across data centers and still fine grained control to do things like enforce local quorum to not impact write speed etc.)
You know as I get closer and closer to official graybeard status I start having more of a hair trigger when someone says "just use __________" as if that solves everything and doesn't just create a list of a dozen or two other problems.
Everything is tradeoffs, you need to look at what you're trying to do and see which problems you are okay with. Some of these will be show-stoppers at larger scale. Some might be show-stoppers now.
But "just use __________" is never the right answer for any serious discussion.
What's wrong is replacing a single query with a join with multiple round trips implementing part of the join logic in application code because you are using an RDBMS as if it were a fairly dumb repository of dusconnected tables.
SELECT id FROM table1
Is fine, if what your app wants is the ids from table1. Its not good if it is just a prelude to:
SELECT * FROM table2
WHERE table1_id in (...ids from previous query...)
instead of just doing:
SELECT table2.*
FROM table1
INNER JOIN table2 ON (
table2.table1_id == table1.id
)
No query optimizer will automatically convert to separate queries connected by application logic that the optimizer can’t see shuttling data between them into a single join.
Lots of application developers don’t really understand how sql works. So instead of learning to do a join (or do a sub query or anything like that) they just issue multiple queries from their application backend.
The posters up thread are talking about novice Python / JavaScript / whatever code which issues queries in sequence: first a query to get the IDs. Then once that query comes back to the application, the IDs are passed back to the database for a single query of a different table.
The query optimizer can’t help here because it doesn’t know what the IDs are used for. It just sees a standalone query for a bunch of IDs. Then later, an unrelated query which happens to use those same IDs to query a different table.
SELECT * FROM table2
WHERE table1_id in (SELECT ids FROM table1 WHERE ...)
will be optimized to a join by the query planner (which may or may not be true, depending on how confident you are in the stability of the implementation details of your RDBMS's query planner). But in most circumstances, there is no subquery, it's more like:
SELECT * FROM table2
WHERE table1_id in (452345, 529872, 120395, ...)
Where the list of IDs are fetched from an API call to some microservice, or from a poorly used/designed ORM library.
What’s really bad is where they (or the ORM) generate a seperate select statement for each ID in a loop. Selecting thousands of rows one row at a time from an unindexed table can be hilariously slow.
And this doesn’t even get into bizarre edge cases, like how Postgres treats foo = ANY(ARRAY[‘bar’, ‘baz’]) differently than foo IN (VALUES((‘bar’), (‘baz’))). PROTIP: the latter is often MUCH faster. I’ve learned this twice this week.
It looks like what they're describing is implementing database joins in the application: 2 network round trips, first one for ids from a table and the second to get values from another table with foreign keys corresponding to the ids.
I have supported a microservices environment where each service had its own database, and indeed "foreign keys" were mostly application-level concerns. A real pain, and I don't think any of us would have chosen that approach in hindsight.
I hope there are better ways to design microservices.
At least in our case, it comes down to expectations.
For the `IN` query the DB doesn't know how many elements are in the list. In MSSQL, it would default to assuming "well, probably a short list" which blows out performance when that's not the case.
When you first insert into the temp table, the DB can reasonably say "Oh, this table has n elements" and switch the query plan accordingly.
In addition, you can throw an index on the temp table which can also improve performance. Assuming the table you are querying against is indexed on ID, when you have another table with IDs that are indexed it doesn't have to assume random access as it pulls out each id. (Effectively, it just has to navigate the tree nodes in order rather than needing to do a full look into the tree).
I've worked with a large Cassandra cluster at a pervious job, was involved with a Scylla deployment, and have a lot of experience with the architecture and operations of both databases.
I wouldn't consider Cassandra/Scylla a replacement for Postgres/MySQL unless you have a very specific problem, namely you need a highly available architecture that must sustain a high write throughput. Plenty of companies have this problem and Scylla is a great product, but choosing it when you are just starting out, or if you are unsure you need it will hurt. You lose a lot of flexibility, and if you model your data incorrectly or you need it in a way you didn't foresee most of the advantages of Cassandra become disadvantages.
There are lots of tradeoffs when choosing Cassandra:
1. If you delete data, tombstones are a problem that nobody wants to have to think about.
2. Denormalizing data instead of joining isn't always practical or feasible.
3. If you're not dealing with large amounts of data, Postgres and MySQL are just easier. There's less headache and it's easier to build against them.
4. One of the big advertised strengths of Cassandra is handling high write volume. Many applications have relatively few writes and large amounts of reads.
There are a lot of factors at play. But I think the biggest one is lack of expertise in the community. For postgress isn't that hard to hire a professional with 10 years of real world experience.
Besides that you need to keep in mind that scylla isn't a silver bullet. There are some workloads that it can't handle very well. Where I work we tried to use it in workload with high read and writes and it had trouble to keep up.
In the end we switched back to postgres + redis because it was cheaper.
Probably the ecosystem and documentation. PaaS like Render, Fly, and even Heroku from back in the day offer addons/plugins/whatever that make the obvious choice out to be Postgres/MySQL. Then there's other tooling such as migration and various ORM adapters. Then there's a wealth of blogs, stack overflow q&a's, etc. And then Postgres/MySQL gets you quite far. Start with a single node -> scale the node -> add read replicas -> shard. Then, sharding becomes painful? Start looking at other data stores, remodeling your data, or eventing, etc.
Because SQL and relational data. They're (MySQL and Pg) a lot more familiar to setup and maintain. There's a lot of managed services offering for the usual SQL databases.
Cassandra/Scylla come with a lot of limitations when compared to typical SQL DB. They were developed for specific use cases and aren't at all comparable to SQL DBs. Both are completely different beast compared to standard SQL DBs.
Is there any benchmark data to show the difference between no-sql DBs written in different languages when it comes to a real distributed system rather than just a handful of machines?
There are a bunch of common workloads (including i’m guessing HN itself) where the trade offs for distributed databases make them a lot harder to use well, or where distributed databases don’t make sense.
Government agencies I've worked usually build an app and put it in production without very few evolutions for 15+ years.
Data is the most important stuff for them and being able to read it is therefore very important.
A nosql DB structure can usually only be read by the code that goes with it and if nobody understands the code you're doomed.
In 2017, I had to extract data from an app from 1994.
The app was built with a language that wasn't widely used at the time and had disappeared today.
It disappeared before all the software companies put their doc on the internet.
You can't run it on anything older than windows2k.
There was only one guy in Montréal that knew how to code and run that thing.
As the DB engine was relational with a tool like sqlplus, I was able to extract all the data and we rebuilt the app based on the structure we found in the database.
If you want to do that with Nosql, you must maintain a separate documentation that explains your structure... And we all now how dev like documentation.
ScyllaDB is a C++ version of Cassandra so it looks like speed and scalability is a complete advantage over the Java based Cassandra and Discord is using ScyllaDB at scale too.