Why don’t more companies / startups choose ScyllaDB rather than Postgres or MySQ...

erulabs · on Oct 5, 2023

Because Postgres and MySQL are more familiar and by the time they're falling over, you have raised 80M and can afford to hire folks like me who's entire job is pestering developers to stop reading-after-writing, to stop forcing queries to the primary (just incase there's lag!), to stop "select id from table" -> "select data_I_want from other_table where ID IN (list-of-a-million-ids"), to beg AWS for early access to the fastest possible machines, to systematically destroy the custom ORM you built during the first 5 years of the company, before you had customers.

Scylla/Cassandra is exactly the database you want when you're serving the entire planet.

Postgres is exactly the database you want when you're desperately searching for product market-fit and just need things to work and code to get written.

Turns out it's much better to start simple and face scaling problems maybe than start complex and face scaling problems maybe.

WesleyJohnson · on Oct 5, 2023

Curious how you approach business units asking for reports that require multiple joins and can "filter by any column" and "sort by any column" and also offer pagination and then complain about why it's so slow? This is MySQL by the way.

The sorting by any column and pagination is really killer. Can't do cursor-based pagination so you get LIMIT/OFFSET which is terrible for performance, at least in this context. Indexes are often useless in this circumstance as well due to the filter by any column. We get lucky sometimes with a hard-coded WHERE clause as baseline filter for the entire report that hits an index, but not always. Just for fun, add in plenty of LIKE '%X%' because we limiting them to STARTS WITH is out of the question; it must be CONTAINS.

It's a constant source of frustration for the business and for the dev team.

stvltvs · on Oct 5, 2023

Sounds like the job for a denormalized OLAP data structure (e.g. star schema) with a BI tool. This is the kind of use case they were made for. In other words, you've rediscovered the motivation for data warehouses.

WesleyJohnson · on Oct 5, 2023

Thank you for the feedback. I'll do some research along these lines. We have moved a report or two to Power BI which has been must faster than similar reports built using dynamic queries. Keeping the data in sync and the Power BI learning curve have not been easy.

pbowyer · on Oct 5, 2023

In our case we switched to holding that data in Elasticsearch. Now we have 2 problems: running Elasticsearch and keeping the data in Elasticsearch in sync with MySQL. We took an application-level approach to triggering record sync to ES but somewhere in our ORM it's doing some magic and doesn't sync in certain places. Database triggers are the next stop.

If we could do away with ES and go back to MySQL for what are glorified data grids with 20+ column filtering, sorting, and no full text requirements, we would.

WesleyJohnson · on Oct 5, 2023

We use ElasticSearch in a very limited capacity for keyword searches against consumer product metadata. We haven't ventured into using it for BI reporting, but I'm interested. I'm sure there is plenty of knowledge I can google, but any early insights or traps to avoid when moving from SQL-based reporting to something in ES?

erulabs · on Oct 6, 2023

If you want to sort by any column and paginate on any column (so you can have your cursor), a BTREE (MySQL index) is sort of the wrong structure in the first place. You probably want a document store for something like this. Scylla!

More pragmatically, just force bookmarked scans - limit the columns that can be sorted by - use a replica with indexes that the writer doesn’t have - pay for fast disks - use one of the new parallel query engines like Aurora or Vitess - just tell the business people to wait a dang minute, computer is thinking!

mr_toad · on Oct 6, 2023

The lazy approach is to denormalise the data in advance and load the result into memory. If it’s small enough you can simply return the entire result set to a reporting tool and query it locally.

VirusNewbie · on Oct 5, 2023

I think it depends on the problem. I've worked at a tiny telco (<100 people) where MySql absolutely wouldn't have worked. We ended up using Scylla DB. We had billions of rows in single tables.

If you're launching an e-commerce site that sells shoes or something, yeah, you probably aren't going to sell a billion shoes every year so MySql is probably fine.

Finally, I use this example a lot: Github thought they could scale out MySql by writing their own Operator system that sharded MySql. They put thousands of engineering hours in it, and they still had global database outages that lost customer data.

You get that shit for free using Scylla/Cassandra (the ability to replicate across data centers and still fine grained control to do things like enforce local quorum to not impact write speed etc.)

They probably made the wrong choice.

tynorf · on Oct 5, 2023

Cassandra does not make you immune to database issues: https://monzo.com/blog/2019/09/08/why-monzo-wasnt-working-on...

pmcf · on Oct 5, 2023

Yes. You still have to read the docs before randomly setting “auto_bootstrap” to false.

Or just use Astra and not worry about scaling your own cluster.

tomcam · on Oct 5, 2023

Did GP make such a claim? I didn’t see it, but maybe the comment was edited.

sciurus · on Oct 6, 2023

Are you referring to their use of Vitess?

https://github.blog/2021-09-27-partitioning-githubs-relation...

GitHub didn't write it, Google/YouTube did.

krembo · on Oct 5, 2023

Or just use RDS Aurora for same benefits out of the box at scale.

pc86 · on Oct 5, 2023

You know as I get closer and closer to official graybeard status I start having more of a hair trigger when someone says "just use __________" as if that solves everything and doesn't just create a list of a dozen or two other problems.

Everything is tradeoffs, you need to look at what you're trying to do and see which problems you are okay with. Some of these will be show-stoppers at larger scale. Some might be show-stoppers now.

But "just use __________" is never the right answer for any serious discussion.

VirusNewbie · on Oct 6, 2023

Aurora won't work near as well for write heavy/time series type workloads. It's probably a good middle ground for a lot of applications though.

isoprophlex · on Oct 5, 2023

> systematically destroy the custom ORM you built during the first 5 years of the company, before you had customers.

Ouch, haha, that hurt

gajus · on Oct 5, 2023

What's wrong with "select id from table"? (presumably "where" skipped for brevity of the example)

dragonwriter · on Oct 5, 2023

What's wrong is replacing a single query with a join with multiple round trips implementing part of the join logic in application code because you are using an RDBMS as if it were a fairly dumb repository of dusconnected tables.

  SELECT id FROM table1

Is fine, if what your app wants is the ids from table1. Its not good if it is just a prelude to:

  SELECT * FROM table2
  WHERE table1_id in (...ids from previous query...)

instead of just doing:

  SELECT table2.*
  FROM table1 
  INNER JOIN table2 ON (
    table2.table1_id == table1.id
  )

osigurdson · on Oct 6, 2023

Wow, it is hard to believe anyone would use the first option.

muzaffarpur · on Oct 5, 2023

Most of the query optimisers would automatically convert subqueries into join.

dragonwriter · on Oct 5, 2023

No query optimizer will automatically convert to separate queries connected by application logic that the optimizer can’t see shuttling data between them into a single join.

stvltvs · on Oct 5, 2023

Not if the dev has removed the context of which table the id values come from.

muzaffarpur · on Oct 5, 2023

Can you explain bit further. If the subquery do not have context where the value comes from, I would think query will error out?

josephg · on Oct 5, 2023

Lots of application developers don’t really understand how sql works. So instead of learning to do a join (or do a sub query or anything like that) they just issue multiple queries from their application backend.

The posters up thread are talking about novice Python / JavaScript / whatever code which issues queries in sequence: first a query to get the IDs. Then once that query comes back to the application, the IDs are passed back to the database for a single query of a different table.

The query optimizer can’t help here because it doesn’t know what the IDs are used for. It just sees a standalone query for a bunch of IDs. Then later, an unrelated query which happens to use those same IDs to query a different table.

10000truths · on Oct 5, 2023

What you're saying is that:

  SELECT * FROM table2
  WHERE table1_id in (SELECT ids FROM table1 WHERE ...)

will be optimized to a join by the query planner (which may or may not be true, depending on how confident you are in the stability of the implementation details of your RDBMS's query planner). But in most circumstances, there is no subquery, it's more like:

  SELECT * FROM table2
  WHERE table1_id in (452345, 529872, 120395, ...)

Where the list of IDs are fetched from an API call to some microservice, or from a poorly used/designed ORM library.

mr_toad · on Oct 6, 2023

What’s really bad is where they (or the ORM) generate a seperate select statement for each ID in a loop. Selecting thousands of rows one row at a time from an unindexed table can be hilariously slow.

sgarland · on Oct 6, 2023

And this doesn’t even get into bizarre edge cases, like how Postgres treats foo = ANY(ARRAY[‘bar’, ‘baz’]) differently than foo IN (VALUES((‘bar’), (‘baz’))). PROTIP: the latter is often MUCH faster. I’ve learned this twice this week.

CBLT · on Oct 5, 2023

It looks like what they're describing is implementing database joins in the application: 2 network round trips, first one for ids from a table and the second to get values from another table with foreign keys corresponding to the ids.

WanderPanda · on Oct 5, 2023

Isn‘t implementing joins in the application what microservices are all about?

_lqaf · on Oct 5, 2023

No, microservices are all about senior management's ability to map applications to people to yell at.

macintux · on Oct 5, 2023

I have supported a microservices environment where each service had its own database, and indeed "foreign keys" were mostly application-level concerns. A real pain, and I don't think any of us would have chosen that approach in hindsight.

I hope there are better ways to design microservices.

nicoburns · on Oct 5, 2023

You're getting confused with NoSQL ;)

cogman10 · on Oct 5, 2023

IDK if postgres has this problem but mssql does. If you do "Select * from table where ID in (blah)"

it ends up with some pretty terrible performance.

It's often a lot faster to do something like

Insert into #tempTable ids

select from table JOIN #tempTable t on t.id = table.id

WanderPanda · on Oct 5, 2023

I can‘t believe this is a thing in 2023! If this is faster, shouldn‘t the database do that transparently in the background?!

cogman10 · on Oct 5, 2023

At least in our case, it comes down to expectations.

For the `IN` query the DB doesn't know how many elements are in the list. In MSSQL, it would default to assuming "well, probably a short list" which blows out performance when that's not the case.

When you first insert into the temp table, the DB can reasonably say "Oh, this table has n elements" and switch the query plan accordingly.

In addition, you can throw an index on the temp table which can also improve performance. Assuming the table you are querying against is indexed on ID, when you have another table with IDs that are indexed it doesn't have to assume random access as it pulls out each id. (Effectively, it just has to navigate the tree nodes in order rather than needing to do a full look into the tree).

pritambaral · on Oct 5, 2023

I read it as "`where` skipped at code time because the table is never going to be too large".

osigurdson · on Oct 6, 2023

>> Because Postgres and MySQL are more familiar

If familiarity is the primary problem, it should be relatively easy to fix.

nemothekid · on Oct 5, 2023

>choose ScyllaDB rather than Postgres or MySQL?

I've worked with a large Cassandra cluster at a pervious job, was involved with a Scylla deployment, and have a lot of experience with the architecture and operations of both databases.

I wouldn't consider Cassandra/Scylla a replacement for Postgres/MySQL unless you have a very specific problem, namely you need a highly available architecture that must sustain a high write throughput. Plenty of companies have this problem and Scylla is a great product, but choosing it when you are just starting out, or if you are unsure you need it will hurt. You lose a lot of flexibility, and if you model your data incorrectly or you need it in a way you didn't foresee most of the advantages of Cassandra become disadvantages.

bastawhiz · on Oct 5, 2023

There are lots of tradeoffs when choosing Cassandra:

1. If you delete data, tombstones are a problem that nobody wants to have to think about.

2. Denormalizing data instead of joining isn't always practical or feasible.

3. If you're not dealing with large amounts of data, Postgres and MySQL are just easier. There's less headache and it's easier to build against them.

4. One of the big advertised strengths of Cassandra is handling high write volume. Many applications have relatively few writes and large amounts of reads.

swingingFlyFish · on Oct 17, 2023

Unless it's a commenting system ;)

pmcf · on Oct 5, 2023

Fast writes and slow reads was true in 2012. The project has been busy since.

bastawhiz · on Oct 6, 2023

Fast reads are commonplace. Having fast writes for a workload without an especially high volume of writes is not selling anyone on a database.

olivermuty · on Oct 5, 2023

Because I (we) want rigid schemas with transaction wrapped DDL migrations more than I want blazing speed at Discords scale.

michaelmior · on Oct 5, 2023

You can have a rigid schema with Cassandra/ScyllaDB. Transactions are a whole other thing though for sure.

riku_iki · on Oct 5, 2023

Plus complex queries with rich capabilities

nzach · on Oct 5, 2023

There are a lot of factors at play. But I think the biggest one is lack of expertise in the community. For postgress isn't that hard to hire a professional with 10 years of real world experience.

Besides that you need to keep in mind that scylla isn't a silver bullet. There are some workloads that it can't handle very well. Where I work we tried to use it in workload with high read and writes and it had trouble to keep up.

In the end we switched back to postgres + redis because it was cheaper.

esafak · on Oct 5, 2023

Can you be more specific? I am surprised Scylla got overwhelmed; it powers Discord.

pid-1 · on Oct 5, 2023

Why would use a database that is closely linked to a VC backed startup when PostgreSQL works fine you?

jeffchao · on Oct 5, 2023

Probably the ecosystem and documentation. PaaS like Render, Fly, and even Heroku from back in the day offer addons/plugins/whatever that make the obvious choice out to be Postgres/MySQL. Then there's other tooling such as migration and various ORM adapters. Then there's a wealth of blogs, stack overflow q&a's, etc. And then Postgres/MySQL gets you quite far. Start with a single node -> scale the node -> add read replicas -> shard. Then, sharding becomes painful? Start looking at other data stores, remodeling your data, or eventing, etc.

jskrablin · on Oct 5, 2023

Because SQL and relational data. They're (MySQL and Pg) a lot more familiar to setup and maintain. There's a lot of managed services offering for the usual SQL databases.

Cassandra/Scylla come with a lot of limitations when compared to typical SQL DB. They were developed for specific use cases and aren't at all comparable to SQL DBs. Both are completely different beast compared to standard SQL DBs.

dangoodmanUT · on Oct 5, 2023

Transactions mean you can build faster as you don't have to write a bunch of code to achieve those features

wbsun · on Oct 7, 2023

Is there any benchmark data to show the difference between no-sql DBs written in different languages when it comes to a real distributed system rather than just a handful of machines?

joeatwork · on Oct 5, 2023

There are a bunch of common workloads (including i’m guessing HN itself) where the trade offs for distributed databases make them a lot harder to use well, or where distributed databases don’t make sense.

whynotmaybe · on Oct 5, 2023

Government agencies I've worked usually build an app and put it in production without very few evolutions for 15+ years.

Data is the most important stuff for them and being able to read it is therefore very important.

A nosql DB structure can usually only be read by the code that goes with it and if nobody understands the code you're doomed.

In 2017, I had to extract data from an app from 1994. The app was built with a language that wasn't widely used at the time and had disappeared today. It disappeared before all the software companies put their doc on the internet. You can't run it on anything older than windows2k. There was only one guy in Montréal that knew how to code and run that thing.

As the DB engine was relational with a tool like sqlplus, I was able to extract all the data and we rebuilt the app based on the structure we found in the database.

If you want to do that with Nosql, you must maintain a separate documentation that explains your structure... And we all now how dev like documentation.

dacryn · on Oct 6, 2023

ecosystem integration

Postgres is super well supported, basically every tool out there has a connector to postgres, and most people have familiarity

If I am honest, I never even bother researching alternatives, I always go with postgres out of habit. I know it, and it has never let me down

pmcf · on Oct 5, 2023

That speed thing is a nice bit of marketing from Scylla but not really true in the real world.