Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Why don’t more companies / startups choose ScyllaDB rather than Postgres or MySQL?

ScyllaDB is a C++ version of Cassandra so it looks like speed and scalability is a complete advantage over the Java based Cassandra and Discord is using ScyllaDB at scale too.



Because Postgres and MySQL are more familiar and by the time they're falling over, you have raised 80M and can afford to hire folks like me who's entire job is pestering developers to stop reading-after-writing, to stop forcing queries to the primary (just incase there's lag!), to stop "select id from table" -> "select data_I_want from other_table where ID IN (list-of-a-million-ids"), to beg AWS for early access to the fastest possible machines, to systematically destroy the custom ORM you built during the first 5 years of the company, before you had customers.

Scylla/Cassandra is exactly the database you want when you're serving the entire planet.

Postgres is exactly the database you want when you're desperately searching for product market-fit and just need things to work and code to get written.

Turns out it's much better to start simple and face scaling problems maybe than start complex and face scaling problems maybe.


Curious how you approach business units asking for reports that require multiple joins and can "filter by any column" and "sort by any column" and also offer pagination and then complain about why it's so slow? This is MySQL by the way.

The sorting by any column and pagination is really killer. Can't do cursor-based pagination so you get LIMIT/OFFSET which is terrible for performance, at least in this context. Indexes are often useless in this circumstance as well due to the filter by any column. We get lucky sometimes with a hard-coded WHERE clause as baseline filter for the entire report that hits an index, but not always. Just for fun, add in plenty of LIKE '%X%' because we limiting them to STARTS WITH is out of the question; it must be CONTAINS.

It's a constant source of frustration for the business and for the dev team.


Sounds like the job for a denormalized OLAP data structure (e.g. star schema) with a BI tool. This is the kind of use case they were made for. In other words, you've rediscovered the motivation for data warehouses.


Thank you for the feedback. I'll do some research along these lines. We have moved a report or two to Power BI which has been must faster than similar reports built using dynamic queries. Keeping the data in sync and the Power BI learning curve have not been easy.


In our case we switched to holding that data in Elasticsearch. Now we have 2 problems: running Elasticsearch and keeping the data in Elasticsearch in sync with MySQL. We took an application-level approach to triggering record sync to ES but somewhere in our ORM it's doing some magic and doesn't sync in certain places. Database triggers are the next stop.

If we could do away with ES and go back to MySQL for what are glorified data grids with 20+ column filtering, sorting, and no full text requirements, we would.


We use ElasticSearch in a very limited capacity for keyword searches against consumer product metadata. We haven't ventured into using it for BI reporting, but I'm interested. I'm sure there is plenty of knowledge I can google, but any early insights or traps to avoid when moving from SQL-based reporting to something in ES?


If you want to sort by any column and paginate on any column (so you can have your cursor), a BTREE (MySQL index) is sort of the wrong structure in the first place. You probably want a document store for something like this. Scylla!

More pragmatically, just force bookmarked scans - limit the columns that can be sorted by - use a replica with indexes that the writer doesn’t have - pay for fast disks - use one of the new parallel query engines like Aurora or Vitess - just tell the business people to wait a dang minute, computer is thinking!


The lazy approach is to denormalise the data in advance and load the result into memory. If it’s small enough you can simply return the entire result set to a reporting tool and query it locally.


I think it depends on the problem. I've worked at a tiny telco (<100 people) where MySql absolutely wouldn't have worked. We ended up using Scylla DB. We had billions of rows in single tables.

If you're launching an e-commerce site that sells shoes or something, yeah, you probably aren't going to sell a billion shoes every year so MySql is probably fine.

Finally, I use this example a lot: Github thought they could scale out MySql by writing their own Operator system that sharded MySql. They put thousands of engineering hours in it, and they still had global database outages that lost customer data.

You get that shit for free using Scylla/Cassandra (the ability to replicate across data centers and still fine grained control to do things like enforce local quorum to not impact write speed etc.)

They probably made the wrong choice.


Cassandra does not make you immune to database issues: https://monzo.com/blog/2019/09/08/why-monzo-wasnt-working-on...


Yes. You still have to read the docs before randomly setting “auto_bootstrap” to false.

Or just use Astra and not worry about scaling your own cluster.


Did GP make such a claim? I didn’t see it, but maybe the comment was edited.


Are you referring to their use of Vitess?

https://github.blog/2021-09-27-partitioning-githubs-relation...

GitHub didn't write it, Google/YouTube did.


Or just use RDS Aurora for same benefits out of the box at scale.


You know as I get closer and closer to official graybeard status I start having more of a hair trigger when someone says "just use __________" as if that solves everything and doesn't just create a list of a dozen or two other problems.

Everything is tradeoffs, you need to look at what you're trying to do and see which problems you are okay with. Some of these will be show-stoppers at larger scale. Some might be show-stoppers now.

But "just use __________" is never the right answer for any serious discussion.


Aurora won't work near as well for write heavy/time series type workloads. It's probably a good middle ground for a lot of applications though.


> systematically destroy the custom ORM you built during the first 5 years of the company, before you had customers.

Ouch, haha, that hurt


What's wrong with "select id from table"? (presumably "where" skipped for brevity of the example)


What's wrong is replacing a single query with a join with multiple round trips implementing part of the join logic in application code because you are using an RDBMS as if it were a fairly dumb repository of dusconnected tables.

  SELECT id FROM table1
Is fine, if what your app wants is the ids from table1. Its not good if it is just a prelude to:

  SELECT * FROM table2
  WHERE table1_id in (...ids from previous query...)
instead of just doing:

  SELECT table2.*
  FROM table1 
  INNER JOIN table2 ON (
    table2.table1_id == table1.id
  )


Wow, it is hard to believe anyone would use the first option.


Most of the query optimisers would automatically convert subqueries into join.


No query optimizer will automatically convert to separate queries connected by application logic that the optimizer can’t see shuttling data between them into a single join.


Not if the dev has removed the context of which table the id values come from.


Can you explain bit further. If the subquery do not have context where the value comes from, I would think query will error out?


Lots of application developers don’t really understand how sql works. So instead of learning to do a join (or do a sub query or anything like that) they just issue multiple queries from their application backend.

The posters up thread are talking about novice Python / JavaScript / whatever code which issues queries in sequence: first a query to get the IDs. Then once that query comes back to the application, the IDs are passed back to the database for a single query of a different table.

The query optimizer can’t help here because it doesn’t know what the IDs are used for. It just sees a standalone query for a bunch of IDs. Then later, an unrelated query which happens to use those same IDs to query a different table.


What you're saying is that:

  SELECT * FROM table2
  WHERE table1_id in (SELECT ids FROM table1 WHERE ...)
will be optimized to a join by the query planner (which may or may not be true, depending on how confident you are in the stability of the implementation details of your RDBMS's query planner). But in most circumstances, there is no subquery, it's more like:

  SELECT * FROM table2
  WHERE table1_id in (452345, 529872, 120395, ...)
Where the list of IDs are fetched from an API call to some microservice, or from a poorly used/designed ORM library.


What’s really bad is where they (or the ORM) generate a seperate select statement for each ID in a loop. Selecting thousands of rows one row at a time from an unindexed table can be hilariously slow.


And this doesn’t even get into bizarre edge cases, like how Postgres treats foo = ANY(ARRAY[‘bar’, ‘baz’]) differently than foo IN (VALUES((‘bar’), (‘baz’))). PROTIP: the latter is often MUCH faster. I’ve learned this twice this week.


It looks like what they're describing is implementing database joins in the application: 2 network round trips, first one for ids from a table and the second to get values from another table with foreign keys corresponding to the ids.


Isn‘t implementing joins in the application what microservices are all about?


No, microservices are all about senior management's ability to map applications to people to yell at.


I have supported a microservices environment where each service had its own database, and indeed "foreign keys" were mostly application-level concerns. A real pain, and I don't think any of us would have chosen that approach in hindsight.

I hope there are better ways to design microservices.


You're getting confused with NoSQL ;)


IDK if postgres has this problem but mssql does. If you do "Select * from table where ID in (blah)"

it ends up with some pretty terrible performance.

It's often a lot faster to do something like

Insert into #tempTable ids

select from table JOIN #tempTable t on t.id = table.id


I can‘t believe this is a thing in 2023! If this is faster, shouldn‘t the database do that transparently in the background?!


At least in our case, it comes down to expectations.

For the `IN` query the DB doesn't know how many elements are in the list. In MSSQL, it would default to assuming "well, probably a short list" which blows out performance when that's not the case.

When you first insert into the temp table, the DB can reasonably say "Oh, this table has n elements" and switch the query plan accordingly.

In addition, you can throw an index on the temp table which can also improve performance. Assuming the table you are querying against is indexed on ID, when you have another table with IDs that are indexed it doesn't have to assume random access as it pulls out each id. (Effectively, it just has to navigate the tree nodes in order rather than needing to do a full look into the tree).


I read it as "`where` skipped at code time because the table is never going to be too large".


>> Because Postgres and MySQL are more familiar

If familiarity is the primary problem, it should be relatively easy to fix.


>choose ScyllaDB rather than Postgres or MySQL?

I've worked with a large Cassandra cluster at a pervious job, was involved with a Scylla deployment, and have a lot of experience with the architecture and operations of both databases.

I wouldn't consider Cassandra/Scylla a replacement for Postgres/MySQL unless you have a very specific problem, namely you need a highly available architecture that must sustain a high write throughput. Plenty of companies have this problem and Scylla is a great product, but choosing it when you are just starting out, or if you are unsure you need it will hurt. You lose a lot of flexibility, and if you model your data incorrectly or you need it in a way you didn't foresee most of the advantages of Cassandra become disadvantages.


There are lots of tradeoffs when choosing Cassandra:

1. If you delete data, tombstones are a problem that nobody wants to have to think about.

2. Denormalizing data instead of joining isn't always practical or feasible.

3. If you're not dealing with large amounts of data, Postgres and MySQL are just easier. There's less headache and it's easier to build against them.

4. One of the big advertised strengths of Cassandra is handling high write volume. Many applications have relatively few writes and large amounts of reads.


Unless it's a commenting system ;)


Fast writes and slow reads was true in 2012. The project has been busy since.


Fast reads are commonplace. Having fast writes for a workload without an especially high volume of writes is not selling anyone on a database.


Because I (we) want rigid schemas with transaction wrapped DDL migrations more than I want blazing speed at Discords scale.


You can have a rigid schema with Cassandra/ScyllaDB. Transactions are a whole other thing though for sure.


Plus complex queries with rich capabilities


There are a lot of factors at play. But I think the biggest one is lack of expertise in the community. For postgress isn't that hard to hire a professional with 10 years of real world experience.

Besides that you need to keep in mind that scylla isn't a silver bullet. There are some workloads that it can't handle very well. Where I work we tried to use it in workload with high read and writes and it had trouble to keep up.

In the end we switched back to postgres + redis because it was cheaper.


Can you be more specific? I am surprised Scylla got overwhelmed; it powers Discord.


Why would use a database that is closely linked to a VC backed startup when PostgreSQL works fine you?


Probably the ecosystem and documentation. PaaS like Render, Fly, and even Heroku from back in the day offer addons/plugins/whatever that make the obvious choice out to be Postgres/MySQL. Then there's other tooling such as migration and various ORM adapters. Then there's a wealth of blogs, stack overflow q&a's, etc. And then Postgres/MySQL gets you quite far. Start with a single node -> scale the node -> add read replicas -> shard. Then, sharding becomes painful? Start looking at other data stores, remodeling your data, or eventing, etc.


Because SQL and relational data. They're (MySQL and Pg) a lot more familiar to setup and maintain. There's a lot of managed services offering for the usual SQL databases.

Cassandra/Scylla come with a lot of limitations when compared to typical SQL DB. They were developed for specific use cases and aren't at all comparable to SQL DBs. Both are completely different beast compared to standard SQL DBs.


Transactions mean you can build faster as you don't have to write a bunch of code to achieve those features


Is there any benchmark data to show the difference between no-sql DBs written in different languages when it comes to a real distributed system rather than just a handful of machines?


There are a bunch of common workloads (including i’m guessing HN itself) where the trade offs for distributed databases make them a lot harder to use well, or where distributed databases don’t make sense.


Government agencies I've worked usually build an app and put it in production without very few evolutions for 15+ years.

Data is the most important stuff for them and being able to read it is therefore very important.

A nosql DB structure can usually only be read by the code that goes with it and if nobody understands the code you're doomed.

In 2017, I had to extract data from an app from 1994. The app was built with a language that wasn't widely used at the time and had disappeared today. It disappeared before all the software companies put their doc on the internet. You can't run it on anything older than windows2k. There was only one guy in Montréal that knew how to code and run that thing.

As the DB engine was relational with a tool like sqlplus, I was able to extract all the data and we rebuilt the app based on the structure we found in the database.

If you want to do that with Nosql, you must maintain a separate documentation that explains your structure... And we all now how dev like documentation.


ecosystem integration

Postgres is super well supported, basically every tool out there has a connector to postgres, and most people have familiarity

If I am honest, I never even bother researching alternatives, I always go with postgres out of habit. I know it, and it has never let me down


That speed thing is a nice bit of marketing from Scylla but not really true in the real world.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: