Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Clickhouse is a really cool and stupidly fast columnar database


I understand why OLAP writes are faster but is there any reason why OLTPs can't achieve similar read performance with denormalized and sharded data?


Aggregation is a huge reason. For rolling up data, something can’t like Clickhouse can’t be beat by oltp


What problems can I solve with a columnar database?

What type of data benefits from that type of Database?


Imagine you have a small business that tracks in the order of 10's - 100's of millions of events (pageviews, clicks, whatever), and you have reporting you want to run. Trying to do this in PG/MySQL would likely need to use materialized views so your reports don't take a long time to run. You could store your event data in CH directly, or use ELT/ETL process to sync/copy it into clickhouse just for reporting. Then, your queries would be very fast. It's must faster (for certain types of queries, mainly timeseries queries or queries involving aggregation of many rows). It's faster because of how the data is stored on disk. It's NOT good for fetching/updating/deleting single rows however.

It's originally designed to handle hundreds of columns, and billions of rows, but I think it can still apply to much smaller use cases that value performance. I'm implementing it currently in a similar scenario, and I'm using AirByte OSS version to ELT from postgres. Then I'm using tableau or some other BI tool to analyze that data much more effectively (I will be trying to perform complex aggregations/group by reports on 100mm rows)


Row based databases are optimized for accessing compete rows and joins. Columnar storage is optimized for accessing all, or many column values across rows. This makes aggregates and applying transformation logic faster with columnar storage than row based storage. Ie they are great for data warehouses and other analytical workloads.

Ps, great and still highly relevant resource covering all the major database system designs, their advantages and drawbacks: https://www.oreilly.com/library/view/designing-data-intensiv...


Less about the data itself and more about the specific operations you want to do on it.

Large aggregations, massive datasets, large joins, and workloads that are ready heavy and eschew row-level mutations.

They get used for data analysis frequently, time series data and associated analysis meshes quite nicely too. ClickHouse itself was originally built to support arbitrary analytical queries on clickstream data at pretty massive scale. Cloudflare uses it for live analytics, Uber uses it for logs.


An over simplification:

Columnar stores are optimized for reads. Row stores are optimized for writes.


This is an overly simplistic but also correct answer: clickhouse was developed for analytics on clickstreams.

Technically the overall idea is that if you have lots of queries that only read certain columns and your database stores rows contiguously it's a waste to read a whole row and then discard columns.

Also compression (such as run length or delta or even ztsd) often works better if you give it a block of data that's from one column (such as a timestamp or tag value).


That’s a longer subject that fits in a comment here.

If you are _actually_ interested I suggest using google search to find some good sites that go over what a column oriented database does/is used for.

This isn’t hard; I’ll get you started:

https://www.kdnuggets.com/2021/02/understanding-nosql-databa...


Or he, you know, could just ask, because that is the spirit of discussion.



Way more than that now.


Personally tried it, it can handle logs nicely. And from their page, many more things

https://clickhouse.com/use-cases


Uber wrote a blog on using Clickhouse to store logs: https://www.uber.com/blog/logging/


Columnar databases let you do fast aggregations and read only the columns you are interested in. They are for analyzing data.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: