Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Before I proceed, let me just state that I have no particular knowledge of either RDBMS' internals or Kafka, so this is nothing but my own amateur musings on the subject.

Now, I wholeheartedly agree that Kafka is more scalable, but I think the key point here is that there is no particular law of nature as to why that is the case. It may just be an historical accident of how RDBMS - and PosgreSQL in particular - have evolved. Further: many of the properties of Kafka are in fact also desirable properties for the PostgreSQL transaction log.

My take on the inopinatus observation, together with the Samza article [1] mentioned on this discussion, are as follows. You can think of Postgres as two "products" (bounded contexts, if you like):

- a stream-based, possibly replicated, transaction log;

- a projection of that transaction log into relational calculus, plus all of the associated machinery.

Thus far we never had the need to think of these as clearly separate "products", but Kafka makes it obvious that they are. In truth, the amount of tools processing WAL outside of PostgreSQL were already hinting in this direction; Kafka just made it obvious.

From this perspective, it seems a tad expensive to take the original transaction log, convert it to a RDBMS representation, then convert it to events and, in some cases, then store it as an event stream in Kafka. It would be much more efficient to simply use the original transaction log directly - and this is why, to me, even Debezium [2] / Bottled Water [3] appear to be one layer too many. To the best of my understanding, this line of reasoning is also line with the observations in the Samza article [1]. Where I believe I differ from the article is in thinking that the RDBMS representation also adds a lot of value to applications - I see both having a role (e.g. streaming vs batch processing sort of thing). I think this would derail the present discussion too much, so I won't go in to it.

In conclusion: to the untrained eye, it seems that the right thing to do is to extract the transaction log out of PostgreSQL and make it as scalable as Kafka. Then, allow for it to log "things" which are not necessarily "projectable" into the relational plane. PostgreSQL then becomes just a client of the transaction log, together with other "kinds" of clients. I suspect that this is what will ultimately happen, but the engineering work required will probably span a decade or more.

My 2 Angolan Kwanzas, at any rate.

[1] https://www.confluent.io/blog/turning-the-database-inside-ou...

[2] https://debezium.io/

[3] https://github.com/confluentinc/bottledwater-pg



Kafka can scale and distribute individual streams across the cluster. That is, the entire "database" is distributed across nodes. With postgres, your unit of scalability is the entire database. You can't natively have some tables on one node and some tables on another node, for example.

Kafka is also just more optimized for what it does. Postgres is a superset of what kafka does, so kafka is unsurprisingly better able to optimize for its usecase. It has a zero-copy protocol that can shuttle data to/from disk to/from the network without bringing it into memory (using the sendfile syscall). It doesn't wait for disk writes when doing writes, because it achieves durability via replication.

Also, don't forget the things you'd have to do when implementing consumers. How will you load balance streams between consumers? Meaning, if you have a stream and you want multiple consumers to burn it down at a time, how can you make sure they aren't duplicating work and they can handle the consumer group growing/shrinking? How will you handle checkpointing where each consumer tracks what they've done so far? What about streams' data rolling off?

All of this is doable with pg, but you'd have to implement it yourself. With kafka and its client drivers, this is handled for you.


As I said, I haven't given a lot of thought about this so please take my opinion with a grain of salt - but I believe that once you split the log out of PostgreSQL, a lot of functionality of this ilk could start to be considered. When/if added, I think it would make for a stronger PostgreSQL in the end. However, I do understand this is an insanely hard amount of work. In a way, it bears some similarities to splitting GTK out of GIMP, for instance; extremely difficult thing to do but ultimately it turned out to be a massive win for both projects. This would be even harder, but ultimately, greatly advantageous.


Musing similarly... Or just replace PostgreSQL's transaction log with Kafka, given all the Kafka's advantages. Seems to me that traditional RDBMSs cram too many seemingly independent subsystems into a single system, each one of which if separate and behind a well defined interface could be made substantially better. I think there's more room for software that is akin to SQLite - a library replacing RDBMS functionality within an application and thus allowing for composability on language-linking level as opposed to process-linking level (done with an orchestrator).




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: