Hacker Newsnew | past | comments | ask | show | jobs | submit | lfittl's commentslogin

Its worth reading this follow-up LKML post by Andres Freund (who works on Postgres): https://lore.kernel.org/lkml/yr3inlzesdb45n6i6lpbimwr7b25kqk...

>If this somehow does end up being a reproducible performance issue (I still suspect something more complicated is going on), I don't see how userspace could be expected to mitigate a substantial perf regression in 7.0 that can only be mitigated by a default-off non-trivial functionality also introduced in 7.0.

They said the magic words to get Linus to start flipping tables. Never break userspace. Unusably slow is broken

> Maybe we should, but requiring the use of a new low level facility that was introduced in the 7.0 kernel, to address a regression that exists only in 7.0+, seems not great.

Completely right. This sounds like a communication failure. Maybe Linux maintainers should pick a few applications that have "priority support" and problems with these applications are also problems with Linux itself. Breaking Postgres is a serious regression.

Reminds me of a situation where Fedora couldn't be updated if you had Wine installed and one side of the argument was "user applications are user problem" while the other was "it's Wine, like come on".


I for one liked the old and simple WE DO NOT BREAK USERSPACE attitude.

https://linuxreviews.org/WE_DO_NOT_BREAK_USERSPACE


Performance regressions are different from ABI incompatibilities. If the kernel refused to do any work that slowed down any userspace program, the pace would go a lot slower.

Or be a lot uglier. See: Microsoft replacing its own API surfaces with binary-compatible representations to workaround companies like Adobe adding perf improvements like bypassing the kernel-provided kernel object constructors because it saved them a few cycles to just hard-code the objects they wanted and memcpy them into existence.

Microsoft's whole "Let's just ship all the dlls" attitude is a big part of the reason a windows install is like 300GB now.

Eventually you'd expect that something has to give.


Slow pace is appropriate for a mature kernel that the entire world relies on.

Not sure it is true anymore. I've encountered few userspace breaks in io_uring, at least.

Funny how "use hugepages" is right there on the table and 99% of users ignore it.

I’m absolutely flabbergasted by the performance left on the table; even by myself - just yesterday I learned Gentoo’s emerge can use git and be a billion times faster.

The time spent by emerge is utterly dwarfed by the time spent to build the packages, so who cares? Maybe it's different if installing a binary system but don't think most people are doing that.

If you can emerge in 2.86s user you can do it right before you emerge world, meaning it's all "done in one interaction" (even if the actual emerge takes an hour - you don't have to look at it.

Whereas if emerge is taking 5-10 minutes, you have to remember to come back to it, or script it.


When using multiple overlays, emerge-webrsync is ungodly slower compared to git.

That's really not universally true. Building can be parallelized on modern multi-code CPUs (minus configure), emerge cannot and portage is really really slow.

Note that it's just not a single post, and there's additional further information in following the full thread. :)

Yes, and in the following messages the conclusion was that the regression is mitigated when using huge pages.

This seems bad, Splunk advises you to turn off THP due its small read/write characteristics: https://help.splunk.com/en/splunk-enterprise/release-notes-a...

Bad because as of Splunk 10.x, Splunk bundles postgres to integrate with their SOAR platform. Parenthetically, this practice of bundling stuff with Splunk is making vuln remediation a real pain. Splunk bundles its own python, mongod, and now postgres, instead of doing dependency checking. They're going to have to keep doing it as long as they release a .tgz and not just an RPM. The most recent postgres vuln is not fixed in Splunk.


Huge pages and THP are not the same thing.

Which you always should use anyway if you can.

Hmmm, it's not always that clear cut.

For example, Redis officially advised people to disable it due to a latency impact:

https://redis.io/docs/latest/operate/oss_and_stack/managemen...

Pretty sure Redis even outputs a warning to the logs upon startup when it detects hugepages are enabled.

Note that I'm not a Redis expert, I just remember this from when I ran it as a dependency for other software I was using.


1) That is about transparent huge pages which is a different thing and 2) it is always clear cut for PostgreSQL. If you can you should always use huge pages (the non-transparent kind).

That's transparent huge pages, which are also not the setting recommended for PostgreSQL.

Java can work with transparent hugepages (in addition to preallocated hugepages), but you just use +AlwaysPreTouch to map them in during the startup so that at runtime there won't be any delays or jitter. Redis should add a similar option

AIUI in that thread they're saying "0.51x" the perf on a 96-core arm64 machine and they're also saying they cannot reproduce it on a 96-core amd64 machine.

So it's not going to affect everybody both running PostgreSQL and upgrading to the latest kernel. Conditions seems to be: arm64, shitloads of core, kernel 7.0, current version of PostgreSQL.

That is not going to be 100% of the installed PostgreSQL DBs out there in the wild when 7.0 lands in a few weeks.


It's a huge issue of ARM based systems, that hardly anyone uses or tests things on them (in production).

Yes, Macs going ARM has been a huge boon, but I've also seen crazy regressions on AWS Graviton (compared to how its supposed to perform), on .NET (and node as well), which frankly I have no expertise or time digging into.

Which was the main reason we ultimately cancelled our migration.

I'm sure this is the same reason why its important to AWS.


Macs are actually part of pain point with ARM64 Linux, because the Linux arm set er tend to use 64 kB pages while Mac supports only 4 and 16, and it causes non trivial bugs at times (funnily enough, I first encountered that in a database company...)

It was later reproduced on the same machine without huge pages enabled. PICNIC?

Yes, I did reproduce it (to a much smaller degree, but it's just a 48c/96t machine). But it's an absurd workload in an insane configuration. Not using huge pages hurts way more than the regression due to PREEMPT_LAZY does.

With what we know so far, I expect that there are just about no real world workloads that aren't already completely falling over that will be affected.


So why does it happen only with hugepages? Is the extra overhead / TLB pressure enough to trigger the issue in some way? Of is it because the regular pages get swapped out (which hugepages can't be)?

I don't fully know, but I suspect it's just that due to the minor faults and tlb misses there is terrible contention with the spinlock, regardless of the PREEMPT_LAZY when using 4k pages (that easily reproducible). Which is then made worse by preempting more with the lock held.

So perhaps this is a regression specifically in the arm64 code, or said differently maybe it’s a performance bug that has been there for a long time but covered up by the scheduler part that was removed?

The following messages concluded that using huge pages mitigates the regression, while not using huge pages reproduces it.

Could be either of those, or something else entirely. Or even measurement error.

Turns out the amd machine had huge tables enabled and after disabling those the regression was there on and too. So arm vs amd was a red herring.

Of course not a nice regression but you should not run PostgreSQL on large servers without huge pages enabled so thud regression will only hurt people who have a bad configuration. That said I think these bad configurations are common out there, especially in containerized environments where the one running PostgreSQL may not have the ability to enable huge pages.


Still that huge a regression that affects multiple platforms doesn't sound too neat, did they narrow down the root cause?

That should be obvious to anyone who read the initial message. The regression was caused by a configuration change that changed the default from PREEMPT_NONE to PREEMT_LAZY. If you don’t know what those options do, use the source. (<https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...>)

Yes, I had a good laugh at that. It might technically be a regression, but not one that most people will see in practice. Pretty weird that someone at Amazon is bothering to run those tests without hugepages.

I doubt they explicitly said "I'll run without huge pages, which is an important AWS configuration". They probably just forgot a step. And "someone at Amazon" describes a lot of people; multiply your mental probability tables accordingly.

The number of people at Amazon is pretty much irrelevant; the org is going to ensure that someone is keeping an eye on kernel performance, but also that the work isn’t duplicative.

Surely they would be testing the configuration(s) that they use in production? They’re not running RDS without hugepages turned on, right?


> The number of people at Amazon is pretty much irrelevant; the org is going to ensure that someone is keeping an eye on kernel performance, but also that the work isn’t duplicative.

I'd guess they have dozens of people across say a Linux kernel team, a Graviton hardware integration team, an EC2 team, and a Amazon RDS for PostgreSQL team who might at one point or another run a benchmark like this. They probably coordinate to an extent, but not so much that only one person would ever run this test. So yes it is duplicative. And they're likely intending to test the configurations they use in production, yes, but people just make mistakes.


True; to err is human. But it is weird that they didn’t just fire up a standard RDS instance of one or more sizes and test those. After all, it’s already automated; two clicks on the website gets you a standard configuration and a couple more get you a 96c graviton cpu. I just wonder how the mistake happened.

You're assuming that they ran the workload with huge-pages disabled unintentionally.

No… I’m assuming that they didn’t use the same automation that creates RDS clusters for actual customers. No doubt that automation configures the EC2 nodes sanely, with hugepages turned on. Leaving them turned off in this benchmark could have been accidental, but some accident of that kind was bound to happen as soon as the tests use any kind of setup that is different from what customers actually get.

You're again assuming that having huge pages turned on always brings the net benefit, which it doesn't. I have at least one example where it didn't bring any observable benefit while at the same time it incurred extra code complexity, server administration overhead, and necessitated extra documentation.

FYI: huge pages isn't just a system-wide toggle, but a variety of things you can do:

* explicit huge pages

* transparent huge pages system-wide default

* app-specific or even mapping-specific toggles

* various memory allocator settings to raise its effectiveness

It would be really surprising to me to see a workload for which it's optimal to not use huge pages anywhere on the system.


It is a system-wide toggle in a sense that it requires you to first enable huge-pages, and then set them up, even if you just want to use explicit huge pages from within your code only (madvise, mmap). I wasn't talking about the THP.

When you deploy software all around the globe and not only on your servers that you fully control this becomes problematic. Even in the latter case it is frowned upon by admins/teams if you can't prove the benefit.

Yes, there are workloads where huge-pages do not bring any measurable benefit, I don't understand why would that be questionable? Even if they don't bring the runtime performance down, which they could, extra work and complexity they incur is in a sense not optimal when compared to the baseline of not using huge-pages.


For production Postgres, i would assume it’s close to almost no effect?

If someone is running postgres in a serious backend environment, i doubt they are using Ubuntu or even touching 7.x for months (or years). It’ll be some flavor of Debian or Red Hat still on 6.x (maybe even 5?). Those same users won’t touch 7.x until there has been months of testing by distros.


Ubuntu is used in many serious backend environments. Heroku runs tens of thousands (if not more) instances of Ubuntu on its fleet. Or at least it did through the teens and early 2020s.

https://devcenter.heroku.com/articles/stack


Do they upgrade to the new LTS the day it is released?

Ubuntu's upgrade tools wait until the .1 release for LTSes, so your typical installation would wait at least half a year.

Not historically.

and they are right, this is because a lot of junior sysadmins believe that newer = better.

But the reality:

  a) may get irreversible upgrades (e.g. new underlying database structure) 
  b) permanent worse performance / regression (e.g. iOS 26)
  c) added instability
  d) new security issues (litellm)
  e) time wasted migrating / debugging
  f) may need rewrite of consumers / users of APIs / sys calls
  g) potential new IP or licensing issues
etc.

A couple of the few reasons to upgrade something is:

  a) new features provide genuine comfort or performance upgrade (or... some revert)
  b) there is an extremely critical security issue
  c) you do not care about stability because reverting is uneventful and production impact is nil (e.g. Claude Code)
but 99% of the time, if ain't broke, don't fix it.

https://en.wikipedia.org/wiki/2024_CrowdStrike-related_IT_ou...


On the other hand, I suspect LLMs will dramatically decrease the window between a vulnerability being discovered and that vulnerability being exploited in the wild, especially for open-source projects.

Even if the vulnerability itself is discovered through other means than by an LLM, it's trivial to ask a SOTA model to "monitor all new commits to project X and decide which ones are likely patching an exploitable vulnerability, and then write a PoC." That's a lot easier than finding the vulnerable itself.

I won't be surprised if update windows (for open source networked services) shrink to ~10 minutes within a year or two. It's going to be a brutal world.


Too often I see IT departments use this as an excuse to only upgrade when they absolutely have to, usually with little to no testing in advance, which leaves them constantly being back-footed by incompatibility issues.

The idea of advanced testing of new versions of software (that they’ll be forced to use eventually) never seems to occur, or they spend so much time fighting fires they never get around to it.


all fair points, on the other hand, as a general rule, isn't it important to stay on currently-supported versions of pieces of software that you run?

ymmv, but in my experience projects like postgresql which have been reliable, tend to continue to be so.


There is serious as in "corporate-serious" and serious as in "engineer-serious".

I’ve seen more 5k+-core fleets running Ubuntu in prod than not, in my career. Industries include healthcare, US government, US government contractor, marketing, finance.

In other words, those industries that used to run windows before ?

I'd say about 2/3 of the places I've worked started on Linux without a Windows precedent other than workstations. I can't speak for the experience of the founding staff, though; they might have preferred Ubuntu due to Windows experience--if so, I'm curious as to why/what those have to do with each other.

That said, Ubuntu in large production fleets isn't too bad. Sure, other distros are better, but Ubuntu's perfectly serviceable in that role. It needs talented SRE staff making sure automation, release engineering, monitoring, and de/provisioning behave well, but that's true of any you-run-the-underlying-VM large cloud deployment.


A customer of mine is running on Ubuntu 22.04 and the plan is to upgrade to 26.04 in Q1 2027. We'll have to add performance regression to the plan.

Are you running ARM servers?

.. which confirms all of my stereotypes. Looks like the AWS engineer who reported it used a m8g.24xlarge instance with 384 GB of RAM, but somehow didn't know or care to enable huge pages. And once enabling them, the performance regression disappears.

Because such settings aren’t obvious to those not familiar with them. LLMs should make discoverability easier though

Honest question: what's the value of running the benchmark and reporting a performance regression if the author is not familiar with basic operation of the software? I'd argue that not understanding those settings disqualifies you from making statements about it.

The performance was reduced without a settings change. That is still a regression even if huge pages mitigates the problem.

I'd be curious to know if there's still a regression with hugepages turned on in older kernels.

If you are benchmarking something and the only changed variable between benchmarks is the kernel, that is useful information. Even if your environment isn't correctly setup.


Some software clearly wants hugepages disabled, so it's not always the slam dunk people seem to be making it out to be.

ie Redis:

https://redis.io/docs/latest/operate/oss_and_stack/managemen...


Thanks for posting! There is a hand-edited transcript here as well, for those who prefer text: https://pganalyze.com/blog/5mins-postgres-19-better-planner-...

And, its noted in the video/transcript, but for clarity: This is talking about new extensibility in Postgres 19 that makes it easier to do Postgres planner hints / plan management extensions.

The patch that was committed is part of a larger proposal (pg_plan_advice) which, if it ends up being committed, would add a version of planner hints to Postgres itself (in contrib). It remains to be seen where that goes for Postgres 19.


This is great, I have harped on PG's aversion to plan hints quite a bit in the past (and also lack of plan caching and reuse).

One of the biggest issues I run into operationally with relational db's is lack of plan stability and inability as a developer to tell it what I know is the right thing to do always in my application.

I use MS SQL Server more than PG currently and if it did not have plan hinting it would have been catastrophic for our customers.

Its still not perfect and over the years I am starting to think that having smart optimizers that use live statistics may be more of a detriment than help due to constantly fighting bad plan generation at typically the worst times (late at night in production). At least with compiler optimization it happens at build time and the results can be reasoned about and is stable. SQL is like some insane runtime JIT system like a javascript engine that de-optimizes but not deterministically based on the data in the system at the time.

Using various tricks and writing the query in slightly different ways while praying to the planner god to pick the correct plan is well infuriating due to lack of control over the system.

I much prefer systems like Linq or Kusto where it's still declarative but the pipeline execution order follows the order as written in the code. One of the most helpful and typical hints that solves issues is simply forcing join order and subquery / where order such that I want it to do this filter first then this other one second, then typically the optimizer picks the obviously correct indexes to use. Bad plans typically try and rewrite the order do much less selective thing first destroying performance. I as the developer normally know the correct order of operations for my applications use and write the query that way.


I've actually come around to the Postgres way of thinking. We shouldn't want or need plan hints usually.

Literally every slow Postgres statement I worked on in the last few years was due to lack of accurate statistics, missing indexes, or just badly designed queries. Every one was fixable at the source, by actually fixing the core issue.

This was in stark contrast to the myriads of Oracle queries I also debugged. The larger older ones had accumulated a "crust" of plan hints over the years. Most not so well thought out and not valid anymore. In fact, often just removing all hints made the query faster rather than slower on newer Oracle versions.

It's so tempting to just want to add a plan hint to "fix" the suboptimal query plan. However, the Postgres query planner often has an actual reason for why it does what it does and overall I've found the decisions to be very consistent.


>I've actually come around to the Postgres way of thinking. We shouldn't want or need plan hints usually.

They only come out at night, mostly.

PG is 40 years old and still has planner bugs being fixed up regularly, and having no control and waiting for a new version when a hint could fix the issue at runtime is an obvious problem that should have been addressed long ago.

It's great the devs want to make the planner perfect and strive for that, it is an unattainable goal worth pursuing IMO. Escape hatches are required hence the very popular pg_hint_plan extension.

But in the end after many years of dealing with these things I have come to the opposite conclusion, let the query language drive the plan more directly and senior devs can fix juniors devs mistakes in the apps source code and the plans will be committed in source control for all to see and reference going forward.

SQL comes from an idea of non technical people querying a system in ad-hoc ways, still useful, but if you are technically competent in data structures and programming and making an application that uses the db, the planner just gets in your way at least in my experience.


Specifically on the cost of forking a process for each connection (vs using threads), there are active efforts to make Postgres multi-threaded.

Since Postgres is a mature project, this is a non-trivial effort. See the Postgres wiki for some context: https://wiki.postgresql.org/wiki/Multithreading

But, I'm hopeful that in 2-3 years from now, we'll see this bear fruition. The recent asynchronous read I/O improvements in Postgres 18 show that Postgres can evolve, one just needs to be patient, potentially help contribute, and find workarounds (connection pooling, in this case).


Would be nice if the OrioleDB improvements were to be incorporated in postgresql proper some day.. https://www.slideshare.net/slideshow/solving-postgresql-wick...


Since there seems to be some confusion in the comments about why pg_query chose Protobufs in the first place, let me add some context as the original author of pg_query (but not involved with PgDog, though Lev has shared this work by email beforehand).

The initial motivation for developing pg_query was for pganalyze, where we use it to parse queries extracted from Postgres, to find the referenced tables, and these days also rewrite and format queries. That use case runs in the background, and as such is much less performance critical.

pg_query actually initially used a JSON format for the parse output (AST), but we changed that to Protobuf a few major releases ago, because Protobuf makes it easy to have typed bindings in the different languages we support (Ruby, Go, Rust, Python, etc). Alternatives (e.g. using FFI directly) make sense for Rust, but would require a lot of maintained glue code for other languages.

All that said, I'm supportive of Lev's effort here, and we'll add some additional functions (see [0]) in the libpg_query library to make using it directly (i.e. via FFI) easier. But I don't see Protobuf going away, because in non-performance critical cases, it is more ergonomic across the different bindings.

[0]: https://github.com/pganalyze/libpg_query/pull/321


I think there are two aspects to that:

1) When do pages get removed? (file on disk gets smaller)

Regular vacuum can truncate the tail of a table if those pages at the end are fully empty. That may or may not happen in a typical workload, and Postgres isn't particular about placing new entries in earlier pages. Otherwise you do need a VACUUM FULL/pg_squeeze.

2) Does a regular VACUUM rearrange a single page when it works on it? (i.e. remove empty pockets of data within an 8kb page, which I think the author calls compacting)

I think the answer to that is yes, e.g. when looking at the Postgres docs on page layout [0] the following sentence stands out: "Because an item identifier is never moved until it is freed, its index can be used on a long-term basis to reference an item, even when the item itself is moved around on the page to compact free space". That means things like HOT pruning can occur without breaking index references (which modify the versions of the tuple on the same page, but keep the item identifier in the same place), but (I think) during VACUUM, even breaking index references is allowed when cleaning up dead item identifiers.

[0]: https://www.postgresql.org/docs/current/storage-page-layout....

Edit: And of course you should trust the parallel comment by anarazel to be the correct answer to this :)


The article has a section where it estimates index bloat based on comparing the number of index reltuples * 40 bytes (?), compared to the size of the file on disk.

This is problematic, first of all because I don't think the math is right (see [0] for a more comprehensive query that takes into account column sizes), and second because it ignores the effects of B-Tree index deduplication in Postgres 13+: [1]

In my experience, fast bloat estimation queries can work okay for table bloat, but for index bloat I'd recommend instead looking at the change in page density over time (i.e. track relpages divided by reltuples), or just go direct to running pgstatindex outside business hours.

[0]: https://github.com/pgexperts/pgx_scripts/blob/master/bloat/i... [1]: https://www.postgresql.org/docs/current/btree.html#BTREE-DED...


Every time Postgres advice says to “schedule [important maintenance] during low traffic period” (OP) or “outside business hours”, it reinforces my sense that it’s not suitable for performance-sensitive data path on a 24/7/365 service and I’m not sure it really aims to be. (To be fair, running it like that for several years and desperately trying to make it work also gave me that feeling. But I’m kind of aghast that necessary operational maintenance still carries these caveats.)


> Every time Postgres advice says to “schedule [important maintenance] during low traffic period” (OP) or “outside business hours”, it reinforces my sense that it’s not suitable for performance-sensitive data path on a 24/7/365 service and I’m not sure it really aims to be.

It's a question of resource margins. If you have regular and predictable windows of low resource utilization, you can afford to run closer to the sun during busy periods, deferring (and amortizing, to some degree) maintenance costs till later. If you have a 24/7/365 service, you need considerably higher safety margins.

Also, there's a lot of terrible advice on the internet, if you haven't noticed.

> (To be fair, running it like that for several years and desperately trying to make it work also gave me that feeling. But I’m kind of aghast that necessary operational maintenance still carries these caveats.)

To be fair, I find oxides' continual low-info griping against postgres a bit tedious. There's plenty weaknesses in postgres, but criticizing postgres based on 10+ year old experiences of running an, at the time, outdated postgres, on an outdated OS is just ... not useful? Like, would it useful to criticize oxides lack of production hardware availability in 2021 or so?

Edit: duplicated word removed


> It's a question of resource margins.

What you describe is true and very important (more margin lets you weather more disruption), but it's not the whole story. The problem we had was queueing delays mainly due to I/O contention. The disks had the extra IOPS for the maintenance operation, but the resulting latency for all operations was higher. This meant overall throughput decreased when the maintenance was going on. The customer, finally accepting the problem, thought: "we'll just build enough extra shards to account for the degradation". But it just doesn't work like that. If the degradation is 30%, and you reduce the steady-state load on the database by 30%, that doesn't change the fact that when the maintenance is ongoing, even if the disks have the IOPS for the extra load, latency goes up. Throughput will still degrade. What they wanted was predictability but we just couldn't give that to them.

> To be fair, I find oxides' continual low-info griping against postgres a bit tedious. There's plenty weaknesses in postgres, but criticizing postgres based on 10+ year old experiences of running an, at the time, outdated postgres, on an outdated OS is just ... not useful?

First, although I work at Oxide, please don't think I speak for Oxide. None of this happened at Oxide. It informed some of the choices we made at Oxide and we've talked about that publicly. I try to remember to include the caveat that this information is very dated (and I made that edit immediately after my initial comment above).

I admit that some of this has been hard for me personally to let go. These issues dominated my professional life for three very stressful years. For most of that time (and several years earlier), the community members we reached out to were very dismissive, saying either these weren't problems, or they were known problems and we were wrong for not avoiding them, etc. And we certainly did make mistakes! But many of those problems were later acknowledged by the community. And many have been improved -- which is great! What remains is me feeling triggered when it feels like users' pain is being casually dismissed.

I'm sorry I let my crankiness slip into the comment above. I try to leave out the emotional baggage. Nonetheless, I do feel like it's a problem that, intentionally or otherwise, a lot of the user base has absorbed the idea that it's okay for necessary database maintenance to significantly degrade performance because folks will have some downtime in which to run it.*


> First, although I work at Oxide, please don't think I speak for Oxide. None of this happened at Oxide. It informed some of the choices we made at Oxide and we've talked about that publicly. I try to remember to include the caveat that this information is very dated (and I made that edit immediately after my initial comment above).

I said oxide, because it's come up so frequently and at such length on the oxide podcast... Without that I probably wouldn't have commented here. It's one thing to comment on bad experiences, but at this point it feels like more like bashing. And I feel like an open source focused company should treat other folks working on open source with a bit more, idk, respect (not quite the right word, but I can't come up with a better one right now).

I probably shouldn't have commented on this here. But I read the message after just having spent a Sunday morning looking into a problem and I guess that made more thin skinned than usual.

> For most of that time (and several years earlier), the community members we reached out to were very dismissive, saying either these weren't problems, or they were known problems and we were wrong for not avoiding them, etc.

I agree that the wider community sometimes has/had the issue of excusing away postgres problems. While I try to avoid doing that, I certainly have fallen prey to that myself.

Leaving fandom like stuff aside, there's an aspect of having been told over and over we're doing xyz wrong and things would never work that way, and succeeding (to some degree) regardless. While ignoring some common wisdom has been advantageous, I think there's also plenty where we just have been high on our own supply.

> What remains is me feeling triggered when it feels like users' pain is being casually dismissed.

Was that done in this thread?


I don't agree that we have been "bashing" Postgres. As far as I can tell, Postgres has come up a very small number of times over the years: certainly on the CockroachDB episode[0] (where our experience with Postgres is germane, as it was very much guiding our process for finding a database for Oxide) and then again this year when we talked about our use of statemaps on a Rust async issue[1] (where our experience with Postgres was again relevant because it in part motivated the work that we had used to develop the tooling that we again used on the Rust issue).

I (we?) think Postgres is incredibly important, and I think we have properly contextualized our use of it. Moreover, I think it is unfair to simply deny us our significant experience with Postgres because it was not unequivocally positive -- or to dismiss us recounting some really difficult times with the system as "bashing" it. Part of being a consequential system is that people will have experience with it; if one views recounting that experience as showing insufficient "respect" to its developers, it will have the effect of discouraging transparency rather than learning from it.

[0] https://oxide-and-friends.transistor.fm/episodes/whither-coc...

[1] https://oxide-and-friends.transistor.fm/episodes/when-async-...


I'm certainly very biased (having worked on postgres for way too long), so it's entirely plausible that I've over-observed and over-analyzed the criticism, leading to my description.

> I (we?) think Postgres is incredibly important, and I think we have properly contextualized our use of it. Moreover, I think it is unfair to simply deny us our significant experience with Postgres because it was not unequivocally positive -- or to dismiss us recounting some really difficult times with the system as "bashing" it. Part of being a consequential system is that people will have experience with it; if one views recounting that experience as showing insufficient "respect" to its developers, it will have the effect of discouraging transparency rather than learning from it.

I agree that criticism is important and worthwhile! It's helpful though if it's at least somewhat actionable. We can't travel back in time to fix the problems you had in the early 2010s... My experience of the criticism of the last years from the "oxide corner" was that it sometimes felt somewhat unrelated to the context and to today's postgres.

> if one views recounting that experience as showing insufficient "respect" to its developers

I should really have come up with a better word, but I'm still blanking on choosing a really apt word, even though I know it exists. I could try to blame ESL for it, but I can't come up with a good German word for it either... Maybe "goodwill". Basically believing that the other party is trying to do the right thing.


>> What remains is me feeling triggered when it feels like users' pain is being casually dismissed.

> Was that done in this thread?

Well, I raised a general problem around 24/7/365 use cases (rooted in my operational experience, reinforced by the more-current words that I was replying to and the OP) and you called it "tedious", "low-info griping". Yes, that seems pretty dismissive.

(Is it fair? Though I thought the podcast episodes were fairly specific, they probably glossed over details. They weren't intended to be about those issues per se. I did write a pretty detailed post though: https://www.davepacheco.net/blog/2024/challenges-deploying-p... (Note the prominent caveat at the top about the experience being dated.))

You also wrote:

> running an, at the time, outdated postgres, on an outdated OS

Yes, pointing to the fact that the software is old and the OS is unusual (it was never outdated; it was just not Linux) are common ways to quickly dismiss users' problems. If the problems had been fixed in newer versions, that'd be one thing. Many (if not all) of them hadn't been. But also: the reason we were running an old version was precisely that it was a 24/7/365 service and there was no way to update databases without downtime, especially replicated ones, nor a great way to mitigate risk (e.g., a mode for running the new software without updating the on-disk format so that you can go back if it's a disaster). This should be seen as a signal of the problem, not a reason to dismiss it (as I feel like you're doing here). As for the OS, I can only think of one major issue we hit that was OS-specific. (We did make a major misconfiguration related to the filesystem that certainly made many of our issues much worse.)

I get that it sucks to keep hearing about problems from years ago. All of this was on 9.2 - 9.6 -- certainly ancient today. When this comes up, I try to balance sharing my operational experience with the fact that it's dated by just explaining that it's dated. After all, all experience is dated. Readers can ignore it if they want, do some research, or folks in the PostgreSQL world can update me when specific things are no longer a problem. That's how I learned that the single-threaded WAL receiver had been updated, apparently in part because of our work: https://x.com/MengTangmu/status/1828665449850294518 (full thread: https://x.com/MengTangmu/status/1828665439234474350). I'll happily share these updates wherever I would otherwise share my gripes!


Regarding pgstattuple specifically: If this was a 24/7/365 service and you would be concerned by the I/O impact of loading the full table or index at any time, you could run this on a replica too. For tables there is pgstattuple_approx which is much better at managing its impact, but there is no equivalent for indexes today.

The REINDEX CONCURRENTLY mentioned in OP could also be run at other times of the day - the main issue is again I/O impact (with potentially some locking concerns at the very end of the reindex concurrently to swap out the index).

There are no magic solutions here - other databases have to deal with the same practical limitations, though Postgres sometimes is a bit slow to adopt operational best practices in core (e.g. the mentioned pg_squeeze from OP may finally get an in-core "REPACK CONCURRENTLY" equivalent in Postgres 19, but its been a long time to get there)


OP here - thank you for mentioning replica, forgot to mention it. Unless it would hit it pretty hard and `hot_standby_feedback` is on


Postgres-compatible cloud databases like AlloyDB address this issue.


Yep, I find cloud storage performance to be quite frustrating, but its the reality for many production database deployments I've seen.

Its worth noting that even on really fast local NVMe drives the new asynchronous I/O work delivers performance benefits, since its so much more efficient at issuing I/Os and reducing syscall overhead (for io_uring).

Andres Freund (one of the principal authors of the new functionality) did a lot of benchmarking on local NVMe drives during development. Here is one mailinglist thread I could find that shows a 2x and better benefit with the patch set at the time: https://www.postgresql.org/message-id/flat/uvrtrknj4kdytuboi...


It depends on the I/O method - as described in the article, "io_uring" is only available on Linux (and requires building with liburing, as well as io_uring to be enabled in the Kernel), but the default (as of beta1) is actually "worker", which works on any operating system.

The "worker" method uses a dedicated pool of I/O worker processes that run in the background, and whilst not as performant as io_uring in our benchmark, did clearly outperform the "sync" method (which is the same as what Postgres currently has in 17 and older).


> "io_uring" is only available on Linux

Windows now also has IORing (see my comment above)


Yes, but what's your point? It's not that we can take this new Postgres version and just compile it in a Windows machine supporting IORing, can we? It requires some effort and time -many times by volunteers- and given that most Postgres deployments are running under Linux, it's understandable that the developers focus on that platform first.


Misunderstood - thought you were saying ioring was a feature specific to linux, but you meant specific to linux on Postgres.


Yeah, as one of the main authors of libpg_query, I think the primary things that make this easier is that Postgres has good abstractions internally, and the parser works independently from other parts (e.g. the community discourages adding settings that affect parser behavior).

Over the years we've only had to maintain a small set of patches on top of the Postgres source [0], together with some mocks and our libclang-based extraction logic [1]. Of course it would be nice if Postgres just packaged this directly like it packages the client libraries, but there is non-trivial effort involved to do that. From what I recall, the main issue is that error handling and memory allocations work differently in the client-side libraries (and so that would have to either also be moved out of the backend source, or use some other abstraction).

[0]: https://github.com/pganalyze/libpg_query/tree/17-latest/patc...

[1]: https://github.com/pganalyze/libpg_query/blob/17-latest/scri...


Unrelated to your comment but a big fan of pganalyze. Makes Postgres infinitely more grokable for developers who aren’t experts at running and tuning a database. Keep up the good work!


Thanks, glad to hear! I like to think that one of the reasons pganalyze is a good product (though there are always parts I'd like to improve, and feedback is always welcome) is because we like to use it ourselves to optimize our own database, where we can :)


pganalyze | Senior Solutions Engineer | REMOTE (US/Canada) | Full-time | $160-200k + equity | https://pganalyze.com

At pganalyze, we're redefining how developers optimize one of the world's most popular databases, PostgreSQL. Our software gives companies like Atlassian, Robinhood, and Notion the tools to solve their most complex Postgres performance challenges, ensuring their mission-critical applications run smoothly at scale.

For our Solutions Engineering role you will directly work with our VP of Sales, as well as our Founder & CEO (me!) on strategic deals, and develop the playbook for integrating pganalyze in complex customer environments. You will also actively contribute to both our documentation and the product itself. Our team is cross-functional, and our goal is to improve the pganalyze product where we can, to allow repeated delivery of a great customer experience.

We are looking for candidates with a strong background in working on Linux VMs in the cloud or on-premise, and with Kubernetes or containers. Go experience is a big plus, since our agent (the pganalyze collector) is written in Go, and this role may involve making the collector work in different environments, and improving cloud provider API integrations.

Details on https://pganalyze.com/careers/senior-solutions-engineer

Interested? Email me (the Founder & CEO) at lukas AT pganalyze.com - or fill out the form on the website.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: