Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Should never use uuidv4 (the builtin gen_random_uuid()) for identifiers anyway since it clusters horribly and complete randomness is the absolute enemy of anything you want to "look up".

uuid_generate_v1mc() from uuid-ossp is a much better choice if you really want a uuid indentifier since it will at least cluster much better.

And yes, absolutely _never_ make it your primary key. I've seen that mistake far too many times and it's always a disaster that is a nightmare to fix once there's a non-trivial amount of data and the database is miserable to work with.



>randomness is the absolute enemy of anything you want to "look up".

Absolutely true that having a lot ordered on a column that's one of the search predicates makes look ups faster. But aren't there many situations where the PK is essentially random even if it's an automatically incrementing integer?

Like in a customer's table, is the order someone became your customer ever really relevant for most operations? It's essentially a random id, especially if you're only looking up a single customer (or things related to that single customer).

Insert performance I could see being an issue. Im not supper familiar with postgres performance tuning, but if the engine handles a last hot page better than insert spread across all pages better (which I think is the case due to how the WAL operates) I can see that being a compelling reason for going with a sequential PK.


You're correct that for simple one-off cases, in a new-ish table, the performance difference is tiny if you can measure it at all.

As tables and indices age, they'll bloat, which can cause additional latency. Again, you may not notice this for small queries, but it can start becoming obvious in more complex queries.

The main issue I've seen is that since software and hardware is so absurdly fast, in the beginning none of this is noticed. It isn't until you've had to upsize the hardware multiple times that someone stops to think about examining schema / query performance, and at that point, they're often unwilling to do a large refactor. Then it becomes my problem as a DBRE, and I sigh and explain the technical reasons why UUIDs suck in a DB, and inevitably get told to make it work as best I can.

> Insert performance I could see being an issue.

And UPDATE, since Postgres doesn't actually do that (it does an INSERT + DELETE).



I think that is putting it rather strongly. I use UUID7 (for its better clustering) for anything that both, 1) may end up having a lot of rows (1M+), and 2) where leaking the creation time in the UUID is acceptable.

Otherwise I use UUID4.


> where leaking the creation time in the UUID is acceptable.

I've seen tons of people cite this as a problem for various reasons, but not a single person has ever given an actual example in the wild of it causing problems. This isn't war; the German Tank Problem is not nearly as applicable as people think.

Worse (from a DB perspective), it's often used as an excuse for poor API design. "I have to have a random ID, because the API is /api/user/<user-id>" and the like. OK, so don't do that? Put the user ID into a JWT (encrypted first if you'd like). Or use an external ID that is random, and an internal ID that's a sequential ID which maps to the external. Use the internal ID as the PK.


One year later, in a News Article:

"$person_name's lawyer could not be reached for comment, but it is noted that the userid mentioned in the lawsuit was created on 2022-12-03, the same day as the photos were uploaded, and the same day as the arrest warrant was issued..."

Using the id directly as the pk admittedly just makes things a bit simpler, one less bit of information to track down. I know bridge tables are not hard, but it's just one extra step that the developer has to be aware of, plan ahead for (and possibly all your teams and customer support people need to be aware of when they are tracking down an issue with an account)

UUIDv4 lets you move fast now and pay your performance-piper next year rather than now.


PK should be bigserial no matter what. "Use an external ID that is random" has the same clustering problem mentioned above since you will need an index on it, so it's not always the right answer, but I would default to it.


> same clustering problem

Not necessarily. Depending on your secrecy desire, you could include the internal ID in JWT or cookie, and then locate the row (including the external UUID) using that. Another option would be to build an index with `INCLUDE external_id`. That way, it doesn’t impact the B+tree (kind of) but it still gets pulled along for query results. The leaf nodes would become larger, but the overall ordering wouldn’t change (I think – never actually tried this).


If you mean to encrypt your internal IDs when sending to clients and decrypt when receiving, it'd work. I've never seen it done this way, maybe because with standard 2048-bit RSA you get a 2048-bit output at least, which would bloat up the API responses.


I like it as a primary key because you don't have to worry about clashes if you rejig data. But I don't work at scale with data, so performance hasn't been a blocker for this approach.


If you're not working at scale, IMO this is also not a difficult problem to overcome with integer IDs. Dump the DB into a CSV and increment the ID column by N, then reload. Or copy the table to a new table, add a trigger to copy from the original to the new with an ID increment, then do `UPDATE <temp_table> SET id = id + N WHERE id <= $MAX_ID_ORIGINALLY_SEEN`.


Yeah, but that requires basically a special favor from the DBA to create the trigger for what feels like a basic, common-enough data-load task...

What I normally see though is that the auto-incrementing sequence (at least in MS SQL Server or Oracle) isn't clever enough to say "wow, that id already exists on your table somehow? Here let me bump the sequence again and you try again with a higher ID..."

Instead you get a 2am alarm because the generated sequence somehow ran into the block of IDs that you inserted into the table and crashed with a PK unique constraint violation.

Hence UUIDv7 or ULIDs being easy to insert from the temp table into the main table.


It's easier if you have a non-PK UUID column you can use as a crutch.


What is rejigging data? Combining tables into one but keeping original PKs?


Yes. I found it often necessary at my last job, either due to loading client data from various datasets and staging tables (ideally in an idempotent way to prevent mistakes or double-entries) , or because someone asked "Hey if we already have the data in system a, can we just copy the data to system b?"


> And yes, absolutely _never_ make it your primary key.

If I dont have lots of range queries, why not then? Only because of bloating (I think fragmentation is a more precise term here)?


Because there's a good chance down the line you will need to do some sort of range query. Let's say you want to add and backill a column. Not too bad, you create a partial index where the column is null and use that for backfilling data.

But at a certain scale that starts taking too long and a bigint column would be quicker. Or you decide you need to periodically scan the table in batches for some reason. Perhaps to export the contents to a data warehouse as part of an initial snapshot.

You can skip enumerating these possibilities by having a bigint surrogate key from the get go. There's other advantages as well like better joins and temporal locality when the bigint index can be used rather than the uuid.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: