Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

You require three fields (or four): email at registration, a date for that entry (together these create a natural key), and current email (this one not part of the key and editable).

We're almost all the way to a Tag URI[0], so you could combine it with the user's name or username or any other identifier that fits the spec[1] (you could even use the website's own name) and you have a (definitely two thirds, probably 100%) natural key.

It's stable over time and unique, easy to mint, and has a standard behind it. The user also gets to change their contact details without any problem related to the key.

[0] https://taguri.org/

[1] http://www.faqs.org/rfcs/rfc4151.html



But now then if you want to expose a detail page for that user the id for identifying that page has to include all this potentially personal information about them?

e.g. instead of mysocialmedia.com/users/2374927

you would be showing

mysocialmedia.com/users/email@example.com-2024-06-05-mysocialmedia.com

Then exposing a lot of information that you may have not wanted to expose.


You don’t have to use the PK as the URL slug. Even if you want to route that way, you can have an internal ID and external ID. This is one way to use something random like a UUIDv4 for display without incurring the costs (at least, some of them) of having it as a PK.


And then if you want to list other entities to that user you will have to start mapping the external id and foreign relationships every time to external users?

And also if you are doing exception logging, for ids/primary keys there's higher odds of them being logged out, including your own logs and also external platforms.

It feels like having primary key set up like this just will complicate everything unnecessarily for the future including many edge cases that you don't foresee.

Just have the main ID not have any meaning.

It shouldn't contain information about the date, it shouldn't be auto increment, it should really be just random.


The solution I outlined is the one GitLab and PlanetScale both use internally, so it has been tested at scale and works well, for both Postgres (the former) and MySQL (the latter).

> It shouldn't contain information about the date, it shouldn't be auto increment, it should really be just random.

That’s a great way to tank performance. You want your PK to be k-sortable.


> And then if you want to list other entities to that user you will have to start mapping the external id and foreign relationships every time to external users?

If we're talking about relational database engines, that's what they do, relate things. One join statement is much the same as another.


Except you're encoding PII in the ID, which makes them plainly visible to people who should not have access to user data, and hard or impossible to change. Sure, I could e.g. change my e-mail and the contact data would be updated, but you still have the old e-mail associated with my account via ID. I'm not sure this would fly under GDPR.


Erm, don't show the ID to people who don't need it.

Aside from that, it's not a violation of GDPR to keep personal information (that they consented to you having) in order to process business for that person. Using an email address as a unique identifier is not a violation, using it to spam them would be. If they're willing to give you their current email why not an old one?


> Erm, don't show the ID to people who don't need it.

How do you communicate with other people in your company about a customer without sending around PII if the customer's ID is PII?

Maybe we could create a field that uniquely identifies the customer that isn't PII. Then that could be used to uniquely identify a customer in places where we don't want to expose their PII. But then... why not just use this unique ID as the key?


Do you often send the auto-incremented int (that would be the default substitute to this) when communicating with others? Then why would you send this?

It's so strange an argument. Right now you have my username but not my email address, yet you can still query the website database and get certain data that you're allowed to see. There are so many ways to query a particular user's data, and they would all depend on what you're trying to do, needing the specific key would mean you should have access to it anyway and it could be given on per case basis anyway.


> Do you often send the auto-incremented int (that would be the default substitute to this) when communicating with others?

Frequently yes. It is extremely common to communicate about specific records using the ID for that record. The fact that this sort of behavior is extremely common is pretty clearly indicated by the question itself.

> There are so many ways to query a particular user's data, and they would all depend on what you're trying to do, needing the specific key would mean you should have access to it anyway

A responsible organization at scale with limit and log access to customer data. I should be able to determine if two people are talking about the same customer record without needing access to that record's PII.

It is much better to have an artificial key that is linked to this data. There is no upside to the natural key and many, many downsides.


> Do you often send the auto-incremented int (that would be the default substitute to this) when communicating with others?

It's not an int, but yes, we have a unique synthetic identifier that serves as the database PK and as a means of communicating about a customer in insecure channels without exposing PII. "Customer ID ### is having an issue with such-and-such."

To turn your second part back around: why a natural key? What is the function of minting a natural key if humans are meant to use something else?


> To turn your second part back around: why a natural key? What is the function of minting a natural key if humans are meant to use something else?

Because non-natural keys are unnecessary in the presence of a natural key, and unnecessary things bring in complexity.

> "Customer ID ### is having an issue with such-and-such."

Then you need access to the customer's ID, but the devil here is in the detail you didn't add, the such-and-such.

> communicating about a customer in insecure channels

Use secure channels…


> Because non-natural keys are unnecessary in the presence of a natural key, and unnecessary things bring in complexity.

None of the things you've presented are actually "natural" keys, they are pieces of information that you've made assumptions about to shoehorn them into being usable as a "natural key".

> Use secure channels…

No channel is perfectly secure. As channels become more secure, they become harder to use and add complexity. The more places you store customer data, the more risk you create. The attempt to force semantic data to serve as "natural key" has now added risk and complexity to your entire communication infrastructure.


I don’t believe you understand what a natural key is, but aside from that, I’m not the one advocating passing around IDs like that isn’t a security failing. If you wouldn’t put it in a URL then you shouldn’t be passing it around anyway.


> Use secure channels…

When it comes to PII at my company, secure channels means "encrypted email only". No Slack, no Jira, no chat in video calls.

That's just not feasible for 100% of communications.


Then use a time-limited token, you can assign it to a particular role or support engineer too. You could do fancy things like making it pronounceable… there are so many options that do not involve passing around keys (while fearing you might leak an email address, which is less worrying than the key, IMO).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: