This is what made it a funny thought experiement to me. I was doing some space r...

mypalmike · on Oct 12, 2021

Using the pigeonhole principle, consider how many 100 megabit values must, on average, hash to the same 32 bit value. The answer is a huge number (I could be wrong but it seems to me it might be 2 ^ (100,000,000 - 32)?). This is a bit of a paradox in cryptographic hashing, where finding a collision is difficult, but the number of collisions is enormous.

fwip · on Oct 12, 2021

Yeah, you're fundamentally missing what I wrote, as well as pigeonhole principle and the basics of information theory.

Let's say you replace the quantum computer with an oracle. You have an infinitely long dictionary, and you can turn to any page and see your hash, as well as every possible input that could form that hash. There's billions and billions of options that all match your hash - how could you pick which one is definitely the "right" one?

You're no closer to knowing what the transmitted message was, because of the astronomical number of inputs that you now have to sift through.

[ Technically, you're (length of hash in bits) closer. Say you have 10 bits of hash, or 1024 values. There are (on average), 2 11-bit inputs that create that hash, 4 12-bit inputs, 8 13-bit inputs, and so on, following the powers of 2. At input of N=30 bits, you have 2^20 =~ 1 million possible inputs with that exact same hash. 1KB? 2^990, or "one followed by 990 zeroes". ]

motohagiography · on Oct 13, 2021

This part about a massive number of possible collisions for a given plaintext in a cryptographic hash, but just spread over such a large field and incrememting with the input size so that it is just unlikely you will find another one is the counterintuitive part. I haven't implemented a secure hashing algorithm from scratch, so this aspect would have come out in the exercise.

Even working with security and crypto, the presumption of a cryptographic hash from a standard algorithm is that it is universally unique for the data it hashes - barring someone finding collisions, and then it gets deprecated. Most people deal with them at a higher level of abstraction.

Hash collisions have been used in a few high profile attacks in the last decade, most notably when some SSL certificates were still using MD5, and the most recent smart contract issue where the contract was only validating the last 4 bytes of a cryptographic hash and someone produced a key that exploited that with a collision. Unfortunately the people who don't make those mistakes are both very rare, and often dynamic to work with.

The idea that there are many, many potential collisions for a given input to a cryptographic hash doesn't come up much unless someone manages to find single a collision in one. I'd even posit nobody outside academic cryptography circles is including sha256 collisions in their threat scenarios right now.

However, the absurd/imaginary/hypotehtical/counterfactual I was raising is that a future quantum computer can produce all outputs of that length in a reasonable amount of time. The question was more about what was sufficient to reconstruct the data, and it was funny to know something was specifically wrong but not know why.

However, I appreciate the time taken to illustrate it.