I wonder how much money in Bitcoin they would have made if they had used those hashes to mine instead.
Speaking of which, it would be funny if they were pretending to do security research but added a secret backdoor to mine bitcoin instead, somehow exporting those hashes or using the partial results of SHA-1 calculations (BTC isn't SHA-1).
I'm just joking, but I wonder if that's possible. If anyone is Machiavellian enough for that, it's security researchers.
> When disagreeing, please reply to the argument instead of calling names. "That is idiotic; 1 + 1 is 2, not 3" can be shortened to "1 + 1 is 2, not 3."
I only half joke when I ask, when you break a cryptogrpahic hash, does it mean that now it's just a really amazing compression algorithm, but with a very heavy compute requirement? The non-joking half is speculating what data compression in a post-quantum compute world looks like.
The compression joke evokes a misconception (as STEM jokes often do) which, in this case, is that breaking a hash means finding a way to recover the full text of a multi-megabyte document, from a 100-something byte hash.
Whereas all it means is that it's become feasible for someone to produce a fake document with the same hash as the genuine one; the attack depends on the existence of collisions, which depend on the hash being one-way, such that an infinite set of documents could be "recovered" from the hash.
Most of the documents you could "decompress" from a hash are going to be gibberish, but some of them are going to be legit. If a hash is "broken", that means it's somehow unduly easy to search that ocean of gibberish collisions to find a chosen text which goes to a certain desired hash.
Nothing quantum can redefine a function that isn't one-to-one and onto into one that is. The hashing function remains ever the mathematical object that it is: a one-way function. But calculations of or searches through that function can be faster.
In a post-quantum world, pre-quantum crypto and hashing could well turn out to be "for shit" as such, but not due to ruses like suddenly reversing irreversible functions.
Not saying that the joke isn't funny! We can imagine some hard-core computer science or math sitcom in which the fall characters say things like this, and only the very narrow target audience gets it.
These hashes, regardless of their actual cryptographic strength, are intended to be indistinguishable from a random generator keyed with the input, kind of. Assuming that they succeed in that, hashes should be quite well distributed. That means that that for each hash n-bit hash output, there will be about two (n+1)-bit input strings that produce that hash, about four (n+2)-bit input strings, eight (n+3)-bit input strings, etc. So while there are probably few ASCII input strings that map to one particular hash, for example, there will be loads and loads of arbitrary bitstrings that map to that hash.
Compression algorithms must encode data in a reversible way, which hash functions can't do in general.
For example, SHA-1 outputs a 160 bit hash, which means some input of 161 or more bits will definitely have the same hash as some other input of the same size. Even smaller inputs may have collisions too.
Not so simple. A hash corresponds to infinitely many messages, but how many of them are in ASCII charset under 1KB long? It may happen that each hash has a unique message within these constraints.
The pigeonhole principle forbids it, since the hash itself is an ASCII message under 1KB long. Even within your message constraints, the message can contain contain any possible hash, and more besides. Therefore there are more valid messages than hashes.
Most of them will be gibberish; English text has only about 2.3 bits of entropy per letter. So there are only on the order of 2²³⁵⁵ 1KiB messages that look pretty much like English text, of which one in 2¹⁶⁰ will have any particular SHA-1 hash, leaving about 2²¹⁹⁵ more or less English messages of 1 KiB, which is a much more manageable 5 765 546 543 805 391 543 300 385 245 067 897 407 469 008 380 207 694 434 170 981 314 369 415 226 086 245 896 005 497 410 349 176 911 651 361 357 544 908 126 864 379 940 776 407 262 468 025 247 520 821 365 392 566 254 691 849 336 550 399 984 742 144 883 696 325 495 839 942 505 506 308 529 294 485 245 435 346 088 288 415 306 782 152 045 986 880 430 505 821 218 111 120 701 594 573 419 855 327 199 586 861 839 630 511 065 600 663 692 968 681 473 384 074 002 850 142 261 291 497 547 545 795 867 600 142 345 188 353 358 006 378 705 229 284 788 565 040 964 509 510 302 568 387 814 225 873 737 552 804 109 763 080 706 434 267 888 314 149 674 523 819 024 312 546 837 031 915 917 556 591 511 424 773 862 591 940 658 144 814 461 877 029 111 670 089 356 835 845 931 924 493 084 507 666 309 424 365 148 038 224 615 440 025 478 945 269 023 101 615 392 514 882 287 817 384 451 162 838 663 168 messages.
It is unlikely that it would be apparent upon inspection which one was correct, since most of them differ from other messages by only a couple dozen subtle changes.
Not it's not. You can't distinguish the complete works of Shakespeare from the complete works of Shakespeare with the words changed a bit, or switching to Harry Potter for a bit in the middle, or a completely dazzling and original work of fiction. Any hash might generate all three of these, and essentially boundless other works. It's a library of Babel.
Forgive me if I've misunderstood - the premise was that, out of the infinite set of data that hashes to a particular hash/checksum, there is a unique data set that is "obviously" the real one? Or at least, that the set is meaningfully bounded? My reply is that this is not the case. There will be infinitely many "plausible" data sets as well.
You could collide every hash in existence merely by making undetectably tiny alterations to Shakespeare.
Not obviously, no. - Just a very high probability. Perhaps that provides noise immunity to some degree. If it does, it is a form of AI.
This admittedly open question presumes a very large fuzzy 'code book' with which it can re-assemble the data. The length of the input in cleartext is valuable metadata that speeds up the search.
You still seem to be missing the point. The problem is that the SHA-1 hash is one of 2^160 possible values, while a 1GB plaintext is one of 2^8000000000 possible values. That means that the hash disambiguates your message down to one of 2^7999999840 values.
Let's go smaller. Let's say our plaintext is a single kilobyte. One 80x25 terminal's worth of ASCII. Knowing the SHA-1 hash narrows our search space, yes, but the search space is so absurdly large that it only solves 0.0000... (insert over two thousand zeros here) ..0001% of the problem.
Say you have a 32 byte hash, that's 2^(328) possible values, right? Now say your input is 128 bytes, that means that (on average) each 32 byte hash maps to 2^(968) values.
"Natural looking" data will occur an astronomical number of times in a search space that large - which is, again, only 96 bytes larger than your hash.
Only if you count all natural looking data of which there is way more than you could possibly imagine.
It is of course likely that humanity hasn't yet created more than 2^100 (~10^30) files, so in theory given a registry of all files in existence you might be able to identify it by its hash. However while this is simple it's definitely not easy.
But this is technically an invalid answer to the problem of file compression. An easy (!) way for data to be losslessly compressed would just be to have a registry of all files in existence, and to refer to each file by its index in the registry. (Like the prisoners who had told each other all their jokes so many times that now they could just quote them by number and everyone would laugh.) Data compression is supposed to be able to deal with novel but likely documents without needing to add them to a global registry.
I can imagine the output of program space. That's pretty big. All possible universes in fact.
It's an issue of probability and bins. While natural data is infinite, there is vastly more unnatural data. At some point you have enough metadata (e.g. natural vs. random) to know that the original data came from Earth to pick out the right needle from the needle stack.
Unless the data is from a completely alien source, we are close enough to that source for our heuristics to be of a manageable size.
I understand your confusion, because it took me a minute to think about it and confirm that this doesn't work. It might be surprising that even a machine with complete information and infinite compute power couldn't pick out the needle; after all, SHA-1 is a 160 bit hash. Are there really 2^160 ≈ 10^48 different texts in English that could reasonably have been written by someone living on Earth (add as much additional metadata as you like)? The answer is yes: there are simply too many needles.
Take your favorite book with 160 or more characters. (Probably a large book, but such a book could certainly exist.) Now, for each character, create version of the book with their name swapped out for exactly one other name which does not already appear in the book. So if a character is named Alex, in some other edition of the book that character is named Jason.
Now, consider how many editions of the book there are. There are two combinations (for each character) that are exactly alike other than that character's name being swapped out. So there are 2×2 = 4 combinations where two characters' names are swapped. And 2×2×2 = 8 combinations where 3 characters' names are swapped. If you think about how many versions of the book there are in total, it's 2^C, where C is the number of characters whose names can be swapped for only one other name.
In other words, it's guaranteed that you can generate SHA-1 collisions using one book alone, using only entirely reasonable alternative names for its characters.
Taking a book and randomly swapping words means you are introducing noise. This is high-entropy and really it's outside the scope of what I'm comfortable speculating on.
Image data is much simpler to think about in this paradigm, but the obvious AI applications of natural language are of course fascinating.
It's somewhat irrelevant how many unnatural data there is. The point is that there's way more plausible data (or sentences if we're restricting ourselves to natural language) than there are possible 160; 256 or 512 bit hashes.
Unless of course you enumerate all of human communication, but like I said that doesn't count as 'easy'.
But let's say every paragraph only offers 1 bit of entropy. Then a 160 bit hash gives you fuzzy accuracy up to 160 paragraphs. After that you'll have to extend the hash with hints to guide which sequence of paragraphs you're looking for, & hints for where the typos are
This is very abstract, but I believe that since both input and output are very close in program space compared to the total size of PS, the heuristics to map between them are going to be of a manageable size. This is based on an intuition about a single source for all activity in this universe.
For the sake of argument I figured I'd be highly optimistic. The linked prize shows practical evidence of 1GB of Wikipedia being compressed below 1 bit per character (& that's with a limit of 50 hours cpu time, 10GB memory, & 100GB disk)
This is what made it a funny thought experiement to me. I was doing some space related stuff a while back and when you're dealing with something as small as 100mb, and you add the constraint of up to a light-year or more of latency, which makes error correction immensely costly, having a quantum processor reciever find the collision of a unique hash could be faster than transmitting the data with errors.
There's probably a distance where the latency in light years is longer than it takes for a quantum processor with some shortcuts to exhaust the search space for the collision on the hash - and potential hashing algorithms that contain information and hints about a hyperplane that generates the original data over the field of possibilities. To a quantum computer with sufficient qbits at a long enough latency distance away, the 32-bit hash of a 2048bit RSA key may "arrive" faster than transmitting the whole key.
It feels like a fundamental misunderstanding of something in communication theory, but this was the thinking that prompted it.
Using the pigeonhole principle, consider how many 100 megabit values must, on average, hash to the same 32 bit value. The answer is a huge number (I could be wrong but it seems to me it might be 2 ^ (100,000,000 - 32)?). This is a bit of a paradox in cryptographic hashing, where finding a collision is difficult, but the number of collisions is enormous.
Yeah, you're fundamentally missing what I wrote, as well as pigeonhole principle and the basics of information theory.
Let's say you replace the quantum computer with an oracle. You have an infinitely long dictionary, and you can turn to any page and see your hash, as well as every possible input that could form that hash. There's billions and billions of options that all match your hash - how could you pick which one is definitely the "right" one?
You're no closer to knowing what the transmitted message was, because of the astronomical number of inputs that you now have to sift through.
[ Technically, you're (length of hash in bits) closer. Say you have 10 bits of hash, or 1024 values. There are (on average), 2 11-bit inputs that create that hash, 4 12-bit inputs, 8 13-bit inputs, and so on, following the powers of 2. At input of N=30 bits, you have 2^20 =~ 1 million possible inputs with that exact same hash. 1KB? 2^990, or "one followed by 990 zeroes". ]
This part about a massive number of possible collisions for a given plaintext in a cryptographic hash, but just spread over such a large field and incrememting with the input size so that it is just unlikely you will find another one is the counterintuitive part. I haven't implemented a secure hashing algorithm from scratch, so this aspect would have come out in the exercise.
Even working with security and crypto, the presumption of a cryptographic hash from a standard algorithm is that it is universally unique for the data it hashes - barring someone finding collisions, and then it gets deprecated. Most people deal with them at a higher level of abstraction.
Hash collisions have been used in a few high profile attacks in the last decade, most notably when some SSL certificates were still using MD5, and the most recent smart contract issue where the contract was only validating the last 4 bytes of a cryptographic hash and someone produced a key that exploited that with a collision. Unfortunately the people who don't make those mistakes are both very rare, and often dynamic to work with.
The idea that there are many, many potential collisions for a given input to a cryptographic hash doesn't come up much unless someone manages to find single a collision in one. I'd even posit nobody outside academic cryptography circles is including sha256 collisions in their threat scenarios right now.
However, the absurd/imaginary/hypotehtical/counterfactual I was raising is that a future quantum computer can produce all outputs of that length in a reasonable amount of time. The question was more about what was sufficient to reconstruct the data, and it was funny to know something was specifically wrong but not know why.
However, I appreciate the time taken to illustrate it.