> The technique that the researchers developed is quite complex and required two...

asdfman123 · on Oct 12, 2021

I wonder how much money in Bitcoin they would have made if they had used those hashes to mine instead.

Speaking of which, it would be funny if they were pretending to do security research but added a secret backdoor to mine bitcoin instead, somehow exporting those hashes or using the partial results of SHA-1 calculations (BTC isn't SHA-1).

I'm just joking, but I wonder if that's possible. If anyone is Machiavellian enough for that, it's security researchers.

garmaine · on Oct 13, 2021

Well, they made 2.4994722 BTC directly by exploiting a smart contract on bitcoin that pays out to anyone demonstrating a SHA-1 collision.

fsckboy · on Oct 12, 2021

mining bitcoin on one GPU is uneconomical, so mining bitcoin an a large number of them is... also.

FredPret · on Oct 12, 2021

Unless your research grant pays for the electricity

asdfman123 · on Oct 13, 2021

Ding ding ding

valec · on Oct 13, 2021

[flagged]

exciteabletom · on Oct 13, 2021

> When disagreeing, please reply to the argument instead of calling names. "That is idiotic; 1 + 1 is 2, not 3" can be shortened to "1 + 1 is 2, not 3."

> Be kind. Don't be snarky.

https://news.ycombinator.com/newsguidelines.html

lallysingh · on Oct 12, 2021

Or roughly 1.3M GPU hours. Less as GPUs get upgraded. Not sure how many AWS F1 (FPGA) hours it would take.

wongarsu · on Oct 12, 2021

So somewhere between $300,000 (running consumer GPUs cheaply) and $30 million (paying AWS on-demand prices for Tesla V100).

Even the upper bound for that seems well worth it for well-funded attackers if the target is juicy enough.

lallysingh · on Oct 14, 2021

Consumer GPUs could be run from a botnet, actually... Yikes.

motohagiography · on Oct 12, 2021

I only half joke when I ask, when you break a cryptogrpahic hash, does it mean that now it's just a really amazing compression algorithm, but with a very heavy compute requirement? The non-joking half is speculating what data compression in a post-quantum compute world looks like.

kazinator · on Oct 12, 2021

The compression joke evokes a misconception (as STEM jokes often do) which, in this case, is that breaking a hash means finding a way to recover the full text of a multi-megabyte document, from a 100-something byte hash.

Whereas all it means is that it's become feasible for someone to produce a fake document with the same hash as the genuine one; the attack depends on the existence of collisions, which depend on the hash being one-way, such that an infinite set of documents could be "recovered" from the hash.

Most of the documents you could "decompress" from a hash are going to be gibberish, but some of them are going to be legit. If a hash is "broken", that means it's somehow unduly easy to search that ocean of gibberish collisions to find a chosen text which goes to a certain desired hash.

Nothing quantum can redefine a function that isn't one-to-one and onto into one that is. The hashing function remains ever the mathematical object that it is: a one-way function. But calculations of or searches through that function can be faster.

In a post-quantum world, pre-quantum crypto and hashing could well turn out to be "for shit" as such, but not due to ruses like suddenly reversing irreversible functions.

Not saying that the joke isn't funny! We can imagine some hard-core computer science or math sitcom in which the fall characters say things like this, and only the very narrow target audience gets it.

tomsmeding · on Oct 12, 2021

These hashes, regardless of their actual cryptographic strength, are intended to be indistinguishable from a random generator keyed with the input, kind of. Assuming that they succeed in that, hashes should be quite well distributed. That means that that for each hash n-bit hash output, there will be about two (n+1)-bit input strings that produce that hash, about four (n+2)-bit input strings, eight (n+3)-bit input strings, etc. So while there are probably few ASCII input strings that map to one particular hash, for example, there will be loads and loads of arbitrary bitstrings that map to that hash.

Hashes are very bad compressors. :)

FartyMcFarter · on Oct 12, 2021

Compression algorithms must encode data in a reversible way, which hash functions can't do in general.

For example, SHA-1 outputs a 160 bit hash, which means some input of 161 or more bits will definitely have the same hash as some other input of the same size. Even smaller inputs may have collisions too.

fwip · on Oct 12, 2021

No, because reversing the hash has infinite possible answers (well, "very many" for bounded input size).

You can't decompress 32 bytes into 1GB because you don't know which of the 1GB-sized answers is the intended one.

akomtu · on Oct 12, 2021

Not so simple. A hash corresponds to infinitely many messages, but how many of them are in ASCII charset under 1KB long? It may happen that each hash has a unique message within these constraints.

hyperpape · on Oct 12, 2021

Ignoring punctuation, capitalization and spaces, you have

(26 ^ 1024) / (2 ^ 160) ascii texts, which is too large for python to do the division.

Thinking harder, that's:

>>> (13 * (160)) * (26 ** (1024 - 160)) 586015382205826960727672544401463871212215837535709025412643135464856100047507808667927549205890681326798481562717865671371914861707033271010401105757243098374422354323280335989456977123883814519788789676409601028611595593846201622073508573113400508407626532918150795408521721900638322896979964584478287370459866751451622413984067802115006844002721198098126119299044037305285971873899685570443731713584345786693414215895940310733413089663439828136700053588516077484775302179979357918243214847406310224703621807862576801557249657421436718532619781248154845297450962161296162654285951879462181582579086975207170674079727507724939279192338763731562331681240184746490930657625088527012496974579965412172847315257089070105214466289197177909463176672954181773739345690793807307314187284671422149249630372138364687234076394633405015375177710973043617082884089261346079718819448519032937091364439564690325014717871376537803759701695837030040255807318892621674241958398030180530294671920834183657715967728968579955163053146309746677182369005157209730333605691785063279407829717681098919238400049142795696195708090541066855775773950462556018168019660329830112434789524857886670766484791538540247307849949305926757172442880517278313785940979709167536140659937329357336156000509320395656047604973761134700992358740522602445520136787846575269569119611476145342798153931437530464566322510167663230985347176517861376

If you restrict it to words, assume that there's only 1000 words in the English language, averaging 5 letters each, you get:

(1000 ^ 200) / (2 ^ 160) which becomes:

>>> (500 ** 160) * (1000 ** 40) 684227765783602085411977335590779360976690401306892466678255997993062052092705371819647552911192178726196289062500000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000

There will be massive numbers of realistic texts that match any given hash.

akomtu · on Oct 13, 2021

Good point. I've realized that 2^160 can be just 160 words, each word is a choice of two synonyms.

dTal · on Oct 12, 2021

The pigeonhole principle forbids it, since the hash itself is an ASCII message under 1KB long. Even within your message constraints, the message can contain contain any possible hash, and more besides. Therefore there are more valid messages than hashes.

kragen · on Oct 13, 2021

It's very unlikely for a 160-bit hash to have a unique message under the constraint of it being "in ASCII charset under 1KB long". 1 KiB of printable ASCII including \n and space but not \t or \177 is 96¹⁰²⁴, which is about 2⁶⁷⁴³. Of these, if the hash function is any good at all, almost exactly one out of every 2¹⁶⁰ will have the right hash, leaving you with an expected 2⁶⁵⁸³ distinct ASCII-charset messages under 1KB long. If I calculated it correctly, instead of one unique message with the right hash consisting of under 1 KiB of printable ASCII (possibly with some spaces appended to pad it out to 1 KiB), you have about 47 967 084 939 617 240 088 231 752 005 306 325 884 757 564 806 051 499 378 420 486 072 786 141 026 711 644 389 624 345 361 644 245 593 155 700 554 795 067 612 226 300 255 186 367 434 267 198 328 891 304 541 335 537 513 836 547 400 045 409 787 642 617 075 359 474 632 262 281 099 933 405 444 693 275 797 636 981 359 166 142 551 935 433 456 776 135 242 574 665 937 013 670 423 995 769 980 893 248 768 962 950 653 929 460 806 907 564 741 845 059 869 794 472 249 820 375 550 104 552 506 227 318 738 353 617 584 549 423 076 513 640 935 785 892 067 652 844 586 446 661 520 922 590 520 435 212 971 778 420 191 665 135 704 785 297 868 496 132 194 344 092 532 767 753 539 023 500 656 559 945 585 667 295 477 775 543 847 979 594 651 751 709 542 229 125 925 151 633 946 693 073 932 350 683 231 283 773 613 582 005 763 154 838 864 288 879 888 332 953 553 058 126 791 192 014 388 230 316 497 633 811 926 064 371 458 658 795 966 440 990 101 901 032 836 632 802 434 803 173 156 709 301 006 193 682 249 500 596 835 881 601 627 616 029 997 922 141 423 971 635 916 546 397 289 367 953 405 102 342 661 998 428 622 018 161 995 405 622 814 061 925 210 174 108 063 042 331 122 962 491 864 135 145 051 181 277 545 303 147 381 365 523 778 185 116 460 815 262 263 500 402 481 608 602 600 186 833 670 768 942 448 591 646 439 127 301 400 601 336 130 050 539 973 301 764 212 409 825 855 458 107 341 401 691 305 152 002 782 886 099 703 975 355 852 347 383 236 360 931 712 288 039 396 729 525 475 540 725 954 319 379 177 826 924 167 807 020 037 630 409 522 341 596 693 199 836 065 989 968 868 685 185 086 087 022 273 317 809 678 407 361 940 533 280 300 173 079 635 303 309 555 449 410 556 692 415 681 998 421 397 202 738 917 214 037 435 078 427 359 778 923 359 900 840 055 529 921 692 702 321 542 997 864 543 418 915 511 989 871 945 334 709 916 341 623 655 699 199 160 379 312 900 066 054 592 963 465 517 593 966 290 755 564 533 465 052 063 440 394 099 369 011 964 754 135 849 017 025 315 982 367 652 117 408 473 872 192 766 757 909 754 922 584 465 489 973 138 378 312 812 178 366 379 361 777 540 778 799 973 694 388 412 249 147 549 245 795 058 816 022 744 426 339 281 128 551 995 761 983 244 680 295 344 226 171 135 266 747 723 635 275 591 506 459 047 751 124 514 571 326 309 610 529 999 309 895 628 777 141 756 979 076 770 232 849 272 425 879 893 429 411 541 942 495 076 207 596 350 607 674 491 546 842 819 722 368 514 723 670 390 388 258 639 701 760 252 716 026 232 912 354 822 929 844 920 997 401 287 086 389 559 396 695 747 881 419 688 945 588 175 366 178 512 747 036 422 014 697 873 269 878 782 396 373 076 554 780 366 468 559 467 108 175 337 875 528 158 025 875 456 of them. (Sorry for the number format with spaces; it got chopped when I used commas.)

Most of them will be gibberish; English text has only about 2.3 bits of entropy per letter. So there are only on the order of 2²³⁵⁵ 1KiB messages that look pretty much like English text, of which one in 2¹⁶⁰ will have any particular SHA-1 hash, leaving about 2²¹⁹⁵ more or less English messages of 1 KiB, which is a much more manageable 5 765 546 543 805 391 543 300 385 245 067 897 407 469 008 380 207 694 434 170 981 314 369 415 226 086 245 896 005 497 410 349 176 911 651 361 357 544 908 126 864 379 940 776 407 262 468 025 247 520 821 365 392 566 254 691 849 336 550 399 984 742 144 883 696 325 495 839 942 505 506 308 529 294 485 245 435 346 088 288 415 306 782 152 045 986 880 430 505 821 218 111 120 701 594 573 419 855 327 199 586 861 839 630 511 065 600 663 692 968 681 473 384 074 002 850 142 261 291 497 547 545 795 867 600 142 345 188 353 358 006 378 705 229 284 788 565 040 964 509 510 302 568 387 814 225 873 737 552 804 109 763 080 706 434 267 888 314 149 674 523 819 024 312 546 837 031 915 917 556 591 511 424 773 862 591 940 658 144 814 461 877 029 111 670 089 356 835 845 931 924 493 084 507 666 309 424 365 148 038 224 615 440 025 478 945 269 023 101 615 392 514 882 287 817 384 451 162 838 663 168 messages.

It is unlikely that it would be apparent upon inspection which one was correct, since most of them differ from other messages by only a couple dozen subtle changes.

ganzuul · on Oct 12, 2021

Natural data is easily identifiable.

dTal · on Oct 12, 2021

Not it's not. You can't distinguish the complete works of Shakespeare from the complete works of Shakespeare with the words changed a bit, or switching to Harry Potter for a bit in the middle, or a completely dazzling and original work of fiction. Any hash might generate all three of these, and essentially boundless other works. It's a library of Babel.

ganzuul · on Oct 12, 2021

Your examples really fall out of the scope of the premise.

dTal · on Oct 12, 2021

Forgive me if I've misunderstood - the premise was that, out of the infinite set of data that hashes to a particular hash/checksum, there is a unique data set that is "obviously" the real one? Or at least, that the set is meaningfully bounded? My reply is that this is not the case. There will be infinitely many "plausible" data sets as well.

You could collide every hash in existence merely by making undetectably tiny alterations to Shakespeare.

ganzuul · on Oct 13, 2021

Not obviously, no. - Just a very high probability. Perhaps that provides noise immunity to some degree. If it does, it is a form of AI.

This admittedly open question presumes a very large fuzzy 'code book' with which it can re-assemble the data. The length of the input in cleartext is valuable metadata that speeds up the search.

dTal · on Oct 13, 2021

You still seem to be missing the point. The problem is that the SHA-1 hash is one of 2^160 possible values, while a 1GB plaintext is one of 2^8000000000 possible values. That means that the hash disambiguates your message down to one of 2^7999999840 values.

Let's go smaller. Let's say our plaintext is a single kilobyte. One 80x25 terminal's worth of ASCII. Knowing the SHA-1 hash narrows our search space, yes, but the search space is so absurdly large that it only solves 0.0000... (insert over two thousand zeros here) ..0001% of the problem.

It tells you basically nothing.

fwip · on Oct 12, 2021

No, they're 100% correct.

fwip · on Oct 12, 2021

Nope, not on these scales.

Say you have a 32 byte hash, that's 2^(328) possible values, right? Now say your input is 128 bytes, that means that (on average) each 32 byte hash maps to 2^(968) values.

"Natural looking" data will occur an astronomical number of times in a search space that large - which is, again, only 96 bytes larger than your hash.

contravariant · on Oct 12, 2021

Only if you count all natural looking data of which there is way more than you could possibly imagine.

It is of course likely that humanity hasn't yet created more than 2^100 (~10^30) files, so in theory given a registry of all files in existence you might be able to identify it by its hash. However while this is simple it's definitely not easy.

swutas · on Oct 13, 2021

But this is technically an invalid answer to the problem of file compression. An easy (!) way for data to be losslessly compressed would just be to have a registry of all files in existence, and to refer to each file by its index in the registry. (Like the prisoners who had told each other all their jokes so many times that now they could just quote them by number and everyone would laugh.) Data compression is supposed to be able to deal with novel but likely documents without needing to add them to a global registry.

ganzuul · on Oct 12, 2021

I can imagine the output of program space. That's pretty big. All possible universes in fact.

It's an issue of probability and bins. While natural data is infinite, there is vastly more unnatural data. At some point you have enough metadata (e.g. natural vs. random) to know that the original data came from Earth to pick out the right needle from the needle stack.

Unless the data is from a completely alien source, we are close enough to that source for our heuristics to be of a manageable size.

bscphil · on Oct 13, 2021

I understand your confusion, because it took me a minute to think about it and confirm that this doesn't work. It might be surprising that even a machine with complete information and infinite compute power couldn't pick out the needle; after all, SHA-1 is a 160 bit hash. Are there really 2^160 ≈ 10^48 different texts in English that could reasonably have been written by someone living on Earth (add as much additional metadata as you like)? The answer is yes: there are simply too many needles.

Take your favorite book with 160 or more characters. (Probably a large book, but such a book could certainly exist.) Now, for each character, create version of the book with their name swapped out for exactly one other name which does not already appear in the book. So if a character is named Alex, in some other edition of the book that character is named Jason.

Now, consider how many editions of the book there are. There are two combinations (for each character) that are exactly alike other than that character's name being swapped out. So there are 2×2 = 4 combinations where two characters' names are swapped. And 2×2×2 = 8 combinations where 3 characters' names are swapped. If you think about how many versions of the book there are in total, it's 2^C, where C is the number of characters whose names can be swapped for only one other name.

In other words, it's guaranteed that you can generate SHA-1 collisions using one book alone, using only entirely reasonable alternative names for its characters.

Too many needles.

ganzuul · on Oct 13, 2021

Taking a book and randomly swapping words means you are introducing noise. This is high-entropy and really it's outside the scope of what I'm comfortable speculating on.

Image data is much simpler to think about in this paradigm, but the obvious AI applications of natural language are of course fascinating.

contravariant · on Oct 12, 2021

It's somewhat irrelevant how many unnatural data there is. The point is that there's way more plausible data (or sentences if we're restricting ourselves to natural language) than there are possible 160; 256 or 512 bit hashes.

Unless of course you enumerate all of human communication, but like I said that doesn't count as 'easy'.

ganzuul · on Oct 12, 2021

I'm not trying to claim this is practical with current technology.

__s · on Oct 12, 2021

That's because natural data has low entropy

But let's say every paragraph only offers 1 bit of entropy. Then a 160 bit hash gives you fuzzy accuracy up to 160 paragraphs. After that you'll have to extend the hash with hints to guide which sequence of paragraphs you're looking for, & hints for where the typos are

ofc, 100x compression of English text doesn't require this amount of compute to decompress: https://en.wikipedia.org/wiki/Hutter_Prize

It's also impractical since most compression use cases want to put the work in compression, & have decompression be straight forward

edit: 100x was misreading, it's currently 8.6x http://prize.hutter1.net/#prev

ganzuul · on Oct 12, 2021

This is very abstract, but I believe that since both input and output are very close in program space compared to the total size of PS, the heuristics to map between them are going to be of a manageable size. This is based on an intuition about a single source for all activity in this universe.

AnimalMuppet · on Oct 12, 2021

"Paragraph" is highly optimistic. IIRC, each English character has about 1 bit of entropy.

__s · on Oct 12, 2021

Shannon estimated 0.6 to 1.3: https://cs.fit.edu/~mmahoney/dissertation/entropy1.html

For the sake of argument I figured I'd be highly optimistic. The linked prize shows practical evidence of 1GB of Wikipedia being compressed below 1 bit per character (& that's with a limit of 50 hours cpu time, 10GB memory, & 100GB disk)

xinniethepooh · on Oct 13, 2021

It's not, and that's why decryption of unknown ciphertext is even harder

motohagiography · on Oct 12, 2021

This is what made it a funny thought experiement to me. I was doing some space related stuff a while back and when you're dealing with something as small as 100mb, and you add the constraint of up to a light-year or more of latency, which makes error correction immensely costly, having a quantum processor reciever find the collision of a unique hash could be faster than transmitting the data with errors.

There's probably a distance where the latency in light years is longer than it takes for a quantum processor with some shortcuts to exhaust the search space for the collision on the hash - and potential hashing algorithms that contain information and hints about a hyperplane that generates the original data over the field of possibilities. To a quantum computer with sufficient qbits at a long enough latency distance away, the 32-bit hash of a 2048bit RSA key may "arrive" faster than transmitting the whole key.

It feels like a fundamental misunderstanding of something in communication theory, but this was the thinking that prompted it.

mypalmike · on Oct 12, 2021

Using the pigeonhole principle, consider how many 100 megabit values must, on average, hash to the same 32 bit value. The answer is a huge number (I could be wrong but it seems to me it might be 2 ^ (100,000,000 - 32)?). This is a bit of a paradox in cryptographic hashing, where finding a collision is difficult, but the number of collisions is enormous.

fwip · on Oct 12, 2021

Yeah, you're fundamentally missing what I wrote, as well as pigeonhole principle and the basics of information theory.

Let's say you replace the quantum computer with an oracle. You have an infinitely long dictionary, and you can turn to any page and see your hash, as well as every possible input that could form that hash. There's billions and billions of options that all match your hash - how could you pick which one is definitely the "right" one?

You're no closer to knowing what the transmitted message was, because of the astronomical number of inputs that you now have to sift through.

[ Technically, you're (length of hash in bits) closer. Say you have 10 bits of hash, or 1024 values. There are (on average), 2 11-bit inputs that create that hash, 4 12-bit inputs, 8 13-bit inputs, and so on, following the powers of 2. At input of N=30 bits, you have 2^20 =~ 1 million possible inputs with that exact same hash. 1KB? 2^990, or "one followed by 990 zeroes". ]

motohagiography · on Oct 13, 2021

This part about a massive number of possible collisions for a given plaintext in a cryptographic hash, but just spread over such a large field and incrememting with the input size so that it is just unlikely you will find another one is the counterintuitive part. I haven't implemented a secure hashing algorithm from scratch, so this aspect would have come out in the exercise.

Even working with security and crypto, the presumption of a cryptographic hash from a standard algorithm is that it is universally unique for the data it hashes - barring someone finding collisions, and then it gets deprecated. Most people deal with them at a higher level of abstraction.

Hash collisions have been used in a few high profile attacks in the last decade, most notably when some SSL certificates were still using MD5, and the most recent smart contract issue where the contract was only validating the last 4 bytes of a cryptographic hash and someone produced a key that exploited that with a collision. Unfortunately the people who don't make those mistakes are both very rare, and often dynamic to work with.

The idea that there are many, many potential collisions for a given input to a cryptographic hash doesn't come up much unless someone manages to find single a collision in one. I'd even posit nobody outside academic cryptography circles is including sha256 collisions in their threat scenarios right now.

However, the absurd/imaginary/hypotehtical/counterfactual I was raising is that a future quantum computer can produce all outputs of that length in a reasonable amount of time. The question was more about what was sufficient to reconstruct the data, and it was funny to know something was specifically wrong but not know why.

However, I appreciate the time taken to illustrate it.