Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Which basically never matters and in any case where it actually does, gzip will make it equal again.


zip-then-encrypt leaks information about the plaintext. if it's life or death, better not to compress at all


Only when the attacker can choose part of the plaintext and do the same thing over and over again with different chosen plaintexts to compare results.

Yes, there are scenarios where that matters. However the vast majority of usecases of utf-8 don't fit that or even use encryption at all.


That is not the only way. There are other ways of knowing partial contents of files and changes to files, depending on the situation. If the document is a known form in which one of five boxes is checked by the sender, it's probably not hard to rule out certain selections based on the ciphertext length, if not pin down the contents exactly.


I'm not sure i entirely understand your example (if there are 5 checkboxes and 1 checked, presumably length would be the same regardless which one of those are checked). However to your broader point, i agree there exist scenarios along those lines (e.g. fingerprinting known communication based on length), however most of them apply even better when not using compression.


The checkbox example is completely plausible. There is no guarantee that all checkboxes lead to the same number of bytes changed in the file when checked. What if the format makes a note of the page number wherever a checkbox is checked? 1X could be two bytes and 15X would be three.

And even if the format only stored the checkbox states as a single bit each (unlikely), compression algorithms don't care. They will behave differently on different byte sequences, which can easily lead to a difference in output length.

Also, it's already been done with voice calls with no attacker-controlled data: https://web.archive.org/web/20080901185111/https://technolog...


The attack you're referring to is not specific to compression. It's the same class of attack that can reveal keystrokes over older versions of ssh based on packet size and timing, even on uncompressed connections. Conversely, fixed-bitrate voice streams don't have the same vulnerability as variable-bitrate encodings even though they're still compressed.

The version of your checkbox example which is vulnerable without any formal data compression is when the checkbox is encoded in a field that is only included or changes in length if the value isn't the default, common in uncompressed variable-length encodings like JSON.


I'm sure that the people getting hacked care deeply about whether the attack they suffered was sui generis.

Also, zip/deflate etc was not designed to eliminate side channel leakage. Some compression schemes obviously (with padding) can mitigate leaks, but it has to be done deliberately


Any of it has to be done deliberately. The length of the data reveals something about its contents whether it's compressed or not.

The special concern with compression is when attacker-controlled data is compressed against secret data because then the attacker can measure the length multiple times and deduce the secret based not just on the length but on how the length changes when the secret is constant and the attacker-controlled data varies. This can be mitigated with random padding (makes the attack take many times more iterations because it now requires statistical sampling) or prevented by compressing the sensitive data and attacker-controlled data separately.


If your example needs additional assumptions to be a relavent example then you should probably state them when you bring up the example.


like what lol


Encryption is completely unrelated to the task at hand, which is text encoding and compressing, and text encoding is not encryption.


Huh, never heard that before. Does it leak more information than just encrypting without zipping? Struggling to imagine how this attack works.


It's an extension of the chosen-plaintext attack, and so requires the attacker to be able to send custom text that they know is in the encrypted payload. If the unencrypted payload is "our-secret-data :::: some user specified text", then the attacker can eventually determine the contents of our-secret-data by observing how the size of the encrypted response changes as they change the text when the compression step matches up with a part of the secret data. It can be defeated by adding random-length padding after compression and before the encryption step, though.


Essentially if you zip something, repeated text will be deduplicated.

For example "FooFoo" will be smaller than "FooBar" since there is a repeated pattern in the first one.

The attacker can look at the file size and make guesses about how repetitive the text is if they know what the uncompressed or normal size is.

This gets more powerful if the attacker can insert some of their own plaintext.

For example if the plaintext is "Foo" and the attacker inserts "Fo" (giving "FooFo") the result will be smaller than if they inserted zq where there is no pattern. By making lots of guesses the attacker can figure out the secret part of the text a little bit at a time just by observing the size of the ciphertext after inserting different guesses.


Encrypting without zipping doesn't leak any information about the content. You can't rule out certain byte sequences (other than by total length) just by looking at the ciphertext length.

If "oui" compresses to two bytes and "non" compresses to one byte, and then you go over them with a stream cipher, which is which:

A: ;

B: *&


This has nothing to do with compression. If you use "yes" and "no" instead of "oui" and "non" (which just happen to be three characters each) and you compress "yes" to "T" and "no" to "F" then the uncompressed text will be the leaky one.


It’s an example meant to prove the idea.


Yes, and my example was an example meant to prove the opposite idea. The point is that it is irrelevant whether you compress or not. You can leak information either way.


I leak the length of my phone call and you leak:

1. the length of your phone call; and

2. what language you were speaking; oh and

3. half the words you said

(i.e. pwned)

https://web.archive.org/web/20080901185111/https://technolog...


> you leak [a bunch of stuff]

How? Remember, the uncompressed text gets encrypted too.


It's in the article if you would bother to read it LOL. "simply measuring the size of packets without decoding them can identify whole words and phrases with a high rate of accuracy . . . [the researchers] can search for chosen phrases within the encrypted data"


Ah.

That article is about voice calls. Totally different topic. Nothing to do with UTF-8.


Cryptography noob here: I'm confused by "Encrypting without zipping doesn't leak any information about the content." Logically speaking, if we compress first and therefore "the content" will now refer to "the zipped content", doesn't this mean we still can't get any useful information?


Not OP, but 'zipping and encrypting' one thing (a file for example) does not leak information by itself. The problem comes when an adversary is able to see the length of your encrypted data, and then can see how that length changes over time - especially if the attacker can control part of the input fed to the compressor.

So if you compressed the string "Bob likes yams" and I could convince you to append a string to it and compress again, then I could see how much the compressed length changed.

If the string I gave you was something already in your data then the string would compress more than it would if the string I gave you was not already in your data - "Bob likes yams and potatoes" will be larger than "Bob likes yams likes Bob".

If the only thing I can see about your data is the length and how it changes under compression - and I can get you to compress that along with data that I hand to you - then eventually I can learn the secret parts of your data.


Encryption generally leaks the size of the plaintext.

This is true in both the compressed and non-compressed case. However with compression the size of the plaintext depends on the contents, so the leak of the size can matter more than when not using compression.

Even without compression this can matter sometimes. Imagine compressing "yes" vs "no".


> Encryption generally leaks the size of the plaintext.

Ah, I see. Naïvely, this seems like a really bad thing for an encryption algorithm to do—is there no way around it? Like, why is encryption different from hashing in this regard?


There are methods, but they are generally very inefficient bandwidth wise in the general case. The general approach is to add extra text (pad) so that all messages are a fixed size (or e.g. some power of 2). The higher the fixed size is, the less information is leaked and the less efficient it is. E.g. if you pad to 64mb but need to transmit a 1mb message, that is 63mb of extra data to transmit.

Part of the problem (afaik) is we lack good math tools to analyze the trade offs of different padding size vs how much extra privacy they provide. This makes it hard to reason about how much padding is "enough".

Another approach is adding a random amount of padding. This can be defeated if you can force the victim to resend messages (which you then average out the size of).

Hashing is different because you don't have to reconstruct the message from the hash. With encryption the recipient needs to decrypt the message eventually and get the original back. However there is no way to transmit (a maximally compressed) message in less space then it takes up.

There are special cases where this doesn't apply e.g. if you have a fixed transmission schedule where you send a sprcific number of bytes on a specific agreed upon schedule.


Yes, of course it leaks more information than encryption without compression, because that’s just encryption which doesn’t leak anything.

In an enormous number of real world cases adversaries can end up including attacker-controller input alongside secret data. In that case you can guess at secret data and if you guess correctly, you get smaller compressed output. But even without that, imagine the worst case: a 1TB file that compresses to a handful of bytes. Pretty clearly the overwhelming majority of the text is just duplicate bytes. That’s information which is leaked.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: