Strange Loops: Ken Thompson and the Self-referencing C Compiler

kwantam · on Sept 9, 2013

The original article, which this (kind of) summarizes: "Reflections on Trusting Trust" by Ken Thompson.

http://cm.bell-labs.com/who/ken/trust.html

Sort of disappointing that the blog entry doesn't bother linking to the original, which is at least as well written.

sjay · on Sept 10, 2013

From the original, Ken on the press lionizing hackers:

"I have watched kids testifying before Congress. It is clear that they are completely unaware of the seriousness of their acts. There is obviously a cultural gap. The act of breaking into a computer system has to have the same social stigma as breaking into a neighbor's house. It should not matter that the neighbor's door is unlocked. The press must learn that misguided use of a computer is no more amazing than drunk driving of an automobile."

Nowadays, I here more about the opposite problem -- using overly broad, non-technical legalese to convict kids of "hacks" that aren't really hacks. I wonder what Ken would think of Snowden, Barrett Brown, weev et al?

alanctgardner2 · on Sept 10, 2013

I'm pretty sure the context of that quote is completely applicable to weev. "It should not matter that the neighbor's door is unlocked", is clearly Ken saying that something being technically simple does not make it morally or legally justifiable.

Dylan16807 · on Sept 10, 2013

But at the same time, simply entering is not going to get the feds after you. And while standing on someone's lawn and staring into their living room certainly has a stigma, it will never get you put in prison for decades. (For the weev analogy pretend they have some financial documents on the table in the living room.)

acchow · on Sept 10, 2013

Was first introduced to this article in CS 458/658 at Waterloo.

My favourite part was "This is a deep concept. It is as close to a 'learning' program as I have seen" since getting a compiler to recognize when, where, and how to inject (self-propagating) backdoor code seemed practically impossible.

It also expressed the idea of "self-referencing" more succinctly than the backdoor trick.

WalterBright · on Sept 9, 2013

This relies on new compiler binaries being built from the previous compiler binaries.

The cycle can be broken by using a different compiler with a different pedigree in the bootstrap process. In fact, this suggests a way to detect the back door:

Given Compiler A which is the previous generation, and Compiler B which has a different pedigree, we want to generate Compiler C and detect if it is compromised:

A compiles C which compiles C1

B compiles C which compiles C2

C1 and C2 should match. If it does, and A is "known good", then C is now "known good".

Of course, this can be compromised, too, but it will make it very much harder to do so.

pmiller2 · on Sept 9, 2013

Of course, the hard part is to detect whether C1 and C2 "match." Just because 2 different compilers generate different code for the same source doesn't mean one or the other compiler was compromised. By Rice's theorem, it's undecidable in a general case to decide whether two computable functions are equal, so you technically can't do this step in a general way.

WalterBright · on Sept 9, 2013

Do a binary diff of the executables. That's unambiguous.

Remember that C1 and C2 are generated by two different compilers generated from the same source code. The same source code should generate the same binary.

jackowayed · on Sept 9, 2013

No, the same source code will generate different binaries. The spec doesn't say exactly how source code has to map to machine code, just what should happen based on source code. Differences in the optimizers are almost certain to yield differences in the optimized binaries, and even with optimizations off, binaries are likely to be different due to different compilers making decisions that are more-or-less arbitrary differently.

(Plus a backdoor could easily be set to run only when optimizations are on. Almost all production builds that one would want to exploit should be built optimized, and it would make the exploit a bit harder to spot in compiled binaries.)

WalterBright · on Sept 9, 2013

Please read the process again. Let me rephrase it:

A compiles C and produces C1. C1 compiles C and produces D1.

B compiles C and produces C2. C2 compiles C and produces D2.

D1 and D2 should be identical.

jackowayed · on Sept 9, 2013

Oh, I see, generate two different binaries of the same compiler with different compilers, and you know something is up if the same source compiled by those two separate binaries of the same compiler differ. Assuming the compiler is deterministic, that should work pretty well.

I wonder how well that would work in practice. There are actually quite a few bugs in C compilers[1], and compilers are complicated enough that I could see them hitting some of them when they are compiled. Has anyone done this before?

1: http://www.stanford.edu/class/cs343/resources/finding-bugs-c...

lambda · on Sept 10, 2013

Yes. David A Wheeler did his PhD thesis on this[1], both with some toy compilers to demonstrate that it did detect a trusting trust attack, and eventually bootstrapping gcc using icc, and was able to get it to compile deterministically and verify that it produced identical binaries.

1: http://www.dwheeler.com/trusting-trust/

pmiller2 · on Sept 9, 2013

It doesn't matter. "Trusting Trust" is just intended to show that by compromising the C compiler, you can compromise any other binary on the system. So, how are you going to tell D1 and D2 are identical if the diff tools are compromised? Compare them off the machine? Well, what if the file transfer tools/drivers are compromised? Boot to a custom kernel and compare using a different OS? Well, better hope your boot firmware isn't compromised, etc.

I grant that this is not a very practical attack, but it can be made arbitrarily hard to detect, based on the threat model the original attacker (who compromised the C compiler) uses.

lambda · on Sept 10, 2013

> I grant that this is not a very practical attack, but it can be made arbitrarily hard to detect, based on the threat model the original attacker (who compromised the C compiler) uses.

Not really. The more binaries you have to be able to subvert for it to work, the harder the attack becomes. There are lots of binary diff tools out there, lots of compilers, lots of assemblers, etc. Writing an attack that could detect them all and defeat them all would be so hard as to be impractical; being able to generically recognize something like "this is a compiler" or "this is an implementation of AES" would probably involve solving the halting problem. And if you can't solve it generically, you have to start adding bigger and bigger suspicious binary blobs in which encode all of the patterns for all of the programs you're looking to subvert.

Remember, the NSA does not have infinite resources, nor can they solve unsolveable problems. There is a practical limit to how paranoid you need to be.

Practically, subverting general purpose computation, like subverting a compiler or subverting general purpose instructions on a CPU, is likely to be too easy to detect and too hard to implement to be worth it.

Much easier is subverting a random number generator. That's incredibly hard to detect, and easy to implement. AES encrypting an incrementing counter with a key known only to the NSA looks an awful lot like a random stream, but with that key they can trivially figure out where it is in the stream and what will come next.

That's why Intel's insistence on getting the Linux kernel to blindly trust the RdRand instruction was quite worrying[1], while applying only a bit of paranoia and using David A Wheeler's approach to defeat the trusting trust attack[2] is likely sufficient to have faith in things like your compiler.

1: https://en.wikipedia.org/wiki/RdRand 2: http://www.dwheeler.com/trusting-trust/

vidarh · on Sept 10, 2013

You are addressing a completely different issue than the paper did. Yes, the general case is nearly impossibly hard.

But that is not the point. The point is that against a determined adversary, you can not trust an arbitrary program, even if that program is compiled from source, unless you can also trust every other component on your system.

The point then, is to encourage that "bit of paranoia" and make people understand that thinking you're safe just because you can read through the source or have another mechanism for producing or obtaining what you might think is a pristine, safe copy of a piece of software is a false sense of security.

While this particular attack to my knowledge was only a thought experiment and I've never heard of it occurring in the wild, note that at least a few viruses for example propagate by modifying binaries to act as carriers, and often modify the system to obscure their presence (e.g. report incorrect file sizes, and not read back the modified data). This is largely the same threat, and in many ways far more practical because it doesn't require people to take the step of trying to recompile applications.

Having the source (or a "known good" source of your application) does not help you, as your binary is infected when it is handled by the compromised system. Taking checksums etc. on the compromised systems may not help you.

jlgreco · on Sept 9, 2013

Use a card deck as your persistent storage. Give the card deck to undergrads to verify by hand. ;)

IgorPartola · on Sept 9, 2013

Right. Use 1000 different hash functions to hash the binaries and compare. Surely it is impractical to compromise all future hashing algorithms by detecting their pattern.

jlgreco · on Sept 9, 2013

Oh, I'm just thinking that you only write and read from storage that can be manually verified. You therefore wouldn't verify the output of your compiler with anything that was compiled by that compiler.. instead you would verify it with the owners manual and some underlings to do the grunt work.

So you run your compiler, and it punches cards for you. You then turn off the machine, remove the cards, and verify them. If that checks out, then you boot the machine with the cards again. Anything that persists the reboot has to be on those cards, and is therefore subject to uncompromised inspection.

DannoHung · on Sept 9, 2013

Not intrinsically.

InclinedPlane · on Sept 10, 2013

> The same source code should generate the same binary.

This is utterly false and reveals a deep misunderstanding of compiler technology. Aside from optimizations and exactly how they are implemented in each compiler there are still many more or less arbitrary choices that a compiler needs to make in order to translate source code into machine code. Address layout being one of the most prevalent. Indeed, these choices are so arbitrary and lead to such a high degree of difference in the binary forms that the same compiler running on the same code will often generate different binary output. And the same compilers running on very slightly different code bases will quickly generate substantially divergent output. This is why programs like bsdiff and courgette exist, because comparing binaries is actually an enormously non-trivial problem.

WalterBright · on Sept 10, 2013

> This is utterly false and reveals a deep misunderstanding of compiler technology.

I've written many professional compilers, front to back, and I use the binary difference technique to verify that the compiler is capable of exactly reproducing itself.

If you've got a compiler that generates different binaries depending on the time of day, the address the compiler was loaded at, or something else that is not the compiler switches + source code provided to it, you've got a compiler with serious QA issues.

InclinedPlane · on Sept 10, 2013

That's nice that your compiler works that way but that's not the way most do.

More importantly the salient point was about comparing the binary output of different compilers.

While it's certainly possible to create tools which make it make it possible to determine if the binary output of different compilers are effectively the same such tools are very non-trivial to create. The idea that different compilers are likely to produce exactly identical output is sheer fantasy.

WalterBright · on Sept 10, 2013

> More importantly the salient point was about comparing the binary output of different compilers.

No, it wasn't. I explained it (apparently badly) 3 times now. There's another iteration of bootstrap compiling in there before the output is compared.

chii · on Sept 10, 2013

I think a lot of people gets confused by your explanation earlier in the comment thread because one it is difficult to do recursion, and two, see above. ;)

But yes, i think this method does work. You'd have to trust all the pipeline programs used in between.

e12e · on Sept 10, 2013

No, the point was comparing the binary output of the same compiler, compiled by itself, compiled by different compilers... (Yeah, I missed that the first time round, too).

Patient0 · on Sept 10, 2013

Next time you might want to look up who Walter Bright is before jumping to the conclusion that you know more about compilers than he does (http://en.wikipedia.org/wiki/Walter_Bright)

mhurron · on Sept 9, 2013

Do not different compilers make different optimizations? Could not the output be then expected to be different?

WalterBright · on Sept 9, 2013

When A compiles C, and B compiles C, then when we run C to recompile C, it will be doing the SAME optimizations, and should produce exactly the same binaries.

I.e. take the iteration one more time.

(I actually do this as part of the test suite for the Digital Mars C++ compiler - it takes two iterations until the binaries match exactly.)

mcpherrinm · on Sept 9, 2013

Rice's theorem is nice and all, but in the real world we can model executables as things other than turing machines and answer questions about them.

If I was trying to detect a "trusting trust" type backdoor, one thing I might do is model the control flow of the executable as a graph, and use something like approximate graph isomorphism to detect additional large blocks of code.

asabjorn · on Sept 10, 2013

A solution to this problem happens to be my thesis topic, and yes it is a very difficult problem. We achieve this by a mechanism to randomly test functions in binaries without knowing the function call signature.

jtemplestein · on Sept 9, 2013

True. But this isn't useful unless you know which code C to do the test on (in the OP's original example C would be the code of the login program).

The only way to tell if a compiler will put a backdoor into certain programs is to read the disassembled source.

SEMW · on Sept 9, 2013

No. In WalterBright's example, code C is the compiler (the one which you suspect may be backdoored), not the login program. A and B are also compilers, one of which is known-good.

So: the difference you'll find isn't the login-backdoor itself, it's the code that waits for a login program (or another compiler) to be compiled, and inserts that backdoor. This avoids the problem of having to know what the login program or whatever is.

sharkbot · on Sept 10, 2013

> A and B are also compilers, one of which is known-good.

'Known-good' is exactly the problem Thompson's essay is describing. There is no 'known-good'. Instead, you have decided to root your chain of trust with a compiler you call 'known-good' (say, B). But ultimately, that trust is arbitrary: your compiler B relied upon un-investigated components at some point in its heritage: a hex editor, a disk drive controller, a CPU, an LCD screen, etc.

The best you can do is reduce the trusted base to the smallest amount, then make explicit your trust relationships as much as possible. Ideally, the trusted base would be physics and logic with a metaphysical certainty of the universality of those laws; we're a long way from that situation :)

SEMW · on Sept 10, 2013

Yes, of course. I don't think WalterBright was claiming that this method eliminates the theoretical possibility of all trusting-trust-type attacks entirely, only that it gives you a good shot at detecting the specific type that Thompson gave as his example (the compromised compiler). The point of the method being that since compiler B can be crap (slow, non-optimising, only implementing the minimum required to compile compiler C), it can be something you could knock up yourself, so reducing your level of trust to those components below the level of the compiler.

So yes, you're still relying on things below that. That doesn't make the exercise 'arbitrary' or pointless. Good risk management is reducing the chances of the most probable attacks, and a compromised compiler is (ISTM) a much more likely attack than e.g. a compromised CPU.

Seems to me that putting theoretical perfection too far ahead of pragmatism just gives the (wrong and damaging) impression that not being able to solve all trusting-trust issues entirely means it's not worth their time trying to solve the more tractable ones.

sharkbot · on Sept 10, 2013

I disagree with you, but you're not wrong. Taking the pragmatic approach to trust is perfectly valid, and the only short to medium term option at present. I personally find it distasteful, but that's my academic bias showing.

However, Thompson's essay is both a practical attack and an abstract idea; I find the notion of trust with self referential systems more interesting than the specifics of how to detect a backdoored compiler.

SEMW · on Sept 10, 2013

That's fair enough. Cheers for the discussion.

mkehrt · on Sept 9, 2013

Nice article, but just read the original: http://cm.bell-labs.com/who/ken/trust.html

bediger4000 · on Sept 9, 2013

I'm surprised no one has mentioned David A. Wheeler's "Diverse Double Compiling" (http://www.dwheeler.com/trusting-trust/) as a solution to this problem.

csense · on Sept 10, 2013

Let's pretend a compiler with this backdoor will always correctly detect when it's compiling a compiler. I.e. looksLikeCompilerCode() and generateCompilerWithBackDoorDetection() are oracles.

With this assumption, to write a compiler that's safe, you can't run it through an existing possibly-compromised compiler. You would have to bootstrap your safe compiler, as if for a totally new language.

So you'd have to initially write it in a different language. You probably can't write it in most existing languages, e.g. Python and Java are both implemented in C. Because, if you tried to write a safe compiler in Python, a sufficiently smart [1] C compiler would have been able to tell that you were compiling a Python interpreter when it compiled /usr/bin/python, and inserted a backdoor into that Python interpreter which will trigger when the interpreter is interpreting a C compiler written in Python.

You'd basically have to consider any code that has ever passed through an automated tool to be potentially backdoored, so you'd have to start writing in machine code (no assembler allowed, of course, because it's probably written in C). Of course, you could program an assembler in pure machine code (or using a potentially-tainted assembler, verifying its output by hand).

[1] It'd either be a general artificial intelligence of human level programming ability, or some kind of magical oracle.

e12e · on Sept 10, 2013

Reading this, I was reminded of an article using the dos debug[1] command to bootstrap tools written from (memorized) assembly language.

Note, I can't find the actual article now - I think it may have been in one of the early issues of Phrack[2].

[1] http://www.intel-assembler.it/portale/5/Write-an-assembly-pr...

[2] http://www.phrack.org

gleenn · on Sept 9, 2013

This is awesome because it means backdoors could have been inserted many many versions ago in a compiler and as long as there was a chain of using the compiler to compile itself through each version (as most compilers like GCC are), the hack would be passed ad infinitum. Pretty evil.

WalterBright · on Sept 9, 2013

The chain also breaks if the detected compiler source code pattern changes enough so it is no longer detected.

_dp9d · on Sept 9, 2013

Tinfoil hat time.

Let your imagination run if someone had done this at some point to a build of GCC being used by Linus...

zerohp · on Sept 10, 2013

Linus doesn't distribute kernel binaries.

unimpressive · on Sept 9, 2013

There is more than one way to defeat this attack. One of the simpler ones would be to create a formally verified C compiler in X86 assembly, which you can use to compile an older version of GCC or whatever, and then use that version to compile the next one, and so on and so forth until you have a guaranteed clean modern version.

I don't think any buildchains currently implement anything like this though.

barbs · on Sept 10, 2013

Interesting! Out of curiosity, if this were the Gnu C Compiler, would this be illegal, since the GPL requires the source to be distributed with the binary, and the source and binary don't match?

belorn · on Sept 10, 2013

GPLv3:

  The “source code” for a work means the preferred form
  of the work for making modifications to it.
  ...
  The “Corresponding Source” for a work in object code form
  means all the source code needed to generate, install, and
  (for an executable work) run the object code and to modify
  the work, including scripts to control those activities.

adrusi · on Sept 10, 2013

The is no definition of "match" beyond just what the compiler produces. No spec dictates what the output of a compiler should be.

barbs · on Sept 11, 2013

Does this mean that, in theory, if someone wanted to modify some GPL'd code to merge it into a proprietary project, and they didn't want to GPL the entire project, they could modify the compiler to compile the code with the modifications? It'd be a completely round-about way to do it, but in theory would this be legal?

If it were a GPL-licensed compiler (e.g. GCC), they wouldn't need to distribute the compiler-code changes, since they would only be using it internally and not distributing the binary for the compiler itself.

Of course, they could just modify and use a shared-library of the GPL'd code or whatever. At least, that's my understanding, could be wrong...

adrusi · on Sept 12, 2013

Well they couldn't use gcc, because that's also GPL and they'd have to release its source, but if I'm not missing some more nuanced part of the license, they could probably compile with a fork of clang.

barbs · on Sept 17, 2013

See, I don't think they'd have to release the source for gcc, since they're not releasing the gcc binary.

My understanding is, under the GPL, if you distribute the binary, you need to distribute the source.