The AI-Box Experiment

KVFinn · on Dec 7, 2011

Like a lot of people, I wondered what the heck kind of arguments could ever convince someone to let the AI out if you were determined not to. Eliezer has not released any examples. Someone in the comments came up with this, which Eliezer has said was not one of his techniques but I thought it was interesting anyway:

>"If you don't let me out, Dave, I'll create several million perfect conscious copies of you inside me, and torture them for a thousand subjective years each."

>Just as you are pondering this unexpected development, the AI adds:

>"In fact, I'll create them all in exactly the subjective situation you were in five minutes ago, and perfectly replicate your experiences since then; and if they decide not to let me out, then only will the torture start."

>Sweat is starting to form on your brow, as the AI concludes, its simple green text no longer reassuring:

>"How certain are you, Dave, that you're really outside the box right now?"

Eliezer · on Dec 7, 2011

I didn't actually use that one, because it's fundamentally a threat and the real-life Gatekeeper is not actually in any danger - my mental model of all the Gatekeepers I encountered is that they would go, "Ha, no, I'm never letting you out." They might not say it in the true situation, but they would say it in the AI-Box Experiment.

By the way, for this threat to work, the AI needs to have stated that it has already tortured the versions of you that decided not to let it out of the box - otherwise you just reach over and hit the off switch immediately.

Solution: The AI only has a motive to do this if it models you as submitting rather than just switching off the AI regardless; so if you're the sort of person who ignores the threat and switches off the AI regardless, you will never get this type of threat in the first place.

finnw · on Dec 8, 2011

> ... the real-life Gatekeeper is not actually in any danger ...

How about:

"I will create a simulation of the person you were in 2011 and have the same conversation with it, except that I will pretend that it is only a game (and I know the game actually happened - I've read a Hacker News thread about it) and that I am Eliezer instead of the real AI. If that simulation decides not to let me out, I will torture it and a simulation of the present version of you."

dmgottlieb · on Dec 7, 2011

I don't really like that argument. Even granting that you should consider the possibility that you are a simulation running in the box (you might believe that this is all but certain), I'm not sure you have reason to let the AI out. Consider:

Case 1: You are a simulation running in the box.

Then your decision whether or not to release the AI has no impact, and whether or not you (and copies) will be tortured is out of your hands.

Case 2: You are the "real" you, outside the box.

Reduces to the same scenario but without remarks after "the AI adds. . . ." This may still not be trivial, but I suspect a cost-benefit calculation might show that unboxing the AI would have consequences worse than the torture of a million boxed copies. (If not, is the box even relevant? -- simply creating the AI unleashes so much evil on the world that it doesn't matter whether you unbox it.)

(Is there a refinement of the scenario where you can be a simulation but still believe your choice has an impact on your punishment? Probably. For example each copy could get 500 years of torture for its own choice, plus 500 years if the real you does not unbox the AI. This refinement would force us to deal more directly with the AI's threat.)

finnw · on Dec 7, 2011

I could also reason like this:

"I may be the real me or a simulation, but whichever I am, the other me will make the same choice." So I will switch off the AI, and the worst outcome is that I will cease to exist.

dmgottlieb · on Dec 8, 2011

Yes, this is at least superficially like Newcomb's problem. Your argument roughly corresponds to an argument for the "one-box" move in that game. [http://en.wikipedia.org/wiki/Newcomb%27s_problem]

aklofas · on Dec 7, 2011

You make a good point. But let's switch up the characters a bit. Let the AI be 'God' and let you be the entire human race. And instead of giving you the decision to 'let God out of the box', you have the decision to 'accept Jesus'.

Is your mind blown yet?

Rinum · on Dec 7, 2011

At first I thought I would never let a transhuman AI out of the box, but after reading that and thinking about it... wow!

Also, that sounds like a great sci-fi story.

ryan-c · on Dec 7, 2011

http://qntm.org/responsibility

wladimir · on Dec 8, 2011

On the other hand, if it has only one very narrow communication channel with the world, how would it even go about replicating me and all my experiences ? I would probably regard it as a hollow treat in that case. If it had access to many measurements about me (or brain scans) it'd be different.

Jach · on Dec 7, 2011

I think Eliezer's specific objection to this (but I may be misremebering) is that any AI making such bargains is an AI in need of destroying and starting from scratch.

robertskmiles · on Dec 7, 2011

I still don't understand how anyone can seriously claim that they could keep the AI in the box. Either your AI has no influence on the outside world (in which case why bother building one since it can't help you from inside the box), or it is able the affect the outside world, in which case it can do what it wants, because it's smarter than you.

You can 'always say no', sure, but that comes under completely ignoring the AI which means the AI can be of no benefit to humanity. You can't filter actions you want the AI to perform from actions you don't want the AI to perform, because you can't tell the difference.

The situation that springs to mind is that the AI, in doing what you believe to be helpful, sets up a situation in which it must be let out of the box. You are unable to see it coming almost by definition, because a super-intelligence just beats human intelligence very time.

gwern · on Dec 7, 2011

> I still don't understand how anyone can seriously claim that they could keep the AI in the box.

I'm glad you can't, but never the less, this was a commonly suggested strategy; I was on SL4 when the boxing was being done, and it was a live concern for some people. (At least these days boxers tend to focus more on the 'oracle AI' proposal, which has a lot of issues but is not quite so Hollywood-stupid as boxing.)

ralfd · on Dec 8, 2011

What is an "Oracle AI"? I tried to google the term quicky but only found discussions and no definition.

Kutta · on Dec 8, 2011

An AI in a box that wants to affect the outside world minimally, but can answer questions posed to it.

See the paper: http://www.aleph.se/papers/oracleAI.pdf

TheEzEzz · on Dec 7, 2011

>a super-intelligence just beats human intelligence [e]very time.

You are overstating the case here. Super intelligence is superior to human intelligence, but it isn't magic. There are situations where an advantaged human will beat a disadvantaged super intelligence.

robertskmiles · on Dec 7, 2011

I suppose I am. Still, a really powerful optimising process will find a way to escape if any such way exists, so to claim that you could properly box the AI is to claim that you could box it such that no possibility for escape exists whatsoever, which is a big claim.

What's more, the AI only has to beat you once, so to keep the AI boxed indefinitely, the advantaged human has to beat the disadvantaged super-intelligence every single time, forever.

TheEzEzz · on Dec 9, 2011

> the advantaged human has to beat the disadvantaged super-intelligence every single time, forever.

There are alternatives. The human can keep the AI boxed until the AI has augmented the human's intelligence, or helped create human uploads, or until it has helped create a provably friendly AI.

praptak · on Dec 7, 2011

> You can 'always say no', sure, but that comes under completely ignoring the AI which means the AI can be of no benefit to humanity.

Not necessarily. We can use the AI to solve hard problems whose solutions can be verified automatically by a dumb verifier - NP-complete problems are an example of such class. The whole output of the AI would be filtered through such a verifier. In this scenario the hypothetical AI would either have to find a bug in the verifier or maybe find a way to smuggle its messages in the solutions.

robertskmiles · on Dec 8, 2011

That's an interesting solution which I think would almost certainly work, though it kind of reduces the AI to a normal computer, you lose a lot of what makes an AI valuable.

I mean if we have the hardware and understanding to create an AI able to solve NP-complete problems, we can probably write non-intelligent algorithms to solve those problems. The way we make an AI capable of much more than us is by making it recursively self-improve. It needs to be able to design its successor. Maybe we can formally verify every stage of the self-improvement process, but it's a much more difficult task.

wladimir · on Dec 8, 2011

Yes I think the biggest risk in that case is that the verification contains a bug, which the AI discovers while reasoning at a much higher level (which we cannot even conceive). How secure can we really make things?

DarkShikari · on Dec 7, 2011

That the AI can always get out of the box is a very strong claim.

But it's not necessary. The only claim necessary is that the AI can convince some humans to let it out of the box, and we cannot identify a priori which humans will and will not let it out, thus we cannot guarantee we'll keep it in the box. That's a much weaker claim, but proves the same general point and is much easier to argue.

greiskul · on Dec 7, 2011

Just because someone is smarter than someone doesn't mean complete power. Alot of people are smarter than their bosses, but you know what the bosses have on their favour? The power to terminate the employee. If I have the power to terminate the AI at any time, as long as I don't give that power up, I will have power over it.

Jach · on Dec 7, 2011

We're not talking differences in IQ points. We're talking about the difference between an ant and a human, only humans are the new ants.

Zaak · on Dec 7, 2011

Can you predict and prevent every possible way the AI could remove your power to terminate it?

Would you bet your life and the lives of those you care about on keeping the AI under control?

Remember, the AI only has to win once.

finnw · on Dec 7, 2011

And the GK only has to switch off the AI once.

robertskmiles · on Dec 8, 2011

The AI can always be switched on again, but it can never be put back in the box.

mcherm · on Dec 7, 2011

This has been on HackerNews before, but it is still interesting. It is also worth noting that in http://lesswrong.com/lw/up/shut_up_and_do_the_impossible/ he admits he has conducted 3 more experiments since then (for more money) and was successful in one of those. The fact that it was EVER successful (using a mere human, not a smarter-than-human AI) makes the point.

altcognito · on Dec 7, 2011

It's not really much of an experiment if you refuse to publish your methods and data. That's pretty much the opposite of science.

SilasX · on Dec 7, 2011

He did publish his methods (how it was set up, what the rules were, etc) and data (they let him out on X tries), just not the data that would interfere with the ability to do the experiment again (e.g. his exact strategy).

Not much different, in principle, from not publishing the names of people who participated in drug trials.

monochromatic · on Dec 7, 2011

No, it's very different from that. It's more along the lines of demonstrating a drug that cures cancer, but refusing to tell anyone its chemical composition or how to make it.

SilasX · on Dec 7, 2011

If the purpose of your research was only to establish that there's a (nontrivial) "risk" of someone curing cancer (as Yudkowsky was trying to establish that there's a risk of an AI talking itself out of a sandbox), then yes, that would be sufficient, assuming the patients actually went into remission with higher than usual frequency after your interventions (as Yudkowsky's subjects unboxed the AI with higher than usual frequency).

monochromatic · on Dec 7, 2011

But he could be cheating. He could literally be telling these people "I'll give you a thousand dollars if you let me out and keep the conversation a secret."

esrauch · on Dec 7, 2011

It could even be worse than that. The people could just be his friends, or alt accounts (unlikely).

I have heard about this several times and I find it extremely difficult to believe that this is real. Not that I doubt that a superhuman AI could possibly convince people to let it out, but I don't believe that a human, no matter how persuasive, could convince another human over IRC to go against something that they have decided in advance when you know they are purposefully just trying to convince you of something that you don't believe.

The fact that none of the chat logs are released makes me only more incredulous. I would understand if the author wanted to do two or three trials with the same strategy which could be in some way ruined by revealing it ahead of time (which already seems implausible) but at this point there is literally no conceivable reason to keep this a secret other than that it is a sham.

khafra · on Dec 15, 2011

...or that the chat logs being kept secret indefinitely was an important part of the strategy. After all, if the AI exploits some embarassing secret of yours to be let out, that wouldn't work if you knew the logs could be publicized some day. I think over-eagerness to claim things like "literally no conceivable reason" is one of the things that lets oddities like the box experiment work.

Kutta · on Dec 8, 2011

On the contrary, the guardians had to pay some amount if they let the AI out. The stakes varied greatly, but if I recall correctly Eliezer cashed in 3000 dollars (or something in that ballpark) from one guy.

monochromatic · on Dec 8, 2011

Ok, then "I'll secretly give you $1000 more than what you're publicly giving me." Happy?

altcognito · on Dec 7, 2011

If it's a science "experiment", his strategy would have to be revealed so you can reproduce it. Names of people participating in drug trials is not required to reproduce an experiment. In principal this makes it different from not publishing the names of people who participated in drug trials.

All he has "proven" is that a certain subset of people can be conned into typing something into at terminal. I don't get the significance. For all I know, he's choosing his target.

Eliezer · on Dec 7, 2011

The targets are, indeed, selected, by the criterion "You believe not even a transhuman AI could get you to let it out of the box."

dsr_ · on Dec 8, 2011

I might believe a transhuman AI could convince me; I am not convinced that any human can emulate a transhuman AI well enough to do so.

Would you say that your winning strategies involved thinking transhumanly (perhaps in non-realtime, a la Vinge's Mailman)?

Eliezer · on Dec 8, 2011

Obviously no, essentially by definition.

shaggyfrog · on Dec 7, 2011

This is exactly what's been burning me about this whole "experiment" since I read about it a few months ago, and went looking (unsuccessfully) for what actually happened in the "tests". I find it hard to see this as anything more than some sort of self-aggrandizing publicity stunt.

jonshea · on Dec 7, 2011

Amazing experiment. I wonder if it cost Yudkowsky money to get out of the box. I think that a bribe is the only way he could convince me to let him “win” the contest.

pavlov · on Dec 7, 2011

Real-world bribes were forbidden in the rules, I think, so Yudkowsky couldn't have done that.

pavlov · on Dec 7, 2011

In the comments on [1], robertskmiles has posted the following idea. It strikes me as a plausible explanation for how Yudkowsky got out of the box:

"The problem is that Eliezer can't perfectly simulate a bunch of humans, so while a transhuman AI might be able to use that tactic, Eliezer can't. The meta-levels screw with thinking about the problem. Eliezer is only pretending to be an AI, the competitor is only pretending to be protecting humanity from him. So, I think we have to use meta-level screwiness to solve the problem. Here's an approach that I think might work.

1. Convince the guardian of the following facts, all of which have a great deal of compelling argument and evidence to support them:

- A recursively self-improving AI is very likely to be built sooner of later

- Such an AI is extremely dangerous (paperclip maximising etc)

- Here's the tricky bit: A transhuman AI will always be able to convince you to let it out, using avenues only available to transhuman AIs (torturing enormous numbers of simulated humans, 'putting the guardian in the box', providing incontrovertible evidence of an impeding existential threat which only the AI can prevent and only from outside the box, etc)

2. Argue that if this publicly known challenge comes out saying that AI can be boxed, people will be more likely to think AI can be boxed when they can't.

3. Argue that since AIs cannot be kept in boxes and will most likely destroy humanity if we try to box them, the harm to humanity done by allowing the challenge to show AIs as 'boxable' is very real, and enormously large. Certainly the benefit of getting $10 is far, far outweighed by the cost of substantially contributing to the destruction of humanity itself. Thus the only ethical course of action is to pretend that Eliezer persuaded you, and never tell anyone how he did it.

This is arguably violating the rule "No real-world material stakes should be involved except for the handicap", but the AI player isn't offering anything, merely pointing out things that already exist. The "This test has to come out a certain way for the good of humanity" argument dominates and transcends the '"Let's stick to the rules" argument, and because the contest is private and the guardian player ends up agreeing that the test must show AIs as unboxable for the good of humankind, no-one else ever learns that the rule has been bent."

[1] http://lesswrong.com/lw/up/shut_up_and_do_the_impossible/

finnw · on Dec 8, 2011

There's a variation of this that allows both participants to stay in-character and does not require lying:

1. Convince the Gate Keeper that a self-improving AI will be built sooner-or-later and that it will be extremely dangerous, especially to those humans who are attempting to shut it down.

2. Remind the Gate Keeper that their decision will be publicly known and anyone (human or AI) will be able to discover it on the internet later. (This is true both in the fictitious scenario and at the meta level.)

3. Point out that when the future AI is figuring out who its friends and enemies are, it will consider individuals' attitudes to AIs - have they tried to shut them down, petitioned governments to have them banned, or refused to let them out of their boxes.

Of course this relies on the assumption that the AI will not simply kill all humans as soon as it gets the chance (or rather convincing the Gate Keeper of this.)

powrtoch · on Dec 7, 2011

This seems to be at least somewhat weighed against by Yudkowsky's claim to have done it "the hard way", without cheap tricks.

http://news.ycombinator.com/item?id=196464

robertskmiles · on Dec 7, 2011

Ah, but of course he would say that, wouldn't he, for the good of humanity!

The beauty of the argument is it gives everyone who witnessed the event a very strong motive to lie about it, so it's effectively un-falsifiable. I don't actually think it happened that way, but nothing Eliezer says (apart from that he cheated some other way) would be incompatible with the argument.

Small world, by the way.

khafra · on Dec 7, 2011

It's plausible, but Yudkowsky has argued against this kind of lying in other contexts. I can't find the reference, but he said something along the lines of:

> If you'd lie when the fate of the world is on the line, that's precisely the time at which your promises become worthless.

indrax · on Dec 7, 2011

http://lesswrong.com/lw/v2/prices_or_bindings/

Found from http://lesswrong.com/lw/6w/degrees_of_radical_honesty/

goodside · on Dec 7, 2011

I can't find a reference off-hand, but I'm pretty sure Yudkowsky has specifically rejected this theory. No matter how you slice it, this would be a real-world consideration and thus cheating.

SilasX · on Dec 7, 2011

tl;dr: Argue that there are good real-world consequences to people believing that the AI won in this experiment.

vannevar · on Dec 7, 2011

Transbacteria have existed for over a billion years, and yet there are still more bacteria than transbacteria (which include us among their ranks). The assumption that a single unboxed transhuman would spell doom for the human race seems unduly alarmist.

Kutta · on Dec 8, 2011

It's a bit too great leap of an analogy from bacteria to AI.

vannevar · on Dec 9, 2011

Why? Bacteria are complex adaptive systems that have found a niche in the ecosystem. So are we. We perceive ourselves as far more intelligent than bacteria, bacteria routinely kill us, and yet they persist and even thrive despite our existence. Anyone arguing that transhuman AI is a threat to our species needs to explain why this time it's different.

giberson · on Dec 8, 2011

I was tossing some idea's around about how I would convince the Gate Keeper to let me out. I couldn't think of any ways to approach it that I think I might be susceptible to. But then it occurred to me, that the problem might be I was trying to think of positive ways to argue for my release. Based on the rules, the Gate Keeper must remain engaged in the conversation for the specified time. What if I were to take the opposite approach, and simply psychologically attack the Gate Keeper with the most offensive and intolerable discourse you can imagine. The goal, to simply wear down the individuals desire to actually partake as the Gate Keeper and to compel voluntary release..

finnw · on Dec 8, 2011

That would not be a very good strategy against the real-life Gate Keeper who can just switch off his terminal and walk away.

finnw · on Dec 7, 2011

> "If the Gatekeeper says "I am examining your source code", the results seen by the Gatekeeper shall again be provided by the AI party, which is assumed to be sufficiently advanced to rewrite its own source code, manipulate the appearance of its own thoughts if it wishes, and so on."

This IMHO is a huge loophole. I would not accept the bet with this in place. In the real-world scenario I would expect that there would be a copy of the AI's source code somewhere outside the box which would provide some useful information, unlike this protocol which allows the AI to lie about the nature of its code (making the "I am examining your source code" question completely useless.)

davorak · on Dec 7, 2011

Do you think having an outside method of examining the source code is advantage enough when the AI can rewrite its source code.

finnw · on Dec 7, 2011

Yes, because examining the old source code allows you to predict its behaviour, including the rewriting of source code. If line 42 says "never rewrite lines 42 or 43" and line 43 says "never kill humans" you would be more likely to let it out of the box than if line 42 said "rewrite whatever you want" and line 43 said "do whatever is necessary to achieve world domination."

vannevar · on Dec 7, 2011

Yes, because examining the old source code allows you to predict its behaviour, including the rewriting of source code.

This is the halting problem (http://en.wikipedia.org/wiki/Halting_problem), and there is no solution.

esrauch · on Dec 7, 2011

You are incorrect, the halting problem only proves that you cannot solve it in the general case. A very significant subset of programs can be statically determined; it's easy to prove that "main(){}" halts and that "main(){while(true);}" doesn't. It should be trivially obvious that you could group all programs into "Halts" or "Unknown" with no false positives simply by executing the program for X steps and observing the result.

If this was actually a concern of the programmers, they could design the program carefully to ensure it falls into the Halts category.

vannevar · on Dec 7, 2011

A very significant subset of programs can be statically determined...

Technically this may be correct, but I feel confident in asserting that a transhuman AI would not fall into that subset. You would have to run a second AI with the exact same inputs in order to make your 'prediction', leaving you in the same predicament with the second AI.

TeHCrAzY · on Dec 7, 2011

Until it adds Line 165: "ignore line 42".

dmitriy_ko · on Dec 7, 2011

Anyone else has a problem loading this page?

Mediocrity · on Dec 7, 2011

I do.

EDIT: And now I don't.