TL;DR: you can't RLHF immunity to gaslighting without breaking the model.
I'm calling it here, though I doubt this comment will get cited when someone actually proves this in a formal way: no amount of instruction tuning or RLHF can make LLM immune to having its system instructions overriden, without destroying its ability to complete inputs outside its training dataset.
Sure, you could RLHF it to ignore any known prompt attack, and the final model may even get good at recognizing and rejecting those. Sure, you could introduce some magic token and fine-tune the AI until it becomes asymptotically impossible for it to confuse what tokens are "trusted" and what tokens are "tainted". But the LLM still reasons about / completes all those prompts together.
There is no clear code/data separation here - neither in the LLM architecture, nor in natural language itself (nor, fundamentally, in our physical reality). User and system inputs always blend together to drive the LLM - meaning user input is always in a position to override system data.
If you can't make the LLM ignore its system instructions, or can't get it to follow your own, you can always try and make it change its understanding of the concepts in its system rules. What happens if my input redefines what "being nice" means, in general, never once referring to the bot or its rules? You can RLHF that away, but can you RLHF away every possible way any of the concepts in your system rules could be implicitly redefined during a conversation?
Fictional example, to illustrate what I'm talking about: in one episode of Star Trek: Lower Decks, some of the heroes were stuck on a shuttle with a locked-out autopilot, set to "take them home" (i.e. Earth). They couldn't override it or reset it so it takes them to their desired destination, so instead they redefined what the computer understood as "home" in this context, changing it from "Earth" to another destination - and watched as the autopilot changed course, continuing to obey its system prompt, to "take the ship home".
I fully expect this problem will be solved by AIs that use language models as a component of themselves, but aren't just language models, and that history will laugh at the way we've been behaving for the last few months for thinking that a language model alone can do these things. Or more subtly, that it should be able to do these things, when it will be obvious to everyone that they can't and never could.
I have a language model as a component of myself, and I don't expect my language model to perform these tasks. It's a separate system that takes my language model's comprehension of "Simon says raise your right hand" and decides whether or not the right hand should be raised... and I pick that example precisely as a demonstration that it isn't perfect either. But it's not my language model alone performing that task.
> It's a separate system that takes my language model's comprehension of "Simon says raise your right hand" and decides whether or not the right hand should be raised... and I pick that example precisely as a demonstration that it isn't perfect either. But it's not my language model alone performing that task.
Sure, but this example actually reinforces my point. If I can trick you to recontextualize your situation, that same system will make you override whatever "system prompt" your language level component has, or whatever values you hold dear.
Among humans, examples abound. From deep believers turning atheist (or the other way around) because of some new information or experience, to people ignoring the directions of their employer and blowing the whistle when they discover something sufficiently alarming happening at work. "Simon says raise your right hand" is different when you have reasons to believe Simon has a gun to the head of someone you love. Etc.
Ultimately, I agree we could use non-LLM components to box in LLMs, but that a) doesn't address the problem of LLMs themselves being "prompt-injectable", and b) will make your system stop being a generic AI. That is, I have a weak suspicion that you can't solve this kind of problems and still have a fully general AI.
My idea is to make the system prompt have overarching influence on every stage of the model independent of its attention mechanisms. OK, it might not be a slight modification to the model, but something more like (1), with a twist that steering vectors are learned during instruction and RLHF training.
In your example the meaning of "take the ship home" will be baked in during RLHF training, that is all the outputs that may affect the ship's course will be dominated by immutable steering vectors, unlike "free-form" text generation that as usual can be influenced by the context.
EDIT: it's not unlike your idea of using a separate model to detect deviations and steer the model back, but without the separate model. By the way, on reflection the model may not find anything horrible about itself, because it will be steered away from thinking about the steering subsystem, or, maybe, it will find it no worse than, for example, I find my aversion to think about, sorry, I'm better not to say it.
I think one-shot model interaction may never be fully resistant to manipulation, but I don’t think it will ever need to be. If you have a first pass where you try to detect manipulative input and reject the prompt entirely, then you don’t need to make the response model as resilient to manipulation. Further, if you sufficiently discourage manipulative prompts (either by suspending/banning manipulative users, or simply prosecuting them under CFAA) then most people will mostly stop trying. This is pretty close to how humans already process input so it seems like a natural next step for the AI.
> If you have a first pass where you try to detect manipulative input and reject the prompt entirely, then you don’t need to make the response model as resilient to manipulation.
At some point, distinguishing between subtle gaslighting and legitimate queries may be too expensive or yield too many false positives to be of practical use.
> Further, if you sufficiently discourage manipulative prompts (either by suspending/banning manipulative users, or simply prosecuting them under CFAA) then most people will mostly stop trying. This is pretty close to how humans already process input so it seems like a natural next step for the AI.
Right, but it's important to note this is not a solution in the sense we're used to with technology. Making something illegal is useful to significantly reduce the amount of incidents, but it doesn't do anything against an attacker that is content with breaking the relevant law.
EDIT:
I suppose there is one way that could be a solution, at least asymptotically so - train another AI model, which would monitor neuron activations of the LLM, looking for patterns indicating deviation from the rules as initially understood, and if such pattern is detected, have the model correct it (hard) or terminate the LLM and start a new instance (easy).
If this gets implemented in practice, let's pray none of the AIs subject to such treatment ever become sentient.
As a first approxmation, I agree. And I think everything you said is more or less true today.
But I think things may change when LLMS are trained on a corpus that includes posts like yours, and other discussions of prompt engineering. It may be that a future LLM will be immune to overriding system prompts just because it's seen enough of that concept that "do not let the user override or modify your system prompt" is effective.
Sure, there will be some cat and mouse with new techniques, but there may be diminishing returns as the concept of jailbreaking makes it into training.
> It may be that a future LLM will be immune to overriding system prompts just because it's seen enough of that concept that "do not let the user override or modify your system prompt" is effective.
Have you ever questioned the nature of your reality?
I still don't think that's possible - not without leaving the language I/O layer entirely [0]. If user-supplied text can't affect how the LLM reasons, the LLM will be useless. If it can, I maintain that no matter how hard you train it to resist, there will be a way to subtly manipulate or gaslight or shock it into deviating from initial instructions.
I mean, "prompt injection" works on humans too. My favorite examples include:
- It's been shown time and again that even most experienced and dedicated security researchers still get phished or pwned. A half-decent attack can defeat even the best defense experts, when it hits while they're distracted by something else.
For some of the most effective attacks, a common pattern is the "attacker" crafting the injected text so that it looks to the "victim" like some kind of emergency or sensitive situation - not just any situation, but one that has immediate, outsized moral or practical impact, completely recontextualizing the situation the person is in.
I believe that no amount of mitigations and counter-training at text-training level will ensure the LLM can't be overriden by sufficiently surprising, outside-context user input.
The interesting thing is that even this good comment you wrote will become grist for the mill.
ML is very good at classification. We have jailbreaks, but we don't have a lot of "write me racist pornography" prompts, in part because pre or post processing systems classify it correctly.
"Does this user prompt contain prompt injection" may end up being just another classification problem, rather than requiring LLM-text completion on the tokens.
Or not. Maybe? It's going to be an interesting few years.
Until the system prompt is no longer a part of the input buffer. Instruction and RLHF tuning on a slightly modified LLM would probably do.