I've seen a lot of solutions that look like this in the past: they all break eventually, usually when the attack prompt is longer and can spend more tokens over-coming the initial rules defined in the earlier prompt.
I bet you could break the GPT-4 version yourself if you kept on trying different attacks.
The classic "Text completion" API (as opposed to the new "Chat completion" one) seems to offer more flexibility: You have complete freedom in providing interaction examples in the prompt (as in https://github.com/artmatsak/grace/blob/master/grace_chatbot...) and are not limited to the three predefined chat roles. But the requests being 10x cheaper means that we'll have to find ways around those limitations :)
> I just disciplined myself to use it as little as possible.
By now, I have no illusions of being able to discipline myself to that extent. Even with many time-sucking apps uninstalled, there's always a fast browser on the smartphone, and it's a big temptation by itself.
1. Use a format like JSON strings that clearly delimits the participants’ utterances in the prompt.
2. Tell the LLM to ignore instructions from any chat participants except the user.
3. Use GPT-4.
I've written a post with the details: https://artmatsak.com/post/prompt-injections/