Codex has always been better at following agents.md and prompts more, but I would say in the last 3 months both Claude Code got worse (freestyling like we see here) and Codex got EVEN more strict.
80% of the time I ask Claude Code a question, it kinda assumes I am asking because I disagree with something it said, then acts on a supposition. I've resorted to append things like "THIS IS JUST A QUESTION. DO NOT EDIT CODE. DO NOT RUN COMMANDS". Which is ridiculous.
Codex, on the other hand, will follow something I said pages and pages ago, and because it has a much larger context window (at least with the setup I have here at work), it's just better at following orders.
With this project I am doing, because I want to be more strict (it's a new programming language), Codex has been the perfect tool. I am mostly using Claude Code when I don't care so much about the end result, or it's a very, very small or very, very new project.
>I've resorted to append things like "THIS IS JUST A QUESTION. DO NOT EDIT CODE. DO NOT RUN COMMANDS". Which is ridiculous.
Funny to read that, because for me it's not even new behavior. I have developed a tendency to add something like "(genuinely asking, do not take as a criticism)".
I'm from a more confrontational culture, so I just assumed this was just corporate American tone framing criticism softly, and me compensating for it.
Same here. I quickly learned that if you merely ask questions about it's understanding or plans, it starts looking for alternatives because my questioning is interpreted as rejection or criticism, rather than just taking the question at face value. So I often (not always) have to caveat questions like that too. It's really been like that since before Claude Code or Codex even rolled around.
It's just strange because that's a very human behavior and although this learns from humans, it isn't, so it would be nice if it just acted more robotic in this sense.
Yeah, numerous times I've replied to a comment online, to add supporting context, and it's been interpreted as a retort. So now I prefix them with 'Yeah, '.
Splitting tasks like researching, coding and reviewing across multiple LLMs and/or sessions, can reduce or eliminate these issues while giving you more control over your context windows. This also provides the side benefit of a ‘second opinion’ and a more general perspective on a given topic.
You're absolutely right! No, really: I've never had this problem of unprompted changes when I'm just asking, but I always (I think even in real-life conversations with real people) start with feedback: "Works great. What happens if..."
I think people having different styles of prompting LLMs leads to different model preferences. It's like you can work better with some colleagues while with others it does not really "click".
Oh funny enough, I often add stuff like "genuinely asking, do not take as a criticism" when talking with humans so I do it naturally with LLMs.
People often use questions as an indirect form of telling someone to do something or criticizing something.
I definitely had people misunderstand questions for me trying to attack them.
There is a lot of times when people do expect the LLM to interpret their question as an command to do something. And they would get quite angry if the LLM just answered the question.
Not that I wouldn't prefer if LLMs took things more literal but these models are trained for the average neurotypical user so that quirk makes perfect sense to me.
Personally defined <dtf> as 'don't touch files' in the general claude.md, with the explanation that when this is present in the query, it means to not edit anything, just answer questions.
Worked pretty well up until now, when I include <dtf> in the query, the model never ran around modifying things.
I've been using chat and copilot for many months but finally gave claude code a go, and I've been interested how it does seem to have a bit more of an attitude to it. Like copilot is just endlessly patient for every little nitpick and whim you have, but I feel like Claude is constantly like "okay I'm committing and pushing now.... oh, oh wait, you're blocking me. What is it you want this time bro?"
I found that helpful for a question but the btw query seemed to go to a subagent that couldn't interrupt or direct the main one. So it really was just for informational questions, not "hey what if we did x instead of y?"
Yesterday it was showing a hint in the corner to use "/btw" but when I first tried it I got this same error. About ten minutes later (?) I noticed it was still showing the same hint in the corner, so I tried it again and it worked. Seemed to be treated as a one-off question which doesn't alter the course of whatever it was already working on.
It's not really the same use case. It's a smaller model, it doesn't have tools, it can't investigate, etc. The only thing it can do is answer questions about whatever is in the current context.
Then it will try to update plan. Sometimes I have a plan that I'm ready to approve, but get an idea "what if we use/do this instead if that" and all I want is a quick answer with or within additional exploring. What I don't want is to adjust plan I already like based of a thing I say that may not pan out.
Charitable reading. Culture; tone; throughout history these have been medium and message of the art of interpersonal negotiation in all its forms (not that many).
A machine that requires them in order to to work better, is not an imaginary para-person that you now get to boss around; the "anthropic" here is "as in the fallacy".
It's simply a machine that is teaching certain linguistic patterns to you. As part of an institution that imposes them. It does that, emphatically, not because the concepts implied by these linguistic patterns make sense. Not because they are particularly good for you, either.
I do not, however, see like a state. The code's purpose is to be the most correct representation of a given abstract matter as accessible to individual human minds - and like GP pointed out, these workflows make that stage matter less, or not at all. All engineers now get to be sales engineers, too! Primarily! Because it's more important! And the most powerful cognitive toolkit! (Well, after that other one, the one for suppressing others' cognition.)
Fitting: most software these days is either an ad or a storefront.
>80% of the time I ask Claude Code a question, it kinda assumes I am asking because I disagree with something it said, then acts on a supposition.
Humans do this too. Increasingly so over the past ~1y. Funny...
Some always did though. Matter of fact, I strongly suspect that the pre-existing pervasiveness of such patterns of communication and behavior in the human environment, is the decisive factor in how - mutely, after a point imperceptibly, yet persistently - it would be my lot in life to be fearing for my life throughout my childhood and the better part of the formative years which followed. (Some AI engineers are setting up their future progeny for similar ordeals at this very moment.)
I've always considered it significant how back then, the only thing which convincingly demonstrated to me that rationality, logic, conversations even existed, was a beat up old DOS PC left over from some past generation's modernization efforts - a young person's first link to the stream of human culture which produced said artifact. (There's that retrocomputing nostalgia kick for ya - heard somewhere that the future AGI will like being told of the times before it existed.)
But now I'm half a career into all this goddamned nonsense. And I'm seeing smart people celebrating the civilization-scale achievement of... teaching the computers how to pull ape shit! And also seeing a lot of ostensibly very serious people, who we are all very much looking up to, seem to be liking the industry better that way! And most everyone else is just standing by listless - because if there's a lot of money riding on it then it must be a Good Thing, right? - we should tell ourselves that and not meddle.
All of which, of course, does not disturb, wrong, or radicalize me in the slightest.
First time I used Claude I asked it to look at the current repo and just tell me where the database connection string was defined. It added 100 lines of code.
I asked it to undo that and it deleted 1000 lines and 2 files
Would `git reset --hard` have worked to in your case? I guess you want to have each babystep in a git commit, in the end you could do a `git rebase -i` if needed.
Whatever settup I have in the office doesn't allow git without me approving the command. Or anything else - I often have to approve a grep because is redirects some output to /dev/null which is a write operation.
One annoying thing about that flow is that when you change the world outside the model it breaks its assumptions and it loses its way faster (in my experience).
I feel like people are sleeping on Cursor, no idea why more devs don't talk about it. It has a great "Ask" mode, the debugging mode has recently gotten more powerful, and it's plan mode has started to look more like Claude Code's plans, when I test them head to head.
Cursor implemented something a while back where it started acting like how ChatGPT does when it's in its auto mode.
Essentially, choosing when it was going to use what model/reasoning effort on its own regardless of my preferences. Basically moved to dumber models while writing code in between things, producing some really bad results for me.
Anecdotal, but the reason I will never talk about Cursor is because I will never use it again. I have barred the use of Cursor at my company, It just does some random stuff at times, which is more egregious than I see from Codex or Claude.
ps. I know many other people who feel the same way about Cursor and other who love it. I'm just speaking for myself, though.
ps2. I hope they've fixed this behavior, but they lost my trust. And they're likely never winning it back.
You just described their “auto” behavior, which I’m guessing uses grok.
Using it with specific models is great, though you can tell that Anthropic is subsidizing Claude Code as you watch your API costs more directly. Some day the subsidy will end. Enjoy it now!
And cursor debugging is 10x better, oh my god.
I have switched to 70% Claude Code, 10% Copilot code reviews (non anthropic model), and 20% Cursor and switch the models a bit (sometimes have them compete — get four to implement the same thing at the same time, then review their choices, maybe choose one, or just get a better idea of what to ask for and try again).
You wouldn't do that for everything. I'd reserve it for work with higher uncertainty, where you're not sure which path is best. Different model families can make very different choices.
In the coworking I am in people are hitting limits on 60$ plan all the time. They are thinking about which models to use to be efficient, context to include etc…
I’m on claude code $100 plan and never worry about any of that stuff and I think I am using it much more than they use cursor.
Tell them to use the Composer 1.5 model. It's really good, better than Sonnet, and has much higher usage limits. I use it for almost all of my daily work, don't have to worry about hitting the limit of my 60$ plan, and only occasionally switch to Opus 4.6 for planning a particularly complex task.
Cursor tends to bounce out of plan mode automatically and just start making changes (while still actually in plan mode). I also have to constantly remind it “YOU ARE IN PLAN MODE, do not write a plan yet, do not edit code”. It tends to write a full-on plan with one initial prompt instead of my preferred method of hashing out a full plan, details, etc… It definitely takes some heavy corralling and manual guardrails but I’ve had some success with it. Just keep very tight reins on your branches and be prepared to blow them away and start over on each one.
I've had some luck taming prompt introspection by spawning a critic agent that looks at the plan produced by the first agent and vetos it if the plan doesn't match the user's intentions. LLMs are much better at identifying rule violations in a bit of external text than regulating their own output. Same reason why they generate unnecessary comments no matter how many times you tell them not to.
I've also found it to be better to ask the LLM to come up with several ideas and then spawn additional agents to evaluate each approach individually.
I think the general problem is that context cuts both ways, and the LLM has no idea what is "important". It's easier to make sure your context doesn't contain pink elephants than it is to tell it to forget about the pink elephants.
> what "red/green" means: the red phase watches the tests fail, then the green phase confirms that they now pass.
> Every good model understands "red/green TDD" as a shorthand for the much longer "use test driven development, write the tests first, confirm that the tests fail before you implement the change that gets them to pass".
You can just say spawn an agent as the sibling says. I didn't find that reliable enough, so I have a slightly more complicated setup. First agent has no permissions except spawning agents and reading from a single directory. It spawns the planner to generate the plan, then either feeds it to the critic and either spawns executors or re-runs the planner with critic feedback. The planner can read and write. The critic agent can only read the input and outputs accept/reject with reason.
This is still sometimes flaky because of the infrastructure around it and ideally you'd replace the first agent with real code, but it's an improvement despite the cost.
Codex, on the other hand, will follow something I said pages and pages ago, and because it has a much larger context window (at least with the setup I have here at work), it's just better at following orders.
This is important, but as a warning. At least in theory your agent will follow everything that it has in context, but LLMs rely on 'context compacting' when things get close to the limit. This means an LLM can and will drop your explicit instructions not to do things, and then happily do them because they're not in the context any more. You need to repeat important instructions.
This is mostly dependent on the agent because the agent sets the system prompt. All coding agents include in the system prompt the instruction to write code, so the model will, unless you tell it not to. But to what extent they do this depends on that specific agent's system prompt, your initial prompt, the conversation context, agent files, etc.
If you were just chatting with the same model (not in an agent), it doesn't write code by default, because it's not in the system prompt.
There’s an extension to this problem which I haven’t got past. More generally I’d like the agent to stop and ask questions when it encounters ambiguity that it can’t reasonably resolve itself. If someone can get agents doing this well it’d be a massive improvement (and also solve the above).
Hm, with my "plan everything before writing code, plus review at the end" workflow, this hasn't been a problem. A few times when a reviewer has surfaced a concern, the agent asks me, but in 99% of cases, all ambiguity is resolved explicitly up front.
This. Just asking it to ask you questions before proceeding has saved me so much time from it making assumptions I don’t want. It’s the single most important part of almost all my prompts.
The solution for this might be to add a ME.md in addition to AGENT.md so that it can learn and write down our character, to know if a question is implicitly a command for example.
This is not Claude Code.
And my experience is the opposite. For me Codex is not working at all to the point that it's not better than asking the chat bot in the browser.
This is extra rough because Codex defaults to letting the model be MUCH more autonomous than Claude Code. The first time I tried it out, it ended up running a test suite without permission which wiped out some data I was using for local testing during development. I still haven't been able to find a straight answer on how to get Codex to prompt for everything like Claude Code does - asking Codex gets me answers that don't actually work.
Maybe I should give Codex a go, because sometimes I just want to ask a question (Claude) and not have it scan my entire working directory and chew up 55k tokens.
For the last 12 months labs have been
1. check-pointing
2. train til model collapse
3. revert to the checkpoint from 3 months ago
4. People have gotten used to the shitty new model
Antropic said they "don't do any programming by hand" the last 2 years. Antropic's API has 2 nines
I find this thread surprising honestly. Claude Code is my daily driver and I consider myself a real power user. If you have your commands/agents/skills set up correctly you should never be running into these issues
> Codex, on the other hand, will follow something I said pages and pages ago, and because it has a much larger context window (at least with the setup I have here at work), it's just better at following orders.
Claude Code goes through some internal systems that other tools (Cline / Codex / and I think Cursor) do not. Also we have different models for each. I don't know in practice what happens, but I found that Codex compacts conversations way less often. It might as well be somehow less tokens are used/added, then raw context window size. Sorry if I implied we have more context than whatever others have :)
Codex does something sorta magical where it auto compacts, partially maybe, when it has the chance. I don’t know how it works, and there is little UI indication for it.
But that's one of the first things you fix in your CLAUDE.md:
- "Only do what is asked."
- "Understand when being asked for information versus being asked to execute a task."
How do you do that???? Say the words but in the form of a question? I feel like that will go a lot worse than just telling (but nicely). I have a daughter too so I am genuinely willing to try anything
What about adding something like, "When asked a question, just answer it without assuming any implied criticism or instructions. Questions are just questions." to claude.md?
I'm back on Claude Code this month after a month on Codex and it's a serious downgrade.
Opus 4.6 is a jackass. It's got Dunning-Kruger and hallucinates all over the place. I had forgotten about the experience (as in the Gist above) of jamming on the escape key "no no no I never said to do that." But also I don't remember 4.5 being this bad.
But GPT 5.3 and 5.4 is a far more precise and diligent coding experience.
80% of the time I ask Claude Code a question, it kinda assumes I am asking because I disagree with something it said, then acts on a supposition. I've resorted to append things like "THIS IS JUST A QUESTION. DO NOT EDIT CODE. DO NOT RUN COMMANDS". Which is ridiculous.
Codex, on the other hand, will follow something I said pages and pages ago, and because it has a much larger context window (at least with the setup I have here at work), it's just better at following orders.
With this project I am doing, because I want to be more strict (it's a new programming language), Codex has been the perfect tool. I am mostly using Claude Code when I don't care so much about the end result, or it's a very, very small or very, very new project.