It's not even always a more efficient form of labour. I've experienced many scenarios with AI where prompting it to do the right thing takes longer and requires writing/reading more text compared to writing the code myself.
> With Codex (5.3), the framing is an interactive collaborator: you steer it mid-execution, stay in the loop, course-correct as it works.
> With Opus 4.6, the emphasis is the opposite: a more autonomous, agentic, thoughtful system that plans deeply, runs longer, and asks less of the human.
Ain't the UX is the exact opposite? Codex thinks much longer before gives you back the answer.
I've also had the exact opposite experience with tone. Claude Code wants to build with me, and Codex wants to go off on its own for a while before returning with opinions.
well with the recent delays i can easily find claude code going off on it's own for 20 minutes and have no idea what it's going to come back with. but one time it overflowed it's context on a simple question, and then used up the rest of my session window. in a way a lot of ai assistants have ime have this awkward thing where they complicate something in a non-visible and think about it for a long time burning up context before coming up with a summary based upon some misconception.
For complex tasks I ask ChatGPT or Grok to define context then I take it to Claude for accurate execution. I also created a complete pipeline to use locally and enrich with skills, agents, RAG, profiles. It is slower but very good. There is no magic, the richer the context window the more precise and contained the execution.
The key is a well defined task with strong guardrails. You can add these to your agents file over time or you can probably just find someone's online to copy the basics from. Any time you find it doing something you didn't expect or don't like, add guardrails to prevent that in future. Claude hooks are also useful here, along with the hookify plugin to create them for you based on the current conversation.
In terms of 'tone', I have been very impressed with Qwen-code-next over the last 2 days, especially as I have it running locally on a single modest 4090.
Easiest way I know is to just use LMStudio. Just download and press play :). Optional, but recommended, increase the context length to 262144 if you have the DRAM available. It will definitely get slower as your interaction prolongs, but (at least for me) still tolerable speed.
Codex now lets you tell the LLM tgings in the middle of its thinking without interrupting it, so you can read the thinking traces and tell it to change course if it's going off track.
That just seems like a UI difference. I've always interrupted claude code added a comment and it's continued without much issue. Otherwise if you just type the message is queued for next. There's no real reason to prefer one over the other except it sounds like codex can't queue messages?
Codex can queue messages, but the queue only gets flushed once the agent is done with whatever it was working on, whereas Claude will read messages and adjust accordingly in the middle of whatever it is doing. It sounds like OP is saying that Codex can now do this latter bit as well.
The problem is if you're using subagents, the only way to interject is often to press escape multiple times which kills all the running subagents. All I wanted to do was add a minor steering guideline.
That is so annoying too because it basically throws away all the work the subagent did.
Another thing that annoys me is the subagents never output durable findings unless you explicitly tell their parent to prompt the subagent to “write their output to a file for later reuse” (or something like that anyway)
I have no idea how but there needs to be ways to backtrack on context while somehow also maintaining the “future context”…
This is most likely an inference serving problem in terms of capacity and latency given that Opus X and the latest GPT models available in the API have always responded quickly and slowly, respectively
GPT-5.3-Codex dominates terminal coding with a roughly 12% lead (Terminal-Bench 2.0), while Opus 4.6 retains the edge in general computer use by 8% (OSWorld).
Anyone knows the difference between OSWorld vs OSWorld Verified?
OSWorld is the full 369-task benchmark. OSWorld Verified is a ~200-task subset where humans have confirmed the eval scripts reliably score success/failure — the full set has some noisy grading where correct actions can still get marked wrong.
Scores on Verified tend to run higher, so they're not directly comparable.
I think people here need to accept that software is becoming electricity, you get charged when you use it and by how much. You don't pay for a box shaped electricity or purple color electricity, it is just electricity.
A middle 100-500 heads firm don't need enterprise level SaaS, a vibe coded website will suit them better.
Fundamentally, those workflow/orchestration SaaS needs to answer the question why people should pay you premium while only getting 80% where they want to be.
Not to rain on the parade, but this app feels to me ... unpolished. Some of the options in the demo feels less thought out and just put together.
I will try it out, but is this just me, or product/UX side of recent OpenAI products are sort of ... skipped over? It is good that agents help ship software quickly, but please no half-baked stuff like Altas 2.0 again ...
I for one am quite happy to outsource this kind of simply memorisation to a machine. Maybe it's the thin end of the slippery slope? It doesn't FEEL like it is but...
I think we should move past this quickly. Coding itself is fun but is also labour , building something is the what is rewarding.
reply