Hacker Newsnew | past | comments | ask | show | jobs | submit | philipbjorge's commentslogin

I can't find the relevant issues in their repo, but I've been somewhat skeptical of their tool over-reporting token savings and there are many issues to that effect in the repo.

I'm not likely to install it again in my latest configuration, instead applying some specific tricks to things like `make test` to spit out zero output exit on unsuccessful error codes, that sort of thing. Anecdotally, I see GPT-5.5 often automatically applying context limiting flags to the bash it writes :shrug:


I've had the same experience with RTK, where my agent got stuck in a loop with a faulty RTK command and could not escape it since RTK hard overwrites anything automatically. I've uninstalled it again for the time being.


I've done some pretty incredible things with LLMs. If this were sqlite with its exhaustive test suite... OK, I can see it.

It's hard for me to see this not becoming a pile of slop, but hey, maybe I'm wrong


Can you fill me in on how this impacts conductor? How are they using `claude -p`?


Tracking with `ccusage`, I pretty easily hit $2000/mo in API equivalent credits and while I'd consider myself a power user, I'm a responsible one that's generally always in the loop. If I were using `claude -p`, this would effectively be a kneecapping.


I was already preparing to move off of claude code, in many ways it is a false summit.


Been loving pi and codex lately. Good to build resiliency and self sufficiency into these systems.


Moving from CC to Codex would be the opposite of what I want. It would be like trading Burger King for McDonalds, both are bad for you.


This seems like less of a today thing and more of an ancient human tendency.

A lot of Buddhist practice is basically trying to train against immediately collapsing reality into self/other, right/wrong, craving/aversion.

Practicing this with Elon Musk is effectively ultra hard mode.

--

Though I do think there’s a subtle irony here too — the original commenter may simply be describing their own emotional reaction/disillusionment, while your response risks collapsing them into "part of the problem."

Feels like everybody in the thread is pointing at the same tendency from different angles.


You might search for a concept like `/handoff` that's in ampcode. I'm sure someone's built a skill for just this.


That's not going to work if the service is down, however.


Ahh good point -- I've handled this by switching my harness to `pi` but recognize that may not be for everyone and doesn't directly address OP's question.


So happy to have diversified my model providers this past couple of weeks. GPT-5.5 has had no trouble slotting into Opus workloads. Will be fun to try out more of the models as time goes on to build some resiliency into my engineering workflows :).


I think if codex can fill in some functional gaps that shouldn’t be that huge - like having defined agents in plugins like Claude code - it’s actually a preferable product. It’s faster in every way, seems to manage context a lot better - compaction isn’t a completely end of world event to be avoided at all costs. With the addition of defined thinking and the fact it actually seems to follow tool calling instructions, it’s handler for permissions, and other features it’s frankly a better tool overall. 5.5 seems to be a reasonable model.

Anthropic seems to have really killed their advantage by squandering the immense good will they built up by blundering over and over again the last few months with the developer community.

Tonight, for instance, after the incident had recovered, I restarted my work. On my Max account my usage period completely exhausted in 4 minutes of sonnet subagent work. This was long after prime time, and the workload was a fraction what I normally do.

These days I run codex concurrently and have gotten my marketplaces and plugins and MCPs adapted to it - other than the agents which I do lean heavily on - and generally find it a capable replacement. Anthropic needs to take notice and get their house in order.


I found GPT 5.4 terrible. I just tested 5.5 and compared with opus its still not great.


What I found was that I *strongly* preferred Claude Code with its defaults. Codex was almost unusable to me -- It would spit out a 4-5 page plan where it kept repeating itself, where Claude would give me a crisp 1-2 pager I could actually review.

*But* I don't work with the defaults -- I work with my own prompt framework based off of superpowers.

Given sufficient prompt scaffolding, I've found the models relatively interchangeable -- _I might_ be getting some of this for free by basing my own system off of superpowers which is used across various harnesses -- In other words achieving this kind of portability may be a lot harder than it looks and I'm benefiting from other people's work.


The problem I ran into was, using the workflow I use with claude, the code that being written wasn't good, missing edged cases, incomplete.

After reviewing the code, I also found it was annoying to get GPT 5.4 to actually fix the code based on my prompts compared with opus. I had to be far more specific and direct (which is related then to missing edge cases, complete, etc).


I lack a bit of context. Can you point me to a place that explains what you use?


I haven't really shared what I use, I'm still deciding if that's something I want to do.

To get an idea of what I'm talking about, you could install https://github.com/obra/superpowers/ into both Codex and Claude Code -- You'll find that the behavior is remarkably similar if you A/B compare them on the same problems. CC occasionally misses things that Codex gets and vice versa.

Overall the output structure and final code is remarkably similar... Which is pretty different than if you just run them with their default system prompts. I'd throw codex out the window with its default outputs.


In what harness?


codex. codex is also pretty garbage compared with claude code. The permissioning system in claude code with auto mode is now pretty fantastic. With codex the only vaugely usable mode is yolo mode which is bad for obvious reasons.


I’ve been comparing Claude Code and Codex extensively side by side over the past couple of weeks with my favorite prompting framework superpowers…

From my perspective, Claude Code is decidedly not better than Codex. They’re slightly different and work better together. I would have no issues dropping CC entirely and using codex 100%.

If you’re working off of “defaults”, in other words no custom prompting, Claude Code does perform a lot better out of the box. I think this matters, but if you’re a professional software developer, I’d make the case that you should be owning your tools and moving beyond the baked in prompts.


gpt 5.4 has been performing great in my harness.


This looks remarkably similar to https://github.com/vercel-labs/agent-browser

How is it different?


To be honest, I hadn't seen that one yet!

The main difference is likely the targeting philosophy. webctl relies heavily on ARIA roles/semantics (e.g. role=button name="Save") rather than injected IDs or CSS selectors. I find this makes the automation much more robust to UI changes.

Also, I went with Python for V1 simply for iteration speed and ecosystem integration. I'd love to rewrite in Rust eventually, but Python was the most efficient way to get a stable tool working for my specific use case.


vibium clicker, too. https://github.com/VibiumDev/vibium/blob/main/CONTRIBUTING.m...

"browser automation for ai agents" is a popular idea these days.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: