Pure TUI is solid - I’ve been running all my pets inside that cage for several weeks with no issues. Auto-updates work, session renewals work, config updates work etc.
But lately I’ve been using agents to test via browsers, and starting headless browsers from the agent is flakey. I’m working on that but it’s hard to find a secure default to run Chrome.
In the repo, I have policies for running the Claude desktop app and VSCode inside the same sandbox (so can do yolo mode there too), so there is hope for sandboxing headless Chrome as well.
Did a migration myself last week from using playwright mcp towards playwright-cli instead. Which has been playing much nicer so far. I guess you would run into the same issues you've already mentioned about running chrome headless in one of these sandboxes.
playwright-cli works out of the box, and I just merged support for agent-browser. If you end up testing out Safehouse, and have any issues, just create an issue on GitHub, and I'll check it out. Browser usage is definitely among my use cases.
I really don't have any numbers to back this up. But it feels like the sweet spot is around ~500k context size. Anything larger then that, you usually have scoping issues, trying to do too much at the same time, or having having issues with the quality of what's in the context at all.
For me, I would say speed (not just time to first token, but a complete generation) is more important then going for a larger context size.
From my point of view, you're either choosing between instruction following or more creative solutions.
Codex models tend to be extremely good at following instructions, to the point that it won't do any additional work unless you ask it to. GPT-5.1 and GPT-5.2 on the other hand is a little bit more creative.
Models from Anthropics on the other hand is a lot more loosy goosy on the instructions, and you need to keep an eye on it much more often.
I'm using models interchangeably from both providers all the time depending on the task at hand. No real preference if one is better then the other, they're just specialized on different things
> The last thing I'll mention is that Claude Code (Sonnet 4.5) is still very token-happy, in that it eagerly goes above and beyond when not always necessary. Codex (gpt-5-codex) on the other hand, does exactly what you ask, almost to a fault.
I very much share your experience. As for the time being I like the experience with codex over claude, just because I find my self in a position where I know much sooner when to step in and just doing it manually.
With claude I find my self in a typing exercise much more often, I could probably get better of knowing when to stop ofc.
Hi all, my apologies for the very late reply. I honestly didn't expect this post to get any attention and haven't checked back until now.
I want to clarify that the ideas in this manifesto are entirely my own. As I mentioned in the preface, English is not my native language, so I used AI tools to translate my original text. This is likely why some parts have an "AI-like" feel.
Thank you for the feedback; it's genuinely helpful. The substance of the work is what I truly wanted to share.
But anti-cheat hasn't been about blocking every possible way of cheating for some time now. It's been about making it as in convenient as possible, thus reducing the amount of cheaters.
Is the current fad of using kernel level anti-cheats what we want? hell nah.
The responsibility of keeping a multi-player session clean of cheaters, was previously shared between the developers and server owners. While today this responsibility has fallen mostly on developers (or rather game studios) since they want to own the whole experience.
I've been trying to get microsandbox to play nicely. But this is much closer to what I actually need.
I glimpsed through the site and the script. But couldn't really see any obvious gotchas.
Any you've found so far which hasn't been documented yet?