For what it's worth, I've been trying Opus 4.1 in VS Code through GitHub Copilot and it's been really bad. Maybe worse than Sonnet and GPT 4.1. I'm not sure why it was doing so poorly.
In one instance, I asked it to optimize a roughly 80 line C# method that matches some object positions by object ID and delta encodes their positions from the previous frame. It seemed to be confused about how all this should work and output completely wrong code. It has all the context it needs in the file and the method is fairly self-contained. Other models did much better. GPT-5 understood what to do immediately.
I tried a few other tasks/questions that also had underwhelming results. Now I've switched to using GPT-5.
If you have a quick prompt you'd like me to try, I can share the results.
But they definitely don't taking into account whatever prompts the tools are really using (or ms is using a neutered version to reduce cost). So I would agree with the suggestion. Using sonnet through copilot seems very very different than cursor or cline or Claude code.
Using the same exact model, Copilot consistently often fails to finish tasks or makes a mess. It is consistent at this across ides (ie using the jetbrains plugin generates nearly identical bad results as vscode copilot). I then discard all it did and try the exact same (user) prompt in cursor or Claude code or cline with the same model and it does the same task perfectly.
I've used both aider and opencode with both Opus and Sonnet. Opencode, at least initially, used Claude Code's exact prompt; and I found the results surprisingly different.
Perhaps it shouldn't be surprising; after all, we do want the LLMs to listen to the prompts and act differently. And, the Claude team will presumably be tuning both Claude and Claude Code's prompts to each other optimize their own experience, so it's perhaps not surprising that Claude + Claude Code's prompts well together.
To me it seems that Opus is really good at writing code if you give it a spec. The other day I had Gpt come up with a spec for a DnD text game that uses the GPT API. It one shotted a 1k line program.
However, if I'm not detailed with it, it does seem to make weird choices that end up being unmaintainable. It's like it has poor creative instincts but is really good at following the directions you give it.
Opus seems to need more babysitting IME, which is great if you are going to actually pair program. Terrible if you like leaving it to do its own thing or try to do multiple things at once.
I just want a model that feels like an extension of me. For example if I there's a task I can describe in one sentence - "add a rest api for user management in the db, and makes sure only users in the admin group are allowed to use it" - would result in an API endpoint that's properly wired up to the right places, and the model does what I tell it, and nothing else, even if it would logically follow from what I told it.
And if it's gets confused, needs clarification, or has its own initative - I want it to stop and ask.
Oh and it needs to be fast it's tokens per minute should be as fast as I can read what it generates (and I can read boilerplate-y code quite fast), and it shouldn't stop and think on every prompt, only when it needs to, and it should be much faster and granular in backtracking.
The loop of waiting on the AI then having to fix and steer it constantly as it doggedly follows its own ideas has really taken the enjoyment out of vibe coding for me.
Have it break the problem into phases. Have it unit testing after every phase. Only move forward after all the test for the phase have passed. I’m using the free Qwen3-Coder and with proper prompting is fairly good.
I spend a lot of time planning tasks, generating various documents per pr (requirements, questions, todo), having AI poke my ideas (business/product/ux/code-wise) etc.
After 45 minutes of back and forth in general we end up with a detailed plan.
This has also many benefits:
- writing tests becomes very simple (unit, integration, E2Es)
- writing documentation becomes very simple
- writing meaningful PRs becomes very simple
It is quite boring though, not gonna lie. But that's a price I have accepted for quality.
Also, clearing the ideas so much before hand often leads me to come with creative ideas later in the day, when I go for walks and review mentally what we've done/how.
You might want to try Claude Code if you haven't. It's perfect for exactly this plan, then build flow with a ton of documents. A colleague set up some strict code guidelines, right down to say, put constructors at the top, constants at the bottom, use this name for this, snake case for that. Code quality just shoots up with these details. Can't just hack away with a blunt axe.
People tend to hate Claude Code because it's not vibe coding anymore but it was never really meant to be.
In one instance, I asked it to optimize a roughly 80 line C# method that matches some object positions by object ID and delta encodes their positions from the previous frame. It seemed to be confused about how all this should work and output completely wrong code. It has all the context it needs in the file and the method is fairly self-contained. Other models did much better. GPT-5 understood what to do immediately.
I tried a few other tasks/questions that also had underwhelming results. Now I've switched to using GPT-5.
If you have a quick prompt you'd like me to try, I can share the results.