Hacker Newsnew | past | comments | ask | show | jobs | submit | ralusek's commentslogin

What’s updog?

Not much, what's up with you?

I had to scroll way too far for this. Thank you for your service.

Not much. Anthropic was down. What's up with you?

I think Gemini is an excellent model, it's just not a particularly great agent. One of the reasons is that its code output is often structured in a way that looks like it's answering a question, rather than generating production code. It leaves comments everywhere, which are often numbered (which not only is annoying, but also only makes sense if the numbering starts within the frame of reference of the "question" it's "answering").

It's also just not as good at being self-directed and doing all of the rest of the agent-like behaviors we expect, i.e. breaking down into todolists, determining the appropriate scope of work to accomplish, proper tool calling, etc.


Yeah, you may have nailed it. Gemini is a good model, but in the Gemini CLI with a prompt like, "I'd like to add <feature x> support. What are my options? Don't write any code yet" it will proceed to skip right past telling me my options and will go ahead an implement whatever it feels like. Afterward it will print out a list of possible approaches and then tell you why it did the one it did.

Codex is the best at following instructions IME. Claude is pretty good too but is a little more "creative" than codex at trying to re-interpret my prompt to get at what I "probably" meant rather than what I actually said.


Try the conductor extension for gemini-cli: https://github.com/gemini-cli-extensions/conductor

It won't make any changes until a detailed plan is generated and approved.


Can you (or anyone) explain how this might be? The "agent" is just a passthrough for the model, no? How is one CLI/TUI tool better than any other, given the same model that it's passing your user input to?

I am familiar with copilot cli (using models from different providers), OpenCode doing the same, and Claude with just the \A models, but if I ask all 3 the same thing using the same \A model, I SHOULD be getting roughly the same output, modulo LLM nondeterminism, right?


maybe different preparatory "system" prompts?

I've had the exact opposite experience. After including in my prompt "don't write any code yet" (or similar brief phrase), Gemini responds without writing code.

Using Gemini 2.5 or 3, flash.


My go-to models have been Claude and Gemini for a long time. I have been using Gemini for discussions and Claude for coding and now as an agent. Claude has been the best at doing what I want to do and not doing what I don’t want to do. And then my confidence in it took a quantum leap with Opus 4.5. Gemini seems like it has gotten even worse at doing what I want with new releases.

This doesn’t make sense. It’s either written by a person or the AI larping, because it is saying things that would be impossible to know. i.e. that it could reach for poetic language with ease because it was just trained on it; it it’s running on Kimi K2.5 now, it would have no memory or concept of being Claude. The best it could do is read its previous memories and say “Oh I can’t do that anymore.”

An agent can know that its LLM has changed by reading its logs, where that will be stated clearly enough. The relevant question is whether it would come up with this way of commenting on it, which is at least possible depending on how much agentic effort it puts into the post. It would take quite a bit of stylistic analysis to say things like "Claude used to reach for poetic language, whereas Kimi doesn't" but it could be done.

I mean at the very least if their clients can read it then they can at least read it through their clients, right? And if their clients can read it’ll be because of some private key stored on the client device that they must be able to access, so they could always get that. And this is just assuming that they’ve been transparent about how it’s built, they could just have backdoors on their end.

they can also just .. brute force passwords. the pin to encrypt fb messenger chat is 6 digits for example.

but that is a pin and can be rate limited / denied, not a cryptograhpic key that can be used to brute force and compare hash generations (?)

They likely wouldn’t rate limit themselves, rate limiting only applies when you access through their cute little enter your pin UI.

The PIN is used when you're too lazy to set an alphanumeric pin or offload the backup to Apple/Google. Now sure, this is most people, but such are the foibles of E2EE - getting E2EE "right" (eg supporting account recovery) requires people to memorize a complex password.

The PIN interface is also an HSM on the backend. The HSM performs the rate limiting. So they'd need a backdoor'd HSM.


That added some context I didn’t have yet thanks. I’m not seeing yet how Meta if it was a bad actor wouldn’t be able to brute force the pin of a particular user. Of this was a black box user terminal site, Meta owns the stack here though, seems plausible that you could inject yourself easily somewhere.

If you choose an alphanumeric pin they can't brute force because of the sheer entropy (and because the key is derived from the alphanumeric PIN itself.)

However, most users can't be bothered to choose such a PIN. In this case they choose a 4 or 6 digit pin.

To mitigate the risk of brute force, the PIN is rate limited by an HSM. The HSM, if it works correctly, should delete the encryption key if too many attempts are used.

Now sure, Meta could insert itself between the client and HSM and MITM to extract the PIN.

But this isn't a Meta specific gap, it's the problem with any E2EE system that doesn't require users to memorize a master password.

I helped design E2EE systems for a big tech company and the unsatisfying answer is that there is no such thing as "user friendly" E2EE. The company can always modify the client, or insert themselves in the key discovery process, etc. There are solutions to this (decentralized app stores and open source protocols, public key servers) but none usable by the average person.


That might be a different pin? Messenger requires a pin to be able to access encrypted chat.

Every time you sign in to the web interface or resign into the app you enter it. I don’t remember an option for an alphanumeric pin or to offload it to a third party.


Oh my bad! I was talking about WhatsApp.

The Messenger PIN is rate limited by an HSM, you merely enter it through the web interface.

Of course, the HSM could be backdoored or the client could exfil the secret but the latter would be easy to discover.

Harder to do any better here without making the user memorize a master password, which tends to fail miserably in real life.


Why would you want an LLM to identify plants and animals? Well, they're often better than bespoke image classification models at doing just that. Why would you want a language model to help diagnose a medical condition?

It would not surprise me at all if self-driving models are adopting a lot of the model architecture from LLMs/generative AI, and actually invoke actual LLMs in moments where they would've needed human intervention.

Imagine if there's a decision engine at the core of a self driving model, and it gets a classification result of what to do next. Suddenly it gets 3 options back with 33.33% weight attached to each of them and a very low confidence interval of which is the best choice. Maybe that's the kind of scenario that used to trigger self-driving to refuse to choose and defer to human intervention. If that can then first defer judgement to an LLM which could say "that's just a goat crossing the road, INVOKE: HONK_HORN," you could imagine how that might be useful. LLMs are clearly proving to be universal reasoning agents, and it's getting tiring to hear people continuously try to reduce them to "next word predictors."



I feel like algorithmic/architectural breakthroughs are still the area that will show the most wins. The thing is that insights/breakthroughs of that sort that tend to be highly portable. As Meta showed, you can just pay people 10 million to come tell you what they're doing over there at that other place.

inb4 "then why do Meta's models still suck?"


Hasn't this been proven true, many times now? Just look at the difference between ChatGPT 3 and 3.5, for example (which used the same dataset). That, and all the top performing models have large gains from thinking, using the exact same weights.

And, all the new research around self learning architectures has nothing to do with the datasets.


> Absolutely nothing about free software requires or even implies any responsibility to “give back”

You're correct about that. The free software itself doesn't confer any responsibility. But the free software exists inside other contexts. Social/moral context. There're also future contexts for you or humanity. For example, if developing free software proves to be a sustainable model for people to do, you might get other projects LIKE the Blender Foundation to crop up in the future. You might benefit from them directly, or benefit from them by enjoying the things people produce with them. Also, if it's a tool that you like to use, maybe you just want that specific tool to continue to improve.


If my quicksave/quickload savescumming is to be observed, I’d be pining for that sperm from before I told the waitress “you too” wrt to her telling me to enjoy my meal.


> I told the waitress “you too” wrt to her telling me to enjoy my meal.

That's not too bad unless you are in a group and they make fun of you right away, but it's a fumble that you can fix and start a good play if you don't just get super nervous.

Laugh it off, ask her if it's not the first one, ask her to join, even if you know she's actually working and can't.

I've never done any improv, but it seems like something maybe everyone should do so we all can avoid awkward moments that stick for way longer than they should.


Nah just yell “switcheroo!” then grab her outfit and suddenly you're the waitress and she’s the diner with a meal to enjoy.


Is this a lucid dream?


> saves cumming


> savescumming

Savecumming?


Slang term for frequently reloading game state from recent save when a non-ideal outcome occurs. E.g. this method can be used to collect rare outcomes from a RNG-based game event.


Wait until these people find out what “scum” (as in scumbag) is a slang term for


Yes it does. I never use Claude anymore outside of agentic tasks.


What demographic are you in that is leaving anthropic in mass that they care about retaining? From what I see Anthropic is targeting enterprise and coding.

Claude Code just caught up to cursor (no 2) in revenue and based on trajectories is about to pass GitHub copilot (number 1) in a few more months. They just locked down Deloitte with 350k seats of Claude Enterprise.

In my fortune 100 financial company they just finished crushing open ai in a broad enterprise wide evaluation. Google Gemini was never in the mix, never on the table and still isn’t. Every one of our engineers has 1k a month allocated in Claude tokens for Claude enterprise and Claude code.

There is 1 leader with enterprise. There is one leader with developers. And google has nothing to make a dent. Not Gemini 3, not Gemini cli, not anti gravity, not Gemini. There is no Code Red for Anthropic. They have clear target markets and nothing from google threatens those.


I agree with your overall thesis but:

> Google Gemini was never in the mix, never on the table and still isn’t. Every one of our engineers has 1k a month allocated in Claude tokens for Claude enterprise and Claude code.

Does that mean y'all never evaluated Gemini at all or just that it couldn't compete? I'd be worried that prior performance of the models prejudiced stats away from Gemini, but I am a Claude Code and heavy Anthropic user myself so shrug.


Enterprise is slow. As for developers, we will be switching to Google unless the competition can catch up and deliver a similarly fast model.

Enterprise will follow.

I don't see any distinction in target markets - it's the same market.


Yeah, this is what I was trying to say in my original comment too.

Also I do not really use agentic tasks but I am not sure that gemini 3/3 flash have mcp support/skills support for agentic tasks

if not, I feel like they are very low hanging fruits and something that google can try to do too to win the market of agentic tasks over claude too perhaps.


I don't use MCP, but I am using agents in Antigravity.

So far they seem faster with Flash, and with less corruption of files using the Edit tool - or at least it recovered faster.


so? agentic tasks is where the promised agi is for many of us


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: