LinkedIn - it takes you to the allow/deny page but doesn't automate things. It used to be that the LinkedIn login would get stuck in a cycle around this, but now it just dumps you on to the consent page.
Indeed, that's what I kind of hinted at in https://news.ycombinator.com/item?id=46442195 and coincidentally https://news.ycombinator.com/item?id=46437688 briefly after, namely that OK, one can "generate" a "solution", that's much easier than before... but until we can verify somehow that it actually does what it say it does (and we know of hallucinations and have no reason to believe this changed) then testing itself, especially of well know "problems" is more and more important.
That being said, it doesn't answer the "why" in the first place, an even more important question. At least though it does help somehow to compare with existing alternatives.
Folks think, they write code, they do their own localized evaluation and testing, then they commit and then the rest of the (down|up)stream process begins.
LLM's skip over the "actually verify that the code I just wrote does what I intended it to" step. Granted, most humans don't do this step as thoroughly and carefully as would be desirable (sometimes through laziness, sometimes because of a belief in (down|up)stream testing processes). But LLM's don't do it at all.
They absolutely can do that if you give them the tools. Seeing Claude (I use it with opencode agents) run curl and playwright to verify and then fix it's implementation was a real 'wow' moment for me.
We have different experiences. Often I’ll see Claude, et. al. find creative ways to fulfill the task without satisfying my intent, e.g., changing the implementation plan I specifically asked for, changing tolerances or even tests, and frequently disabling tests.
Yeah I feel that, if it happens your only way out is to write down a more extensive implementation plan first. For me that is the point where I start regretting to have tried implementing something using AI,.. But admittedly most of the time redacting the implementation plan and running the agent again is still faster than I could have done on my own (I try to make implementation tasks explicit in the form of a markdown file, worked pretty well so far).
I see these “you had a different experience than me” comments around AI coding agents a lot and can concur; I’ll have a different experience with Copilot from day-to-day even, sometimes it’s great and other days I give up on using it at all it’s being so bad.
Makes me honestly wonder — will AGI just give us agents that get into bad moods and not want to work for the day because they’re tired or just don’t feel like it!
Don’t downvote because you don’t like the question.
It obviously adds to the discussion: paid and non paid accounts are being conflated daily in threads like these!
They’re not the same tier account!
Free users, especially ones deemed less interesting to learn from for the future, are given table-scraps when they feel it’s necessary for load reasons.
Exactly. There's an impedance mismatch between those using the free/cheap tiers and those paying a premium, so the discussion gets squirrely because one side is talking about apples and the other oranges.
> LLM's skip over the "actually verify that the code I just wrote does what I intended it to" step.
I'm not sure where this idea comes from. Just instruct it to write and run unit tests and document as it goes. All of the ones I've used will happily do so.
You still have to verify that the unit tests are valid, but that's still far less work than skipping them or writing the code/tests yourself.
I disagree it's less work. It just carte blanche rewrites tests. I've seen it rewrite and rewrite tests to the point of undermining the original test intention. So now instead of intentionally writing code and a new unit test, I need to intentionally go and review EVERY unit test it touched. Every. Time.
It also doesn't necessarily rewrite documentation as implementation changes. I've seen documentation code rot happen within the same coding session.
One commercial equivalent to the project I work on, called ProTools (a DAW), has a test "harness" that took 6 people more than a year to write and takes more than a week to execute.
Last month, I made a minor change to our own code and verified that it worked (it did!). Earlier this week, I was notified of an entirely different workflow that had been broken by the change I had made. The only sort of automated testing that would have detected this would have been similar in scope and scale to the ProTools test harness, and neither an individual human nor an LLM is going to run that.
Moreover, that workflow was entirely graphically based, so unless Claude Opus 4.5 or whatever today's flavor of vibe coding LLM agent is has access to a testing system that allows it to inject mouse events into a running instance of our application (hint: it does not), there's no way it could run an effective test for this sort of code change.
I have no doubt that Claude et al. can verify that their carefully defined module does the very limited task it is supposed to do, for cases where "carefully defined" and "very limited" are appropriate. If that's the only sort of coding you do, I am sorry for your loss.
> access to a testing system that allows it to inject mouse events into a running instance of our application
FWIW that's precisely what https://pptr.dev is all about. To your broader point though designing a good harness itself remains very challenging and requires to actually understand what value for user, software architecture (to e.g. bypass user interaction and test the API first), etc.
No I was sharing an example of a framework that does include "a testing system that allows it to inject mouse events".
That being said mouse events and similar isn't hard to do, e.g. start with a fixed resolution (using xrandr) then xdotool or similar. Ideally if the application has accessibility feature it won't be as finicky.
My point though was just to show that testing with GUI is not infeasible.
Apparently there is even a "UI Testing for devs & agents" https://www.chromatic.com which I found via Visual TDD https://www.chromatic.com/blog/visual-test-driven-developmen... I can't recommend this but it does show even though the person I was replying with can't use Puppeteer in their context the tooling does exist and the principles would still apply.
> My point though was just to show that testing with GUI is not infeasible.
Indeed, which is why I mentioned the ProTools test harness and the fact that it took 6 people a year to write and takes a week to run (or took a week, at some point in the past; it might be more or less now).
With the NES there are all sorts of weird edge cases, one of which are NMI flags and resets; the PPU in general is kinda tricky to get right. Claude has had *massive** issues with this, and I've had to take control and completely throw out code it's generated. I'm restarting it with a clean slate though, as there are still issues with some of the underlying abstractions. PPU is still the bane of my existence, DMA, I don't like the instruction pipeline, haven't even gotten to the APU. It's getting an 80/130 on accuracy coin.
Though, when it came to creating a WASM target, Claude was largely able to do it with minimal input on my end. Actually, getting the WASM emulator running in the browser was the least painful part of this project.
You will run into three problems: 1) "The Wall" when any project becomes large enough, you need the context window to be *very* specific and scoped, with explicit details of what is expected, the success criteria and deliverables. 2) Ambiguity means Claude is going to choose the path of least resistance, and will pedantically avoid/add things which are not specced. Stubs for functions, "beyond scope", "deferred" are some favorite excuses to not refactoring or implementing obvious issues (anything that will go beyond the context window, Claude knows, but won't tell you will be punted work). 3) Chat bots *loooove* to talk, it will vomit code for days. Removing code/documentation is anathema to Claude. "Backward compatibility", deprecated, and legacy being its favorite.
This sounds exhausting, once the thrill of seeing code rapidly generated wears off, I wonder if it's even worth it. If someone was going to use code they didn't write, why not just pull down some open source implementation from somewhere and build on top of it? It's basically gets you the same thing but without the LLM hassles, and you can start building on a more sane foundation.
They've all started cracking down, in the past year the Barclays and Lloyds app have broken on my phone.
TSB still works for now, but even for a bank they're technologically incompetent so I'm going to just assume they're behind the curve rather than willingly not using SafetyNet.
The only one I would bank on still working in the future is Monzo, since, like you say, they detect it and just give you scary warning and let you continue.
Barclays have always played silly games with this stuff, they used to fund a whole team whose job it was to waste time on security theatre (this was nearly ten years ago).
I have this set as my OS default and also forced for all webpages, I just find it so clear and easy to read. On the occasion that I have to browse the web without it, I don't struggle per-say, but I definitely find that I have to read slower, and find myself rereading words more often.
> Everyone knows that debugging is twice as hard as writing a program in the first place. So if you’re as clever as you can be when you write it, how will you ever debug it?
I think people misunderstand this quote. Cleverness in this context is referring to complexity, and generally stems from falling in love with some complex mechanism you dream up to solve a problem rather than challenging yourself to create something simpler and easier to maintain. Bolting together bits of LLM-created code is is far more likely to be “clever” rather than good.
Junior's grow into mids, and eventually into seniors. OSS contributor's eventually learn the codebase, you talk to them, you all get invested in the shared success of the project and sometimes you even become friends.
For me, personally, I just don't see the point of putting that same effort into a machine. It won't learn or grow from the corrections I make in that PR, so why bother? I might as well have written it myself and saved the merge review headache.
Maybe one day it'll reach perfect parity of what I could've written myself, but today isn't that day.
I wonder if that difference in mentality is a large part of the pro- vs anti-AI debate.
To me the AI is a very smart tool, not a very dumb co-worker. When I use the tool, my goal is for _me_ to learn from _its_ mistakes, so I can get better at using the tool. Code I produce using an AI tool is my code. I don't produce it by directly writing it, but my techniques guide the tool through the generation process and I am responsible for the fitness and quality of the resulting code.
I accept that the tool doesn't learn like a human, just like I accept that my IDE or a screwdriver doesn't learn like a human. But I myself can improve the performance of the AI coding by developing my own skills through usage and then applying those skills.
> It won't learn or grow from the corrections I make in that PR, so why bother?
That does not match my experience. As the codebases I've worked with LLMs on become more opinionated and stylized, it seems to to a better job of following the existing work. And over time the models have absolutely improved in terms of their ability to understand issues and offer solutions. Each new release has solved problems for me that the previous ones have struggled with.
Re: interpersonal interactions, I don't find that the LLM has pushed them out or away. My projects still have groups of interested folk who talk and joke and learn and have fun. What the LLMs have addressed for me in part is the relative scarcity of labor for such work. I'm not hacking on the Linux Kernel with 10,000 contributors. Even with a dozen contributors, the amount of contributed code is relatively low and only in areas they are interested in. The LLM doesn't mind if I ask it to do something super boring. And it's been surprisingly helpful in chasing down bugs.
> Maybe one day it'll reach perfect parity of what I could've written myself, but today isn't that day.
Regardless of whether or not that happens, they've already been useful for me for at least 9 months. Since O3, which is the first one that really started to understand Rust's borrow checker in my experience. My measure isn't whether or not it writes code as well as I do, but how productive I am when working with it compared to not. In my measurements with SLOCCount over the last 9 months, I'm about 8x more productive than the previous 15 years without (as long as I've been measuring). And that's allowed me to get to projects which have been on the shelf for years.
reply