Your entire argument is derived from a pseudoscientific field without any peer-reviewed research. Mechanistic interpretability is a joke invented by AI firms to sell chatbots.
Lol that's a stupid ass response, especially when half the papers are from universities from China. You think the chinese universities are trying to sell ChatGPT subscriptions? Ridiculous. You're just falling behind in tech knowledge.
And apparently you think peer reviewed papers presented at NeurIPS or other conferences are considered pseudoscience. (For the people not versed in ML, NeurIPS is where the 2017 paper "Attention is All You Need" that started the modern ML revolution was presented)
(Not GP) There was a well recognized reproducibility problem in the ML field before LLM-mania, and that's considering published papers with proper peer-reviews. The current state of afairs in some ways is even less rigourous than that, and then some people in the field feel free to overextend their conclusions into other fields like neurosciences.
We're in the "mad science" regime because the current speed of progress means adding rigor would sacrifice velocity. Preprints are the lifeblood of the field because preprints can be put out there earlier and start contributing earlier.
Anthropic, much as you hate them, has some of the best mechanistic interpretability researchers and AI wranglers across the entire industry. When they find things, they find things. Your "not scientifically rigorous" is just a flimsy excuse to dismiss the findings that make you deeply uncomfortable.
Did you just invent a nonsense fallacy to use as a bludgeon here? “Stochastic parrot fallacy” does not exist, and there actually quite a bit of evidence supporting the stochastic parrot hypothesis.
I imagine "stochastic parrot fallacy" could be their term for using the hypothesis to dismiss LLMs even where they can be useful; i.e., dismissing them for their weaknesses alone and ignoring their strengths. (Of course, we have no way to know for sure without their input.)
Oh, please. There’s always a way to blame the user, it’s a catch-22. The fact is that coding agents aren’t perfect and it’s quite common for them to fail. Refer to the recent C-compiler nonsense Anthropic tried to pull for proof.
It fails far less often than I do at the cookie cutter parts of my job, and it’s much faster and cheaper than I am.
Being honest; I probably have to write some properly clever code or do some actual design as a dev lead like… 2% of my time? At most? The rest of the code related work I do, it’s outperforming me.
Now, maybe you’re somehow different to me, but I find it hard to believe that the majority of devs out there are balancing binary trees and coming up with shithot unique algorithms all day rather than mangling some formatting and dealing with improving db performance, picking the right pattern for some backend and so on style tasks day to day.
That’s a devastating benchmark design flaw. Sick of these bullshit benchmarks designed solely to hype AI. AI boosters turn around and use them as ammo, despite not understanding them.
Relax. Anyone who's genuinely interested in the question will see with a few searches that LLMs can play chess fine, although the post-trained models mostly seem to be regressed. Problem is people are more interested in validating their own assumptions than anything else.
This exact game has been played 60 thousand times on lichess. The peace sacrifice Grok performed on move 6 has been played 5 million times on lichess. Every single move Grok made is also the top played move on lichess.
This reminds me of Stefan Zweig’s The Royal Game where the protagonist survived Nazi torture by memorizing every game in a chess book his torturers dropped (excellent book btw. and I am aware I just committed Godwin’s law here; also aware of the irony here). The protagonist became “good” at chess, simply by memorizing a lot of games.
reply