I've found that to be accurate when asking it questions that require ~PhD level knowledge to answer. e.g. Gemini and ChatGPT both seem to be capable of answering questions I have as I work through a set of notes on algebraic geometry.
Its performance on riddles has always seemed mostly irrelevant to me. Want to know if models can program? Ask them to program, and give them access to a compiler (they can now).
Want to know if it can do PhD level questions? Ask it questions a PhD (or at least grad student) would ask it.
They also reflect the tone and knowledge of the user and question. Ask it about your cat's astrological sign and you get emojis and short sentences in list form. Ask it why large atoms are unstable and you get paragraphs with larger vocabulary. Use jargon and it becomes more of an expert. etc.
I don't know about algebraic geometry, but AI is absolutely terrible at communications and social sciences. I know because I can tell when my postgraduate students use it.
Are you sure? What about when you use it? e.g. I suppose asking it to critique experimental design and analytical methodology, or identify potential confounders and future areas to explore, or help summarize nearby research, etc.
If you can tell when your students use it, presumably you mean they're just copying whatever, which just sounds like that student doesn't know what they're doing or is being lazy. That doesn't mean the model isn't capable; it means an incapable person won't know what they'd want to ask of it.
Additionally, even for similar prompts, my experience is that the models for professional use (e.g. gpt-codex) take on a much more professional tone and level of pragmatism (e.g. no sycophancy) than models for general consumer entertainment use (e.g. chatgpt).
I use AI for coding, but not for anything involving writing text, it's just horrendous at it. It just spews verbose slop, devoid of meaning, original thought or nuanced critique.
> That doesn't mean the model isn't capable; it means an incapable person won't know what they'd want to ask of it.
So it's user error again then, eh? PhD experts are able to help even "incapable" students, that's often a big part of their job.
Weird, my experience is that they are full of nuance. e.g. here is a snippet of my discussion with Gemini:
> Would you like to see why Q is "flexible" (Flat) while Z/2 is "rigid" (Not Flat) using this "crushing" vs. "preserving" logic? It explains why localized rings are almost always better to work with.
> Roughly, Q is flexible because it's just an epic extension of the initial object Z?
> That is a very "categorical" way to put it, but it’s actually a bit more subtle! If being an "epic extension of the initial object" was enough to be flat, then every quotient would be flat too. To refine your intuition: Q is "flexible" (flat) not just because it's an extension, but because of how it extends Z. Z/2 is a Quotient: It adds a constraint (2=0). Constraints are "rigid." As we saw, if you multiply by 2, everything collapses to zero. That's a "hidden kernel," which breaks left exactness. Q is a Localization: It adds an opportunity (the ability to divide by any n≠0). This is the definition of "flexibility."
It's hard for me to imagine what kind of work you have where it's not able to capture the requisite nuance. Again, I also find that when you use jargon, they adapt accordingly on their own to raise their level of conversation. They also seem to no longer have an issue with saying "yep exactly!" or "ehh not quite" (and provide counterarguments) as necessary.
Obviously if someone just says "write my paper" or whatever and gives that to you, that won't work well. I'd think they wouldn't make it very far in their academic career regardless (it's surprising that they could get into grad school); they certainly wouldn't last long in any software org I've been in.
Why would you not have the bot write in a formal language (e.g. Lean) and then just typecheck it? Then you only need to decide that your definitions were interesting. LLMs are now extremely good at programming given a compiler/typechecker, so I'd expect them to be good at formal math as well. It's nearly (if not precisely) the exact same activity.
Wouldn't a referendum to limit immigration be the way to reveal their preference? Obviously immigrants would tautologically prefer to move there. How is a citizen to "vote" against that via the market? Discriminate and refuse to rent/sell to any immigrants? Charge them more to try to offset their perceived loss of utility? What portion of the country is even in a position to be asked the question via the market?
Again, how is money supposed to measure value here? Are people supposed to look into whether every company interacts with immigrants in any way and then boycott them if they do? The only avenue I see is for people to look at the aggregate economic benefits of immigration and then decide to limit it anyway, effectively treating the opportunity cost as the price they're willing to pay.
Most of the people I know with money are difficult to convince to spend it. e.g. rich people don't buy designer bags; poor people do. My wife makes all of our food; we do delivery or go out to eat maybe once every year or two. We have no recurring subscriptions (other than utilities). Our phone bill is $20 for both of us. etc.
We also live in an area where outdoor ads are banned (which tends to be the case in wealthy areas IME), and I block ads on our computers, so we rarely encounter them. Consumerism is gauche.
Paid blogs/articles are worse than nothing. They are anti-information. If you did successfully eliminate those things, the currently niche places with honest discussion would be easier to find.
Google has a way of knowing. They can ask for documentation on who their customers are and what markets they operate in, and do some due diligence. Just like they have ways of knowing whether the ads they run are for blatant scams.
I'm not saying Google doesn't know if a company is in a particular market, I'm saying that a) Google doesn't know what market I'm searching for something from and b) even if they know both from context, it puts them in some awkward positions.
e.g. Vice Media has a trademark on "motherboard" that covers the tech news blog website service.
Is it now impossible for Asus to place an ad for the official Asus motherboard blog on the search term "motherboard"?
Is it legal to advertise for "motherboard" for any good or service other than a tech news blog website?
Is it now illegal to advertise a website featuring in-depth motherboard reviews using the term "motherboard"?
If I search for "motherboard website", what is Google allowed to show me for ads, given they don't know if I'm looking for the Vice website, or motherboard reviews, or the Asus homepage?
If a plain search for "motherboard" results in Vice's website not being in the top results, is Vice allowed to advertise on their own trademark to put it above other results? (Either above organic results, or above paid results for motherboard manufacturers, depending on whether you're allowing the latter.)
> Is it legal to advertise for "motherboard" for any good or service other than a tech news blog website?
Roughly speaking (modulo dilution which doesn't seem like it'd apply here), that's my understanding of trademark law. So your questions are all basically trivially answered, and those things are fine. A human should be able to review such cases.
HBF is NAND and integrated in-package like HBM. 3D XPoint or Optane would be extremely valuable today as part of the overall system architecture, but they were power-intensive enough that this particular use probably wouldn't be feasible.
(Though maybe it ends up being better if you're doing lots of random tiny 4k reads. It's hard to tell because the technology is discontinued as GP said, whereas NAND has kept progressing.)
Why do people have to make this stuff so complicated? An API that requires a key and enabling an MCP server and configuring your client to fetch markdown files on the fly? There's documentation on how to set things up to be able to get the documentation? Why not just a tar with all the docs? How big are they? A couple MB? Agents are really good at using grep on text files. So is my text editor.
Want it to be easy to update? Make it a git repo with all the docs. My agent already knows to always do a git fetch before interacting with a repo in a new session. Or you can fetch on a timer. Whatever.
I haven't yet figured out the point of this MCP stuff. Codex seems to have innate knowledge of how to curl jira and confluence and gitlab and prometheus and SQL databases and more. All you need to configure is a .netrc file and put the hostname in AGENTS.md. Are MCP tools even composable? Like can the model pipe the response to grep or jq or another MCP call without it entering/wasting context? Or is a normal CRUD API strictly more powerful and easier to use?
You don't even need to do git or a tarball! HTTP/HTML already has an "API" for serving Markdown to any agent like a LLM which wants it, because you can easily set a server to return Markdown with an accept-encoding (kinda why that functionality exists in the first place).
I set my nginx to return the Markdown source (which is just $URL.md) for my website; any LLM which wants up-to-date docs from my website can do so as easily as `curl --header 'Accept: text/markdown' 'https://gwern.net/archiving'`. One simple flag. Boom, done.
The point of MCP is discoverability. A crud app is better, except you have to waste context telling your LLM a bunch of details. With MCP you only put into it's context what the circumstances are where it applies, and it can just invoke it. You could write a bunch of little wrapper scripts around each api you want to use and have basically reinvented MCP for yourself.
But this is entirely besides the point. The point of MCP is bundling those exact things into a standardized plugin that’s easy for people to share with others.
MCP is useful because I can add one in a single click for an external service (say, my CI provider). And it gives the provider some control over how the agent accesses resources (for example, more efficient/compressed, agent-oriented log retrieval vs the full log dump a human wants). And it can set up the auth token when you install it.
So yeah, the agent could write some those queries manually (might need me to point it to the docs), and I could write helpers… or I could just one-click install the plugin and be done with it.
I don’t get why people get worked up over MCP, it’s just a (perhaps temporary) tool to help us get more context into agents in a more standard way than everyone writing a million different markdown files and helper scripts.
"The point of MCP is bundling those exact things into a standardized plugin that’s easy for people to share with others." Like... a CLI/API?
"MCP is useful because I can add one in a single click for an external service" Like... a CLI/API? [edit: sorry, not click, single 'uv' or 'brew' command]
"So yeah, the agent could write some those queries manually" Or, you could have a high-level CLI/API instead of a raw one?
"I don’t get why people get worked up over MCP" Because we tried them and got burned?
"to help us get more context into agents in a more standard way than everyone writing a million different markdown files and helper scripts." Agreed it's slightly annoying to add 'make sure to use this CLI/API for this purpose' in AGENTS.md but really not much. It's not a million markdown files tho. I think you're missing some existing pattern here.
Again, I fail to see how most MCPs are not lazy tools that could be well-scoped discoverable safe-to-use CLI/APIs.
That's literally what they are. It's a dead simple self describing JSONRPC API that you can understand if you spend 5 seconds looking at it. I don't get why people get so worked up over it as if it's some big over-engineered spec.
I can run an MPC on my local machine and connect it to an LLM FE in a browser.
I can use the GitHub MCP without installing anything on my machine at all.
I can run agents as root in a VM and give them access to things via an MCP running outside of the VM without giving them access to secrets.
It's an objectively better solution than just giving it CLIs.
All true except that CLI tools are composable and don't pollute your context when run via a script. The missing link for MCP would be a CLI utility to invoke it.
How does the agent know what clis/tools it has available? If there's an `mcpcli --help` that dumps the tool calls, we've just moved the problem.
The composition argument is compelling though. Instead of clis though, what if the agent could write code where the tools are made available as functions?
> what if the agent could write code where the tools are made available as functions?
Exactly, that would be of great help.
> If there's an `mcpcli --help` that dumps the tool calls, we've just moved the problem.
I see I worded my comment completely wrong... My bad. Indeed MCP tool definitions should probably be in context. What I dislike about MCP is that the IO immediately goes into context for the AI Agents I've seen.
Example: Very early on when Cursor just received beta MCP support I tried a Google Maps MCP from somewhere on the net; asked Cursor "Find me boxing gyms in Amsterdam". The MCP call then dumped a HATEOAS-annotated massive JSON causing Cursor to run out of context immediately. If it had been a CLI tool instead, Cursor could have wrapped it in say a `jq` to keep the context clean(er).
I mean what was keeping Cursor from running jq there? It's just a matter of being integrated poorly - which is largely why there was a rethink of "we just made this harder on ourselves, let's accomplish this with skills instead"
The last time I looked at MCPs closely, they appeared to pollute context and just hang there consuming context constantly. Whereas a self-documenting API or CLI tool enabled progressive discovery.
Has this changed?
My uncharitable interpretation is that MCP servers are NJ design for agents, and high quality APIs and CLIs are MIT design.
But at the end of the day, MCP is about making it easy/standard to pull in context from different sources. For example, to get logs from a CI run for my PR, or to look at jira tickets, or to interact with GitHub. Sure, a very simple
API baked into the model’s existing context is even better (Claude will just use the GH CLI for lots of stuff, no MCP there.)
MCP is literally just a way for end users to be able to quickly plug in to those ecosystems. Like, yeah, I could make some extra documentation about how to use my CI provider’s API, put an access token somewhere the agent can use… or I could just add the remote MCP and the agent has what it needs to figure out what the API looks like.
It also lets the provider (say, Jira) get some control over how models access your service instead of writing whatever API requests they feel like.
Like, MCP is really not that crazy. It’s just a somewhat standard way to make plugins for getting extra context. Sure, agents are good at writing with API requests, but they’re not so good at knowing why, when, or what to use.
People get worked up over the word “protocol” like it has to mean some kind of super advanced and clever transport-layer technology, but I digress :p
You're making the convenience argument, but I'm making the architecture argument. They're not the same thing.
You say "a very simple API baked into the model's existing context is even better". So we agree? MCP's design actively discourages that better path.
"Agents are good at writing API requests, but not so good at knowing why, when, or what to use". This is exactly what progressive discovery solves. A good CLI has --help. A good API has introspection. MCP's answer is "dump all the tool schemas into context and let the model figure it out," which is O(N) context cost at all times vs O(1) until you actually need something.
"It's just a standard way to make plugins" The plugin pattern of "here are 47 tool descriptions, good luck" is exactly the worse-is-better tradeoff I'm describing. Easy to wire up, expensive at runtime, and it gets worse as you add more servers.
The NJ/MIT analogy isn't about complexity, it's about where the design effort goes. MCP puts the effort into easy integration. A well-designed API puts the effort into efficient discovery. One scales, the other doesn't.
I tried using the Microsoft azure devops MCP and it immediately filled up 160k of my context window with what I can only assume was listing out an absurd number of projects. Now I just instruct it to make direct API calls for the specific resources, I don’t know maybe I’m doing something wrong in Cursor, or maybe Microsoft is just cranking out garbage (possible), but to get that context down I had to uncheck all the myriad features that MCP supplies.
Sure, but Google isn't maintaining two sets of documentation here, the MCP server is just a thin wrapper around the webpage with a little search tool. So it's still the docs for humans just with a different delivery mechanism. Which is fine, but you can understand when hypertext exists largely for this exact purpose folks would find it odd and over complicated to reinvent the web over jsonrpc for robots.
How do you manage auth for services that use OAuth?
I’ve been wrapping the agent’s curl calls in a small cli that handles the auth but I’m wondering if other people have come up with something lighter/more portable.
It doesn't even feel good. My recollection (having used a Macbook for work for a few years) is that they have aluminium shells with sharp edges that would irritate my wrists. That never happens with a soft plastic shell like a Thinkpad.
Its performance on riddles has always seemed mostly irrelevant to me. Want to know if models can program? Ask them to program, and give them access to a compiler (they can now).
Want to know if it can do PhD level questions? Ask it questions a PhD (or at least grad student) would ask it.
They also reflect the tone and knowledge of the user and question. Ask it about your cat's astrological sign and you get emojis and short sentences in list form. Ask it why large atoms are unstable and you get paragraphs with larger vocabulary. Use jargon and it becomes more of an expert. etc.
reply