Hacker Newsnew | past | comments | ask | show | jobs | submit | ritz_labringue's commentslogin

Well, it's really a VSCode extension that lets you run Codex CLI in the IDE. Not the "cloud" version of Codex... So GP is technically correct


The short answer is "batch size". These days, LLMs are what we call "Mixture of Experts", meaning they only activate a small subset of their weights at a time. This makes them a lot more efficient to run at high batch size.

If you try to run GPT4 at home, you'll still need enough VRAM to load the entire model, which means you'll need several H100s (each one costs like $40k). But you will be under-utilizing those cards by a huge amount for personal use.

It's a bit like saying "How come Apple can make iphones for billions of people but I can't even build a single one in my garage"


> These days, LLMs are what we call "Mixture of Experts", meaning they only activate a small subset of their weights at a time. This makes them a lot more efficient to run at high batch size.

I don't really understand why you're trying to connect MoE and batching here. Your stated mechanism is not only incorrect but actually the wrong way around.

The efficiency of batching comes from optimally balancing the compute and memory bandwidth, by loading a tile of parameters from the VRAM to cache, applying those weights to all the batched requests, and only then loading in the next tile.

So batching only helps when multiple queries need to access the same weights for the same token. For dense models, that's just what always happens. But for MoE, it's not the case, exactly due to the reason that not all weights are always activated. And then suddenly your batching becomes a complex scheduling problem, since not all the experts at a given layer will have the same load. Surely a solvable problem, but MoE is not the enabler for batching but making it significantly harder.


You’re right, I conflated two things. MoE improves compute efficiency per token (only a few experts run), but it doesn’t meaningfully reduce memory footprint.

For fast inference you typically keep all experts in memory (or shard them), so VRAM still scales with the total number of experts.

Practically, that’s why home setups are wasteful: you buy a GPU for its VRAM capacity, but MoE only activates a fraction of the compute each token, and some experts/devices sit idle (because you are the only one using the model).

MoE does not make batching more efficient, but it demands larger batches to maximize compute utilization and to amortize routing. Dense models batch trivially (same weights every token). MoE batches well once the batch is large enough so each expert has work. So the point isn’t that MoE makes batching better, but that MoE needs bigger batches to reach its best utilization.


I'm actually not sure I understand how MoE helps here. If you can route a single request to a specific subnetwork then yes, it saves compute for that request. But if you have a batch of 100 requests, unless they are all routed exactly the same, which feels unlikely, aren't you actually increasing the number of weights that need to be processed? (at least with respect to an individual request in the batch).


Essentially, inference is well-amortized across the many users.


I wonder then if its possible to load the unused parts into main memory, while the more used parts into VRAM


Great metaphor


I don't think I agree with the premise. Sure there are lots of car accidents in absolute terms, but given how many people drive and how error-prone driving inherently is, most people are actually pretty decent drivers


Driving is not error prone, cars rarely break in unexpected ways.

People driving and making decisions are error prone.

A simple test is to watch how people turn. Do they turn early potentially hitting the curb or cutting it too close to pedestrians. Or do they increase their radius by turning late? The latter are better drivers.

Edit: here are more tests,

- do they signal

- do they cutoff others

- do they let those who signal in

- do they drive too slow or too fast for the given road and conditions

- do they have an awareness of all cars around them

- do they block the passing lane

- do they maintain a reasonable distance behind other cars

- do they let emergency vehicles pass

etc.


AI is really useful when you already know what code needs to be written. If you can explain it properly, the AI will write it faster than you can and you'll save time because it is quick to check that this is actually the code you wanted to write. So "programming with AI" means programming in your mind and then using the AI to materialize it in the codebase.


Well, kinda? I often know what chunks / functions I need, but too lazy to think how to implement them exactly, how they should works inside. Yeah, you need to have overall idea of what you are trying to make.


Harry Potter action figures trade almost entirely on J. K. Rowling’s expressive choices. Every unlicensed toy competes head‑to‑head with the licensed one and slices off a share of a finite pot of fandom spending. Copyright law treats that as classic market substitution and rightfully lets the author police it.

Dropping the novels into a machine‑learning corpus is a fundamentally different act. The text is not being resold, and the resulting model is not advertised as “official Harry Potter.” The books are just statistical nutrition. One ingredient among millions. Much like a human writer who reads widely before producing new work. No consumer is choosing between “Rowling’s novel” and “the tokens her novel contributed to an LLM,” so there’s no comparable displacement of demand.

In economic terms, the merch market is rivalrous and zero‑sum; the training market is non‑rivalrous and produces no direct substitute good. That asymmetry is why copyright doctrine (and fair‑use case law) treats toy knock‑offs and corpus building very differently.


I very much agree, and I think people who are in denial about the usefulness of these tools are in for a bad time.

I've seen this firsthand multiple times: people who really don't want it to work will (unconsciously or not) sabotage themselves by writing vague prompts or withholding context/tips they'd naturally give a human colleague.

Then when the LLM inevitably fails, they get their "gotcha!" moment.


I think the people who are in denial about the uselessness of these tools are in for a bad time.

I've been playing with language models for seven years now. I've even trained them from scratch. I'm playing with aider and I use the chats.

I give them lots of context and ask specific questions about things I know. They always get things wrong in subtle ways that make me not trust them for things I don't know. Sometimes they can point me to real documentation.

gemma3:4b on my laptop with aider can merge a diff in about twenty minutes of 4070 GPU time. incredible technology. truly groundbreaking.

call me in ten years if they figure out how to scale these things without just adding 10x compute for each 1x improvement.

I mean hell, the big improvements over the last year aren't even to do with learning. Agents are just systems code. RAG is better prompting. System prompts are just added context. call me when GPT 5 drops, and isn't an incremental improvement


"gemma3:4b on my laptop with aider"

Found the problem!


It does require writing good instructions for the LLM to properly use the tables, and it works best if you carefully pick the tables that your agent is allowed to use beforehand. We have many users that use it for every day work with real data (definitely not toy problems).


If only we had a language to accurately describe what we want to retrieve from the database! Alas, one can only dream!


> It does require writing good instructions for the LLM to properly use the tables

--- start quote ---

prompt engineering is nothing but an attempt to reverse-engineer a non-deterministic black box for which any of the parameters below are unknown:

- training set

- weights

- constraints on the model

- layers between you and the model that transform both your input and the model's output that can change at any time

- availability of compute for your specific query

- and definitely some more details I haven't thought of

https://dmitriid.com/prompting-llms-is-not-engineering

--- end quote ---


What else is engineering then if not taming the unknown and the unknowable? How is building a bridge any different? Do you know everything in advance about the composition of terrain, the traffic, the wind and the earthquakes? Or are you making educated guesses about unknown quantities to get something that fits into some parameters that are Good Enough(TM) for the given purpose?


> and the unknowable

This is the crux. Sure, for high level software (e.g. Web apps), many parts of the system will feel like black boxes, but low-level software does not generally have this problem. Sure, sometimes you have to deal with a binary blob driver, but more often than not you're in control of or and to debug most all of the software running on your system.

> Building a bridge

There should NOT be significant unknowns when you're building a bridge, this is how people die. You turn those parameters into "knowns with high confidence", which is not something you can even begin to do for the LLM parameters described above.


> How is building a bridge any different?

In absolutely every way that matters and in all the details that don't matter.

> Do you know everything in advance about the composition of terrain, the traffic, the wind and the earthquakes?

No, and there are established procedures and ways to establish those facts.

"This magical incantation that I pretend works better because the US is asleep and more compute is available" is not such a procedure.


Yes you are perfectly right. Our product pushes users to be selective on the tables they give access to a given agent for a given use-case :+1:

The tricky part is correctly supporting multiple systems which each have their own specificity. All the way to Salesforce which is an entirely different beast in terms of query language. We're working on it right now and will likely follow-up with a blog post there :+1:


Salesforce architect here (from partner firm, not the mothership directly)--Salesforce's query language, SOQL, is definitely a different beast as you say. I'd like to learn more about the issues you're having with the integration, specifically the permissions enforcement. I may be misunderstanding what you meant in the blog post, but if you're passing a SOQL query through the REST API then the results will be scoped by default to the permissions of the user that went through the OAuth flow. My email is in my profile if you're open to connecting.


It's not on par with o1, let alone o1-pro


It's on par/better/worse depending on the problem. o1 is significantly worse, for example, in Rust programming than Claude 3.5; at least for me.


Claude really likes producing code, that’s for sure. I feel like it’s a useful tool once I’ve deconstructed a project past a certain point.


It’s pretty on par with o1, better at many coding questions.


I found it better at reasoning, worse at coding.

Not doubting your experience, just thinking how subjective it all is.


Why can't you just create a pool of databases and truncate all tables after each run ? That should be pretty fast


DELETE is faster on a small amount of data.

But yea, DELETEing is faster than creating a DB from template (depends on how much you have to delete, of course). However, templates allow parallelism by giving its test its own database. I ended up doing a mix of both: create a DB if there's not one available, or reuse one if available, always tearing down data.


TRUNCATE is faster than DELETE. You could have 100 dbs, each test first acquires one, runs, truncates, releases. No need to create more dbs on the fly.


researchers are paid 2x what engineers are paid at OAI, even if it's not the same job there's still one that is "higher level" than the other.


In terms of pay at OAI, sure.

But being an engineer isn’t just a lesser form of being a researcher.

It’s not a “level” in that sense. Like OAI isn’t going to fire an engineer and replace them with a researcher.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: