And even if the models themselves for some reason were to never get better than what we have now, we've only scratched the surface of harnesses to make them better.
We know a lot about how to make groups of people achieve things individual members never could, and most of the same techiques work for LLMs, but it takes extra work to figure out how to most efficiently work around limitations such as lack of integrated long-term memory.
A lot of that work is in its infancy. E.g. I have a project I'm working on now where I'm up to a couple of dozens of agents, and ever day I'm learning more about how to structure them to squeeze the most out of the models.
One learning that feels relevant to the linked article: Instead of giving an agent the whole task across a large dataset that'd overwhelm context, it often helps to have an agent - that can use Haiku, because it's fine if its dumb - comb the data for <information relevant to the specific task>, and generate a list of information, and have the bigger model use that as a guide.
So the progress we're seeing is not just raw model improvements, but work like the one in this article: Figuring out how to squeeze the best results out of any given model, and that work would continue to yield improvements for years even if models somehow stopped improving.
Thanks for this (and to Izkata for the suggestion). I now have about 100 (okay, minor exaggeration, but not as much as you'd like it to be) AGENTS.md/CLAUDE.md files and agent descriptions I will want to systematically validate if shifting toward first person helps adherence for...
I'm realising I need to start setting up an automated test-suite for my prompts...
You can absolutely learn LLMs strengths and weaknesses too.
E.g. Claude gets "bored" easily (it will even tell you this if you give it too repetitive tasks). The solution is simple: Since we control context and it has no memory outside of that, make it pretend it's not doing repetitive tasks by having the top agent "only" do the task of managing and sub-dividing the task, and farm out each sub-task to a sub-agent who won't get bored because it only sees a small part of the problem.
> Also that thing that LLMs don’t actually learn. You can threaten to chop their fingers off if they do something again… they don’t have fingers, they don’t recall, and can’t actually tell if they did the thing. “I’m not lying, oops I am, no I’m not, oops I am… lemme delete the home directory and see if that helps…”
No, like characters in a "Groundhog Day" scenario they also doesn't remember and change their behaviour while you figure out how to get them to do what you want, so you can test and adjust and find what makes them do what you want and it, and while not perfectly deterministic, you get close.
And unlike humans, sometimes the "not learning" helps us address other parts of the problem. E.g. if they learned, the "sub-agent trick" above wouldn't work, because they'd realise they were carrying out a bunch of tedious tasks instead of remaining oblivious that we're letting them forget in between each.
LLMs in their current form need harnesses, and we can - and are - learning which types of harnesses work well. Incidentally, a lot of them do work on humans too (despite our pesky memory making it harder to slip things past us), and a lot of them are methods we know of from the very long history of figuring out how to make messy, unreliable humans adhere to processes.
E.g. to go back to my top example of getting adherence to a boring, reptitive task: Create checklists, subdivide the task with individual reporting gates, spread it across a team if you can, put in place a review process (with a checklist). All of these are techniques that work both on human teams and LLMs to improve process adherence.
> AGENTS.md, on the other hand, is context. Models have been trained to follow context since the dawn of the thing.
The skills frontmatter end up in context as well.
If AGENTS.md outperform skills in a given agent, it is down to specifically how the skills frontmatter is extracted and injected into the context, because that is the only difference between the two approaches.
EDIT: I haven't tried to check this so this is pure speculation, but I suppose there is the possibility that some agents might use a smaller model to selectively decide what skills frontmatter to include in context for a bigger model. E.g. you could imagine Claude passing the prompt + skills frontmatter to Haiku to selectively decide what to include before passing to Sonnet or Opus. In that case, depending on approach, putting it directly in AGENTS.md might simply be a question of what information is prioritised in the ouput passed to the full model. (Again: this is pure speculation of a possible approach; though it is one I'd test if I were to pick up writing my own coding agent again)
But really the overall point is that AGENTS.md vs. skills here still is entirely a question of what ends up in the "raw" context/prompt that gets passed to the full model, so this is just nuance to my original answer with respect to possible ways that raw prompt could be composed.
No it's more than that - they didn't just put the skills instructions directly in AGENTS.md, they put the whole index for the docs (the skill in this case being a docs lookup) in there, so there's nothing to 'do', the skill output is already in context (or at least pointers to it, the index, if not the actual file contents) not just the front matter.
Hence the submission's conclusion:
> Our working theory [for why this performs better] comes down to three factors.
> No decision point. With AGENTS.md, there's no moment where the agent must decide "should I look this up?" The information is already present.
> Consistent availability. Skills load asynchronously and only when invoked. AGENTS.md content is in the system prompt for every turn.
> No ordering issues. Skills create sequencing decisions (read docs first vs. explore project first). Passive context avoids this entirely.
> No it's more than that - they didn't just put the skills instructions directly in AGENTS.md, they put the whole index for the docs (the skill in this case being a docs lookup) in there, so there's nothing to 'do', the skill output is already in context (or at least pointers to it, the index, if not the actual file contents) not just the front matter.
The point remains: That is still just down to how you compose the context/prompt that actually goes to the model.
Nothing stops an agent from including logic to inline the full set of skills if the context is short enough. The point of skills is to provide a mechanism for managing context to reduce the need for summarization/compaction or explicit management, and so allowing you to e.g. have a lot of them available.
(And this kind of makes the article largely moot - it's slightly neat to know it might be better to just inline the skills if you have few enough that they won't seriously fill up your context, but the main value of skills comes when you have enough of them that this isn't the case)
Conversely, nothing prevents the agent from using lossy processing with a smaller, faster model on AGENTS.md either before passing it to the main model e.g. if context is getting out of hand, or if the developer of a given agent think they have a way of making adherence better by transforming them.
These are all tooling decisions, not features of the models.
However you compose the context for the skill, the model has to generate output like 'use skill docslookup(blah)' vs. just 'according to the docs in context' (or even 'read file blah.txt mentioned in context') which training can affect.
This is assuming you make the model itself decide whether the skill is relevant, and that is one way of doing it, but there is no reason that needs to be the case.
Of course training can affect it, but the point is that there is nothing about skills that need to be different to just sending all the skill files as part of the context, because that is a valid way of implementing skills, though it looses the primary benefit of skills, namely the ability to have more documentation of how to do things than fits in context.
Other options that also do not require the main model to know what to include ranges from direct string matching (e.g. against /<someskill>) via embeddings, to passing a question to a smaller model (e.g "are any of these description relevant to this prompt: ...").
I use a hybrid setup. I have 9 tiled desktops, and 1 floating, across 2 monitors, so I "make a mess" on the floating one and the rest are task focused, typically with one dedicated to a single browser window for my "main" browsing.
If you use the API, you pay for a specific model, yes, but even then there are "workarounds" for them, such as someone else pointed out by reducing the amount of time they let it "think".
If you use the subscriptions, the terms specifically says that beyond the caps they can limit your "model and feature usage, at our discretion".
Sure. I was separating the model - which Anthropic promises not to downgrade - and the "thinking time" - which Anthropic doesn't promise not to downgrade. It seems the latter is very likely the culprit in this case.
Old school Gemini used to do this. It was super obvious because mid day the model would go from stupid to completely brain dead. I have a screenshot of Google's FAQ on my PC from 2024-09-13 that says this (I took it to post to discord):
> How do I know which model Gemini is using in its responses?
> We believe in using the right model for the right task. We use various models at hand for specific tasks based on what we think will provide the best experience.
There's an 8k license for "recommended" (not "core", which is free under CC-BY-4.0 for all purposes) data if you are a service provider or broadcaster:
Or they just move their "headquarters" and the US part of the business will be a subsidiary.
This is an old, and well tested strategy.
E.g. Commodore International formally had its head office in The Bahamas, but the entire leadership team worked out of the US.
You can try putting more constraints on what will get a company considerd a US company to catch those kinds of structures, but as you indirectly point out, there are really only downsides to playing that game.
I think the human + agent thing absolutely will make a huge difference. I see regularly that Claude can totally off piste and eventually claw itself back with a proper agent setup but it will take a lot of time if I don't spot it and get it back on track.
I have one project Claude is working on right now where I'm testing a setup to attempt to take myself more out of the loop, because that is the hard part. It's "easy" to get an agent to multiply your output. It's hard to make that scale with your willingness to spend on tokens rather than with your ability to read and review and direct.
I've ended up with roughly this (it's nothing particularly special):
- Runs a evaluator that evaluates the current state and assigns scores across multiple metrics.
- If a given score is above a given threshold, expand the test suite automatically.
- If the score is below a given threshold, spawn a "research agent" that investgates why the scores don't meet expectations.
- The research agent delivers a report, that is passed to an implementation agent.
- The main agent re-runs the scoring, and if it doesn't show an improvement on one or more of the metrics, the commit is discarded, and notes made of what was tried, and why it failed.
It takes a bit of trial and error to get it right (e.g. "it's the test suite that is wrong" came up early, and the main agent was almost talked into revising the test suite to remove the "problematic" tests) but a division sort of like this lets Claude do more sensible stuff for me. Throwing away commits feels drastic - an option is to let it run a little cycle of commit -> evaluate -> redo a few times before the final judgement, maybe - but it so far it feels like it'll scale better. Less crap makes it into the project.
And I think this will work better than to treat these agents as if they are developers whose output costs 100x as much.
Code so cheap it is disposable should change the workflows.
So while I agree this is a better demonstration of a good way to build a browser, it's a less interesting demonstration as well. Now that we've seen people show that something like FastRender is possible, expect people to experiment with similarly ambitious projects but with more thought put into scoring/evaluation, including on code size and dependencies.
> I think the human + agent thing absolutely will make a huge difference.
Just the day(s) before, I was thinking about this too, and I think what will make the biggest difference is humans who posses "Good Taste". I wrote a bunch about it here: https://emsh.cat/good-taste/
I think the ending is most apt, and where I think we're going wrong right now:
> I feel like we're building the wrong things. The whole vibe right now is "replace the human part" instead of "make better tools for the human part". I don't want a machine that replaces my taste, I want tools that help me use my taste better; see the cut faster, compare directions, compare architectural choices, find where I've missed things, catch when we're going into generics, and help me make sharper intentional choices.
For some projects, "better tools for the human part" is sufficient and awesome.
But for other projects, being able to scale with little or no human involvement suddenly turns some things that were borderline profitable or not possible to make profitable at all with current salaries vs. token costs into viable businesses.
Where it works, it's a paradigm shift - for both good and bad.
So it depends what you're trying to solve for. I have projects in both categories.
Personally I think the part where you try to eliminate humans from involvement, is gonna lead to too much trouble, being too inflexible and the results will be bad. It's what I've seen so far, haven't seen anything pointing to it being feasible, but I'd be happy to be corrected.
It really depends on the type of tasks. There are many tasks LLMs do for me entirely autonomously already, because they do it well enough that it's no longer worth my time.
And even if the models themselves for some reason were to never get better than what we have now, we've only scratched the surface of harnesses to make them better.
We know a lot about how to make groups of people achieve things individual members never could, and most of the same techiques work for LLMs, but it takes extra work to figure out how to most efficiently work around limitations such as lack of integrated long-term memory.
A lot of that work is in its infancy. E.g. I have a project I'm working on now where I'm up to a couple of dozens of agents, and ever day I'm learning more about how to structure them to squeeze the most out of the models.
One learning that feels relevant to the linked article: Instead of giving an agent the whole task across a large dataset that'd overwhelm context, it often helps to have an agent - that can use Haiku, because it's fine if its dumb - comb the data for <information relevant to the specific task>, and generate a list of information, and have the bigger model use that as a guide.
So the progress we're seeing is not just raw model improvements, but work like the one in this article: Figuring out how to squeeze the best results out of any given model, and that work would continue to yield improvements for years even if models somehow stopped improving.
reply