More

irrationalfab · 2025-12-19T00:39:58 1766104798

+1... like with a large enough engineering team, this is ultimately a guardrails problem, which in my experience with agentic coding it’s very solvable, at least in certain domains.

majormajor · 2025-12-19T05:51:52 1766123512

Like with large engineering teams I have little faith people will suddenly get the discipline to do the tedious, annoying, difficult work of building good enough guardrails now.

We don't even build guardrails that keep humans who test stuff as they go from introducing subtle bugs by accident; removing more eyes from that introduces new risks (although LLMs are also better at avoiding certain types of bugs, like copypasta shit).

"Test your tests" gets very difficult as a product evolves and increases in complexity. Few contracts (whether unit test level or "automation clicking on the element on the page") level are static enough to avoid needing to rework the tests, which means reworking the testing of the tests, ...

I think we'll find out just how low the general public's tolerance for bugs and regressions is.

davidbau · 2025-12-28T02:00:48 1766887248

No question this will be hard to do.

But I am not so pessimistic. I do think it will be possible, because it is more fun to test your tests now than in the pre-LLM era. You just need a little bit of knowledge and patience, and the LLM absorbs most of the psychic pain.

If programmers get accustomed to doing their tests of tests, software might actually get better.

irrationalfab · 2025-12-18T22:09:56 1766095796

Agent/MCP/Skills might be "Netscape-y" in the sense that today's formats will evolve fast. But Netscape still mattered: it lost the market, not the ideas. The patterns survived (JavaScript, cookies, SSL/TLS, progressive rendering) and became best practices we take for granted.

The durable pattern here isn't a specific file format. It's on-demand capability discovery: a small index with concise metadata so the model can find what's available, then pull details only when needed. That's a real improvement over tool calling and MCP's "preload all tools up front" approach, and it mirrors how humans work. Even as models bake more know-how into their weights, novel capabilities will always be created faster than retraining cycles. And even if context becomes unlimited, preloading everything up front remains wasteful when most of it is irrelevant to the task at hand.

So even if "Skills" gets replaced, discoverability and progressive disclosure likely survive.

irrationalfab · 2025-12-18T21:47:57 1766094477

There's a pattern I keep seeing: LLMs used to replace things we already know how to do deterministically. Parsing a known HTML structure, transforming a table, running a financial simulation. It works, but it's like using a helicopter to cross the street: expensive, slow, and not guaranteed to land exactly where you intended.

The real opportunity with Agent Skills isn't just packaging prompts. It's providing a mechanism that enables a clean split: LLM as the control plane (planning, choosing tools, handling ambiguous steps) and code or sub-agents as the data/execution plane (fetching, parsing, transforming, simulating, or executing NL steps in a separate context).

This requires well-defined input/output contracts and a composition model. I opened a discussion on whether Agent Skills should support this kind of composability:

https://github.com/agentskills/agentskills/issues/11

basch · 2025-12-18T22:30:00 1766097000

The same applies to context vs a database. If a reasoning model makes a decision about something, it should be put off to the side and stored as a value/variable/entry somewhere. Instead of using pages and pages of context, it makes sense for some tasks to "press" decisions that become more permanent to the conversation. You can somewhat accomplish that with notebooklm, by turning results into notes into sources, but notebooklm is insular and doesnt have the research and imaging features of gemini.

And also, in writing, writing from top to bottom has its disadvantages. It makes sense to emulate human writing process and have passes, as you flesh out, and conversely summarize writing.

Current LLMs can brute force these things through emulation/observation/mimicry but they arent as good as doing it the right way. Not only would I like to see "skills" but also "processes" where you create a well defined order that tasks are accomplished in sequence. Repeatable templates. This would essentially include variables in the templates, set for replacement.

rlupi · 2025-12-19T15:01:13 1766156473

> Not only would I like to see "skills" but also "processes" where you create a well defined order that tasks are accomplished in sequence. Repeatable templates. This would essentially include variables in the templates, set for replacement.

You can do this with Gemini commands and extensions.

https://cloud.google.com/blog/topics/developers-practitioner...

basch · 2025-12-19T20:47:34 1766177254

Maybe I'm not explaining it well.

The template would more define the output, and I imagine it more recursively.

Say we are building a piece of journalism. First pass, do these things, second pass build more coherent topic sentences, third pass build an introduction.

Right now, the way that models write from top to bottom, the introduction paragraph seems to inform the body, and then the body is just a stretched out version of the intro. Whereas how it should work is the body is written and then condensed into topic sentences and introductions.

I find myself having to baby models, "we are going to do this, lets do the first one. ok now lets do the second one, ok now the third one. you forgot the instructions, lets revise with the parameters you were given initially. now lets put it all together."

I'm babbling, I just think these interfaces need a better way to define "lets write paragraph 4 first, followed by blah blah" to better structure the order in which they tackle tasks.

gradus_ad · 2025-12-19T01:16:18 1766106978

I've recently been doing some work with Autodesk. It would be great for an LLM to be as comfortable with the "vocabulary" of these applications as they are with code. Maybe part of this involves creating a language for CAD design in the first place. But the principle that we need to build out vocabularies and subsequently generate and expose "sentences" (workflows) for LLM's to train on seems like a promising direction.

Of course this requires substantial buy in from application owners - create the vocabulary - and users - agree to expose and share the sentences they generate - but the results would be worth it.

baq · 2025-12-19T10:43:54 1766141034

Mildly amusing since i remember AutoCAD having a lisp interpreter ~30 years ago…?

officialchicken · 2025-12-19T11:38:15 1766144295

AutoCAD had LISP from the beginning.

https://www.fourmilab.ch/autofile/

ugh123 · 2025-12-18T22:24:30 1766096670

100%

Additionally, I can't even get claude or codex to reliable use the prompt and simple rules (use this command to compile) in an agents.md or whatever required markdown file is needed. Why would I assume they will reliably handle skills prompts spread about a codebase?

I've even seen tool usage deteriorate while it's thinking and self commanding through its output to say.. read code from a file. Sometimes it uses tail while other times it gets confused on the output and then writes a basic python program to parse lines and strings from the same file to effectively get what was the same output as before. How bizarre!

esafak · 2025-12-19T14:22:44 1766154164

Skills are about empowering LLMs with tools, so the heavy lifting can still be deterministic. Furthermore, pipelines written in LLMs are simpler and less brittle, since handling variation is the essence of machine learning.

rk06 · 2025-12-22T13:11:14 1766409074

pipelines written in LLMs may be simpler. but they are definitely more brittle and even more non-deterministic.

if AI were deterministic, what difference would different AI model make?

itissid · 2025-12-19T05:56:10 1766123770

Isn't atleast part of that GH issue something that this https://docs.boundaryml.com/guide/introduction/what-is-baml is also trying to solve? LLM inputs and outputs must be functions with defined functions. That was their starting point.

IIUC their most recent arc focuses on prompt optimization[0] where you can optimize — using DSPy and an optimization algo GEPA [1] — using relative weights on different things like errors, token usage, complexity.

[0] https://docs.boundaryml.com/guide/baml-advanced/prompt-optim... [1] https://github.com/gepa-ai/gepa?tab=readme-ov-file

deaux · 2025-12-20T12:00:58 1766232058

Where in this post's article are you seeing this pattern?

> Parsing a known HTML structure

In most cases, HTML structures that are being parsed aren't known. If they're known, you control them, and you don't need to parse them in the first place. If they're someone else's, who knows when they'll change, or under what condition they're different.

But really, I don't see the stuff you're talking about happening in prod for non-one-off usecases. I see LLMs used in prod usecases exactly for data where you don't know exactly what its shape will be, and there's an enormous amount of such cases. If the same logic is needed every time, of course you don't have an LLM execute that logic, you have the LLM write a deterministic script.

_the_inflator · 2025-12-19T10:24:43 1766139883

I agree partly.

Skills are essentially boiling down to distributed parts of a Main Prompt. If you consider a state model you can see this pattern: Task is the state and combining the task's specifics skills defines the current prompt augmentation. When the task changes, another prompt emerges.

In the end, it is the clear guidance of the Agent that is the deciding factor.

hintymad · 2025-12-18T22:38:39 1766097519

> Parsing a known HTML structure, transforming a table, running a financial simulation.

Transforming an arbitrary table is still hard, especially a table on a webpage or in a document. Sometimes I even struggle to find the right library. The effort does not seem worth it for one-off need of such transformation too. LLM can be a great tool for doing the tasks.

irrationalfab · 2025-12-18T19:26:04 1766085964

Anthropic is positioning Claude in Chrome as a beta feature for day to day use, expanding beyond the earlier research preview.

- Pull data from dashboards into one analysis doc

- Address slide comments automatically

- Build with Claude Code, test in Chrome

https://www.youtube.com/watch?v=rBJnWMD0Pho

irrationalfab · 2025-10-06T20:42:44 1759783364

> If those features aren't supported by the widget's hard-coded schema, you're out of luck as a user.

Chat paired to the pre-built and on-demand widgets address this limitation.

For example, in the keynote demo, they showed how the chat interface lets you perform advanced filtering that pulls together information from multiple sources, like filtering only Zillow housers near a dog park.

rushingcreek · 2025-10-06T20:55:43 1759784143

Yes, because it seems that Zillow exposes those specific filters as a part of the input schema. As long as it's a part of the schema, then ChatGPT can generate a useful input to the widget. But my point is that is very brittle.

handfuloflight · 2025-10-07T00:44:10 1759797850

Isn't that as brittle as any system being constrained to providing only some type of outputs? Please elaborate.

rushingcreek · 2025-10-07T01:01:54 1759798914

A fully generative UI with on-the-fly schema would be less brittle because you can guarantee that the schema and the intelligent widget can fully satisfy the user’s request. The bottleneck here is the intelligence of the model computing this, but we are already at the point where this is not much of a problem and it will disappear as the models continue to improve.

I think most software will follow this trend and become generated on-demand over the next decade.

JumpCrisscross · 2025-10-06T20:51:40 1759783900

> Chat paired to the pre-built and on-demand widgets address this limitation

The only place I can see this working is if the LLM is generating a rich UI on the fly. Otherwise, you're arguing that a text-based UX is going to beat flashy, colourful things.

irrationalfab · 2025-10-06T19:17:37 1759778257

This feels like the death of the app, and the rise of the micro-app.

irrationalfab · 2025-09-30T19:14:50 1759259690

Very interesting tech. Ephemeral, high-fidelity preview environments that require zero setup are a key enabler. They let you rapidly validate changes within the complete context of a web or mobile app, accelerating feedback loops and cutting friction for minor updates. This also empowers business users to safely implement small, self-contained UI adjustments which is particularly powerful when combined with LLM-driven suggestions.

irrationalfab · 2025-09-10T21:02:27 1757538147

This nails a real problem. Non-trivial PRs need two passes: first grok the entrypoints and touched files to grasp the conceptual change and review order, then dive into each block of changes with context.

irrationalfab · 2025-08-07T23:00:47 1754607647

Like convex.dev

irrationalfab · 2025-08-07T22:59:36 1754607576

Ironically, LLMs might make it very hard for new frameworks to gain popularity since they are trained on the popular ones.

alwillis · 2025-08-08T06:02:55 1754632975

If we're not there already, it's just a matter of time before LLMs will be able to read and understand a framework they haven't seen before and be able to use it anyway.

LLMs are already trained on JavaScript at a deep level; as LLM reasoning and RAG techniques improve, there will be a time in the not-too-distant future when an LLM can be pointed to the website of a new framework and be able to use it.

yencabulator · 2025-08-11T17:26:03 1754933163

There's actually a huge twist to this: JS people deprecate things so fast that most of the training data will be for a version that either warns about deprecations on every run, or has been fully removed from the API.