More

measurablefunc · 2026-04-01T01:08:22 1775005702

With four parameters I can fit an elephant, and with five I can make him wiggle his trunk so there is still room for improvement.

esafak · 2026-04-01T01:20:38 1775006438

Except learning to reason is a far cry from curve fitting. Our brains have more than five parameters.

voxelghost · 2026-04-01T02:11:46 1775009506

After a quick content browse, my understanding is this is more like with a very compressed diff vector, applied to a multi billion parameter model, the models could be 'retrained' to reason (score) better on a specific topic , e.g. math was used in the paper

sdenton4 · 2026-04-01T04:08:49 1775016529

It's the statistics equivalent of 'no one needs more than 640kb of RAM'

nativeit · 2026-04-01T13:29:28 1775050168

My very first PC was a Packard Bell with 640KB of RAM. If I’d known, I’d have saved all my RAM for retirement…

ekuck · 2026-04-01T02:10:32 1775009432

speak for yourself!

est · 2026-04-01T01:54:37 1775008477

reasoning capability might just be some specific combinations of mirror neurons.

even some advanced math usually evolves applying patterns found elsewhere into new topics

measurablefunc · 2026-04-01T02:30:23 1775010623

I agree, I don't think gradient descent is going to work in the long run for the kind of luxurious & automated communist utopia the technocrats are promising everyone.

measurablefunc · 2026-03-30T22:04:08 1774908248

It's not that simple. Production costs have gone up for everyone, inflation is going to get worse so the simple logic of "higher prices, higher profits" doesn't really work in this case.

cogman10 · 2026-03-30T22:07:30 1774908450

There will be a short term long term thing with this. I agree with you that ultimately everyone loses long term. Short term the higher prices will result in higher profits which will enrich whoever owns the oil.

We aren't at the end of the inflation, though, that's going to hit. This is only the beginning. Next year will be when things really go south. At this point it's not a question of if, but rather how bad.

measurablefunc · 2026-03-30T22:42:30 1774910550

I agree.

measurablefunc · 2026-03-30T07:57:30 1774857450

It's not clear or obvious why continuous semantics should be applicable on a digital computer. This might seem like nitpicking but it's not, there is a fundamental issue that is always swept under the rug in these kinds of analysis which is about reconciling finitary arithmetic over bit strings & the analytical equations which only work w/ infinite precision over the real or complex numbers as they are usually defined (equivalence classes of cauchy sequences or dedekind cuts).

There are no dedekind cuts or cauchy sequences on digital computers so the fact that the analytical equations map to algorithms at all is very non-obvious.

jampekka · 2026-03-30T08:32:05 1774859525

Continuous formulations are used with digital computers all the time. Limited precision of floats sometimes causes numerical instability for some algorithms, but usually these are fixable with different (sometimes less efficient) implementations.

Discretizing e.g. time or space is perhaps a bigger issue, but the issues are usually well understood and mitigated by e.g. advanced numerical integration schemes, discrete-continuous formulations or just cranking up the discretization resolution.

Analytical tools for discrete formulations are usually a lot less developed and don't as easily admit closed-form solutions.

shiandow · 2026-03-30T12:35:55 1774874155

It is definitely not obvious, but I wouldn't say it is completely unclear.

For instance we know that algorithms like the leapfrog integrator not only approximate a physical system quite well but even conserve the energy, or rather a quantity that approximates the true energy.

There are plenty of theorems about the accuracy and other properties of numerical algorithms.

measurablefunc · 2026-03-30T16:20:09 1774887609

How do they apply in this case?

sfpotter · 2026-03-30T10:52:10 1774867930

This is what the field of numerical analysis exists for. These details definitely have been treated, but this was done mainly early in the field's history; for example, by people like Wilkinson and Kahan...

magicalhippo · 2026-03-30T11:42:29 1774870949

I just took some basic numerical courses at uni, but every time we discretized a problem with the aim to implement it on a computer, we had to show what the discretization error would lead to, eg numerical dispersion[1] etc, and do stability analysis and such, eg ensure CFL[2] condition held.

So I guess one might want to do a similar exercise to deriving numerical dispersion for example in order to see just how discretizing the diffusion process affects it and the relation to optimal control theory.

[1]: https://en.wikipedia.org/wiki/Numerical_dispersion

[2]: https://en.wikipedia.org/wiki/Courant%E2%80%93Friedrichs%E2%...

phreeza · 2026-03-30T08:36:29 1774859789

Doesn't continuous time basically mean "this is what we expect for sufficiently small time steps"? Very similar to how one would for example take the first order Taylor dynamics and use them for "sufficiently small" perturbations from equilibrium. Is there any other magic to continuous time systems that one would not expect to be solved by sufficiently small time steps?

measurablefunc · 2026-03-30T09:04:58 1774861498

You should look into condition numbers & how that applies to numerical stability of discretized optimization. If you take a continuous formulation & naively discretize you might get lucky & get a convergent & stable implementation but more often than not you will end up w/ subtle bugs & instabilities for ill-conditioned initial conditions.

phreeza · 2026-03-30T09:15:59 1774862159

I understand that much, but it seems like "your naive timestep may need to be smaller than you think or you need to do some extra work" rather than the more fundamental objection from OP?

measurablefunc · 2026-03-30T09:23:13 1774862593

The translation from continuous to discrete is not automatic. There is a missing verification in the linked analysis. The mapping must be verified for stability for the proper class of initial/boundary conditions. Increasing the resolution from 64 bit floats to 128 bit floats doesn't automatically give you a stable discretized optimizer from a continuous formulation.

phyalow · 2026-03-30T09:31:27 1774863087

Or you can just try stuff and see if it works

measurablefunc · 2026-03-30T09:40:10 1774863610

Point still stands, translation from continuous to discrete is not as simple as people think.

phreeza · 2026-03-30T10:28:29 1774866509

Numerical issues totally exist but the reason has nothing to do with the fact that Cauchy sequences don't exist on a computer imo.

measurablefunc · 2026-03-30T16:11:23 1774887083

The abstract formulation is different from the concrete implementation. It is precisely b/c the abstractions do not exist on computers that the abstract analysis does not automatically transfer the necessary analytical properties to the digital implementation. Cauchy sequences & Dedekind cuts are abstract & do not exist on digital computers.

tsimionescu · 2026-03-30T14:51:51 1774882311

Infinity has properties that finite approximations of it just don't have, and this can lead to serious problems for certain theorems. In the general case, the integral of a continuous function can be arbitrarily different from the sum of a finite sequence of points sampled from that function, regardless of how many points you sample - and it's even possible that the discrete version is divergent even if the continous one is convergent.

I'm not saying that this is the case here, but there generally needs to be some justification to say that a certain result that is proven for a continuous function also holds for some discrete version of it.

For a somewhat famous real-world example, it's not currently known how to produce a version of QM/QFT that works with discrete spacetime coordinates, the attempted discretizations fail to maintain the properties of the continuous equations.

cubefox · 2026-03-30T10:29:18 1774866558

Real numbers mostly appear in calculus (e.g. the chain rule in gradient descent/backpropagation), but "discrete calculus" is then used as an approximation of infinitesimal calculus. It uses "finite differences" rather than derivatives, which doesn't require real numbers:

https://en.wikipedia.org/wiki/Finite_difference

I'm not sure about applications of real numbers outside of calculus, and how to replace them there.

imtringued · 2026-03-30T11:48:36 1774871316

I can't tell if this a troll attempt or not.

If your definition of "algorithm" is "list of instructions", then there is nothing surprising. It's very obvious. The "algorithm" isn't perfect, but a mapping with an error exists.

If your definition of "algorithm" is "error free equivalent of the equations", then the analytical equations do not map to "algorithms". "Algorithms" do not exist.

I mean, your objection is kind of like questioning how a construction material could hold up a building when it is inevitably bound to decay and therefore result in structural collapse. Is it actually holding the entire time or is it slowly collapsing the entire time?

measurablefunc · 2026-03-30T03:09:42 1774840182

You should provide evidence & examples for your claims if you want to be taken seriously.

Kim_Bruning · 2026-03-31T14:31:59 1774967519

Precisely!

No need to engage with an article that makes naked assertions with little backing.

Ok, fine then...:

"But they have no more consciousness, sensitivity, and sentience than a hammer. " -- naked assertion, no backing, no definition, no ope rationalization, no scientific or philosophical work shown (and this is a spicy one, because there's been philosophical turf wars on this for half a century, you can't just ASSERT that)

"Every device made by man has an off switch. We can use it sometimes." -- I have stories. Semi-Explosive near death stories. At any rate... uh, not quite?

Look, at very least he's sloppy here. Mostly just a raw opinion piece I guess, but not really backed by much that is real. Just so you know, this cost me more time than the text even deserves.

measurablefunc · 2026-03-29T00:20:57 1774743657

This is similar to AWS & their Graviton VMs.

measurablefunc · 2026-03-28T22:45:40 1774737940

The author does not exist & the paper is pure nonsense: https://scholar.google.com/citations?user=G97KxEYAAAAJ&hl=en. Might even be a psyop by some 3 letter agencies. So the obvious question, why did you post this?

vinhnx · 2026-03-28T23:34:31 1774740871

Sorry for the confusion, even though the author's names may not have an active record on Scholar. But I would like to share it here because I read the paper, and I find it interesting.

yorwba · 2026-03-29T10:01:22 1774778482

You read the paper? All 459 pages of it? And you missed e.g. this gem on page 257? "[11:23:54] CLAUDE: OPUS 5 — 606 pages, need 194 more. FINAL PUSH. Write these in first person as Logan. MAXIMUM DENSITY:"

vinhnx · 2026-03-31T04:20:20 1774930820

I'm sorry.

measurablefunc · 2026-03-27T03:59:02 1774583942

All traffic is monitored, all signal sources are eventually incorporated into the training set in one way or another. The person you're responding to is correct, even a single API call to any AI provider is sufficient to discount future results from the same provider.

stale2002 · 2026-03-27T04:11:44 1774584704

ok! So if someone uses an existing, checkpointed, open source model then the answer is yes the results are valid and it doesn't matter that the tests are public.

measurablefunc · 2026-03-27T04:35:52 1774586152

Yes, assuming the checkpoint was before the announcement & public availability of the test set.

raincole · 2026-03-27T05:33:21 1774589601

You live in a conspiracy world. Those AI providers don't update the models that fast. You can try ask them solve ARC-AGI-3 without harness and see them struggle as yesterday yourself.

measurablefunc · 2026-03-27T06:23:17 1774592597

Which part is the conspiracy? Be as concrete as possible.

measurablefunc · 2026-03-26T19:07:00 1774552020

That's great but how about UltraAgents: Meta-referential meta-improving self-referential hyperagents?

2001zhaozhao · 2026-03-26T20:40:48 1774557648

AGI-MegaAgent 5.7 Pro Ultra

measurablefunc · 2026-03-26T21:09:26 1774559366

Somehow still financed w/ ads & ubiquitous surveillance.

measurablefunc · 2026-03-25T03:08:03 1774408083

It's "vibe" research. Most of it is basically pure nonsense.

kleiba · 2026-03-25T03:22:07 1774408927

Care to elaborate?

refulgentis · 2026-03-25T03:38:01 1774409881

The headline theorem, "every sigmoid transformer is a Bayesian network," is proved by `rfl` [1]. For non-Lean people: `rfl` means "both sides are the same expression." He defines a transformer forward pass, then defines a BP forward pass with the same operations, wraps the weights in a struct called `implicitGraph`, and Lean confirms they match. They match because he wrote them to match.

The repo with a real transformer model (transformer-bp-lean) has 22 axioms and 7 theorems. In Lean, an axiom is something you state without proving. The system takes your word for it. Here the axioms aren't background math, they're the paper's claims:

- "The FFN computes the Bayesian update" [2]. Axiom.

- "Attention routes neighbors correctly" [3]. Axiom.

- "BP converges" [4]. Axiom, with a comment saying it's "not provable in general."

- The no-hallucination corollary [5]. Axiom.

The paper says "formally verified against standard mathematical axioms" about all of these. They are not verified. They are assumed.

The README suggests running `grep -r "sorry"` and finding nothing as proof the code is complete. In Lean, `sorry` means "I haven't proved this" and throws a compiler warning. `axiom` also means "I haven't proved this" but doesn't warn. So the grep returns clean while 22 claims sit unproved. Meanwhile the godel repo has 4 actual sorries [6] anyway, including "logit and sigmoid are inverses," which the paper treats as proven. That same fact appears as an axiom in the other repo [7]. Same hole, two repos, two different ways to leave it open.

Final count across all five repos: 65 axioms, 5 sorries, 149 theorems.

Claude (credited on page 1) turned it into "Not an approximation of it. Not an analogy to it. The computation is belief propagation." Building to a 2-variable toy experiment on 5K parameters presented as the fulfillment of Leibniz's 350-year-old dream. Ending signed by "Calculemus."

[1] https://github.com/gregorycoppola/sigmoid-transformer-lean/b...

[2] https://github.com/gregorycoppola/transformer-bp-lean/blob/7...

[3] https://github.com/gregorycoppola/transformer-bp-lean/blob/7...

[4] https://github.com/gregorycoppola/transformer-bp-lean/blob/7...

[5] https://github.com/gregorycoppola/transformer-bp-lean/blob/7...

[6] https://github.com/gregorycoppola/godel/blob/bc1d138/Godel/O...

[7] https://github.com/gregorycoppola/sigmoid-transformer-lean/b...

kleiba · 2026-03-25T20:35:49 1774470949

Thanks for writing such an elaborate reply! I wish I was familiar with Lean, so I could follow. But if you're right, it would put the claims of the paper in a totally different light.

Perhaps others with knowledge in Lean could also chime in?

refulgentis · 2026-03-26T19:30:24 1774553424

Doubtful:

- articles two days old

- I got links right to the code

- its clearly a waste of time if you know Lean, I went way above and beyond already

Maybe if you were able to show "no actually > 0 of this is well founded", someone might be tempted. But you'd need someone who showed up days later, knows enough Lean to validate for you, yet, not enough Lean to know it's a joke just from looking at the links.

You're welcome! Don't mean to be mean (pun intended), hope you don't read it that way. Just, figured it'd give you some food for thought re: exactly how much work you can expect from other people, and that you might need to set more constraints on an "idk, can someone else tell me more?" reaction than "one person said something, but someone else said they're wrong, so score is 1 to 1"

kleiba · 2026-03-27T03:16:26 1774581386

Thanks again - this time I have to admit I really don't get what you're trying to say?!

refulgentis · 2026-03-27T04:55:32 1774587332

Sorry, I was unclear!

You said you wished someone with Lean knowledge could check my work. I'm saying: you can check it yourself, right now, without knowing Lean.

Click any of links [2] through [5]. You'll see the word `axiom` in the code. In Lean, `axiom` means "assume this is true without proof." That's it. That's the whole critique. The paper says "formally verified," but the key claims — FFN computes Bayesian update, attention routes correctly, BP converges, no hallucination — are all just assumed.

You don't need to take my word for it, and you don't need a Lean expert to break a tie. The evidence is right there in the links. It'd be like a paper claiming "we formally proved this bridge is safe" and the proof says "Axiom: this bridge is safe." You don't need a civil engineer to tell you that's not a proof.

jack_pp · 2026-03-25T03:31:11 1774409471

I suspect it means it's LLM generated without it being checked

measurablefunc · 2026-03-25T02:25:53 1774405553

There is nothing continuous on the computer, it's all bit strings & boolean arithmetic. The semantics imposed on the bit strings does not exist anywhere in the arithmetic operations, i.e. there is no arithmetic operation corresponding to something as simple as the color red.

kelseyfrog · 2026-03-25T02:29:08 1774405748

It sounds like you're saying that if a computer had infinite precision then hallucinations would not occur?

measurablefunc · 2026-03-25T02:34:54 1774406094

The way neural networks work is that the base neural network is embedded in a sampling loop, i.e. a query is fed into the network & the driver samples output tokens to append to the query so that it can be re-fed back into the network (q → nn → [a, b, c, ...] → q + sample([a, b, c, ...])). There is no way to avoid hallucinations b/c hallucinations are how the entire network works at the implementation level. The precision makes no difference b/c the arithmetic operations are semantically void & only become meaningful after they are interpreted by someone who knows to associated 1 /w red, 2 w/ blue, 3 w/ clouds, & so on & so forth. The mapping between the numbers & concepts does not exist in the arithmetic.

kelseyfrog · 2026-03-25T02:42:54 1774406574

Oh, I thought that the embedding space of the residual stream was precisely that.

measurablefunc · 2026-03-25T02:47:13 1774406833

The arithmetic is meaningless, it doesn't matter what you call it b/c on the computer it's all bit strings & boolean arithmetic. You can call some sequence of operations residual & others embeddings but that is all imposed top-down. There is nothing in the arithmetic that indicates it is somehow special & corresponds to embeddings or residuals.

kelseyfrog · 2026-03-25T02:52:03 1774407123

Ah ok, so if we had such a mapping then models wouldn't hallucinate?

measurablefunc · 2026-03-25T02:58:34 1774407514

Maybe it's better if you define the terms b/c what I mean by hallucination is that the arithmetic operations + sampling mean that it's all hallucinations. The output is a trajectory of a probabilistic computation over some set of symbols (0s & 1s). Those symbols are meaningless, the only reason they have meaning is b/c everyone has agreed that the number 97 is the ascii code for "a" & every conformant text processor w/ a conformant video adapter will convert 97 (0b1100001) into the display pattern for the letter "a".

kelseyfrog · 2026-03-25T03:16:37 1774408597

So kind of like if you flip a coin, the sampling means the heads or tails you get isn't real?

measurablefunc · 2026-03-25T17:14:10 1774458850

It's when you define heads or tails however you want & then tell me you have objective semantics for each side of the coin when all you've really done is established a convention about which side is which. The coin is real, what you call each side is a convention & what semantics you attach to a sequence of flips is also a convention that has nothing to do with the reality of the coin.

kelseyfrog · 2026-03-25T20:55:26 1774472126

I'm struggling to differentiate that from how we use coinflips normally. We can pretty easily create arbitrary mappings and then sample from the binomial in a way that has meaning far beyond just heads or tails. Maybe I'm not quite understanding.

measurablefunc · 2026-03-25T21:27:34 1774474054

Which part are you confused about? Symbols are meaningless until someone imposes semantics on them. There is nothing meaningful about arithmetic in a neural network other than whatever conventions are imposed on the binary sequences, same way 97 has no meaning other than the conventional agreement that it is the ascii code point for "a".

kelseyfrog · 2026-03-25T23:02:43 1774479763

I guess I don't get the main idea. Chemical reactions in our brains are semantically void and yet we're able to use it as substrate for thinking.

measurablefunc · 2026-03-25T23:36:22 1774481782

This has nothing to do with chemical reactions. The discussion was about symbols and arithmetic. But in any event, this discussion has run its course so good luck: https://chatgpt.com/s/t_69c473e1f71c8191a4ed1e3e2dbdef83

kelseyfrog · 2026-03-26T00:43:50 1774485830

Yes! Excellent example of an ungrounded response, a hallucination.

measurablefunc · 2026-03-26T00:55:09 1774486509

Also a demonstration of your rhetoric.

naasking · 2026-03-25T04:20:33 1774412433

> The semantics imposed on the bit strings does not exist anywhere in the arithmetic operations,

Correct, the semantics is actually in the network of relations between the nodes. That has been one of the major lessons of LLMs, and it validates the systems response to Searle's Chinese Room.