Hacker Newsnew | past | comments | ask | show | jobs | submit | red2awn's commentslogin

The "distillation attacks" are mostly using Claude as LLM-as-a-judge. They are not training on the reasoning chains in a SFT fashion.


So they're paying expensive input tokens to extract at best a tiny amount of information ("judgment") per request? That's even less like "distillation" than the other claim of them trying to figure out reasoning by asking the model to think step by step.


LLM-as-a-judge is quite effective method to RL a model, similar to RLHF but more objective and scalable. But yes, anthropic is making it more serious than it is. Plus DeepSeek only did it for 125k requests, significantly less than the other labs, but Anthropic still listed them first to create FUD.


What's the SOTA open source or weight available turn taking model these days? I tried pipecat/smart-turn-v3 and the results are not good. It only works well when you say a short sentence in a clear voice. Anything else will cause it to wait indefinitely. Closed source API models are obviously a lot better but adds network latency and the cost adds up.


Distilling from a teacher (Opus 4.5) and scaling RL more.


So less parameters but "better" weights?


> i need to wash my helicopter at the helicopter wash. it is 50m away, should i walk or fly there with my helicopter.

Sonnet 4.6: Walk! Flying a helicopter 50 metres would be more trouble than it's worth — by the time you've done your pre-flight checks, spun up the rotors, lifted off, and then safely landed again, you'd have walked there and back twice. Just stroll over.


I think this is my favourite test. You can just tell it was programmed on smug Reddit comments talking about how Americans drive to places 50 metres away.


The smug, non-informative, confidently wrong tone these LLMs have learned from such comments drives me mad.


It's amusing, but when it comes to doing actually work, I just don't care if my LLM fails things like this.

I'm not trying to trick it, so falling for tricks is harmless for my use cases. Does it write quality, secure code? Does it give me accurate answers about coding/physics/biology. If it gets those wrong, that's a problem. If it fails to solve riddles, well, that'll be a problem iff I decide to build a riddle solver using it.


Additionally, I don't think that these kinds of failures say much about overall intelligence. Humans are largely visual creatures, and we fall prey to innumerable visual illusions where we fail to see what's actually there or imagine something that isn't there under certain visual patterns.

LLMs are largely textual creatures and they fail to see things that are there or imagine things that are under certain textual patterns.

I don't think you would say a human "isn't really intelligent" because it imagines grey spots at the intersection of black squares on a white background even though they aren't there.


I want one that responds by asking how full the parking lots are and how busy the left turn lane is


TBH I would first walk there to check that they can take me on the spot, and if so, ask them to either please come clean it (only 50m away) or if they cannot fly it there. So walk seems very rational to me.


Sure, just pick up the building containing the compressors, water hoses/sprayers, soap, and required drainage and water filtration system, and bring it 50 metres down the road.


Asked gemini and it said to use ground handling wheels. I think it actually makes sense to use that for this distance.


Ah yes the new "how many r's in strawberry" question, some poor intern has to go vacuum up all these gotcha social media posts so they can train the next model on this.


It was mentioned that the performance of DuckDB is similar to that of Polars (among others). In that case why would one choose DuckDB over Polars? The only differentiator seems to be that you do the querying with standard SQL instead of the library specific APIs.


- performance is often better, especially on “out of core” (“streaming”, spill to disk data sizes). Polars has done a ton of work on their streaming engine but they’re still catching up

- you don’t need to use Python (but Pythonic wrappers like Ibis exist; disclaimer I worked on Ibis, you can find my blogs on performance comparisons easily there); CLI, WASM, etc. w/o Python

- governance: DuckDB as OSS is setup in a more sustainable way (DuckDB Labs + DuckDB Foundation). while there is a VC-backed company (MotherDuck), it doesn’t employ the primary developers/control the project in the same way the Polars company does

- overall just simplicity and focus. tends to break less, solely focused on single-node, easy to extend, etc. — not trying to do a cloud product, distributing computing, supporting GPU execution


They are heavily focused on code. Claude Code likely generates 100 of millions lines of Python a day, make the language a little bit better with $1.5M is extremely high leverage.


And if this money improves PyPI security (part of the focus), that reduces the chance of Claude Code adding malicious packages to a code base (a well publicized case of this could be a big PR headache for Anthropic). This donation is likely much better leverage than trying to somehow add mitigation at the Claude Code level.


Care to elaborate on how $1.5M makes Python better?


The donation is earmarked for security concerns, ie. improving PyPI from a security perspective to prevent/mitigate supply chain attacks, etc. This means a more healthy Python ecosystem, which also benefits their products which are utilizing said ecosystem likely more than any other.


You’re asking how money can be used to improve software?


Yes, because lots of these comments seem to imply that more money necessarily makes it better, which is often not the case.


Implement transformers yourself (ie in Numpy). You'll never truly understand it by just watching videos.


Seconding this, the terms "Query" and "Value" are largely arbitrary and meaningless in practice, look at how to implement this in PyTorch and you'll see these are just weight matrices that implement a projection of sorts, and self-attention is always just self_attention(x, x, x) or self_attention(x, x, y) in some cases, where x and y are are outputs from previous layers.

Plus with different forms of attention, e.g. merged attention, and the research into why / how attention mechanisms might actually be working, the whole "they are motivated by key-value stores" thing starts to look really bogus. Really it is that the attention layer allows for modeling correlations and/or multiplicative interactions among a dimension-reduced representation.


>the terms "Query" and "Value" are largely arbitrary and meaningless in practice

This is the most confusing thing about it imo. Those words all mean something but they're just more matrix multiplications. Nothing was being searched for.


Better resources will note the terms are just historical and not really relevant anymore, and just remain a naming convention for self-attention formulas. IMO it is harmful to learning and good pedagogy to say they are anything more than this, especially as we better understand the real thing they are doing is approximating feature-feature correlations / similarity matrices, or perhaps even more generally, just allow for multiplicative interactions (https://openreview.net/forum?id=rylnK6VtDH).


Do you think the dimension reduction is necessary? Or is it just practical (due to current hardware scarcity)?


Definitely mostly just a practical thing IMO, especially with modern attention variants (sparse attention, FlashAttention, linear attention, merged attention etc). Not sure it is even hardware scarcity per se / solely, it would just be really expensive in terms of both memory and FLOPs (and not clearly increase model capacity) to use larger matrices.

Also for the specific part where you, in code for encoder-decoder transformers, call the a(x, x, y) function instead of the usual a(x, x, x) attention call (what Alammar calls "encoder-decoder attention" in his diagram just before the "The Decoder Side"), you have different matrix sizes, so dimension reduction is needed to make the matrix multiplications work out nicely too.

But in general it is just a compute thing IMO.


I personally don't think implementation is as enlightening as far as really understanding what the model is doing as this statement implies. I had done that many times, but it wasn't until reading about the relationship to kernel methods that it really clicked for me what is really happening under the hood.

Don't get me wrong, implementing attention is still great (and necessary), but even with something as simple as linear regression, implementing it doesn't really give you the entire conceptual model. I do think implementation helps to understand the engineering of these models, but it still requires reflection and study to start to understand conceptually why they are working and what they're really doing (I would, of course, argue I'm still learning about linear models in that regard!)


Do you have a tutorial that I can follow?


If you have 20 hours to spare I highly recommend this youtube playlist from Andrej Karpathy https://www.youtube.com/watch?v=VMj-3S1tku0&list=PLAqhIrjkxb...

It starts with the fundamentals of how backpropagation works then advances to building a few simple models and ends with building a GPT-2 clone. It won't taech you everything about AI models but it gives you a solid foundation for branching out.


The most valuable tutorial will be translating from the paper itself. The more hand holding you have in the process, the less you'll be learning conceptually. The pure manipulation of matrices is rather boring and uninformative without some context.

I also think the implementation is more helpful for understanding the engineering work to run these models that getting a deeper mathematical understanding of what the model is doing.


It does NOT. Response healing only fixes JSON syntax errors, not schema differences.

See https://news.ycombinator.com/item?id=46332119


Tokens are sampled from logits using the constraints after a normal forward pass. The forward pass is the expensive part of LLM inference which isn't affected by structured output.


Yes, but if the constraints only permit a single valid token anyway for some positions, you could skip the forward pass entirely for those positions and just return that token.

The other idea was a bit more theoretical: If you know only a handful tokens are valid, then calculating the logits of the other tokens in the forward pass is wasteful as they won't affect the sampling process. However, it's probably not worthe the cost to optimize that as it only affects the last layer and might be mostly amortized by SIMD parallel processing anyway.


Very confused. When you enable structured output the response should adhere to the JSON schema EXACTLY, not best effort, by constraining the output via guided decoding. This is even documented in OpenRouter's structured output doc

> The model will respond with a JSON object that strictly follows your schema

Gemini is listed as a model supporting structured output, and yet its fail rate is 0.39% (Gemini 2.0 Flash)!! I get that structured output has a high performance cost but advertising it as supported when in reality it's not is a massive red flag.

Worst yet response healing only fixes JSON syntax error, not schema adherence. This is only mentioned at the end of the article which people are clearly not going to read.

WTF


You're exactly right. The llguidance library [1,2] seems to have emerged as the go-to solution for this by virtue of being >10X faster than its competition. It's work from some past colleagues of mine at Microsoft Research based on theory of (regex) derivatives, which we perviously used to ship a novel kind of regex engine for .NET. It's cool work and AFAIK should ensure full adherence to a JSON grammar.

llguidance is used in vLLM, SGLang, internally at OpenAI and elsewhere. At the same time, I also see a non-trivial JSON error rate from Gemini models in large scale synthetic generations, so perhaps Google hasn't seen the "llight" yet and are using something less principled.

1: https://guidance-ai.github.io/llguidance/llg-go-brrr 2: https://github.com/guidance-ai/llguidance


Cool stuff! I don't get how all the open source inference framework have this down but the big labs doesn't...

Gemini [0] is falsely advertising this:

> This capability guarantees predictable and parsable results, ensures format and type-safety, enables the programmatic detection of refusals, and simplifies prompting.

[0]: https://ai.google.dev/gemini-api/docs/structured-output?exam...


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: