timbilt's comments

timbilt · 2025-08-08T06:23:07 1754634187

> Unlike many public benchmarks, the PR Benchmark is private, and its data is not publicly released. This ensures models haven’t seen it during training, making results fairer and more indicative of real-world generalization.

This is key.

Public benchmarks are essentially trust-based and the trust just isn't there.

laggyluke · 2025-08-08T06:28:06 1754634486

Unless you're running the LLM yourself (locally), private benchmarks are also trust-based, aren't they?

timbilt · 2025-08-08T06:35:16 1754634916

Yes, but in a case like this it's a neutral third-party running the benchmark. So there isn't a direct incentive for them to favor one lab over another.

With public benchmarks we're trusting the labs not to cheat. And it's easy to "cheat" accidentally - they actually need to make a serious effort to not contaminate the training data.

And there's massive incentives for the labs to cheat in order to get the hype going around their launch and justify their massive investments in training. It doesn't have to be the CEO who's directing it. Can even be one/a few researchers who are responsible for a specific area of model performance and are under tremendous pressure to deliver.

vohk · 2025-08-08T06:59:58 1754636398

The problem is when using a model hosted by those labs (ex: OpenAI only allowed access to o3 through their own direct API, not even Azure), there still exists a significant risk of cheating.

There's a long history of that sort of behaviour. ISPs gaming bandwidth tests when they detect one is being run. Software recognizing being run in a VM or on a particular configuration. I don't think it's a stretch to assume some of the money at OpenAI and others has gone into spotting likely benchmark queries and throwing on a little more compute or tagging them for future training.

I would be outright shocked if most of these benchmarks are even attempting serious countermeasures.

nojs · 2025-08-08T06:31:44 1754634704

How does this ensure models haven’t seen it during training - is it a different benchmark per model release?

jacquesm · 2025-08-08T07:07:40 1754636860

Then you just need to use different data the next time you evaluate. That is much more indicative of real-world generalization: after all, you don't normally do multiple PRs on the same pieces of code. The current approach risks leaking the dataset selectively and/or fudging the results because they can't be verified. Transparency is key when doing this kind of benchmark, so now we have to trust the entity doing the benchmarking rather than independent verification of the results and with the amount of money that is at stake here I don't think that's the way to go.

timbilt · 2025-03-03T15:55:28 1741017328

anyone else concerned that training models on synthetic, LLM-generated data might push us into a linguistic feedback loop? relying on LLM text for training could bias the next model towards even more overuse of words like "delve", "showcasing", and "underscores"...

timbilt · 2025-02-10T19:51:08 1739217068

Twitter thread about this by the author: https://x.com/jonasgeiping/status/1888985929727037514

danielbln · 2025-02-10T22:24:26 1739226266

If you don't have a twitter account and want to read the full thread: https://xcancel.com/jonasgeiping/status/1888985929727037514

radarsat1 · 2025-02-11T15:10:40 1739286640

If you keep digging in that thread the author posts a gist containing information on how the recurrence works:

https://gist.github.com/JonasGeiping/65959599ca637d72d50c96c...

timbilt · 2025-02-02T09:07:52 1738487272

Until we get real-time learning to work in production, every AI tool feels like it's getting dumber over time. It goes very quick from "wow this is magic" to starting to notice all the little gaps. I think we have a fundamental expectation of intelligence to learn and when it doesn't, it just doesn't seem that smart

timbilt · 2025-01-29T20:32:29 1738182749

The weirdness of LLMs is that they're so damn good at so many things but then you see these glaring gaps that instantly make them seem dumb. We desperately need benchmarks and evals that test these kinds of hard to pin down cognitive abilities

turnsout · 2025-01-29T21:08:42 1738184922

Absolutely. This is not a new observation, but another thing they struggle with is self-reporting confidence intervals. When I've asked LLMs to classify/tag things along with a confidence metric, the number seems random and has no connection to the quality or difficulty of the classification.

timbilt · on Dec 4, 2024

> validates each test to ensure it runs successfully, passes, and increases code coverage

This seems to be based on the cover agent open source which implements Meta's TestGen-LLM paper. https://www.qodo.ai/blog/we-created-the-first-open-source-im...

After generating each test, it's automatically run — it needs to pass and increase coverage, otherwise it's discarded.

This means you're guaranteed to get working tests that aren't repetitions of existing tests. You just need to do a quick review to check that they aren't doing something strange and they're good to go.

torginus · on Dec 4, 2024

What the reasoning behind generating tests until they pass? Isn't the point of tests to discover erroneous corner cases?

What purpose does this serve besides the bragging rights of 'we need 90% coverage otherwise Sonarqube fails the build'?

timbilt · on Dec 4, 2024

Unit tests are more commonly written to future proof code from issues down the road, rather than to discover existing bugs. A code base with good test coverage is considered more maintainable — you can make changes without worrying that it will break something in an unexpected place.

I think automating test coverage would be really useful if you needed to refactor a legacy project — you want to be sure that as you change the code, the existing functionality is preserved. I could imagine running this to generate tests and get to good coverage before starting the refactor.

HideousKojima · on Dec 4, 2024

>Unit tests are more commonly written to future proof code from issues down the road, rather than to discover existing bugs. A code base with good test coverage is considered more maintainable — you can make changes without worrying that it will break something in an unexpected place.

The problem is a lot of unit tests could accurately be described as testing "that the code does what the code does." If the future changes to your code also require you to modify your tests (which they likely will) then your tests are largely useless. And if tests for parts of your code that you aren't changing start failing when you make code changes, that means you made terrible design decisions in the first place that led to your code being too tightly coupled (or had too many side effects, or something like global mutable state).

Integration tests are far, far more useful than unit tests. A good type system and avoiding the bad design patterns I mentioned handle 95% of what unit tests could conceivably be useful for.

torginus · on Dec 4, 2024

I disagree in my experience, poorly designed tests test implementation rather than behavior. To test behavior you must know what is actually supposed to happen when the user presses a button.

One of the issues with getting high coverage is that often tests need to be written for testing implementation, rather than desired outcomes.

Why is this an issue? As you mentioned, testing is useful for future proofing codebases and making sure changing the code doesn't break existing use cases.

When test look for desired behavior, this usually means that unless the spec changes, all tests should pass.

The problem is when you test implementation - suppose you do a refactoring, cleanup, or extend the code to support future use cases - the test start failing. Clearly something must be changed in the tests - but what? Which cases encode actual important rules about how the code should behave, and which ones were just tautologically testing that the code did what it did?

This introduces murkiness and diminishes the value of tests.

timbilt · on Nov 23, 2024

https://archive.md/Bn3Dz

timbilt · on Nov 21, 2024

GitHub repo: https://github.com/aiola-lab/whisper-ner

Hugging Face Demo: https://huggingface.co/spaces/aiola/whisper-ner-v1

Pretty good article that focuses on the privacy/security aspect of this — having a single model that does ASR and NER:

https://venturebeat.com/ai/aiola-unveils-open-source-ai-audi...

Tsarp · on Nov 22, 2024

Wouldnt it be better to run normal Whisper and NER on top of the transcription before streaming a response or writing anything to disk?

What advantage does this offer?

timbilt · on Nov 22, 2024

I think one of the biggest advantages is the security/privacy benefit — you can see in the demo that the model can mask entities instead of tagging. This means that instead of transcribing and then scrubbing sensitive info, you can prevent the sensitive info from ever being transcribed. Another potential benefit is in lower latency. The paper doesn't specifically mention latency but it seems to be on par with normal Whisper, so you save all of the time it would normally take to do entity tagging — big deal for real-time applications

Tsarp · on Nov 23, 2024

Ive worked on some enterprise NER systems (specifically privacy/redaction), and in almost all cases the cost of missing out masking was significantly higher than latency (ofc in an ideal world youd have both).

And in all the research we did, the best solutions ended up passing through a workflow of 1.NN based NER, 2.Regex and 3.Dictionary look ups to really clean information. Using a single method worked well in customer demos but always ended up in what we thought were edge cases in prod.

That being said, latency stuff makes sense. This might work great in conversational use cases. Picking out intent and responding. Every millisecond helps in making things sound natural.

PeterisP · on Nov 22, 2024

The general principle is that "pipelines" impose a restriction where the errors of the first step get baked-in and can't effectively use the knowledge of the following step(s) to fix them.

So if the first step isn't near-perfect (which ASR isn't) and if there is some information or "world knowledge" in the later step(s) which is helpful in deciding that (which is true with respect to knowledge about named entities and ASR) then you can get better accuracy by having an end-to-end system where you don't attempt to pick just one best option at the system boundary. Also, joint training can be helpful, but that IMHO is less important.

its_down_again · on Nov 22, 2024

From my experience, ASR-to-NER pipelines don't perform adequately out of the box. Even though SOTA ASR systems claim 85% word accuracy, the distribution of errors is worth looking into. Errors around critical entities like credit card numbers or addresses are particularly prone, and even a small mistake renders the result useless.

These ASR errors cascade into the NER step, further degrading recall and precision. Combining ASR and NER into a joint model or integrated approach can reduce these issues in theory, it's just more complex to implement and less commonly used.

conradev · on Nov 22, 2024

Yeah, I’m also curious about that. Does combining ASR and NER into one model improve performance for either?

anewhnaccount2 · on Nov 22, 2024

Almost definitely. You can think of there being a type of triangle inequality for cascading different systems where manually combined systems almost always perform worse given comparable data and model capacity. Alternatively you have tied the models hands by forcing it to bottleneck through a representation you chose.

wanderingmind · on Nov 22, 2024

Looks like only inference available and no fine tuning code available

timbilt · on Nov 1, 2024

PDFs are now also supported in the API: https://docs.anthropic.com/en/docs/build-with-claude/pdf-sup...

I think Anthropic is actually the first to support PDFs in their API

babelfish · on Nov 1, 2024

OpenAI Assistants API has PDF support I believe