Hacker Newsnew | past | comments | ask | show | jobs | submit | ncommentslogin

Came across this purchase this for the verum 2 headphones printable option only. Does anyone know if these will be available for a physical purchase option? Thanks

What's strange in this discussion of chips and export bans is there's been zero discussion of cloud access, I guess networked computers are difficult for America's gerontocrats to understand.

I've rented H100s no problem on American servers and there's no KYC or anything, they let anybody do it.


I despise ads. I take any chance I can to pay for my content rather than support ad-based revenue.

But you can’t solve that issue with policy. It’s a cultural issue. People are not willing to pay for the content they consume (with money).

Not to mention you would collapse the US economy (I’m not sure if you’re US based, just speaking from my perspective), and likely others, if you applied a blanket ban on ad-supported media.


I appreciate this framing a lot. It is actually close to how I think about the result internally. The paper focuses on the geometric behavior of intermediate representations, and classification is the cleanest setting to study that. Generative decoding is a much harder problem, and the limitations section already makes that distinction explicit.

Recasting the work as a “classification-native distilled model” or “discriminative foundation model” is a good way to signal scope without underselling the contribution. You're right that discriminative understanding requires far fewer parameters than generation, and my experiments reinforce that.

This will help me get better. The goal for the next revision is exactly what you describe: make the setup clearer, emphasize the intended domain, and avoid suggestive wording that implies capabilities the method does not claim. Duly noted. Your suggestions on positioning and title direction are genuinely helpful, and I’ll incorporate some of this thinking when I prepare the academic submission.

Thanks for taking the time to articulate it so clearly. I appreciate your time and your critique.


This is awesome! lol

Imagine turning your wildest business idea into a fully functional, production-ready app today – without touching a single line of code.

That's the magic of Blink.new, the revolutionary AI-powered platform that does all the heavy lifting for you.

Just describe your vision in plain English (e.g., "Create a fitness tracker app with user logins and progress charts"), and Blink's advanced AI handles everything: seamless databases, secure authentication, blazing-fast APIs, file storage with global CDN, and even integrations with GPT for smart features like chatbots or image generation.

Perfect for entrepreneurs, creators, and dreamers who want to launch MVPs, side hustles, or full-scale SaaS products lightning-fast.

No more endless tutorials, buggy prototypes, or developer fees – just pure, effortless creation.

Thousands are already building games, e-commerce sites, and productivity tools that scale globally with custom domains and 99.9% uptime.

Ready to revolutionize your workflow? Sign up now with this exclusive link and dive in for free – get instant access to all features and start prototyping immediately.

Your first app awaits!


What are you a psychiatrist?

The average American thinks the U.S is the best country in the world. I say that as an American. To your point if people saw how the rest of the world lives and how happy many of those 7.88 Billion people are they would start being more vocal about our endless cycle of work until you are 85 to be able to pay your property taxes.

Ah yes brilliant. Instead of trying to address these issues at their source let’s just let kids form immaterial connections online and guarantee they never learn how to form any sort of in person communication skills!

Thank you for the thoughtful comments. Really. This is actually the most constructive feedback in the thread so far.

A few clarifications.

1. On the LaTeX citations and figure references That part is definitely on me. I never used LaTeX before this project and moved extremely fast. There's a lot of weird mumbo jumbo going on with formatting and converting it to a pdf. That part isnt interesting to me, and I try to move passed it quickly. I did use AI tools for typesetting help, and I clearly didn’t clean up all the placeholder references. Entirely my mistake, not an attempt to fabricate sources. I’ll fix the citations and figure links in the next revision so they meet normal academic standards.

2. Architecture transparency and reproducibility The open-source repo contains every component used for the scientific claim:

extraction of activation fields

rank reduction

probing

training the student model

running inference with the student alone

The proprietary references in the paper refer only to optimization layers (CUDA kernels, scheduler heuristics, etc.) that aren’t required for the scientific result. They're not hand wavey secret parts of the method. Just production-grade accelerations I’m still packaging separately for licensing.

The core idea—extract, compress, probe, distill—is fully reproduced in the repo.

3. “Secret sauce” concern There actually isn’t any. The paper may read like I’m hinting at hidden architecture, but the method is intentionally simple. The novelty is in how much task-relevant geometry survives after severe rank reduction, not in a complex architecture. The “anchor layers” are just early and mid-layer activations concatenated before compression.

4. Baseline comparisons Good point on comparing to:

1. a standard small transformer of the same size

2. a distillation from a single layer’s activations

I do have partial results for both, and you’re right that including them would sharpen the contribution. I’ll incorporate them into the revised version.

5. Writing clarity and background Fair critique. I wrote this at the same time I was building the entire stack, which means the prose lagged behind the experiments. I can expand failure modes, limitations, and benchmark context to make the narrative clearer.

6. On the term “meaning field” Naming is tricky, and I thought that captured everything im working on pretty effectively. Also, I think it will make more sense when you see everything im releasing in the near future. I used it because I felt as if it captures the intuition behind low-rank activation structure, but I’m not attached to the term. “Compressed activation representation” is probably clearer for a paper audience. I’ll adjust based on reviewer expectations.

7. Correct summary of the method Your restatement is close, but not quite it. The student isn’t trained to reconstruct specific layers, but to match the compressed field extracted from multiple layers. It’s not a smaller transformer trying to imitate concatenated layers, but a model trying to predict a learned low-dimensional latent that carries most of the task-relevant signal.

All of your points are duly noted, and they will help me to adapt, grow, and mature my work and future releases.

Thank you, sincerely. This is the kind of feedback that actually improves me and the work aswell.


I have a chronic disease making over 500K dollars and I can tell you the US healthcare (from primary care to specialists) ability to help me stay on track or identify health issues has been null. If it wasn't because I second guess every recommendation, go and pay out of pocket tests (even though I gotta pay 4K+ in insurance premiums) I would have been dead by now. No, the US does not have the best healthcare not even close.

Scenario 1: You fall head first from a 10th floor. US healthcare has higher chance of saving your life. Scenario 2: You are an average person that hopes to get preventive medical care. You will die in the U.S of the most basic medical condition.


A few clarifications, since most of the points here come from asking LLMs to summarize the repo rather than running the code directly.

1. The teacher only runs during field extraction. That step is offline. Once the fields are saved, the transformer is no longer needed. The student training and student-only inference scripts do not load the teacher at all. Compression refers to the field representation and the student head, not the extraction pass.

2. The HellaSwag file is a placeholder, not a required part of the method. It's included so the structure mirrors the paper’s tasks, and it points to the description in the text. The core experiments (RTE, SST-2, CIFAR-10 intention probe, etc.) all have complete working code paths.

3. The AN1 head is intentionally simple. Linear probes are the baseline way to test whether compressed intermediate representations preserve structure. The key result is how much task-relevant geometry survives in a low-rank field. The novelty is in the compression behavior, not in inventing a new classifier architecture.

4. The student model exists and is trained independently of the teacher. This is what produces the classification results in the paper. The student doesn't call the teacher during inference, which is exactly the point.

5. DistilBERT’s SST-2 score isn’t the relevant comparison. The experiment isn’t “beat a small transformer.” It’s “how far can a 256-dimensional compressed field distilled from a frozen 70B model get on a downstream task?” The result speaks to representational compression, not leaderboard performance.

6. The 2 tok/s number is for the specific configuration used in the economic section. Different hardware, precision modes, and serving stacks vary by an order of magnitude. The point was to illustrate cost scaling, not claim a universal throughput ceiling.

If there’s a specific part of the implementation you believe contradicts the paper, feel free to point to the line and we can discuss that human to human. The repo is small by design, so everything is easy to check directly without relying on LLM summaries.


There's no way a tardigrade is half a sea snail.

That's not how the method works... The full transformer is only needed once to extract the activation fields. That step can even be done offline. Then the teacher can be discarded entirely. The compression result refers to the size of the learned field representation and the small student head that operates directly on it. Simple. No fake claim there. Inference with the student does not involve the transformer at all.

If you look at the student-only scripts in the repo, those runs never load the teacher. That's the novel part.


When someone shifts from engaging with the actual results to attacking the person, it usually tells you more about their internal state than about the work itself. I'm glad I have a new fan though.

Oh, so you didnt run the repo and remembered something that you read once that looked like it matched. This contribution is meaningless.

The simplest way to resolve any doubt is to run the code. Every result in the paper comes from reproducible scripts in the repo, not from speculative reasoning or LLM-assisted invention.


Looking at a possible rebrand in the near future haha.

The substack isnt what was supposed to be evaluated, it was the repo. That's creative writing and the repo is sciencetific. Two different things. One has nothing to do with the other. The technical direction here is straightforward, almost boring in a sense: freeze the teacher, extract intermediate activations, compress, then train a student to match the compressed fields. Sometimes when people aren't able to evaluate the work, they dig for something else online that they can comment on or bring down. The only thing I can offer in response is the simplest one: look at the code and the experiments themselves, not the narrative around them. Everything in the paper is fully reproducible from the reference implementation, and every number in the results section came from running those scripts, not from a model filling in blanks. The surprise is not in the prose, but in how much structure those early-layer fields ended up carrying.

If you think something in the repo looks wrong or inflated, I’m happy to walk through it point by point. I have no problem with hard questions. What matters to me is whether the experiments hold when someone else runs them, not whether the story around them fits a certain aesthetic.


That limitation is already accounted for in how the title is meant to be read. The 224× compression result is specifically about the structure of intermediate activations on classification tasks. The paper makes that explicit in multiple places, including the Limitations section, where generation is identified as an entirely separate challenge.

The title reflects the strongest verified result in the domain the method currently supports, not a universal claim across all modalities. In other words, the compression result is real, but it shouldn't be interpreted as applying to generative decoding... yet.


No one trusts Sam Altman. The trouble is that the media remains in it's neutral reporting mode that gives anyone that achieves a title the benefit of what that would normally entail and unwarranted benefit of the doubt on everything they have obviously done as if it were a criminal court but with no possibility of ever actually consulting its jury.

I guess my "vibe" is just better than your coding :)... Let me explain a few things, if you will. A few clarifications so the discussion stays aligned with what the experiment is actually measuring.

1. The HellaSwag “binary collapse” is intentional and not a leaderboard claim. This work doesn’t attempt to benchmark HellaSwag in the standard four-choice setting. The goal is to probe whether a single frozen layer carries enough information for a small head to distinguish correct versus incorrect continuations. That's a representational geometry test, not a SOTA claim. Binary framing raises the baseline, but that's expected and documented. It's not meant to compare against full LLM HellaSwag results.

2. No adversarial filtering was done. I am using HuggingFace’s standard split directly. Nothing was removed or curated. The experiment doesn't claim robustness or benchmark competitiveness, so the “easier eval” framing doesn’t really apply.

3. EOS extraction isn't cheating, it's the whole point of the probe. The extraction logic takes the final token’s hidden state, which is basic and standard for classification heads and probing studies. If the EOS token captures a high-level sequence summary, that's exactly the structural feature being examined. The result is meant to show how much task-relevant signal is already present in that early representation, not to present a new generative mechanism.

4. The purpose of the work is clearly narrow by design. This is not proposed as a drop-in replacement for full-transformer inference. The paper states that directly. The contribution is about how much structure a single early layer encodes and how far a tiny head can go under strict frozen-teacher constraints. So several of the criticisms make assumptions about goals the work never even claimed.

Thaank you for the feedback and for taking the time.


great read so as i have understood to why ai is generally good at web dev and not so good at other low level stuff is "more data to train better the model". So model trained on the whole web > a model particular trained on a single language (in this case c). and i also think you can't go lower than C in this case, since below that the code is architecture specific and so even less data to train on. would love what others think

Brute force!? Language modeling is a factorial time and memory problem. Someone comes up with a successful method that’s quadratic in the input sequence length and you’re complaining…?

I think there's no "you", just an illusion that there's this uninterrupted "you"-ness from birth to death. It's a very useful illusion for the most part.

I view life (in the philosophical sense; consciousness) as the stream of subjective experiences (qualia) that arise out of life (in the biological sense; neurons and such). Right now my life consists of a collection of sustained interest in this discussion, a little hunger, the qualia of seeing the screen and the realization that I'm sitting a bit uncomfortably. In a few moments "I" will be a collection of other ephemeral qualia.

There's no "real" continuation between one experience and then next, just like there's no real continuation between my past "self" and my future "self", but they're both extremely useful illusions. I'll eat to subside that hunger that was registered a moment ago or change my position to get comfortable. I'll be responsible for "my" previous actions, as well. I'll basically be able to function as a temporally continuous being.

On the topic of immortality, I'd like to be virtually immortal so I can pursue my goals indefinitely. If I stop having goals or feel like I've had enough, I could always kill myself. My goals arise from my ethics, my biological needs and probably many other things. Why would I be OK with biology and death preventing me from achieving my goals at some arbitrary age?

So for me "immortality" is both being able to continue the illusions of self indefinitely (which I admit, feels good intrinsically), and being to continue the pursuit of my goals indefinitely. The goals seem to actually have more "real" continuity than "I" do.

The most troubling thing with immortality is the biological imperative to live that makes suicide so hard. But I think after a few centuries many people will reach that point. It's not a bad thing, it's just a personal choice.


So yeah, I basically ended up building my own calorie tracking app because every other app was just annoying and way too complicated for my brain. Tracking my macros was a full headache and since I’m ADHD it was even worse. I’m going to the gym like 8 days a week and tried everything. MyFitnessPal, random apps, paid ones, all of them.

One day I kinda did the math and realized it was taking me like 3 to 4 minutes just to log ONE meal. Multiply that by how many meals a day and then a month and then a whole year… yeah nah. I’m not doing that forever.

Then those “take a picture and the AI tells you calories” apps came out around 2023 and I swear I thought I found heaven. I tried like 3 of them, paid of course, and all of them were trash. It actually took me more time because I had to fix every macro manually one by one. So annoying.

So I stopped tracking the normal way and did something different. I just wrote my food in my iPhone notes and at the end of the day I gave the text to ChatGPT. It gave me the macros and calories and if it was wrong I would ask again or fix the portions. Sometimes I even tried Perplexity because it was better at some stuff. That system actually worked perfectly.

I told my gym bros about it and they started doing the same and then their friends told their friends and so on.

Then it hit me… why not make an app that looks like a normal notes app but has AI already inside and it understands what I write about food automatically without me doing anything extra.

And that is how Notecal AI started.

I added a 3 day free trial so if you hate it or it doesn’t fit you, just unsubscribe. No hard feelings.


I was replying in the context of what you were replying to where they either could spend 30k or make a private root. I'm not sure they were actually using EV but for it to cost $30k and given the topic of the thread it seems plausible they were using some technicality on EV or similar to reduce public domain validation requirements.

> It started, as many things do these days, by scrolling on X.

Thanks Blake. This is the sign I needed to cancel my Twitter subscription and hellban X from my LAN.


Author here. I started this project after reading the earlier threads on DeepSeek-OCR [1][2]. I got really excited about "vision for context compression," but after reading their paper, a couple things were bugging me.

They show good OCR results (image → text), but the pitch is context compression (text → image → text). They never actually test that pipeline. So I implemented it: render text, compress to vision tokens, reconstruct. Then I compared against just compressing the text embeddings directly. Mean pooling (averaging embeddings in a sliding window) nearly matched DeepSeek-OCR. A small conv encoder crushed both.

Ok so fine, maybe vision isn't special for reconstruction. But maybe the path matters more than the destination. Do the representations learned through vision work better for language modeling? I finetuned the checkpoints from the reconstruction experiments for next-token prediction. Vision and mean pooling couldn't beat truncation, but the conv encoder could. I didn't do any architecture search. It just worked.

That said, this is preliminary work. I just wanted to answer the obvious next questions. So far, the findings don't support the "vision for context compression" narrative.

Happy to answer questions.

[1] https://news.ycombinator.com/item?id=45640594 [2] https://news.ycombinator.com/item?id=45658928


Facts don't matter when Elon's feelings are at stake

This doesn't change my argument at all.

The more money you have, more you benefit from this ruling. Now you can buy a service which was not possible before.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: