Hacker Newsnew | past | comments | ask | show | jobs | submit | lappa's commentslogin

This isn't suggesting no one understands how these models are architected, nor is anyone saying that SDPA / matrix multiplication isn't understood by those who create these systems.

What's being said is that the result of training and the way in which information is processed in latent space is opaque.

There are strategies to dissect a models inner workings, but this is an active field of research and incomplete.


Whatever comes out of any LLM will directly depend upon the data you fed it and which answers your reinforced as correct. There is nothing unknown or mystical about it.


The same could be said of people, revealing the emptiness of this idea. Knowing the process at a mechanism level says nothing about the outcome. Some people output German, some English. It’s sub-mechanisms are plastic and emergent


I use the SingleFile extension to archive every page I visit.

It's easy to set up, but be warned, it takes up a lot of disk space.

    $ du -h ~/archive/webpages
    1.1T /home/andrew/archive/webpages
https://github.com/gildas-lormeau/SingleFile


storage is cheap, but if you wanted to improve this:

1. find a way to dedup media

2. ensure content blockers are doing well

3. for news articles, put it through readability and store the markdown instead. if you wanted to be really fancy, instead you could attempt to programatically create a "template" of sites you've visited with multiple endpoints so the style is retained but you're not storing the content. alternatively a good compression algo could do this, if you had your directory like /home/andrew/archive/boehs.org.tar.gz and inside of the tar all the boehs.org pages you visited are saved

4. add fts and embeddings over the pages


1 and partly 3 - I use btrfs with compression and deduping for games and other stuff. Works really well and is "invisible" to you.


dedup on btrfs requires to setup a cronjob. And you need to pick one of the dedup too. It's not completely invisible in my mind bwcause of this ;)


>storage is cheap

It is. 1.1TB is both:

- objectively an incredibly huge amount of information

- something that can be stored for the cost of less than a day of this industry's work

Half my reluctance to store big files is just an irrational fear of the effort of managing it.


> - something that can be stored for the cost of less than a day of this industry's work

Far, far less even. You can grab a 1TB external SSD from a good name for less than a days work at minimum wage in the UK.

I keep getting surprised at just how cheap large storage is every time I need to update stuff.


How do you manage those? Do you have a way to search them, or a specific way to catalogue them, which will make it easy to find exactly what you need from them?


KaraKeep is a decent self hostable app that has support for receiving singlefile pages via singlefile browser extension and pointing to karakeep API. This allows me to search for archived pages. (Plus auto summarization and tagging via LLM).


Very naive question, surely. What does KaraKeep provide that grep doesn't?


jokes aside. It has a mobile app


I don't get it aside. How does that help him search files on his local file system? Or is he syncing an index of his entire web history to his mobile device?


GP is using SingleFile browser extension. Which allows him to download the entire page as a single .html file. But SingleFile also allows sending that page to Karakeep directly instead of downloading it to his local file system. (if he's hosting karakeep on a NAS on his network). He can then use the mobile app or Karakeep web UI to search and view that archived page. Karakeep does the indexing. (Including auto-tagging via LLM)


I see now, thank you.


Thanks. I didn't know about this and it looks great.

A couple of questions:

- do you store them compressed or plain?

- what about private info like bank accounts or health issuance?

I guess for privacy one could train oneself to use private browsing mode.

Regarding compression, for thousands of files don't all those self-extraction headers add up? Wouldn't there be space savings by having a global compression dictionary and only storing the encoded data?


> do you store them compressed or plain?

Can’t speak to your other issues but I would think the right file system will save you here. Hopefully someone with more insight can provide color here, but my understanding is that file systems like ZFS were specifically built for use cases like this where you have a large set of data you want to store in a space efficient manner. Rather than a compression dictionary, I believe tech like ZFS simply looks at bytes on disk and compresses those.


By default, singlefile only saves when you tell it to, so there's no worry about leaking personal information.

I haven't put the effort in to make a "bookmark server" that will accomplish what singlefile does but on the internet because of how well singlefile works.


i was considering a similar setup, but i don’t really trust extensions. Im curious;

- Do you also archive logged in pages, infinite scrollers, banking sites, fb etc? - How many entries is that? - How often do you go back to the archive? is stuff easy to find? - do you have any organization or additional process (eg bookmarks)?

did you try integrating it with llms/rag etc yet?


You can just fork it, audit the code, add your own changes, and self host / publish.


yes, you right. im not helpless and all the new ai tools make this even easier.


Are you automating this in some fashion? Is there another extension you've authored or similar to invoke SingleFile functionality on a new page load or similar?


Have you tried MHTML?


SingleFile is way more convenient as it saves to a standard HTML file. The only thing I know that easily reads MHTML/.mht files is Internet Explorer.


Chrome and Edge read them just fine? The format is actually the same as .eml AFAIK.


I remember having issues but it could be because the .mht's I had were so old I think I used Internet Explorer's Save As... function to generate them.


I've had such issues with them in the past too, yeah. I never figured out the root cause. But in recent times I haven't had issues, for whatever that's worth. (I also haven't really tried to open many of the old files either.)


You must have several TB of the internet on disk by now...


OpenAI clearly states that they train on your data https://help.openai.com/en/articles/5722486-how-your-data-is...


By default, we do not train on any inputs or outputs from our products for business users, including ChatGPT Team, ChatGPT Enterprise, and the API. We offer API customers a way to opt-in to share data with us, such as by providing feedback in the Playground, which we then use to improve our models. Unless they explicitly opt-in, organizations are opted out of data-sharing by default.

The business bit is confusing, I guess they see the API as a business product, but they do not train on API data.


So for posterity, in this subthread we found that OpenAI indeed trains on user data and it isn't something that only DeepSeek does.


So for posterity, in this subthread we found that I can use OpenAI without them training on my data, whereas I cannot with DeepSeek.


What do you mean? They both say the same thing for usage through API. You can also use DeepSeek on your own compute.


Where does DeepSeek say that about API usage? Their privacy policy says they store all data on servers in China, and their terms of use says that they can use any user data to improve their services. I can’t see anything where they say that they don’t train on API data.


> Services for businesses, such as ChatGPT Team, ChatGPT Enterprise, and our API Platform > By default, we do not train on any inputs or outputs from our products for business users, including ChatGPT Team, ChatGPT Enterprise, and the API.

So on API they don't train by default, for other paid subscription they mention you can opt-out


It's easy to argue that Llama-3.3 8B performs better than GPT-3.5. Compare their benchmarks, and try the two side-by-side.

Phi-4 is yet another step towards a small, open, GPT-4 level model. I think we're getting quite close.

Check the benchmarks comparing to GPT-4o on the first page of their technical report if you haven't already https://arxiv.org/pdf/2412.08905


Did you mean Llama-3.1 8B? Llama 3.3 currently only has a 70B model as far as I’m aware.


Great project, looking forward to seeing more as this develops.

Also FYI, your mail server seems to be down.


Thank you, and good catch.

We recently acquired deepsilicon.com, and it looks like the forwarding hasn't been registered yet. abhi@deepsilicon.net should work.


Provided a constant temperature of 1.0, you can train the model on prompts with probablistic requests, with loss determined by KL divergence.

Expectation: 80% left, 20% right

Model sampling probability: 99% left, 1% right

>>> 0.80 * math.log(0.99 / 0.80) + 0.20 * math.log(0.01 / 0.20)

-0.42867188234223175

Model sampling probability: 90% left, 10% right

>>> 0.80 * math.log(0.9 / 0.80) + 0.20 * math.log(0.1 / 0.20)

-0.04440300758688229

Of course, if you change the temperature this will break any probablistic expectations from training in this manner.


Look at it from an algorithmic perspective. In computer science many algorithms take a non-constant number of steps to execute. However, in transformers models, there are a limited number of decoder blocks, and a limited number of FFN layers in each block. This presents a theoretical upper bound on the complexity of the algorithms a decoder network can solve in a single token generation pass.

This explains why GPT4 cannot accurately perform large number multiplication and decimal exponentiation. [0]

This example can extend to general natural language generation. While some answers can be immediately retrieved or generated by a "cache" / algorithm which exists in latent space, some tokens have better quality when their latent-space algorithm is executed in multiple steps.

[0] https://www.semanticscholar.org/reader/817e52b815560f95171d8...


> Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking

This paper suggests that a large language model should "think ahead" by predicting not only the next token but also a "supporting thought." The approach involves generating all tokens simultaneously, allowing for a single forward pass that produces both the next token and a supporting thought, which might consist of, for example, 16 tokens.

This supporting thought influences the model's prediction. The process is then extended to multiple supporting thoughts by ingeniously masking cross-attention between thoughts to ensure their independence. So in essence we can fill all the remaining context with supporting thoughts and benefit from all of them in the same single forward pass.

The supporting thoughts themselves are trained with the objective to maximize the probability of a longer sequence ahead, using RL. So they are trained to optimize for longer-term, instead of the myopic next token prediction task.

https://arxiv.org/abs/2403.09629


Very interested in the expansion of RL for transformers, but I can't quite tell what this project is.

Could you please add links to the documentation to the readme where it states "It includes detailed documentation".

Also maybe DPO should use the DDPG acronym instead so your repos Deterministic Policy Optimization isn't confused for trl's Direct Preference Optimization.


A few days ago I saw a post using NeuralFlow to help explain the repetition problem.

https://old.reddit.com/r/LocalLLaMA/comments/1ap8mxh/what_ca...

> I’ve done some investigation into this. In a well trained model, if you plot the intermediate output for the last token in the sequence, you see the values update gradually layer to layer. In a model that produces repeating sequences I almost always see a sudden discontinuity at some specific layer. The residual connections are basically flooding the next layer with a distribution of values outside anything else in the dataset.

> The discontinuity is pretty classic overfitting. You’ve both trained a specific token to attend primarily to itself and also incentivized that token to be sampled more often. The result is that if that token is ever included at the end of the context the model is incentivized to repeat it again.

...

> Literally just plotting the output of the layer normalized between zero and one. For one token in mistral 7B it’s a 4096 dimension tensor. Because of the residual connections if you plot that graph for every layer you get a really nice visualization.

> Edit: Here's my visualization. It’s a simple idea but I've never personally seen it done before. AFAIK this is a somewhat novel way to look at transformer layer output.

> Initial output: https://imgur.com/sMwEFEw

> Over-fit output: https://imgur.com/a0obyUj

> Second edit: Code to generate the visualization: https://github.com/valine/NeuralFlow

This is nearly identical to the overfitting example in the repo, only really representing a binary, but it's a good start. Perhaps some transformations can be applied to help further?


Not a material science expert, however per their paper, they use DFT to verify the stability, then use the verification status to improve the model.

>candidate structures filtered using GNoME are evaluated using DFT calculations with standardized settings from the Materials Project. Resulting energies of relaxed structures not only verify the stability of crystal structures but are also incorporated into the iterative active-learning workflow as further training data and structures for candidate generation


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: