I do it for myself - the desire to post a comment motives to read a little more....

I do it for myself - the desire to post a comment motives to read a little more.

A little correction:

> Explicit <eot> is only used in training.

Of course an explicit <eot> is present in the context at inference time, because the LLM was trained to produce verbal tokens after <eot>. It's just that the <eot> is placed into the context in a one of the two ways above.

BTW, I do not understand why the <eot> is not produced by LLM itself, as you describe. It seems reasonable and natural.

Is that to save computational performance on unembedding while in the latent thought mode? But unembedding takes a small fraction of computations, should not be an issue. Something prevents reliable learning of how and when to produce the <eot>? But they managed to train a binary classifier. But why separate classifier, why not rely on LLM learning?

Another though is that maybe better names for special tokens would be not "begin of thought" (<bot>), "end of thought" (<eot>), but rather something like "pause speak", "begin of speak". Because neither human nor LLM stop thinking when speaking.