So, what's happening here on the surface is that it's an optimization (fairly me...

So, what's happening here on the surface is that it's an optimization (fairly meaningful, from the looks of it) aimed at doing roughly the same things we could already do with chain-of-thought (CoT), but IMO the downstream effects of this sort of optimization could be much more meaningful.

LLMs can already do a decent amount of "processing" in a single token generation because of the number of layers they have. The layers are trained independently so it's not exactly like they're a recurrent network doing multiple steps, but they are layering sequences of context-dependent transformations on top of each other; no matter how you cut it, if getting to a problem's answer requires 100 steps, you won't be able to do it in a single token output from a 20 layer LLM. To some approximation, CoT is just a way to give the network more chances to transform the data than there are layers in the network - each additional token of output gives a shot to bake another vector the size of the token embedding into each layer's state in the network, enriching what it's computed so far.

The problem with chain of thought is that as you add each new token, at the input level of the network, your computation is basically starting from scratch against the raw text, just with one additional token. You don't even have access to all the stuff you already figured out in the deepest layers of the network during the previous step! If you were processing "All wunguses are glurgles, and Joe is a wungus", then somewhere in those deepest layers as you're generating the next token you've almost certainly got some vector that basically represents "therefore Joe is a glurgle", but with chain of thought you've got to first output "t", then "h", then "e", and so on (I know those aren't tokens, let's pretend letter == token for argument sake), and during that process almost ALL of the work being done by the network is mere bookkeeping, slowly dumping that thought into the output stream. Only once you get the whole sentence out can you start processing the next token at the first layer with the information that Joe is, in fact, a glurgle, in hand. Which is a damn shame, because it's been sitting right there in the deeper layers of the network parallel to previous tokens this whole time, it just wasn't available for the shallow layers to process directly because you were casting most of the info away and "rounding" to a single token.

With Coconut's approach, you don't need to output "therefore Joe is a glurgle" token by token to continue the train of thought, you can essentially pass the entire thought through as a single uber-token, and the next pass can generate a new entire thought, and so on.

It's a pretty straightforward idea, IMO the neat bit is that they were able to train the network to work well in this way by leveraging CoT. I'm guessing you probably don't need to act as if these are two distinct modes of operation, you could instead always have this side channel of "continuous thought" running, even when you have generated a normal token, coming through as a separate input to the first attention block. You still might want to have a "thinking" token when you need to sit there and let the thing do more work, but you'd generally increase the information flow from time step to time step, which would allow the net to keep thinking in the background even as it's doing the gruntwork of outputting whatever its current "buffered" thought is.