> 4. memory attention: SAM2 is a transformer with memory across frames! special ...

> 4. memory attention: SAM2 is a transformer with memory across frames! special "object pointer" tokens stored in a "memory bank" FIFO queue of recent and prompted frames. Has this been explored in language models? whoa?

interesting, how do you think this could be introduced to llm? I imagine in video some special tokens are preserved in input to next frame, so kind of like llms see previous messages in chat history, but it's filters out to only some category of tokens to limit size of context.

I believe this is trick already borrowed from llm into video space.

(I didn't read the paper, so that's speculation on my side)