i covered SAM 1 a year ago (https://news.ycombinator.com/item?id=35558522). note...

ulrikhansen54 · on July 30, 2024

A colleague of mine has written up a quick explainer on the key features (https://encord.com/blog/segment-anything-model-2-sam-2/). The memory attention module for keeping track of objects throughout a video is very clever - one of the trickiest problems to solve, alongside occlusion. We've spent so much time trying to fix these issues in our CV projects, now it looks like Meta has done the work for us :-)

Szpadel · on July 31, 2024

> 4. memory attention: SAM2 is a transformer with memory across frames! special "object pointer" tokens stored in a "memory bank" FIFO queue of recent and prompted frames. Has this been explored in language models? whoa?

interesting, how do you think this could be introduced to llm? I imagine in video some special tokens are preserved in input to next frame, so kind of like llms see previous messages in chat history, but it's filters out to only some category of tokens to limit size of context.

I believe this is trick already borrowed from llm into video space.

(I didn't read the paper, so that's speculation on my side)

alsodumb · on July 30, 2024

I might be minority, but I am not that surprised by the results or the not so significant GPU hours. I've been video segment tracking for a while now using SAM for mask generation and some of the robust academic video-object segmentation models (see CUTIE: https://hkchengrex.com/Cutie/ presented at CVPR this year.)for tracking the mask.

I need to read SAM2 paper, but 4. seems a lot like what Rex has in CUTIE. CUTIE can consistently track segments across video frames even if they get occluded/ go out of frame for a while.

dingaling · on July 30, 2024

Seems like there's functional overlap between segmentation models and the autofocus algorithms developed by Canon and Sony for their high-end cameras.

The Canon R1 for example will not only continually track a particular object even if partially occluded but will also pre-focus on where it predicts the object will be when it emerged from being totally hidden. It can also be programmed by the user to focus on a particular face to the exclusion of all else.

michaelt · on July 30, 2024

Of course Facebook has had a video tracking ML model for a year or so - Co-tracker [1] - just tracking pixels rather than segments.

[1] https://co-tracker.github.io/