- "We extend SAM to video", because is was previously only for images and it's capabilities are being extended to videos
- "by considering images as a video with a single frame", explaining how they support and build upon the previous image functionality
The main assumptions here are that images -> videos is a level up as opposed to being a different thing entirely, and the previous level is always supported.
"retrofit" implies that the ability to handle images was bolted on afterwards. "extend to video" implies this is a natural continuation of the image functionality, so the next part of the sentence is explaining why there is a natural continuation.
- "We extend SAM to video", because is was previously only for images and it's capabilities are being extended to videos
- "by considering images as a video with a single frame", explaining how they support and build upon the previous image functionality
The main assumptions here are that images -> videos is a level up as opposed to being a different thing entirely, and the previous level is always supported.
"retrofit" implies that the ability to handle images was bolted on afterwards. "extend to video" implies this is a natural continuation of the image functionality, so the next part of the sentence is explaining why there is a natural continuation.