Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Any idea on how does Sonnet does this, is the image annotated with bounding boxes on text boxes etc. along with its coordinates before sending to sonnet and it responds with box name back or co-ordinate back or ? is SAM2 used for segmenting everything before sending to sonnet ?


They don't discuss this at all on their blog other than "Training Claude to count pixels accurately was critical." My speculation on how they accomplished it is either explicit tokenizer support with spacial encoding similar to how single-digit tokenization improves math abilities or an extensive pretraining like Molmo.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: