Any idea on how does Sonnet does this, is the image annotated with bounding boxe...

theredsix · on Nov 1, 2024

They don't discuss this at all on their blog other than "Training Claude to count pixels accurately was critical." My speculation on how they accomplished it is either explicit tokenizer support with spacial encoding similar to how single-digit tokenization improves math abilities or an extensive pretraining like Molmo.