Both of them are "visually grounded" - meaning if you ask for the location of something in an image - they can output the exact x/y pixel coordinates! Not many models can do this, especially not many that are large enough to actually reason through sequences of actions well
> This method is so unreasonably effective I can't believe it works, but it's never failed me yet. Whenever you are in the throws of a cataplectic attack, lying motionless and completely helpless, focus all your energy into "finding" the tip of your index finger (either one will do).
Amazing, this is the exact method I found independently to escape sleep paralysis, which thankfully only happens before or after sleeping for me.
Pasting a URL in NewsBlur also uses several of these techniques to find the feed(s), and it is open source, so the feed-finding code could be ripped out of NewsBlur as an alternative to this.