But that's not the way NNs work. NNs learn classification on a particular representation only. From the NN perspective, this kind of description is exactly zero samples (of the required representation).
> NNs learn classification on a particular representation only
To perform the task based on the description, you need to have learned to associate descriptions of images with the image itself. To train that ability you'll need lots of paired examples in both representations.
That the description is not in the same representation as the image you want to recognize doesn't change the fact that both need to use a representation that has been trained on.
Cross-representation learning is still useful because it increases the range of usable data (e.g. it can improve game-playing agents by telling them how to play https://arxiv.org/abs/1704.05539), but it doesn't magically enable learning from zero samples (just more kinds of samples).
> To train that ability you'll need lots of paired examples in both representations.
People rarely ever need lots of paired examples, except possibly once, when they learn the concept of "paired representation," and even then probably a few examples suffice to learn an extremely abstract concept that is then used in all learning. People simply don't learn high-level concepts through statistical clustering.
> that both need to use a representation that has been trained on.
But this training is of a very different nature. As far as I know, not a single person claims that NNs learn in a way remotely similar to how people learn (nor are they meant to), certainly not when it comes to high-level concepts.