Suddenly, a leopard print sofa appears

karpathy · on June 20, 2015

This article would not come as a surprise to anyone who works with ConvNets. Sadly, that might not the case for those outside of the field, largely due to media's inadequate coverage of our advances (but this is common outside our field too). No one in the field really believes ConvNets see better than humans. They are very good single glance texture recognizers. It's as if you flashed an image and looked at it for a split second without giving yourself a chance to look around and take some time to gain any higher-level scene understanding. If you tried this with this image you might also think you had seen a leopard. Another point to make is not from modeling side but from data side. If in the training data the leopard texture is highly indicative of leopard, then the ConvNet will learn to strongly associate it as such. As the article mentions, a quick hack would be to make sure that your training data contains many leopard-textured items of different classes. You might then expect the ConvNet to seek other features to latch on to and become less reliant on the texture itself.

Also, we carried out an experiment on ImageNet and the outcome was that "One human labeler (me, incidentally) with a fixed amount of training and a slightly-above average determination reached ~5% top-5 error on a subset of ImageNet test set". The media sees this and it immediately gets spun to "AI now Super-Human. And we're all going to die." It makes a lot of us cringe every time.

Many people in Computer Vision now consider ImageNet "squeezed" out of juice - we're good at texture recognition and when an object is in plain view, and we're now searching for harder tasks and more dynamic range with respect to human performance, in areas such as harder 3D/Spatial tasks, Image Captioning, Visual Q&A, etc. The hope is that these harder datasets might in turn guide us in developing models with more nuanced understanding.

_dps · on June 20, 2015

Might you by chance be familiar with Rodney Brooks' work on subsumption architectures [1]? If not, I would summarize the underlying idea (my words not his) as "don't try to jump too many layers of abstraction in one go" [2].

So I wonder to what extent you would consider this a predictable outcome from the classifier in question not being part of a subsumptive architecture --- which at a guess would look like

  - glance/texture responses fed into 
  - boundary-recognition layers fed into
  - object persistence/tracking layers 
  - fed into abstract scene reasoning

It seems to me, as a non-vision researcher (I mainly worked in planning and control), that the most obvious counterargument to the image being a spotted cat is based on boundary/object/scene reasoning, and that it's "reasonable" for the texture/glance layer to say "looks a lot like a cat texture".

[1] https://en.wikipedia.org/?title=Subsumption_architecture

[2] I realize this may seem, superficially, anathema to deep network research, which advocates letting the network find its own intermediate levels of abstraction. But it's actually compatible in my view because Brooks advocates (again, paraphrasing quite a bit) that the separate layers should have different objective functions, and that in fact the need for different objective functions (in a prioritized order) is the cause of emergent layering in nature. "First, don't die. Second, find shelter. Third, find food etc." So one can imagine deep networks each finding their own locally useful abstractions for each objective function in the "Maslow" chain, while still having some macro architecture that tracks human-imposed design principles.

avereveard · on June 20, 2015

the limit is that training cannot force abstraction. you can only reach abstraction if you have enough neuron space and the data set is big enough to avoid over-fitting textures.

the problem is.. human vision doesn't work just by feeding a bitmap. we have structure to decode space relationships, shapes and maybe even shadow/light relations. no way we gonna see classificator working on color arrays matching our vision capabilities

Retric · on June 21, 2015

Seems simple enough to feed a NN with that abstracted data.

However, the advantage to the texture approach is it's abstracted from a lot of other information. You don't want a classifier to say sofa, when it's a picture of a person on a sofa.

avereveard · on June 21, 2015

but then you're biasing it toward your perception:

http://www.bespokesofalondon.co.uk/assets/Uploads/bespoke-so...

anyway it does work perfectly if that's what you need, but most proponent are trying to use deep nn to classify 'as good as humans do'

iopq · on June 21, 2015

No, I actually would not think I saw a leopard. Humans are really good at recognizing things with faces and legs. Those humans that didn't have the ability to recognize a leopard in a split second were already eaten thousands of years ago.

ArtB · on June 20, 2015

Hmm, in your opinion do you think this would be a good technique then for digitizing paper maps? And if so, could you point in the direction of a library or textbook you'd recommend?

sova · on June 21, 2015

Do you mean cartographic images ? Like a map of your city or state -> machine learning -> some data like google maps?

rasz_pl · on June 20, 2015

Next step is Video. Adding temporal dimension will emphasize extrapolating true 3d shapes of recognized objects.

visarga · on June 20, 2015

Not just 3D shapes, but understand actions as they develop in time with recurrent neural networks.

camikazeg · on June 20, 2015

What if you took these same Neural Networks as they exist now, and tweaked the input and the parameters slightly. For the input, use individual frames of an hour long video of a leopard (in order), and instead of having it just identify whether or not there is a leopard, have it identify what in each image is the leopard, and have it try to predict the next frame.

It seems that this is more like the way that we learn to identify things. Then once we establish an understanding of a base class (big cat) we can apply that same model to new cats that we have never seen before with just a picture.

ulam2 · on June 21, 2015

I think a recurrent neural network is more suitable here as the size of context for each example might be different, instead of fixing the frame size.

fla · on June 20, 2015

Somehow tangent but this made me think about this quote found on HN last year:

Context: Evolutionary algorithms and analog electronic circuits

> One thing stands out when you try playing with evolutionary systems. Evolution is _really_ good at gaming the system. Unless you are very careful at specifying all of the constraints that you care about you can end up with a solution that is very clever but not quite what you had in mind. Here power consumption is the issue. If you tried to evolve a sturdy chair you might end up with something that is 1mm tall. or maybe a fuel efficient car that exploits continental drift.

I think it's the same here: The net is never gonna better than what it needs to be, and it is probably always gonna take the easy route.

revelation · on June 20, 2015

You don't even need a neural net for that, take any global optimization method and give it a somewhat ill-defined scoring function, it will instantly run circles around you laughing.

FranOntanaya · on June 20, 2015

There's an alife program called DarwinBots where small bots powered by mutating code compete against each other to survive and reproduce.

Given enough time, you'd expect the to develop clever behaviors, but instead they just fuzz-tested the sim and locked in on exploits of bugs or environment settings. They only got a bit more clever when connecting different sims running on different conditions.

Eyes already use different kinds and densities of sensors optimized for either detail and color or movement/edges. I wouldn't expect a single learning method, even after optimizing it to its limits, to be above what two or more layers of different methods could do, especially when trying to avoid exploits like the tank story.

stcredzero · on June 20, 2015

Given enough time, you'd expect the to develop clever behaviors, but instead they just fuzz-tested the sim and locked in on exploits of bugs or environment settings.

Classic A-life! Also, not so different from the spirit of actual biology.

They only got a bit more clever when connecting different sims running on different conditions.

Diversity is very important for evolution on many levels. What many don't realize (especially, I note, evolution deniers) is that the ecosystem as a whole provides a very complex and continually varying epiphenomenal fitness function to any given organism.

rdlecler1 · on June 21, 2015

If you don't have a sufficiently complex genotype phenotype mapping and the system is not evolvabke (See Gunter Wager's work) the you shouldn't expect more complex phenotype. Understanding a genetic representation is going to be an important step toward open ended evolutionary systems.

tripzilch · on June 21, 2015

> They only got a bit more clever when connecting different sims running on different conditions.

Part of the reason why a lot of these nets are trained with added noise, as well as drop-out (randomly disabling 50% of the hidden neurons, every training step).

Especially the drop-out tactic is particularly effective at preventing "exploits" of the neural net type, which otherwise appear in the form of large correlated weights (really big weights depending on other really big opposite weights to cancel out--it works, but it doesn't help learning).

Either way, adding noisy hurdles helps because exploits are usually edge cases, and noise makes them less dependable, as the region of fitness space very close to an exploitable spot, is usually not very high-ranking at all (which is why you don't want your classifiers ending up there).

Houshalter · on June 21, 2015

Darwinbots uses actual computer code to control the robots. This makes it really hard for evolution to work with. Most mutations just break the code, and very very few mutations create anything interesting. And the simulation is too slow to explore millions of different possibilities to make up for the difficulty. What makes it worse is they are usually asexual.

However I think that's ok. Most of the fun with darwinbots is programming your own bots. They used to be (still are?) competitions where people wrote their own bots and had them compete under different conditions.

lambda · on June 20, 2015

> So I guess, there's still a lot of work to be done.

And I think this is the most interesting part.

One of the most depressing things about all of the "this image recognition algorithm performs better than humans on this task" is the idea that we've pretty much solved the problem, and it's just a matter of some more optimization and tweaking to handle a few edge cases.

This kind of problem, where the dominant solution simply gets it so wrong, and the problem cases are uncommon enough that any statistical solution is generally going to treat them as noise, reveals that in fact that there is likely plenty of room for entirely new, novel ways of approaching the problem to handle these kinds of cases better.

It's actually more exciting that there's so much more to be done, than to say "well, it's basically a solved problem, we just need to do some tweaking and optimization."

jhundal · on June 20, 2015

Definitely agree here, want to add a link to this paper which shows how far we still have to go (and questions whether the current models will ever replicate human vision): http://arxiv.org/abs/1412.1897

namuol · on June 21, 2015

This reminds me of one of Richard Feynman's famous quotes: “We are trying to prove ourselves wrong as quickly as possible, because only in that way can we find progress.”

Indeed, discovering these "broken" edge cases is exactly what we need to converge upon a more correct solution.

jweir · on June 20, 2015

https://neil.fraser.name/writing/tank/

This is a classic story of a neural net failure.

The net was able to find tanks hiding in the trees with amazing accuracy. Too amazing. It turned out the photos of the hidden tanks were all photographed on a cloudy day. The images without tanks in a clear day.

fpgaminer · on June 20, 2015

Thank you for this article; very thought provoking.

My nitpick:

> When each student was given a heavy book of MNIST database, hundreds of pages filled with endless hand-written digit series, 60000 total, written in different styles, bold or italic, distinctly or sketchy. > ... > So, are you going to say that was not the case?

I understand the point the author is making. Human brains are really good at taking limited examples and correctly extrapolating them to new cases. That is, of course, the goal of intelligence. Machine Learning has gotten better at this generalization, but has a long way to go. And ConvNets as they exist today will not achieve that, no matter how much training you perform on them.

This specific example is inaccurate though. Let us aggressively simplify and low-ball by saying that humans see at 24fps. Humans of course don't see in discrete frames, but this simplification doesn't detract from my argument and makes quantifying easier. So, if you give a human a single page of numbers, and they look at it for an hour, they have now seen >86k examples. That's 86k examples with twitching saccades, and from both eyes. That's in just an hour of looking at numbers.

Prior to being given that page of numbers, most children will have been alive for 4-5 years. That's 3 billion examples from a wide variety of subjects (we ignore sleeping cycles, because we're already low-balling this fps figure, and because the brain is still learning and visualizing during sleep).

And humans are born with a pre-built visual cortex. Edge detection, gradient detection, etc. are all already built for us. CNNs learn that from scratch.

The author's real point is still valid, though, don't get me wrong. I'm just nitpicking.

steven2358 · on June 21, 2015

Humans don't need to do data augmentation over several thousands of frames just to learn to recognize an object or image.

In the words of Joshua Tenenbaum and coauthors, "human children learning names for object concepts routinely make strong generalizations from just a few examples".

You can check this out for yourself on the brilliant illustration that went with it: http://i.imgur.com/5axtXSo.png From Tenenbaum J.B. et al, "How to grow a mind: statistics, structure, and abstraction," March 2011, Science, DOI:10.1126/science.1192788.

alextgordon · on June 20, 2015

Forget seeing a symbol once, you can recognise and represent a symbol without ever having seen it.

Test your humanness; draw these symbols:

"Like an E but rotated so the prongs point upwards"

"Like a snake but with two heads. Snakes down, up, down, up, down."

"Like a walking stick with the handle pointing left and looping back around."

(answer for A: Russian letter Sha) (answer for B: Kannada letter Uu) (answer for C: Tamil vowel sign I)

Houshalter · on June 21, 2015

Convnets can do this. Geoffrey Hinton has a wonderful lecture, where he trained a digit recognizer on everything but 7's and 8's (IIRC.) He then let another convnet tell it which numbers looked more or less like 7's and 8's. E.g. "that 9 looks kind of like an 8. That 1 looks kind of like a 7", etc.

And then it was able to correctly recognize 7's and 8's, despite never having actually seen one. I'm simplifying somewhat, but it was super cool.

I don't know why people are so focused on one-shot learning, or think that NNs can't do it. Neural networks learn features from lots of (possibly unlabelled) data. That's the whole point. Once you have those features, you can use them for all sorts of things. You can show it an image, and then measure how close other images are too it. Thereby learning from a single example.

Dylan16807 · on June 20, 2015

That's a bunch of noisy pictures of very similar numbers. You can add noise to the pictures you're training your CNN with; it's not really going to help. It's still texture without structure or context.

You can show me novel symbols, with me only looking at a single example of each for a few seconds, and I can manage good categorization.

visarga · on June 20, 2015

Maybe we need a different representation at the upper layers, to capture higher concepts. Perhaps cross domain learning, combining learning from text for example with learning from video/images would help a lot.

qbrass · on June 20, 2015

On the other hand, nobody has been upset that humans are constantly misidentifying that jaguar print sofa as a leopard print.

vacri · on June 21, 2015

Well-played!

I myself have researched leopard spots since I painted our toilet floor in them. It's a lead sheet, and the paint had worn off, which probably wasn't the healthiest thing. My housemates had filled the toilet with memorabilia from an African trip, so leopard-print paintjob it was.

Which entailed looking up leopardprint online. Very little of which actually looks like leopard rosettes, and now I have a problem with almost anything trying to pass itself off as leopardprint. Anyway, I can't say that my paintjob is a particularly good reproduction, but at least it's 'spiritually correct'... :)

discreteevent · on June 20, 2015

They would be if you ran screaming from someone's living room because of the thing in the corner. I think it's because from a practical point of view its a different category of error?

Mz · on June 20, 2015

Leopards (or jaguars) are complex 3-dimensional shapes with quite a lot of degrees of freedom (considering all the body parts that can move independently). These shapes can produce a lot of different 2d contours

My son keeps telling me that infants are fine with, say, a truck transforming into a clown (when it emerges from the other side of a visual barrier) but not with it transforming into TWO of something. Apparently, babies subjectively experience this (visual transformation) all the time -- mom moves a plate and what seemed like a big circle is now a flat line or whatever.

So humans apparently get tons and tons of experience with visually mapping 3d reality to mere 2d imagery. I have been thinking somewhat about this of late, in terms of physical attractiveness or "image" -- that pictures of a woman posted on a blog capture a 2d version of her but people interacting with her are interacting with a 3d living, moving creature who also has smell and a voice and her movements may be elegant or may be not elegant. Which is a thought process relevant to a project of mine, something people here surely will have no interest in. But where it is relevant to this article is that we are doing this wrong: Humans have thousands of hours of practice of looking at 3d reality and figuring out how it to interpret 2d images as representative of that 3d reality. Image recognition software is just dealing with 2d images. I don't see how it can hope to compete. Humans don't come preinstalled with the software to make that distinction. We acquire it with enormous repetition.

When do we make a robot and give it some baseline parameters and a learning algorithm (and set it loose in 3d reality and have to learn)? That is when we can get scared about human like AI that can compete on image recognition.

etangent · on June 20, 2015

Indeed, arguments like "My mother didn't have to purchase 30,000 mugs to teach me what one looks like" miss the fact that we humans spent so much time (almost all of our waking time in fact) interpreting 3D reality, an endless stream of repetitive tasks.

hayd · on June 20, 2015

> but not with it transforming into TWO of something

I suppose that's why Banach–Tarski is considered a paradox.

xyproto · on June 20, 2015

https://www.imageidentify.com/ correctly identified the images as "a small sofa". I think rotating the image is questionable, since there could be an algorithm for first orienting the image correctly based on light and shadow and then the image recognition could be run.

https://www.imageidentify.com/result/1ixb9603m9ix1

quasiresearcher · on June 20, 2015

But those algorithms would be very limited in usefulness. Systems such as imageidentify.com should be at least trying an ensemble of algorithms, many of which I suppose should be invariant under translation and rotation.

Edit: There's a comment about invariance in this thread [1] and apparently CNNs are not invariant under rotation.

[1] https://news.ycombinator.com/item?id=9750133

sbodenstein · on June 20, 2015

We did not train ImageIdentify to be invariant under arbitrarily large rotations. This is fairly easy to do: show the network couches rotated at all angles.

FranOntanaya · on June 20, 2015

It could be that the dataset is so large there's very close matches to any common given picture, and the algorithm is actually awful at picking up details.

Giving it an image that we know has all the relevant details of a sofa, but it likely won't have close matches in the dataset, can give us an idea of how clever it is.

shpx · on June 20, 2015

Suddenly, a shaded rock appears.

https://upload.wikimedia.org/wikipedia/commons/7/77/Martian_...

We're doing humans wrong. Maybe not all wrong, and of course, humans are extremely useful things, but think about it: sometimes it almost looks like we're already there. There always going to be an anomaly; lots of them, actually, considering all the things shaded in different patterns. Something have to change.

I agree that we aren't there, but we'll never be there, every system can be fooled, its just a question of 95%, 99% or 99.99%

benplumley · on June 20, 2015

> We're doing humans wrong.

I don't follow. If you asked a human what the linked image looked like, they'd likely say a face, but if you then asked them what it actually was, they're all going to change their answer to a rock, even specifically a rock on Mars (if given a colour version of this image).

It's true that humans see patterns that aren't there, but does that detract from our ability to recognise objects?

jonahx · on June 20, 2015

This analogy doesn't hold, because the whole point of these classifiers is to classify things the same way humans would.

The fact that all classifiers -- including human beings -- fail in some cases is a separate issue. The goal is to create a computer classifier that succeeds and fails in the same cases humans do.

jweir · on June 20, 2015

Magicians would be out of work if our vision were 'flawless.'

TimFogarty · on June 20, 2015

This is fascinating and well written.

I tried the unrotated sofa image on Wolfram's ImageIdentify and it correctly identified a settee [1]. So it presumably gathered that from the shape of the image rather than the pattern. It is peculiar though that it can't see the shape under a simple rotation. Or perhaps the margin of confidence levels between sofa and leopard were so narrow that a rotation was enough to tip it in favour of the leopard? I'd be interested to see the inner workings of this.

[1] http://i.imgur.com/6f6Co5O.png

glaberficken · on June 20, 2015

I tried Wolrfram ImageIdentify with a bunch of bicycle photos and it insisted on identifying them as "Bicycle Saddle".

I kept trying different ones and it kept identifying as "Bicycle Saddle"...

fortyeight · on June 20, 2015

To be fair that is one of the few things on a bike that isn't a triangle.

glaberficken · on June 20, 2015

hmmm, good point!

steckerbrett · on June 20, 2015

I had some success with pictures of sofas from a top view, but a lot came out as complete nonsense like "nail" or "light bulb". Seems on at least some patterns it is trained to view things in a particular orientation that you would normally see them in. I imagine that if it did exhaustive rotation searches on a lot of objects the results would often be completely incorrect.

castratikron · on June 20, 2015

Really wasn't expecting a Terry Davis comment at the bottom.

shangxiao · on June 20, 2015

I really couldn't comprehend what that was all about and had to go read a Vice article about him to understand

tim333 · on June 20, 2015

His comments rather reminded me of the network identifying the couch as a Jaguar. He seems to have similar network issues.

briandear · on June 20, 2015

Who is Terry Davis.

cgriswald · on June 20, 2015

I didn't know either, but found this: http://motherboard.vice.com/read/gods-lonely-programmer

Scary and fascinating.

YoukaiCountry · on June 20, 2015

I felt like Alice down the rabbit hole after getting sucked into reading about TempleOS. Quite an unexpected side-effect of reading an article on neural networks!

vanderZwan · on June 20, 2015

The MNIST analogy reminds me of the "Teaching Me Softly" article that was posted here last year:

> When Vladimir Vapnik teaches his computers to recognize handwriting, he does something similar. While there’s no whispering involved, Vapnik does harness the power of “privileged information.” Passed from student to teacher, parent to child, or colleague to colleague, privileged information encodes knowledge derived from experience. That is what Vapnik was after when he asked Natalia Pavlovich, a professor of Russian poetry, to write poems describing the numbers 5 and 8, for consumption by his learning algorithms. The result sounded like nothing any programmer would write. One of her poems on the number 5 read,

> He is running. He is flying. He is looking ahead. He is swift. He is throwing a spear ahead. He is dangerous. It is slanted to the right. Good snaked-ness. The snake is attacking. It is going to jump and bite. It is free and absolutely open to anything. It shows itself, no kidding. Brown_Cornerart

> All told, Pavlovich wrote 100 such poems, each on a different example of a handwritten 5 or 8, as shown in the figure to the right. Some had excellent penmanship, others were squiggles. One 5 was, “a regular nice creature. Strong, optimistic and good,” while another seemed “ready to rush forward and attack somebody.” Pavlovich then graded each of the 5s and 8s on 21 different attributes derived from her poems. For example, one handwritten example could have an ‘‘aggressiveness” rating of 2 out of 2, while another could show “stability” to a strength of 2 out of 3.

> So instructed, Vapnik’s computer was able to recognize handwritten numbers with far less training than is conventionally required. A learning process that might have required 100,000 samples might now require only 300. The speedup was also independent of the style of the poetry used. When Pavlovich wrote a second set of poems based on Ying-Yang opposites, it worked about equally well. Vapnik is not even certain the teacher has to be right—though consistency seems to count.

http://nautil.us/issue/6/secret-codes/teaching-me-softly

That article in turn reminded me strongly of "Metaphors We Live By" by Lakoff & Johnson, and the works they have written since, where they claim that humans make sense of the world using systems of rich, conceptual metaphors. As I understand, the work is well-known to machine learning researchers.

jameshart · on June 20, 2015

Obviously these classifiers do often focus on patterns, rather than shapes, and that's probably something that could be worked on, but I don't think an image classifier can possibly be expected to, at the level it is operating, identify the leopard-print sofa all on its own. Clearly there's a higher order process at work than image recognition here - after all, when a human is faced with a sofa-shaped object with a leopardskin pattern on it, there are two hypotheses that need to be evaluated: 1) this is a sofa patterned to look like a leopard; or 2) this is a leopard, shaped like a sofa. Rejecting the less plausible of those two scenarios is obviously a higher-order activity. If the image classifier is at least firing off the concepts 'leopard' and 'sofa' with some level of probability, it's doing its job pretty well.

visarga · on June 20, 2015

Then we need to integrate higher order knowledge about the world collected from text (Wikipedia and the like).

rndn · on June 21, 2015

Obviously!

anglerfish · on June 20, 2015

OP here, and thank you kind sirs and ladies for you feedback.

I'd just like to answer the recurring objection: yes, our visual experience contains a lot of frames and that seemingly refutes my MNIST example; however, you do forget about the other part of a supervised dataset, namely labels. Do we have a label provided to each thing we see in our life? Obviously not. How much time do you need to familiarize yourself with a new entity, like an unknown glyph or symbol? Can't provide a concrete example, but I guess a single math class was enough for all of you to recognize all the digits the next day. You can test it right now by looking into some unknown alphabet and then looking into it again upside down - you'll recognize it perfectly, except for mental rotation issues (which occuur even for well-known letters and symbols).

vidarh · on June 21, 2015

> I guess a single math class was enough for all of you to recognize all the digits the next day.

I'm curious what makes you think that. My experience with what's going on at my sons school is telling me that the children spends a massive amount of time on getting recognition of digits and letters right.

prewett · on June 21, 2015

When I was living in China I had difficulty recognizing handwritten 9's and 1's. Their 9's are less half-circular than Westerner's are, and the 1's have a long stroke the top that looks like a sloppily written 7 to me. I would frequently look at a handwritten number and have to analyze what it was.

eyalhei · on June 27, 2015

> How much time do you need to familiarize yourself with a new entity, like an unknown glyph or symbol?

I think you look at your subjects wrong - don't pretend your computer is an adult (which had learned most of its life) - rather consider him an infant learning letters/digits/objects for the first time. Doing so you might come across a similar learning curve to the one you have described. With the additional case we (or at least I) don't know how to bring the computer to the level of a fully grown man.

As for the problem presented in CNN, if the problem is not having the structure, why not gray-scale the structure as a secondary level for the CNN?

I'm not really from the field so excuse me if this was complete BS

comex · on June 20, 2015

What is the upside down experiment meant to prove? Seems to me that the mental rotation issues indicate that mental image processing is not very rotation tolerant, but rather needs a hardwired (and slow) counter-rotation step added to cope with rotated symbols, which you could just as well tack onto a neural network. Am I missing the point?

Also, being able to consciously recognize letters is relatively easy, but the normal reading process, with which people recognize well-known letters and instantly unconsciously convert them to sounds, does require quite a bit of repetition of those letters before it starts to kick in...

rndn · on June 20, 2015

That's right, CNNs tend to be more concerned with textures rather than with overall shapes. One problem is that CNNs discard a lot of pose information of the detected features during pooling. Another problem is that there is no top-down verification such as "hmm, leopards always have {heads, legs, tails ...}, I should scan the input for these... nope, it doesn't fit at all, I should exclude everything that has {heads, legs, tails ...} from my interpretation." In a human brain that likely happens in some distributed fashion without considering individual classes, but by just inhibiting everything that can't be verified upon looking twice (or more times).

Houshalter · on June 21, 2015

Awhile ago I tried an interesting experiment. I fed a famous psychology image into a bunch of different image recognition systems.

This image: https://i.imgur.com/2aCqMx2.png

And here are the results: https://imgur.com/a/8ndyq

This doesn’t really prove anything, but I thought it was interesting. It is of course, unreasonable to expect ML algorithms to perform decently, so far outside of the space they were trained on.

But I suspect that part of the reason they don’t do well is that they are purely feed forward. Humans also don’t see the image at first. It takes time to find the pattern, and then everything clicks into place and you can’t unsee it.

This might have something to do with recurrency. But more importantly, information feeds down the hierarchy as well as up. Features above, give information back down to features below. So once you see the dog, that tells the lower level features that they are seeing legs and heads, which says they are seeing outlines of more basic 3 dimensional shapes, and so on.

I think it also requires a descent understanding of 3d space, to fit the observed pattern to 3d models which could have produced it. I’m not certain if regular NNs observing static images, are optimal for learning that.

More here: https://www.reddit.com/r/MachineLearning/comments/399ooe/tes...

dlss · on June 20, 2015

For context, ImageNet does have a sofa category for labels :/

http://image-net.org/search?q=sofa

sbodenstein · on June 20, 2015

The Caffe models are only trained on the 1000 category subset of ImageNet used for the competition: http://image-net.org/challenges/LSVRC/2014/browse-synsets

There are no sofa's in this list, the closest thing I can find is a "studio couch, day bed": http://imagenet.stanford.edu/synset?wnid=n04344873

dlss · on June 21, 2015

Right. The "studio couches" are couches.

sosuke · on June 20, 2015

Just to be sure, Wolfram Alpha Image Identify was correct, down to it was a particular type of sofa.

https://www.imageidentify.com/result/1ixb9603m9ix1

andreyf · on June 20, 2015

Google image search says

  Best guess for this image: cat bed furniture

https://goo.gl/vXwSaj

comex · on June 20, 2015

That's cheating though; you can see the first few results are for the exact same image with a dog pasted on it, and contain those terms.

quantombone · on June 20, 2015

ConvNets have gotten popular because of their strong empirical results. All the recent work on visualizing CNNs suggests that the community working on Deep Learning still has a lot to learn about their own algorithms.

But high-level notions like a Jaguar is a cat-like animal aren't necessary to perform well on an N-way classification task like ImageNet.

What's more important to note is everybody knows there's plenty wrong with a pure appearance-based approach like CNNs. Every few years a new approach pops up that is based on ontologies, an approach inspired by Plato, etc, but these systems require a lot of time and effort. More importantly, they don't perform as well on large-scale benchmarks. In the publish-or-perish world, you can jump on the CNN bandwagon or start reading Aristotle's metaphysics and never earn your PhD.

sgt101 · on June 20, 2015

I doubt that reviewers for NIPS would reject a paper with a novel approach because it didn't perform at best in class level, provided it offered a way forward.

If it doesn't work at all, or isn't a new idea, that's different.

nartz · on June 20, 2015

I have seen some kaggle competitions do image transformations and put the data back into the training set to increase the robustness of the classifier. For instance, rotating images, slightly skewing them, etc.

I would propose that for this leopard problem, instead of just skewing the images, you also performed transformations on the COLOR and put the images back into the training set.

Maybe applying certain filters, such asdimming the saturation or contrast of images, so that the contrast between the leopoard spots were less visible (i.e. "A Leopoard in low lighting") - maybe this would force the neural net to learn more than just its print.

Knowing the right set of color filters to apply to all images could be tricky though.

yellowbkpk · on June 20, 2015

Let's say I didn't want to use the ImageNet or CaffeNet pre-trained models but wanted to train my own model (say, of thousands of images of sofas, leopards, jaguars, and cheetahs); are there any tutorials that walk through the process of building a CNN on your own data?

(I've seen the comments like https://news.ycombinator.com/item?id=9584325 and watched the lectures and youtube walkthroughs, but they're all theoretical and I'm looking for documented code to go along with that theory)

discardorama · on June 20, 2015

For doing image training, Caffe is pretty good. Here's a starting point: http://caffe.berkeleyvision.org/tutorial/

raverbashing · on June 20, 2015

Yes, ConvNets are limited and results are pretty arbitrary sometimes.

The net correctly identified "leopard". Was it taught about sofas? Who knows, maybe Sofa had a high score as well on the output.

Or, look at the Dalmatian/Cherry picture. The net identified "Dalmatian" which is a 100% valid response! But whoever labeled it wanted "cherry". The picture is 50% cherry 50% dalmatian.

Pictures often have more than one element and a pure ConvNet is "one picture to one label"

sova · on June 20, 2015

Ha! I was wondering this exact thing. How could it identify a sofa if it never learned about one? It seems very contrived. At the same time, the example is kind of clear, but we are making neural nets that are emulating Pollock and Dali, not Monet. If you can dig it. The whole beauty is the overlapping interstitial matrix of weighted values that leads to these beautiful discoveries. To rank such a fractal-like algorithm on whether or not it predicts a label satisfactory to the average human is to mis-apply the elegance and potential of these mathematical wonders, in my humble opinion.

geophile · on June 20, 2015

This is the best HN submission I've seen in a very long time. Really thought-provoking.

vonklaus · on June 20, 2015

I agree. I don't work with machine learning or neural nets, and thus don't have more than the most cursory laymans understanding of them. This article read really well and was quite informative of the problems this technology is facing.

deet · on June 20, 2015

Is there any work into building self-verification into these type of networks? For example based on hierarchical categories of concepts?

If part of the network is trained on the concept of a cat, and whether or not an image is a cat is fed into training of the leopard, it seems like the problems would be avoided. Or is the notion that with enough training data and deep enough networks the concept of "leopard is cat" will be learned?

rdlecler1 · on June 21, 2015

We spend so much effort trying to engineer intelligence, when we would get a lot farther reverse engineering intelligence. Whenever AI makes a big advance the analog was already known my neuroscientists. There is also clearly no comprehension of the importance of the topological (circuitry) defining a neural network. We always assume a fully connected network, and draw the out as such, but we don't stop to consider that many of those Wijk interactions are completely spurious, meaning they have no information bearing role. If you strip them away you'll start to reveal the underlying circuit at work. I've published theoretical results using artificial gene networks, but the results should be similar for ANNs. http://m.msb.embopress.org/content/4/1/213.abstract

abrichr · on June 21, 2015

We spend so much effort trying to engineer intelligence, when we would get a lot farther reverse engineering intelligence. Whenever AI makes a big advance the analog was already known my neuroscientists.

The problem with attempting to understand intelligence by reverse engineering the human brain is that we cannot know a priori which aspects of the human brain are necessary for intelligence to arise, and which are merely consequences/side effects of biology and chemistry. Once we discover some technique that works in a practical setting (e.g. on ImageNet), then it is fairly straightforward to find the biological analogy in the brain.

In fact, Geoff Hinton explicitly advocates an approach of "try things, keep what works, and figure out how it relates to the brain". The inverse is like finding a needle in a haystack.

There is also clearly no comprehension of the importance of the topological (circuitry) defining a neural network. We always assume a fully connected network, and draw the out as such, but we don't stop to consider that many of those Wijk interactions are completely spurious, meaning they have no information bearing role.

The purpose of training a deep neural network from data is to automatically discover what the topological circuitry of the network should be, rather than engineering it by hand. In the brain, some prior knowledge is encoded via genetics, while the rest is learned. The effect of sparsity of the weights in deep neural networks is an active area of research [1].

If you strip them away you'll start to reveal the underlying circuit at work. I've published theoretical results using artificial gene networks, but the results should be similar for ANNs.

Very interesting. If I understand correctly, the cost you are attempting to minimize is phenotypic variation, which you measure as the gross cost of perturbation (GCP). Would this cost be analogous to sensitivity to adversarial examples in the case of convolutional neural networks [2]?

[1] http://www.jmlr.org/papers/volume14/thom13a/thom13a.pdf

[2] http://arxiv.org/abs/1412.6572

thret · on June 23, 2015

Isn't the solution to simply have two NNs? One trained to identify leopards, another trained to identify sofas.

Regardless of how computationally expensive NNs may be now, wait a few years, and then train millions of them on different classes of objects and run them concurrently to identify new pictures.

rcfox · on June 20, 2015

I don't work with NN at all, but it kind of seems like the author set up their CNN with a large enough filter to see a whole spot at once, but not large enough to see a whole cat at once. Then he complains that it doesn't know what a cat looks like. Would it be possible to make a larger filter with a lower resolution such that overhead is the same as the smaller filter but it can get a higher-level view of the image?

Also, the author spends the first section of the article determining that it is in fact a jaguar-print sofa (which the model also confirms) but continues to throw around the word "leopard". They're not making it any easier for the future machine learning algorithms that try to identify an image by the text surrounding it. ;)

mirimir · on June 21, 2015

TinEye tells me that the leopard print "sofa" is actually a dog bed ;) [0]

[0] http://www.tineye.com/search/4c4ce7b6558e8d3c4dd443439e80556...

stcredzero · on June 20, 2015

Really, humans are not so different. Online and in person, we pattern-match on a few scant signals, sometimes jumping to ridiculous conclusions as a result. (1) Granted, we have much better machinery for recognizing the kinematics of fellow animals. There's compelling evolutionary reasons for animals to become really, really good at this.

If one paired the present classifiers with Amazon Mechanical Turk, just providing one bit of information -- "is-it-an-animal?" -- I wonder how well the current classifiers would fare in relation to human beings?

(1) - Ironically, the more "cosmopolitan" people become, the quicker they are to jump to such conclusions!

pilooch · on June 20, 2015

Well, you can try this one, it is open source and based on Caffe : http://imgdetect.alexgirard.com/# The image link: http://rocknrollnerd.github.io/assets/article_images/2015-05... It says 'studio couch, day bed' with 97% accuracy. Very likely because it is part of Imagenet.

louis-paul · on June 20, 2015

http://cloudsightapi.com/

This, given the leopard couch, returns "brown leopard print couch".

TheEzEzz · on June 20, 2015

> the problem won't be solved by collecting even larger datasets and using more GPUs, because leopard print sofas are inevitable.

The models have room for improvement, but it's not clear to me that larger datasets won't solve the problem. Larger datasets and more processing power is exactly why neural nets have surged in effectiveness recently. Who knows how much further current models can go with more data and processing power?

kristjankalm · on June 20, 2015

read the Hinton paper cited in the OP -- no amount of processing power will make CNNs represent structured latent variables

jmount · on June 20, 2015

That is neat. Just for fun I tried to figure out how to set up Caffe on EC2. I got it to work, but I didn't get CUDA up and it probably would have been faster with Anaconda. But for what it is worth: here are my current notes https://github.com/JohnMount/CaffeECSExample

BrandonM · on June 21, 2015

Sorry to go off-topic, but the way that font combined 'st' (with a loop back from the 't' to the top-center of the 's') was very visually distracting to me. Did that bother anyone else?

acd · on June 20, 2015

How about detecting vector edge shapes and unifying that result with the existing classifier? Surely a leopard sofa cannot have the same edge vector shape as a real big cat.

Animats · on June 20, 2015

That's an idea from 1970-1980s AI, called the "primal sketch" model. The concept was to take an image and try to turn it into a line drawing, then extract the topology and geometry.[1] Further processing might yield a 3D model.

This sort of works on simple situations without too much edge noise. It's been used for industrial robot vision, where what matters are the outside edges of the part. It's not too useful when there's clutter, occlusion, or noisy textures.

More recent thinking is to find surfaces, rather than edges. This works well if you have a 3D imager, such as a Kinect. You can get a 3D model of the scene. Occlusion remains a problem, but texture noise doesn't hurt.

[1] http://homepages.inf.ed.ac.uk/rbf/CVonline/LOCAL_COPIES/GOME...

jng · on June 20, 2015

I'm happy somebody tries to put some sense into the whole absurdly overblown machine learning field.

ghaff · on June 20, 2015

I'm not sure it's absurdly overblown but individual advances/findings can be way overhyped or at least over-generalized. ML/AI has been incrementally delivering pretty impressive results within certain constraints. That's great but there's then a widespread tendency to extrapolate those results to the broader case--and then absurd/stupid-looking results happen.

We certainly see the same thing with autonomous vehicles. Given very accurate mapping and a particular set of environmental and type-of-road conditions, cars can do so well that it's tempting to say they're 95% of the way to fully-autonomous. But dump them in a Boston snowstorm and you see they're really not even close. (Which isn't to say that bounded use cases can't be very useful.)

leereeves · on June 20, 2015

Overhype has always been the enemy of AI/ML, leading to unrealistic expectations, then disappointment, then distrust.

But it might be different this time...

discardorama · on June 20, 2015

I wouldn't call it overblown. It has been applied very successfully in a few places; for example, voice recognition.

Right now we're at the stage of the "seven blind men and the elephant". Over time, our eyes will open and things will start making sense.

ghaff · on June 20, 2015

Voice recognition has gotten a lot better. I'm almost impressed by my Amazon Echo. That said, for an arbitrary recording of say a conference presentation, you need to either use a human transcriber or expect to spend a LOT of time cleaning things up.

(A lot probably has to do with switching to more data-based approaches.)

deelowe · on June 20, 2015

It's not overblown anymore than "cloud computing" is overblown. Misreported, maybe, but there is a huge shift happening in the industry right now.

If not, why else would this exist http://www.nvidia.com/object/tesla-supercomputing-solutions....?

chillingeffect · on June 20, 2015

From the outside, there appear to be significant advances; it's just they seem to come in clusters, rather than linearly. The news has turned into science fiction!

It would seem CNNs were a significant step up, but the author hints at inferring structure as the next tack to take.

NHQ · on June 21, 2015

Maybe the algorithm did detect a face in the couch, as we see them in trees (wood sprites!).

arthurcolle · on June 21, 2015

Suddenly Terrence Davis appears!

jostmey · on June 20, 2015

After rotating the image 90 degrees, the predicted result changes substantially. The author should not be surprised. A convolutional neural network is translationally invariant, not rotationally invariant.

sbodenstein · on June 20, 2015

Standard convnets do not contain explicit rotational invariance (unless you include a layer such as this: arXiv:1506.02025v1). They can however learn rotational invariance if you feed them rotated images.

simonmd · on June 20, 2015

Fascinating read.

davyjones · on June 20, 2015

gi/go

robbrown451 · on June 20, 2015

I want that sofa.