That’s a loaded sentence with 2 terms i have not Idea what they are. Depth 3 and...

se4u · on Oct 29, 2017

@averagewall already explained what depth 3 is.

The explanation of poly time is a little bit lengthier. Let me briefly explain the PAC learning paradigm in the context of this paper.

Consider the set of all depth 3 neural networks with n dimensional input and k dimensional hidden layer whose weights have total norm 1. Let this set be called H. Note that this is a well defined set, parameterized by the weights of the neural networks.

Let ∊ be an arbitrary positive error value and let δ be an arbitrary prob value. We are interested in ∊ ≈ 0 and δ ≈ 0. The solution of the PAC learning problem is an algorithm, say A, which takes a dataset D as input, generated from a dsitribution Π, and which runs in time that is polynomial in the size of D, and spits out a specific neural network h that lies in H. Furthermore, no matter what Π was, the probability that the neural net h, that A produced, has error that is at max ∊ above the optimal error achievable by any member of H, should be more than 1-δ. This is the PAC learning paradigm.

The paper assumes that Π is a distribution over the n-dimensional unit sphere, so all examples have the same euclidean norm and constrains the weight vectors to have unit norm, AND WITH THESE ASSUMPTIONS, it gives a algorithm that requires |D| = poly(n, k, 1/∊) and runs in time poly(D).

averagewall · on Oct 29, 2017

The abstract says depth-3 means a NN with 1 hidden layer. So I guess the 3 refers to one input layer, one hidden layer, and one output layer.

Polynomial time roughly means "fast enough that it could be practical at large scale if you have enough computing power". It's in contrast to exponential time which means "no amount of money can ever buy enough computers to do it at scale".

My question is what's the existing learning time for such a network? Maybe it's already just as good but these guys have a proof that it works in general instead of just hoping and finding that it usually seems to be fast enough?

yorwba · on Oct 29, 2017

While the idea that "polynomial time is fast and exponential time is slow" is quite popular, it's important to realize that those are descriptions of the growth in computation time for ever larger input sizes.

In practice, you are usually only interested in a limited range of inputs, for which an exponential time algorithm may turn out to be faster. If you really want the fastest algorithm, you need to actually measure the runtime on realistic data.

taeric · on Oct 29, 2017

My question is more of how many shallow neural networks are still in use. The deep in deep learning is typically much greater than three.

stochastic_monk · on Oct 29, 2017

You're absolutely right. But this is still a huge step forward. There has been work on the VC dimension of neural networks for a long time (and it's been shown to be finite), which is a necessary but not sufficient condition for efficient PAC learnability.

If it can be done for 3 layers, then maybe it can be done for more. And I happen to really like it when my problems have polynomial time guarantees.

[VC dimension is a quantitative measure related to the complexity that a model's parameters allow it to express. I think of it as analogous to entropy for physical systems.]

srean · on Oct 29, 2017

Apart from deep wagging contests this might become less relevant if we go by the notion "poly time == solved". The reason why this is so is that 3 layer Neural Networks are uniform approximators of smooth functions.

I have to read the paper carefully for assumptions made, but if the result holds true for the entire class of 3 layer NNs, or a class that is big enough not to sacrifice uniform approximation then this would be Big Deal.

Of course for practical applications, poly-time may not mean that it is a solved problem, or that the poly-time algorithm is the best algorithm to use on a typical instance. The exponent or the leading constant could be very high, and deeper networks may well offer more conducive properties.

rspeer · on Oct 29, 2017

I think this assessment of what's "typical" may just be based on posturing. For some reason you can't brag about training a relatively shallow neural network (is efficiency not valued in this field?).

My counter-assessment is that the space of problems you solve just by making your NN deeper is very small. Such problems clearly exist, but I've seen few compelling examples outside of image classification.