This is straight work with.
Santos from Paris and
who was one of the students.
And probably everyone in this room has a.
Passing familiarity with.
The amazing advances in machine
learning but have been.
Gracing newspaper covers for the past.
Couple years so for
example we now have a is for
the game go that can be the best for the
game go that can be nine done Go players.
So.
That's pretty cool.
And depending on how you
frame the task anyway we can
now train computers to classify
images more accurately than humans.
We have self driving cars and.
Amazing speech recognition systems on.
Made phone lines and
underlying all of this stunning.
Empirical work is one
family of algorithms or
algorithmic techniques called deep
learning on the other hand despite this.
Empirical advance for
machine learning using deep learning
there isn't a whole lot of theoretical
explanation for why it should be the case
that these algorithms should succeeds
as well as they seem to do in
practice so that's the motivating
question for this talk Why is that.
Deep learning works let me.
State formally the problem that we're
trying to solve here supervised learning.
You have some family of concepts or
functions let's say real valued let's
say over some and dimensional input space
endowed with some probability distribution
I'm going to secretly take one
function from this family call it F.
star and
I'm going to draw a bunch of samples from
the input distribution and
I'm going to give you the label Paris X.
F.
start X.
where X.
is drawn from the distribution of
Starbucks is the the my secret
chosen function from the family
evaluated on that.
And your job is to come up
with some function that
behaves like my hidden one even if you
draw new samples from the distribution.
Behaves like what does that mean so
you're going to minimize some
loss function in this case.
I've written mean squared loss it doesn't
mean you can choose whatever
last option you want.
And you just want to make sure that
that that last that distance between
the function that you came up with and
my hidden concept is less than
some desired accuracy epsilon.
OK so how are you going to do this
well probably you're going to.
Think carefully about the the family
the specific family of concepts and
the specific distribution and
you're going to come up with
some clever algorithm or it.
Or you could just try throwing
some deep learning at it so
this is the sort of generic solution to
trying to solve a learning problem without
having to know anything at all about it.
So you construct a neural network
which is a neural network.
It's some directed.
Basically graph on one end
I have my inputs the and
inputs corresponding to
the end dimensions of my
input distribution I have on
the other hand my output.
In this case we're just talking
about some real number there and
in between I have some layers.
Hidden layers they're called of neurons
and what are these neurons doing each
neuron is taking some combination of
the values on the previous layer and
it's setting its value to some specific.
Non-linear function of that half
in combination So for example.
A sigmoid.
Function would be a popular choice so
this is a feed forward neural network.
And this computes then some sum function.
Of the inputs which is a nest
of say Al sigmoid where L.
is the number of letters.
OK well how am I going to figure out
what these these weights are here for
this combination that each
neuron should compute.
Well just a gradient descent.
So I estimate the.
Three people can see this at
the bottom here I estimate the gradient
of my last function with respect to.
The weights and
biases at each neuron in my network.
Using some of the labels samples.
And I use that to iteratively improve
this network and hopefully get closer and
closer to something that has small loss.
So that basic concept clear deep learning
then is just the infinite variations that
you can put on the steam changer
architecture change or your gates here
use some fancy form of
gradient descent here so.
Why should this work we'd like to.
Give some setting at least in which we
can prove that it does work that in other
words that if you do this it's going to
converge in a reasonable amount of time to
a good solution.
Well that's too much to hope for
in general obviously I mean learning
is hard in general so what about
just in the case where the family of
functions that we're trying to learn.
Is in fact realizable it's a defined
its It is itself defined in terms of
neural networks.
Well that's probably not
going to help very much.
Because all functions are approximately
realizable as neural networks.
This has been known for.
Almost thirty years now.
So.
This isn't much of
a restriction to say that
the family functions is realizable
we need something stronger say
functions that are realizable
as just small neural networks.
So maybe if I restrict the number of of
hidden units in my neural network and I'm
only trying to learn functions that are of
that form then maybe I can I can hope to.
Achieve this well that's
not going to work either.
Even if I just have three neurons
in the in the true concept class.
With threshold Gates as
the non linear function.
Then this is N.P. hard.
And.
Even if I'm trying to learn improperly So
I'm I'm not trying to come up with
a neural network that represents the
neural network that I'm trying to learn.
Than under under.
Certain cryptographic assumptions for
example even this can't be done so
we need we need to put some stronger
constraints on the family of
functions that we want to be able to learn
before we can hope to prove anything here.
So what is it that would be sort of
a natural restriction on the family of
neural networks that might allow us
to learn Well I'm not sure but let's.
Let's think of a nice simple
family of neural networks and
see if we can generalize from there so
let's say that I guarantee you that
my my my secret chosen function
is exactly represented as a neural
network like this not too many
units in a single hidden layer some
nice smooth activation function and
an unnatural input
distribution say Galaxian.
And Gaussian inputs you can't
get much nicer than that.
And then a linear combination of these.
Of the hidden values as my as my own so
if I guarantee you that my input
distribution looks like this
my family functions look like it looks
like this surely if I can learn anything.
I can learn this so let's try.
Yeah.
I'm trying to learn a function that
yeah that would be enough but I'm I'm
trying to do something even weaker than
that I'm happy to learn it improperly so
so I just want you to come up with
a function that is close to this one.
Just as of.
Yet I've I've chosen the weights here I've
just I've chosen the weights here and
now your job is to is to learn that
function over this input distribution.
Yeah I'll tell you what I'll tell
you what that is also it's say.
Hyperbolic tangent function.
So.
No no no no no no that's
that's that's given to you
the distribution is given to you this
particular non-linear function is given
to you the only thing that's not
given to you is the specific.
Combination that's used here that this W.
and this be so that's the only thing
I've chosen is those weights here but
everything else I've told you and
your goal is to learn this.
And again you don't need to you
don't need to tell me the weights or
the biases there you just need to
come up with some function some
representation that doesn't need to be
the same the same one for this function.
So you know choose choose your favorite
architecture choose whatever activations
you activation units you
want to use to learn it.
You know you can you can represent this
this tiny thing as some huge multilayer
network if you want Did it work.
No it didn't so
that's the result I'm going to talk about.
Today is that even for this.
The simplest possible case where
we have these these wonderful
smooth input distribution
smooth activation units.
Not too many of them even this.
Seems to be difficult to learn.
So to to be able to state
the setting in which results
apply let me introduce
statistical query algorithms.
So.
A statistical query algorithm is one that
doesn't need to rely on the individual
labeled examples in order to to
figure out the function instead it
it queries some Oracle and what sort
of queries does it make it chooses some
bounded function of labeled examples and
some tolerance parameter to how and
it receives an answer that is close to
the expectation of that function how
close within the tolerance parameter to
remember the neural network training
algorithms that we were sort of that we
had in mind here were ones that were
going to estimate some sort of gradient.
So estimating a gradient of a loss
function doesn't require that you actually
see labeled examples all it requires is
that you have something close to this this
expected value of this function here so in
the case of gradient descent the way you
you realize that as a statistical query
algorithm is you think of your queries as.
The gradient of your of your last
function does it make sense.
So.
Gradient descent can be
understood as a statistical
Querrey algorithm in some sense.
And.
More generally.
Tensor decomposition methods
all sorts of algorithms for
learning can be realized as
statistical carry algorithms and
there are actually very few problems
in machine learning where we don't have
a statistical query algorithm for
the problem that performs as well as.
Its other algorithms.
So this is a very general framework.
But nevertheless what we show is that
no statistical Querrey algorithm
is going to succeed at this task of
learning neural networks of this form.
So in particular there's
some family of functions
realized as in this form of not too many.
C.
sigmoid gates in a single layer and
in fact you can take any log
product distribution as your input.
And if you're trying to learn this using a
statistical Querrey algorithm for example
one that estimates gradients using
statistical carries then you're going
to need exponentially many queries in
the input dimension assuming the tolerance
is so
small it's a smaller than the number of.
Units in the in the in the network so
even even getting a single carry
of this tolerance means you need as many
samples you need more samples than the.
Number of parameters freedom and
then in the network.
So sorry so
me clarify we have an additional parameter
as here as is the sharpness of the sigmoid
functions that you're talking about.
So this is what a sharp sigmoid function
is this is what it looks like but
this this particular unit
here isn't important either
the result generalizes to essentially
arbitrary activation functions.
I should mention earlier work by sheer
gives exponential lower bounds also
against a sort of vanilla form of gradient
descent in particular he estimates.
Exactly what these gradients are and
shows that they're going
to be exponentially small.
This is a slightly more
well somewhat more general
result in the sense that applies to any
statistical carry algorithm not just.
The most basic forms of
statistically descent.
And this result for more is a political
to functions that are actually
realizable in practice so
how do we construct this family functions.
It's fairly straightforward.
I have some.
Pair of sigmoid units here if I add
them together I can get something that's
sort of a function if I add up a bunch
of bump functions like that I get
something that looks sort of periodic
it's not actually periodic because
I only have finally many of
these things that I've added up.
But it looks sort of
reasonably close to periodic.
What am I going to do with this.
Well I'm going to have my family of
functions be this periodic function or
close to periodic function evaluated at.
Some sum of half of the input.
Of half of the inputs so
choose and over two of the inputs.
Sum them up and then apply this
nearly periodic function to it so
I have exponentially many such functions
just by varying the subset that I've
chosen here and
I can realize this then as a neural
network the weights will all be plus or
minus one and the biases will.
Shift these units out in either
direction from the origin.
So this is what these these
functions are doing their sort of
vaguely periodic functions
applied to some some of inputs.
That make sense.
And in fact you could add a little
bit of noise here if it bothers
you that this weight matrix here is.
Rank rank one then you can add a bunch
of a little bit of noise here to
to make this at least
polynomially Well condition so
this is now a reasonable family functions
how do we show it's hard to learn what we
estimate the statistical dimension of this
family what is to testicle dimension.
The usual definition is that a family of
functions over an input distribution D.
is has.
A statistical dimension
with average correlation D.
if for every one over D.
fraction of my functions
the average correlation
of that subset is small smaller
than some parameter gamma.
If you can prove a bound like this.
For your family of function C.
then.
The theorem is that you
will need at least D.
Querrey of tolerance gamma
ish to learn your family.
This this definition is the sort of
usual definition that applies
to classification problems.
For our setting we're trying to do
a regression problem we have a real valued
function that we're trying to learn so
we need
a rather more complicated definition
that's nevertheless similar in spirit.
So coming up with this.
Proper generalization of statistical
dimension to regression problems was.
I guess one of the basic problems.
In proving this result.
So how do you generalize
this to regression problems
well you need the same sort
of condition except that.
Instead of just having your family of
functions have a small average correlation
you need what are essentially
indicator functions
applied to the members of
the functions in your family that
these compositions have small efforts
correlation so you take a bunch of
we call them epsilon modifier modifiers
they're sort of smoothed out versions of
indicator functions apply those to
the functions in your family and
now you want all of these things to have
small average correlation and we show that
this is enough if you can if you can show
a bound like this then you will need D.
queries with tolerance
gamma in order to get
in order to enter your family function.
This is it's a lot of notation so this
is this is what these these mollify are.
Doing their sort of like indicator
functions if you have some value
why not if you compose
the modifier that value Y.
Not with your your function it's
sort of a smoothed out indicator for
your function being close to that you
want all of these things to have a small
average correlation.
So this is the statement
if you can prove this
then your statistical Querrey algorithm
is going to need exponentially many.
Of these D.
queries so for
us it suffices to show that for
most members of our family.
They are uncorrelated.
When we or nearly uncorrelated when we
compose them with what is essentially
an indicator function how do we do that
well I claim that it
suffices to show that.
An indicator of one of these
functions in our family.
Doesn't change much when we
condition on some subset.
Of the input values are in the sum
of some subset of the values
if I if I have some other set T.
some other function T.
here and I look at just the input values
that are relevant to both functions and
some of those up I get some value.
This is the value over here I want this
to be uncorrelated with this value it's
going to be enough to condition just
on the sum of the values where they.
Now I have some input distribution
it's some nice log concave
input distribution I have.
This function of FS FS
was sort of periodic.
Is therefore also sort of periodic.
And therefore I can take this is Z.
that I am conditioning on to be.
Less than the period of this function
assuming that I add some
apps lines here and there.
And then proof by picture if I
only shift by a small amount.
Two functions are sort of close.
So that's the hand wavy
proof of this reason.
Now I said at the.
Beginning of the talk that deep learning
was working astoundingly well in practice
and yet I've just claimed to prove that it
can't even learn these simple functions so
what gives.
Well.
Actually these bounds do in
fact apply in practice Also
you can if you actually construct
the functions here that.
We show are difficult to learn
in the statistical model then.
In fact even at very small
sizes of input dimension.
You see.
A threshold beyond which these
functions are difficult to learn and
the threshold relates to the is exactly
given by the tolerance parameter
that we prove in the here.
So this.
This graph here is showing the last of
a bunch of different networks
that we're trying to.
Or the best over over many
different networks trying to learn
instances of this size and this sort of.
This large loss that they end
up with here is exactly the loss
that you would get from the best constant
approximation for these functions.
So in fact these functions
cannot be learned in practice.
Which.
Leaves the question.
Why what is it about.
All of these natural instances
go image classification
what have you where we can imperiously
get good results what is it about
these problem instances that's making
them tractable so that I think is the.
The really interesting
problem going forward and
there have been some positive results for
training.
Neural networks.
For example there's a paper
by gentlemen said he and
I'm Kumar that gives a cue algorithm.
With strong non degeneracy
assumptions on the weight matrix and
access to the density function
that allows you to learn
in the in the statistical theory though
there is also a very recent paper
by by going Clive is that
gives a non as to algorithm.
And and has some in addition some weak or
non degeneracy assumptions but the.
Neither of these algorithms here is
is the algorithm that's actually used
in practice to train neural
networks gradient descent.
And so the the really interesting
question here is is what sort of weak non
degeneracy conditions can we put on these
neural networks what can we assume about
the structure of these neural networks
that will let us actually learn them and.
What this result does is
it sort of rules out.
The easiest possible path forward which
would be that you know maybe just some
some smooth input distribution some
some nice input distribution and
smooth activation functions would
be enough no we're going to need
some assumptions on the actual structure
of the network in order to get results.
Close on that positive note.
Yet it's.
New.
It's just.
It's an interesting question.
I don't really have any intuition for it.
I think it should still
be possible to construct.
Equally hard examples I don't think
that's the essential feature here.
What is sort of.
What does sort of stand
out on further analysis
about this these hard examples
is that the bias parameters
need to become very large they
scale with the input dimension.
I don't have an example where
you have bounded bias and
I sort of suspect that that should be.
An easy case.
Or I mean tractable case.
It's.
Yeah.
Yeah absolutely I mean Gaussian
independent Gaussian inputs is very
natural if you're a mathematician it's
not necessarily very natural if you're
if you're Mother Nature.
You could you could say that maybe you
have some some underlying low dimensional
manifold in your samples are drawn from
sort of close to the slow dimensional.
Manifold but I don't know that
that I mean that now you need to
make some assumptions about
the manifold in order to.
To be able to say that it's tractable but.
Maybe the problem is that is that nature
isn't actually that high dimensional
or not I mean asymptotically
high dimensional.
Yes So
you have absolute you can you
can represent any function.
Arbitrarily closely using two layers even
actually a single there as long as you're
willing to use exponentially many
units but a reasonably sized two layer
neural network will represent any function
that you want any reasonable functional.
Yeah but how do you feel how do you
find out what that neural network is
you know that's the problem.