OK Good afternoon everyone.
My name is an early August I'm
a system from a professor at
Georgia State University.
Working with the School of Business.
And to talk today as the tide of shows
is about pruning people networks
would be so we're with.
Them that we've been recently developing.
This might be probably
a use type of research for
people with compressed
sensing optimization.
Background who are looking for fundamental
developments in deep neural networks.
So before we start on I
wanna tank my collaborators.
The unjust and
wrong both from D.C. Department.
Definitely because war would have not
been possible without a hope and support.
Of the just start with
a very brief introduction
to compress something although I know most
people in this room or in the experts.
To Recovery fields would do construction
of signals and images from far fewer.
Then the desired result of the image.
Aside from its why the.
It actually comes to be interesting and
beautiful performance Grumpy's and.
So it's not only a year and
interesting theory at the core
problem it actually has very.
Widespread IT applications so
it actually come by as different
in mathematics optimizer that it
sticks we have optimization theory
complex and complex analysis we have
probability theory random matrices cancer.
A measure.
For manic analysis of sampling theory and
the kind of all these areas.
Are mesh together and producing
the compressed sensing we search that has
been going on over the past ten fifteen
years since the start in two thousand and
five so many interesting problems
in physics signal processing
imaging information theory have been
tackled through D.D.'s kind of techniques.
So on the other hand we have people
learning and all the networks and
these models being door for
almost all over half a century.
They did their best basically did
Death had their options the OPs and
downs over and over or.
Very recently Dame they became
popular again or probably And
now we have more chance
to revisit them and.
Think more deeply about them because
of the tools that we have now so
basically want to.
Deal with computer systems inspired by
the biological structure in the network.
So classic neural network model most
route back to ninety forty's fifty's.
That's the works of you know scientists.
Defer exposure to.
Mission purposes was around
nine hundred sixty S.
the seventies.
And they were really planned
in that as people earning.
Two thousand and
one two thousand and eleven tall
they became popular again very
popular simply because some of these
AI models based on Norm networks
were able to achieve human
perception on certain tasks
especially in an image recognition.
Kind of achieve the humor him.
And perception.
So they become popular again however or
despite their why the Particularly there
is less for kind of on this tour to call
an understanding of how does models work.
So.
That.
On the order of Been There are four
almost over seventy year olds.
There would be you would see less
throughout the Cold War being done
in the area of deep learning
neural networks and
that is mainly because of the extreme
complexity of the man on the line models
only very recently computer scientists
have started working on but
the magic of our understanding of
how these people in that works work.
So did the presenting
the presentation that we have today.
We're going to talk about how to
promote the specific structure into
the architecture of a deep inner network.
Why do we why do we want that probably
by promoting specific structure
such as a sparsity we're going to reduce
the complexity of these models and
by reducing the complexity and
having performance grantees and
how does three doctrine works we're going
to be able to reduce the model variance
and have probably better predictions
maybe faster processing times so
we're going to detail on
that as we move forward.
So we're going to specifically focus
on a pause training procedure so
associated with each neural
network you have a training phase
where you are trained a model where
you train your normal neural network
based on the training data
that is the training phase and
we're going to focus on a post training
phase so the algorithms that we're going
to talk today are mainly in the context of
post training and we're going to say why.
Ending is something interesting to do.
How can we use this process
the post training kind of framework
to develop the design for
structures and he.
Talked of people in that works also with
the possibility of using competition and
efficient top's such as
complex models despite
the very complicated and
non-convex nature of the models.
We're going to also have presentation of
performance quantities on how well we can
perform this task but
if we want to do this model reduction so
sure that with a normal work it just
doesn't come as an ad hoc algorithm
it comes with performance
going to tease us one.
The world we're going to go through
all of that in today's lecture So
let's start with pruning
do a deep deep models.
I'm going to start with the classic and
simple problem of recovering
the solution to a linear model
to a linear system of equations.
Let's start with an undetermined
linear system of equations so
we have this under determined
system of equations X.
W.
equals Y..
Probably the system is already noisy
thought probably want to have.
Some normal and
wired to be as close as possible.
Among all the solutions that
this lean your system can have
we're going to look for dispositive
solution the more simple solution.
Although does optimize ation
which is minimizing the number of
non-zero elements in W.
subject to this consistency between
why and the output of the measurement.
Despite.
The simple formulation this problem is and
this problem is N.P.R. hop meaning
that we cannot solve it in computational
time it's it's competition really.
Not a tractable optimization to work
with instead what we can do is actually.
Consider a convex relaxation so
instead of.
This.
Non-convex object if we
can replace it with an.
Object it instead of having zero of W.
and one of W.
still stick to the same constant set.
Thought of the optimization is that of the
first one the second one and it actually
becomes very surprising results that
under certain conditions when X.
the measurement matrix is
generate such as a random matrix.
Just solution to the lateral
coincides with a solution of
the form meaning that as longer Earth for
instance X.
is.
General.
Matrix and we have P.
of S.
log in uses the sample
complexity that we have for.
Last so type.
As long as we have that then P.
of this or that or with the fires
to recover a sparse solution W.
so.
S.
is the cardinality
of the sparse of solution and
what this statement says is the number
of samples doesn't need to scale with N.
which is the canonical
dimension of the signal
it actually needs to scale only with S.
which is the number of
non zeroes in a signal.
Or so.
Similar ideas and to.
Wars have been used to
simplify linear models
you're probably you are all familiar
with this last thought type of.
Models we have we did one
interesting characteristic about
last summer is it does while we
are training we can apply last so.
We can actually do model selection
along with the training and
what last it does is it reduces
the model of complexity for us so
the modern value of this can
be potentially reduced so
I doubt the prediction accuracy improves.
Now what about linear models such
as the networks let me first
start with a brief introduction of
the architecture of a norm network so
in the case someone neural
network we have the training or.
Any standpoint that.
The model.
To the first layer of a network
what is the first layer it's
just simply a linear You got it it's a W.
matrix that applies to whatever
vector as the input so you have W.
one applied to the input X.
and then it generates some
into immediate response so
we have some intermediate
response at this level
then this Intermedia response goes through
an element wise the linear region.
So basically every entry of.
Course through.
A uniform uniformly goes through
element wise non-linear unit so
if some non-linear operation is
applied to every entry of the matrix
then generates output the output
is fed to another linear unit W.
to again in order element wise non-linear
in it and the process continues.
For what they were a number of layers how
deep you want to proceed in a network and
in the case of linear models probably
the unknowns of the problem
are a door a vector or a matrix W.
whereas in the case of your networks
you are north of the problem
the prime interest that we
need to learn are W one W.
two through W.
L.
and this is actually did this is more or
less general for
instance Indic convolutional
layers we also can consider
this operation simply
because convolution is of.
Operator.
So.
Now are similar to what we
did with linear models and
last of you want to think of is there
anything we can do with neural networks.
So.
If you look at each lead in your
unit each layer of the network
then you have a matrix that
relates to input nodes
to the output notes and
when you just learn to model.
Most of the time this matrix is
a dense matrix meaning that if you
look at it from a graph perspective
it is something like a graph
that every input know what is
connected to every output not.
What we want to do in order
to perform model reduction
is probably proving this complex.
Connections and replace it with something.
More more sparser a chance
to see matrix so that every
node at the input is connected to
only if you know what the output and
by dist technic we're able to reduce
the complexity of each layer and hopefully
reduce the complexity of the overall
network and have a reduced balance.
Now if you want to follow a similar
procedure as we did for last so
the natural thing to do is probably just
penalize the some of the other ones of
W.L. matrices subject to a consistency
between the network output and
the training that right so we have W.
one through W.
Orld an input X.
which is the network model this
is a complicated function.
That relates the input to the output.
And we have Y.
as a.
Response of the training response.
So probably what we want to do
is minimize the subjective
subject to test constraints that.
We have is actually a very
complicated construct
we have you remember we had the X.
as the input we had W.
one.
The first linear unit applied to X.
then it goes through a non-linear
unit then another W.
is applied.
W.
two is applied to this output and
this change goes on for L.
levels to produce a kind
of a complex function and
feeding this complex function inside and
I don't normally kind of a.
Norm here and forming a constant would
give us the very extremely actually
complex constructs that So
understanding of how this
optimization works is not actually
possible and there have been.
No clear understanding of how the stock
to my station works but people.
Would say OK maybe we don't really
need that we can just try it.
These are the kind of techniques
that these guys are on nor
networks deal with so
they say OK let's let's just address this
optimization what I wore was the landing
point take it as the solution.
But.
You are very likely to land in
bad law called minimal when it.
Comes to don't interrupt
them as I sure like this.
Also some people also have considered more
at heart techniques such as drop out.
And these are these are all kind
of heuristic algorithms that
kind of inspire or promote sparsity
in the course of training.
Know what we're going to present is this
net trim algorithm which is a comeback
remedy to the problem we
recently presented at nips and
we have been able to
generalize the results.
And make to make the competition
towards actually more powerful
We're going to go through that in D.C..
So start with three So let me let me.
Be art director of the network again so
we had this L.A. our network.
We had some training some post and
each twenty has dimension.
And we have P.
different training samples so we can
stack them up in this training matrix
where each column the pros
thence one training sample.
Then it goes through this matrix
goes through a linear unit.
Producing.
Immediate response Y.
one then a non-linear unit is applied it
goes through another linear going to.
Producing and it immediate response Y.
to sort of flow for the whole until we
get to Y L which is the last layer.
Now.
If you want to have kind of a recursive
equation four four four four how does this
model works as we have why to be someone
I'm going to explain why we're using this.
But some non-linearity applied to W.
L..
Y.
and minus one for W.
L.
is the matrix associated with each layer.
And we are specifically considering
a really low rectify linear unit.
First of all because it's very wide
Why do spirit widely used People
love to use the second it actually
helps us develop a more beautiful.
Theory compared to other or.
Other or activation functions.
What is the big picture here.
Let me just briefly explain how do.
So.
Deep no networks the problem of
overfeeding and modern Vaticanus
is actually more of a concern than
the bias what do I mean by this but
yes I mean if between a neural
network you Dardar down you you would
see many chords online that train or
network and once you just reach a minimum.
That many more deaths the Lucian
is more or less good enough.
So one solution to
the other there are more or
less good enough there's more of this
to say OK now it's the matter of
sometimes overfeeding the model
sometimes some of the solutions.
Kind of balance when ACIM is better at.
Based on this fact and
knowing that OK maybe.
The stage of training is not
as critical as we thought
we can't look at a network that is
already twenty eight so if we have
a network that is already trying we
can retrain it in order to reduce
the value of something that work with only
a little bit of increase in the bias.
So.
Let me just briefly explain how we do this
so suppose that we have a network that is
already twenty so
we have an input to the network then.
We have access to why
one Intermedia response
because we have access to all these W.
matrices that are already live and
we have access to them.
And just similar to why one we have
access to why one why two one two three
two why up right.
So all these Intermedia responses
we have access to now is
there a possibility that we can
relate the same input and output.
Beyond.
Transformation Previously
we had W one relating X.
and Y.
Y.
one is there a chance then we can
explore the devil you had one to relate.
X.
and Y.
had one and Y.
had we want to one Y.
had one to be as close as
possible to white one.
So.
Basically if we considered
intermediate responses as checkpoints.
We have some initial model
in the initially train model
you have some complex path
between these checkpoints.
As you can see the these complex paths and
what we want to do is probably pass
through the same checkpoints However.
Take a short path and take a very simple
path which is what we consider these W.
Hat and matrices to do.
So now that we have the general
idea let's talk about how to.
Address this competition only.
So consider just a single layer because
we're destress framework all we need is to
input the output to that same layer so
Suppose X.
is then input Why is the output and W.
is the initial training many tricks.
So probably want to do is explore.
Parts of solutions that relate
the output and the input so
that the most natural thing that
we can think of is penalizing
one norm of you which is
the solution which is basically W.
Hat Sobchak to a consistency
between them all and then.
Model meaning that we want to
really of you transpose X.
and Y.
which is the old to sponsor to
be as close as possible and
the closeness we can control with.
Now are we done your.
Yet we can improve this although
we have a complex object if your
cursor instead is actually in and convex
of plotted to convert the constants that
you can see didn't come back to
constant constant function so
does this is the concept function.
Now.
Probably you would see both of
the functions that formed this constraint
are convex but decomposition of two
complex functions is not necessarily
comebacks now there is a way to actually
come back to fight his problem and
all we need to do is we
can close the door in or
basically enforce similar
activation patterns.
Before and after the retraining and
with a little
bit of detail going through a little bit
of detail as as long as being poor as
these activation constraints but we can
say is instead of these constructs that
which just don't come here we can have a
convex constraints and what is this come.
So we have the output matrix Y.
which is the output of the real.
Function so why either has zero or
positive entries.
We're going to look at all the positive
entries of why the strictly positive
entries are white and form the least
squares object if for those entries for
the ones that are to zero that means there
really has been already activated for
doors in pause that activation
under new solutions us one.
So all would be.
It was just follow similar
activation patterns before and
after retraining and then ended up
with this convex approximation.
So instead of using this longer.
I'm just going to use a D.C. epsilon X.
Y.
and Z.
you know and I'm going to explain what
this constant state is basically X.
is the net the input to the model
why is the output of the model
epsilon is what I want description
see we use in here and
zero is the right hand side to this
inequality constraint on the here or
we use the right hand side one zero for
some net trim algorithms that are going to
explain Next we need to have this right
hand side something and zero and that's
why I use this redundant formulation
to also take care of the situation.
OK Now here comes the first results you
might say OK let's just retrain every
layer individually and reassemble
the network and now run the network.
So every lawyer has an epsilon discrepancy
is this going to blow up there are we
going to have large discrepancies at
the end and the answer is no we can
actually bound the overall indiscriminate
discrepancy of the network so you.
Can.
Yeah exactly do you just solve the complex
problem for every layer individually
OK you have the input you have
the are called just independently thought
of that convex problem for each layer of
come up with a new will quickly show and
assemble a network we did three
assemble network because every layer
has an epsilon discrepancy compared to the
previous one probably one thing you might
be concerned about is are these
storms are going to propagate to
block the network and basically
a large discrepancy at the end so
what we really care about is why had and
why to be as close as possible
OK we don't want down much description see
between the output before and after and
it turns out that if we walk with
normals they work basically because.
Real Networks they are scalable and
we can't so don't normality is not
actually a function to make as long as we
have normal as that works and we run this.
For every layer individually.
D.
post training network and
the initial network they're going to
be in their responses to the training
they're going to be here of course
some of it is a cost meaning that L.
epsilon we're going to be able to.
Throw in is actually the maximum
discrepancy that we can expect
out of the normal as they work with.
Previously it does actually
exactly the next slide so
the next slide we're going to talk about
a cascade version of this retraining so
we cascade what happens
is instead of retraining.
Each layer individually.
The output of.
The previous layer is going to be used for
the retraining of the next layer so.
For a simple formulation
let me just think of.
A network with two layers so we have X.
as the input Y.
as the intermediate output and
c as the final out write in a power
of scheme probably we have W.
X.
Y.
W.
Y.
Z.
already learned through do initial model
so we want to find W.
Hat X.
Y.
and W.
Y.
Z.
by having X.
and Y.
as the primary terms of the constraints
that you did in the other case Y.
and Z.
as the primary source of
the constraints it.
Then retrain each load individually
they resemble the network and
see how it performs an alternative
solution as our friend
just pointed out interestingly
were would be to look at.
Training to Firstly you're just a stand or
wait we used to do it for the Power case
now once we have retrained
the first we have access to W.
Hat X.
Y.
instead of W.
X.
Y..
So how about using Y.
hat instead of Y.
in retraining the next layer so
by forming Y.
hat by applying the real over
to the to the output of the.
First layer in the next layer
we're going to use the if
why hat as the first input
to the constants that.
And zero as they are put to the concerts.
Now it turns out that because of
a little bit of infeasibility issues
we cannot use zero anymore here as the
third printer basically data right hand
side expression for the inequality
that needs to be slacked a little so
that we don't end up with any
entries ability issues and
also Epsilon has to be modified
a little bit so basically we're going
to stick to see our epsilon tool and is
that obviously you're going to have a V.
here and what is V.
we're going to be a W.
W Y Z transport's Y.
hat which which is actually.
A matrix with small nor.
And we're going to have excellent
tool to be this expression and
this expression this term
is actually small and
this game of gamma question is
a prime minister that controls
the sparsity deliver the level of sparsity
that we want to promote into the network.
Basically if you if you if you have few
connections if you have a few prime
interest in the model and we are working
with a ton of data probably if you said
to one you wouldn't get that
much sparsity back but by.
Two slightly number slightly more than
one then you're going to receive more and
more responsibly so
if this were directly controls to the
level of sparsity now it comes the next.
Descriptions to resolve so if you if we
retrain every layer through this process
and reassemble the network then we can
actually show that the discrepancy doesn't
blow up in this case and
the overall discrepancy before and
after retraining is going to be bounded
by more to the minors one or two
multiplied by epsilon epsilon is already a
small quantity we use for the first your.
And Gary some number close
to one thirty one point one.
So you have still just numberous
something in order to.
Get that basically that does not
block the overall discrepancy
Now the interesting thing about
cascade version versus Paolo version
is the cascade we get
probably more sparsity for
the same level of discrepancy compared to
a power of scheme however power scheme
is competition more favorable because we
can just retrain every linear individual.
In either case we have
actually developed an A.D.M.
scheme we can use to
retrain our models so.
We have actually that that
court available online so
let's talk about complex analysis and
sample complexity so
what do you remember when we talked
about last so it came with why
there particularly of the beautiful theory
we could talk about sample complexity of.
This algorithm and say how many if P.
is of order as large in and
then we probably can we.
Probably we can recover the spots
of solution now into case of
neural networks and disputing skiing
we can actually similar results and
that's kind of surprising because
we we did not the six we did not
expect this to happen for neural networks
which are extremely complex models so
let me just go through the problem here so
the key question is if there
exists a sparse transformation matrix
relating to input and output of the or
how many samples are required to learn
such a model we know the answer for that.
That in the case of last.
What is the source for now so
what we are interested in is
we are interested in the under determined
case similar to the assumption we had for
the last so we're going to work with
epsilon means zero meaning that
poorly we're going to we're going to
think of our more degrees of freedom
in the network down to the ones
imports by the twenty samples or
meaning or another way of saying that is
the relationship between the input and
the output of each can be stablish
we are more than only one matrix.
And want all these matrices
we probably want to
stick to the sparse a solution
to the to the best solution.
Now it turns out that when
we take it soon to be zero.
The optimization problem that we
had previously which is here or
that the car pulls into M.
different optimization problems
each in terms of a column in W.
So the optimizer capos into
each column individually so
instead of admitting an optimization
we don't hold matrix we can address M.
which is the number of columns of W.
different optimizations
each dealing with a vector.
So now that we're at this point
let's look at this as our
central optimization so
we have we want to minimize the one of W.
subject to some linear constraints and
some inequality constraints right.
Now one other observation is when we have
these real activated networks we can say
that as long as the samples are ID which
is a very reasonable assumption to make.
Meaning that one sample does not affect
the out of sample and the entries are.
The sum of say of the first layer or X.
one.
X.P. Then trees of X.
one them solve can be depended but
one sample X.
one to walk on one another
stamp on the say X.
two they should be independent
There is no structure in how we
take the samples to train the model
as long as we have this and
we pass it through the network and
route a real activated network
we can claim that the intermediate
responses is why one.
Why one one why to one that's a throw and
why one two
why why two two cetera these
are going to be some activists and
they're going to remain ID so
the ID is not anything complex to see
because we have independence here and
our system is men one less so
we're going to maintain that independence
as we move on through the layers and
also so gosh becomes actually based on
the properties of the operation that is
performed in each layer so
do these samples are going to remain.
Now if we address the problem
the following problem probably that would
be a general problem problem for
all the network suppose we have a layer
which has sub culture an idea Sun
points and we'll be training.
See if we pee different samples.
How large should this P.P.
If you want to recover and
there's a sparsely and if we solve
this problem that actually applies to
all the layers in the network independent
of where they are let's look at it.
Now.
Here is the that the general statement
of the theorem that that is valid.
So suppose we can first of all because we.
We.
We were able to decouple the optimizer
instead of not talking about retraining
the lower your we can talk about
retraining each neuron individually so
that the problem is now a smaller problem
so supports that you have a train
the No one right and
the relationship between.
And the output of the Nor
is this relationship which comes
naturally as the model that we describe.
We have the imports why in an input
matrix has the concatenation of.
These twenty kind of intermediate
with responses why you want to Y.P.
and they are independent.
These are independent.
And they are coming from a sub
distribution these are all the natural
functions that we make now
the pause that you can't.
Support that you fix beta and T.
to some constant and
as long as you have P of S.
large.
Then your good would be
covering the sparser solution
similar result as what we
had previously for the last.
Now this constant thought the constant C.
takes care of the probability of success
that is a universal constant discusses.
A new is actually a constant
that relates to the norm of
this virtual import so
a virtual input is actually
an import that is there why you picked or
whenever or W.
Not transport's Why is greater than
zero or zero vector when I want W.
not transpose why is less than zero so W.
not transpose is basically defers
the model we learned initially.
The network and weight of the network.
Now based on this result as long
as we can show that four D.
virtual import that we did there was
a definition of what we're actually input
it's why multiplied by
indicator of where did what W.
not transpose Y.
is greater than zero or
whether it's less than zero as long as.
Virtual input we can show that the minimum
value of the covariance matrix
is bounded away from zero.
And.
The norm of the Centaur input is less than
some constant then we are good with S.
large and sometimes complex.
So this is actually something
we can easily verify for
instance for the Firstly when we have
a Gulshan import we can just verify these
two conditions very conveniently Now what
are the main challenges of the proof.
So all first of all.
This.
Compressed sense and
type of proof that we have and
we're talking about these
measurement matrices the make and
doors measurement matrices or
normally taken from zero mean.
Distributions we don't have
zero mean random value.
Because we have our undervalue
passing through activation
they become positive right so
we have either and zeroes or
some positive numbers so that our random
value boards are not are not sent or
and this might be causing some
problem in the end Alice's
let me just give you one example
think of compressed sensing.
With a rather Macor matrix where the
entries our randomly minus one and one.
You can just handle quite conveniently.
No I think of the alternative
way tentative problem you have
as the random input zeroes and
ones now with this new measurement matrix
the R.R.A.P. doesn't hold any more.
Data restricted isometry property was
one of the strongest tools that we have
it doesn't apply anymore and
we get into problems even like
other schemes other powerful
techniques such as the polling scheme they
require the measurement to be zero me.
Now.
This is one of the challenges the other
challenge we have we have to start
all Mega and disown actually
the activation pattern of the initial
of the initial model but this comes from
the Mitchell model and this Omega kind of.
Promotes a dependence
between the columns and
rows of these kind of
measurement matrices so
we end up with having random
matrices the rows and
the columns of which are dependent and B.
we're going to have.
A site from equality constraints we're
also having inequality constraints so
we have multiple challenges and
there is a way to actually
overcome all these challenges.
Basically the story of
two birds one stone.
Here or what a trick so
what we want to do is probably.
Proceed that.
Through a lot of this so
let's consider why in our input.
And consider or.
Indicator of the set of make up.
And multiply why invite diagonal of one or
make up meaning that
were your inset Omega.
Just leave the corresponding
column in white and
whatever in your off the Omega we play.
That Collingwood is you.
And me you is just some some.
Vector or you can choose where we're going
to choose the mean of the virtual import
now what you can see is actually
this matrix to make these five
is can be formed by independent
draws of this virtual input so
we had this dependence issue but if we
think of the input as our random variable
we can think of the columns as independent
draws of that virtual input and
meal we can control so meal we can
control in the control it in a way
that these five matrix is zero mean by
only taking it to be expectation of.
Virtual input.
Now here's a statement that
simplifies everything for us so
consider our stigma to be non-zero
as long as we can assure that W.
star and this scale or
is the unique solution to this
optimization then we can claim that W.
Starr is the solution to initial
optimization So we're going
to end up where we actually ended up
with a thin floor optimizer everything
in the form of equality constraints we
have a large zero mean matrix you're
We only have one victory appended to
your already need to show now we.
Have to start and term solution.
So I'm not going to go into more details
of how we wanted to prove but we use the.
Technique.
Polling scheme to handle
the remainder of the proof.
We use the variant of the.
Inequality to link the constants to.
Smallest eigenvalue of the virtual
input covariance matrix and.
For the two part matrices
measurement matrices
will go to member of
that matrix that we had.
Victor one but this became actually.
Rather Macor complexity and
come up with the overall result of
as large and sample complexity.
So let's look at some simulations so
consider or.
Tor example we have
a classification problem of.
Classifying two sort of points
down lying on a spear or.
Using a network of two by two hundred
by two hundred by two the year
is going to be two hundred by two hundred
and if you look at the classifier
after the initial training death that
looks something like this if we do daily
training beyond the classifier
is going to be a war what
usually you see on the right you
wouldn't see any major differences.
That difference between all of us
controlled by how large the values of
epsilons are now we are in the last
in the BIOS we're been able to
actually prune the network.
And come up with.
Your which is ninety three percent
more sparse than the initial one so
this is the way to the just since
the matrix of the initial model and
this is in model and you can see there
are three model is extremely sparse or
we doubt any loss in the bytes.
This is an all or
example when we have used one and
dropout to train our models these are like
this set up there are techniques to
come up with a pool network and
if you use that metaphor
still applying their twin to it
you are going to end up with.
System Interesting that are seven
times for this specific
example and this is just a comparison
of the weights for the last year or.
So if you would feel with things about
the simulations you have done or
fast and scalable.
They are actually available online part
of the court is going to be ports that
in a few weeks or so
already training a layer for
instance this benchmark of emptiness
only takes about one to two minutes so
that's very fast totally
comparable to the model initially.
Every idea.
So we probably need like one
hundred two hundred eighty but
every idea of limitation on duty
choir is passing the input why in
through delay or twice so
that's basically matrix multiplication and
that's the whole of complexity of a single
in our in our a commitment scheme.
Now the retraining actually can
be handled with only a portion
of the training data so we were talking
about passing all the data through
the network again we training but we
don't really need that probably only with
a portion small portion of the training
data we can still do the same job.
Once we did that we can
find the results kind of.
Retuned if prime interest because
we already know what what
weights are going to be
decision if we can we.
Model to for
door reduce the bias of the model.
Compared to some of the.
Ad hoc techniques such as squeeze
net net is actually faster and
more reliable so
this is a comparison between actually.
Afford to work.
Of this size this is a comparison
between casket and Paolo So
this is the statement similar to
what we mentioned previously for
the same sparse to do ratio
probably the description Steve.
Powell scheme would produce.
Larger that script in C.
of a casket framework However
this is only with regards
to the training data when we comes to
the test data their performances are not
really that much different Ok so
why not just use power or scheme.
When we really don't don't don't
have any issues with the tests are.
And that's why especially for
large data sets a big data
power scheme is the path to go.
This is just one comparison of net worths
is this squeeze that approach I want what
is the push during the day looking at
the weights This is a very naive idea so
they did to look at the.
Network and
the way Star close to zero they just
told them to zero and do a fine
human being after that that's all.
This actually can do this
process can be very tricky and
one in many instances when these
are instances that we've poured in our
simulations we end up with very
bad solutions because we know
mathematical understanding of what happens
we just want a bunch of numbers and
do it it's the something some a lot of
somebody saying instead of a linear model.
And applying lasso to it just look at
the weights that are small to hold them to
zero and just do it with training so for
feature selection that would not be a.
Wise thing to do.
These are just comparisons and we can
show the movie we can we can all see that
net consistent Steve Wynn's squeeze net.
This is their output of their trim and
you can see with only one one brand you
can find the most important feature of.
That cannot identify the most
important features for you.
I'm so in conclusion we talk about
these post processing of networks and
how that can help us to employ powerful
tools from complex analysis and
concentration of measure other interesting
structures we talked about sparsity but
a lot of interesting to talk to can
be considered forward on networks.
And other problems down
might be of interest is.
The size of the networks the number of
the number of nodes that we pick for
every lead or that's actually and
that's more of a based
on experience like that there is
no actual process to do that so
one thing that we can do is
actually to walk with a very large
initially just let the network
over feed that doesn't matter and
then apply net three to
shrink to the right size.
The printing process is
competition to distribute all.
Very well suitable for
big data also net three M.
is a post processing scheme so what they
were to use the outdoor for training or
not let them stay as they are and
you can just enjoy all of that but
once you morons that one wants to learn to
model you can just apply net through them
and enjoy the model reduction
thank you very much for attention.
Last.
Year.
Yes.
I actually wanted one thing that I wanted
to say you don't have to apply net three
into every Your internet work you can just
apply it to some of the layers and still
does this but on actually applies to the
ones that went through data retraining So
let's say the most important ones are
the dense layers the largest layers then
you can just apply and through into those
layers and the smaller ones which don't
don't have any impact on the compression
Paul you can leave as they are.
Is there started.
So.
First of all it makes the when when we
have some models need to put the forward
processing is fast so we can have
faster nor networks working and second.
Which is actually the most of the more
important one is we don't increasing
the bias we can reduce the modern
valuables so that is going to improve.
When we if we apply and the three similar
to what we have will last all last so
increases the bios of the model but
if we do use is the value of
an overall it can be used to test our.
Place.
You.
Know that's actually DO THEY DIDN'T
that the Navy idea that initially we
talked to so it's a very complex model and
most of the cases you end up where.
We did a bad thought it was and
actually that's one of the reasons
that look this is actually there
are the output of that optimization and
on top of that if we apply not we're going
to get a solution like this which and
this is the optimal solution but
this is not the one yes you
read actually used on top of
on top of the dropout so we actually used
to pruning schemes simultaneously so.
What.
Are your tree for free just like it was.
It's actually be a.
Very high possibility of travel getting
trapped into local minima that.
The geometry of.
These functions the underlying functions
is very complex that it has thought many
local minima and
minima of different biases and it's very
likely that you just get trapped into one
and there's no one it way to escape it.
Think yeah thank you thank you.