Thank you Justin can you hear me
OK this is work this is working.
Yes OK good so
let me know if you can hear me in the back
of the room in we'll figure it out.
We have thank you very much for
the introduction so
today I'm going to tell you about
an algorithm for distributed optimization
called asynchronous ability and
push in particular to be it will be about.
It will be asynchronous and it'll be
a multi-age an optimization algorithm so
I'll go into details about
what that means this is
joint work with my graduate student Mido
us Ron who's a student at McGill and
he's also working part time with me
at Facebook I research in Montreal.
I guess I don't know if I really need
to say anything about Facebook and
research but I thought yeah I'll
say I'll say one word about it
yeah there's a picture of me though so
you know what he looks like.
And.
So first a word about Facebook I research
so this is still relatively new for
me I'd like Justin said since
September I'm on leave from the GO and
working at Facebook at research so
it's the air research lab within
Facebook our chief scientist
is John the can and
the mission is to advance the state of
the art of AI through open research for
the benefit of all and so maybe different
from some other corporate research labs
really everything we do is supposed to be
in the open so we publish everything and
we open source the vast
majority of the code that we're
using to do our experiments so that others
can benefit from it and build on it and
we hope that we can create this open
open system of research helping to.
That's the state of the art.
Fair has four locations when it
was founded there was Menlo Park
at the Facebook headquarters and New York
and I guess I mean you probably heard
about fair because there are two two
faculty members of Georgia Tech here
who are also if you lead with fair there
in the Menlo Park location right now
that young is based out of New York but
he of course travels a lot and
the two international locations are in
Paris and most recently in Montreal and
that lab opened up in September and so
the picture that I've got down here is
of the current team we started out
with four researchers in September now
we're up to I think last count was
around seventeen if we so you know
about half research scientists and
engineers and the other half are students
mostly from local universities in Montreal
McGill and University of Montreal.
And working out a whole bunch of areas and
so I just thought I would mention real
quick a plug if any of you is
interested in Facebook and research or
in particular in the Montreal's location
we're definitely hiring and so feel free
to come talk to me after the presentation
if you're more curious about that.
OK so to distributed optimization So
the focus of the talk today is going
to be on solving problems that have the
form I want to minimize some function F.
over some vector X.
I'm going to focus really on just
continuous optimization problems
that don't have any inherent
constraints all just
keep it simple with this
unconstrained setting where F.
decomposes into a sum of and terms.
And in this multi agent framework
I'll be talking about in particular
the way things are going to be set up I
want you to have in the back of your mind
something like a distributed data parallel
processing in machine learning so
I've got a very large set of training
instances let's say I want to solve some
sort of supervised learning problem.
And of course the overall loss function
the objective I would like to minimize to
fit a model to this data is going to be
a sum over all of the data terms under
the usual ID kind of assumption and
are going to think about.
Sucking up my data into
end groups where N.
is the number of processors that I'm using
for this algorithm and so I have some
portion of the data at processor one
some portion the processor two and so
on of the processor and OK And
of course some of those all of gives me
the overall objective function I would
like to minimize and keep in
the back your mind here too that X.
would be the parameters of the model
that we're trying to fit into the data.
So going to be in this
data parallel setting.
And often in the setting the reason why we
would want to do distributed optimization
in the first place is most likely at
least in the context of machine learning
it would be to get some sort of a speed
up rate leveraging leveraging parallel
processing power to do this computation or
find a good solution faster.
It could also happen that the data
underlying the problem so
in other words the local objective
functions are inherently distributed
just based on the way that the data
was collected that often happens
often maybe the data will be too
large to fit on a single server and
I mean there's lots of other
motivations that one could could use
will see some of these in the talk today
but things like robustness if we want
to worry about certain processors
failing what we're doing a computation.
Also just having parallel processor
sometimes allows us to load data faster
in fact in many applications loading the
data can be quite time consuming in and
of itself and so
loading in parallel can be faster too so
these are just a few motivations and
I want to mention I'm focusing on
a supervised learning machine learning
kind of application for the talk today
specifically because it's part of this
machine learning seminar but the kinds of
algorithms that I will be talking about
do have applications to a pretty wide
variety of domains so similar kinds of
optimization problems with this structure
come up in other applications for
example in distributed sensing systems or
in distributed power systems as we
were talking about earlier today or
in the control of multi robot systems
as well often certain problems like
agreeing on a place to meet or
agreeing on maintaining a certain form.
It can be formulated as an optimization
problem of this form where again
we have one term for each agent and
each node in the network and
they need to coordinate in order to find
a minimum of this overall objective.
OK.
OK So
in terms of distributed processing so
this is the problem we're talking about
of course what are what are objective
function factors this way into a sum
of different terms Now keep in mind.
The variable X.
The argument is the same for each of these
terms so we want to find one solution X.
which minimizes this overall some OK.
But of course what are our function F.
separates into a sum of functions so it is
the gradient and so this is where we try
to exploit parallel processing that's kind
of the typical way that this comes into
the the picture and so maybe the most
straightforward approach that might have
come across in the past to be something
like a master work a kind of architecture
so I want to kind of put into context here
where the multi agent approach comes from.
Master workers not multi agent but
it'll build up to how we can
maybe one way to think about multi-engine
algorithms very simply Right so
in a masterwork a kind of architecture
you think of the Master as really being
the one that's driving the optimization
process but in order to compute a gradient
faster it's going to distribute that
computation across multiple workers and
of them in this case and so
each worker is going to be responsible for
one of the subject in this objective
function each one so the Master will send
the current value of the iterate Xscape
to each worker the local workers will
evaluate their local gradients and
then of course to get back a gradient of
the overall objective function we just
need to add those up and so the workers
will send them back to the master once a
master has them all it can sum them up and
take a step in the gradient or in that
direction you can do something along
these lines right so what's a gradient
descent is a sort of prototypical
very simple first order optimization
algorithm for what we talk about today so
of course we could think about doing more
complicated things in this kind of setup
as well but let's just keep it simple
with gradient descent for now.
OK so this would be a master
worker kind of architecture.
And this could work well in certain
circumstances but in other circumstances
you might worry about sort of the master
becoming kind of a single point of failure
a bottleneck in the system right first of
all the communication band with in and
out of the Master we need to
scale something like linearly or
you could be more clever or maybe it
only needs to scale logarithmically or
something like that in terms
of the size of the network.
But of course also if the master fails if
it's hard drive crashes or whatever for
whatever reason other
reason it's stalls or
delays that's going to slow
down the entire process.
And so one way to avoid having
this masterwork are kind of set up
is to just get rid of the Master so
then you might think of what I would call
a fully connected network typology
right so we have our end workers and
each one of them is going to maintain
a local copy of the optimization variable.
And they're going to still
perform something like.
Something like creating dissent in the
sense that let's assume that they're all
initialized to the same value X.
zero so they can each locally
compute their gradient so
each of them is going to
have the gradient of F.
I and then they can send a copy
of that to every other region or
every other node in the network
each node sums those up and
then it also has the global gradient
of the gradient of our overall overall
objective function some of those up in
each region can locally do something
like a gradient descent step right so this
of boys having a master of course is going
to require huge amounts of communication
bad with because now we need to send
a message from every node to every
other node and that might be a little
bit undesirable You might think that
that's also perhaps not necessary and so
this is where we can kind of get to
what we call the multi-agency set up.
In the fully connected set up a previously
I mention that we could just sum up
the local gradients but of course we can
also Average think about averaging those
local gradients rate and that factor of
one over and that we get from the average
we can just think of as rescaling
the step size parameter Alpha case here.
For resale by an appropriate amount the
size of the network then we're back to our
old good old gradient decide.
So now let's replace this average
over all agents in the network by
locally an average over a subset of and
here my script and
I have K.'s going to be the neighbors
of no DI at the situation.
OK so instead of edging over everybody
we're just going to average over a subset
right so in this example one is going to
average over say its own local gradient
plus two and three days and is going to
average over twos and threes also three is
going to average over ones in NS and so on
and so forth so none of them is actually
going to be averaging over all of the
gradients but you might hope that getting
a local average like this is going to be
somewhat close to the overall average and
the other thing you might hope for is that
over time information about the gradients
from two would propagate through
these steps to eventually reach three
even though there's not a direct
connection between two and three.
This is kind of what we hope for of course
we're going to see that there are some
challenges that come up in this
kind of setting as well right so
just as to kind of point out
one of the main challenges
even if all the agents are initialized
to the same value let's think about
what happens after just one step and it's
of the IJA valuate their gradients at X.
zero which is the same for everybody but
then when they perform their first
update they reach using an average over
a potentially different subset of
gradients from other agents and so
none of them is taking exactly
the same gradient step right so
that means that we're going to have
these different values local values X.
I at different nodes that are going to
somewhat diverged away from each other and
we're going to hope that sort of this
averaging process over time keeps them
close enough so that when we're taking
a gradient at one point in the gradient at
another agents point and
averaging the gradients of those two
locations they're relatively close to each
other then we might think that this is
reasonable under some typical assumptions
that you might be familiar with if you
study optimization which would be.
Things like you know often we've seen
something like that the function is smooth
so it's gradients are first
order Lipschitz right so
in other words if I have two points
which are close enough to each other
there gradients are not going to be too
far apart also right but of course we're
going to incur some some additional
error potentially by having this kind of
an averaging process and we're going
to see if we can control that that
will kind of be the gist of the talk and
that in fact there's already a very very
broad literature a large literature on
this topic of multi-agency optimization.
So let me run through just to give you
a little bit of historical perspective on
this so if you want to trace back
to see the not the origins but
the very seminal work parallel and
distributed algorithms for
optimization that really goes back
to the one nine hundred eighty S.
the work of Johnson sequestering his Ph D.
at MIT and he did some work in
collaboration with Jimmy and
Mike Athens and there they looked at
algorithms which are in a slightly
different set up from what I'll be talking
about today but they considered it a case
where both you could have synchronous
algorithms or asynchronous today and
really going to be interested in
the synchronous algorithms the flavor
the algorithms that they looked at were
more of a block or in the descent form so
there you think of having your overall
decision variables you want to optimize
over and each agent is responsible for
just one block and it's only going to
update that block and then send the
updates to all the other agents and so.
In that kind of a set up all the agents
need to know about the global objective
function it turns out that's the way that
they formulate the problem and that and
that makes sense and certainly
applications but for the applications I
talked about today each agent really
is only going to have a subset one term
in the objective function so
we can't really apply the same algorithms.
Now the algorithms I will be building
on today a build on some protocols for
distributed averaging I'll tell
you about them in a moment but
those came out in two thousand and
three that was work.
I believe at the time company was a Ph D.
student at Cornell and I think he's now
professor at U.S.C. last time I checked.
And so that was.
An algorithm called push some for
distributed averaging and
I'll tell you about that in a moment we're
going to build on top of that today so
that's just for solving an averaging
problem and in two thousand and three they
analyze this in a very restricted case
where the network is fully connected so
every agent sends messages to potentially
all other agents and where they looked at
rather than really an asynchronous set up
they consider to a randomized set up where
you know the order in which agents are
nodes performing update is randomized but
they didn't account for
things like delay so
I get to really what I mean by
an asynchronous model in the next slide.
So then continuing on this literature
of multi agent optimization
some of the first work on that
came out in late two thousand and.
Two as Dogger and they propose really
what many consider the first synchronous
multi-age and gradient descent kind of
algorithm so that really is essentially
what I had on the previous slide they
propose that they analyze that they
analyze it I believe in the setting of
potentially not differentiable function so
they had some gradients instead of
gradients and at some point they also
incorporated this idea of using
projections if you have constraints.
But in any case there's been a large
body of work in this area since around
two thousand and nine two thousand
eighty thousand nine thousand and
twelve with all my former Ph D.
students Constance you know we
proposed the first algorithm
to combine kind of these different
ideas of multiengine optimisation and
this pushed type of
distributed averaging right so
previous algorithms kind of required some
sort of symmetry in the network where
if one agent communicates with another
they always reciprocate and they do this
in a nice synchronous matter and when it
comes time to actually implement these
algorithms one thing you find quickly
you're talking about this morning is it's
it's actually tricky to impose some
of those constraints sometimes.
Not always but in some situations it
does become a pain to actually get it
to work that way and so one nice thing
about this push some approach that
will see is that it really lets you do
things in an asynchronous manner where you
don't have to worry
about you know sort of.
Directed communications were
symmetries in your network.
Since that paper in two thousand and
twelve there is a sequence
of a bunch of other papers improving
on it making it faster and
still in the spirit of sort of faster
synchronous push some based optimization
algorithms and so it's a series of
works by jangle tie in U.C.L.A.
biology and Khan at Tufts University and
that it's also she and
she kind of a Boston University
in Arizona state and so.
This is really what I think
would be considered the State
of the art right now in terms
of what we know how to do for
multiengine optimization
over a network so.
We're going to see how we can kind
of improve upon this today so
I said it's beginning to feel free to
stop me at any point if there's questions
I'm happy to make this interactive.
OK So in this doc I'm going to be
moving away from synchronous and
trying to think about asynchronous
algorithms so the problem with being
synchronous is that kind of one of the key
problems is that you move at the pace
of the slowest node so if you have a large
network of nodes and if one of them for
whatever reason is taking longer to do its
computation or maybe you're in a cluster
where you know the processors are being
shared amongst multiple processes so
another another task is eating up
some of the cycles on that C.P.U.
maybe the hardware in your
system is you know it's for
whatever reason is heterogeneous of some
machines are just faster than others or
have more processors or more memory than
others all those kinds of issues lead to
potentially one node being slower than
the rest and in the in the synchronous
approach also a picture in a moment too
but you know we all the network can't
advance to the next iteration until
everyone has finished the previous one.
And so that's why we move at the pace
of the slowest node so of course
this motivates being asynchronous where we
don't wait for the others the slow ones
and everybody just goes as fast as they
CAN WE DON'T HAVE ANY idling hopefully.
But of course there's challenges that
come up with being synchronous too so
in this case the things I want to account
for today when I'm thinking about
asynchrony are first of all that when one
node sends a message there can be some
delay in the time it takes to transmit
that message over the network can be
received at its destination OK So that's
one thing that we want our algorithm or
analysis to somehow account for the other
one is and the kind of the consequence of
that is that because nodes are going
to be at different iterations
as they're proceeding with the computation
one might be computing an update for
it's case update but
it's not going to be using the value K.
minus one from its one of its neighbors
it's going to be using sort of stale
information so we need to somehow account
for that in the analysis that we do in
the algorithm should somehow be robust to
that the other issue that can come up is.
If you have had a genius node capabilities
like I said then those might updated
different rates and so one with one node
might be able to perform more updates and
others and
we need some way in our algorithm.
Yet in our system to ensure that that no
doesn't kind of dominate that we don't
change the actual problem
that we're trying to solve so
we still can converge to solutions of the
original problem that we're interested OK.
OK so our contributions in this work are
to propose an asynchronous multi-engine
optimization method asynchronous
gradient pushers of gradient push.
We provide some theoretical convergent
guarantees that I'll tell you about today
and experimentally we show in some
experiments on a cluster actually
implementing this in a distributed way
that the synchronous approach is for
sure more robust than the synchronous
algorithms and in fact is even
more competitive in certain served
circumstances so it scales slightly better
as you add more nodes to the network than
some of these State of the art synchronous
methods as well so I'll show you
experiments to support those conclusions.
OK All right so
first let me tell you about push some for
a distributed averaging so
let's consider a setting where we
have our agents in a graph and.
The connections between different agents
is not necessarily symmetric So for
example one will send
messages to seven but
seven will never send
messages back to one OK so
this in this setting could be you
know well of it for for different
reasons in different applications but for
what we're thinking about today we're
going to see that having this kind of
directed communications is helpful
in the asynchronous setting because it
will help us analyze the case where one
node transmits messages to its neighbors
when it's finished some computation and
it doesn't wait to get replies from them
before it continues so you can think
of that as one outgoing transmission where
all the links are directed out from that
node and everybody else is
just doing their own thing.
OK so first just to build up to that.
Kind of a setting.
Let me start by defining the matrix P.
which is column stochastic and
which captures the comic to video of this
network so I have a matrix It's my
network has an nodes then it's an N.
by a matrix the entries P.
i j are bigger than zero if and
only if j sends a message to i
cases is going to be my convention
here and because it's column so
cassock If I sum over all the rows I for
a given fixed column J.
That sums up to one OK and
this is going to be true for all.
And so from you we can think of
these major cities as also defining
a random walk over this kind of a graph
and we know from for being his theory for
random walks that as long as our graph
satisfies certain properties things like
being a periodic in the reducible of
these properties that we hear about
maybe in graduate level probability as
long as we have a periodic in reducibility
which in the graph context we can
ensure by assuming that we always send
a message from each node to itself right
that that avoids period of cities and
if we know that this graph is what's
called strongly connected which means I
can follow a directed set of edges to get
from any node in the network to any other
node in the network then we're OK And so
we can kind of translate a periodic into
a disability into graph concepts as well
if we're interested under those kinds of.
There under that setting we know if
we take products of these major Cs
that's going to converge to a matrix
where all columns are the same and
each column is equal to the stationary
distribution of this random walk right so
pi here is a stationary distribution so
this is just this is an array one matrix
where my left vector is the stationary
distribution of my right vector is
the vector of all ones right so that means
that every column is exactly the same.
As.
The consequence of that is if I look at
what happens if I take a limiting product
of P.K. time some initial vector X.
and every node Well it's going to
converge to this limit where it's
my my stationary vector
times a constant right with
a constant is just the sum of all
the initial values and that matrix.
Is sorry that vector X.
sum up all the initial values in X.
and then I multiply that by
the stationary probability at each age it
right now if we could have
a nice situation where.
The stationary distribution is uniform so
it's one overrun then
every agent would just be computing the
average of the initial values right but
in general designing that kind of a random
walk for a directed graph is not always
possible and even if it is possible
it's not computational necessarily easy
to do it's not always easy to find
that kind of a or random walk.
So.
We're going to somehow
get around that right so
we would like the way to think about
it is maybe these the values of X.
are going to be something like the
gradient at each node and we would like to
get something like an average gradient out
after doing some steps of averaging Yes.
Passing messages from.
Yeah so the way to think about this and
we'll see this when we get to
the algorithm is I evaluate my
gradient and I'm going to write so.
Let me make this over here so.
In the matrix equation if I think of so
if I think of it say X.
K.
plus one is equal to P X K Can
everybody see this.
So if I think about what's
happening at node I X I K plus
one is equal to sum over all j.
P I J X J K.
This is and so.
The only PJ's which are non-zero
are those for which J.
sends a message to I so
the way to think about this if X.
is a gradient or
whatever value it is is in a synchronous
kind of set up that we would have here.
After we've got these X.
case.
No J.
sends a weighted copy of this to
each of its neighbors the only
non-zero terms are neighbors so we're only
going to send messages to our neighbors so
no j is going to take it's value
weighted by some value P.-I J.
and send that message to its neighbors and
so
since no j we're assuming here that no J.
knows who it sending messages to and
how many messages are well
who are the non-zero so
we can choose those weights in
such a way that they sum to one.
That's right yes so sorry that's a bad
choice of notation on my part that
maybe she would have been a better choice
or something like that some other vector.
But that's correct this is not Don't
think of this necessarily as X.
because we're not going to average.
The decision variables themselves
will average the gradients.
OK.
OK so now this gets to this push them
out of rhythm which is essentially
implementing these linear iterations
in a decentralized manner so and
the trick is going to be that to avoid
getting different weightings rate by.
By this stationary probability
at different nodes we're going to
do some rescaling appropriately so.
We know that we we converge
to a value where an agent I.
Multiplied by the value of the stationary
distribution at that node so
we're going to actually run two linear
linear iterations in parallel one is
going to be for our values X.
Sorry think about them as the gradients
not as units themselves and
the other one is going to be a weight
which we all initialized to one.
OK And as we run these We're going
to run the same linear iteration
right this is just writing down
exactly the equations that I had here.
Do that recursively and we know that the
X.'s are on the gradients this is going to
converge to the stationary vector weighted
by the sum of the initial values and
these weights are going to converge
to be the stationary distribution
times a vector of all times the inner
product between a vector of all ones with
a vector of all ones because that's
how we initialized this factor W.
And so the inner product
of two vectors of length N.
which are all values equal
the one is equal to and itself so
this guy is equal to and
times as stationary distribution and so
now at each node when we take
the element wise division of X.
divided by this way to that we've computed
we see that that converges to the average.
Even though neighbors are nodes
are only communicating locally
it's a this is kind of a nice
asymptotic result that we have.
OK.
Clears is clear so far.
So essentially instead of having different
different importance is at
each node based on their.
Distribution in this random walk we can
kind of cancel that out by just sending
around one extra scalar parameter right
now and I should also emphasize here I've
written this in a way where we think of
having just one value scalar value X.
at each and each node to begin with but of
course we can just implement these kinds
of linear durations but where now we
replace the local values of each node
by a vector we get the same thing and
we only need one additional scalar W.
We don't need one print
imagine because that W.
that that rescaling factor is going
to be the same for every dimension.
OK.
So the Z.
is just being
defined right here it's
going to be the element Y.
is ratio of X.
and this weight no dies so
no I can think of it is
in these two steps we're going to see it's
just summing up messages that are received
from its neighbors those messages contain
a weighted version of the value of X.
at the neighbor and
a weighted version of the value of W.
at that neighbor from the previous
iteration we take the ratio of those to
the use and
that gives us as vectors e or scalar Z.
depending on the set up which
is converging to the average.
The other thing I mention here which I
don't have on the side but there is a lot
of existing convergence theory for random
walks on graphs even in this context of
distributed sorry over directed graphs
where we can also not only guarantee that
were converging asymptotically to this
average we can quantify how far away we
are from the average after a finite number
of steps and typically that involves
that rate of convergence depends on the
structure of this graph where you would
expect a graph where it's sort of just one
long chain is going to take a very long
time for information for from one node
to reach all the way to the other side
whereas a graph that's very well connected
might converge faster than it so
so that that can be quantified I
won't go into the details of that but
just to say that it's there and
we'll see it come up later in the talk.
OK So that leads to now so this is all
synchronous the way of describe it so
far in order for
a note I had to compute this value of K.
plus one it needs to receive the messages
from all of its neighbors so
first natural question before
we even talk about optimization
is just to think about what happens
if we wanted to do this averaging but
in an asynchronous way OK let's say for
whatever reason we want to account for
the fact that messages the fact
that messages might be delayed and
that different nodes might take different
amounts of time to do some computations
so we're going to think of a loop that
looks something like the following we're
still initializing the same set of
variables and we're going to repeat
doing the following we're going to send
messages weighted by these probabilities.
My and my J.
because I'm thinking about this
from the perspective agent so
it's sending it to its out neighbors J.
waited appropriately to reach out
neighbor after it finishes sending
it's going to look and
see what messages it's received so
I want I mean typically when you implement
distributed algorithms I guess one
sort of standard library that you might
use if you're in a high performance
computing setting is something like the
message passing library it's called M.P.I.
also and in M.P.I. there's different
ways to do message passing but
one way is through sort of an asynchronous
mechanism where you think of having a send
buffer and a receive buffer ready to Agent
and so when you're going to send a message
you write that message into your send
buffer and they get sent to soon is
as soon as it can and in the same time
you can do some computations and there's
going to be received buffer that will
just store whatever messages you received
while the competition was taking place you
can probe to see if you have messages in
the queue and process those and so the way
to think about this is at this point we've
written our messages into the send buffer
and we're going to let the underlying
protocols take care of transmitting
those however long that will take and
at the same time we're going to
probe I receive buffer I receive Q.
and see if we have messages and
whatever we have in that queue.
We're going to take those messages and
just do the same kind of aggregation
we take whatever we had before and
we sum them up centrally doing what we
had here but in some circumstances we
might get messages only from a subset
of neighbors and in other circumstances
we might get more than one message from
the same neighbor because maybe that one
is going faster than we are but at the end
of the day we can think of this is just so
performing that same aggregation but
now there's some some sort of delay.
No.
If by the event you mean you get of some
sort of signal once you've got one message
from each NEIGHBOR No that's not not
the way we're going to do it so.
We're going to send these messages so
it's yes we're going to
send them which requires calling a command
to write messages to a buffer to be sent
one second manned completes we're
going to probe our received buffer and
take whatever messages we find in that
receive buffer until it's empty and
perform this computation so we may have
like I said we may have fewer than one
message we may have fewer messages in
the number of neighbors we have we may
have more messages in some cases OK.
Yes.
If no neighbor has sent us any messages
in the time that it took us to send
hours to our neighbors then we probe
here we're going to see that we have
no receive messages and so we'll repeat.
To you.
So so far I'm just describing this in
terms of an averaging algorithm right so
we're it's maybe that's
why it's a little bit.
Unclear how it's going to fit into
optimization I think we're going to get
there very quickly but yeah we're just
starting with some initial values and
the goal of this algorithm would
just be to have the values Z.-I at
every node converge to the average
of the initial values right so
that's all that this computation with that
would perform it's just a distributed
computation computation for
calculating averages to start out with.
OK maybe five proceed to the next slide or
two the most see how how optimization
comes into the picture so so
it's clear that this would just
converge to the average right.
So now let's think about a let's
let's first of all let me OK So.
Everything I talked about so
far up to here was synchronous and
now we're talking about asynchronous
algorithms now when it comes to any lies
in the synchronous algorithms book
first I want to see what's the benefit
of the benefit can look something like
this so it could be that for one agent It
takes some time here the grade bars
indicate some processing time and
black bars indicate some transmission time
and so in the synchronous setup we would
need to wait until all nodes of finished
processing and transmitting which means
now its neighbors the neighbors of each
one have received their message now they
can proceed with beginning to compute
iteration number two but it's one way to
think about what's happening here so it's
a agents doing thier neighbors of one so
at this point they have both received
that message and they can start
doing their next computation so we can see
that while they are waiting their idling
they're not doing any processing
they're not doing any communication.
So we're hoping the picture will look like
asynchronous set up as something more
like this where it might take one agent
a long time to do its computation
much longer than another agent let's
say but this agent sorry my my
gut truncated somehow but
it should say agents one two and three.
Whenever this agent finishes its first
computation it's going to write those
messages to be sent to some send buffer
and that can happen in parallel with while
it's doing its next computation it's
kind of the picture to have in mind so
now we no longer have idling this is
kind of an idealized scenario but
we can hope that this is this is going to
make things more efficient potentially
now in terms of analyzing this kind
of an algorithm we're going to for
the purposes of analysis
only think about a sequence
of discrete times cables one two and
three and so we can think of this global.
It is.
Incrementing every time one agent
finishes its local computation OK So
in fact this kind of a model for
analyzing asynchronous algorithms traces
its roots back to the work of seclusive or
seekers in Athens in their backcourt
the same kind of scheme is well OK but
is that is that part clear so we're just
going to think about because the only
time something really changes is when
we've completed one of these computational
phases and so at this point in time almost
everybody in the network is not going to
have a change but Agent two is going to
have some change to its local variables.
And at the next point in time when Agent
three finishes it's going to have some
change to its local variables and so
proceed in this matter is a clear so
I've taken what would be
otherwise an asynchronous and
sort of you could think of this is
a continuous time scheme but I was
boil it down to saying I'm only going to
enter lies this process at these discrete.
Instance of time where something
interesting is happening and for
the purposes of analysis I can get
get a global order of those OK
so now let me throw one more twist into
my averaging setup which is to no longer
think about just the process of averaging
but also to in corporate some kind of
a perturbation into the scheme so in this
kind of a set of I'm going to think of all
the agents start with some initial value
and they still have their weights and
now when they send a message it's not
just going to be a weighted version of X.
or the gradient from the previous time
there's going to be some perturbation
a to take it's added to that
before it gets transmitted.
OK.
And in fact it's going to trial we think
of these perturbations as the gradients
in X.
as a decision variable so
I was wrong when I told you before well
not exactly wrong OK so for
each received message then we do the same
thing we aggregate those guys and
we aggregate our weights as well right so
this was just kind of this first component
if we think of our messages being
two values the first value is this way we
don't really actually care about what that
computation was because
that was done at Agent J.
before it sent us the message we get some
some value here we're
going to add that to X.
we get some other value we're
going to add that to W.
right these still look like linear
iterations as well the thing
I wanted to point out but now instead of
being just Xscape US one is equal to P.K.
times X K We also have this additional
pretty patient factor that's being
thrown in the mix.
OK And of course the question is
now what what does this converge to
does it converge to anything and
of course it depends on what's
happening with these perturbations.
So these are not going to be exaggerated
perturbations here so you could have so.
The perturbations are going to be
intentionally added by each agent
we're going to see this is going
to be my framework for analyzing
an optimization method you probably could
also think of other scenarios where when
you transmit a message there can be errors
in the transmission and that might lead to
some other sort of perturbation I'm
not thinking of that scenario so
I'm actually thinking of the case where
the network might delay my messages but
it doesn't mess with the bits.
That's that's right so that's going to be
also an additional complicating factor.
OK So let me keep going here.
So our first result for
the synchronous person algorithm
rates of salary the head of the.
Previous slide tells us what happens
in different cases depending on
how those perturbations ADA behave so
this is the big messy
maybe we don't need to worry about the
details but what we're bounding here is
the difference between Z.-I which was
those ratios that we talked about so
that's correcting for this stationary
distribution factor minus X.
part is the average of all X.'s
across all the nodes editor ation K.
OK so you can think of maybe we're
trying to track this average value but
the average is also being perturbed at
every step of the way so we're going
to be able to say that that's bounded by
some constant Times and other constant Q.
to the power K.
times the one norm and this is only one
norms come up naturally here because
we're talking about stochastic matrices
and so it's because of probability
vectors in that case the one Norm kind of
falls out from the analysis but that's OK.
So we have these constant times our
initial vector Plus something that depends
on the perturbations OK So let me tell
you briefly about these two constancy and
Q At the end of the day they
depend on two quantities so
first of all assume that my graph is
strongly connected this gets back to
the a periodic irreducible kinds of set
ups assumptions that I mentioned earlier
the other assumptions I'm making is
that there is some value bar which is
an upper bound on the delays that I might
experience and this is the way to think
about this is it's an upper bound on the
number of events that can happen between
when one agent finish finishes
its previous computation to when
it finishes the next computation and all
those messages arrive at its neighbors so
I'm going to see that that can't can't
get too big it can be big enough
to account for different different
rates at different agents and
delays in the network but it can't go
on bonded OK so if I'm in that setting
my two constancy into century depend on
two things that depend on that delay and
they also depend on the structure
of the network over risk.
Over which we're passing messages so I'll
go into the details of all these constants
but lambda is a parameter related
to the network connectivity
I can go into more details if
you're interested maybe offline.
Tabar here is the maximum delay.
And Max is the maximum number of outgoing
neighbors and in fact this is for
a particular case where I'm thinking of
these PJ's as just being uniform so each
agent passes a uniform fraction to itself
and all of its outgoing neighbors so
it's one over and I J Yeah and
so that's what we have here.
OK.
Yeah.
Yeah.
OK.
Yeah and they have to assume things
like the patients are white noise
essentially right so
here it's a different set up yeah.
Yeah.
Not exactly no so
let me let me give two examples here
two corollaries of this theorem that
would maybe then help us lead into
the actual optimization algorithm
Hopefully I'll do OK on time.
So the first one is let's
consider two scenarios so
in one case of these pretty patient
remain bounded at all times
OK then the provisions are entering here
and we see we have this quantity Q Q.
didn't regardless of the Never typology as
long as it's connected we can guarantee
it's going to be between zero and one and
so as we take powers of this because it's
strictly less than one these terms
are going to have a negligible effect in
the long run and so
if these values these pretty patients ADA
remain bounded over all time then we get a
bound that looks like the following which
is that this tracking error right so
we're looking at how far is the.
Estimate at any age any node
from the average if you could average
over all of their current values a time.
That is going to remain below this
constant over one minus Q So L.
Here is our assumption
on this upper bound C.
is again related to the network topology
in the delays in the network as Q.
related to the network topology.
The other one says that if
those perturbations go to zero
then in fact these two things
are going to converge So
really with these two to what the left
hand side here to get back to it what that
is talking about it's talking about how
much disagreement there is in the network
how far away is the value in Agent I from
the average of all of their values so
I want to make sure that that disagreement
doesn't get too bad essentially.
Exactly a consensus
either approximately or
exactly depending on what happens here and
now we're going to centrally
set up this optimization algorithm
such that it is our gradients and so.
We're converging to a stationary point
which is going to be a solution if we
assume it's convex then the magnitude
of the gradient should go to zero
then we're going to be in this case
where at least the disagreement is going
to go to zero and left to see if we still
end up with some potentially or right and
in this case where things are the great
Internet necessarily going to zero but
they're going to be bounded we're going
to have some disagreements and in the S.
and tada case that we'll
also have to deal with.
And essentially just kind of
jump ahead you can think of
this case as being the case where
we take a constant step size
we're not exactly going to converge but
we're going to converge to some
ball which has some radius where
the gradient remains bounded and
in the case where we do use sort of
diminishing step sizes then we'll actually
see things go to zero and we'll be back
to Convergence and exact consensus.
OK OK maybe I'll skip the proof but
to come back to the sketch
if there's time after.
So now in a nutshell let me just tell you
I think of already said this in words but
here's here's the actual asynchronous
gradient push algorithm so
we think of each node knowing its number
of out neighbors the number of people who
are agents is going to send messages to
you it has some initial vector X.I.I. and
it has.
Its initial weight is one and it has
some step size alpha which is positive
not too big for the theory to work out.
So now we're going to split our
asynchronous loop into two parts the first
part we're going to do local computation
which is where we take this ratio
from the values we've been computing by
by gossip or by push and we do a local
gradient descent step where we add in some
perturbation here based on the gradient
at this local value the gradient of
the local objective at this local value.
Optional we may update the stepsize or
we may keep a constant.
OK And then we're going to do
the same asynchronous push some kind
of updates where we send messages
now P I J is just equal to one over
the out degree of each of each node and.
Same for the weights and we process
whatever is in our local buffer but
just some things up and we repeat until
some germination criterion satisfied for
example until a certain
amount of time has passed or
a certain number of updates
have been performed.
OK So the question is what happens for
this and so on centrally
applying those two corollaries that
we had on the previous slide for
the perturbed push on the algorithm along
with a few assumptions which are just that
the local objective functions
are strongly convex and capital M.
Lipschitz have capital in
Lipschitz continuous derivatives.
And that we have some global
not minimize or extra hours or
problem as well posed under
those kinds of assumptions
first of all we can show that these
gradients are going to remain bounded.
Period where this depends on
the Lipshitz constant them
as well as kind of the degree of
asynchrony and the structure of
the network which can tell us how how bad
things can get away from each other OK so
based on those parameters I
can tell you the constant.
So then the main result to
me in results that we get
our first one in the case of
diminishing step sizes so
if we run the algorithm where we used
to step sizes that satisfy these
typical kinds of properties that we like
say first of classic approximation methods
then we can show that the lim in for
this difference right so
this is now this is not our disagreement
but this is how far away is E.I. from
the solution to the problem is going to be
bounded above by a few quantities so L.
which was our constant kind of
bounding how big the gradient can
be at the end of the day.
Which is related to lift as constant and
the network.
Parameters little M.
is a strong convex to the parameter and
then we have the square root of two times
this value here is
the degree of asynchrony So
this is what's the maximum processing
delay that any node can incur.
If there is if there is no.
Delay then.
The delay then Tabar proc would be
equal to one which would be saying that
the difference between two
consecutive instances k's where
where a node finishes doing a gradient
computation is just one right so
the net term would would not be there.
And is the size of the network here OK so
this gives us a worst case upper bound
on how far we're going to be
away from the optimal solution.
And in the case where we use a constant
step size if that step size is not too
large then we get a similar kind of
bound but where now in addition to this
penalty we paid before we also have
this additional term Alpha C.L. over one
minus Q Which should look familiar This
was the right hand side in the case where
gradients just remained about did what you
might expect is going to happen in this
constant step size case and so this is due
to the disagreement and it's going to be
proportional to the step size so we
have some control over that OK I do want
to kind of put one cat I think it's nice
that we have these theoretical guarantees.
As often happens in the analysis of
of these algorithms it turns out that
a lot of these constants
can be quite large or
very pessimistic so
it's not not necessarily going to be so
useful for setting a step size
parameter in practice but at least it
gives us some insights about what are the
parameters that matter in the problem and
how the how what's the interplay like in
between them in a worst case setting OK.
OK.
In my last couple of minutes let me tell
you about the experiments that we've done
and then I'll wrap up so should be this
should be quick so I'll just tell you
about one experiment where now the type
back into machine learning we're going
to be the local object of functions at
each age and they're going to be based on
a regular sized multiclass
logistic regression So it's this
function that we have a peer which because
of our regular Eiser is strongly convex.
And so we're doing this with
the cover type dataset so
we have seven classes each
input has fifty four features.
And there's over half a million training
instances I'm going to do this over
networks of different sizes and I'm going
to partition the training instances
uniformly over the nodes in my network and
so the parameters that we're trying
to fit here now I jump from being X.
to W.
because X.
is my input and why is my labels
of the classes in the data so W.
is are the parameters that we're
optimizing over by going to do work or
the notation in the doc and I'm sorry.
OK But yes and so.
All that to say the number the
dimensionality of these parameter vectors
of the parameter vector is the product of
the number of classes in the number of
features like three hundred fifty roughly
dimensional parameter vector and so
the gradient is also three hundred fifty
to much of that were passing back and
forth and
we're going to compare with some of those
methods methods that I mentioned which
are state of the art I'm not going to go
into the details of these I can tell you.
Offline synchronous a gradient push is
really just similar to what we've what I
told you about today but where we operate
in the synchronous better so one each each
node doesn't proceed until it's received
messages from all of its neighbors these
other two extra push and push digging just
in a nutshell do something similar but
they also use an extrapolation term so
you can think of it it's not
exactly the same thing as using momentum
or in a strong averaging but it's kind of
a way to extrapolate to get better
convergence in the strongly convex case.
And so what do we see I'm going to look at
two cases here first I'm going to look at
what happens is they increase the size
of the network and I'm also going
to look at two cases I should mention that
all of this was implemented in M.P.I.
and the experiments were done over
a cluster high performance computing
cluster it's the kind of
the academic cluster in Canada.
Will look at one case where we just let
things run under standard conditions so
nodes may be shared with other tasks.
And so there might be.
Small amounts of delay were pretty patient
do that but we're not going to do anything
else and then to sort of stress test
what's happening we're going to look at
another case where we take one
agent the one with three zero and
each step we artificially add a small
amount of delay in order to see how having
one delayed agent perturbs the whole rest
of the process it's not not immune it's
not going to be surprising that this is
really going to slow down the synchronous
algorithms The question is how does
this affect the asynchronous algorithms.
And so let me step you through first
just varying the number of nodes so
these are the kinds of
curves that we see this is
now the error to the optimal
value as a function of time.
For the three different algorithms for.
Push digging which is kind of a state
of the art method using extrapolation
synchronous so
gradient push an asynchronous ability and
push in blue so the blue one is
the one I'm advocating for today so
if we have eight nodes the blue is kind
of hiding underneath the green here so
synchronous and asynchronous
you're not doing two different and
you might be surprised that this
extrapolation method which is faster in
theory if you look at it if I were to plot
this in terms of error is a function of
iterations in fact the red one would
be faster but it's slower in this
case because at every step it requires
communicating twice as much information
to do that extrapolation and that actually
plays a non-trivial role in this problem.
Now if we add delays to the mix all the
methods are slower so the curves all go
up but clearly the asynchronous one goes
up much less than the others so the delays
here have a much more significant
effect on the synchronous methods.
And as we increase the size of the network
these effects become more pronounced and
so when we get to sixty four nodes in fact
it turns out that the asynchronous method
even in the in the standard setting
seems like it's scaling better
than the synchronous method or
this extrapolation method
which is not something we're necessarily
expecting to see but it's nice that it.
It scales slightly better
because there is less idling and
idling becomes more pronounced in the
large network setting and then of course
clearly in this case maybe the most
distinguishing thing here is that.
In this large network case
the asynchronous method
still convergence almost as fast as if it
was running in the unperturbed setting
that's probably because we just added
these delays at a single agent so
it's really just slowing down a small
part of the overall network but
that has a significant effect for the
synchronous methods if we look at this in
terms of scaling now so if I see
what's my speed up so if I say T.N.
here was the time to get the training
error to zero point zero one.
And if I look at the ratio of
four times how long it took for
the four node case to
an end node case where N.
is bigger than four.
Well when N.
is equal to four this is equal to one and
there should be equal to one.
And as we increase.
Its I can be exactly equal to one for
reasons we can talk about but
as we increase in the ideal setting if we
were getting a linear speed up then we
would see this sticking exactly to that I
can also this would be like saying if you
go for if you go to sixty four nodes you
would have a sixty four times speed up
relative to what you would get if you just
had one time now of course in practice
we never expect to get exactly that linear
speed up because of overhead due to
communication other things but clearly
the asynchronous method is scaling better
which is what we saw in those previous
slides as well and in the delayed setting
we still see some good scaling from the
synchronous method whereas the synchronous
methods kind of hit a plateau because of
that delay being added at every iteration.
OK So let me just wrap up let me
conclude because I think I've already
gone a little bit longer
than I should have.
So to summarize I guess the contributions
of the talk today are this algorithm
asynchronous gradient pushes a gradient
push we have theoretical convergence
guarantees and the experiments I just
showed you hopefully convince you that
this method is also going to be robust to
preserve Asians and delays in the network
and some of the ongoing work in
extensions or thinking about so.
The theory that I showed you today was for
continuously differentiable convex
functions in the strongly convex setting
so we we also have theory actually
we do have a now that's not in the paper
for the non differentiable convex setting
under some additional assumptions
which are needed in that that context.
If that's of interest.
Other things were thinking about and
planning to look into the case where we
have stochastic gradients so the algorithm
I talked about today was deterministic.
Case where we have time varying typologies
as opposed to for the algorithm I
talked about today the underlying
typology was fixed so the the set of
outgoing agents that one node communicates
to at every step is the same every
time it communicates we can also look at
what happens if that changes over time.
Non-convex functions is definitely
something I'm interested in exploring
more there are of course
we can only hope to.
Converge to a stationary point but
it would be interesting to see if
if agents can still reach some sort
of consensus in that setting and
I think it's also interesting
to think about lower bounds for
distributed stochastic optimization so.
We you know we have lower bounds for
just centralised classic optimization in
the sense of how many gradient evaluations
do we need to get the area to a certain
level but in the distributed setting it
would also be interesting to look into
sort of the trade off between how
much computation we need to do and
how much communication we do what
are lower bounds on that tradeoff to
understand kind of what is the best we can
hope to perform that as far as I know is
not known right now so
all the work that I talked about today
is from my student you know my
co-authors master's thesis and
it's also available in a paper which
you can download from that link and
of course if you have questions my email
is there and I'm happy happy to talk
moral be sticking around after the talk
to so thank you for attention.
And if people have to hear.
Our.
Work.
I see.
OK.
So I have not thought about
the dynamic setting very much
in the context of
distributed optimisation.
Yesso in everything I've talked about
today really we should think about
those objective functions being defined at
the outset fixed not changing over time.
I've thought about tracking problems in a
slightly different setting which I'm happy
to talk about offline too in this
distributed set up but there is.
Yet not in what we've talked about today
I will say that there has been work by
people in the control theory community.
Not that I'm aware of in the asynchronous
setting but in the synchronous setting
there's definitely consensus kinds of
control distributed control but in
the asynchronous set up I don't I haven't
followed the literature very closely
to the might be something I'm missing but
I'm not aware of anything in that setting.