I was figuring it out for this kind of.
Food because I'm just like probably good.
Credit for the monkey making that
one right but it got the best.
Make segment could.
Be right for the plate but
because it was over my plate is that
my but this is a little known fact.
Let me get on just I think about it for
two weeks so
after graduating he will be but struck.
Part of the post doc of whether he
can do that but that's what it.
Was a professor at Princeton over taking.
Notes on.
His.
Books or machine that really impacted.
Our.
Predicament.
He did about just what the.
Research papers think you should be
thinking but also in fact that's a study
of you but when you look better I
think you are using want to thank you.
So this rock has been.
Noted in many awards to
be pretty good probably.
Without the street chorus
most of us are good for
the fourteen I was kind of like a four.
Or four he did in fact.
He's a member of both
the National Guard and I think.
Works in many areas I think he'll
be really boosting your ear.
All right.
Thank you.
Thank you for
that very generous introduction and
thank you very much for
inviting me here I'm very honored.
So today I'm going to talk about
the contextual Bandit's problem.
I just had my name on here but
actually it's got a lot of joint work and
some work that's not even mine.
So to make things concrete let me just
start with a couple of examples just.
To try to get intuitions going.
So the first example is actually a fairly
realistic example a case where people
actually do care about this kind
of problem which is how do you
place advertisements or content or
something else on a web page so
when you visit a website or
when the website gets visited by a user
the website has some
information about the user or
the person's profile the prows
in history but type of thing and
then based on that information the website
needs to make a decision needs to decide
what advertisement to reason or
what news article or
other content to present to the user and
then the user responds in some way so
maybe the person clicks on the news
article or doesn't do anything or
leaves the page and so on and
this happens repeatedly and
the goal is to for
the website to make choices that will
elicit some kind of desired
behavior in the user so
typically want the person to click a lot
to click on the news article a lot for
instance OK here's another
example this one's
more cartoonish but I again it's to
try to give you the intuition of
the problem that we're looking at so
this is medical treatment or
a very simplified cartoon
version of medical treatment so
what happens here is that a doctor
is visited by a patient and
the patient arrives with symptoms
family history test results in other
words the doctor actually knows something
about the patient which is good and
then based on that information
the doctor decides on a treatment so
the treatment might be prescribe a drug
prescribed physical therapy do nothing.
And then the patient
responds in some way so
the patient gets better gets
worse never comes back and so on.
And that happens repeatedly and
again the goal is to make choices choices
that will help the patient
get better quickly and
as often as possible it's
to maximize good outcomes.
OK so both of these are examples of
what we call the can text will be and
it's problem so
abstractly the problem looks like this so
there's some agent that's repeatedly
making decisions to try to maximize reward
so in this talk I'll call the agent
the learner since there's that the goal
is to do some learning So
what happens repeatedly is that learner so
which could be the Dr the website in
the examples is presented with context
which might be information
about the patient or
the web user then the learner choose
it's an action prescribe a drug.
Do nothing and so on and
then the learner observe some reward
of some kind so
the reward is the outcome so maybe.
We're thinking of the rewards
as real valued so maybe for
instance one of the person clicks on
the news article and zero otherwise but
importantly the learner only
gets to see the reward for
the action that was actually chosen just
like in real life you make an action
you make a decision and then you only see
the outcome of that decision you don't get
to find out what would have happened if
you had done something else so it's that
fact the fact that you only get to see the
outcome of the action that was taken that
makes it what's called a bandit problem
the name the name is inherited for
historical reasons the first
problem of this kind.
So this happens repeatedly and
then the goal of course is to
choose actions that will maximize
the long term rewards right.
So it's called a bandit problem because of
this like I said it's called the context
will be and it's probably because
there's context because there's this
additional but very realistic information
that can be used to make decisions
OK and we claim that this is a very
general and fundamental problem it's
how to how to learn to make
smart decisions from experience
OK So what are the issues that come
up in thinking about this problem so
first of all I guess I'm neglecting this
side of the room over here point over here
for a while OK so so
first of all there's this
classic dilemma between exploitation and
exploration so
on the one hand we want or the learner
wants to be behaving in a way to take
advantage of what's already been learned
so if you're a doctor you want to make
decisions that you already believe are
the best decisions for that patient but
on the other hand if you only act in
the way that you think is best so
far then you might be neglecting other
options that might have better outcomes so
you also need to be doing
some kind of exploration so
there's a dilemma How do you balance
between those two two goals.
And that comes up in any band or problem.
In addition in that context will be
in its problem we also have context
that we're trying to take advantage
as effectively as possible so
when you have context there just many more
ways of behaving because you can kind of
make a separate decision for every single
patient or every every single web user so
there are many different exponentially
many more possible ways of behaving.
So what do you do in that case and you
can't learn separately for every context
because well you might not ever see the
same context a second time might not ever
see exactly the same patient in
exactly the same situation twice so
somehow we have to
generalize across contexts.
Another problem is selection bias so
of course we're trying to do some learning
we're trying to gather statistics
we're trying to do exploring but
at the same time we're trying to act in
a way that we think will give high reward
we're trying to explore while exploiting
it and that means that the data
that we're collecting will be highly
biased it will be highly skewed in the way
manner in which we're collecting it so
all our statistics are going to be skewed.
And then of course we care about
the fish and sea we want algorithms that
we can actually use so we want them to
use moderate time and space resources.
OK So that's the context will be and
it's a problem so
in this talk that what I want to
do is give a kind of overview so
it's an overview talk it's not a survey
I'm certainly not covering everything
that's out there in the field but
I do want to talk about
some of the algorithms that
are useful in this area and
I especially want to focus on some of the
techniques that have been developed for
how to attack contextual bandits and
some of its variants.
And what we're looking for are algorithms
that will be general purpose so
we want to be able to use them in as many
settings as possible we want them to be
off the shelf in the same way that
supervised learning algorithms are off
the shelf we want them to be practical so
they should be fast and easy to use.
And we especially want to be able to learn
complex behaviors we want to be able
to learn complex ways of making
decisions based on context.
And of course we also want strong
strong guarantees guarantees that
we can prove and especially
strong statistical guarantees So
in other words we want to be able to
learn how to behave quickly I'm sorry
we want to be able to learn how to behave
intelligently as quickly as we can
based on the available information OK so
that's what we're aiming for.
So here's a very short
outline of the talk so
I'm going to start by taking this learning
problem and describing it more formally.
And that actually takes a certain
amount of work to do that and
then I'll talk about algorithms that's
really the main part of the talk and
then this last part depends if I have time
to talk it out an application in the next
step OK So start at the beginning.
OK so I already described
the problem kind of informally but
let me just go back and
describe it again a little bit more
formally introduce a little bit of
notation there will be a lot of math.
So again learning is happening
through a sequence of interactions
of some kind or rounds so
there are capital T.
rounds all together and so little T.
is a kind of a generic round
in on each round little T.
what happens is the learner observes some
context like information about a patient
will be Note that by sixty and
then based on that context the learner
chooses an action eighty that
to make things simple would just assume
there are a fixed number of actions K.
actions you can think of cases like ten or
one hundred something like that.
And we'll just number them one through K..
So the learner chooses an action and
then the learner gets to see
the reward importantly only for
the action that was actually
selected OK R T of eighty.
And I kind of left out another
step that's happening in here.
OK that we need to think about when
we're trying to formalize the problem so
before the learner chooses an action
the world or nature or the environment or
whatever you want to call it decides what
the reward will be for each one of the K.
actions for all of the K.
actions OK so we can think of that as a
reward vector as a backed or of length K.
with one component for
each of these actions and
where each component specifies what
the reward is for each of those K.
actions and here I'm assuming that the
rewards are just bounded between zero and
one OK so
that we board those possible rewards
are selected by nature but of course
they're not revealed to the learner than
the learner chooses an action and gets to
see the reward for the action selected.
Yeah.
OK Let me let me say in a second
where these get chosen from so for
now nature chooses context and
the reward back tears so
the world gives the condition
of this patient and
I guess it would be more like God or
something and
says what the outcome will be for each
one of the treatments before it happens.
Yes Very much so.
Yes I'll talk about that in one second.
OK Are you asking if the reward
has to be the same for
every context no it doesn't have to be.
No it doesn't have to be so there's Let me
let me talk about that one second Yeah.
OK so right so the goal of course is
to maximize the reward OK or this
sorry some of the rewards of these are the
sum of the rewards over the capital T.
time stops.
I'm actually going to refine
this goal in a minute but for
now let's say that's the goal.
And then coming to the question
that was just asked.
Where did these pairs of context and
these reward back to come from so
for now I'm going to change this in
a minute or later in the talk but for
now let's say that there's a distribution
that generates those pairs X.T.
and the reward back to R T OK
it's a fixed distribution OK So
what happens from one round to another
is independent so their i id but
I am definitely not saying that
X T N R T Are independent of each other
that would mean context and reward
Have you no context doesn't give any
information about the rewards and
on the contrary we're specifically
assuming that context does give useful
information about the rewards and the goal
is to try to figure out how to act based
on that context to maximize rewards OK.
OK And you.
Distribution is not no.
OK.
OK So to make it more concrete
Here's a little toy example just
to drive all these ideas home so
in this tiny toy example there are let's
say cake walls three actions so
what happens on the first round
is we get some context so
maybe this is a web user He's
a man he's fifty years old
whatever other information
we have about this person so
then nature chooses what the rewards will
be for each of the three actions but
of course does not reveal them to
the learner then the learner chooses one
of the actions maybe action two and gets
to see what the reward is for that action.
And then we continue to the next round so
now there's a female user and
she's eighteen years old nature chooses
some rewards we choose and action and so
on OK So in this way in this way
the learner accumulates a sum
of rewards that we're
trying to maximize OK.
So.
Yeah so what's our goal like I
said our goal is to try to figure
out how to choose actions
based on context so
what we're trying to do is we're trying
to the learner is trying to discover
a rule that maps contexts actions so
we call that kind of a rule a policy
we call that a policy and so
the aim is to find a good policy so
just for concreteness just for
concreteness Here's an example of a
possible policy this is just an example so
this policy says if the web
user is male choose action
to otherwise if she's over
forty five years old choose X.
Y.
and otherwise choose action three OK so
it doesn't have to have this form
by any means all a policy has to do
is to be a mapping function that
maps contexts to actions that all
it has to do given a new context here's
the action that it recommends taking.
OK Now before we
the learning process begins
we as the users are the designers
of this learning algorithm
need to decide on the form of the policy
that will be using during the learning so
this is just like in regular supervised
learning before you start learning you
decide you're going to use decision
trees or neural networks or boosting or
whatever what you're doing is you're
choosing the form of the In that case
classification rule or prediction rule
that your you decide to use so in the same
way in this setting before learning we
need to decide on the form of the policy
that we're using like maybe we will
use these kinds of IF THEN ELSE rules.
So when you decide on the form of
the policies what you're really doing is
you're specifying a space a possible
policies that can be used and I'll be
denoting that space by capital pie that's
a capital pie So for instance in this case
maybe we would use all decision trees this
doesn't look like a decision tree but it
is a decision tree decision tree is just
this kind of nested IF THEN ELSE rule.
OK And again that's just an example
in general we're imagining
that we've chosen some space policies
that we're working with maybe it's.
Neural networks and so on.
OK So tacitly what we're doing is we're
tacitly assuming that one of the policies
in the space gives a higher reward that
there's some policy that has this form
that gives high rewards it's just
a guess or an intuition that we use.
OK.
So the goal now is to learn through
experimentation exploration and
so on to do almost as well
as the best policy in the.
Space and to keep things simple we're
going to assume that the space that we're
working with is finite but
we're also imagining that it might be
extremely large OK Like maybe the space of
all these trees some very
large class of functions or
policies so exponential in whatever
reasonable measure you want.
So by doing that by
allowing the space to be so
large we're explicitly allowing these
policies to be very complex and express
that where we're explicitly allowing for
behaviors that are more complex.
And we hope that that will be a very
powerful approach because it will mean how
rhythms that can learn these complex
behaviors complex ways of making
decisions but it also makes the problem
hard when the policy space is so
large so everything you want to do with
the policy spaces you know all the obvious
things are in feasible because
we're managing that the spaces
exponentially large and
it's also hard in this setting because
we're trying to do learning about all of
the policies in the space at the same
time while we're also simultaneously
trying to perform as well as the best one.
And when we select an action we
all me get to see the reward for
the policies that would have chosen
that same action so we don't learn
anything about the other policies So
in other words it's really just the same
exploration versus exploitation dilemma
but now it's on a gigantic scale
because we're working with this huge space
of possible ways of choosing actions.
OK So let me just go back
to the formal model and
I just want to this part is exactly
the same but I just want to refine what
the goal is now the goal
now is to learn or
to achieve high reward high total reward
or equivalently high average reward.
In comparison to the best policy in
the space so before I say we just want
maximize reward now I'm saying we
want to get high reward relative or
in comparison to the best policy in this
particular space of policies capital pie
OK So in other words we're now going
to measure performance in terms of
regret so don't don't worry too much
about the math but basically this
term over here is the learners average
reward averaged over the capital T.
time steps and this is the average
reward of the best policy
in the space capital pie and
the difference is called the regret and so
what we're looking for is algorithms that
will drive the regret to zero because if
the regret goes to zero it's saying
that the algorithms the algorithm
is getting reward which is almost as
good as the best policy in this space
OK in our them's that do that
are called no regret algorithms OK.
By the way perfectly fine getting
questions of paper has any.
Correct the algorithm doesn't have to
pay out we allow the algorithm to
choose actions any way it wants to but
it's with the way its performance is being
judged is in comparison to that space.
OK So that's our learning model or
the formalisation of the learning
problem so what I want to do now
is talk about algorithms and
we'll actually talk about two kinds of
algorithms so I've set things up that our
goal is to get to the bandit setting but
it makes things simpler to
first think about the full information
setting so that's what we'll do first.
So what is the full information setting.
So the full information setting is
the same as the bandit setting but
now the learner gets
to see the rewards for
all of the actions not just
the action that was selected.
OK so the set it set up is
the same We get to see a context.
Nature chooses three words for
all of the actions and but
learners selects an action
let's say Action two but
what's different is that after the action
is like that the learner also gets to find
out what the reward would have been for
the other actions that were not chosen so
it gets that kind of
counterfactual information.
OK in the learning process
continues in the same way
with the learner collecting
rewards in the same way.
And there are a couple number of things
that make this setting a lot easier but
one of the things that's
really nice about it.
Is you can pick any policy
let's say we pick some policy
by whatever it happens to be we
can go back and figure out what
rewards would have been received
if we had listened to that policy
on every single round OK so
we can go back we can go back and
we can say one round one this policy
would have chosen action three and
round two it would have chosen action
three again and so on we can compute
the actions that would have been shaken on
every round by that one particular policy.
And then from that we can figure
out what reward we would have
received if we had always
listened to that policy.
And if we take the average of those
numbers the average reward that's a good
estimate of the policies expected reward
expected with respect to a random
choice of context and reward factor.
And so now the obvious thing is just
on every round figure out which policy
seems to be giving the highest reward
based on the actions that it picks and
then just follow that policy whatever
that policy says to do do it
so that very simple algorithm is
called the follow the leader algorithm
it sometimes also called greedy I think
people call it greedy Also I'm calling it
a poll of the leader so again what
we do on every round we figure out
the an estimate of the expected reward of
every policy in other words we look at
the rewards that each policy would have
received so far and we choose the best
one that's what I mean by the empirically
best policy in that space and
then we just choose the action selected
by that policy on the current context
very simple OK well.
This our rhythm gives optimal regret
the regret is like square root
log of the size of the policy space
divided by the number of time steps T.
so that's nice because it means the regret
is going to zero at the rate of like
one over squared of T.
which is good and it's only a log rhythmic
in the size of the policy space so
at least in terms of
statistics regret it's very
moderate in the size of the policy space
we can handle these huge policy spaces
but in terms of computation the hard
part the challenge here is how
you find the best policy in
this space capital PIII so
to do that you need to have some some
method of finding the empirically
best policy you need some algorithm or
separate teen or.
What we call an oracle OK I'll call it an
oracle but it's really just an algorithm
or sub routine for finding the best
policy based on what's been observed so
far all of the context and
rewards they've been seen so far OK and
that's the hard part so we call it an art
Max our coal it's also called the or M.
or a coal and a bunch of other names.
So if you think about the problem that
this Oracle is solving it turns out
that it's actually solving a very
standard kind of learning problem it's
solving just a supervised classification
learning problem in other words the kind
of learning problem that people have been
working on literally for decades for
three or four more decades and we have
a whole bunch of methods around for
doing that you know neural nets decision
trees boosting us we've got a whole host
of methods for solving that problem and
even though this looks very different and
we're talking about things like actions
and policies and contexts what's really
happening is it's actually just a standard
classification problem that we're solving
here so that's great because it means that
we can take any of these off the shelf
algorithms and plug them in and use them
as an oracle for planning the best policy.
OK So so
we're kind of accumulating techniques so
I'll just point them out as we go along so
we've already seen a couple techniques
The first is to just estimate
the expected reward of every policy.
And another technique is to use this or
call the existing method
that we might already have sitting
around to find the best policy
OK OK So that was a really simple
algorithm maybe I won't talk about proof.
So so far I've only talked about this I
id setting the setting where the data
is being generated at random this what
I'll call the stochastic setting.
But that setting is not always
realistically realistic so you might have
data that's true if doing where there's
a temporal correlations of some kind or
you could being in a truly adversarial
setting like a game playing setting for
instance so there's also been work on
a more ad the serial model a setting where
we do not assume that the data is being
generated stochastically So
in that case these context and
reward back they can be chosen
in a completely arbitrary way so
we don't assume that they're random and so
they could even be chosen by an adversary.
So it's seemingly a much more difficult
model it is a much more difficult
learning problem.
In particular an algorithm like follow the
leader which is so simple doesn't work in
this setting so you can set up examples
where the adversary can force very low
rewards even though there's one policy
that gives very high reward OK so
you can set it up so
that an adversary can force higher regret.
Nevertheless there is an algorithm
actually there are many algorithms but
here's an example of one our rhythm
that can deal with this setting
which is called the hedge algorithm.
So which is based on though
we did majority of them of
little stone environments.
So in the hedge algorithm what we
do is we maintain one weight for
every policy in the space one wait for
every single policy in the space.
And then on every round what
we do is we select a random
policy from the space where
the probability of choosing each policy is
proportional to its weight proportional to
the weight that we're keeping the round.
And then we just use the action that was
selected by that policy just like we did
before OK so we're using
a weighted combination of policies
to make our choices and then when
we get the reward when we observe
reward the rewards we increase the weight
of the policies according to the reward
that they received so the more reward they
get the more we increase their weight.
So this algorithm yields
optimal regret I mean if you
set it up right of course there's
obviously a lot of details left out and
it gets that optimal regret even in
the adversarial setting which is nice.
But the problem is that this algorithm
is keeping around one wait for
every policy in the entire space so
that means it's time and
space requirements are linear in
the size of the policy space so
it's going to be much too slow
if the policy space is gigantic.
Nevertheless it still useful it's used
like in game playing like you can use
it to play games or to solve games.
And the boosting boosting
algorithm out of boost actually
was derived from hedge taking advantage of
the fact that this algorithm works even in
an adversarial setting OK OK So
here we have a third technique which is to
use a weighted combination of the policies
all right so well let me skip that.
OK So so where are we so we talked about
these two methods follow the leader and
hedge follow the leader works in the
stochastic setting works in the nonstop
cast exciting they both get optimal
regret follow the leader is efficient if
you have access to an Oracle page is any
fish and if the policy space is gigantic.
So you might ask is it possible
to get both of these so
is it possible to get an algorithm
which gets no regret and
it's Oracle a fission In other words
it's a question if we have an oracle and
it works even in the non
stochastic adversarial setting.
So this appears to be impossible
because of a result of Coron So
it looks like you can't do that OK So
that's the full information setting and
we're going to use those techniques even
in the bandit setting which I'm
going to talk about now OK So
of course in the bandit setting
the rewards you only see rewards for
the actions that were taken OK So
whereas in full information setting
we get to see the rewards for all
the actions now we only get to see it for
the actions that were selected OK so.
How far can we get with
what we talked about before
while we still can figure out
what the learner's own total
reward is because we get to see the reward
every time the learner chooses an action.
But now if we pick a policy
we can still figure out what
actions would have been
taken by that policy but
unfortunately we only get to see
the rewards on a subset of the rounds for
that particular policy like on round
one the learner chose action to
this policy would have chosen action three
so we don't know what reward would have
been received on that round and round two
they happens they would have both chosen
the same action we do see what reward
would have been received by that policy.
OK.
So we still might like to use
an oracle in the same way to try to
find an empirically good policy
it's a natural thing to do but
in order to do that we have to overcome to
difficulties first of all we only get to
see some of the rewards as I said and
furthermore the rewards that we
do see are very highly biased OK because.
Because we're again trying to learn or
is trying to choose actions
to get high reward so the statistics
that are being gathered will be biased.
OK Now of course I talked about.
This exploitation exploration trade
off and let me just drive try to drive
home the point that we really need
exploration in this setting so
here's just a little tiny tiny
example to show what the problem
is of why an algorithm like say follow the
leader does not work in this setting if
you don't have any exploration built in so
magine you know there are two drugs.
And you're trying to figure
out which one is better and
drug A is pretty good it's got a cure
rate of sixty percent drug be is
a lot better it's got a cure
rate of eighty percent and
so you get some people and you try out
the drug and you see what happens.
Well it's possible it's possible just
because of statistical fluctuations
that in early trials eight the worst drug
might actually appear better than B.
just because you have a small sample
size that's possible and if that happens
if you get off on the wrong track in
the beginning then an algorithm like Paul
the the leader which always
chooses the seemingly best
action will get stuck since a looks
better to just keep on trying day and
its estimate of how good a is will
get better and better and we're
never trying out be to see that it would
have been better if we had tried it or.
OK So an algorithm like follow
the leader really can get stuck and
that's why we need exploration and
of course this problem is much more
extreme for more complex policies
OK So we need exploration and
the natural thing to do is to
explicitly inject exploration into
an hour rhythm like Paul but leader so
you can take follow the leader and you can
explicitly inject exploration into that
our rhythm and that leads to techniques
that are called Epsilon greedy or
more refined version is called the POC
Reidy So the algorithm looks like this so
on every round we choose
an action in one of two ways so
with let's say ninety percent probability
of it's a epsilon is point one So
with ninety percent probability would
choose the best policy so far so
will do exploitation will do
what follow the leader does and
then ten percent of the time we'll
just choose one of the actions
uniformly at random will
do explicit exploration.
And as before we can use an oracle to
find the best policy it turns out.
So this is a very simple and
fast algorithm if you have the Oracle.
That's good what's not so good is that
the regret you get is not optimal OK so
the regret you get looks like this so
there's this inner term which is K.
the number of actions times
log number of policies over T.
but now raise to the one third power so
the Cubic root of this quantity
turns out that the optimal regret will
be will have a square root rather than
a cube root so it might not look like
a lot when half one third whatever
what's the big deal but actually it leads
to a very big difference in performance.
Over Joe's.
Not sure if anybody's sat down and proven
the lower bound but I believe that it is
not positive if it's actually known but
in practice it shows up
OK OK So another technique is to
do explicit exploration using using
uniform sampling about actions
OK OK Now one of the problems I
talked about quite a bit is selection
bias talked about why it's such a big
problem why the statistics will be skewed
in fact there's this really simple and
old trick for
getting around selection bias which is
called inverse propensity waiting so
let's just consider really simple case
you just got this random variable X.
and you want to estimate its
expected value so for instance X.
might be the probability that some coin
comes up heads where it's not a fair coin
and you're trying to estimate that
probability the bias of the coin
OK And
well that's kind of a boring problem but
to make it a little bit more interesting
let's say that with probability P.
we get to observe that
random variable once
like we get see one flip of that coin but
with probability one minus P.
we don't get to observe the random
variable at all so we don't even get to
flip the coin at all so what do you do how
do you estimate the bias of the coin when
part of the time you don't even get to see
it at all which is basically the situation
where in where you get to see the outcomes
for some actions but not for others.
But like I said there's this little trick
you can define a new
random variable called X.
half.
Which if were in the case this first
case where the random variable is
observed will take whatever outcome
was observed and divide by P.
and that will be X.
hat.
And if the random variable
was not observed so
for in the second case will just find X.
had to be zero.
So whereas we don't always get to see X.
the random variable that we care about
we do always know the value of X.
have regardless of what case we're in.
And you probably can do
the expectation in your head
to see that the expected value
of this random variable like SAT
is exactly the same as
the random variable of X.
OK So X.
hat itself will be an unbiased
estimate of the expected value of X..
It's a nice little trick and
we can use it in our setting too
to get unbiased estimates for
the rewards of all of the actions
even the ones that we can get to see.
Great so is all this stuff so
we can get these unbiased estimates so
is all this talk about selection bias.
Doesn't matter it's not a problem.
Well no because even though we can get
these unbiased estimates even though we
can get the right expected value
the variance might be extremely large.
So I was talking about
bias being a problem but
the real problem is variance so
in order to get a good estimate hers
we need to be controlling variance that's
really the thing we need to be controlling
OK and sometimes you can do with
this uniform sampling of actions and
sometimes you have to do
something more sophisticated OK So
this is another technique inverse
propensity weighting that we can use to
get biased estimators OK so.
I talked about bandits in the stochastic
setting with epsilon greedy
greedy in fact he can also do bandits
in the non stochastic setting where
an adversary is choosing the outcomes
of the rewards on the contexts so
the first algorithm to do
this was called X P for so
for the longest time I didn't
know how to pronounce this so
I don't know should I say for
which is this awful mouthful or X.
for which is this awkward thing to say and
then at this brilliant idea
we can pronounce it x P.
for it's own can try and
get everybody in the world say X.
before FOR THIS HOUR them.
At work on pronouncing
out of whose to put.
So anyway this algorithm it's
a conjectural Bandit's algorithm even
though we can call it that and it works
even in the non stochastic setting.
And basically what this algorithm does is
it combines techniques that we've already
talked about hedge uniform sampling of
actions and inverse propensity waiting and
if you combine those in the right way you
can get regret which is optimal So whereas
we had that one third rate for epsilon
greedy here we get a square root rate.
And let me not talk about that.
OK so it's got this nice regret but
the problem is just like.
The time and space requirements are linear
in the size of the policy space so
if you're working with this huge
policy space it's hopeless.
OK so where are we.
So I talked about these two algorithms for
the bandit setting.
Greedy and X P four.
And so he works in the stochastic setting
for works in the nonstop caustic setting.
That.
So is the fission if we have access to
an oracle just like follow the leader X.
before it is any fission of the policy
space is huge just like for
hedge and for X.
before we can get optimal regret
the park really we don't and
again this this one third
one half it looks small.
But it's actually really big so
if you turn the bounds around.
And you say if you want to get regret
absolute on how many rounds do you need
well in this case you would need
like one over epsilon cubed
rounds and by the way you write
picked up I'm ignoring in the size
the policies base here because there fixed
so you'd need about one over epsilon Q.
Browns purses one over epsilon square
ground so in the case of X P four OK so
the difference between cubed and
squared is really big can be really big
OK So again we can ask Is it possible to
get the best of both Now before I use
best of both to mean something slightly
different than what I mean here so here by
best of both what I mean is an algorithm
which works in the stochastic setting and
which is efficient if you have an oracle
like greedy but which gets optimal regret
So in other words can we take what's up
here and change it to optimal regret.
And that was open for quite a few years.
Turns out that the answer is yes.
So there's one last algorithm
I want to talk about.
So this algorithm it's official
name is I love to con bandits I
can't remember what the heck
that name comes from but
it doesn't matter because everybody
calls it the mini monster algorithm.
So in the many months.
Say in a minute why it's
called a mini Monstro rhythm.
In this hour rhythm what
we do is we use all
of the techniques that I've
talked about so far and
one of those techniques is to find
a weighted combination of the policies.
But now we're going to find that we did
combination of policies in a different way
than we did for hedge OK so now every
round we're going to find this weighted
combination so that it satisfy certain
properties that we care about we're going
to say first what properties we wanted to
satisfy and then we're going to go off and
look for it OK and there are just two
properties and they're both natural
the first one is we want that weighted
combination to give low regret or
low estimated regret in other words
we want to do exploitation we want
to choose actions that we think
will give high reward of course and
secondly we want we want
to have low variance
in the estimates that come out of using
that weighted combination OK So we want to
make sure that our future estimates of
all values will be decent estimates and
that basically reduces to them
having bounded variance OK and
the first one I already said is
exploitation the second writ really
corresponds to gathering statistics so
it's really talking about exploration so
we're again making those explicit it
turns out you can write these as.
Mathematically what these mean you
end up with this gigantic huge
very complex optimization problem with was
lots of on pleasant terms very big and
unpleasant but it turns out that
there's this really simple any fission
numerical algorithm that you
can use using an oracle one of
these oracles that I talked about to
find a solution which satisfies these
criteria which satisfies these properties
and the basic idea of the algorithm
is to just find a violated constraint and
fix it and repeat until done.
So I can't give the details because
there's a lot of details but
that's the main idea of
the algorithm OK now if you
do that you end up with an algorithm which
gives OK it's not actually optimal but
it's close to optimal it's that optimal
rate but with some log terms thrown in and
the algorithm is really fast so
we can measure the speed of the algorithm
by how many times they calls
the ORC poll on every round
it turns out that on average the number
of times this algorithm needs to call
the oracle is ignored all these
other terms and just look at T.
it's calling the oracle about.
One over square root of T.
times per round Or said differently
it's calling the Oracle only
once every square root of T rounds and
on all the other rounds it's not
doing anything it's just collecting
statistics it's not calling the Oracle So
it's calling the Oracle far
less than once per round so
it's very fast nearly
optimal using I should say
that a lot of these ideas the algorithm is
very much based on a previous algorithm
that also had an official name randomized
the C B but was much more complicated and
was so complicated that even the authors
came to call it the monster algorithm and
so that's why this algorithm
is called the mini monster.
And if you look at the paper
you'll see what I'm talking.
OK so yeah so the last technique is to
take the properties they care about and
write them out as an optimization
problem and solve it.
So let's see so much time should finish.
OK.
I mean.
The space so the spaces.
So so space you do have to remember all of
that all of the data that you've seen so
far so the data is I mean so the the main
bottleneck is storing all of
the examples you've seen so far so
yeah so it's not exponential it's
polynomial but in practice that's not so
great so that that actually would be
a area an open area to try to improve but.
You're right you're not
doing one per policy so
it's not exponential You know it's
polynomial it's like after T.
rounds your data is like O T I don't
know what other factors are in there but
in practice even that
can get annoying if T.
is really big.
OK So so
I am short on time let me just very.
Quickly talk about
an application of this so
this is a work that was done at other
people at Microsoft so they built
a service called the multi world testing
decision service which is basically just.
A system for solving can can
text will be ended problems but
it's a system so it's general purpose and
it's modular and it's set up so that
it would be really easy to interface with
existing systems like real world systems
like web servers and all that other
stuff that I don't know anything about.
And one of the things they did was in
designing this system was to try to reduce
common errors like if you look at
an algorithm like epsilon greedy you look
at that algorithm and you're like How hard
could it be to implement that algorithm
it turns out on a real system.
On a real system that can be.
Challenging just to gather statistics
in the right way because these
systems are actually amazingly complex and
you have things happening in a stream and
you know you don't always know what's
happening downstream and upstream and so
on actually can be hard just
gathering correct statistics so
they set everything up so there would be
eliminate so many of those common errors.
And it works really well
it's very general purpose so
for instance they applied it to the M.S.N.
home page.
Where they had tried many
learning methods in the past and
none of them had been successful but
using this system they were able to
get a big lift in performance so
this is like a huge relative left in how
many people actually click on things and
now it's used all the time now it's
used actually in the M.S.N. home page.
OK So let me skip those since
I'm pretty much out of time so
I talked about contextual bandits and
I talked about some of the challenges so
how to get computationally fish and
see how to work with these very large
policy spaces which is equivalent
to saying that we're looking for
very complex behaviors how to get
optimal statistical performance
to learn as quickly as possible based
on the data that's available and
how to handle an even adversarial settings
so I guess we don't have our rhythms for
doing all of those things but we talked
about several combinations of those.
So the research community has been
steadily building up these methods for
dealing with these that I've talked about
here and one of the lessons is that
theory is really indispensable
that it's been very helpful.
Really necessary
to work with the theory to end up with
out rhythms that work in practice.
So you know stop there.
We're.
Talking about.
You know yeah.
You know that.
Yeah it's a good question and I'm drawing
a blank on what's been done in that area.
Yeah it's a good question.
I'm sorry.
It's.
Yeah the out certainly a reasonable thing.
Well I guess in other areas people have
looked at things where you actually have
some market process that's
generating the data for instance.
And some some traction.
There But
you know church people applying it to
the bandits setting in particular but.
My people the ported over here.
Now.
Are.
Also.
Does this history used.
To.
Be that is.
Not just that it's always
been used what use is.
There might.
Well in the ID setting the context does
not have that property and that's a myth.
Right so there the context could have
this historical form and so on and so.
Does that help with
designing algorithms or in.
My work.
So the way we set things up.
No.
So we just have these fixed policies
that are mapping context which might be
a history of everything that's happened so
far to actions and so
they're just kind of
existing in isolation.
But in a particular implementation you
could kind of build that into your
policies perhaps as much.
Yet.
Just Rewards.
The adversary chooses everything you know.
Just.
Yeah yeah yeah
I have actually there yeah OK There
we go yeah there has been work but.
You know.
But yes yes there has been some
work on that setting Yeah.
Well we allow the adversary
to do anything so
the adversary if you want to make
it formal the adversary could be
an arbitrary mapping from everything
that's happened so far to.
A context and reward here for instance but
but it's truly adversarial so it.
Probability distribution over.
While the adversary
the adversary is allowed to be.
Adaptive So the adversary could do that
but the adversary doesn't have to so
the adversary can just.
Exist at a per certain point in time and
say here's everything I've seen so
far and based on that I'm picking this
context and this reward back to Earth
so there's really no restrictions
on the adversary in this model.
I make sense I mean you can think of it as
a game I mean that would be one way to set
it up that the adversary is a player in
a game if you want to make things formal.
My not answering your question.
Well I think the cat I see
where you're getting at so
the catch is that we're just aiming for
a low regret OK so
the adversary Yes the adversary
can set things up so
that the actual rewards that
are received are terrible but
so the algorithm might get
terrible rewards and do awful but
what these algorithms say is that the only
way the adversary can do that is to make
every policy in the entire space also
bad so that the regret will be small so
the algorithm is terrible but so
was are all the policies OK So
we that's why we have this kind of tacit
assumption that one policy is good and so
if that's the case and you have low regret
then the algorithm will perform well.
OK Thank you.