[00:00:05]
>> We are very fortunate to have them right here. Underground at the University of Chicago. Ph D. at MIT and started up. Now at Berkeley. So it has all kinds of. Fellowship. Which I was there to witness him receiving. I like to call it the Nobel Prize if.
[00:00:40]
You want to. But it was the 1st time a word certainly I've ever witnessed Yeah. Some control optimization and machine learning. Is probably best known for how you worked on Matrix completion. So I actually know a long time thinking about it. Do you remember when that was over 10 years ago.
[00:01:08]
I don't like talking about things that are that all that's so still present with the 2007 post-recession. Wow. I was wow or a middle grad school so I did everything I was working on. And then giving a presentation that a poster. Type of random features for large scale kernel.
[00:01:33]
Which combines 2 of the things I was working on and so I naturally very skeptical I wasn't Yes and I and I felt kind of obvious I can't believe they made a paper but. I was I was you are impressed. But then Flash forward 10 years and this paper is receiving the test of time I'll work for him if he was actually turned down important papers.
[00:02:02]
Last year we all learned here I don't know either. So. The other thing that I wanted to mention is that you know I think a lot of people if you haven't seen the. When you win this award one of the co-authors gets to give a talk. 17 test of time award.
[00:02:25]
At some point. But this is actually not the most important paper bag and. So. I moved it up again today it's still there it is still there I love the internet not your website I love the Internet. It is in the paper a couple years before I hope that the U.S..
[00:02:48]
And Europe study. It's worth it just for the pictures to sit down look at the paper. No no no no he. Predicts you know that conclusion. Between 1.3 gigahertz and one quarter according to the F.T.C. these bans are supposedly reserved for radio location and other communication with satellites that you can stick your hand mobile phone technology but not affiliated with government these bans are in the hands of multinational corporations it requires no stretch of the imagination.
[00:03:30]
The 1st comment is likely to have been propagated by the US government possibly with involvement of the F.C.C. We hope this report will encourage them to develop. And I think you know that a little of that there is the kind. Of thing we can expect. Ben is going to.
[00:03:52]
Talk today about. Learning to control Thank you Mark that was an amazing introduction let me just say let me just say the best thing about that paper ever citations aside is that I got to be a wait wait don't tell me so that was really the all the best thing to come out of the living for homeless paper by far and I have bragging rights there.
[00:04:13]
Anyway thank you that was awesome also yet so I'll come back to something that Mark said actually in an introduction and a little bit. This is a bit of a dry title which I'm really fond of here because why go why lead with something outrageous when I Could this be outrageous progressively as I move on so well the start with boring with this and.
[00:04:37]
Before I actually even dive into anything let's see the range of my sensor here even work. I did one second I gotta get this is very important that my clicker actually works here and see. Victory most important slide is this is this is. Talk of the product of multiple projects with a great team at Berkeley.
[00:05:01]
These are the 6 posts most involved with everything I say here and definitely not been done without them and so anything inflammatory offensive I say that's my opinion and anything that you think is interesting that's probably from these guys so will keep the division of the allocation of credit that way.
[00:05:22]
So the prominent talk about I won't talk about machine learning broadly and in particular you know machine learning. To be you guys are all here for this is kind of knowledge unavoidable now but think people think it could solve all problems maybe actually we've had a little bit of a reckoning over the past couple years but I still still think there is this pervasive notion you have companies like Google claiming that they're 1st and they claim that they can push they can solve all their problems with artificial intelligence and indeed we've had results where people are now trying to push machine learning into these problems that face people write we solve go and then people think that that means we can solve all planning and actuation problems and people actually there are lots of impressive demos of using machine learning in robotic systems and using So the hope is that we can use machine learning in the systems that are actually facing more and more important problems we go from just labeling images of cats and dogs to perhaps controlling cells driving cars or allocating our supply chains or maybe controlling our power grid but the problem is that as you move things from the frivolous into things are essentially mission critical.
[00:06:36]
The constraints that we usually apply to machine learning have to be Titan's it's not enough to say this works OK and I'm going to ship it because people can and have died I mean I used to say could you say it was possible that people would die but people have died we have multiple fatalities and self driving cars now.
[00:06:56]
And I think there are lots of other evidence that machine learning systems have led to loss of life and so it's kind of this weird thing it's a weird kind of technological responsibility that we didn't used to be used to have one machine learning didn't work that well it was fine to do crazy stuff but now that is actually important we kind of have to all change kind of our our perspective so in particular I'm interested in understanding not just how to make these things safe but actually there are interesting just fascinating research problems that come along with this particular if I have a machine learning system that has to interact with its world and can choose to do things based on what it seem before that's very different than what we learn in standard supervised learning courses certainly more than what we would learn in an undergraduate course so some of the things broadly I would call moving from what we call supervised learning which is just making predictions to reinforcement learning which is choosing actions that have consequences based on things that we've seen before is a big shift as a big thing that we have to understand as broadly and called Read for certain learning just the study of using data to enhance the future manipulation of a dynamical system and dynamical I mean the broadest setting here.
[00:08:12]
Now if you had asked me what that field was called 10 Years ago I would have said that's called Control theory because that's the people I was brought up with I feel like reinforcement learning wasn't mainstream at that point although their people had solved backgammon but not go and so now the question is what's the difference between these 2 and I guess what I want to talk about today is how do we actually unify these 2 views of systems that can how to interact with their environment using systems that collect data that take that data and make decisions and those decisions have consequences that we can be evaluated.
[00:08:45]
That's what I'm hoping to try to emerge today and I'm going to focus it around one question like this is a big question what are the limits of these learning systems that interact with their environment what you know what are their capabilities what did they fail how do we actually put these things on equal footing I want to focus in on this one which is just like a baby step towards kind of the broader broader issue which is how well do I have to actually understand a system in order to control it.
[00:09:15]
And control here I'm going to mean in the sense of what control theorist talk about which is usually just making things fly in a sensible way but we can think more broadly about some of the kind of consequences when we actually think. More closely about how we use that word control in machine learning systems but today we're going to focus in on the kind of like that standard thing control no it's interesting what's cool about this is like lots of great research problems are at this interface and there are a lot of touches on all sorts of different areas mostly you know statistical learning theory I don't think that's the fish and I think we actually understanding control theory understand that emphasis is important and of course the for me because this is always been the foundation of everything I've done really ties into a lot of cool things and optimization as well so hopefully big some of the things OK so what is control theory hopefully everybody knows.
[00:10:09]
Our computer science undergraduate don't have to take this class so if there's anybody here who disputes undergrad this would be the 5 minute introduction. More than 5 minutes is important but let's list give you the flavor and the idea so you have a dynamical system identical system is something which has some internal state the state is where everything I need to predict the future and you have this thing that you get to control you get to choose an action when you make that action that influences based on the current state of the next states and the goal is typically to take those inputs to make the system do something make the X. do something so here X. is the state a state of identical system is just a sufficient statistic such that I know the state I can predict the future you is the input and the dimension is Pete this notation I'm using is from the controls.
[00:11:05]
Perspective if you take a course of reinforcement learning you will learn the exact same thing except X. would be called F. for state and you would be called a for action so frankly that's better notation but I'm just going with what I grew up with so X. in you today.
[00:11:23]
So was off of the control until the optimization come in it comes in on Slide 6 so awful control is now actually specifies what is the goal for what I want to do with the system so I added one new signal here which is called which is going to be my error signal or is external already external disturbance signal East Africa Journal disturbance that would actually also money.
[00:11:49]
And so essentially what happens is I have a system I can pick you I don't get to pick I'm going to see is random these get more difficult if you assume adversarial but let's just start with a simple case where is random and the goal is to find a sequence of use.
[00:12:06]
That makes some cost small. OK So C. is a cost we like to make small and it's interesting control theorists are inherently pessimistic people so they minimize cost but computer scientists being optimistic people maximize rewards of course as you know these are equivalent to each other it's just whether or not you're an optimist or pessimist so.
[00:12:29]
So we pick that cost in advance E.-T. is some noise process that's external that we don't get to control S.T. again is our dynamics this is the state transition function that governs how the system will Vall and. This notation of Tarot I'm using touted to know a trajectory tower like I said right reinforcement learning controller about taking all the past data and then making some new action as the tile summarizes all the past data and PI summarizes what happens what do we do with that data through going to find something that looks at our past data and then just maps it into a new action and so in reinforcement learning that's called policies and.
[00:13:10]
It controls that would be called the controller and the goal here is to find the best policy find the policy that makes this cost the smallest possible an expectation with respect to that noise process so that let me let me do a concrete example because I was like a lot of notation that the general control problem let's do something simple especially if you've never seen this before so we have this quadrotor we want to move the quadrotor over here.
[00:13:38]
So how do we do that well the 1st thing we have to do is understand how the dynamics of volves and that would probably be due to Newton's laws although you could be more complicated but let's be really simple and that we say that the position of call the position Z. is the enrolled the last city or vice versa the derivative positions velocity the derivative velocity is acceleration.
[00:14:02]
And was the other one that we know is that force equals mass times acceleration. OK so it's like simple Newton's laws I could put that together and I'll get an optimization problem that's the dynamics that governs my system and if you could actually write it out with an icemaker C.S. So it's a really simple thing with major C's here so the state variable is to 2 components the position of the velocity.
[00:14:27]
The input is this force that I get to apply and the system of Olives in this way and then I pick a cost and the cost here is to say simply saying you're going to pay a cost of one if you're not at that blue dot and then you pay a cost of $0.00 if you are at the Blue Dot as a discrete cost that's the SIP I don't know I came up with that $1.00 I pulled it out of a hat and I think that's actually if anything one of the most important lessons to take away from today is that I get to choose the cost I don't get to choose the dynamics but you can engineer some dynamics but like there's a lot of physical laws and constraints that govern those dynamics the cost is such that you get to pick and this is exactly the same machine learning and machine learning we want to minute when we the simplest thing you do is classified and classification what you really want to do is minimize the number of mistakes but that's hard so what you do is you come up with a surrogate like logistic regression now I can minimize something that's convex and then evaluate how well that works on this thing that was not convex So the same thing is true here is that of minimizing the.
[00:15:34]
This discrete cost I could minimize the sum of the squares of that 1st component OK this is not exactly what I want to do but it's bound ing it's in something bounding that costs. Or this approximate the cost of something and you know I can do other things for example I definitely don't have an infinite amount of force that I can apply and so I can penalise how much work that piles well.
[00:15:58]
And I have a quadratic cost. And I have my dynamics in the constraints the dynamics of this case happen to be linear. The state is a linear function the new state is a linear function of the current state and the control inputs of linear dynamics of a quadratic cos And so you could solve this even not knowing anything I could feed this into C. V.X. or tensor flow if you want and you could solve this yourself turns out there's an easier way to do it but that's fine you could just do that right now without knowing a thing.
[00:16:30]
That was interesting is that cars that are that have here actually allows you to tune the controller depending on to see. Depending on what kind of constraints you had so if you really don't want to use battery you could you could set our to 10 in this case is going to take you a long time to get to the origin and maybe you're going to overshoot and have to come back but in doing so you're going to expend very little battery on the other hand if I turn our to one now I only have a tiny little overshoot and I get to the origin very quickly on the other hand I'm expanding a lot of power as you can see in that 1st initial transient you spend a lot of force to get down to OK So this is like we do this in machine learning all the time we tune the cost function to get to the goal we want the same thing is true in controls and I think this is something really important reinforcement learning controls you get to pick the cost and you have to make sure that that cost actually when you solve it does what you want you have to understand the outcome of picking that cost so this example we can just generalize it a little bit is called the linear quadratic regulator like almost classic problems an awful control it's a standard thing that you would learn in a controlled course the idea is minimizing a quadratic cost.
[00:17:49]
Subject to linear dynamics linear dynamics as I said you don't get to pick whether or not your system has linear dynamics and or well really you have to over engineer yourself to work in a regime where your system has linear dynamics we can talk about that later but Newton's laws are linear so that's nice and you have a quadratic cost which I get to pick your so this is some sense this is the simplest controls from that I know and I'm going to use this as a way to now understand.
[00:18:19]
Issues in reinforcement learning. Now of course is where the learning come in there is no learning that which is optimal control I told you that you could solve that problem there variety of ways to solve that problem as a kind of generalized. So you could solve it using back propagation which in optimal control is called the method of adjutants turns out the method of Agilent is older than back propagation but that's fine we don't have to have to cook quibble about that but actually would fly satellites in the sixty's using this and I was talking to Felix this morning it's still a very popular technique.
[00:18:56]
There's a variety of other ways to solve those problems and they all kind of generalize but the hard part where does the learning come in when it stops being optimization this is when we actually don't know the rules that govern the dynamics I think this is one of the simplest thing I said you get to pick a cost function but you don't get to pick the dynamics in a particular lot of times as an amateur not know or that's not known to accuracy you like OK so the question is what do I do if the system if I don't know the way the system evolves over time well now I have to do some learning or if you're controlled person you call identification so if you do the same thing learning or identification.
[00:19:36]
And so the big challenge is how do you actually do offer the control when I don't know the dynamics but what's interesting here right is that I have to do I have to get some information about the system so I have to do some probing to do some kind of experiment extract information from the experiment and then act.
[00:19:54]
So that's kind of the situation we're in let me motivate this with another example I really like this example just because I take every opportunity I can to make fun of Google they were nice to me last month but this particular Actually it's really deep mind trying to go after deep mind.
[00:20:10]
There's this weird thing where controls in some sense was far too successful for its own good but the number systems in this room that are powered by control systems is too big to count and yet. You know nobody I feel like control theories don't get the respect that they deserve it's a little control engineers are a little bit a little bit disrespected on the other hand if you're machine learning scientist you get like this quarterback salaries.
[00:20:38]
Are exhausted by an adult sized playground so let me tell you about something that deep mine like to work on they call this the problem data center cooling as you know Google has a lot of data centers and they generate a lot of heat because they're all running tests or flow so.
[00:20:52]
Very hot sauce G.P.U. So you have to cool the data you have to cool the machines right so I had these giant warehouses filled with machines and you build these in saying the big cooling systems the goal is what's the right way to cool those data centers what's the right way to get that down to a reasonable temperature.
[00:21:08]
So that the machines don't melt or calcify are good this is a good problem to have we actually had a data center fire Berkeley still not solved. In the last couple of years actually embarrassing and A about. That. That was more of a hardware issue and one of them was a way let's not talk about that so so you could these days there's a real problem it's a hard problem on the other hand I would say that this is a great example of how control theory is don't know how to brand things you call data center cooling you get written up in every major publication including New York Times but if you to ask a control theorist what this problem was they say that's a back.
[00:21:46]
Right so air conditioning not that exciting you're not going to get a paper in The New York Times on your article New York Times an air conditioning but data center cooling South cool I bring this up because now the question is What would you actually do to solve this problem so there's this kind of I could model every single bit of the data center and do a finite element calculation and try to actually do control their P.D. constrained optimization to do datacenter cooling OK we don't do that for H. back probably overkill so a different idea and I could do some bulk model modeling things that heat sources and modeling things you know the connectivity and from that bulk model do some kind of control based on the.
[00:22:26]
Other simpler approximate model for this more approximate model of course model or I could do the thing where I just look at all the sensors in the data center which I augment it with a ton of sensors and then just try to make a prediction or to make a policy that goes directly from sensors to action and which which is the right one here's the question what's the right thing to do I think is something that's the hardest question I don't know what the right thing to do is right so there's that identify everything for data center cooling is probably not the right thing to do for high performance air dynamics this is the right thing to do and it's actually the heart the more you push up against your control authority like the ability to actually control the more you really have to be modeling and predicting things so there are certainly cases where that's important probably not air conditioning for air conditioning we can maybe identify something much coarser and then do some do some kind of.
[00:23:23]
Something based on that model and this is the foundation of model predictive control which I don't have time to really go into today but I'll talk about like I thought about for one side at the end while predictive control at the idea is really simple it's actually a really beautiful idea you just predict you make a prediction of what's going to happen for some long time horizon like let's say from here to infinity I predict the future I take one step.
[00:23:47]
Based on my policy they planned over the entire horizon of the future and then and then I see what happens and then I replant again and it's kind of the shifting of trying to be clever about building a good controller to try and just be good at optimizing quickly and interesting push that's actually how a variety of these systems work and then there's another idea which is I don't need models at all I'm just going to go directly from sensors to action.
[00:24:13]
And essentially this is how people characterize reinforcement learning but also actually controls we have something pretty similar I was it P. ID control actually does something very similar to this and just in case you don't know P. ID control is P. ID control I guess the point here is like I don't know what the right answer is I want to know how to actually distinguish them let me tell you about the ID control very quickly the ID control is really simple idea the idea is that I have some signal I would like to track with for simplicity let's just make I want to get some error to 0 like I want the temperature in the room to be $68.00 so what I do I look at the deviation from the error in my control law is a linear combination of 3 thinks it's the error it's the derivative of the air and it's the interval of the air proportional integral derivative that's what Ph D. stands for.
[00:25:04]
And if fact it turns out that people have done surveys of this and essentially 95 percent of the controllers that are in production at industry in industry 95 percent of them are proportional integral only so 2 parameters and very little modeling actually have to be done to tune these the ID controllers by just doing a few simple experiments you can actually set those values and get to 80 percent of where you want to go maybe 90 percent of where you want to go so you can spend the extra effort to get really awful or you get pretty good with very simple tuning rules that were invented in the forty's.
[00:25:39]
OK So the question is now if we want to push into these more complex domains domains that have perceptual information domains that have. Much more complicated interaction they may have much higher state space how much you actually have to model to do these more advanced controls and how do you step from P I D outward and the question is can you like learn to compensate and adapt and do all sorts of things under changing in the conditions and kind of war adversarial environments and that's kind of the big research challenge.
[00:26:14]
Like just for example right itself driving cars if you just want to have a self driving car drive very slowly around the same track over and over again you can do that with the ID control lane following M.P. ID control pretty simple you want to do something where I have to go and drive on the 101 in San Jose much more complicated and the question is how do we interpret to their.
[00:26:37]
I also know I'm using the example of the one I want to I have no idea how people drive I mean I don't drive in the Atlanta so I can't comment but I don't. Know I just don't know how people drive or that is a crazy not is that people are nice or good to know good to know.
[00:26:57]
People of those are all right so get this is our set up we want to this is a get a lot of notation but it's fairly straightforward to minimize cost we have this unknown state transition function how do you pick policies to solve this and so says this actually allows us to reinvent reinforcement learning just looking at this particular question.
[00:27:18]
OK in our get into it. Yes So back to some sense. This is where I put on my statistical learning theory hat and I was like WHAT exactly how exactly would we compare methods was actually the right way to compare methods it was a way to compare methods and way that machine learning people would appreciate that that's control theory people a machine and people have different objectives even in how they evaluate things so let's just talk about how maybe a machine learning person might look at how you would evaluate this and this is actually the classic This is not we didn't invent this this is actually been fairly over the last 20 years in reinforcement learning similar kind of models have been proposed for evaluating methods so the idea is I could generate enter directories of length.
[00:28:05]
In advance in the laboratory and I want to build a controller that has the smallest cost the smallest cost with some sampling budget which was to say for simplicity is just endtimes the so that's my challenge that's my that's my oracle models I want to find something that has the smallest cost with a fixed number of samples so that it's important to note that this is not there are lots of different settings some people want this to work in an online setting this is not online setting this is a do work in the lab and then I deploy in the question is how much work in the lab you have to do before you can certify what happens when you play with high probability with we're going to I'm going to talk about with high probability but you could also do with expectation I'm happy with expectation high probability.
[00:28:55]
But how do you so the question what's the optimal way to do this obviously I have to do so I have to generate some trajectories After do something in the lab and how many samples you need actually to get near optimal control that's what we're trying to let's try to get that and again we're going to use else you are to get this and this is based on something that I don't know this is an ethos I don't buy the linearization ethos but my feeling in machine learning is that if you can't understand what happens on Linear Models you're not going to understand we have as a new non-linear model so good to start with linear.
[00:29:34]
Moreover if someone tells you they have this amazing thing for non-linear models and it doesn't work on Linear Models I'm very suspicious about it working on those non-linear models OK So that's that's an ethos it doesn't always work we're going to we're going to use that here is actually been pretty fruitful for my group we've been doing this for a while this is kind of like the one of the driving principles and so we have a lot of different ways of actually even looking at deep learning in this context.
[00:30:01]
Basing optimization in this context so there's a lot there's actually it's nice to look at the linear case see what happens there and then there's this converse that you come back to then actually see if that those things on the linear models hold up in the non-linear case as well do the lessons learned in the their models hold up on the non their models the answer sometimes but it's an instructive loop OK so let's do that for control in the context of Else you are else you are will be my simple linear model it's like the again the reason why I like you are it's been studied forever so.
[00:30:37]
We really understand it well when we know the dynamics of the dynamics are linear so the estimation is fairly straightforward and has a like it's actually the other thing about if you are super useful people use it you can use else you are to build a variety of controllers in a variety of settings so I think it's a kind of nice baseline and then this is where we're going to start OK So remember we have to get a general answer directories of length T. and then the goal is to build a controller with the smallest error given a fixed sampling budget that's where we like to go OK something interesting happens I go from slide a to slide B. I added this limit here what happened there OK So here I was talking about with a fixed time horizon let me make the problem harder I see where it is harder and easier at the same time I'm not only going to be interested in the cost for some fixed time horizon what if I'm interest in the cost arbitrarily long time horizon.
[00:31:36]
We don't always like we get to run funny things in the lab but we'd like to deploy these things for a long time so now we can still have the same Oracle model you have to run for some fixed number of samples but then I want you to predict was going to happen an infinite time deploy this in the wild and I want you to guarantee that's going to work on an infinite time I made a little bit harder What's funny is you make this harder it turns out when I know everything the problem is actually much simpler in this case because it has this beautiful form on the infinite time horizon the awful controller is just look at your state and most by fixed matrix and that's your control who's beautiful property of elk you are so there's some matrix with a very few number of parameters just dimension of the state of the dimension the input the number parameters you have to find and that would be the awful solution the way that you solve that is there's something called a discrete algebraic recovery equation whatever you solve this equation and it gives you the K. but maybe I could just find this K..
[00:32:38]
By some other means so the obvious control strategy I say obvious this is what people have been doing since the sixty's is estimate the heck out of the model get a really good estimate of A and B. and then once you have a good estimate treated as true and build a control from that to get a good as good an estimate as you can with your sampling budget treat that estimate is true and then build the cople controller assuming your estimate is true this is called the principle of certain equivalence that's fun I call it I call it nominal control because we're using this nominal model which is that it's a shorter to say as well so both terms are used were discussing nominal control using what we think is a reasonable model as the truth but there are other verses.
[00:33:28]
That would be one approach but my friends in reinforcement learning would tell me all we could just do something like this right or not even reinforcement learning what they were doing. If I when I went talk to Jake I don't know if Jake would tell me to do that maybe say hey I want to F. one greedy sees a triathlon greedy now James and so OK Well we have to say.
[00:33:52]
Here's my ass one greedy approach was that's longer to me as well greeted me try the thing you think is good now but like I had a little bit of the possibility of doing something else to explore this tradeoff between exploration or their exploitation which is trying what we believe is good now for ever and exploration which is you know maybe there's something else out there that might work so one thing you could say is I'm going to tell the epsilon greedy algorithm the optimal controller should be X. and then you could play whatever case you have now and then update it right OK so here's here's a strategy which is sending a bunch of random factors came from the Galaxian and want to do is add and every time step one of these random vectors to my control signal that allows me to explore.
[00:34:36]
And so here's the algorithm I compute how much cost I accrue in the lab and then based on that I update my controller using this law was saying that the old the new controller is just a combination the old controller and this thing I don't know what this is what is the South GA thing I don't know it's actually does anybody know where that expression comes from so that's not really an update rule this is like the punchline here so it turns out that if that rule comes from something called policy gradient.
[00:35:05]
That's the update rule in politics and so I actually lied that's not really epsilon greedy I'm using something called policy gradient and what's funny about this is if you actually look at what policy gradient does who seem policy great to arrive before money not as many as I would expect which is great because you guys are pure This is good don't go get it do it because it's crazy you go through you go through pages of like crazy things with probabilities and logs and all this other fun stuff and you get out this formula and people are like hey you have a great method to solve the reinforcement learning problem but if you look at the Kasich I'm just like computing the cost and multiply it by something that's not a gradient method and yet the way we can when we take policy grade in and we apply this now OK we're going to get a couple places in a 2nd but you fly to L Q R It's not a gradient method it's essentially just a grant of search method and while so it's this kind of thing where there's a lot of formalism there's a lot of excitement about this there are.
[00:36:03]
Lots of people trying to analyze this but the question is Why is that a good algorithm it looks weird It looks weird to me. It turns out that when you run it it is weird it is weird at least all these control problems when I try to apply this.
[00:36:18]
Linearization principle and I run policy gradient Here's what happened so here's this dashed line I got by solving the recall you waited for this system this system is 2 by 2 this 2 by 2 system. For the control theorist in the room or people who remember the feedback control no it's not stable marginally stable system this is why policy grading doesn't like it OK so I have this messy system and the blue lines are my error bars over 10 trials so the minimax the dash the solid line there is.
[00:36:53]
The median performance and you see that essentially the median performance after 30000 samples in the lab is matching the the best thing I could have done had actually known the system in advance but it's still very noisy. OK So let me tell you something else that I think is also surprising that if I just did nominal control this certainly equivalence principle with 10 samples is indistinguishable from the dash line so certainly equivalence which is what people have been doing for 50 years.
[00:37:26]
At least on this problem of a 2 state century double integrator certain equivalence is thousands of times better in terms of sample sample efficiency. Presumably to have a low dimensional problem that's one question somebody else that's actually something said over time yes that when I say not a control I mean estimate A and B. I don't know A and B. I just estimate and B. and I saw the work of the equation that's right in this other one I just run policy gradient.
[00:38:07]
OK There are lots of hypotheses for why. The thing is the thing is we could we could dive into this a bit because I was told that this is all open and I uses and beat my mind on something very similar this is like their bread and butter open a I lost that Doda But whatever almost one Doda using policy gradient and so they're like how can I be saying that policy great it can't learn this double integrator Well don't take my word for it I mean I might be Lance Armstrong who also said that quote claiming that I'm not doping but.
[00:38:40]
Don't take my word for it don't take my word I mean again there are a lot of people in the popular press claiming that this works but go read the fine print. So here's some quotes from my friends at open AI actually I'm not sure they like me but OK I don't.
[00:38:59]
Read it over I these are direct quotes in their blog reinforcement learning results are tricky to reproduce the performance is very noisy algorithms and many moving parts which allow for subtle bugs and many papers don't report all the required tricks I don't know what a bug is honestly but that's neither here nor there what a bug in a machine learning system is actually very complicated and something we should talk about R.L. algorithms are challenging to implement correctly good results typically only come after fixing many seemingly trivial bugs again they're using that word.
[00:39:29]
This is from the horse's mouth. Probably even more compelling is there's really nice work from folks at McGill who are actually analyzing what happens when I change the random seeds in existing get have reposed and here's the plot of this one algorithm called here a p.o and you can see that over 2 different sets of random seeds 5 in the top 5 in the bottom non-overlapping performance this weird this particular Is this the benchmark what more damning than that to me however and honestly plot is an implied plot B. but with more damage that to that to me this is 3 implementations of the same algorithm by the same author or these the same authors on all 3 get ups.
[00:40:17]
And in all 3 cases they ran the exact same algorithm got 3 completely different results 3 different curves of how well we're actually getting returns over time so. That's not great that's not great I don't want that in my car I mean it's fine for video games although apparently not that fine yet there's no get there they'll get there I mean it might win some of the video games but again you don't want like that kind of unpredictability changing the random seed because of this kind of predictability we can actually put that in a reliable system and so there has to be something different there has to be a better way and I would be giving a thought unless I have some ideas I have some ideas I have some ideas so let's let's let's try to dive into that because if this is the thing where what how exactly would we distinguish this with what is a policy gradient and its ilk doing there safely 3 ways to solve these reinforcement learning problems especially when they're really clearly this is why I like the software control framework is that there really easy to kind of this distinguish by what part of the thing do I estimate what part of the often is a sin problem I estimate so there are these methods are called Model free model free methods especially say it's too hard to estimate state transitions from data so I'm going to do is try to approach a different part of the problem and the policy great message I was discussing our policy search methods they just pick a policy and then try to jiggle that policy until it works as you might imagine they're all essentially just random search they're all derivative free optimization methods which is OK but there's a lot because as we've seen it doesn't work that well sample inefficient a different approach which I mean you know I don't have time to talk about today would be called approximate diner programming present in our programming estimates the cost function from data and then from that approximation tries to solve something called the bellman equation this is like if anyone has heard of the Q.N. which is what people use for Atari that does this but then there's this other approach that people that doesn't get as much attention which is this fit the model from data.
[00:42:23]
Again like the certainty equivalence approach and this one is actually kind of like the want to fund us obviously the idea is again we talked about collective simulation data fit the dynamics using supervised learning boring does run the squares to fit the dynamical model and then there's this tricky part of solving another problem solving a surrogate problem to hope to solve the 1st one now there's a certain equivalence principle which is just take the solution plug it in is true and we saw that work at least on that double integrator but is there something else we could do where we actually use the fact that we don't we know we're not solving the right problem and that allows us to maybe get away with fewer samples and also have better robustness properties in the way we do that is something I call course ID control the idea is straightforward one it's very easy to estimate course models of dynamical systems maybe you could come up with one that's really complicated but for the most part for everything that actually is interacting with the world we have some notion of how these things move forward honestly even in the place where people tell me it's difficult they're still running simulators so if they're running a simulator that means they have a model they could look at their code and we would have some way forward but OK But the 2nd thing and this is actually really the part I think is interesting is we can also use supervised learning to bound our uncertainty.
[00:43:45]
This is something we don't really talk about enough unless you're an adaptive learning person so yeah there's a few of you but again this is something I would never cover in the undergraduate class but we probably should that is often times that when I actually solve one of these learning problems I can only estimate the model by going to the uncertainty in the model at the same time and you can get both and so if we can get both then we could do what is called robust control and robust control is a very mature very powerful framework the idea of robust control is I want to build a controller.
[00:44:18]
That works not just for the the plant that I estimated not just for the model I estimated but it works for every single thing in the uncertainties that you want to build something that works for every single thing in the uncertainties it so let me walk walk you through how that works for L Q R Can we have an approach and approach is not the approach it works well but like obviously there's more to do.
[00:44:41]
And I'm going to get this because this is for I've put thought I'd get this before but you know your talk I could talk about that later. But they're talking about that later OK So the 1st question is how do you get error bars on. Your model here so we run our dynamical system how many samples you actually need to ask me a and B. what we can do from one experiment for teatime steps one disparity times this.
[00:45:11]
One tricky part here is I want to analyze what actually what the error name B.-R. i have this problem that the all of the samples are correlated with the thing I'm trying to estimate but and I'm not going through all the details here but you could prove kind of standard rates here that for Fingal trajectories you essentially get the parametric rate you would expect this is what we would expect we spec that the number of the length of the time horizon should be the number of states plus the number of inputs divided by the precision or looking for square.
[00:45:40]
This is not like the essential what we would expect from if it was a independent data even for dependent data we can prove this there are other interesting consequences that come from the dependencies but again this is a really great paper by. A max and Stephen site at the bottom there that really that goes into all the current delicacies there's lots of weird delicate things that happen here.
[00:46:04]
In that estimation from OK he has the estimate the is the truth A has yes and so again what we get the balance is not knowledge we get an estimate so this we'd run this solve this we get an estimate I can tell you roughly what the error is and then I can say roughly that if I solve this robust optimization problem and the robot this is just L Q R But now I'm allowing the dynamics to have a Delta and Delta B. which is my uncertainty right a hat is my estimate thank you for pointing that out she has said that A has the estimate Delta and Delta B. are my uncertainties I was to take the worst instance of the uncertainty so over all possible one think the worst one and solve this now we can solve this exactly was the hard problem but we have a semi definite programming relaxation that solves this when we get the following result and if I have time I might go over how to do this like I don't start.
[00:47:01]
The main idea and actually the whole idea is that you want to actually push those Delta A's and Delta B.S. into the cost and then do a perturbation analysis and we do that was cool is that essentially the this term is what we have from estimation and is actually generically true no matter this term is just your estimation feel it fidelity So that's one if you know F. on a small they're going to get this term and then this term is some stuff that depends on the instance but again what this says is that your estimation quality governs your control quality if you do things robustly and it's also cool is this tells you how long you have to run before you know that the cost will be finite and infinite time horizon.
[00:47:43]
So if you know properties about your system which are captured in the sea which we spell out in the paper in full detail if you know that those properties are bounded This could tell you in advance how long how many experiments will actually have to run before I can guarantee something and actually just turning around the analysis you can also get data dependent bounce which tell you when you're guaranteed to have actually a stable system a stable just means that it's not if you don't get infinite cost finite cost which is important so let me go rather than Dusty's how how that how the mouse this works let me show you why the robustness matters and this is a contrived example but you know nothing well I mean contrived examples get you into the New York Times as we see here so might as well do it right there here's a really bad version bad model of a data center that needs to be cooled there's the racks there are fans that cool the racks and they shed heat to each other and we can model this is essentially these heat sources generating generating heat if they sit there they're just going to generate heat and then they shed heat to their neighbors and then the fans cool OK Now what's interesting about this model if I ran this turn off the control and run this forever because this model has eigenvalues that are bigger than one the A matrix is eigenvalues are bigger than one if you run it forever you get infinite cost OK so unstable if I do leave squares and don't have a lot of data I might estimate one of the diagonal entries to be less than one if you ask me one of the I don't want you to be less than one and you really want to show that you can improve data center cooling you would actually want to turn the electricity off there so now you're really efficient.
[00:49:17]
I mean right away that they have these metrics that you want to hit if you're metric is just minimising the amount of power that would be what you would do but that would not be safe so you could if you if your system identification tells you that something is unstable when it's really stable that causes an issue.
[00:49:34]
So here is this experiment of some simulations that we did and it was orange is this the certain equivalence principle this is like the estimate is true the blue is robust L.Q. are where I actually tell the algorithm what the air and A and B. are green is where I estimate a and B. using a very conservative bootstrap and so that even if you're super conservative with your estimates you get you're not too far off in cost and moreover this plot to me is interesting is how frequently is the thing that you found how frequently are you getting solutions that don't have infinite cost and what we see is that both the blue and green with less than a 100 samples 100 percent of the time are giving you.
[00:50:18]
Knowledge they give you find a cost but they give you certificates that you have fun at cost which is nice I mean what's actually happening here is not that the thing is returning the algorithm is returning. In unstable model what's happening here is the algorithm is saying I can't find a solution what you're finding here is infeasible problems and then I see too that even as your funny feasible problems there is an issue with that point it's not great but eventually it's all stable after 600 samples.
[00:50:47]
And after $600.00 samples you're still 10 percent of the time returning an unstable model with certain equivalence. So the robustness doesn't matter it's a bit of a contrived example but it's worth showing that unless you think about the robustness you actually can have these kind of unforseen consequences that's actually probably think that's true in all of machine learning you have to think about consequences that you might not have intended.
[00:51:17]
The robust that this is you the stable controller just very quickly we ran some of the model free methods again in this example and they're still that particular policy policy search is terrible still doesn't work if I find a way to just we do compare and they're not they're not even.
[00:51:35]
I think a couple of takeaways from this one I think the really interesting thing and I think the thing that it's a hard as machine learning people have been successful in all domains to realize is that even L.Q. are is a hard problem doing stuff where you take actions and they're operating cost wise is really difficult and here we had to bring in a lot of different heavy machinery we found that like the result in that paper I talked about with estimation contradicts 50 pages that are on cause Michelle has blogged and I can explain to people who know the stuff why it's later this not go into it even are asked if you are stationed to some new stuff so it's really like these problems are hard they need new techniques.
[00:52:13]
And if we really want to be successful we have to do a lot of work which I think is really important thing as opposed to the law must could told me that in 2017 I would have a car that was self drive across the country I think today anyway.
[00:52:31]
And Tesla No I mean. Or is he. I don't think you step down it hasn't happened anyway I think the important thing here though even the simplest R.L. problems are really hard and that doesn't mean we shouldn't do them it just means that we have a lot of work to do.
[00:52:51]
So there's other stuff I don't have time to talk about today I'm going to. We've actually been looking at this kind of adaptive control problem we looked at some other problems actually simplifying some of the policy search methods that people have been looking at and I certainly would really interested in is actually how you understand the the tradeoff between learning and safety which actually is one of the most pressing problems if you actually have a ton of car like you have a simulator it's fine you can have your robot crash all the time if you have a tolerance vehicle you don't want to actually crash the car so how you actually do this tradeoff between learning and exploration is super important I'm going to just skip ahead for a 2nd here so let me let me wrap there I just want to say one thing I want to say one thing.
[00:53:37]
Rather than going too much into this other thing OK We had a thing of reinforcement learning and there is a camp but not surprisingly not a lot of you guys at Georgia Tech which is interesting we could talk about that but in the control theory I know there are a lot of people here who does this actually just so we have the training I personally feel like that I said a lot of things about direct policies but these 2 things are really I think there is amazing stuff that can be done that interfaces machine learning and control and I just don't want this to be a turf war I think that that intersection is super fascinating and it's already fascinating I think just because my get a lot of credit for it there are amazing people who've been working in this for a long time who are doing really cool stuff and so I like to say I don't really care what we call it I'm going to come up with this like I'm calling it actionable intelligence because I don't want to pick sides here.
[00:54:23]
But I think that there's something broadly important here that is why I want to close with this something broadly important which is look. Mark said that best known for my work in majors completion I don't know if that's true will this go is that right and for me I got interested in Major's completion because at the time people were interested recommender systems and at the time like recommender systems seemed like this kind of silly and like innocuous thing is like I would recommend a good music or recommend the movies I like and that simple.
[00:54:53]
But like what we've definitely learned over the last couple years is that recommender systems have really crazy unforseen consequences they drive people to radical action they drive people to radical thoughts and they inflame kind of our worst human impulse and we don't think about that and I think the thing that's really important is that every machine learning system that's put in feedback with people.
[00:55:17]
Is a control system or is an actionable intelligence system or is a reinforcement learning system is not supervised learning is no longer supervised learning and we're just running supervised learning and thing is OK why is it a control system I put in feedback with people I recognize the stuff the people they watch what I recommend I look at that data and I retrain and look I have a feedback loop now both the policy and I'm using supervised learning without thinking about the consequences and so I think the really important thing here moving forward is that we can't avoid these questions anymore in machine learning all the questions that are really facing us now all the challenges and all the frailties that we're seeing in machine learning have to be understood in this broader context of like this is what happens when you put machine learning in feedback loops and so to me this is like the it's simultaneously the most frightening part of what were our current machine learning infrastructure but also I think a lot of exciting stuff in that intersection so hopefully you guys are excited to work on this too and I would love to talk to you more about it thank you.
[00:56:26]
Chris your 1st Mark had to take off to teach. Ergo. The. Question be cuing. Go ahead because there's roughly got. Not a great great question as a great question. I will repeat the question just this occasion here the question is one of the if one of the reasons why people like model free methods is because they claim generality however I would say I would argue algorithmically I could solve any problem.
[00:57:17]
That we currently solve in computers using a Sat solver that solves our generic but you don't do that right efficiency optimality all sorts of kind of reliability you actually build custom solvers. And so I think your principles that you have to learn to push out towards these more complicated towards the push out towards reinforcement learning but I think there are general principles you have to learn but I'm not sure that there's that we necessarily should be looking for one hammer for all the else.
[00:57:50]
Just go ahead. Is no it's not no no no to the question the question is Is this the best learning between machine learning and people stable and I think the answer has been decidedly no and we have a lot of examples of me being decidedly No I you know some I didn't pick on so I didn't pick on with Facebook.
[00:58:21]
Like. I said stressing That's interesting so the question is not if it cost the I've not yet the question is like non-infinite cost versus. Is it well 1st class is it linear their top I give out is only worth or linear that I work for later but it is.
[00:58:41]
The stability question I think is really interesting something I didn't mention and I try to measure before everybody left it's undeniable that Facebook has inflamed genocide and me and Mark. That was why I was talking about with that recommendation system you give you give the internet to everybody and you put them all next to each other and they don't like each other it goes bad so we know that there are social dynamics that have really bad negative consequences.
[00:59:05]
And I think that's the big challenge for all of us right those places are hiring How do you go there and make them do better. Can we make a stable I think that's the interesting research question there's a whole subset of machinery people who are worried about machines like doing machine learning as if people mattered like that to me is like like that's more important than the people working on Thomas cars and so that's I think those that's like that's exactly the big challenge like how do we do that.
[00:59:39]
I have time so you know if you could come up here I think.