back to index

Pieter Abbeel: Deep Reinforcement Learning | Lex Fridman Podcast #10


small model | large model

link |
00:00:00.000
The following is a conversation with Peter Abiel.
link |
00:00:03.040
He's a professor at UC Berkeley and the director of the Berkeley Robotics Learning Lab.
link |
00:00:07.760
He's one of the top researchers in the world working on how to make robots understand and
link |
00:00:13.200
interact with the world around them, especially using imitation and deeper enforcement learning.
link |
00:00:19.680
This conversation is part of the MIT course on artificial general intelligence
link |
00:00:24.160
and the artificial intelligence podcast. If you enjoy it, please subscribe on YouTube,
link |
00:00:29.040
iTunes, or your podcast provider of choice, or simply connect with me on Twitter at Lex
link |
00:00:34.160
Freedman, spelled F R I D. And now here's my conversation with Peter Abiel.
link |
00:00:41.440
You've mentioned that if there was one person you could meet, it would be Roger Federer. So let
link |
00:00:46.480
me ask, when do you think we will have a robot that fully autonomously can beat Roger Federer
link |
00:00:52.720
at tennis? Roger Federer level player at tennis? Well, first, if you can make it happen for me
link |
00:00:59.840
to meet Roger, let me know. In terms of getting a robot to beat him at tennis, it's kind of an
link |
00:01:07.840
interesting question because for a lot of the challenges we think about in AI, the software
link |
00:01:15.280
is really the missing piece. But for something like this, the hardware is nowhere near either. To
link |
00:01:22.800
really have a robot that can physically run around, the Boston Dynamics robots are starting to get
link |
00:01:28.240
there, but still not really human level ability to run around and then swing a racket.
link |
00:01:36.720
So you think that's a hardware problem? I don't think it's a hardware problem only. I think it's
link |
00:01:40.160
a hardware and a software problem. I think it's both. And I think they'll have independent progress.
link |
00:01:45.600
So I'd say the hardware maybe in 10, 15 years. On clay, not grass. I mean, grass is probably hard.
link |
00:01:53.360
With the sliding? Yeah. Well, clay, I'm not sure what's harder, grass or clay. The clay involves
link |
00:02:00.080
sliding, which might be harder to master actually. Yeah. But you're not limited to bipedal. I mean,
link |
00:02:09.360
I'm sure there's no... Well, if we can build a machine, it's a whole different question, of
link |
00:02:12.560
course. If you can say, okay, this robot can be on wheels, it can move around on wheels and
link |
00:02:18.000
can be designed differently, then I think that can be done sooner probably than a full humanoid
link |
00:02:24.880
type of setup. What do you think of swing a racket? So you've worked at basic manipulation.
link |
00:02:31.120
How hard do you think is the task of swinging a racket with a be able to hit a nice backhand
link |
00:02:36.480
or a forehand? Let's say we just set up stationery, a nice robot arm, let's say. You know,
link |
00:02:44.240
a standard industrial arm, and it can watch the ball come and then swing the racket.
link |
00:02:50.560
It's a good question. I'm not sure it would be super hard to do. I mean, I'm sure it would require
link |
00:02:57.600
a lot... If we do it with reinforcement learning, it would require a lot of trial and error. It's
link |
00:03:01.520
not going to swing it right the first time around, but yeah, I don't see why I couldn't
link |
00:03:08.240
swing it the right way. I think it's learnable. I think if you set up a ball machine, let's say
link |
00:03:12.320
on one side and then a robot with a tennis racket on the other side, I think it's learnable
link |
00:03:20.160
and maybe a little bit of pre training and simulation. Yeah, I think that's feasible.
link |
00:03:25.360
I think the swinging the racket is feasible. It'd be very interesting to see how much precision it
link |
00:03:28.880
can get. I mean, that's where... I mean, some of the human players can hit it on the lines,
link |
00:03:37.760
which is very high precision. With spin. The spin is an interesting whether RL can learn to
link |
00:03:44.320
put a spin on the ball. Well, you got me interested. Maybe someday we'll set this up.
link |
00:03:51.040
Your answer is basically, okay, for this problem, it sounds fascinating, but for the general problem
link |
00:03:55.440
of a tennis player, we might be a little bit farther away. What's the most impressive thing
link |
00:03:59.840
you've seen a robot do in the physical world? So physically, for me, it's
link |
00:04:08.720
the Boston Dynamics videos always just ring home and just super impressed. Recently, the robot
link |
00:04:16.560
running up the stairs during the parkour type thing. I mean, yes, we don't know what's underneath.
link |
00:04:22.160
They don't really write a lot of detail, but even if it's hard coded underneath,
link |
00:04:27.040
which it might or might not be just the physical abilities of doing that parkour,
link |
00:04:30.800
that's a very impressive robot right there. So have you met Spotmini or any of those robots in
link |
00:04:36.000
person? I met Spotmini last year in April at the Mars event that Jeff Bezos organizes. They
link |
00:04:43.040
brought it out there and it was nicely falling around Jeff. When Jeff left the room, they had it
link |
00:04:49.840
following him along, which is pretty impressive. So I think there's some confidence to know that
link |
00:04:55.680
there's no learning going on in those robots. The psychology of it, so while knowing that,
link |
00:05:00.080
while knowing there's not, if there's any learning going on, it's very limited,
link |
00:05:03.920
I met Spotmini earlier this year and knowing everything that's going on,
link |
00:05:09.520
having one on one interaction, so I get to spend some time alone.
link |
00:05:14.400
And there's immediately a deep connection on the psychological level,
link |
00:05:18.720
even though you know the fundamentals, how it works, there's something magical.
link |
00:05:23.280
So do you think about the psychology of interacting with robots in the physical world,
link |
00:05:29.040
even you just showed me the PR2, the robot, and there was a little bit something like a face,
link |
00:05:37.040
had a little bit something like a face, there's something that immediately draws you to it.
link |
00:05:40.480
Do you think about that aspect of the robotics problem?
link |
00:05:45.040
Well, it's very hard with Brett here. We'll give him a name, Berkeley Robot,
link |
00:05:50.560
for the elimination of tedious tasks. It's very hard to not think of the robot as a person,
link |
00:05:56.480
and it seems like everybody calls him a he for whatever reason, but that also makes it more
link |
00:06:00.560
a person than if it was a it. And it seems pretty natural to think of it that way.
link |
00:06:07.200
This past weekend really struck me, I've seen Pepper many times on videos,
link |
00:06:12.400
but then I was at an event organized by, this was by Fidelity, and they had scripted Pepper to help
link |
00:06:19.280
moderate some sessions, and they had scripted Pepper to have the personality of a child a
link |
00:06:24.880
little bit. And it was very hard to not think of it as its own person in some sense, because it
link |
00:06:31.360
was just kind of jumping, it would just jump into conversation making it very interactive.
link |
00:06:35.120
Moderate would be saying Pepper would just jump in, hold on, how about me,
link |
00:06:38.720
how about me, can I participate in this doing it, just like, okay, this is like like a person,
link |
00:06:43.600
and that was 100% scripted. And even then it was hard not to have that sense of somehow
link |
00:06:48.800
there is something there. So as we have robots interact in this physical world, is that a signal
link |
00:06:55.120
that could be used in reinforcement learning? You've worked a little bit in this direction,
link |
00:07:00.160
but do you think that that psychology can be somehow pulled in? Yes, that's a question I would
link |
00:07:05.920
say a lot, a lot of people ask. And I think part of why they ask it is they're thinking about
link |
00:07:14.160
how unique are we really still as people, like after they see some results, they see
link |
00:07:18.560
a computer play go to say a computer do this that they're like, okay, but can it really have
link |
00:07:23.200
emotion? Can it really interact with us in that way? And then once you're around robots,
link |
00:07:28.960
you already start feeling it. And I think that kind of maybe methodologically, the way that I
link |
00:07:33.760
think of it is, if you run something like reinforcement learnings about optimizing some
link |
00:07:38.560
objective, and there's no reason that the objective couldn't be tied into how much
link |
00:07:48.240
does a person like interacting with this system? And why could not the reinforcement learning system
link |
00:07:53.120
optimize for the robot being fun to be around? And why wouldn't it then naturally become more
link |
00:07:59.040
more interactive and more and more maybe like a person or like a pet? I don't know what it would
link |
00:08:03.920
exactly be, but more and more have those features and acquire them automatically. As long as you
link |
00:08:08.720
can formalize an objective of what it means to like something, how you exhibit what's the ground
link |
00:08:16.320
truth? How do you get the reward from human? Because you have to somehow collect that information
link |
00:08:21.360
from human. But you're saying if you can formulate as an objective, it can be learned.
link |
00:08:27.120
There's no reason it couldn't emerge through learning. And maybe one way to formulate as an
link |
00:08:30.800
objective, you wouldn't have to necessarily score it explicitly. So standard rewards are
link |
00:08:35.840
numbers. And numbers are hard to come by. This is a 1.5 or 1.7 on some scale. It's very hard to do
link |
00:08:41.920
for a person. But much easier is for a person to say, okay, what you did the last five minutes
link |
00:08:47.680
was much nicer than we did the previous five minutes. And that now gives a comparison. And in fact,
link |
00:08:53.600
there have been some results on that. For example, Paul Cristiano and collaborators at OpenEye had
link |
00:08:58.080
the hopper, Mojoka hopper, one legged robot, the backflip, backflips purely from feedback. I like
link |
00:09:05.840
this better than that. That's kind of equally good. And after a bunch of interactions, it figured
link |
00:09:11.280
out what it was the person was asking for, namely a backflip. And so I think the same thing.
link |
00:09:16.080
It wasn't trying to do a backflip. It was just getting a score from the comparison score from
link |
00:09:20.880
the person based on person having a mind in their own mind. I wanted to do a backflip. But
link |
00:09:27.760
the robot didn't know what it was supposed to be doing. It just knew that sometimes the person
link |
00:09:32.480
said, this is better, this is worse. And then the robot figured out what the person was actually
link |
00:09:37.120
after was a backflip. And I imagine the same would be true for things like more interactive
link |
00:09:42.560
robots that the robot would figure out over time. Oh, this kind of thing apparently is appreciated
link |
00:09:47.520
more than this other kind of thing. So when I first picked up Sutton's Richard Sutton's
link |
00:09:54.720
reinforcement learning book, before sort of this deep learning, before the reemergence
link |
00:10:02.480
of neural networks as a powerful mechanism for machine learning, IRL seemed to me like magic.
link |
00:10:07.600
It was beautiful. So that seemed like what intelligence is, RRL reinforcement learning. So how
link |
00:10:18.000
do you think we can possibly learn anything about the world when the reward for the actions is delayed
link |
00:10:24.320
is so sparse? Like where is, why do you think RRL works? Why do you think you can learn anything
link |
00:10:32.160
under such sparse rewards, whether it's regular reinforcement learning or deeper reinforcement
link |
00:10:37.600
learning? What's your intuition? The kind of part of that is, why is RRL, why does it need
link |
00:10:45.600
so many samples, so many experiences to learn from? Because really what's happening is when you
link |
00:10:51.040
have a sparse reward, you do something maybe for like, I don't know, you take 100 actions and then
link |
00:10:56.240
you get a reward, or maybe you get like a score of three. And I'm like, okay, three. Not sure what
link |
00:11:01.920
that means. You go again and now you get two. And now you know that that sequence of 100 actions
link |
00:11:06.960
that you did the second time around somehow was worse than the sequence of 100 actions you did
link |
00:11:10.640
the first time around. But that's tough to now know which one of those were better or worse.
link |
00:11:15.040
Some might have been good and bad in either one. And so that's why you need so many experiences.
link |
00:11:19.680
But once you have enough experiences, effectively RRL is teasing that apart. It's starting to say,
link |
00:11:24.080
okay, what is consistently there when you get a higher reward and what's consistently there when
link |
00:11:28.640
you get a lower reward? And then kind of the magic of sometimes the policy grant update is to say,
link |
00:11:34.720
now let's update the neural network to make the actions that were kind of present when things are
link |
00:11:39.520
good, more likely, and make the actions that are present when things are not as good, less likely.
link |
00:11:44.960
So that's that is the counterpoint. But it seems like you would need to run it a lot more than
link |
00:11:50.480
you do. Even though right now, people could say that RRL is very inefficient. But it seems to be
link |
00:11:55.120
way more efficient than one would imagine on paper, that the simple updates to the policy,
link |
00:12:01.760
the policy gradient that somehow you can learn is exactly as I said, what are the common actions
link |
00:12:07.520
that seem to produce some good results, that that somehow can learn anything.
link |
00:12:12.640
It seems counterintuitive, at least. Is there some intuition behind it?
link |
00:12:16.800
Yeah, so I think there's a few ways to think about this. The way I tend to think about it
link |
00:12:24.720
mostly originally. And so when we started working on deep reinforcement learning here at Berkeley,
link |
00:12:29.920
which was maybe 2011, 12, 13, around that time, John Shulman was a PhD student initially kind of
link |
00:12:36.880
driving it forward here. And kind of the way we thought about it at the time was if you think
link |
00:12:44.480
about rectified linear units or kind of rectifier type neural networks, what do you get? You get
link |
00:12:51.360
something that's piecewise linear feedback control. And if you look at the literature,
link |
00:12:56.960
linear feedback control is extremely successful, can solve many, many problems surprisingly well.
link |
00:13:03.520
I remember, for example, when we did helicopter flight, if you're in a stationary flight regime,
link |
00:13:07.200
not a non stationary, but a stationary flight regime like hover, you can use linear feedback
link |
00:13:12.080
control to stabilize the helicopter, a very complex dynamical system. But the controller
link |
00:13:16.960
is relatively simple. And so I think that's a big part of is that if you do feedback control,
link |
00:13:22.240
even though the system you control can be very, very complex, often,
link |
00:13:26.000
relatively simple control architectures can already do a lot. But then also just linear
link |
00:13:31.520
is not good enough. And so one way you can think of these neural networks is that in some of the
link |
00:13:35.840
tile the space, which people were already trying to do more by hand or with finite state machines,
link |
00:13:40.880
say this linear controller here, this linear controller here, neural network,
link |
00:13:44.560
learns to tell the spin say linear controller here, another linear controller here,
link |
00:13:48.160
but it's more subtle than that. And so it's benefiting from this linear control aspect is
link |
00:13:52.000
benefiting from the tiling, but it's somehow tiling it one dimension at a time. Because if
link |
00:13:57.760
let's say you have a two layer network, even that hidden layer, you make a transition from active
link |
00:14:04.160
to inactive or the other way around, that is essentially one axis, but not axis aligned, but
link |
00:14:09.600
one direction that you change. And so you have this kind of very gradual tiling of the space,
link |
00:14:15.200
we have a lot of sharing between the linear controllers that tile the space. And that was
link |
00:14:19.840
always my intuition as to why to expect that this might work pretty well. It's essentially
link |
00:14:25.280
leveraging the fact that linear feedback control is so good. But of course, not enough. And this
link |
00:14:30.000
is a gradual tiling of the space with linear feedback controls that share a lot of expertise
link |
00:14:35.520
across them. So that that's, that's really nice intuition. But do you think that scales to the
link |
00:14:41.120
more and more general problems of when you start going up the number of control dimensions,
link |
00:14:48.160
when you start going down in terms of how often you get a clean reward signal,
link |
00:14:55.280
does that intuition carry forward to those crazy or weirder worlds that we think of as the real
link |
00:15:00.960
world? So I think where things get really tricky in the real world compared to the things we've
link |
00:15:10.000
looked at so far with great success and reinforcement learning is
link |
00:15:16.160
the time scales, which takes us to an extreme. So when you think about the real world, I mean,
link |
00:15:22.800
I don't know, maybe some student decided to do a PhD here, right? Okay, that's that's a decision,
link |
00:15:28.560
that's a very high level decision. But if you think about their lives, I mean, any person's life,
link |
00:15:34.000
it's a sequence of muscle fiber contractions and relaxations. And that's how you interact with
link |
00:15:39.360
the world. And that's a very high frequency control thing. But it's ultimately what you do
link |
00:15:44.480
and how you affect the world. Until I guess we have brain readings, you can maybe do it slightly
link |
00:15:49.280
differently. But typically, that's how you affect the world. And the decision of doing a PhD is
link |
00:15:55.120
like so abstract relative to what you're actually doing in the world. And I think that's where
link |
00:16:00.240
credit assignment becomes just completely beyond what any current RL algorithm can do. And we need
link |
00:16:07.360
hierarchical reasoning at a level that is just not available at all yet. Where do you think we can
link |
00:16:13.360
pick up hierarchical reasoning by which mechanisms? Yeah, so maybe let me highlight what I think the
link |
00:16:19.360
limitations are of what already was done 20, 30 years ago. In fact, you'll find reasoning systems
link |
00:16:27.600
that reason over relatively long horizons. But the problem is that they were not grounded in the real
link |
00:16:33.200
world. So people would have to hand design some kind of logical, dynamical descriptions of the
link |
00:16:43.040
world. And that didn't tie into perception. And so that didn't tie into real objects and so forth.
link |
00:16:49.120
And so that was a big gap. Now with deep learning, we start having the ability to really see with
link |
00:16:57.920
sensors process that and understand what's in the world. And so it's a good time to try to
link |
00:17:02.800
bring these things together. I see a few ways of getting there. One way to get there would be to say
link |
00:17:08.080
deep learning can get bolted on somehow to some of these more traditional approaches.
link |
00:17:12.160
Now bolted on would probably mean you need to do some kind of end to end training,
link |
00:17:16.160
where you say, my deep learning processing somehow leads to a representation that in term
link |
00:17:22.720
uses some kind of traditional underlying dynamical systems that can be used for planning.
link |
00:17:29.680
And that's, for example, the direction of Eve Tamar and Thanard Kuritach here have been pushing
link |
00:17:33.920
with causal info again. And of course, other people too, that that's that's one way. Can we
link |
00:17:38.800
somehow force it into the form factor that is amenable to reasoning?
link |
00:17:43.520
Another direction we've been thinking about for a long time and didn't make any progress on
link |
00:17:50.160
was more information theoretic approaches. So the idea there was that what it means to take
link |
00:17:56.880
high level action is to take and choose a latent variable now that tells you a lot about what's
link |
00:18:03.840
going to be the case in the future, because that's what it means to to take a high level action.
link |
00:18:08.640
I say, okay, what I decide I'm going to navigate to the gas station because I need to get
link |
00:18:14.480
gas from my car. Well, that'll now take five minutes to get there. But the fact that I get
link |
00:18:18.800
there, I could already tell that from the high level action I took much earlier.
link |
00:18:24.480
That we had a very hard time getting success with, not saying it's a dead end,
link |
00:18:30.080
necessarily, but we had a lot of trouble getting that to work. And then we started revisiting
link |
00:18:34.160
the notion of what are we really trying to achieve? What we're trying to achieve is
link |
00:18:39.600
not necessarily a hierarchy per se, but you could think about what does hierarchy give us?
link |
00:18:44.160
What we hope it would give us is better credit assignment. What is better credit assignment
link |
00:18:50.560
is giving us, it gives us faster learning. And so faster learning is ultimately maybe
link |
00:18:58.640
what we're after. And so that's where we ended up with the RL squared paper on learning to
link |
00:19:03.840
reinforcement learn, which at a time Rocky Dwan led. And that's exactly the meta learning
link |
00:19:10.640
approach where we say, okay, we don't know how to design hierarchy. We know what we want to get
link |
00:19:15.040
from it. Let's just enter and optimize for what we want to get from it and see if it might emerge.
link |
00:19:20.000
And we saw things emerge. The maze navigation had consistent motion down hallways,
link |
00:19:25.920
which is what you want. A hierarchical control should say, I want to go down this hallway.
link |
00:19:29.520
And then when there is an option to take a turn, I can decide whether to take a turn or not and
link |
00:19:33.040
repeat, even had the notion of, where have you been before or not to not revisit places you've
link |
00:19:38.480
been before? It still didn't scale yet to the real world kind of scenarios I think you had in mind,
link |
00:19:45.840
but it was some sign of life that maybe you can meta learn these hierarchical concepts.
link |
00:19:51.040
I mean, it seems like through these meta learning concepts, we get at the, what I think is one of
link |
00:19:58.000
the hardest and most important problems of AI, which is transfer learning. So it's generalization.
link |
00:20:06.240
How far along this journey towards building general systems are we being able to do transfer
link |
00:20:12.160
learning? Well, so there's some signs that you can generalize a little bit. But do you think
link |
00:20:18.320
we're on the right path or totally different breakthroughs are needed to be able to transfer
link |
00:20:25.360
knowledge between different learned models? Yeah, I'm pretty torn on this in that I think
link |
00:20:34.000
there are some very impressive results already, right? I mean, I would say when even with the
link |
00:20:44.400
initial kind of big breakthrough in 2012 with Alex net, right, the initial, the initial thing is,
link |
00:20:50.160
okay, great. This does better on image net hands image recognition. But then immediately thereafter,
link |
00:20:57.600
there was of course the notion that wow, what was learned on image net, and you now want to solve
link |
00:21:04.080
a new task, you can fine tune Alex net for new tasks. And that was often found to be the even
link |
00:21:11.280
bigger deal that you learn something that was reusable, which was not often the case before
link |
00:21:15.920
usually machine learning, you learn something for one scenario. And that was it. And that's
link |
00:21:19.520
really exciting. I mean, that's just a huge application. That's probably the biggest
link |
00:21:23.200
success of transfer learning today, if in terms of scope and impact. That was a huge breakthrough.
link |
00:21:28.960
And then recently, I feel like similar kind of by scaling things up, it seems like this has been
link |
00:21:37.040
expanded upon like people training even bigger networks, they might transfer even better. If
link |
00:21:41.440
you look that, for example, some of the opening results on language models. And so in the recent
link |
00:21:46.480
Google results on language models, they are learned for just prediction. And then they get
link |
00:21:54.320
reused for other tasks. And so I think there is something there where somehow if you train a
link |
00:21:59.600
big enough model on enough things, it seems to transfer some deep mind results that I thought
link |
00:22:05.200
were very impressive, the unreal results, where it was learning to navigate mazes in ways where
link |
00:22:12.160
it wasn't just doing reinforcement learning, but it had other objectives was optimizing for. So I
link |
00:22:16.880
think there's a lot of interesting results already. I think maybe where it's hard to wrap my head
link |
00:22:23.680
around this, to which extent or when do we call something generalization, right? Or the levels
link |
00:22:30.160
of generalization involved in these different tasks, right? So you draw this, by the way, just
link |
00:22:37.360
to frame things. I've heard you say somewhere, it's the difference in learning to master versus
link |
00:22:43.280
learning to generalize. That it's a nice line to think about. And I guess you're saying it's a gray
link |
00:22:49.680
area of what learning to master and learning to generalize where one starts.
link |
00:22:54.640
I think I might have heard this. I might have heard it somewhere else. And I think it might have
link |
00:22:58.800
been one of your interviews, maybe the one with Yoshua Benjamin, 900% sure. But I like the example
link |
00:23:05.120
and I'm going to not sure who it was, but the example was essentially if you use current deep
link |
00:23:12.000
learning techniques, what we're doing to predict, let's say the relative motion of our planets,
link |
00:23:20.480
it would do pretty well. But then now if a massive new mass enters our solar system,
link |
00:23:28.320
it would probably not predict what will happen, right? And that's a different kind of
link |
00:23:32.880
generalization. That's a generalization that relies on the ultimate simplest explanation
link |
00:23:38.400
that we have available today to explain the motion of planets, whereas just pattern recognition
link |
00:23:42.640
could predict our current solar system motion pretty well. No problem. And so I think that's
link |
00:23:48.160
an example of a kind of generalization that is a little different from what we've achieved so far.
link |
00:23:54.480
And it's not clear if just, you know, regularizing more and forcing it to come up with a simpler,
link |
00:24:01.360
simpler, simpler explanation. Look, this is not simple, but that's what physics researchers do,
link |
00:24:05.280
right, to say, can I make this even simpler? How simple can I get this? What's the simplest
link |
00:24:10.000
equation that can explain everything, right? The master equation for the entire dynamics of the
link |
00:24:14.560
universe. We haven't really pushed that direction as hard in deep learning, I would say. Not sure
link |
00:24:20.960
if it should be pushed, but it seems a kind of generalization you get from that that you don't
link |
00:24:24.960
get in our current methods so far. So I just talked to Vladimir Vapnik, for example, who was
link |
00:24:30.400
a statistician in statistical learning, and he kind of dreams of creating the E equals Mc
link |
00:24:39.200
squared for learning, right, the general theory of learning. Do you think that's a fruitless pursuit
link |
00:24:46.480
in the near term, within the next several decades?
link |
00:24:51.680
I think that's a really interesting pursuit. And in the following sense, in that there is a
link |
00:24:56.800
lot of evidence that the brain is pretty modular. And so I wouldn't maybe think of it as the theory,
link |
00:25:05.440
maybe, the underlying theory, but more kind of the principle where there have been findings where
link |
00:25:14.160
people who are blind will use the part of the brain usually used for vision for other functions.
link |
00:25:20.240
And even after some kind of, if people get rewired in some way, they might be able to reuse parts of
link |
00:25:26.800
their brain for other functions. And so what that suggests is some kind of modularity. And I think
link |
00:25:35.040
it is a pretty natural thing to strive for to see, can we find that modularity? Can we find this
link |
00:25:41.120
thing? Of course, it's not every part of the brain is not exactly the same. Not everything can be
link |
00:25:45.440
rewired arbitrarily. But if you think of things like the neocortex, which is a pretty big part of
link |
00:25:50.080
the brain, that seems fairly modular from what the findings so far. Can you design something
link |
00:25:56.880
equally modular? And if you can just grow it, it becomes more capable, probably. I think that would
link |
00:26:01.840
be the kind of interesting underlying principle to shoot for that is not unrealistic.
link |
00:26:07.200
Do you think you prefer math or empirical trial and error for the discovery of the essence of what
link |
00:26:14.400
it means to do something intelligent? So reinforcement learning embodies both groups, right?
link |
00:26:19.680
To prove that something converges, prove the bounds. And then at the same time, a lot of those
link |
00:26:25.760
successes are, well, let's try this and see if it works. So which do you gravitate towards? How do
link |
00:26:31.280
you think of those two parts of your brain? So maybe I would prefer we could make the progress
link |
00:26:41.600
with mathematics. And the reason maybe I would prefer that is because often if you have something you
link |
00:26:46.560
can mathematically formalize, you can leapfrog a lot of experimentation. And experimentation takes
link |
00:26:54.080
a long time to get through. And a lot of trial and error, reinforcement learning, your research
link |
00:27:01.440
process. But you need to do a lot of trial and error before you get to a success. So if you can
link |
00:27:05.040
leapfrog that, to my mind, that's what the math is about. And hopefully once you do a bunch of
link |
00:27:10.400
experiments, you start seeing a pattern, you can do some derivations that leapfrog some experiments.
link |
00:27:16.240
But I agree with you. I mean, in practice, a lot of the progress has been such that we have not
link |
00:27:20.160
been able to find the math that allows it to leapfrog ahead. And we are kind of making gradual
link |
00:27:25.840
progress one step at a time. A new experiment here, a new experiment there that gives us new
link |
00:27:30.480
insights and gradually building up, but not getting to something yet where we're just, okay,
link |
00:27:35.280
here's an equation that now explains how, you know, that would be have been two years of
link |
00:27:39.920
experimentation to get there. But this tells us what the results going to be. Unfortunately,
link |
00:27:44.880
unfortunately, not so much yet. Not so much yet. But your hope is there. In trying to teach robots
link |
00:27:52.800
or systems to do everyday tasks, or even in simulation, what do you think you're more excited
link |
00:28:01.200
about? imitation learning or self play. So letting robots learn from humans, or letting robots plan
link |
00:28:10.560
their own, try to figure out in their own way, and eventually play, eventually interact with humans,
link |
00:28:18.240
or solve whatever problem is. What's the more exciting to you? What's more promising you think
link |
00:28:23.200
is a research direction? So when we look at self play, what's so beautiful about it is,
link |
00:28:34.240
goes back to kind of the challenges in reinforcement learning. So the challenge
link |
00:28:37.680
of reinforcement learning is getting signal. And if you don't never succeed, you don't get any signal.
link |
00:28:43.200
In self play, you're on both sides. So one of you succeeds. And the beauty is also one of you
link |
00:28:49.040
fails. And so you see the contrast, you see the one version of me that did better than the other
link |
00:28:53.520
version. And so every time you play yourself, you get signal. And so whenever you can turn
link |
00:28:58.400
something into self play, you're in a beautiful situation where you can naturally learn much
link |
00:29:04.160
more quickly than in most other reinforcement learning environments. So I think, I think if
link |
00:29:10.080
somehow we can turn more reinforcement learning problems into self play formulations, that would
link |
00:29:15.760
go really, really far. So far, self play has been largely around games where there is natural
link |
00:29:21.760
opponents. But if we could do self play for other things, and let's say, I don't know,
link |
00:29:25.440
a robot learns to build a house, I mean, that's a pretty advanced thing to try to do for a robot,
link |
00:29:29.360
but maybe it tries to build a hut or something. If that can be done through self play, it would
link |
00:29:34.240
learn a lot more quickly if somebody can figure it out. And I think that would be something where
link |
00:29:38.560
it goes closer to kind of the mathematical leapfrogging where somebody figures out a
link |
00:29:42.560
formalism to say, okay, any RL problem by playing this and this idea, you can turn it
link |
00:29:47.680
into a self play problem where you get signal a lot more easily.
link |
00:29:52.400
Reality is many problems, we don't know how to turn to self play. And so either we need to provide
link |
00:29:57.680
detailed reward. That doesn't just reward for achieving a goal, but rewards for making progress,
link |
00:30:02.640
and that becomes time consuming. And once you're starting to do that, let's say you want a robot
link |
00:30:06.480
to do something, you need to give all this detailed reward. Well, why not just give a
link |
00:30:09.920
demonstration? Because why not just show the robot. And now the question is, how do you show
link |
00:30:15.920
the robot? One way to show is to tally operator robot and then robot really experiences things.
link |
00:30:20.800
And that's nice, because that's really high signal to noise ratio data. And we've done a lot
link |
00:30:24.480
of that. And you teach your robot skills. In just 10 minutes, you can teach your robot a new basic
link |
00:30:29.360
skill, like, okay, pick up the bottle, place it somewhere else. That's a skill, no matter where
link |
00:30:33.360
the bottle starts, maybe it always goes on to a target or something. That's fairly easy to teach
link |
00:30:38.000
your robot with teleop. Now, what's even more interesting, if you can now teach your robot
link |
00:30:43.120
through third person learning, where the robot watches you do something, and doesn't experience
link |
00:30:48.480
it, but just watches it and says, okay, well, if you're showing me that, that means I should
link |
00:30:52.880
be doing this. And I'm not going to be using your hand, because I don't get to control your hand,
link |
00:30:56.880
but I'm going to use my hand, I do that mapping. And so that's where I think one of the big breakthroughs
link |
00:31:02.000
has happened this year. This was led by Chelsea Finn here. It's almost like learning a machine
link |
00:31:07.520
translation for demonstrations where you have a human demonstration and the robot learns to
link |
00:31:12.000
translate it into what it means for the robot to do it. And that was a meta learning formulation,
link |
00:31:17.440
learn from one to get the other. And that I think opens up a lot of opportunities to learn a lot
link |
00:31:23.440
more quickly. So my focus is on autonomous vehicles. Do you think this approach of third
link |
00:31:28.080
person watching is the autonomous driving is amenable to this kind of approach?
link |
00:31:33.840
So for autonomous driving, I would say it's third person is slightly easier. And the reason I'm
link |
00:31:42.080
going to say it's slightly easier to do with third person is because the car dynamics are very well
link |
00:31:48.320
understood. So the easier than first person, you mean, or easier than. So I think the distinction
link |
00:31:56.560
between third person and first person is not a very important distinction for autonomous driving.
link |
00:32:01.680
They're very similar. Because the distinction is really about who turns the steering wheel.
link |
00:32:07.760
And or maybe let me put it differently. How to get from a point where you are now to a point,
link |
00:32:15.280
let's say a couple of meters in front of you. And that's a problem that's very well understood.
link |
00:32:19.120
And that's the only distinction between third and first person there. Whereas with the robot
link |
00:32:22.480
manipulation, interaction forces are very complex. And it's still a very different thing.
link |
00:32:27.840
For autonomous driving, I think there's still the question imitation versus RL.
link |
00:32:33.840
Well, so imitation gives you a lot more signal. I think where imitation is lacking and needs
link |
00:32:39.520
some extra machinery is it doesn't in its normal format, doesn't think about goals or objectives.
link |
00:32:48.480
And of course, there are versions of imitation learning, inverse reinforcement learning type
link |
00:32:52.240
imitation, which also thinks about goals. I think then we're getting much closer. But I think it's
link |
00:32:57.440
very hard to think of a fully reactive car generalizing well, if it really doesn't have a notion
link |
00:33:05.120
of objectives to generalize well to the kind of general that you would want, you want more than
link |
00:33:10.720
just that reactivity that you get from just behavioral cloning slash supervised learning.
link |
00:33:17.040
So a lot of the work, whether it's self play or even imitation learning would benefit
link |
00:33:22.560
significantly from simulation, from effective simulation, and you're doing a lot of stuff
link |
00:33:27.440
in the physical world and in simulation, do you have hope for greater and greater
link |
00:33:33.520
power of simulation loop being boundless, eventually, to where most of what we need
link |
00:33:40.160
to operate in the physical world, what could be simulated to a degree that's directly
link |
00:33:45.600
transferable to the physical world? Are we still very far away from that? So I think
link |
00:33:55.840
we could even rephrase that question in some sense, please. And so the power of simulation,
link |
00:34:04.720
as simulators get better and better, of course, becomes stronger, and we can learn more in
link |
00:34:09.760
simulation. But there's also another version, which is where you say the simulator doesn't
link |
00:34:13.760
even have to be that precise. As long as it's somewhat representative. And instead of trying
link |
00:34:19.120
to get one simulator that is sufficiently precise to learn and transfer really well to the real
link |
00:34:24.480
world, I'm going to build many simulators, ensemble of simulators, ensemble of simulators,
link |
00:34:30.080
not any single one of them is sufficiently representative of the real world such that
link |
00:34:35.120
it would work if you train in there. But if you train in all of them, then there is something
link |
00:34:41.760
that's good in all of them. The real world will just be, you know, another one of them. That's,
link |
00:34:47.840
you know, not identical to any one of them, but just another one of them.
link |
00:34:50.720
Now, this sample from the distribution of simulators.
link |
00:34:53.120
Exactly.
link |
00:34:53.360
We do live in a simulation. So this is just one, one other one.
link |
00:34:57.600
I'm not sure about that. But yeah, it's definitely a very advanced simulator if it is.
link |
00:35:03.440
Yeah, it's a pretty good one. I've talked to Russell. It's something you think about a little bit
link |
00:35:08.960
too. Of course, you're like really trying to build these systems. But do you think about the future
link |
00:35:13.120
of AI? A lot of people have concern about safety. How do you think about AI safety as you build
link |
00:35:18.880
robots that are operating the physical world? What is, yeah, how do you approach this problem
link |
00:35:24.960
in an engineering kind of way in a systematic way?
link |
00:35:29.200
So when a robot is doing things, you kind of have a few notions of safety to worry about. One is that
link |
00:35:36.720
the robot is physically strong and of course could do a lot of damage. Same for cars, which we can
link |
00:35:43.760
think of as robots do in some way. And this could be completely unintentional. So it could be not
link |
00:35:49.360
the kind of long term AI safety concerns that, okay, AI is smarter than us. And now what do we do?
link |
00:35:54.240
But it could be just very practical. Okay, this robot, if it makes a mistake,
link |
00:35:58.800
what are the results going to be? Of course, simulation comes in a lot there to test in simulation.
link |
00:36:04.080
It's a difficult question. And I'm always wondering, like I always wonder, let's say you look at,
link |
00:36:10.960
let's go back to driving, because a lot of people know driving well, of course.
link |
00:36:15.120
What do we do to test somebody for driving, right, to get a driver's license? What do they
link |
00:36:20.800
really do? I mean, you fill out some tests, and then you drive and I mean, for a few minutes,
link |
00:36:27.680
it's suburban California, that driving test is just you drive around the block, pull over, you
link |
00:36:34.800
do a stop sign successfully, and then, you know, you pull over again, and you're pretty much done.
link |
00:36:40.000
And you're like, okay, if a self driving car did that, would you trust it that it can drive?
link |
00:36:46.720
And I'd be like, no, that's not enough for me to trust it. But somehow for humans,
link |
00:36:50.560
we've figured out that somebody being able to do that is representative
link |
00:36:54.480
of them being able to do a lot of other things. And so I think somehow for humans,
link |
00:36:59.840
we figured out representative tests of what it means if you can do this, what you can really do.
link |
00:37:05.760
Of course, testing humans, humans don't want to be tested at all times. Self driving cars or
link |
00:37:09.840
robots could be tested more often probably, you can have replicas that get tested and are known
link |
00:37:13.760
to be identical because they use the same neural net and so forth. But still, I feel like we don't
link |
00:37:19.600
have this kind of unit tests or proper tests for robots. And I think there's something very
link |
00:37:25.040
interesting to be thought about there, especially as you update things, your software improves,
link |
00:37:29.440
you have a better self driving car suite, you update it. How do you know it's indeed more
link |
00:37:34.640
capable on everything than what you had before that you didn't have any bad things creep into it?
link |
00:37:41.440
So I think that's a very interesting direction of research that there is no real solution yet,
link |
00:37:45.680
except that somehow for humans, we do because we say, okay, you have a driving test, you passed,
link |
00:37:50.640
you can go on the road now and you must have accents every like a million or 10 million miles,
link |
00:37:55.760
something pretty phenomenal compared to that short test that is being done.
link |
00:38:01.520
So let me ask, you've mentioned, you've mentioned that Andrew Ang, by example,
link |
00:38:06.000
showed you the value of kindness. And do you think the space of
link |
00:38:11.440
of policies, good policies for humans and for AI is populated by policies that
link |
00:38:21.440
with kindness or ones that are the opposite, exploitation, even evil. So if you just look
link |
00:38:28.880
at the sea of policies we operate under as human beings, or if AI system had to operate in this
link |
00:38:34.400
real world, do you think it's really easy to find policies that are full of kindness,
link |
00:38:39.440
like we naturally fall into them? Or is it like a very hard optimization problem?
link |
00:38:47.920
I mean, there is kind of two optimizations happening for humans, right? So for humans,
link |
00:38:52.720
there's kind of the very long term optimization, which evolution has done for us. And we're kind of
link |
00:38:57.440
predisposed to like certain things. And that's in some sense, what makes our learning easier,
link |
00:39:02.640
because I mean, we know things like pain and hunger and thirst. And the fact that we know about those
link |
00:39:10.000
is not something that we were taught. That's kind of innate. When we're hungry, we're unhappy.
link |
00:39:13.840
When we're thirsty, we're unhappy. When we have pain, we're unhappy. And ultimately evolution
link |
00:39:20.720
built that into us to think about those things. And so I think there is a notion that it seems
link |
00:39:25.040
somehow humans evolved in general to prefer to get along in some ways. But at the same time,
link |
00:39:33.840
also to be very territorial and kind of centric to their own tribe. It seems like that's the kind
link |
00:39:43.040
of space we converged on to. I mean, I'm not an expert in anthropology, but it seems like we're
link |
00:39:47.360
very kind of good within our own tribe, but need to be taught to be nice to other tribes.
link |
00:39:54.480
Well, if you look at Steven Pinker, he highlights this pretty nicely in
link |
00:40:00.720
Better Angels of Our Nature, where he talks about violence decreasing over time consistently.
link |
00:40:05.520
So whatever tension, whatever teams we pick, it seems that the long arc of history goes
link |
00:40:11.360
towards us getting along more and more. So do you think that
link |
00:40:17.840
do you think it's possible to teach RRL based robots this kind of kindness, this kind of ability
link |
00:40:27.280
to interact with humans, this kind of policy, even to let me ask, let me ask upon one, do you think
link |
00:40:33.040
it's possible to teach RRL based robot to love a human being and to inspire that human to love
link |
00:40:38.800
the robot back? So to like a RRL based algorithm that leads to a happy marriage? That's an interesting
link |
00:40:48.080
question. Maybe I'll answer it with another question, right? Because I mean, but I'll come
link |
00:40:56.080
back to it. So another question you can have is okay. I mean, how close does some people's
link |
00:41:02.000
happiness get from interacting with just a really nice dog? Like, I mean, dogs, you come home,
link |
00:41:09.760
that's what dogs do. They greet you. They're excited. It makes you happy when you come home
link |
00:41:14.000
to your dog. You're just like, okay, this is exciting. They're always happy when I'm here.
link |
00:41:18.160
I mean, if they don't greet you, because maybe whatever, your partner took them on a trip or
link |
00:41:22.560
something, you might not be nearly as happy when you get home, right? And so the kind of,
link |
00:41:27.600
it seems like the level of reasoning a dog has is pretty sophisticated, but then it's still not yet
link |
00:41:33.600
at the level of human reasoning. And so it seems like we don't even need to achieve human level
link |
00:41:38.240
reasoning to get like very strong affection with humans. And so my thinking is, why not, right?
link |
00:41:44.320
Why couldn't, with an AI, couldn't we achieve the kind of level of affection that humans feel
link |
00:41:51.360
among each other or with friendly animals and so forth? So question, is it a good thing for us
link |
00:41:59.280
or not? That's another thing, right? Because I mean, but I don't see why not. Why not? Yeah.
link |
00:42:07.040
So Elon Musk says love is the answer. Maybe he should say love is the objective function and
link |
00:42:12.640
then RL is the answer, right? Well, maybe. Oh, Peter, thank you so much. I don't want to take
link |
00:42:19.280
up more of your time. Thank you so much for talking today. Well, thanks for coming by.
link |
00:42:23.360
Great to have you visit.