back to indexPieter Abbeel: Deep Reinforcement Learning | Lex Fridman Podcast #10
link |
The following is a conversation with Peter Abbeel.
link |
He's a professor at UC Berkeley
link |
and the director of the Berkeley Robotics Learning Lab.
link |
He's one of the top researchers in the world
link |
working on how we make robots understand
link |
and interact with the world around them,
link |
especially using imitation and deep reinforcement learning.
link |
This conversation is part of the MIT course
link |
on Artificial General Intelligence
link |
and the Artificial Intelligence podcast.
link |
If you enjoy it, please subscribe on YouTube,
link |
iTunes, or your podcast provider of choice,
link |
or simply connect with me on Twitter at Lex Friedman,
link |
And now, here's my conversation with Peter Abbeel.
link |
You've mentioned that if there was one person
link |
you could meet, it would be Roger Federer.
link |
So let me ask, when do you think we'll have a robot
link |
that fully autonomously can beat Roger Federer at tennis?
link |
Roger Federer level player at tennis?
link |
Well, first, if you can make it happen for me to meet Roger,
link |
In terms of getting a robot to beat him at tennis,
link |
it's kind of an interesting question
link |
because for a lot of the challenges we think about in AI,
link |
the software is really the missing piece,
link |
but for something like this,
link |
the hardware is nowhere near either.
link |
To really have a robot that can physically run around,
link |
the Boston Dynamics robots are starting to get there,
link |
but still not really human level ability to run around
link |
and then swing a racket.
link |
So you think that's a hardware problem?
link |
I don't think it's a hardware problem only.
link |
I think it's a hardware and a software problem.
link |
I think it's both.
link |
And I think they'll have independent progress.
link |
So I'd say the hardware maybe in 10, 15 years.
link |
On clay, not grass.
link |
I mean, grass is probably harder.
link |
With the clay, I'm not sure what's harder, grass or clay.
link |
The clay involves sliding,
link |
which might be harder to master actually, yeah.
link |
But you're not limited to a bipedal.
link |
I mean, I'm sure there's no...
link |
Well, if we can build a machine,
link |
it's a whole different question, of course.
link |
If you can say, okay, this robot can be on wheels,
link |
it can move around on wheels and can be designed differently,
link |
then I think that can be done sooner probably
link |
than a full humanoid type of setup.
link |
What do you think of swing a racket?
link |
So you've worked at basic manipulation.
link |
How hard do you think is the task of swinging a racket
link |
would be able to hit a nice backhand or a forehand?
link |
Let's say we just set up stationary,
link |
a nice robot arm, let's say, a standard industrial arm,
link |
and it can watch the ball come and then swing the racket.
link |
It's a good question.
link |
I'm not sure it would be super hard to do.
link |
I mean, I'm sure it would require a lot,
link |
if we do it with reinforcement learning,
link |
it would require a lot of trial and error.
link |
It's not gonna swing it right the first time around,
link |
but yeah, I don't see why I couldn't
link |
swing it the right way.
link |
I think it's learnable.
link |
I think if you set up a ball machine,
link |
let's say on one side,
link |
and then a robot with a tennis racket on the other side,
link |
I think it's learnable
link |
and maybe a little bit of pre training and simulation.
link |
Yeah, I think that's feasible.
link |
I think the swing the racket is feasible.
link |
It'd be very interesting to see how much precision
link |
Cause I mean, that's where, I mean,
link |
some of the human players can hit it on the lines,
link |
which is very high precision.
link |
With spin, the spin is an interesting,
link |
whether RL can learn to put a spin on the ball.
link |
Well, you got me interested.
link |
Maybe someday we'll set this up.
link |
Sure, you got me intrigued.
link |
Your answer is basically, okay,
link |
for this problem, it sounds fascinating,
link |
but for the general problem of a tennis player,
link |
we might be a little bit farther away.
link |
What's the most impressive thing you've seen a robot do
link |
in the physical world?
link |
So physically for me,
link |
it's the Boston Dynamics videos.
link |
Always just bring home and just super impressed.
link |
Recently, the robot running up the stairs,
link |
doing the parkour type thing.
link |
I mean, yes, we don't know what's underneath.
link |
They don't really write a lot of detail,
link |
but even if it's hard coded underneath,
link |
which it might or might not be just the physical abilities
link |
of doing that parkour, that's a very impressive.
link |
So have you met Spot Mini
link |
or any of those robots in person?
link |
Met Spot Mini last year in April at the Mars event
link |
that Jeff Bezos organizes.
link |
They brought it out there
link |
and it was nicely following around Jeff.
link |
When Jeff left the room, they had it follow him along,
link |
which is pretty impressive.
link |
So I think there's some confidence to know
link |
that there's no learning going on in those robots.
link |
The psychology of it, so while knowing that,
link |
while knowing there's not,
link |
if there's any learning going on, it's very limited.
link |
I met Spot Mini earlier this year
link |
and knowing everything that's going on,
link |
having one on one interaction,
link |
so I got to spend some time alone and there's immediately
link |
a deep connection on the psychological level.
link |
Even though you know the fundamentals, how it works,
link |
there's something magical.
link |
So do you think about the psychology of interacting
link |
with robots in the physical world?
link |
Even you just showed me the PR2, the robot,
link |
and there was a little bit something like a face,
link |
had a little bit something like a face.
link |
There's something that immediately draws you to it.
link |
Do you think about that aspect of the robotics problem?
link |
Well, it's very hard with Brad here.
link |
We'll give him a name, Berkeley Robot
link |
for the Elimination of Tedious Tasks.
link |
It's very hard to not think of the robot as a person
link |
and it seems like everybody calls him a he
link |
for whatever reason, but that also makes it more a person
link |
than if it was a it, and it seems pretty natural
link |
to think of it that way.
link |
This past weekend really struck me.
link |
I've seen Pepper many times on videos,
link |
but then I was at an event organized by,
link |
this was by Fidelity, and they had scripted Pepper
link |
to help moderate some sessions,
link |
and they had scripted Pepper
link |
to have the personality of a child a little bit,
link |
and it was very hard to not think of it
link |
as its own person in some sense
link |
because it would just jump in the conversation,
link |
making it very interactive.
link |
Moderate would be saying, Pepper would just jump in,
link |
hold on, how about me?
link |
Can I participate in this too?
link |
And you're just like, okay, this is like a person,
link |
and that was 100% scripted, and even then it was hard
link |
not to have that sense of somehow there is something there.
link |
So as we have robots interact in this physical world,
link |
is that a signal that could be used
link |
in reinforcement learning?
link |
You've worked a little bit in this direction,
link |
but do you think that psychology can be somehow pulled in?
link |
Yes, that's a question I would say
link |
a lot of people ask, and I think part of why they ask it
link |
is they're thinking about how unique
link |
are we really still as people?
link |
Like after they see some results,
link |
they see a computer play Go, they see a computer do this,
link |
that, they're like, okay, but can it really have emotion?
link |
Can it really interact with us in that way?
link |
And then once you're around robots,
link |
you already start feeling it,
link |
and I think that kind of maybe mythologically,
link |
the way that I think of it is
link |
if you run something like reinforcement learning,
link |
it's about optimizing some objective,
link |
and there's no reason that the objective
link |
couldn't be tied into how much does a person like
link |
interacting with this system,
link |
and why could not the reinforcement learning system
link |
optimize for the robot being fun to be around?
link |
And why wouldn't it then naturally become
link |
more and more interactive and more and more
link |
maybe like a person or like a pet?
link |
I don't know what it would exactly be,
link |
but more and more have those features
link |
and acquire them automatically.
link |
As long as you can formalize an objective
link |
of what it means to like something,
link |
what, how you exhibit, what's the ground truth?
link |
How do you get the reward from human?
link |
Because you have to somehow collect
link |
that information within you, human.
link |
But you're saying if you can formulate as an objective,
link |
it can be learned.
link |
There's no reason it couldn't emerge through learning,
link |
and maybe one way to formulate as an objective,
link |
you wouldn't have to necessarily score it explicitly,
link |
so standard rewards are numbers,
link |
and numbers are hard to come by.
link |
This is a 1.5 or a 1.7 on some scale.
link |
It's very hard to do for a person,
link |
but much easier is for a person to say,
link |
okay, what you did the last five minutes
link |
was much nicer than what you did the previous five minutes,
link |
and that now gives a comparison.
link |
And in fact, there have been some results on that.
link |
For example, Paul Christiano and collaborators at OpenAI
link |
had the Hopper, Mojoko Hopper, a one legged robot,
link |
going through backflips purely from feedback.
link |
I like this better than that.
link |
That's kind of equally good,
link |
and after a bunch of interactions,
link |
it figured out what it was the person was asking for,
link |
namely a backflip.
link |
And so I think the same thing.
link |
Oh, it wasn't trying to do a backflip.
link |
It was just getting a comparison score
link |
from the person based on?
link |
Person having in mind, in their own mind,
link |
I wanted to do a backflip,
link |
but the robot didn't know what it was supposed to be doing.
link |
It just knew that sometimes the person said,
link |
this is better, this is worse,
link |
and then the robot figured out
link |
what the person was actually after was a backflip.
link |
And I'd imagine the same would be true
link |
for things like more interactive robots,
link |
that the robot would figure out over time,
link |
oh, this kind of thing apparently is appreciated more
link |
than this other kind of thing.
link |
So when I first picked up Sutton's,
link |
Richard Sutton's reinforcement learning book,
link |
before sort of this deep learning,
link |
before the reemergence of neural networks
link |
as a powerful mechanism for machine learning,
link |
RL seemed to me like magic.
link |
So that seemed like what intelligence is,
link |
RL reinforcement learning.
link |
So how do you think we can possibly learn anything
link |
about the world when the reward for the actions
link |
is delayed, is so sparse?
link |
Like where is, why do you think RL works?
link |
Why do you think you can learn anything
link |
under such sparse rewards,
link |
whether it's regular reinforcement learning
link |
or deep reinforcement learning?
link |
What's your intuition?
link |
The counterpart of that is why is RL,
link |
why does it need so many samples,
link |
so many experiences to learn from?
link |
Because really what's happening is
link |
when you have a sparse reward,
link |
you do something maybe for like, I don't know,
link |
you take 100 actions and then you get a reward.
link |
And maybe you get like a score of three.
link |
And I'm like okay, three, not sure what that means.
link |
You go again and now you get two.
link |
And now you know that that sequence of 100 actions
link |
that you did the second time around
link |
somehow was worse than the sequence of 100 actions
link |
you did the first time around.
link |
But that's tough to now know which one of those
link |
were better or worse.
link |
Some might have been good and bad in either one.
link |
And so that's why it needs so many experiences.
link |
But once you have enough experiences,
link |
effectively RL is teasing that apart.
link |
It's trying to say okay, what is consistently there
link |
when you get a higher reward
link |
and what's consistently there when you get a lower reward?
link |
And then kind of the magic of sometimes
link |
the policy gradient update is to say
link |
now let's update the neural network
link |
to make the actions that were kind of present
link |
when things are good more likely
link |
and make the actions that are present
link |
when things are not as good less likely.
link |
So that is the counterpoint,
link |
but it seems like you would need to run it
link |
a lot more than you do.
link |
Even though right now people could say
link |
that RL is very inefficient,
link |
but it seems to be way more efficient
link |
than one would imagine on paper.
link |
That the simple updates to the policy,
link |
the policy gradient, that somehow you can learn,
link |
exactly you just said, what are the common actions
link |
that seem to produce some good results?
link |
That that somehow can learn anything.
link |
It seems counterintuitive at least.
link |
Is there some intuition behind it?
link |
Yeah, so I think there's a few ways to think about this.
link |
The way I tend to think about it mostly originally,
link |
so when we started working on deep reinforcement learning
link |
here at Berkeley, which was maybe 2011, 12, 13,
link |
around that time, John Schulman was a PhD student
link |
initially kind of driving it forward here.
link |
And the way we thought about it at the time was
link |
if you think about rectified linear units
link |
or kind of rectifier type neural networks,
link |
You get something that's piecewise linear feedback control.
link |
And if you look at the literature,
link |
linear feedback control is extremely successful,
link |
can solve many, many problems surprisingly well.
link |
I remember, for example, when we did helicopter flight,
link |
if you're in a stationary flight regime,
link |
not a non stationary, but a stationary flight regime
link |
like hover, you can use linear feedback control
link |
to stabilize a helicopter, very complex dynamical system,
link |
but the controller is relatively simple.
link |
And so I think that's a big part of it is that
link |
if you do feedback control, even though the system
link |
you control can be very, very complex,
link |
often relatively simple control architectures
link |
can already do a lot.
link |
But then also just linear is not good enough.
link |
And so one way you can think of these neural networks
link |
is that sometimes they tile the space,
link |
which people were already trying to do more by hand
link |
or with finite state machines,
link |
say this linear controller here,
link |
this linear controller here.
link |
Neural network learns to tile the space
link |
and say linear controller here,
link |
another linear controller here,
link |
but it's more subtle than that.
link |
And so it's benefiting from this linear control aspect,
link |
it's benefiting from the tiling,
link |
but it's somehow tiling it one dimension at a time.
link |
Because if let's say you have a two layer network,
link |
if in that hidden layer, you make a transition
link |
from active to inactive or the other way around,
link |
that is essentially one axis, but not axis aligned,
link |
but one direction that you change.
link |
And so you have this kind of very gradual tiling
link |
of the space where you have a lot of sharing
link |
between the linear controllers that tile the space.
link |
And that was always my intuition as to why
link |
to expect that this might work pretty well.
link |
It's essentially leveraging the fact
link |
that linear feedback control is so good,
link |
but of course not enough.
link |
And this is a gradual tiling of the space
link |
with linear feedback controls
link |
that share a lot of expertise across them.
link |
So that's really nice intuition,
link |
but do you think that scales to the more
link |
and more general problems of when you start going up
link |
the number of dimensions when you start
link |
going down in terms of how often
link |
you get a clean reward signal?
link |
Does that intuition carry forward to those crazier,
link |
weirder worlds that we think of as the real world?
link |
So I think where things get really tricky
link |
in the real world compared to the things
link |
we've looked at so far with great success
link |
in reinforcement learning is the time scales,
link |
which takes us to an extreme.
link |
So when you think about the real world,
link |
I mean, I don't know, maybe some student
link |
decided to do a PhD here, right?
link |
Okay, that's a decision.
link |
That's a very high level decision.
link |
But if you think about their lives,
link |
I mean, any person's life,
link |
it's a sequence of muscle fiber contractions
link |
and relaxations, and that's how you interact with the world.
link |
And that's a very high frequency control thing,
link |
but it's ultimately what you do
link |
and how you affect the world,
link |
until I guess we have brain readings
link |
and you can maybe do it slightly differently.
link |
But typically that's how you affect the world.
link |
And the decision of doing a PhD is so abstract
link |
relative to what you're actually doing in the world.
link |
And I think that's where credit assignment
link |
becomes just completely beyond
link |
what any current RL algorithm can do.
link |
And we need hierarchical reasoning
link |
at a level that is just not available at all yet.
link |
Where do you think we can pick up hierarchical reasoning?
link |
By which mechanisms?
link |
Yeah, so maybe let me highlight
link |
what I think the limitations are
link |
of what already was done 20, 30 years ago.
link |
In fact, you'll find reasoning systems
link |
that reason over relatively long horizons,
link |
but the problem is that they were not grounded
link |
in the real world.
link |
So people would have to hand design
link |
some kind of logical, dynamical descriptions of the world
link |
and that didn't tie into perception.
link |
And so it didn't tie into real objects and so forth.
link |
And so that was a big gap.
link |
Now with deep learning, we start having the ability
link |
to really see with sensors, process that
link |
and understand what's in the world.
link |
And so it's a good time to try
link |
to bring these things together.
link |
I see a few ways of getting there.
link |
One way to get there would be to say
link |
deep learning can get bolted on somehow
link |
to some of these more traditional approaches.
link |
Now bolted on would probably mean
link |
you need to do some kind of end to end training
link |
where you say my deep learning processing
link |
somehow leads to a representation
link |
that in term uses some kind of traditional
link |
underlying dynamical systems that can be used for planning.
link |
And that's, for example, the direction Aviv Tamar
link |
and Thanard Kuretach here have been pushing
link |
with causal info again and of course other people too.
link |
Can we somehow force it into the form factor
link |
that is amenable to reasoning?
link |
Another direction we've been thinking about
link |
for a long time and didn't make any progress on
link |
was more information theoretic approaches.
link |
So the idea there was that what it means
link |
to take high level action is to take
link |
and choose a latent variable now
link |
that tells you a lot about what's gonna be the case
link |
Because that's what it means to take a high level action.
link |
I say okay, I decide I'm gonna navigate
link |
to the gas station because I need to get gas for my car.
link |
Well, that'll now take five minutes to get there.
link |
But the fact that I get there,
link |
I could already tell that from the high level action
link |
I took much earlier.
link |
That we had a very hard time getting success with.
link |
Not saying it's a dead end necessarily,
link |
but we had a lot of trouble getting that to work.
link |
And then we started revisiting the notion
link |
of what are we really trying to achieve?
link |
What we're trying to achieve is not necessarily hierarchy
link |
per se, but you could think about
link |
what does hierarchy give us?
link |
What we hope it would give us is better credit assignment.
link |
What is better credit assignment?
link |
It's giving us, it gives us faster learning, right?
link |
And so faster learning is ultimately maybe what we're after.
link |
And so that's where we ended up with the RL squared paper
link |
on learning to reinforcement learn,
link |
which at a time Rocky Dwan led.
link |
And that's exactly the meta learning approach
link |
where you say, okay, we don't know how to design hierarchy.
link |
We know what we want to get from it.
link |
Let's just enter and optimize for what we want to get
link |
from it and see if it might emerge.
link |
And we saw things emerge.
link |
The maze navigation had consistent motion down hallways,
link |
which is what you want.
link |
A hierarchical control should say,
link |
I want to go down this hallway.
link |
And then when there is an option to take a turn,
link |
I can decide whether to take a turn or not and repeat.
link |
Even had the notion of where have you been before or not
link |
to not revisit places you've been before.
link |
It still didn't scale yet
link |
to the real world kind of scenarios I think you had in mind,
link |
but it was some sign of life
link |
that maybe you can meta learn these hierarchical concepts.
link |
I mean, it seems like through these meta learning concepts,
link |
get at the, what I think is one of the hardest
link |
and most important problems of AI,
link |
which is transfer learning.
link |
So it's generalization.
link |
How far along this journey
link |
towards building general systems are we?
link |
Being able to do transfer learning well.
link |
So there's some signs that you can generalize a little bit,
link |
but do you think we're on the right path
link |
or it's totally different breakthroughs are needed
link |
to be able to transfer knowledge
link |
between different learned models?
link |
Yeah, I'm pretty torn on this in that
link |
I think there are some very impressive.
link |
Well, there's just some very impressive results already.
link |
I mean, I would say when,
link |
even with the initial kind of big breakthrough in 2012
link |
with AlexNet, the initial thing is okay, great.
link |
This does better on ImageNet, hence image recognition.
link |
But then immediately thereafter,
link |
there was of course the notion that,
link |
wow, what was learned on ImageNet
link |
and you now wanna solve a new task,
link |
you can fine tune AlexNet for new tasks.
link |
And that was often found to be the even bigger deal
link |
that you learn something that was reusable,
link |
which was not often the case before.
link |
Usually machine learning, you learn something
link |
for one scenario and that was it.
link |
And that's really exciting.
link |
I mean, that's a huge application.
link |
That's probably the biggest success
link |
of transfer learning today in terms of scope and impact.
link |
That was a huge breakthrough.
link |
And then recently, I feel like similar kind of,
link |
by scaling things up, it seems like
link |
this has been expanded upon.
link |
Like people training even bigger networks,
link |
they might transfer even better.
link |
If you looked at, for example,
link |
some of the OpenAI results on language models
link |
and some of the recent Google results on language models,
link |
they're learned for just prediction
link |
and then they get reused for other tasks.
link |
And so I think there is something there
link |
where somehow if you train a big enough model
link |
on enough things, it seems to transfer
link |
some deep mind results that I thought were very impressive,
link |
the Unreal results, where it was learned to navigate mazes
link |
in ways where it wasn't just doing reinforcement learning,
link |
but it had other objectives it was optimizing for.
link |
So I think there's a lot of interesting results already.
link |
I think maybe where it's hard to wrap my head around this,
link |
to which extent or when do we call something generalization?
link |
Or the levels of generalization in the real world,
link |
or the levels of generalization involved
link |
in these different tasks, right?
link |
You draw this, by the way, just to frame things.
link |
I've heard you say somewhere, it's the difference
link |
between learning to master versus learning to generalize,
link |
that it's a nice line to think about.
link |
And I guess you're saying that it's a gray area
link |
of what learning to master and learning to generalize,
link |
I think I might have heard this.
link |
I might have heard it somewhere else.
link |
And I think it might've been one of your interviews,
link |
maybe the one with Yoshua Benjamin, I'm not 100% sure.
link |
But I liked the example, I'm not sure who it was,
link |
but the example was essentially,
link |
if you use current deep learning techniques,
link |
what we're doing to predict, let's say,
link |
the relative motion of our planets, it would do pretty well.
link |
But then now if a massive new mass enters our solar system,
link |
it would probably not predict what will happen, right?
link |
And that's a different kind of generalization.
link |
That's a generalization that relies
link |
on the ultimate simplest, simplest explanation
link |
that we have available today
link |
to explain the motion of planets,
link |
whereas just pattern recognition could predict
link |
our current solar system motion pretty well, no problem.
link |
And so I think that's an example
link |
of a kind of generalization that is a little different
link |
from what we've achieved so far.
link |
And it's not clear if just regularizing more
link |
and forcing it to come up with a simpler, simpler,
link |
simpler explanation and say, look, this is not simple.
link |
But that's what physics researchers do, right?
link |
They say, can I make this even simpler?
link |
How simple can I get this?
link |
What's the simplest equation that can explain everything?
link |
The master equation for the entire dynamics of the universe,
link |
we haven't really pushed that direction as hard
link |
in deep learning, I would say.
link |
Not sure if it should be pushed,
link |
but it seems a kind of generalization you get from that
link |
that you don't get in our current methods so far.
link |
So I just talked to Vladimir Vapnik, for example,
link |
who's a statistician of statistical learning,
link |
and he kind of dreams of creating
link |
the E equals MC squared for learning, right?
link |
The general theory of learning.
link |
Do you think that's a fruitless pursuit
link |
in the near term, within the next several decades?
link |
I think that's a really interesting pursuit
link |
in the following sense, in that there is a lot of evidence
link |
that the brain is pretty modular.
link |
And so I wouldn't maybe think of it as the theory,
link |
maybe the underlying theory, but more kind of the principle
link |
where there have been findings where
link |
people who are blind will use the part of the brain
link |
usually used for vision for other functions.
link |
And even after some kind of,
link |
if people get rewired in some way,
link |
they might be able to reuse parts of their brain
link |
for other functions.
link |
And so what that suggests is some kind of modularity.
link |
And I think it is a pretty natural thing to strive for
link |
to see, can we find that modularity?
link |
Can we find this thing?
link |
Of course, every part of the brain is not exactly the same.
link |
Not everything can be rewired arbitrarily.
link |
But if you think of things like the neocortex,
link |
which is a pretty big part of the brain,
link |
that seems fairly modular from what the findings so far.
link |
Can you design something equally modular?
link |
And if you can just grow it,
link |
it becomes more capable probably.
link |
I think that would be the kind of interesting
link |
underlying principle to shoot for that is not unrealistic.
link |
Do you think you prefer math or empirical trial and error
link |
for the discovery of the essence of what it means
link |
to do something intelligent?
link |
So reinforcement learning embodies both groups, right?
link |
To prove that something converges, prove the bounds.
link |
And then at the same time, a lot of those successes are,
link |
well, let's try this and see if it works.
link |
So which do you gravitate towards?
link |
How do you think of those two parts of your brain?
link |
Maybe I would prefer we could make the progress
link |
And the reason maybe I would prefer that is because often
link |
if you have something you can mathematically formalize,
link |
you can leapfrog a lot of experimentation.
link |
And experimentation takes a long time to get through.
link |
And a lot of trial and error,
link |
kind of reinforcement learning, your research process,
link |
but you need to do a lot of trial and error
link |
before you get to a success.
link |
So if you can leapfrog that, to my mind,
link |
that's what the math is about.
link |
And hopefully once you do a bunch of experiments,
link |
you start seeing a pattern.
link |
You can do some derivations that leapfrog some experiments.
link |
But I agree with you.
link |
I mean, in practice, a lot of the progress has been such
link |
that we have not been able to find the math
link |
that allows you to leapfrog ahead.
link |
And we are kind of making gradual progress
link |
one step at a time, a new experiment here,
link |
a new experiment there that gives us new insights
link |
and gradually building up,
link |
but not getting to something yet where we're just,
link |
okay, here's an equation that now explains how,
link |
you know, that would be,
link |
have been two years of experimentation to get there,
link |
but this tells us what the result's going to be.
link |
Unfortunately, not so much yet.
link |
Not so much yet, but your hope is there.
link |
In trying to teach robots or systems
link |
to do everyday tasks or even in simulation,
link |
what do you think you're more excited about?
link |
Imitation learning or self play?
link |
So letting robots learn from humans
link |
or letting robots plan their own
link |
to try to figure out in their own way
link |
and eventually play, eventually interact with humans
link |
or solve whatever the problem is.
link |
What's the more exciting to you?
link |
What's more promising you think as a research direction?
link |
So when we look at self play,
link |
what's so beautiful about it is goes back
link |
to kind of the challenges in reinforcement learning.
link |
So the challenge of reinforcement learning
link |
is getting signal.
link |
And if you don't never succeed, you don't get any signal.
link |
In self play, you're on both sides.
link |
So one of you succeeds.
link |
And the beauty is also one of you fails.
link |
And so you see the contrast.
link |
You see the one version of me that did better
link |
than the other version.
link |
So every time you play yourself, you get signal.
link |
And so whenever you can turn something into self play,
link |
you're in a beautiful situation
link |
where you can naturally learn much more quickly
link |
than in most other reinforcement learning environments.
link |
So I think if somehow we can turn more
link |
reinforcement learning problems
link |
into self play formulations,
link |
that would go really, really far.
link |
So far, self play has been largely around games
link |
where there is natural opponents.
link |
But if we could do self play for other things,
link |
and let's say, I don't know,
link |
a robot learns to build a house.
link |
I mean, that's a pretty advanced thing
link |
to try to do for a robot,
link |
but maybe it tries to build a hut or something.
link |
If that can be done through self play,
link |
it would learn a lot more quickly
link |
if somebody can figure that out.
link |
And I think that would be something
link |
where it goes closer to kind of the mathematical leapfrogging
link |
where somebody figures out a formalism to say,
link |
okay, any RL problem by playing this and this idea,
link |
you can turn it into a self play problem
link |
where you get signal a lot more easily.
link |
Reality is, many problems we don't know
link |
how to turn into self play.
link |
And so either we need to provide detailed reward.
link |
That doesn't just reward for achieving a goal,
link |
but rewards for making progress,
link |
and that becomes time consuming.
link |
And once you're starting to do that,
link |
let's say you want a robot to do something,
link |
you need to give all this detailed reward.
link |
Well, why not just give a demonstration?
link |
Because why not just show the robot?
link |
And now the question is, how do you show the robot?
link |
One way to show is to tally operate the robot,
link |
and then the robot really experiences things.
link |
And that's nice, because that's really high signal
link |
to noise ratio data, and we've done a lot of that.
link |
And you teach your robot skills in just 10 minutes,
link |
you can teach your robot a new basic skill,
link |
like okay, pick up the bottle, place it somewhere else.
link |
That's a skill, no matter where the bottle starts,
link |
maybe it always goes onto a target or something.
link |
That's fairly easy to teach your robot with tally up.
link |
Now, what's even more interesting
link |
if you can now teach your robot
link |
through third person learning,
link |
where the robot watches you do something
link |
and doesn't experience it, but just kind of watches you.
link |
It doesn't experience it, but just watches it
link |
and says, okay, well, if you're showing me that,
link |
that means I should be doing this.
link |
And I'm not gonna be using your hand,
link |
because I don't get to control your hand,
link |
but I'm gonna use my hand, I do that mapping.
link |
And so that's where I think one of the big breakthroughs
link |
has happened this year.
link |
This was led by Chelsea Finn here.
link |
It's almost like learning a machine translation
link |
for demonstrations, where you have a human demonstration,
link |
and the robot learns to translate it
link |
into what it means for the robot to do it.
link |
And that was a meta learning formulation,
link |
learn from one to get the other.
link |
And that, I think, opens up a lot of opportunities
link |
to learn a lot more quickly.
link |
So my focus is on autonomous vehicles.
link |
Do you think this approach of third person watching,
link |
the autonomous driving is amenable
link |
to this kind of approach?
link |
So for autonomous driving,
link |
I would say third person is slightly easier.
link |
And the reason I'm gonna say it's slightly easier
link |
to do with third person is because
link |
the car dynamics are very well understood.
link |
Easier than first person, you mean?
link |
So I think the distinction between third person
link |
and first person is not a very important distinction
link |
for autonomous driving.
link |
They're very similar.
link |
Because the distinction is really about
link |
who turns the steering wheel.
link |
Or maybe, let me put it differently.
link |
How to get from a point where you are now
link |
to a point, let's say, a couple meters in front of you.
link |
And that's a problem that's very well understood.
link |
And that's the only distinction
link |
between third and first person there.
link |
Whereas with the robot manipulation,
link |
interaction forces are very complex.
link |
And it's still a very different thing.
link |
For autonomous driving,
link |
I think there is still the question,
link |
imitation versus RL.
link |
So imitation gives you a lot more signal.
link |
I think where imitation is lacking
link |
and needs some extra machinery is,
link |
it doesn't, in its normal format,
link |
doesn't think about goals or objectives.
link |
And of course, there are versions of imitation learning
link |
and versus reinforcement learning type imitation learning
link |
which also thinks about goals.
link |
I think then we're getting much closer.
link |
But I think it's very hard to think of a
link |
fully reactive car, generalizing well.
link |
If it really doesn't have a notion of objectives
link |
to generalize well to the kind of general
link |
that you would want.
link |
You'd want more than just that reactivity
link |
that you get from just behavioral cloning
link |
slash supervised learning.
link |
So a lot of the work,
link |
whether it's self play or even imitation learning,
link |
would benefit significantly from simulation,
link |
from effective simulation.
link |
And you're doing a lot of stuff
link |
in the physical world and in simulation.
link |
Do you have hope for greater and greater
link |
power of simulation being boundless eventually
link |
to where most of what we need to operate
link |
in the physical world could be simulated
link |
to a degree that's directly transferable
link |
to the physical world?
link |
Or are we still very far away from that?
link |
So I think we could even rephrase that question
link |
And so the power of simulation, right?
link |
As simulators get better and better,
link |
of course, becomes stronger
link |
and we can learn more in simulation.
link |
But there's also another version
link |
which is where you say the simulator
link |
doesn't even have to be that precise.
link |
As long as it's somewhat representative
link |
and instead of trying to get one simulator
link |
that is sufficiently precise to learn in
link |
and transfer really well to the real world,
link |
I'm gonna build many simulators.
link |
Ensemble of simulators?
link |
Ensemble of simulators.
link |
Not any single one of them is sufficiently representative
link |
of the real world such that it would work
link |
if you train in there.
link |
But if you train in all of them,
link |
then there is something that's good in all of them.
link |
The real world will just be another one of them
link |
that's not identical to any one of them
link |
but just another one of them.
link |
Another sample from the distribution of simulators.
link |
We do live in a simulation,
link |
so this is just one other one.
link |
I'm not sure about that, but yeah.
link |
It's definitely a very advanced simulator if it is.
link |
Yeah, it's a pretty good one.
link |
I've talked to Stuart Russell.
link |
It's something you think about a little bit too.
link |
Of course, you're really trying to build these systems,
link |
but do you think about the future of AI?
link |
A lot of people have concern about safety.
link |
How do you think about AI safety?
link |
As you build robots that are operating in the physical world,
link |
what is, yeah, how do you approach this problem
link |
in an engineering kind of way, in a systematic way?
link |
So when a robot is doing things,
link |
you kind of have a few notions of safety to worry about.
link |
One is that the robot is physically strong
link |
and of course could do a lot of damage.
link |
Same for cars, which we can think of as robots too
link |
And this could be completely unintentional.
link |
So it could be not the kind of longterm AI safety concerns
link |
that, okay, AI is smarter than us and now what do we do?
link |
But it could be just very practical.
link |
Okay, this robot, if it makes a mistake,
link |
what are the results going to be?
link |
Of course, simulation comes in a lot there
link |
to test in simulation. It's a difficult question.
link |
And I'm always wondering, like, I always wonder,
link |
let's say you look at, let's go back to driving
link |
because a lot of people know driving well, of course.
link |
What do we do to test somebody for driving, right?
link |
Get a driver's license. What do they really do?
link |
I mean, you fill out some tests and then you drive.
link |
And I mean, it's suburban California.
link |
That driving test is just you drive around the block,
link |
pull over, you do a stop sign successfully,
link |
and then you pull over again and you're pretty much done.
link |
And you're like, okay, if a self driving car did that,
link |
would you trust it that it can drive?
link |
And I'd be like, no, that's not enough for me to trust it.
link |
But somehow for humans, we've figured out
link |
that somebody being able to do that is representative
link |
of them being able to do a lot of other things.
link |
And so I think somehow for humans,
link |
we figured out representative tests
link |
of what it means if you can do this, what you can really do.
link |
Of course, testing humans,
link |
humans don't wanna be tested at all times.
link |
Self driving cars or robots
link |
could be tested more often probably.
link |
You can have replicas that get tested
link |
that are known to be identical
link |
because they use the same neural net and so forth.
link |
But still, I feel like we don't have this kind of unit tests
link |
or proper tests for robots.
link |
And I think there's something very interesting
link |
to be thought about there,
link |
especially as you update things.
link |
Your software improves,
link |
you have a better self driving car suite, you update it.
link |
How do you know it's indeed more capable on everything
link |
than what you had before,
link |
that you didn't have any bad things creep into it?
link |
So I think that's a very interesting direction of research
link |
that there is no real solution yet,
link |
except that somehow for humans we do.
link |
Because we say, okay, you have a driving test, you passed,
link |
you can go on the road now,
link |
and humans have accidents every like a million
link |
or 10 million miles, something pretty phenomenal
link |
compared to that short test that is being done.
link |
So let me ask, you've mentioned that Andrew Ng by example
link |
showed you the value of kindness.
link |
Do you think the space of policies,
link |
good policies for humans and for AI
link |
is populated by policies that with kindness
link |
or ones that are the opposite, exploitation, even evil?
link |
So if you just look at the sea of policies
link |
we operate under as human beings,
link |
or if AI system had to operate in this real world,
link |
do you think it's really easy to find policies
link |
that are full of kindness,
link |
like we naturally fall into them?
link |
Or is it like a very hard optimization problem?
link |
I mean, there is kind of two optimizations
link |
happening for humans, right?
link |
So for humans, there's kind of the very long term
link |
optimization which evolution has done for us
link |
and we're kind of predisposed to like certain things.
link |
And that's in some sense what makes our learning easier
link |
because I mean, we know things like pain
link |
and hunger and thirst.
link |
And the fact that we know about those
link |
is not something that we were taught, that's kind of innate.
link |
When we're hungry, we're unhappy.
link |
When we're thirsty, we're unhappy.
link |
When we have pain, we're unhappy.
link |
And ultimately evolution built that into us
link |
to think about those things.
link |
And so I think there is a notion that
link |
it seems somehow humans evolved in general
link |
to prefer to get along in some ways,
link |
but at the same time also to be very territorial
link |
and kind of centric to their own tribe.
link |
Like it seems like that's the kind of space
link |
we converged onto.
link |
I mean, I'm not an expert in anthropology,
link |
but it seems like we're very kind of good
link |
within our own tribe, but need to be taught
link |
to be nice to other tribes.
link |
Well, if you look at Steven Pinker,
link |
he highlights this pretty nicely in
link |
Better Angels of Our Nature,
link |
where he talks about violence decreasing over time
link |
So whatever tension, whatever teams we pick,
link |
it seems that the long arc of history
link |
goes towards us getting along more and more.
link |
So do you think that, do you think it's possible
link |
to teach RL based robots this kind of kindness,
link |
this kind of ability to interact with humans,
link |
this kind of policy, even to, let me ask a fun one.
link |
Do you think it's possible to teach RL based robot
link |
to love a human being and to inspire that human
link |
to love the robot back?
link |
So to like RL based algorithm that leads to a happy marriage.
link |
That's an interesting question.
link |
Maybe I'll answer it with another question, right?
link |
Because I mean, but I'll come back to it.
link |
So another question you can have is okay.
link |
I mean, how close does some people's happiness get
link |
from interacting with just a really nice dog?
link |
Like, I mean, dogs, you come home,
link |
that's what dogs do.
link |
They greet you, they're excited,
link |
makes you happy when you come home to your dog.
link |
You're just like, okay, this is exciting.
link |
They're always happy when I'm here.
link |
And if they don't greet you, cause maybe whatever,
link |
your partner took them on a trip or something,
link |
you might not be nearly as happy when you get home, right?
link |
And so the kind of, it seems like the level of reasoning
link |
a dog has is pretty sophisticated,
link |
but then it's still not yet at the level of human reasoning.
link |
And so it seems like we don't even need to achieve
link |
human level reasoning to get like very strong affection
link |
And so my thinking is why not, right?
link |
Why couldn't, with an AI, couldn't we achieve
link |
the kind of level of affection that humans feel
link |
among each other or with friendly animals and so forth?
link |
So question, is it a good thing for us or not?
link |
That's another thing, right?
link |
Because I mean, but I don't see why not.
link |
Why not, yeah, so Elon Musk says love is the answer.
link |
Maybe he should say love is the objective function
link |
and then RL is the answer, right?
link |
Oh, Peter, thank you so much.
link |
I don't want to take up more of your time.
link |
Thank you so much for talking today.
link |
Well, thanks for coming by.
link |
Great to have you visit.