back to index

Pieter Abbeel: Deep Reinforcement Learning | Lex Fridman Podcast #10


small model | large model

link |
00:00:00.000
The following is a conversation with Peter Abbeel.
link |
00:00:03.120
He's a professor at UC Berkeley
link |
00:00:04.840
and the director of the Berkeley Robotics Learning Lab.
link |
00:00:07.840
He's one of the top researchers in the world
link |
00:00:10.080
working on how we make robots understand
link |
00:00:13.080
and interact with the world around them,
link |
00:00:15.360
especially using imitation and deep reinforcement learning.
link |
00:00:19.720
This conversation is part of the MIT course
link |
00:00:22.360
on Artificial General Intelligence
link |
00:00:24.080
and the Artificial Intelligence podcast.
link |
00:00:26.400
If you enjoy it, please subscribe on YouTube,
link |
00:00:29.060
iTunes, or your podcast provider of choice,
link |
00:00:31.680
or simply connect with me on Twitter at Lex Friedman,
link |
00:00:34.840
spelled F R I D.
link |
00:00:36.920
And now, here's my conversation with Peter Abbeel.
link |
00:00:41.400
You've mentioned that if there was one person
link |
00:00:44.120
you could meet, it would be Roger Federer.
link |
00:00:46.200
So let me ask, when do you think we'll have a robot
link |
00:00:50.120
that fully autonomously can beat Roger Federer at tennis?
link |
00:00:54.760
Roger Federer level player at tennis?
link |
00:00:57.520
Well, first, if you can make it happen for me to meet Roger,
link |
00:01:00.720
let me know.
link |
00:01:01.560
In terms of getting a robot to beat him at tennis,
link |
00:01:07.440
it's kind of an interesting question
link |
00:01:08.920
because for a lot of the challenges we think about in AI,
link |
00:01:14.560
the software is really the missing piece,
link |
00:01:16.760
but for something like this,
link |
00:01:18.620
the hardware is nowhere near either.
link |
00:01:22.720
To really have a robot that can physically run around,
link |
00:01:26.560
the Boston Dynamics robots are starting to get there,
link |
00:01:28.560
but still not really human level ability to run around
link |
00:01:33.040
and then swing a racket.
link |
00:01:36.920
So you think that's a hardware problem?
link |
00:01:38.400
I don't think it's a hardware problem only.
link |
00:01:39.960
I think it's a hardware and a software problem.
link |
00:01:41.640
I think it's both.
link |
00:01:43.160
And I think they'll have independent progress.
link |
00:01:45.680
So I'd say the hardware maybe in 10, 15 years.
link |
00:01:51.680
On clay, not grass.
link |
00:01:52.920
I mean, grass is probably harder.
link |
00:01:53.760
With the sliding?
link |
00:01:54.600
Yeah.
link |
00:01:55.420
With the clay, I'm not sure what's harder, grass or clay.
link |
00:01:58.920
The clay involves sliding,
link |
00:02:01.600
which might be harder to master actually, yeah.
link |
00:02:06.040
But you're not limited to a bipedal.
link |
00:02:08.940
I mean, I'm sure there's no...
link |
00:02:09.780
Well, if we can build a machine,
link |
00:02:11.480
it's a whole different question, of course.
link |
00:02:13.200
If you can say, okay, this robot can be on wheels,
link |
00:02:16.300
it can move around on wheels and can be designed differently,
link |
00:02:19.400
then I think that can be done sooner probably
link |
00:02:23.040
than a full humanoid type of setup.
link |
00:02:26.280
What do you think of swing a racket?
link |
00:02:27.760
So you've worked at basic manipulation.
link |
00:02:31.240
How hard do you think is the task of swinging a racket
link |
00:02:34.240
would be able to hit a nice backhand or a forehand?
link |
00:02:39.480
Let's say we just set up stationary,
link |
00:02:42.720
a nice robot arm, let's say, a standard industrial arm,
link |
00:02:46.580
and it can watch the ball come and then swing the racket.
link |
00:02:50.700
It's a good question.
link |
00:02:51.540
I'm not sure it would be super hard to do.
link |
00:02:56.200
I mean, I'm sure it would require a lot,
link |
00:02:58.240
if we do it with reinforcement learning,
link |
00:03:00.000
it would require a lot of trial and error.
link |
00:03:01.520
It's not gonna swing it right the first time around,
link |
00:03:03.380
but yeah, I don't see why I couldn't
link |
00:03:07.920
swing it the right way.
link |
00:03:09.480
I think it's learnable.
link |
00:03:10.340
I think if you set up a ball machine,
link |
00:03:12.160
let's say on one side,
link |
00:03:13.800
and then a robot with a tennis racket on the other side,
link |
00:03:17.780
I think it's learnable
link |
00:03:20.280
and maybe a little bit of pre training and simulation.
link |
00:03:22.940
Yeah, I think that's feasible.
link |
00:03:25.560
I think the swing the racket is feasible.
link |
00:03:27.280
It'd be very interesting to see how much precision
link |
00:03:28.900
it can get.
link |
00:03:31.840
Cause I mean, that's where, I mean,
link |
00:03:35.400
some of the human players can hit it on the lines,
link |
00:03:37.920
which is very high precision.
link |
00:03:39.240
With spin, the spin is an interesting,
link |
00:03:42.840
whether RL can learn to put a spin on the ball.
link |
00:03:45.760
Well, you got me interested.
link |
00:03:46.880
Maybe someday we'll set this up.
link |
00:03:48.400
Sure, you got me intrigued.
link |
00:03:51.120
Your answer is basically, okay,
link |
00:03:52.680
for this problem, it sounds fascinating,
link |
00:03:54.160
but for the general problem of a tennis player,
link |
00:03:56.480
we might be a little bit farther away.
link |
00:03:58.560
What's the most impressive thing you've seen a robot do
link |
00:04:01.260
in the physical world?
link |
00:04:04.140
So physically for me,
link |
00:04:06.480
it's the Boston Dynamics videos.
link |
00:04:10.920
Always just bring home and just super impressed.
link |
00:04:15.680
Recently, the robot running up the stairs,
link |
00:04:17.700
doing the parkour type thing.
link |
00:04:19.440
I mean, yes, we don't know what's underneath.
link |
00:04:22.280
They don't really write a lot of detail,
link |
00:04:23.940
but even if it's hard coded underneath,
link |
00:04:27.040
which it might or might not be just the physical abilities
link |
00:04:29.800
of doing that parkour, that's a very impressive.
link |
00:04:32.680
So have you met Spot Mini
link |
00:04:34.960
or any of those robots in person?
link |
00:04:36.840
Met Spot Mini last year in April at the Mars event
link |
00:04:41.040
that Jeff Bezos organizes.
link |
00:04:42.960
They brought it out there
link |
00:04:44.160
and it was nicely following around Jeff.
link |
00:04:47.760
When Jeff left the room, they had it follow him along,
link |
00:04:50.640
which is pretty impressive.
link |
00:04:52.160
So I think there's some confidence to know
link |
00:04:55.680
that there's no learning going on in those robots.
link |
00:04:58.040
The psychology of it, so while knowing that,
link |
00:05:00.160
while knowing there's not,
link |
00:05:01.140
if there's any learning going on, it's very limited.
link |
00:05:04.040
I met Spot Mini earlier this year
link |
00:05:06.840
and knowing everything that's going on,
link |
00:05:09.520
having one on one interaction,
link |
00:05:11.000
so I got to spend some time alone and there's immediately
link |
00:05:15.960
a deep connection on the psychological level.
link |
00:05:18.640
Even though you know the fundamentals, how it works,
link |
00:05:21.000
there's something magical.
link |
00:05:23.240
So do you think about the psychology of interacting
link |
00:05:27.560
with robots in the physical world?
link |
00:05:29.080
Even you just showed me the PR2, the robot,
link |
00:05:33.720
and there was a little bit something like a face,
link |
00:05:36.860
had a little bit something like a face.
link |
00:05:38.480
There's something that immediately draws you to it.
link |
00:05:40.600
Do you think about that aspect of the robotics problem?
link |
00:05:45.160
Well, it's very hard with Brad here.
link |
00:05:48.400
We'll give him a name, Berkeley Robot
link |
00:05:50.680
for the Elimination of Tedious Tasks.
link |
00:05:52.200
It's very hard to not think of the robot as a person
link |
00:05:56.560
and it seems like everybody calls him a he
link |
00:05:58.880
for whatever reason, but that also makes it more a person
link |
00:06:01.160
than if it was a it, and it seems pretty natural
link |
00:06:06.360
to think of it that way.
link |
00:06:07.320
This past weekend really struck me.
link |
00:06:08.680
I've seen Pepper many times on videos,
link |
00:06:13.360
but then I was at an event organized by,
link |
00:06:15.360
this was by Fidelity, and they had scripted Pepper
link |
00:06:18.880
to help moderate some sessions,
link |
00:06:22.800
and they had scripted Pepper
link |
00:06:23.920
to have the personality of a child a little bit,
link |
00:06:26.520
and it was very hard to not think of it
link |
00:06:28.600
as its own person in some sense
link |
00:06:31.920
because it would just jump in the conversation,
link |
00:06:34.560
making it very interactive.
link |
00:06:35.880
Moderate would be saying, Pepper would just jump in,
link |
00:06:37.960
hold on, how about me?
link |
00:06:40.120
Can I participate in this too?
link |
00:06:41.360
And you're just like, okay, this is like a person,
link |
00:06:43.720
and that was 100% scripted, and even then it was hard
link |
00:06:46.640
not to have that sense of somehow there is something there.
link |
00:06:50.640
So as we have robots interact in this physical world,
link |
00:06:54.440
is that a signal that could be used
link |
00:06:56.120
in reinforcement learning?
link |
00:06:57.440
You've worked a little bit in this direction,
link |
00:07:00.240
but do you think that psychology can be somehow pulled in?
link |
00:07:04.360
Yes, that's a question I would say
link |
00:07:07.160
a lot of people ask, and I think part of why they ask it
link |
00:07:11.320
is they're thinking about how unique
link |
00:07:14.960
are we really still as people?
link |
00:07:16.680
Like after they see some results,
link |
00:07:18.120
they see a computer play Go, they see a computer do this,
link |
00:07:21.440
that, they're like, okay, but can it really have emotion?
link |
00:07:23.760
Can it really interact with us in that way?
link |
00:07:26.760
And then once you're around robots,
link |
00:07:29.100
you already start feeling it,
link |
00:07:30.120
and I think that kind of maybe mythologically,
link |
00:07:33.180
the way that I think of it is
link |
00:07:34.720
if you run something like reinforcement learning,
link |
00:07:37.640
it's about optimizing some objective,
link |
00:07:39.920
and there's no reason that the objective
link |
00:07:45.360
couldn't be tied into how much does a person like
link |
00:07:49.380
interacting with this system,
link |
00:07:50.720
and why could not the reinforcement learning system
link |
00:07:53.220
optimize for the robot being fun to be around?
link |
00:07:56.720
And why wouldn't it then naturally become
link |
00:07:58.940
more and more interactive and more and more
link |
00:08:01.400
maybe like a person or like a pet?
link |
00:08:03.200
I don't know what it would exactly be,
link |
00:08:04.600
but more and more have those features
link |
00:08:06.640
and acquire them automatically.
link |
00:08:08.320
As long as you can formalize an objective
link |
00:08:10.880
of what it means to like something,
link |
00:08:13.440
what, how you exhibit, what's the ground truth?
link |
00:08:16.800
How do you get the reward from human?
link |
00:08:19.560
Because you have to somehow collect
link |
00:08:20.760
that information within you, human.
link |
00:08:22.400
But you're saying if you can formulate as an objective,
link |
00:08:26.280
it can be learned.
link |
00:08:27.240
There's no reason it couldn't emerge through learning,
link |
00:08:29.380
and maybe one way to formulate as an objective,
link |
00:08:31.480
you wouldn't have to necessarily score it explicitly,
link |
00:08:33.800
so standard rewards are numbers,
link |
00:08:36.560
and numbers are hard to come by.
link |
00:08:38.740
This is a 1.5 or a 1.7 on some scale.
link |
00:08:41.320
It's very hard to do for a person,
link |
00:08:43.060
but much easier is for a person to say,
link |
00:08:45.420
okay, what you did the last five minutes
link |
00:08:47.800
was much nicer than what you did the previous five minutes,
link |
00:08:51.160
and that now gives a comparison.
link |
00:08:53.080
And in fact, there have been some results on that.
link |
00:08:55.320
For example, Paul Christiano and collaborators at OpenAI
link |
00:08:57.880
had the Hopper, Mojoko Hopper, a one legged robot,
link |
00:09:02.040
going through backflips purely from feedback.
link |
00:09:05.600
I like this better than that.
link |
00:09:06.920
That's kind of equally good,
link |
00:09:08.640
and after a bunch of interactions,
link |
00:09:10.920
it figured out what it was the person was asking for,
link |
00:09:13.080
namely a backflip.
link |
00:09:14.400
And so I think the same thing.
link |
00:09:15.920
Oh, it wasn't trying to do a backflip.
link |
00:09:18.640
It was just getting a comparison score
link |
00:09:20.820
from the person based on?
link |
00:09:23.320
Person having in mind, in their own mind,
link |
00:09:26.080
I wanted to do a backflip,
link |
00:09:27.400
but the robot didn't know what it was supposed to be doing.
link |
00:09:30.760
It just knew that sometimes the person said,
link |
00:09:32.800
this is better, this is worse,
link |
00:09:34.560
and then the robot figured out
link |
00:09:36.020
what the person was actually after was a backflip.
link |
00:09:38.760
And I'd imagine the same would be true
link |
00:09:40.040
for things like more interactive robots,
link |
00:09:43.120
that the robot would figure out over time,
link |
00:09:45.100
oh, this kind of thing apparently is appreciated more
link |
00:09:48.160
than this other kind of thing.
link |
00:09:50.200
So when I first picked up Sutton's,
link |
00:09:54.000
Richard Sutton's reinforcement learning book,
link |
00:09:56.200
before sort of this deep learning,
link |
00:10:01.280
before the reemergence of neural networks
link |
00:10:03.360
as a powerful mechanism for machine learning,
link |
00:10:05.640
RL seemed to me like magic.
link |
00:10:08.320
It was beautiful.
link |
00:10:10.280
So that seemed like what intelligence is,
link |
00:10:13.560
RL reinforcement learning.
link |
00:10:15.520
So how do you think we can possibly learn anything
link |
00:10:20.320
about the world when the reward for the actions
link |
00:10:22.980
is delayed, is so sparse?
link |
00:10:25.840
Like where is, why do you think RL works?
link |
00:10:30.560
Why do you think you can learn anything
link |
00:10:32.800
under such sparse rewards,
link |
00:10:35.040
whether it's regular reinforcement learning
link |
00:10:36.880
or deep reinforcement learning?
link |
00:10:38.640
What's your intuition?
link |
00:10:40.580
The counterpart of that is why is RL,
link |
00:10:44.480
why does it need so many samples,
link |
00:10:47.240
so many experiences to learn from?
link |
00:10:49.640
Because really what's happening is
link |
00:10:50.760
when you have a sparse reward,
link |
00:10:53.040
you do something maybe for like, I don't know,
link |
00:10:55.200
you take 100 actions and then you get a reward.
link |
00:10:57.440
And maybe you get like a score of three.
link |
00:10:59.760
And I'm like okay, three, not sure what that means.
link |
00:11:03.000
You go again and now you get two.
link |
00:11:05.040
And now you know that that sequence of 100 actions
link |
00:11:07.160
that you did the second time around
link |
00:11:08.320
somehow was worse than the sequence of 100 actions
link |
00:11:10.600
you did the first time around.
link |
00:11:11.920
But that's tough to now know which one of those
link |
00:11:14.440
were better or worse.
link |
00:11:15.280
Some might have been good and bad in either one.
link |
00:11:17.480
And so that's why it needs so many experiences.
link |
00:11:19.840
But once you have enough experiences,
link |
00:11:21.280
effectively RL is teasing that apart.
link |
00:11:23.480
It's trying to say okay, what is consistently there
link |
00:11:26.640
when you get a higher reward
link |
00:11:27.840
and what's consistently there when you get a lower reward?
link |
00:11:30.000
And then kind of the magic of sometimes
link |
00:11:32.480
the policy gradient update is to say
link |
00:11:34.720
now let's update the neural network
link |
00:11:37.000
to make the actions that were kind of present
link |
00:11:39.160
when things are good more likely
link |
00:11:41.460
and make the actions that are present
link |
00:11:43.080
when things are not as good less likely.
link |
00:11:45.140
So that is the counterpoint,
link |
00:11:47.000
but it seems like you would need to run it
link |
00:11:49.540
a lot more than you do.
link |
00:11:50.920
Even though right now people could say
link |
00:11:52.760
that RL is very inefficient,
link |
00:11:54.480
but it seems to be way more efficient
link |
00:11:56.320
than one would imagine on paper.
link |
00:11:58.880
That the simple updates to the policy,
link |
00:12:02.040
the policy gradient, that somehow you can learn,
link |
00:12:04.960
exactly you just said, what are the common actions
link |
00:12:07.740
that seem to produce some good results?
link |
00:12:09.820
That that somehow can learn anything.
link |
00:12:12.800
It seems counterintuitive at least.
link |
00:12:15.600
Is there some intuition behind it?
link |
00:12:16.920
Yeah, so I think there's a few ways to think about this.
link |
00:12:21.920
The way I tend to think about it mostly originally,
link |
00:12:26.440
so when we started working on deep reinforcement learning
link |
00:12:29.080
here at Berkeley, which was maybe 2011, 12, 13,
link |
00:12:32.760
around that time, John Schulman was a PhD student
link |
00:12:36.160
initially kind of driving it forward here.
link |
00:12:39.520
And the way we thought about it at the time was
link |
00:12:44.080
if you think about rectified linear units
link |
00:12:47.000
or kind of rectifier type neural networks,
link |
00:12:50.240
what do you get?
link |
00:12:51.080
You get something that's piecewise linear feedback control.
link |
00:12:55.080
And if you look at the literature,
link |
00:12:57.120
linear feedback control is extremely successful,
link |
00:12:59.360
can solve many, many problems surprisingly well.
link |
00:13:03.720
I remember, for example, when we did helicopter flight,
link |
00:13:05.700
if you're in a stationary flight regime,
link |
00:13:07.320
not a non stationary, but a stationary flight regime
link |
00:13:10.440
like hover, you can use linear feedback control
link |
00:13:12.520
to stabilize a helicopter, very complex dynamical system,
link |
00:13:15.580
but the controller is relatively simple.
link |
00:13:18.480
And so I think that's a big part of it is that
link |
00:13:20.660
if you do feedback control, even though the system
link |
00:13:23.220
you control can be very, very complex,
link |
00:13:25.000
often relatively simple control architectures
link |
00:13:28.760
can already do a lot.
link |
00:13:30.560
But then also just linear is not good enough.
link |
00:13:32.600
And so one way you can think of these neural networks
link |
00:13:35.120
is that sometimes they tile the space,
link |
00:13:37.120
which people were already trying to do more by hand
link |
00:13:39.480
or with finite state machines,
link |
00:13:41.000
say this linear controller here,
link |
00:13:42.520
this linear controller here.
link |
00:13:43.840
Neural network learns to tile the space
link |
00:13:45.640
and say linear controller here,
link |
00:13:46.600
another linear controller here,
link |
00:13:48.320
but it's more subtle than that.
link |
00:13:50.080
And so it's benefiting from this linear control aspect,
link |
00:13:52.000
it's benefiting from the tiling,
link |
00:13:53.600
but it's somehow tiling it one dimension at a time.
link |
00:13:57.440
Because if let's say you have a two layer network,
link |
00:13:59.440
if in that hidden layer, you make a transition
link |
00:14:03.360
from active to inactive or the other way around,
link |
00:14:06.560
that is essentially one axis, but not axis aligned,
link |
00:14:09.520
but one direction that you change.
link |
00:14:12.360
And so you have this kind of very gradual tiling
link |
00:14:14.780
of the space where you have a lot of sharing
link |
00:14:16.800
between the linear controllers that tile the space.
link |
00:14:19.560
And that was always my intuition as to why
link |
00:14:21.720
to expect that this might work pretty well.
link |
00:14:24.820
It's essentially leveraging the fact
link |
00:14:26.160
that linear feedback control is so good,
link |
00:14:28.560
but of course not enough.
link |
00:14:29.880
And this is a gradual tiling of the space
link |
00:14:31.800
with linear feedback controls
link |
00:14:33.520
that share a lot of expertise across them.
link |
00:14:36.620
So that's really nice intuition,
link |
00:14:39.040
but do you think that scales to the more
link |
00:14:41.520
and more general problems of when you start going up
link |
00:14:44.720
the number of dimensions when you start
link |
00:14:49.480
going down in terms of how often
link |
00:14:52.760
you get a clean reward signal?
link |
00:14:55.400
Does that intuition carry forward to those crazier,
link |
00:14:58.800
weirder worlds that we think of as the real world?
link |
00:15:03.360
So I think where things get really tricky
link |
00:15:08.040
in the real world compared to the things
link |
00:15:09.760
we've looked at so far with great success
link |
00:15:11.920
in reinforcement learning is the time scales,
link |
00:15:17.320
which takes us to an extreme.
link |
00:15:18.960
So when you think about the real world,
link |
00:15:21.800
I mean, I don't know, maybe some student
link |
00:15:24.320
decided to do a PhD here, right?
link |
00:15:26.920
Okay, that's a decision.
link |
00:15:28.760
That's a very high level decision.
link |
00:15:30.840
But if you think about their lives,
link |
00:15:32.680
I mean, any person's life,
link |
00:15:34.080
it's a sequence of muscle fiber contractions
link |
00:15:37.440
and relaxations, and that's how you interact with the world.
link |
00:15:40.360
And that's a very high frequency control thing,
link |
00:15:42.800
but it's ultimately what you do
link |
00:15:44.640
and how you affect the world,
link |
00:15:46.600
until I guess we have brain readings
link |
00:15:48.320
and you can maybe do it slightly differently.
link |
00:15:49.800
But typically that's how you affect the world.
link |
00:15:52.600
And the decision of doing a PhD is so abstract
link |
00:15:56.360
relative to what you're actually doing in the world.
link |
00:15:59.320
And I think that's where credit assignment
link |
00:16:01.120
becomes just completely beyond
link |
00:16:04.800
what any current RL algorithm can do.
link |
00:16:06.760
And we need hierarchical reasoning
link |
00:16:09.000
at a level that is just not available at all yet.
link |
00:16:12.520
Where do you think we can pick up hierarchical reasoning?
link |
00:16:14.920
By which mechanisms?
link |
00:16:16.960
Yeah, so maybe let me highlight
link |
00:16:18.680
what I think the limitations are
link |
00:16:20.640
of what already was done 20, 30 years ago.
link |
00:16:26.080
In fact, you'll find reasoning systems
link |
00:16:27.720
that reason over relatively long horizons,
link |
00:16:30.960
but the problem is that they were not grounded
link |
00:16:32.800
in the real world.
link |
00:16:34.200
So people would have to hand design
link |
00:16:39.160
some kind of logical, dynamical descriptions of the world
link |
00:16:43.920
and that didn't tie into perception.
link |
00:16:46.360
And so it didn't tie into real objects and so forth.
link |
00:16:49.280
And so that was a big gap.
link |
00:16:51.120
Now with deep learning, we start having the ability
link |
00:16:53.960
to really see with sensors, process that
link |
00:16:59.560
and understand what's in the world.
link |
00:17:01.440
And so it's a good time to try
link |
00:17:02.840
to bring these things together.
link |
00:17:04.960
I see a few ways of getting there.
link |
00:17:06.480
One way to get there would be to say
link |
00:17:08.160
deep learning can get bolted on somehow
link |
00:17:10.120
to some of these more traditional approaches.
link |
00:17:12.280
Now bolted on would probably mean
link |
00:17:14.120
you need to do some kind of end to end training
link |
00:17:16.320
where you say my deep learning processing
link |
00:17:18.600
somehow leads to a representation
link |
00:17:20.840
that in term uses some kind of traditional
link |
00:17:24.640
underlying dynamical systems that can be used for planning.
link |
00:17:29.840
And that's, for example, the direction Aviv Tamar
link |
00:17:32.280
and Thanard Kuretach here have been pushing
link |
00:17:34.080
with causal info again and of course other people too.
link |
00:17:36.720
That's one way.
link |
00:17:38.200
Can we somehow force it into the form factor
link |
00:17:41.080
that is amenable to reasoning?
link |
00:17:43.760
Another direction we've been thinking about
link |
00:17:46.520
for a long time and didn't make any progress on
link |
00:17:50.200
was more information theoretic approaches.
link |
00:17:53.640
So the idea there was that what it means
link |
00:17:56.560
to take high level action is to take
link |
00:17:59.960
and choose a latent variable now
link |
00:18:02.560
that tells you a lot about what's gonna be the case
link |
00:18:04.640
in the future.
link |
00:18:05.480
Because that's what it means to take a high level action.
link |
00:18:09.400
I say okay, I decide I'm gonna navigate
link |
00:18:13.040
to the gas station because I need to get gas for my car.
link |
00:18:15.480
Well, that'll now take five minutes to get there.
link |
00:18:17.880
But the fact that I get there,
link |
00:18:19.280
I could already tell that from the high level action
link |
00:18:22.320
I took much earlier.
link |
00:18:24.480
That we had a very hard time getting success with.
link |
00:18:28.440
Not saying it's a dead end necessarily,
link |
00:18:30.640
but we had a lot of trouble getting that to work.
link |
00:18:33.120
And then we started revisiting the notion
link |
00:18:34.720
of what are we really trying to achieve?
link |
00:18:37.800
What we're trying to achieve is not necessarily hierarchy
link |
00:18:40.680
per se, but you could think about
link |
00:18:41.720
what does hierarchy give us?
link |
00:18:44.280
What we hope it would give us is better credit assignment.
link |
00:18:49.120
What is better credit assignment?
link |
00:18:51.240
It's giving us, it gives us faster learning, right?
link |
00:18:55.760
And so faster learning is ultimately maybe what we're after.
link |
00:18:59.800
And so that's where we ended up with the RL squared paper
link |
00:19:03.400
on learning to reinforcement learn,
link |
00:19:06.040
which at a time Rocky Dwan led.
link |
00:19:08.840
And that's exactly the meta learning approach
link |
00:19:11.080
where you say, okay, we don't know how to design hierarchy.
link |
00:19:14.240
We know what we want to get from it.
link |
00:19:15.760
Let's just enter and optimize for what we want to get
link |
00:19:18.240
from it and see if it might emerge.
link |
00:19:20.200
And we saw things emerge.
link |
00:19:21.240
The maze navigation had consistent motion down hallways,
link |
00:19:26.120
which is what you want.
link |
00:19:27.160
A hierarchical control should say,
link |
00:19:28.320
I want to go down this hallway.
link |
00:19:29.720
And then when there is an option to take a turn,
link |
00:19:31.640
I can decide whether to take a turn or not and repeat.
link |
00:19:33.840
Even had the notion of where have you been before or not
link |
00:19:37.280
to not revisit places you've been before.
link |
00:19:39.960
It still didn't scale yet
link |
00:19:42.520
to the real world kind of scenarios I think you had in mind,
link |
00:19:46.000
but it was some sign of life
link |
00:19:47.200
that maybe you can meta learn these hierarchical concepts.
link |
00:19:51.160
I mean, it seems like through these meta learning concepts,
link |
00:19:56.160
get at the, what I think is one of the hardest
link |
00:19:59.800
and most important problems of AI,
link |
00:20:02.360
which is transfer learning.
link |
00:20:04.040
So it's generalization.
link |
00:20:06.280
How far along this journey
link |
00:20:08.480
towards building general systems are we?
link |
00:20:11.160
Being able to do transfer learning well.
link |
00:20:13.600
So there's some signs that you can generalize a little bit,
link |
00:20:17.520
but do you think we're on the right path
link |
00:20:19.600
or it's totally different breakthroughs are needed
link |
00:20:23.760
to be able to transfer knowledge
link |
00:20:26.800
between different learned models?
link |
00:20:31.240
Yeah, I'm pretty torn on this in that
link |
00:20:33.840
I think there are some very impressive.
link |
00:20:35.560
Well, there's just some very impressive results already.
link |
00:20:40.520
I mean, I would say when,
link |
00:20:44.040
even with the initial kind of big breakthrough in 2012
link |
00:20:47.240
with AlexNet, the initial thing is okay, great.
link |
00:20:52.160
This does better on ImageNet, hence image recognition.
link |
00:20:55.680
But then immediately thereafter,
link |
00:20:57.840
there was of course the notion that,
link |
00:21:00.520
wow, what was learned on ImageNet
link |
00:21:03.320
and you now wanna solve a new task,
link |
00:21:05.000
you can fine tune AlexNet for new tasks.
link |
00:21:09.080
And that was often found to be the even bigger deal
link |
00:21:12.040
that you learn something that was reusable,
link |
00:21:14.320
which was not often the case before.
link |
00:21:16.040
Usually machine learning, you learn something
link |
00:21:17.520
for one scenario and that was it.
link |
00:21:19.320
And that's really exciting.
link |
00:21:20.280
I mean, that's a huge application.
link |
00:21:22.280
That's probably the biggest success
link |
00:21:23.680
of transfer learning today in terms of scope and impact.
link |
00:21:27.920
That was a huge breakthrough.
link |
00:21:29.040
And then recently, I feel like similar kind of,
link |
00:21:33.040
by scaling things up, it seems like
link |
00:21:34.760
this has been expanded upon.
link |
00:21:36.160
Like people training even bigger networks,
link |
00:21:37.960
they might transfer even better.
link |
00:21:39.480
If you looked at, for example,
link |
00:21:41.200
some of the OpenAI results on language models
link |
00:21:43.400
and some of the recent Google results on language models,
link |
00:21:47.560
they're learned for just prediction
link |
00:21:51.040
and then they get reused for other tasks.
link |
00:21:54.960
And so I think there is something there
link |
00:21:56.680
where somehow if you train a big enough model
link |
00:21:58.520
on enough things, it seems to transfer
link |
00:22:01.360
some deep mind results that I thought were very impressive,
link |
00:22:03.600
the Unreal results, where it was learned to navigate mazes
link |
00:22:09.240
in ways where it wasn't just doing reinforcement learning,
link |
00:22:11.240
but it had other objectives it was optimizing for.
link |
00:22:14.280
So I think there's a lot of interesting results already.
link |
00:22:17.240
I think maybe where it's hard to wrap my head around this,
link |
00:22:22.520
to which extent or when do we call something generalization?
link |
00:22:26.720
Or the levels of generalization in the real world,
link |
00:22:29.760
or the levels of generalization involved
link |
00:22:31.880
in these different tasks, right?
link |
00:22:36.240
You draw this, by the way, just to frame things.
link |
00:22:39.280
I've heard you say somewhere, it's the difference
link |
00:22:41.400
between learning to master versus learning to generalize,
link |
00:22:44.920
that it's a nice line to think about.
link |
00:22:47.880
And I guess you're saying that it's a gray area
link |
00:22:50.920
of what learning to master and learning to generalize,
link |
00:22:53.680
where one starts.
link |
00:22:54.520
I think I might have heard this.
link |
00:22:56.120
I might have heard it somewhere else.
link |
00:22:57.840
And I think it might've been one of your interviews,
link |
00:23:00.480
maybe the one with Yoshua Benjamin, I'm not 100% sure.
link |
00:23:03.720
But I liked the example, I'm not sure who it was,
link |
00:23:08.440
but the example was essentially,
link |
00:23:10.600
if you use current deep learning techniques,
link |
00:23:13.320
what we're doing to predict, let's say,
link |
00:23:17.200
the relative motion of our planets, it would do pretty well.
link |
00:23:22.200
But then now if a massive new mass enters our solar system,
link |
00:23:28.440
it would probably not predict what will happen, right?
link |
00:23:32.120
And that's a different kind of generalization.
link |
00:23:33.600
That's a generalization that relies
link |
00:23:34.960
on the ultimate simplest, simplest explanation
link |
00:23:38.560
that we have available today
link |
00:23:40.240
to explain the motion of planets,
link |
00:23:41.600
whereas just pattern recognition could predict
link |
00:23:43.700
our current solar system motion pretty well, no problem.
link |
00:23:47.320
And so I think that's an example
link |
00:23:48.880
of a kind of generalization that is a little different
link |
00:23:52.440
from what we've achieved so far.
link |
00:23:54.560
And it's not clear if just regularizing more
link |
00:23:59.720
and forcing it to come up with a simpler, simpler,
link |
00:24:01.840
simpler explanation and say, look, this is not simple.
link |
00:24:03.840
But that's what physics researchers do, right?
link |
00:24:05.600
They say, can I make this even simpler?
link |
00:24:08.220
How simple can I get this?
link |
00:24:09.440
What's the simplest equation that can explain everything?
link |
00:24:12.400
The master equation for the entire dynamics of the universe,
link |
00:24:15.560
we haven't really pushed that direction as hard
link |
00:24:17.600
in deep learning, I would say.
link |
00:24:20.740
Not sure if it should be pushed,
link |
00:24:22.040
but it seems a kind of generalization you get from that
link |
00:24:24.560
that you don't get in our current methods so far.
link |
00:24:27.400
So I just talked to Vladimir Vapnik, for example,
link |
00:24:30.040
who's a statistician of statistical learning,
link |
00:24:34.200
and he kind of dreams of creating
link |
00:24:37.000
the E equals MC squared for learning, right?
link |
00:24:41.080
The general theory of learning.
link |
00:24:42.460
Do you think that's a fruitless pursuit
link |
00:24:44.640
in the near term, within the next several decades?
link |
00:24:51.800
I think that's a really interesting pursuit
link |
00:24:53.560
in the following sense, in that there is a lot of evidence
link |
00:24:58.040
that the brain is pretty modular.
link |
00:25:03.480
And so I wouldn't maybe think of it as the theory,
link |
00:25:05.520
maybe the underlying theory, but more kind of the principle
link |
00:25:10.700
where there have been findings where
link |
00:25:12.840
people who are blind will use the part of the brain
link |
00:25:16.600
usually used for vision for other functions.
link |
00:25:21.640
And even after some kind of,
link |
00:25:24.720
if people get rewired in some way,
link |
00:25:26.440
they might be able to reuse parts of their brain
link |
00:25:28.700
for other functions.
link |
00:25:30.400
And so what that suggests is some kind of modularity.
link |
00:25:35.160
And I think it is a pretty natural thing to strive for
link |
00:25:39.280
to see, can we find that modularity?
link |
00:25:41.720
Can we find this thing?
link |
00:25:43.200
Of course, every part of the brain is not exactly the same.
link |
00:25:45.960
Not everything can be rewired arbitrarily.
link |
00:25:48.600
But if you think of things like the neocortex,
link |
00:25:50.240
which is a pretty big part of the brain,
link |
00:25:52.300
that seems fairly modular from what the findings so far.
link |
00:25:56.560
Can you design something equally modular?
link |
00:25:59.240
And if you can just grow it,
link |
00:26:00.560
it becomes more capable probably.
link |
00:26:02.520
I think that would be the kind of interesting
link |
00:26:04.940
underlying principle to shoot for that is not unrealistic.
link |
00:26:09.400
Do you think you prefer math or empirical trial and error
link |
00:26:15.200
for the discovery of the essence of what it means
link |
00:26:17.560
to do something intelligent?
link |
00:26:19.000
So reinforcement learning embodies both groups, right?
link |
00:26:22.120
To prove that something converges, prove the bounds.
link |
00:26:26.400
And then at the same time, a lot of those successes are,
link |
00:26:29.320
well, let's try this and see if it works.
link |
00:26:31.560
So which do you gravitate towards?
link |
00:26:33.400
How do you think of those two parts of your brain?
link |
00:26:39.920
Maybe I would prefer we could make the progress
link |
00:26:44.560
with mathematics.
link |
00:26:45.600
And the reason maybe I would prefer that is because often
link |
00:26:48.040
if you have something you can mathematically formalize,
link |
00:26:52.840
you can leapfrog a lot of experimentation.
link |
00:26:55.800
And experimentation takes a long time to get through.
link |
00:26:58.800
And a lot of trial and error,
link |
00:27:01.280
kind of reinforcement learning, your research process,
link |
00:27:04.120
but you need to do a lot of trial and error
link |
00:27:05.560
before you get to a success.
link |
00:27:06.720
So if you can leapfrog that, to my mind,
link |
00:27:08.520
that's what the math is about.
link |
00:27:10.480
And hopefully once you do a bunch of experiments,
link |
00:27:13.280
you start seeing a pattern.
link |
00:27:14.440
You can do some derivations that leapfrog some experiments.
link |
00:27:18.320
But I agree with you.
link |
00:27:19.160
I mean, in practice, a lot of the progress has been such
link |
00:27:21.360
that we have not been able to find the math
link |
00:27:23.680
that allows you to leapfrog ahead.
link |
00:27:25.120
And we are kind of making gradual progress
link |
00:27:28.100
one step at a time, a new experiment here,
link |
00:27:30.440
a new experiment there that gives us new insights
link |
00:27:32.920
and gradually building up,
link |
00:27:34.400
but not getting to something yet where we're just,
link |
00:27:36.600
okay, here's an equation that now explains how,
link |
00:27:39.120
you know, that would be,
link |
00:27:40.560
have been two years of experimentation to get there,
link |
00:27:42.540
but this tells us what the result's going to be.
link |
00:27:45.440
Unfortunately, not so much yet.
link |
00:27:47.560
Not so much yet, but your hope is there.
link |
00:27:50.200
In trying to teach robots or systems
link |
00:27:53.680
to do everyday tasks or even in simulation,
link |
00:27:58.340
what do you think you're more excited about?
link |
00:28:02.740
Imitation learning or self play?
link |
00:28:04.800
So letting robots learn from humans
link |
00:28:08.700
or letting robots plan their own
link |
00:28:11.340
to try to figure out in their own way
link |
00:28:13.880
and eventually play, eventually interact with humans
link |
00:28:18.320
or solve whatever the problem is.
link |
00:28:20.180
What's the more exciting to you?
link |
00:28:21.860
What's more promising you think as a research direction?
link |
00:28:24.660
So when we look at self play,
link |
00:28:32.300
what's so beautiful about it is goes back
link |
00:28:34.900
to kind of the challenges in reinforcement learning.
link |
00:28:37.260
So the challenge of reinforcement learning
link |
00:28:38.460
is getting signal.
link |
00:28:40.580
And if you don't never succeed, you don't get any signal.
link |
00:28:43.300
In self play, you're on both sides.
link |
00:28:46.740
So one of you succeeds.
link |
00:28:48.020
And the beauty is also one of you fails.
link |
00:28:49.980
And so you see the contrast.
link |
00:28:51.100
You see the one version of me that did better
link |
00:28:53.300
than the other version.
link |
00:28:54.140
So every time you play yourself, you get signal.
link |
00:28:57.260
And so whenever you can turn something into self play,
link |
00:29:00.100
you're in a beautiful situation
link |
00:29:02.080
where you can naturally learn much more quickly
link |
00:29:04.820
than in most other reinforcement learning environments.
link |
00:29:07.980
So I think if somehow we can turn more
link |
00:29:12.460
reinforcement learning problems
link |
00:29:13.720
into self play formulations,
link |
00:29:15.500
that would go really, really far.
link |
00:29:17.180
So far, self play has been largely around games
link |
00:29:20.720
where there is natural opponents.
link |
00:29:22.820
But if we could do self play for other things,
link |
00:29:24.740
and let's say, I don't know,
link |
00:29:25.580
a robot learns to build a house.
link |
00:29:26.940
I mean, that's a pretty advanced thing
link |
00:29:28.380
to try to do for a robot,
link |
00:29:29.500
but maybe it tries to build a hut or something.
link |
00:29:31.900
If that can be done through self play,
link |
00:29:34.140
it would learn a lot more quickly
link |
00:29:35.420
if somebody can figure that out.
link |
00:29:36.500
And I think that would be something
link |
00:29:37.980
where it goes closer to kind of the mathematical leapfrogging
link |
00:29:41.560
where somebody figures out a formalism to say,
link |
00:29:43.900
okay, any RL problem by playing this and this idea,
link |
00:29:47.200
you can turn it into a self play problem
link |
00:29:48.700
where you get signal a lot more easily.
link |
00:29:50.740
Reality is, many problems we don't know
link |
00:29:52.780
how to turn into self play.
link |
00:29:53.980
And so either we need to provide detailed reward.
link |
00:29:56.980
That doesn't just reward for achieving a goal,
link |
00:29:58.940
but rewards for making progress,
link |
00:30:00.780
and that becomes time consuming.
link |
00:30:02.660
And once you're starting to do that,
link |
00:30:03.900
let's say you want a robot to do something,
link |
00:30:05.060
you need to give all this detailed reward.
link |
00:30:07.180
Well, why not just give a demonstration?
link |
00:30:09.340
Because why not just show the robot?
link |
00:30:11.940
And now the question is, how do you show the robot?
link |
00:30:14.540
One way to show is to tally operate the robot,
link |
00:30:16.620
and then the robot really experiences things.
link |
00:30:19.020
And that's nice, because that's really high signal
link |
00:30:21.140
to noise ratio data, and we've done a lot of that.
link |
00:30:23.060
And you teach your robot skills in just 10 minutes,
link |
00:30:26.020
you can teach your robot a new basic skill,
link |
00:30:27.860
like okay, pick up the bottle, place it somewhere else.
link |
00:30:30.300
That's a skill, no matter where the bottle starts,
link |
00:30:32.420
maybe it always goes onto a target or something.
link |
00:30:34.940
That's fairly easy to teach your robot with tally up.
link |
00:30:38.100
Now, what's even more interesting
link |
00:30:40.340
if you can now teach your robot
link |
00:30:41.380
through third person learning,
link |
00:30:43.100
where the robot watches you do something
link |
00:30:45.700
and doesn't experience it, but just kind of watches you.
link |
00:30:48.500
It doesn't experience it, but just watches it
link |
00:30:49.820
and says, okay, well, if you're showing me that,
link |
00:30:52.180
that means I should be doing this.
link |
00:30:53.800
And I'm not gonna be using your hand,
link |
00:30:55.380
because I don't get to control your hand,
link |
00:30:57.100
but I'm gonna use my hand, I do that mapping.
link |
00:30:59.540
And so that's where I think one of the big breakthroughs
link |
00:31:02.140
has happened this year.
link |
00:31:03.340
This was led by Chelsea Finn here.
link |
00:31:06.460
It's almost like learning a machine translation
link |
00:31:08.280
for demonstrations, where you have a human demonstration,
link |
00:31:11.340
and the robot learns to translate it
link |
00:31:12.820
into what it means for the robot to do it.
link |
00:31:15.900
And that was a meta learning formulation,
link |
00:31:17.560
learn from one to get the other.
link |
00:31:20.380
And that, I think, opens up a lot of opportunities
link |
00:31:23.020
to learn a lot more quickly.
link |
00:31:24.540
So my focus is on autonomous vehicles.
link |
00:31:26.580
Do you think this approach of third person watching,
link |
00:31:29.940
the autonomous driving is amenable
link |
00:31:31.980
to this kind of approach?
link |
00:31:33.860
So for autonomous driving,
link |
00:31:36.660
I would say third person is slightly easier.
link |
00:31:41.580
And the reason I'm gonna say it's slightly easier
link |
00:31:43.460
to do with third person is because
link |
00:31:46.620
the car dynamics are very well understood.
link |
00:31:49.540
So the...
link |
00:31:51.020
Easier than first person, you mean?
link |
00:31:53.980
Or easier than...
link |
00:31:55.700
So I think the distinction between third person
link |
00:31:57.540
and first person is not a very important distinction
link |
00:32:00.180
for autonomous driving.
link |
00:32:01.840
They're very similar.
link |
00:32:03.460
Because the distinction is really about
link |
00:32:06.100
who turns the steering wheel.
link |
00:32:09.180
Or maybe, let me put it differently.
link |
00:32:12.340
How to get from a point where you are now
link |
00:32:14.860
to a point, let's say, a couple meters in front of you.
link |
00:32:17.440
And that's a problem that's very well understood.
link |
00:32:19.240
And that's the only distinction
link |
00:32:20.260
between third and first person there.
link |
00:32:21.920
Whereas with the robot manipulation,
link |
00:32:23.220
interaction forces are very complex.
link |
00:32:25.420
And it's still a very different thing.
link |
00:32:27.980
For autonomous driving,
link |
00:32:29.940
I think there is still the question,
link |
00:32:31.420
imitation versus RL.
link |
00:32:34.580
So imitation gives you a lot more signal.
link |
00:32:36.740
I think where imitation is lacking
link |
00:32:38.900
and needs some extra machinery is,
link |
00:32:42.380
it doesn't, in its normal format,
link |
00:32:45.460
doesn't think about goals or objectives.
link |
00:32:48.580
And of course, there are versions of imitation learning
link |
00:32:51.060
and versus reinforcement learning type imitation learning
link |
00:32:52.900
which also thinks about goals.
link |
00:32:54.640
I think then we're getting much closer.
link |
00:32:57.100
But I think it's very hard to think of a
link |
00:32:59.620
fully reactive car, generalizing well.
link |
00:33:04.060
If it really doesn't have a notion of objectives
link |
00:33:05.960
to generalize well to the kind of general
link |
00:33:08.540
that you would want.
link |
00:33:09.500
You'd want more than just that reactivity
link |
00:33:12.160
that you get from just behavioral cloning
link |
00:33:13.660
slash supervised learning.
link |
00:33:17.100
So a lot of the work,
link |
00:33:19.560
whether it's self play or even imitation learning,
link |
00:33:22.060
would benefit significantly from simulation,
link |
00:33:24.860
from effective simulation.
link |
00:33:26.540
And you're doing a lot of stuff
link |
00:33:27.580
in the physical world and in simulation.
link |
00:33:29.660
Do you have hope for greater and greater
link |
00:33:33.620
power of simulation being boundless eventually
link |
00:33:38.380
to where most of what we need to operate
link |
00:33:40.740
in the physical world could be simulated
link |
00:33:43.780
to a degree that's directly transferable
link |
00:33:46.460
to the physical world?
link |
00:33:47.580
Or are we still very far away from that?
link |
00:33:51.660
So I think we could even rephrase that question
link |
00:33:57.780
in some sense.
link |
00:33:58.780
Please.
link |
00:34:00.360
And so the power of simulation, right?
link |
00:34:04.940
As simulators get better and better,
link |
00:34:06.580
of course, becomes stronger
link |
00:34:08.980
and we can learn more in simulation.
link |
00:34:11.260
But there's also another version
link |
00:34:12.460
which is where you say the simulator
link |
00:34:13.660
doesn't even have to be that precise.
link |
00:34:15.900
As long as it's somewhat representative
link |
00:34:18.660
and instead of trying to get one simulator
link |
00:34:21.060
that is sufficiently precise to learn in
link |
00:34:23.140
and transfer really well to the real world,
link |
00:34:25.300
I'm gonna build many simulators.
link |
00:34:27.100
Ensemble of simulators?
link |
00:34:28.260
Ensemble of simulators.
link |
00:34:29.940
Not any single one of them is sufficiently representative
link |
00:34:33.580
of the real world such that it would work
link |
00:34:36.740
if you train in there.
link |
00:34:37.900
But if you train in all of them,
link |
00:34:40.700
then there is something that's good in all of them.
link |
00:34:43.600
The real world will just be another one of them
link |
00:34:47.620
that's not identical to any one of them
link |
00:34:49.700
but just another one of them.
link |
00:34:50.940
Another sample from the distribution of simulators.
link |
00:34:53.180
Exactly.
link |
00:34:54.020
We do live in a simulation,
link |
00:34:54.860
so this is just one other one.
link |
00:34:57.780
I'm not sure about that, but yeah.
link |
00:35:01.580
It's definitely a very advanced simulator if it is.
link |
00:35:03.580
Yeah, it's a pretty good one.
link |
00:35:05.700
I've talked to Stuart Russell.
link |
00:35:07.660
It's something you think about a little bit too.
link |
00:35:09.460
Of course, you're really trying to build these systems,
link |
00:35:12.060
but do you think about the future of AI?
link |
00:35:13.780
A lot of people have concern about safety.
link |
00:35:16.380
How do you think about AI safety?
link |
00:35:18.240
As you build robots that are operating in the physical world,
link |
00:35:21.460
what is, yeah, how do you approach this problem
link |
00:35:25.060
in an engineering kind of way, in a systematic way?
link |
00:35:29.220
So when a robot is doing things,
link |
00:35:32.340
you kind of have a few notions of safety to worry about.
link |
00:35:36.240
One is that the robot is physically strong
link |
00:35:39.380
and of course could do a lot of damage.
link |
00:35:42.340
Same for cars, which we can think of as robots too
link |
00:35:44.840
in some way.
link |
00:35:46.780
And this could be completely unintentional.
link |
00:35:48.340
So it could be not the kind of longterm AI safety concerns
link |
00:35:51.780
that, okay, AI is smarter than us and now what do we do?
link |
00:35:54.380
But it could be just very practical.
link |
00:35:55.860
Okay, this robot, if it makes a mistake,
link |
00:35:58.920
what are the results going to be?
link |
00:36:00.700
Of course, simulation comes in a lot there
link |
00:36:02.280
to test in simulation. It's a difficult question.
link |
00:36:07.780
And I'm always wondering, like, I always wonder,
link |
00:36:09.540
let's say you look at, let's go back to driving
link |
00:36:12.020
because a lot of people know driving well, of course.
link |
00:36:15.280
What do we do to test somebody for driving, right?
link |
00:36:18.940
Get a driver's license. What do they really do?
link |
00:36:21.420
I mean, you fill out some tests and then you drive.
link |
00:36:26.660
And I mean, it's suburban California.
link |
00:36:29.500
That driving test is just you drive around the block,
link |
00:36:32.940
pull over, you do a stop sign successfully,
link |
00:36:36.500
and then you pull over again and you're pretty much done.
link |
00:36:40.060
And you're like, okay, if a self driving car did that,
link |
00:36:44.500
would you trust it that it can drive?
link |
00:36:46.840
And I'd be like, no, that's not enough for me to trust it.
link |
00:36:48.900
But somehow for humans, we've figured out
link |
00:36:51.540
that somebody being able to do that is representative
link |
00:36:55.220
of them being able to do a lot of other things.
link |
00:36:57.900
And so I think somehow for humans,
link |
00:36:59.980
we figured out representative tests
link |
00:37:02.660
of what it means if you can do this, what you can really do.
link |
00:37:05.860
Of course, testing humans,
link |
00:37:07.380
humans don't wanna be tested at all times.
link |
00:37:09.180
Self driving cars or robots
link |
00:37:10.300
could be tested more often probably.
link |
00:37:11.980
You can have replicas that get tested
link |
00:37:13.460
that are known to be identical
link |
00:37:14.820
because they use the same neural net and so forth.
link |
00:37:17.140
But still, I feel like we don't have this kind of unit tests
link |
00:37:21.260
or proper tests for robots.
link |
00:37:24.420
And I think there's something very interesting
link |
00:37:25.520
to be thought about there,
link |
00:37:26.780
especially as you update things.
link |
00:37:28.540
Your software improves,
link |
00:37:29.580
you have a better self driving car suite, you update it.
link |
00:37:32.320
How do you know it's indeed more capable on everything
link |
00:37:35.960
than what you had before,
link |
00:37:37.280
that you didn't have any bad things creep into it?
link |
00:37:41.500
So I think that's a very interesting direction of research
link |
00:37:43.540
that there is no real solution yet,
link |
00:37:46.340
except that somehow for humans we do.
link |
00:37:47.980
Because we say, okay, you have a driving test, you passed,
link |
00:37:50.820
you can go on the road now,
link |
00:37:51.940
and humans have accidents every like a million
link |
00:37:54.900
or 10 million miles, something pretty phenomenal
link |
00:37:57.860
compared to that short test that is being done.
link |
00:38:01.660
So let me ask, you've mentioned that Andrew Ng by example
link |
00:38:06.100
showed you the value of kindness.
link |
00:38:10.100
Do you think the space of policies,
link |
00:38:14.580
good policies for humans and for AI
link |
00:38:17.500
is populated by policies that with kindness
link |
00:38:22.500
or ones that are the opposite, exploitation, even evil?
link |
00:38:28.220
So if you just look at the sea of policies
link |
00:38:30.300
we operate under as human beings,
link |
00:38:32.540
or if AI system had to operate in this real world,
link |
00:38:35.300
do you think it's really easy to find policies
link |
00:38:38.060
that are full of kindness,
link |
00:38:39.580
like we naturally fall into them?
link |
00:38:41.340
Or is it like a very hard optimization problem?
link |
00:38:48.100
I mean, there is kind of two optimizations
link |
00:38:50.300
happening for humans, right?
link |
00:38:52.300
So for humans, there's kind of the very long term
link |
00:38:54.140
optimization which evolution has done for us
link |
00:38:56.900
and we're kind of predisposed to like certain things.
link |
00:39:00.780
And that's in some sense what makes our learning easier
link |
00:39:02.780
because I mean, we know things like pain
link |
00:39:05.420
and hunger and thirst.
link |
00:39:08.420
And the fact that we know about those
link |
00:39:10.100
is not something that we were taught, that's kind of innate.
link |
00:39:12.740
When we're hungry, we're unhappy.
link |
00:39:14.060
When we're thirsty, we're unhappy.
link |
00:39:16.220
When we have pain, we're unhappy.
link |
00:39:18.420
And ultimately evolution built that into us
link |
00:39:21.760
to think about those things.
link |
00:39:22.600
And so I think there is a notion that
link |
00:39:24.660
it seems somehow humans evolved in general
link |
00:39:28.220
to prefer to get along in some ways,
link |
00:39:32.820
but at the same time also to be very territorial
link |
00:39:36.940
and kind of centric to their own tribe.
link |
00:39:41.620
Like it seems like that's the kind of space
link |
00:39:43.580
we converged onto.
link |
00:39:44.660
I mean, I'm not an expert in anthropology,
link |
00:39:46.660
but it seems like we're very kind of good
link |
00:39:49.260
within our own tribe, but need to be taught
link |
00:39:52.860
to be nice to other tribes.
link |
00:39:54.660
Well, if you look at Steven Pinker,
link |
00:39:56.300
he highlights this pretty nicely in
link |
00:40:00.740
Better Angels of Our Nature,
link |
00:40:02.340
where he talks about violence decreasing over time
link |
00:40:04.980
consistently.
link |
00:40:05.800
So whatever tension, whatever teams we pick,
link |
00:40:08.340
it seems that the long arc of history
link |
00:40:11.100
goes towards us getting along more and more.
link |
00:40:14.220
So. I hope so.
link |
00:40:15.420
So do you think that, do you think it's possible
link |
00:40:20.620
to teach RL based robots this kind of kindness,
link |
00:40:26.180
this kind of ability to interact with humans,
link |
00:40:28.380
this kind of policy, even to, let me ask a fun one.
link |
00:40:32.860
Do you think it's possible to teach RL based robot
link |
00:40:35.140
to love a human being and to inspire that human
link |
00:40:38.580
to love the robot back?
link |
00:40:40.020
So to like RL based algorithm that leads to a happy marriage.
link |
00:40:47.540
That's an interesting question.
link |
00:40:48.860
Maybe I'll answer it with another question, right?
link |
00:40:52.820
Because I mean, but I'll come back to it.
link |
00:40:56.700
So another question you can have is okay.
link |
00:40:58.940
I mean, how close does some people's happiness get
link |
00:41:03.560
from interacting with just a really nice dog?
link |
00:41:07.620
Like, I mean, dogs, you come home,
link |
00:41:09.900
that's what dogs do.
link |
00:41:10.740
They greet you, they're excited,
link |
00:41:12.660
makes you happy when you come home to your dog.
link |
00:41:14.700
You're just like, okay, this is exciting.
link |
00:41:16.460
They're always happy when I'm here.
link |
00:41:18.340
And if they don't greet you, cause maybe whatever,
link |
00:41:21.300
your partner took them on a trip or something,
link |
00:41:23.540
you might not be nearly as happy when you get home, right?
link |
00:41:26.100
And so the kind of, it seems like the level of reasoning
link |
00:41:30.260
a dog has is pretty sophisticated,
link |
00:41:32.200
but then it's still not yet at the level of human reasoning.
link |
00:41:35.660
And so it seems like we don't even need to achieve
link |
00:41:37.840
human level reasoning to get like very strong affection
link |
00:41:40.460
with humans.
link |
00:41:41.700
And so my thinking is why not, right?
link |
00:41:44.220
Why couldn't, with an AI, couldn't we achieve
link |
00:41:47.140
the kind of level of affection that humans feel
link |
00:41:51.460
among each other or with friendly animals and so forth?
link |
00:41:57.480
So question, is it a good thing for us or not?
link |
00:41:59.740
That's another thing, right?
link |
00:42:01.380
Because I mean, but I don't see why not.
link |
00:42:05.980
Why not, yeah, so Elon Musk says love is the answer.
link |
00:42:09.020
Maybe he should say love is the objective function
link |
00:42:12.660
and then RL is the answer, right?
link |
00:42:14.700
Well, maybe.
link |
00:42:17.660
Oh, Peter, thank you so much.
link |
00:42:18.880
I don't want to take up more of your time.
link |
00:42:20.260
Thank you so much for talking today.
link |
00:42:21.900
Well, thanks for coming by.
link |
00:42:23.500
Great to have you visit.