back to index

Sergey Levine: Robotics and Machine Learning | Lex Fridman Podcast #108


small model | large model

link |
00:00:00.000
The following is a conversation with Sergei Levine, a professor at Berkeley and a world
link |
00:00:05.360
class researcher in deep learning, reinforcement learning, robotics, and computer vision, including
link |
00:00:10.860
the development of algorithms for end to end training of neural network policies that combine
link |
00:00:15.660
perception and control, scalable algorithms for inverse reinforcement learning, and, in
link |
00:00:21.160
general, deep RL algorithms.
link |
00:00:24.100
Quick summary of the ads.
link |
00:00:25.340
Two sponsors, Cash App and ExpressVPN.
link |
00:00:28.740
Please consider supporting the podcast by downloading Cash App and using code LexPodcast
link |
00:00:34.100
and signing up at expressvpn.com slash lexpod.
link |
00:00:38.920
Click the links, buy the stuff, it's the best way to support this podcast and, in general,
link |
00:00:44.340
the journey I'm on.
link |
00:00:45.340
If you enjoy this thing, subscribe on YouTube, review it with 5 stars on Apple Podcast, follow
link |
00:00:51.100
on Spotify, support it on Patreon, or connect with me on Twitter at lexfriedman.
link |
00:00:57.740
As usual, I'll do a few minutes of ads now and never any ads in the middle that can break
link |
00:01:01.540
the flow of the conversation.
link |
00:01:04.020
This show is presented by Cash App, the number one finance app in the App Store.
link |
00:01:08.460
When you get it, use code lexpodcast.
link |
00:01:11.780
Cash App lets you send money to friends, buy Bitcoin, and invest in the stock market with
link |
00:01:15.940
as little as one dollar.
link |
00:01:18.380
Since Cash App does fractional share trading, let me mention that the order execution algorithm
link |
00:01:23.460
that works behind the scenes to create the abstraction of fractional orders is an algorithmic
link |
00:01:29.020
marvel.
link |
00:01:30.020
So, big props to the Cash App engineers for taking a step up to the next layer of abstraction
link |
00:01:34.500
over the stock market, making trading more accessible for new investors and diversification
link |
00:01:40.100
much easier.
link |
00:01:41.100
So, again, if you get Cash App from the App Store or Google Play and use the code lexpodcast,
link |
00:01:48.300
you get $10, and Cash App will also donate $10 to FIRST, an organization that is helping
link |
00:01:54.220
to advance robotics and STEM education for young people around the world.
link |
00:01:59.840
This show is also sponsored by ExpressVPN.
link |
00:02:04.220
Get it at expressvpn.com slash lexpod to support this podcast and to get an extra three months
link |
00:02:11.680
free on a one year package.
link |
00:02:14.500
I've been using ExpressVPN for many years.
link |
00:02:17.380
I love it.
link |
00:02:18.580
I think ExpressVPN is the best VPN out there.
link |
00:02:22.020
They told me to say it, but it happens to be true in my humble opinion.
link |
00:02:26.300
It doesn't log your data, it's crazy fast, and it's easy to use literally just one big
link |
00:02:31.160
power on button.
link |
00:02:32.580
Again, it's probably obvious to you, but I should say it again, it's really important
link |
00:02:37.700
that they don't log your data.
link |
00:02:40.140
It works on Linux and every other operating system, but Linux, of course, is the best
link |
00:02:45.180
operating system.
link |
00:02:46.620
Shout out to my favorite flavor, Ubuntu Mate 2004.
link |
00:02:50.780
Once again, get it at expressvpn.com slash lexpod to support this podcast and to get
link |
00:02:56.620
an extra three months free on a one year package.
link |
00:03:00.940
And now, here's my conversation with Sergey Levine.
link |
00:03:05.500
What's the difference between a state of the art human, such as you and I, well, I don't
link |
00:03:10.260
know if we qualify as state of the art humans, but a state of the art human and a state of
link |
00:03:14.540
the art robot?
link |
00:03:16.500
That's a very interesting question.
link |
00:03:19.100
Robot capability is, it's kind of a, I think it's a very tricky thing to understand because
link |
00:03:26.860
there are some things that are difficult that we wouldn't think are difficult and some things
link |
00:03:29.620
that are easy that we wouldn't think are easy.
link |
00:03:33.060
And there's also a really big gap between capabilities of robots in terms of hardware
link |
00:03:37.740
and their physical capability and capabilities of robots in terms of what they can do autonomously.
link |
00:03:43.060
There is a little video that I think robotics researchers really like to show, especially
link |
00:03:47.460
robotics learning researchers like myself, from 2004 from Stanford, which demonstrates
link |
00:03:53.220
a prototype robot called the PR1, and the PR1 was a robot that was designed as a home
link |
00:03:58.340
assistance robot.
link |
00:03:59.340
And there's this beautiful video showing the PR1 tidying up a living room, putting away
link |
00:04:03.980
toys and at the end bringing a beer to the person sitting on the couch, which looks really
link |
00:04:10.380
amazing.
link |
00:04:11.660
And then the punchline is that this robot is entirely controlled by a person.
link |
00:04:16.060
So in some ways the gap between a state of the art human and state of the art robot,
link |
00:04:20.660
if the robot has a human brain, is actually not that large.
link |
00:04:23.980
Now obviously like human bodies are sophisticated and very robust and resilient in many ways,
link |
00:04:28.340
but on the whole, if we're willing to like spend a bit of money and do a bit of engineering,
link |
00:04:32.620
we can kind of close the hardware gap almost.
link |
00:04:35.880
But the intelligence gap, that one is very wide.
link |
00:04:40.420
And when you say hardware, you're referring to the physical, sort of the actuators, the
link |
00:04:43.820
actual body of the robot, as opposed to the hardware on which the cognition, the hardware
link |
00:04:49.020
of the nervous system.
link |
00:04:50.020
Yes, exactly.
link |
00:04:51.020
I'm referring to the body rather than the mind.
link |
00:04:54.660
So that means that the kind of the work is cut out for us.
link |
00:04:56.660
Like while we can still make the body better, we kind of know that the big bottleneck right
link |
00:05:00.500
now is really the mind.
link |
00:05:02.880
And how big is that gap?
link |
00:05:03.980
How big is the difference in your sense of ability to learn, ability to reason, ability
link |
00:05:11.300
to perceive the world between humans and our best robots?
link |
00:05:16.880
The gap is very large and the gap becomes larger the more unexpected events can happen
link |
00:05:23.720
in the world.
link |
00:05:24.720
So essentially the spectrum along which you can measure the size of that gap is the spectrum
link |
00:05:30.860
of how open the world is.
link |
00:05:32.220
If you control everything in the world very tightly, if you put the robot in like a factory
link |
00:05:36.120
and you tell it where everything is and you rigidly program its motion, then it can do
link |
00:05:41.420
things, you know, one might even say in a superhuman way.
link |
00:05:43.580
It can move faster, it's stronger, it can lift up a car and things like that.
link |
00:05:47.280
But as soon as anything starts to vary in the environment, now it'll trip up.
link |
00:05:51.300
And if many, many things vary like they would like in your kitchen, for example, then things
link |
00:05:55.700
are pretty much like wide open.
link |
00:05:57.940
Now, again, we're going to stick a bit on the philosophical questions, but how much
link |
00:06:03.820
on the human side of the cognitive abilities in your sense is nature versus nurture?
link |
00:06:11.140
So how much of it is a product of evolution and how much of it is something we'll learn
link |
00:06:18.420
from sort of scratch from the day we're born?
link |
00:06:22.060
I'm going to read into your question as asking about the implications of this for AI.
link |
00:06:26.260
Because I'm not a biologist, I can't really like speak authoritatively.
link |
00:06:30.540
So until we go on it, if it's so, if it's all about learning, then there's more hope
link |
00:06:36.580
for AI.
link |
00:06:38.540
So the way that I look at this is that, you know, well, first, of course, biology is very
link |
00:06:44.220
messy.
link |
00:06:45.300
And it's if you ask the question, how does a person do something or has a person's mind
link |
00:06:49.980
do something, you can come up with a bunch of hypotheses and oftentimes you can find
link |
00:06:54.220
support for many different, often conflicting hypotheses.
link |
00:06:58.220
One way that we can approach the question of what the implications of this for AI are
link |
00:07:03.380
is we can think about what's sufficient.
link |
00:07:05.500
So you know, maybe a person is from birth very, very good at some things like, for example,
link |
00:07:11.220
recognizing faces.
link |
00:07:12.220
There's a very strong evolutionary pressure to do that.
link |
00:07:13.980
If you can recognize your mother's face, then you're more likely to survive and therefore
link |
00:07:18.820
people are good at this.
link |
00:07:20.560
But we can also ask like, what's the minimum sufficient thing?
link |
00:07:23.940
And one of the ways that we can study the minimal sufficient thing is we could, for
link |
00:07:27.060
example, see what people do in unusual situations.
link |
00:07:29.380
If you present them with things that evolution couldn't have prepared them for, you know,
link |
00:07:33.860
our daily lives actually do this to us all the time.
link |
00:07:36.360
We didn't evolve to deal with, you know, automobiles and space flight and whatever.
link |
00:07:41.500
So there are all these situations that we can find ourselves in and we do very well
link |
00:07:45.460
there.
link |
00:07:46.460
Like I can give you a joystick to control a robotic arm, which you've never used before
link |
00:07:50.580
and you might be pretty bad for the first couple of seconds.
link |
00:07:52.940
But if I tell you like your life depends on using this robotic arm to like open this door,
link |
00:07:58.260
you'll probably manage it.
link |
00:07:59.660
Even though you've never seen this device before, you've never used the joystick control
link |
00:08:03.140
us and you'll kind of muddle through it.
link |
00:08:04.820
And that's not your evolved natural ability.
link |
00:08:08.580
That's your, your flexibility or your adaptability.
link |
00:08:11.340
And that's exactly where our current robotic systems really kind of fall flat.
link |
00:08:14.860
But I wonder how much general, almost what we think of as common sense, pre trained models
link |
00:08:22.500
underneath all of that.
link |
00:08:24.220
So that ability to adapt to a joystick is, requires you to have a kind of, you know,
link |
00:08:32.100
I'm human.
link |
00:08:33.100
So it's hard for me to introspect all the knowledge I have about the world, but it seems
link |
00:08:37.220
like there might be an iceberg underneath of the amount of knowledge we actually bring
link |
00:08:42.180
to the table.
link |
00:08:43.260
That's kind of the open question.
link |
00:08:45.260
There's absolutely an iceberg of knowledge that we bring to the table, but I think it's
link |
00:08:48.900
very likely that iceberg of knowledge is actually built up over our lifetimes.
link |
00:08:54.060
Because we have, you know, we have a lot of prior experience to draw on.
link |
00:08:58.700
And it kind of makes sense that the right way for us to, you know, to optimize our,
link |
00:09:05.060
our efficiency, our evolutionary fitness and so on is to utilize all of that experience
link |
00:09:10.300
to build up the best iceberg we can get.
link |
00:09:13.360
And that's actually one of the, you know, while that sounds an awful lot like what machine
link |
00:09:16.620
learning actually does, I think that for modern machine learning, it's actually a really big
link |
00:09:20.240
challenge to take this unstructured mass of experience and distill out something that
link |
00:09:25.320
looks like a common sense understanding of the world.
link |
00:09:28.340
And perhaps part of that isn't, it's not because something about machine learning itself is,
link |
00:09:32.660
is broken or hard, but because we've been a little too rigid in subscribing to a very
link |
00:09:38.340
supervised, very rigid notion of learning, you know, kind of the input output, X's go
link |
00:09:42.460
to Y's sort of model.
link |
00:09:43.980
And maybe what we really need to do is to view the world more as like a mass of experience
link |
00:09:51.260
that is not necessarily providing any rigid supervision, but sort of providing many, many
link |
00:09:55.060
instances of things that could be.
link |
00:09:56.980
And then you take that and you distill it into some sort of common sense understanding.
link |
00:10:00.700
I see what you're, you're painting an optimistic, beautiful picture, especially from the robotics
link |
00:10:06.700
perspective because that means we just need to invest and build better learning algorithms,
link |
00:10:12.540
figure out how we can get access to more and more data for those learning algorithms to
link |
00:10:17.620
extract signal from, and then accumulate that iceberg of knowledge.
link |
00:10:22.260
It's a beautiful picture.
link |
00:10:23.740
It's a hopeful one.
link |
00:10:25.100
I think it's potentially a little bit more than just that.
link |
00:10:29.020
And this is, this is where we perhaps reach the limits of our current understanding.
link |
00:10:32.880
But one thing that I think that the research community hasn't really resolved in a satisfactory
link |
00:10:37.700
way is how much it matters where that experience comes from, like, you know, do you just like
link |
00:10:43.540
download everything on the internet and cram it into essentially the 21st century analog
link |
00:10:48.860
of the giant language model and then see what happens or does it actually matter whether
link |
00:10:54.540
your machine physically experiences the world or in the sense that it actually attempts
link |
00:10:59.380
things, observes the outcome of its actions and kind of augments its experience that way.
link |
00:11:03.860
And it chooses which parts of the world it gets to interact with and observe and learn
link |
00:11:09.500
from.
link |
00:11:10.500
Right.
link |
00:11:11.500
It may be that the world is so complex that simply obtaining a large mass of sort of
link |
00:11:16.700
IID samples of the world is a very difficult way to go.
link |
00:11:21.140
But if you are actually interacting with the world and essentially performing this sort
link |
00:11:25.040
of hard negative mining by attempting what you think might work, observing the sometimes
link |
00:11:30.060
happy and sometimes sad outcomes of that and augmenting your understanding using that experience
link |
00:11:35.620
and you're just doing this continually for many years, maybe that sort of data in some
link |
00:11:40.140
sense is actually much more favorable to obtaining a common sense understanding.
link |
00:11:44.800
One reason we might think that this is true is that, you know, what we associate with
link |
00:11:49.700
common sense or lack of common sense is often characterized by the ability to reason about
link |
00:11:55.140
kind of counterfactual questions like, you know, if I were to hear this bottle of water
link |
00:12:01.000
sitting on the table, everything is fine if I were to knock it over, which I'm not going
link |
00:12:04.780
to do.
link |
00:12:05.780
But if I were to do that, what would happen?
link |
00:12:07.700
And I know that nothing good would happen from that.
link |
00:12:10.360
But if I have a bad understanding of the world, I might think that that's a good way for me
link |
00:12:14.100
to like, you know, gain more utility.
link |
00:12:16.840
If I actually go about my daily life doing the things that my current understanding of
link |
00:12:22.300
the world suggests will give me high utility, in some ways, I'll get exactly the right supervision
link |
00:12:28.760
to tell me not to do those bad things and to keep doing the good things.
link |
00:12:33.200
So there's a spectrum between IID, random walk through the space of data, and then there's
link |
00:12:39.220
and what we humans do, I don't even know if we do it optimal, but that might be beyond.
link |
00:12:45.820
So this open question that you raised, where do you think systems, intelligent systems
link |
00:12:52.540
that would be able to deal with this world fall?
link |
00:12:56.460
Can we do pretty well by reading all of Wikipedia, sort of randomly sampling it like language
link |
00:13:02.120
models do?
link |
00:13:03.900
Or do we have to be exceptionally selective and intelligent about which aspects of the
link |
00:13:09.620
world we interact with?
link |
00:13:12.100
So I think this is first an open scientific problem, and I don't have like a clear answer,
link |
00:13:15.980
but I can speculate a little bit.
link |
00:13:18.300
And what I would speculate is that you don't need to be super, super careful.
link |
00:13:23.580
I think it's less about like, being careful to avoid the useless stuff, and more about
link |
00:13:28.480
making sure that you hit on the really important stuff.
link |
00:13:31.620
So perhaps it's okay, if you spend part of your day, just, you know, guided by your curiosity,
link |
00:13:37.540
reading interesting regions of your state space, but it's important for you to, you
link |
00:13:42.140
know, every once in a while, make sure that you really try out the solutions that your
link |
00:13:47.060
current model of the world suggests might be effective, and observe whether those solutions
link |
00:13:51.120
are working as you expect or not.
link |
00:13:53.060
And perhaps some of that is really essential to have kind of a perpetual improvement loop.
link |
00:13:59.740
This perpetual improvement loop is really like, that's really the key, the key that's
link |
00:14:03.540
going to potentially distinguish the best current methods from the best methods of tomorrow
link |
00:14:07.860
in a sense.
link |
00:14:08.860
How important do you think is exploration or total out of the box thinking exploration
link |
00:14:15.820
in this space as you jump to totally different domains?
link |
00:14:19.300
So you kind of mentioned there's an optimization problem, you kind of kind of explore the specifics
link |
00:14:24.260
of a particular strategy, whatever the thing you're trying to solve.
link |
00:14:27.820
How important is it to explore totally outside of the strategies that have been working for
link |
00:14:33.040
you so far?
link |
00:14:34.040
What's your intuition there?
link |
00:14:35.040
Yeah, I think it's a very problem dependent kind of question.
link |
00:14:38.900
And I think that that's actually, you know, in some ways that question gets at one of
link |
00:14:45.580
the big differences between sort of the classic formulation of a reinforcement learning problem
link |
00:14:51.580
and some of the sort of more open ended reformulations of that problem that have been explored in
link |
00:14:57.480
recent years.
link |
00:14:58.480
So classically reinforcement learning is framed as a problem of maximizing utility, like any
link |
00:15:02.740
kind of rational AI agent, and then anything you do is in service to maximizing that utility.
link |
00:15:08.940
But a very interesting kind of way to look at, I'm not necessarily saying this is the
link |
00:15:15.220
best way to look at it, but an interesting alternative way to look at these problems
link |
00:15:17.820
is as something where you first get to explore the world, however you please, and then afterwards
link |
00:15:24.300
you will be tasked with doing something.
link |
00:15:26.700
And that might suggest a somewhat different solution.
link |
00:15:28.960
So if you don't know what you're going to be tasked with doing, and you just want to
link |
00:15:31.860
prepare yourself optimally for whatever your uncertain future holds, maybe then you will
link |
00:15:35.980
choose to attain some sort of coverage, build up sort of an arsenal of cognitive tools,
link |
00:15:41.820
if you will, such that later on when someone tells you, now your job is to fetch the coffee
link |
00:15:46.400
for me, you will be well prepared to undertake that task.
link |
00:15:49.180
And that you see that as the modern formulation of the reinforcement learning problem, as
link |
00:15:54.380
a kind of the more multitask, the general intelligence kind of formulation.
link |
00:16:00.460
I think that's one possible vision of where things might be headed.
link |
00:16:04.500
I don't think that's by any means the mainstream or standard way of doing things, and it's
link |
00:16:08.220
not like if I had to...
link |
00:16:09.940
But I like it.
link |
00:16:10.940
It's a beautiful vision.
link |
00:16:11.940
So maybe you actually take a step back.
link |
00:16:14.220
What is the goal of robotics?
link |
00:16:16.700
What's the general problem of robotics we're trying to solve?
link |
00:16:18.940
You actually kind of painted two pictures here.
link |
00:16:21.260
One of sort of the narrow, one of the general.
link |
00:16:23.340
What in your view is the big problem of robotics?
link |
00:16:26.780
And ridiculously philosophical high level questions.
link |
00:16:31.200
I think that maybe there are two ways I can answer this question.
link |
00:16:34.620
One is there's a very pragmatic problem, which is like what would make robots, what would
link |
00:16:41.100
sort of maximize the usefulness of robots?
link |
00:16:44.060
And there the answer might be something like a system where a system that can perform whatever
link |
00:16:53.620
task a human user sets for it, within the physical constraints, of course.
link |
00:16:59.580
If you tell it to teleport to another planet, it probably can't do that.
link |
00:17:02.560
But if you ask it to do something that's within its physical capability, then potentially
link |
00:17:06.440
with a little bit of additional training or a little bit of additional trial and error,
link |
00:17:10.420
it ought to be able to figure it out in much the same way as like a human teleoperator
link |
00:17:14.180
ought to figure out how to drive the robot to do that.
link |
00:17:16.760
That's kind of the very pragmatic view of what it would take to kind of solve the robotics
link |
00:17:22.740
problem, if you will.
link |
00:17:24.960
But I think that there is a second answer, and that answer is a lot closer to why I want
link |
00:17:29.480
to work on robotics, which is that I think it's less about what it would take to do a
link |
00:17:34.300
really good job in the world of robotics, but more the other way around, what robotics
link |
00:17:39.160
can bring to the table to help us understand artificial intelligence.
link |
00:17:44.840
So your dream fundamentally is to understand intelligence?
link |
00:17:48.260
Yes.
link |
00:17:49.260
And I think that's the dream for many people who actually work in this space.
link |
00:17:53.120
I think that there's something very pragmatic and very useful about studying robotics, but
link |
00:17:58.640
I do think that a lot of people that go into this field actually, you know, the things
link |
00:18:02.920
that they draw inspiration from are the potential for robots to like help us learn about intelligence
link |
00:18:09.400
and about ourselves.
link |
00:18:10.720
So that's fascinating that robotics is basically the space by which you can get closer to understanding
link |
00:18:18.280
the fundamentals of artificial intelligence.
link |
00:18:20.680
So what is it about robotics that's different from some of the other approaches?
link |
00:18:25.440
So if we look at some of the early breakthroughs in deep learning or in the computer vision
link |
00:18:30.020
space and the natural language processing, there's really nice clean benchmarks that
link |
00:18:34.920
a lot of people competed on and thereby came up with a lot of brilliant ideas.
link |
00:18:38.540
What's the fundamental difference to you between computer vision purely defined and ImageNet
link |
00:18:43.760
and kind of the bigger robotics problem?
link |
00:18:46.640
So there are a couple of things.
link |
00:18:48.480
One is that with robotics, you kind of have to take away many of the crutches.
link |
00:18:55.520
So you have to deal with both the particular problems of perception control and so on,
link |
00:19:01.760
but you also have to deal with the integration of those things.
link |
00:19:04.560
And you know, classically, we've always thought of the integration as kind of a separate problem.
link |
00:19:08.800
So a classic kind of modular engineering approach is that we solve the individual subproblems
link |
00:19:12.800
and then wire them together and then the whole thing works.
link |
00:19:16.080
And one of the things that we've been seeing over the last couple of decades is that, well,
link |
00:19:19.720
maybe studying the thing as a whole might lead to just like very different solutions
link |
00:19:24.200
than if we were to study the parts and wire them together.
link |
00:19:26.640
So the integrative nature of robotics research helps us see, you know, the different perspectives
link |
00:19:32.320
on the problem.
link |
00:19:34.240
Another part of the answer is that with robotics, it casts a certain paradox into very clever
link |
00:19:40.960
relief.
link |
00:19:41.960
This is sometimes referred to as Moravec's paradox, the idea that in artificial intelligence,
link |
00:19:48.480
things that are very hard for people can be very easy for machines and vice versa.
link |
00:19:52.800
Things that are very easy for people can be very hard for machines.
link |
00:19:54.880
So you know, integral and differential calculus is pretty difficult to learn for people.
link |
00:20:02.080
But if you program a computer, do it, it can derive derivatives and integrals for you all
link |
00:20:06.080
day long without any trouble.
link |
00:20:08.400
Whereas some things like, you know, drinking from a cup of water, very easy for a person
link |
00:20:13.320
to do, very hard for a robot to deal with.
link |
00:20:16.720
And sometimes when we see such blatant discrepancies, that gives us a really strong hint that we're
link |
00:20:21.680
missing something important.
link |
00:20:23.160
So if we really try to zero in on those discrepancies, we might find that little bit that we're missing.
link |
00:20:28.000
And it's not that we need to make machines better or worse at math and better at drinking
link |
00:20:32.320
water, but just that by studying those discrepancies, we might find some new insight.
link |
00:20:37.800
So that could be in any space, it doesn't have to be robotics.
link |
00:20:41.680
But you're saying, I mean, it's kind of interesting that robotics seems to have a lot of those
link |
00:20:48.560
discrepancies.
link |
00:20:49.560
So the Hans Marvak paradox is probably referring to the space of the physical interaction,
link |
00:20:56.600
like you said, object manipulation, walking, all the kind of stuff we do in the physical
link |
00:21:00.640
world.
link |
00:21:01.640
How do you make sense if you were to try to disentangle the Marvak paradox, like why is
link |
00:21:13.280
there such a gap in our intuition about it?
link |
00:21:17.800
Why do you think manipulating objects is so hard from everything you've learned from applying
link |
00:21:23.420
reinforcement learning in this space?
link |
00:21:25.480
Yeah, I think that one reason is maybe that for many of the other problems that we've
link |
00:21:33.760
studied in AI and computer science and so on, the notion of input output and supervision
link |
00:21:41.120
is much, much cleaner.
link |
00:21:42.380
So computer vision, for example, deals with very complex inputs.
link |
00:21:45.920
But it's comparatively a bit easier, at least up to some level of abstraction, to cast it
link |
00:21:52.080
as a very tightly supervised problem.
link |
00:21:54.840
It's comparatively much, much harder to cast robotic manipulation as a very tightly supervised
link |
00:21:59.640
problem.
link |
00:22:00.720
You can do it, it just doesn't seem to work all that well.
link |
00:22:03.440
So you could say that, well, maybe we get a labeled data set where we know exactly which
link |
00:22:06.980
motor commands to send, and then we train on that.
link |
00:22:09.200
But for various reasons, that's not actually such a great solution.
link |
00:22:13.800
And it also doesn't seem to be even remotely similar to how people and animals learn to
link |
00:22:17.440
do things, because we're not told by our parents, here's how you fire your muscles in order
link |
00:22:22.980
to walk.
link |
00:22:24.280
So we do get some guidance, but the really low level detailed stuff we figure out mostly
link |
00:22:29.080
on our own.
link |
00:22:30.080
And that's what you mean by tightly coupled, that every single little sub action gets a
link |
00:22:34.400
supervised signal of whether it's a good one or not.
link |
00:22:37.560
Right.
link |
00:22:38.560
So while in computer vision, you could sort of imagine up to a level of abstraction that
link |
00:22:41.360
maybe somebody told you this is a car and this is a cat and this is a dog, in motor
link |
00:22:45.640
control, it's very clear that that was not the case.
link |
00:22:49.400
If we look at sort of the sub spaces of robotics, that, again, as you said, robotics integrates
link |
00:22:57.120
all of them together, and we get to see how this beautiful mess interplays.
link |
00:23:00.880
But so there's nevertheless still perception.
link |
00:23:04.040
So it's the computer vision problem, broadly speaking, understanding the environment.
link |
00:23:09.880
And there's also maybe you can correct me on this kind of categorization of the space,
link |
00:23:14.600
and there's prediction in trying to anticipate what things are going to do into the future
link |
00:23:20.480
in order for you to be able to act in that world.
link |
00:23:24.440
And then there's also this game theoretic aspect of how your actions will change the
link |
00:23:31.580
behavior of others.
link |
00:23:34.120
In this kind of space, what, and this is bigger than reinforcement learning, this is just
link |
00:23:38.640
broadly looking at the problem of robotics, what's the hardest problem here?
link |
00:23:42.840
Or is there, or is what you said true that when you start to look at all of them together,
link |
00:23:52.280
that's a whole nother thing, like you can't even say which one individually is harder
link |
00:23:57.360
because all of them together, you should only be looking at them all together.
link |
00:24:01.400
I think when you look at them all together, some things actually become easier.
link |
00:24:05.240
And I think that's actually pretty important.
link |
00:24:07.520
So we had back in 2014, we had some work, basically our first work on end to end reinforcement
link |
00:24:16.040
learning for robotic manipulation skills from vision, which at the time was something that
link |
00:24:21.040
seemed a little inflammatory and controversial in the robotics world.
link |
00:24:25.520
But other than the inflammatory and controversial part of it, the point that we were actually
link |
00:24:30.320
trying to make in that work is that for the particular case of combining perception and
link |
00:24:35.720
control, you could actually do better if you treat them together than if you try to separate
link |
00:24:39.480
them.
link |
00:24:40.480
And the way that we tried to demonstrate this is we picked a fairly simple motor control
link |
00:24:43.240
task where a robot had to insert a little red trapezoid into a trapezoidal hole.
link |
00:24:49.560
And we had our separated solution, which involved first detecting the hole using a pose detector
link |
00:24:54.800
and then actuating the arm to put it in.
link |
00:24:57.720
And then our intent solution, which just mapped pixels to the torques.
link |
00:25:01.780
And one of the things we observed is that if you use the intent solution, essentially
link |
00:25:05.960
the pressure on the perception part of the model is actually lower.
link |
00:25:08.400
Like it doesn't have to figure out exactly where the thing is in 3D space.
link |
00:25:11.320
It just needs to figure out where it is, you know, distributing the errors in such a way
link |
00:25:15.500
that the horizontal difference matters more than the vertical difference because vertically
link |
00:25:19.280
it just pushes it down all the way until it can't go any further.
link |
00:25:22.320
And their perceptual errors are a lot less harmful, whereas perpendicular to the direction
link |
00:25:26.480
of motion, perceptual errors are much more harmful.
link |
00:25:29.060
So the point is that if you combine these two things, you can trade off errors between
link |
00:25:33.560
the components optimally to best accomplish the task.
link |
00:25:38.120
And the components can actually be weaker while still leading to better overall performance.
link |
00:25:41.680
It's a profound idea.
link |
00:25:44.000
I mean, in the space of pegs and things like that, it's quite simple.
link |
00:25:48.400
It almost is tempting to overlook, but that seems to be at least intuitively an idea that
link |
00:25:55.080
should generalize to basically all aspects of perception and control, that one strengthens
link |
00:26:01.280
the other.
link |
00:26:02.280
Yeah.
link |
00:26:03.280
And we, you know, people who have studied sort of perceptual heuristics in humans and
link |
00:26:07.080
animals find things like that all the time.
link |
00:26:08.960
So one very well known example of this is something called the gaze heuristic, which
link |
00:26:12.400
is a little trick that you can use to intercept a flying object.
link |
00:26:17.280
So if you want to catch a ball, for instance, you could try to localize it in 3D space,
link |
00:26:21.960
estimate its velocity, estimate the effect of wind resistance, solve a complex system
link |
00:26:25.040
of differential equations in your head.
link |
00:26:27.480
Or you can maintain a running speed so that the object stays in the same position as in
link |
00:26:33.280
your field of view.
link |
00:26:34.280
So if it dips a little bit, you speed up.
link |
00:26:35.760
If it rises a little bit, you slow down.
link |
00:26:38.200
And if you follow the simple rule, you'll actually arrive at exactly the place where
link |
00:26:40.800
the object lands and you'll catch it.
link |
00:26:43.060
And humans use it when they play baseball, human pilots use it when they fly airplanes
link |
00:26:46.960
to figure out if they're about to collide with somebody, frogs use this to catch insects
link |
00:26:50.520
and so on and so on.
link |
00:26:51.580
So this is something that actually happens in nature.
link |
00:26:53.640
And I'm sure this is just one instance of it that we were able to identify just because
link |
00:26:57.120
all the scientists were able to identify because it's so prevalent, but there are probably
link |
00:27:00.440
many others.
link |
00:27:01.440
Do you have a, just so we can zoom in as we talk about robotics, do you have a canonical
link |
00:27:06.840
problem, sort of a simple, clean, beautiful representative problem in robotics that you
link |
00:27:12.800
think about when you're thinking about some of these problems?
link |
00:27:16.000
We talked about robotic manipulation, to me that seems intuitively, at least the robotics
link |
00:27:23.600
community has converged towards that as a space that's the canonical problem.
link |
00:27:28.760
If you agree, then maybe do you zoom in in some particular aspect of that problem that
link |
00:27:33.240
you just like?
link |
00:27:34.240
Like if we solve that problem perfectly, it'll unlock a major step towards human level intelligence.
link |
00:27:44.040
I don't think I have like a really great answer to that.
link |
00:27:46.360
And I think partly the reason I don't have a great answer kind of has to do with the,
link |
00:27:53.040
it has to do with the fact that the difficulty is really in the flexibility and adaptability
link |
00:27:57.420
rather than in doing a particular thing really, really well.
link |
00:28:01.160
So it's hard to just say like, oh, if you can, I don't know, like shuffle a deck of
link |
00:28:06.680
cards as fast as like a Vegas casino dealer, then you'll be very proficient.
link |
00:28:12.920
It's really the ability to quickly figure out how to do some arbitrary new thing well
link |
00:28:21.120
enough to like, you know, to move on to the next arbitrary thing.
link |
00:28:26.160
But the source of newness and uncertainty, have you found problems in which it's easy
link |
00:28:33.680
to generate new newnessnesses?
link |
00:28:38.520
New types of newness.
link |
00:28:40.120
Yeah.
link |
00:28:41.120
So a few years ago, so if you had asked me this question around like 2016, maybe I would
link |
00:28:46.920
have probably said that robotic grasping is a really great example of that because it's
link |
00:28:51.840
a task with great real world utility.
link |
00:28:54.320
Like you will get a lot of money if you can do it well.
link |
00:28:57.320
What is robotic grasping?
link |
00:28:58.960
Picking up any object with a robotic hand.
link |
00:29:02.400
Exactly.
link |
00:29:03.400
So you will get a lot of money if you do it well, because lots of people want to run warehouses
link |
00:29:06.680
with robots and it's highly non trivial because very different objects will require very different
link |
00:29:13.360
grasping strategies.
link |
00:29:15.240
But actually since then, people have gotten really good at building systems to solve this
link |
00:29:19.740
problem to the point where I'm not actually sure how much more progress we can make with
link |
00:29:25.880
that as like the main guiding thing.
link |
00:29:29.560
But it's kind of interesting to see the kind of methods that have actually worked well
link |
00:29:32.960
in that space because robotic grasping classically used to be regarded very much as kind of almost
link |
00:29:39.760
like a geometry problem.
link |
00:29:41.400
So people who have studied the history of computer vision will find this very familiar
link |
00:29:46.620
that it's kind of in the same way that in the early days of computer vision, people
link |
00:29:49.760
thought of it very much as like an inverse graphics thing.
link |
00:29:52.480
In robotic grasping, people thought of it as an inverse physics problem essentially.
link |
00:29:57.000
You look at what's in front of you, figure out the shapes, then use your best estimate
link |
00:30:01.160
of the laws of physics to figure out where to put your fingers on, you pick up the thing.
link |
00:30:05.960
And it turns out that works really well for robotic grasping instantiated in many different
link |
00:30:10.360
recent works, including our own, but also ones from many other labs is to use learning
link |
00:30:15.960
methods with some combination of either exhaustive simulation or like actual real world trial
link |
00:30:21.200
and error.
link |
00:30:22.200
And it turns out that those things actually work really well and then you don't have to
link |
00:30:24.360
worry about solving geometry problems or physics problems.
link |
00:30:29.160
What are, just by the way, in the grasping, what are the difficulties that have been worked
link |
00:30:35.040
on?
link |
00:30:36.040
So one is like the materials of things, maybe occlusions on the perception side.
link |
00:30:41.080
Why is it such a difficult, why is picking stuff up such a difficult problem?
link |
00:30:45.360
Yeah, it's a difficult problem because the number of things that you might have to deal
link |
00:30:50.920
with or the variety of things that you have to deal with is extremely large.
link |
00:30:54.940
And oftentimes things that work for one class of objects won't work for other classes of
link |
00:30:59.680
objects.
link |
00:31:00.680
So if you, if you get really good at picking up boxes and now you have to pick up plastic
link |
00:31:05.400
bags, you know, you just need to employ a very different strategy.
link |
00:31:09.800
And there are many properties of objects that are more than just their geometry that has
link |
00:31:15.440
to do with, you know, the bits that are easier to pick up, the bits that are hard to pick
link |
00:31:19.580
up, the bits that are more flexible, the bits that will cause the thing to pivot and bend
link |
00:31:23.440
and drop out of your hand versus the bits that result in a nice secure grasp.
link |
00:31:28.000
Things that are flexible, things that if you pick them up the wrong way, they'll fall upside
link |
00:31:31.520
down and the contents will spill out.
link |
00:31:33.840
So there's all these little details that come up, but the task is still kind of can be characterized
link |
00:31:38.820
as one task.
link |
00:31:39.820
Like there's a very clear notion of you did it or you didn't do it.
link |
00:31:43.800
So in terms of spilling things, there creeps in this notion that starts to sound and feel
link |
00:31:50.880
like common sense reasoning.
link |
00:31:53.060
Do you think solving the general problem of robotics requires common sense reasoning,
link |
00:32:01.720
requires general intelligence, this kind of human level capability of, you know, like
link |
00:32:09.440
you said, be robust and deal with uncertainty, but also be able to sort of reason and assimilate
link |
00:32:14.320
different pieces of knowledge that you have?
link |
00:32:17.120
Yeah.
link |
00:32:18.120
What are your thoughts on the needs?
link |
00:32:23.040
Of common sense reasoning in the space of the general robotics problem?
link |
00:32:28.560
So I'm going to slightly dodge that question and say that I think maybe actually it's the
link |
00:32:32.520
other way around is that studying robotics can help us understand how to put common sense
link |
00:32:38.120
into our AI systems.
link |
00:32:40.600
One way to think about common sense is that, and why our current systems might lack common
link |
00:32:45.080
sense is that common sense is an emergent property of actually having to interact with
link |
00:32:51.640
a particular world, a particular universe, and get things done in that universe.
link |
00:32:56.120
So you might think that, for instance, like an image captioning system, maybe it looks
link |
00:33:01.420
at pictures of the world and it types out English sentences.
link |
00:33:05.880
So it kind of deals with our world.
link |
00:33:09.360
And then you can easily construct situations where image captioning systems do things that
link |
00:33:12.860
defy common sense, like give it a picture of a person wearing a fur coat and we'll say
link |
00:33:16.460
it's a teddy bear.
link |
00:33:18.560
But I think what's really happening in those settings is that the system doesn't actually
link |
00:33:22.800
live in our world.
link |
00:33:24.160
It lives in its own world that consists of pixels and English sentences and doesn't actually
link |
00:33:28.480
consist of having to put on a fur coat in the winter so you don't get cold.
link |
00:33:33.280
So perhaps the reason for the disconnect is that the systems that we have now simply inhabit
link |
00:33:39.860
a different universe.
link |
00:33:40.860
And if we build AI systems that are forced to deal with all of the messiness and complexity
link |
00:33:45.120
of our universe, maybe they will have to acquire common sense to essentially maximize their
link |
00:33:50.520
utility.
link |
00:33:51.680
Whereas the systems we're building now don't have to do that.
link |
00:33:53.600
They can take some shortcuts.
link |
00:33:56.560
That's fascinating.
link |
00:33:57.560
You've a couple of times already sort of reframed the role of robotics in this whole thing.
link |
00:34:02.400
And for some reason, I don't know if my way of thinking is common, but I thought like
link |
00:34:08.160
we need to understand and solve intelligence in order to solve robotics.
link |
00:34:13.240
And you're kind of framing it as, no, robotics is one of the best ways to just study artificial
link |
00:34:18.080
intelligence and build sort of like, robotics is like the right space in which you get to
link |
00:34:24.940
explore some of the fundamental learning mechanisms, fundamental sort of multimodal multitask aggregation
link |
00:34:33.880
of knowledge mechanisms that are required for general intelligence.
link |
00:34:36.760
It's really interesting way to think about it, but let me ask about learning.
link |
00:34:41.580
Can the general sort of robotics, the epitome of the robotics problem be solved purely through
link |
00:34:47.000
learning, perhaps end to end learning, sort of learning from scratch as opposed to injecting
link |
00:34:55.860
human expertise and rules and heuristics and so on?
link |
00:35:00.120
I think that in terms of the spirit of the question, I would say yes.
link |
00:35:04.680
I mean, I think that though in some ways it's maybe like an overly sharp dichotomy, I think
link |
00:35:12.360
that in some ways when we build algorithms, at some point a person does something, a person
link |
00:35:20.120
turned on the computer, a person implemented a TensorFlow.
link |
00:35:26.460
But yeah, I think that in terms of the point that you're getting at, I do think the answer
link |
00:35:29.840
is yes.
link |
00:35:30.840
I think that we can solve many problems that have previously required meticulous manual
link |
00:35:36.600
engineering through automated optimization techniques.
link |
00:35:40.120
And actually one thing I will say on this topic is I don't think this is actually a
link |
00:35:43.560
very radical or very new idea.
link |
00:35:45.200
I think people have been thinking about automated optimization techniques as a way to do control
link |
00:35:51.300
for a very, very long time.
link |
00:35:53.680
And in some ways what's changed is really more the name.
link |
00:35:58.040
So today we would say that, oh, my robot does machine learning, it does reinforcement learning.
link |
00:36:03.800
Maybe in the 1960s you'd say, oh, my robot is doing optimal control.
link |
00:36:08.520
And maybe the difference between typing out a system of differential equations and doing
link |
00:36:12.560
feedback linearization versus training a neural net, maybe it's not such a large difference.
link |
00:36:17.040
It's just pushing the optimization deeper and deeper into the thing.
link |
00:36:21.840
Well, it's interesting you think that way, but especially with deep learning that the
link |
00:36:28.360
accumulation of sort of experiences in data form to form deep representations starts to
link |
00:36:35.480
feel like knowledge as opposed to optimal control.
link |
00:36:38.880
So this feels like there's an accumulation of knowledge through the learning process.
link |
00:36:42.920
Yes.
link |
00:36:43.920
Yeah.
link |
00:36:44.920
So I think that is a good point.
link |
00:36:45.920
That one big difference between learning based systems and classic optimal control systems
link |
00:36:49.720
is that learning based systems in principle should get better and better the more they
link |
00:36:53.840
do something.
link |
00:36:54.840
Right.
link |
00:36:55.840
And I do think that that's actually a very, very powerful difference.
link |
00:36:58.160
So if we look back at the world of expert systems and symbolic AI and so on of using
link |
00:37:04.640
logic to accumulate expertise, human expertise, human encoded expertise, do you think that
link |
00:37:11.640
will have a role at some point?
link |
00:37:13.680
The deep learning, machine learning, reinforcement learning has shown incredible results and
link |
00:37:20.620
breakthroughs and just inspired thousands, maybe millions of researchers.
link |
00:37:26.620
But there's this less popular now, but it used to be popular idea of symbolic AI.
link |
00:37:32.680
Do you think that will have a role?
link |
00:37:35.240
I think in some ways the descendants of symbolic AI actually already have a role.
link |
00:37:44.740
So this is the highly biased history from my perspective.
link |
00:37:49.000
You say that, well, initially we thought that rational decision making involves logical
link |
00:37:53.920
manipulation.
link |
00:37:54.920
So you have some model of the world expressed in terms of logic.
link |
00:37:59.940
You have some query, like what action do I take in order for X to be true?
link |
00:38:04.760
And then you manipulate your logical symbolic representation to get an answer.
link |
00:38:08.520
What that turned into somewhere in the 1990s is, well, instead of building kind of predicates
link |
00:38:14.240
and statements that have true or false values, we'll build probabilistic systems where things
link |
00:38:20.800
have probabilities associated and probabilities of being true and false.
link |
00:38:23.160
And that turned into Bayes nets.
link |
00:38:25.280
And that provided sort of a boost to what were really still essentially logical inference
link |
00:38:30.440
systems, just probabilistic logical inference systems.
link |
00:38:33.240
And then people said, well, let's actually learn the individual probabilities inside
link |
00:38:37.940
these models.
link |
00:38:39.560
And then people said, well, let's not even specify the nodes in the models, let's just
link |
00:38:43.240
put a big neural net in there.
link |
00:38:45.500
But in many ways, I see these as actually kind of descendants from the same idea.
link |
00:38:48.960
It's essentially instantiating rational decision making by means of some inference process
link |
00:38:54.040
and learning by means of an optimization process.
link |
00:38:57.840
So in a sense, I would say, yes, that it has a place.
link |
00:39:00.320
And in many ways that place is, it already holds that place.
link |
00:39:04.480
It's already in there.
link |
00:39:05.480
Yeah.
link |
00:39:06.480
It's just quite different.
link |
00:39:07.480
It looks slightly different than it was before.
link |
00:39:09.000
Yeah.
link |
00:39:10.000
But there are some things that we can think about that make this a little bit more obvious.
link |
00:39:13.200
Like if I train a big neural net model to predict what will happen in response to my
link |
00:39:17.760
robot's actions, and then I run probabilistic inference, meaning I invert that model to
link |
00:39:22.880
figure out the actions that lead to some plausible outcome, like to me, that seems like a kind
link |
00:39:26.300
of logic.
link |
00:39:27.520
You have a model of the world that just happens to be expressed by a neural net, and you are
link |
00:39:32.000
doing some inference procedure, some sort of manipulation on that model to figure out
link |
00:39:37.880
the answer to a query that you have.
link |
00:39:39.680
It's the interpretability.
link |
00:39:41.160
It's the explainability, though, that seems to be lacking more so because the nice thing
link |
00:39:46.100
about sort of expert systems is you can follow the reasoning of the system that to us mere
link |
00:39:52.200
humans is somehow compelling.
link |
00:39:56.320
It's just I don't know what to make of this fact that there's a human desire for intelligence
link |
00:40:04.020
systems to be able to convey in a poetic way to us why it made the decisions it did, like
link |
00:40:12.680
tell a convincing story.
link |
00:40:15.520
And perhaps that's like a silly human thing, like we shouldn't expect that of intelligence
link |
00:40:22.720
systems.
link |
00:40:23.720
I'm super happy that there is intelligence systems out there.
link |
00:40:27.800
But if I were to sort of psychoanalyze the researchers at the time, I would say expert
link |
00:40:33.640
systems connected to that part, that desire of AI researchers for systems to be explainable.
link |
00:40:40.120
I mean, maybe on that topic, do you have a hope that sort of inferences of learning based
link |
00:40:48.000
systems will be as explainable as the dream was with expert systems, for example?
link |
00:40:55.040
I think it's a very complicated question because I think that in some ways the question of
link |
00:40:59.120
explainability is kind of very closely tied to the question of like performance, like,
link |
00:41:07.440
you know, why do you want your system to explain itself so that when it screws up, you can
link |
00:41:11.520
kind of figure out why it did it.
link |
00:41:14.960
But in some ways that's a much bigger problem, actually.
link |
00:41:17.360
Like your system might screw up and then it might screw up in how it explains itself.
link |
00:41:22.880
Or you might have some bug somewhere so that it's not actually doing what it was supposed
link |
00:41:26.640
to do.
link |
00:41:27.640
So, you know, maybe a good way to view that problem is really as a problem, as a bigger
link |
00:41:32.360
problem of verification and validation, of which explainability is sort of one component.
link |
00:41:38.640
I see.
link |
00:41:39.640
I just see it differently.
link |
00:41:41.200
I see explainability, you put it beautifully, I think you actually summarize the field of
link |
00:41:45.400
explainability.
link |
00:41:46.400
But to me, there's another aspect of explainability, which is like storytelling that has nothing
link |
00:41:52.880
to do with errors or with, like, it uses errors as elements of its story as opposed to a fundamental
link |
00:42:05.120
need to be explainable when errors occur.
link |
00:42:08.240
It's just that for other intelligent systems to be in our world, we seem to want to tell
link |
00:42:12.520
each other stories.
link |
00:42:14.800
And that's true in the political world, that's true in the academic world.
link |
00:42:19.840
And that, you know, neural networks are less capable of doing that, or perhaps they're
link |
00:42:24.480
equally capable of storytelling and storytelling.
link |
00:42:26.920
Maybe it doesn't matter what the fundamentals of the system are.
link |
00:42:30.360
You just need to be a good storyteller.
link |
00:42:32.900
Maybe one specific story I can tell you about in that space is actually about some work
link |
00:42:38.240
that was done by my former collaborator, who's now a professor at MIT named Jacob Andreas.
link |
00:42:43.360
Jacob actually works in natural language processing, but he had this idea to do a little bit of
link |
00:42:47.280
work in reinforcement learning on how natural language can basically structure the internals
link |
00:42:53.360
of policies trained with RL.
link |
00:42:55.880
And one of the things he did is he set up a model that attempts to perform some task
link |
00:43:01.360
that's defined by a reward function, but the model reads in a natural language instruction.
link |
00:43:06.560
So this is a pretty common thing to do in instruction following.
link |
00:43:08.880
So you tell it like, you know, go to the red house and then it's supposed to go to the red house.
link |
00:43:13.640
But then one of the things that Jacob did is he treated that sentence, not as a command
link |
00:43:18.300
from a person, but as a representation of the internal kind of a state of the mind of
link |
00:43:25.600
this policy, essentially.
link |
00:43:26.680
So that when it was faced with a new task, what it would do is it would basically try
link |
00:43:30.320
to think of possible language descriptions, attempt to do them and see if they led to
link |
00:43:34.760
the right outcome.
link |
00:43:35.760
So it would kind of think out loud, like, you know, I'm faced with this new task.
link |
00:43:38.680
What am I going to do?
link |
00:43:39.680
Let me go to the red house.
link |
00:43:40.680
Oh, that didn't work.
link |
00:43:41.680
Let me go to the blue room or something.
link |
00:43:43.840
Let me go to the green plant.
link |
00:43:45.560
And once it got some reward, it would say, oh, go to the green plant.
link |
00:43:47.700
That's what's working.
link |
00:43:48.700
I'm going to go to the green plant.
link |
00:43:49.700
And then you could look at the string that it came up with, and that was a description
link |
00:43:51.800
of how it thought it should solve the problem.
link |
00:43:54.480
So you could do, you could basically incorporate language as internal state and you can start
link |
00:43:58.800
getting some handle on these kinds of things.
link |
00:44:01.000
And then what I was kind of trying to get to is that also, if you add to the reward
link |
00:44:05.400
function, the convincingness of that story.
link |
00:44:10.160
So I have another reward signal of like people who review that story, how much they like
link |
00:44:15.640
it.
link |
00:44:16.640
So that, you know, initially that could be a hyperparameter sort of hard coded heuristic
link |
00:44:22.880
type of thing, but it's an interesting notion of the convincingness of the story becoming
link |
00:44:30.420
part of the reward function, the objective function of the explainability.
link |
00:44:34.160
That's in the world of sort of Twitter and fake news, that might be a scary notion that
link |
00:44:40.800
the nature of truth may not be as important as the convincingness of the, how convincing
link |
00:44:45.640
you are in telling the story around the facts.
link |
00:44:49.880
Well, let me ask the basic question.
link |
00:44:55.480
You're one of the world class researchers in reinforcement learning, deep reinforcement
link |
00:44:58.700
learning, certainly in the robotic space.
link |
00:45:01.920
What is reinforcement learning?
link |
00:45:04.500
I think that what reinforcement learning refers to today is really just the kind of the modern
link |
00:45:09.960
incarnation of learning based control.
link |
00:45:13.100
So classically reinforcement learning has a much more narrow definition, which is that
link |
00:45:16.420
it's literally learning from reinforcement, like the thing does something and then it
link |
00:45:20.520
gets a reward or punishment.
link |
00:45:22.760
But really I think the way the term is used today is it's used to refer more broadly to
link |
00:45:26.680
learning based control.
link |
00:45:28.280
So some kind of system that's supposed to be controlling something and it uses data
link |
00:45:33.460
to get better.
link |
00:45:34.800
And what does control mean?
link |
00:45:35.920
So this action is the fundamental element there.
link |
00:45:38.520
It means making rational decisions.
link |
00:45:41.140
And rational decisions are decisions that maximize a measure of utility.
link |
00:45:44.420
And sequentially, so you made decisions time and time and time again.
link |
00:45:48.360
Now like it's easier to see that kind of idea in the space of maybe games and the space
link |
00:45:54.820
of robotics.
link |
00:45:55.820
Do you see it bigger than that?
link |
00:45:58.880
Is it applicable?
link |
00:45:59.880
Like where are the limits of the applicability of reinforcement learning?
link |
00:46:04.280
Yeah, so rational decision making is essentially the encapsulation of the AI problem viewed
link |
00:46:12.120
through a particular lens.
link |
00:46:13.120
So any problem that we would want a machine to do, an intelligent machine, can likely
link |
00:46:18.560
be represented as a decision making problem.
link |
00:46:20.960
Learning images is a decision making problem, although not a sequential one typically.
link |
00:46:26.760
Controlling a chemical plant is a decision making problem.
link |
00:46:30.680
Deciding what videos to recommend on YouTube is a decision making problem.
link |
00:46:34.640
And one of the really appealing things about reinforcement learning is if it does encapsulate
link |
00:46:39.800
the range of all these decision making problems, perhaps working on reinforcement learning
link |
00:46:43.760
is one of the ways to reach a very broad swath of AI problems.
link |
00:46:50.480
What is the fundamental difference between reinforcement learning and maybe supervised
link |
00:46:55.720
machine learning?
link |
00:46:57.840
So reinforcement learning can be viewed as a generalization of supervised machine learning.
link |
00:47:02.840
You can certainly cast supervised learning as a reinforcement learning problem.
link |
00:47:05.680
You can just say your loss function is the negative of your reward.
link |
00:47:09.120
But you have stronger assumptions.
link |
00:47:10.120
You have the assumption that someone actually told you what the correct answer was, that
link |
00:47:14.560
your data was IID and so on.
link |
00:47:16.040
So you could view reinforcement learning as essentially relaxing some of those assumptions.
link |
00:47:20.400
Now that's not always a very productive way to look at it because if you actually have
link |
00:47:22.800
a supervised learning problem, you'll probably solve it much more effectively by using supervised
link |
00:47:26.760
learning methods because it's easier.
link |
00:47:29.600
But you can view reinforcement learning as a generalization of that.
link |
00:47:32.560
No, for sure.
link |
00:47:33.560
But they're fundamentally different.
link |
00:47:36.040
That's a mathematical statement.
link |
00:47:37.320
That's absolutely correct.
link |
00:47:38.960
But it seems that reinforcement learning, the kind of tools we bring to the table today
link |
00:47:43.480
of today.
link |
00:47:44.480
So maybe down the line, everything will be a reinforcement learning problem.
link |
00:47:49.080
Just like you said, image classification should be mapped to a reinforcement learning problem.
link |
00:47:53.760
But today, the tools and ideas, the way we think about them are different, sort of supervised
link |
00:48:01.000
learning has been used very effectively to solve basic narrow AI problems.
link |
00:48:07.080
Reinforcement learning kind of represents the dream of AI.
link |
00:48:11.680
It's very much so in the research space now in sort of captivating the imagination of
link |
00:48:17.240
people of what we can do with intelligent systems, but it hasn't yet had as wide of
link |
00:48:22.960
an impact as the supervised learning approaches.
link |
00:48:25.520
So my question comes from the more practical sense, like what do you see is the gap between
link |
00:48:32.520
the more general reinforcement learning and the very specific, yes, it's a question decision
link |
00:48:38.480
making with one step in the sequence of the supervised learning?
link |
00:48:43.200
So from a practical standpoint, I think that one thing that is potentially a little tough
link |
00:48:49.040
now, and this is I think something that we'll see, this is a gap that we might see closing
link |
00:48:53.000
over the next couple of years, is the ability of reinforcement learning algorithms to effectively
link |
00:48:57.680
utilize large amounts of prior data.
link |
00:49:00.600
So one of the reasons why it's a bit difficult today to use reinforcement learning for all
link |
00:49:05.440
the things that we might want to use it for is that in most of the settings where we want
link |
00:49:10.120
to do rational decision making, it's a little bit tough to just deploy some policy that
link |
00:49:15.200
does crazy stuff and learns purely through trial and error.
link |
00:49:18.960
It's much easier to collect a lot of data, a lot of logs of some other policy that you've
link |
00:49:23.260
got, and then maybe if you can get a good policy out of that, then you deploy it and
link |
00:49:28.360
let it kind of fine tune a little bit.
link |
00:49:30.880
But algorithmically, it's quite difficult to do that.
link |
00:49:33.520
So I think that once we figure out how to get reinforcement learning to bootstrap effectively
link |
00:49:37.940
from large data sets, then we'll see very, very rapid growth in applications of these
link |
00:49:44.160
technologies.
link |
00:49:45.160
So this is what's referred to as off policy reinforcement learning or offline RL or batch
link |
00:49:48.800
RL.
link |
00:49:50.080
And I think we're seeing a lot of research right now that's bringing us closer and closer
link |
00:49:53.640
to that.
link |
00:49:54.640
Can you maybe paint the picture of the different methods?
link |
00:49:57.160
So you said off policy, what's value based reinforcement learning?
link |
00:50:02.000
What's policy based?
link |
00:50:03.000
What's model based?
link |
00:50:04.000
What's off policy, on policy?
link |
00:50:05.000
What are the different categories of reinforcement learning?
link |
00:50:07.600
Okay.
link |
00:50:08.600
So one way we can think about reinforcement learning is that it's, in some very fundamental
link |
00:50:14.360
way, it's about learning models that can answer kind of what if questions.
link |
00:50:20.200
So what would happen if I take this action that I hadn't taken before?
link |
00:50:24.360
And you do that, of course, from experience, from data.
link |
00:50:26.840
And oftentimes you do it in a loop.
link |
00:50:28.400
So you build a model that answers these what if questions, use it to figure out the best
link |
00:50:32.900
action you can take, and then go and try taking that and see if the outcome agrees with what
link |
00:50:36.720
you predicted.
link |
00:50:38.880
So the different kinds of techniques basically refer to different ways of doing it.
link |
00:50:43.320
So model based methods answer a question of what state you would get, basically what would
link |
00:50:48.840
happen to the world if you were to take a certain action.
link |
00:50:50.960
Value based methods, they answer the question of what value you would get, meaning what
link |
00:50:55.080
utility you would get.
link |
00:50:57.180
But in a sense, they're not really all that different because they're both really just
link |
00:51:00.940
answering these what if questions.
link |
00:51:03.360
Now unfortunately for us, with current machine learning methods, answering what if questions
link |
00:51:07.240
can be really hard because they are really questions about things that didn't happen.
link |
00:51:12.520
If you wanted to answer what if questions about things that did happen, you wouldn't
link |
00:51:14.960
need a learn model.
link |
00:51:15.960
You would just like repeat the thing that worked before.
link |
00:51:19.080
And that's really a big part of why RL is a little bit tough.
link |
00:51:23.480
So if you have a purely on policy kind of online process, then you ask these what if
link |
00:51:28.960
questions, you make some mistakes, then you go and try doing those mistaken things.
link |
00:51:33.280
And then you observe kind of the counter examples that will teach you not to do those things
link |
00:51:36.640
again.
link |
00:51:37.760
If you have a bunch of off policy data and you just want to synthesize the best policy
link |
00:51:42.240
you can out of that data, then you really have to deal with the challenges of making
link |
00:51:46.760
these counterfactual.
link |
00:51:47.760
First of all, what's a policy?
link |
00:51:50.520
A policy is a model or some kind of function that maps from observations of the world to
link |
00:51:59.200
actions.
link |
00:52:00.200
So in reinforcement learning, we often refer to the current configuration of the world
link |
00:52:05.360
as the state.
link |
00:52:06.360
So we say the state kind of encompasses everything you need to fully define where the world is
link |
00:52:10.000
at the moment.
link |
00:52:11.560
And depending on how we formulate the problem, we might say you either get to see the state
link |
00:52:15.200
or you get to see an observation, which is some snapshot, some piece of the state.
link |
00:52:19.840
So policy just includes everything in it in order to be able to act in this world.
link |
00:52:25.880
Yes.
link |
00:52:26.880
And so what does off policy mean?
link |
00:52:29.200
Yeah, so the terms on policy and off policy refer to how you get your data.
link |
00:52:33.560
So if you get your data from somebody else who was doing some other stuff, maybe you
link |
00:52:37.480
get your data from some manually programmed system that was just running in the world
link |
00:52:43.760
before that's referred to as off policy data.
link |
00:52:46.640
But if you got the data by actually acting in the world based on what your current policy
link |
00:52:50.200
thinks is good, we call that on policy data.
link |
00:52:53.420
And obviously on policy data is more useful to you because if your current policy makes
link |
00:52:58.120
some bad decisions, you will actually see that those decisions are bad.
link |
00:53:01.860
Off policy data, however, might be much easier to obtain because maybe that's all the logged
link |
00:53:06.040
data that you have from before.
link |
00:53:08.680
So we talk about offline, talked about autonomous vehicles so you can envision off policy kind
link |
00:53:14.920
of approaches in robotic spaces where there's already a ton of robots out there, but they
link |
00:53:19.880
don't get the luxury of being able to explore based on a reinforcement learning framework.
link |
00:53:26.360
So how do we make, again, open question, but how do we make off policy methods work?
link |
00:53:32.040
Yeah.
link |
00:53:33.040
So this is something that has been kind of a big open problem for a while.
link |
00:53:37.140
And in the last few years, people have made a little bit of progress on that.
link |
00:53:41.800
You know, I can tell you about, and it's not by any means solved yet, but I can tell you
link |
00:53:44.740
some of the things that, for example, we've done to try to address some of the challenges.
link |
00:53:49.680
It turns out that one really big challenge with off policy reinforcement learning is
link |
00:53:53.640
that you can't really trust your models to give accurate predictions for any possible
link |
00:53:59.680
action.
link |
00:54:00.680
So if I've never tried to, if in my data set I never saw somebody steering the car off
link |
00:54:05.880
the road onto the sidewalk, my value function or my model is probably not going to predict
link |
00:54:11.240
the right thing if I ask what would happen if I were to steer the car off the road onto
link |
00:54:14.480
the sidewalk.
link |
00:54:15.680
So one of the important things you have to do to get off policy RL to work is you have
link |
00:54:20.600
to be able to figure out whether a given action will result in a trustworthy prediction or
link |
00:54:24.600
not.
link |
00:54:25.600
And you can use a kind of distribution estimation methods, kind of density estimation methods
link |
00:54:31.240
to try to figure that out.
link |
00:54:32.240
So you could figure out that, well, this action, my model is telling me that it's great, but
link |
00:54:35.920
it looks totally different from any action I've taken before, so my model is probably
link |
00:54:38.680
not correct.
link |
00:54:39.680
And you can incorporate regularization terms into your learning objective that will essentially
link |
00:54:45.200
tell you not to ask those questions that your model is unable to answer.
link |
00:54:50.880
What would lead to breakthroughs in this space, do you think?
link |
00:54:54.040
Like what's needed?
link |
00:54:55.480
Is this a data set question?
link |
00:54:57.240
Do we need to collect big benchmark data sets that allow us to explore the space?
link |
00:55:03.780
Is it a new kinds of methodologies?
link |
00:55:08.560
Like what's your sense?
link |
00:55:09.960
Or maybe coming together in a space of robotics and defining the right problem to be working
link |
00:55:14.160
on?
link |
00:55:15.160
I think for off policy reinforcement learning in particular, it's very much an algorithms
link |
00:55:18.200
question right now.
link |
00:55:19.880
And this is something that I think is great because an algorithms question is that that
link |
00:55:25.320
just takes some very smart people to get together and think about it really hard, whereas if
link |
00:55:29.800
it was like a data problem or a hardware problem, that would take some serious engineering.
link |
00:55:34.780
So that's why I'm pretty excited about that problem because I think that we're in a position
link |
00:55:38.340
where we can make some real progress on it just by coming up with the right algorithms.
link |
00:55:42.200
In terms of which algorithms they could be, the problems at their core are very related
link |
00:55:47.900
to problems in things like causal inference.
link |
00:55:51.640
Because what you're really dealing with is situations where you have a model, a statistical
link |
00:55:55.960
model, that's trying to make predictions about things that it hadn't seen before.
link |
00:56:00.620
And if it's a model that's generalizing properly, that'll make good predictions.
link |
00:56:04.840
If it's a model that picks up on spurious correlations, that will not generalize properly.
link |
00:56:09.000
And then you have an arsenal of tools you can use.
link |
00:56:11.100
You could, for example, figure out what are the regions where it's trustworthy, or on
link |
00:56:15.200
the other hand, you could try to make it generalize better somehow, or some combination of the
link |
00:56:18.760
two.
link |
00:56:20.800
Is there room for mixing where most of it, like 90, 95% is off policy, you already have
link |
00:56:30.160
the data set, and then you get to send the robot out to do a little exploration?
link |
00:56:36.360
What's that role of mixing them together?
link |
00:56:38.880
Yeah, absolutely.
link |
00:56:39.880
I think that this is something that you actually described very well at the beginning of our
link |
00:56:45.320
discussion when you talked about the iceberg.
link |
00:56:47.480
This is the iceberg.
link |
00:56:48.480
The 99% of your prior experience, that's your iceberg.
link |
00:56:51.720
You'd use that for off policy reinforcement learning.
link |
00:56:54.160
And then, of course, if you've never opened that particular kind of door with that particular
link |
00:56:59.240
lock before, then you have to go out and fiddle with it a little bit.
link |
00:57:02.120
And that's that additional 1% to help you figure out a new task.
link |
00:57:05.320
And I think that's actually a pretty good recipe going forward.
link |
00:57:08.200
Is this, to you, the most exciting space of reinforcement learning now?
link |
00:57:12.840
Or is there, what's, and maybe taking a step back, not just now, but what's, to you, is
link |
00:57:18.240
the most beautiful idea, apologize for the romanticized question, but the beautiful idea
link |
00:57:23.240
or concept in reinforcement learning?
link |
00:57:27.280
In general, I actually think that one of the things that is a very beautiful idea in reinforcement
link |
00:57:32.640
learning is just the idea that you can obtain a near optimal control or near optimal policy
link |
00:57:41.800
without actually having a complete model of the world.
link |
00:57:45.640
This is, you know, it's something that feels perhaps kind of obvious if you just hear the
link |
00:57:53.080
term reinforcement learning or you think about trial and error learning.
link |
00:57:55.880
But from a controls perspective, it's a very weird thing because classically, you know,
link |
00:58:01.800
we think about engineered systems and controlling engineered systems as the problem of writing
link |
00:58:07.480
down some equations and then figuring out given these equations, you know, basically
link |
00:58:11.000
solve for X, figure out the thing that maximizes its performance.
link |
00:58:16.820
And the theory of reinforcement learning actually gives us a mathematically principled framework
link |
00:58:21.360
to think, to reason about, you know, optimizing some quantity when you don't actually know
link |
00:58:27.080
the equations that govern that system.
link |
00:58:28.900
And I don't, to me, that's actually seems kind of, you know, very elegant, not something
link |
00:58:35.040
that sort of becomes immediately obvious, at least in the mathematical sense.
link |
00:58:40.160
Does it make sense to you that it works at all?
link |
00:58:42.960
Well, I think it makes sense when you take some time to think about it, but it is a little
link |
00:58:48.360
surprising.
link |
00:58:49.360
Well, then taking a step into the more deeper representations, which is also very surprising
link |
00:58:56.720
of sort of the richness of the state space, the space of environments that this kind of
link |
00:59:04.840
approach can operate in, can you maybe say what is deep reinforcement learning?
link |
00:59:10.480
Well, deep reinforcement learning simply refers to taking reinforcement learning algorithms
link |
00:59:16.100
and combining them with high capacity neural net representations.
link |
00:59:20.520
Which is, you know, kind of, it might at first seem like a pretty arbitrary thing, just take
link |
00:59:24.140
these two components and stick them together.
link |
00:59:26.560
But the reason that it's something that has become so important in recent years is that
link |
00:59:32.320
reinforcement learning, it kind of faces an exacerbated version of a problem that has
link |
00:59:38.160
faced many other machine learning techniques.
link |
00:59:40.080
So if we go back to like, you know, the early two thousands or the late nineties, we'll
link |
00:59:45.360
see a lot of research on machine learning methods that have some very appealing mathematical
link |
00:59:50.780
properties like they reduce the convex optimization problems, for instance, but they require very
link |
00:59:56.220
special inputs.
link |
00:59:57.220
They require a representation of the input that is clean in some way.
link |
01:00:01.600
Like for example, clean in the sense that the classes in your multi class classification
link |
01:00:06.320
problems separate linearly.
link |
01:00:07.720
So they have some kind of good representation and we call this a feature representation.
link |
01:00:12.560
And for a long time, people were very worried about features in the world of supervised
link |
01:00:15.520
learning because somebody had to actually build those features so you couldn't just
link |
01:00:18.560
take an image and plug it into your logistic regression or your SVM or something.
link |
01:00:22.920
How to take that image and process it using some handwritten code.
link |
01:00:26.840
And then neural nets came along and they could actually learn the features and suddenly we
link |
01:00:30.900
could apply learning directly to the raw inputs, which was great for images, but it was even
link |
01:00:35.360
more great for all the other fields where people hadn't come up with good features yet.
link |
01:00:40.020
And one of those fields actually reinforcement learning because in reinforcement learning,
link |
01:00:43.400
the notion of features, if you don't use neural nets and you have to design your own features
link |
01:00:46.840
is very, very opaque.
link |
01:00:48.580
Like it's very hard to imagine, let's say I'm playing chess or go.
link |
01:00:53.920
What is a feature with which I can represent the value function for go or even the optimal
link |
01:00:58.760
policy for go linearly?
link |
01:00:59.760
Like I don't even know how to start thinking about it.
link |
01:01:03.100
And people tried all sorts of things that would write down, you know, an expert chess
link |
01:01:06.040
player looks for whether the knight is in the middle of the board or not.
link |
01:01:09.160
So that's a feature is knight in middle of board.
link |
01:01:11.760
And they would write these like long lists of kind of arbitrary made up stuff.
link |
01:01:15.960
And that was really kind of getting us nowhere.
link |
01:01:17.680
And that's a little, chess is a little more accessible than the robotics problem.
link |
01:01:21.960
Absolutely.
link |
01:01:22.960
Right.
link |
01:01:23.960
There's at least experts in the different features for chess, but still like the neural
link |
01:01:30.340
network there, to me, that's, I mean, you put it eloquently and almost made it seem
link |
01:01:35.700
like a natural step to add neural networks, but the fact that neural networks are able
link |
01:01:41.000
to discover features in the control problem, it's very interesting.
link |
01:01:45.640
It's hopeful.
link |
01:01:46.640
I'm not sure what to think about it, but it feels hopeful that the control problem has
link |
01:01:51.880
features to be learned.
link |
01:01:54.680
Like I guess my question is, is it surprising to you how far the deep side of deep reinforcement
link |
01:02:02.360
learning was able to like what the space of problems has been able to tackle from, especially
link |
01:02:07.560
in games with alpha star and alpha zero and just the representation power there and in
link |
01:02:17.600
the robotics space and what is your sense of the limits of this representation power
link |
01:02:23.120
and the control context?
link |
01:02:26.120
I think that in regard to the limits that here, I think that one thing that makes it
link |
01:02:32.900
a little hard to fully answer this question is because in settings where we would like
link |
01:02:39.380
to push these things to the limit, we encounter other bottlenecks.
link |
01:02:44.040
So like the reason that I can't get my robot to learn how to like, I don't know, do the
link |
01:02:51.480
dishes in the kitchen, it's not because it's neural net is not big enough.
link |
01:02:56.220
It's because when you try to actually do trial and error learning, reinforcement learning,
link |
01:03:02.680
directly in the real world where you have the potential to gather these large, highly
link |
01:03:07.840
varied and complex data sets, you start running into other problems.
link |
01:03:11.720
Like one problem you run into very quickly, it'll first sound like a very pragmatic problem,
link |
01:03:16.920
but it actually turns out to be a pretty deep scientific problem.
link |
01:03:19.480
Take the robot, put it in your kitchen, have it try to learn to do the dishes with trial
link |
01:03:22.320
and error.
link |
01:03:23.320
It'll break all your dishes and then we'll have no more dishes to clean.
link |
01:03:27.120
Now you might think this is a very practical issue, but there's something to this, which
link |
01:03:30.080
is that if you have a person trying to do this, a person will have some degree of common
link |
01:03:33.720
sense.
link |
01:03:34.720
They'll break one dish, they'll be a little more careful with the next one, and if they
link |
01:03:37.360
break all of them, they're going to go and get more or something like that.
link |
01:03:41.200
So there's all sorts of scaffolding that comes very naturally to us for our learning process.
link |
01:03:46.800
Like if I have to learn something through trial and error, I have the common sense to
link |
01:03:50.720
know that I have to try multiple times.
link |
01:03:53.120
If I screw something up, I ask for help or I reset things or something like that.
link |
01:03:57.440
And all of that is kind of outside of the classic reinforcement learning problem formulation.
link |
01:04:02.100
There are other things that can also be categorized as kind of scaffolding, but are very important.
link |
01:04:07.360
Like for example, where do you get your reward function?
link |
01:04:09.520
If I want to learn how to pour a cup of water, well, how do I know if I've done it correctly?
link |
01:04:15.360
Now that probably requires an entire computer vision system to be built just to determine
link |
01:04:18.840
that, and that seems a little bit inelegant.
link |
01:04:21.220
So there are all sorts of things like this that start to come up when we think through
link |
01:04:24.460
what we really need to get reinforcement learning to happen at scale in the real world.
link |
01:04:28.560
And many of these things actually suggest a little bit of a shortcoming in the problem
link |
01:04:32.320
formulation and a few deeper questions that we have to resolve.
link |
01:04:36.240
That's really interesting.
link |
01:04:37.240
I talked to David Silver about AlphaZero, and it seems like there's no, again, we haven't
link |
01:04:45.440
hit the limit at all in the context where there's no broken dishes.
link |
01:04:50.200
So in the case of Go, you can, it's really about just scaling compute.
link |
01:04:55.080
So again, like the bottleneck is the amount of money you're willing to invest in compute
link |
01:05:00.760
and then maybe the different, the scaffolding around how difficult it is to scale compute
link |
01:05:06.160
maybe, but there, there's no limit.
link |
01:05:09.000
And it's interesting, now we'll move to the real world and there's the broken dishes,
link |
01:05:12.640
there's all the, and the reward function, like you mentioned, that's really nice.
link |
01:05:17.080
So what, how do we push forward there?
link |
01:05:19.920
Do you think there's, there's this kind of a sample efficiency question that people bring
link |
01:05:25.680
up of, you know, not having to break a hundred thousand dishes.
link |
01:05:30.740
Is this an algorithm question?
link |
01:05:33.020
Is this a data selection like question?
link |
01:05:37.680
What do you think?
link |
01:05:38.680
How do we, how do we not break too many dishes?
link |
01:05:41.320
Yeah.
link |
01:05:42.320
Well, one way we can think about that is that maybe we need to be better at, at reusing
link |
01:05:51.360
our data, building that, that iceberg.
link |
01:05:54.080
So perhaps, perhaps it's too much to hope that you can have a machine that's in isolation
link |
01:06:02.560
in the vacuum without anything else, can just master complex tasks in like in minutes the
link |
01:06:07.280
way that people do, but perhaps it also doesn't have to, perhaps what it really needs to do
link |
01:06:10.840
is have an existence, a lifetime where it does many things and the previous things that
link |
01:06:16.240
it has done, prepare it to do new things more efficiently.
link |
01:06:20.400
And you know, the study of these kinds of questions typically falls under categories
link |
01:06:24.260
like multitask learning or meta learning, but they all fundamentally deal with the same
link |
01:06:29.200
general theme, which is use experience for doing other things to learn to do new things
link |
01:06:35.640
efficiently and quickly.
link |
01:06:37.240
So what do you think about if we just look at the one particular case study of a Tesla
link |
01:06:41.880
autopilot that has quickly approaching towards a million vehicles on the road where some
link |
01:06:48.520
percentage of the time, 30, 40% of the time is driven using the computer vision, multitask
link |
01:06:54.440
hydranet, right?
link |
01:06:57.960
And then the other percent, that's what they call it, hydranet.
link |
01:07:03.040
The other percent is human controlled.
link |
01:07:06.360
In the human side, how can we use that data?
link |
01:07:09.920
What's your sense?
link |
01:07:12.920
What's the signal?
link |
01:07:13.920
Do you have ideas in this autonomous vehicle space when people can lose their lives?
link |
01:07:17.900
You know, it's a safety critical environment.
link |
01:07:21.560
So how do we use that data?
link |
01:07:23.960
So I think that actually the kind of problems that come up when we want systems that are
link |
01:07:33.000
reliable and that can kind of understand the limits of their capabilities, they're actually
link |
01:07:37.040
very similar to the kind of problems that come up when we're doing off policy reinforcement
link |
01:07:40.680
learning.
link |
01:07:41.680
So as I mentioned before, in off policy reinforcement learning, the big problem is you need to know
link |
01:07:46.120
when you can trust the predictions of your model, because if you're trying to evaluate
link |
01:07:50.880
some pattern of behavior for which your model doesn't give you an accurate prediction, then
link |
01:07:54.240
you shouldn't use that to modify your policy.
link |
01:07:57.360
It's actually very similar to the problem that we're faced when we actually then deploy
link |
01:08:00.200
that thing and we want to decide whether we trust it in the moment or not.
link |
01:08:05.120
So perhaps we just need to do a better job of figuring out that part, and that's a very
link |
01:08:08.360
deep research question, of course, but it's also a question that a lot of people are working
link |
01:08:11.460
on.
link |
01:08:12.460
So I'm pretty optimistic that we can make some progress on that over the next few years.
link |
01:08:15.920
What's the role of simulation in reinforcement learning, deep reinforcement learning, reinforcement
link |
01:08:20.400
learning?
link |
01:08:21.400
Like how essential is it?
link |
01:08:23.000
It's been essential for the breakthroughs so far for some interesting breakthroughs.
link |
01:08:28.160
Do you think it's a crutch that we rely on?
link |
01:08:31.440
I mean, again, this connects to our off policy discussion, but do you think we can ever get
link |
01:08:37.360
rid of simulation or do you think simulation will actually take over?
link |
01:08:40.160
We'll create more and more realistic simulations that will allow us to solve actual real world
link |
01:08:46.000
problems, like transfer the models we learn in simulation to real world problems.
link |
01:08:49.960
I think that simulation is a very pragmatic tool that we can use to get a lot of useful
link |
01:08:54.360
stuff to work right now, but I think that in the long run, we will need to build machines
link |
01:09:00.000
that can learn from real data because that's the only way that we'll get them to improve
link |
01:09:03.400
perpetually because if we can't have our machines learn from real data, if they have to rely
link |
01:09:08.680
on simulated data, eventually the simulator becomes the bottleneck.
link |
01:09:11.680
In fact, this is a general thing.
link |
01:09:13.560
If your machine has any bottleneck that is built by humans and that doesn't improve from
link |
01:09:19.120
data, it will eventually be the thing that holds it back.
link |
01:09:23.400
And if you're entirely reliant on your simulator, that'll be the bottleneck.
link |
01:09:25.900
If you're entirely reliant on a manually designed controller, that's going to be the bottleneck.
link |
01:09:30.520
So simulation is very useful.
link |
01:09:32.160
It's very pragmatic, but it's not a substitute for being able to utilize real experience.
link |
01:09:39.840
And by the way, this is something that I think is quite relevant now, especially in the context
link |
01:09:44.600
of some of the things we've discussed, because some of these kind of scaffolding issues that
link |
01:09:48.840
I mentioned, things like the broken dishes and the unknown reward function, like these
link |
01:09:52.000
are not problems that you would ever stumble on when working in a purely simulated kind
link |
01:09:57.700
of environment, but they become very apparent when we try to actually run these things in
link |
01:10:01.720
the real world.
link |
01:10:02.720
To throw a brief wrench into our discussion, let me ask, do you think we're living in a
link |
01:10:07.080
simulation?
link |
01:10:08.080
Oh, I have no idea.
link |
01:10:09.080
Do you think that's a useful thing to even think about, about the fundamental physics
link |
01:10:15.960
nature of reality?
link |
01:10:18.880
Or another perspective, the reason I think the simulation hypothesis is interesting is
link |
01:10:24.520
to think about how difficult is it to create sort of a virtual reality game type situation
link |
01:10:33.080
that will be sufficiently convincing to us humans or sufficiently enjoyable that we wouldn't
link |
01:10:38.760
want to leave.
link |
01:10:39.760
I mean, that's actually a practical engineering challenge.
link |
01:10:43.560
And I personally really enjoy virtual reality, but it's quite far away.
link |
01:10:47.820
I kind of think about what would it take for me to want to spend more time in virtual reality
link |
01:10:52.520
versus the real world.
link |
01:10:55.320
And that's a sort of a nice clean question because at that point, if I want to live in
link |
01:11:03.920
a virtual reality, that means we're just a few years away where a majority of the population
link |
01:11:08.040
lives in a virtual reality.
link |
01:11:09.040
And that's how we create the simulation, right?
link |
01:11:11.480
You don't need to actually simulate the quantum gravity and just every aspect of the universe.
link |
01:11:19.860
And that's an interesting question for reinforcement learning too, is if we want to make sufficiently
link |
01:11:24.800
realistic simulations that may blend the difference between sort of the real world and the simulation,
link |
01:11:32.520
thereby just some of the things we've been talking about, kind of the problems go away
link |
01:11:37.640
if we can create actually interesting, rich simulations.
link |
01:11:40.840
It's an interesting question.
link |
01:11:41.840
And it actually, I think your question casts your previous question in a very interesting
link |
01:11:46.320
light, because in some ways asking whether we can, well, the more kind of practical version
link |
01:11:53.560
is like, you know, can we build simulators that are good enough to train essentially
link |
01:11:57.600
AI systems that will work in the world?
link |
01:12:02.200
And it's kind of interesting to think about this, about what this implies, if true, it
link |
01:12:06.440
kind of implies that it's easier to create the universe than it is to create a brain.
link |
01:12:11.260
And that seems like, put this way, it seems kind of weird.
link |
01:12:14.520
The aspect of the simulation most interesting to me is the simulation of other humans.
link |
01:12:21.120
That seems to be a complexity that makes the robotics problem harder.
link |
01:12:27.980
Now I don't know if every robotics person agrees with that notion.
link |
01:12:32.040
Just as a quick aside, what are your thoughts about when the human enters the picture of
link |
01:12:38.040
the robotics problem?
link |
01:12:39.960
How does that change the reinforcement learning problem, the learning problem in general?
link |
01:12:44.560
Yeah, I think that's a, it's a kind of a complex question.
link |
01:12:48.720
And I guess my hope for a while had been that if we build these robotic learning systems
link |
01:12:56.680
that are multitask, that utilize lots of prior data and that learn from their own experience,
link |
01:13:03.280
the bit where they have to interact with people will be perhaps handled in much the same way
link |
01:13:07.480
as all the other bits.
link |
01:13:08.840
So if they have prior experience of interacting with people and they can learn from their
link |
01:13:12.440
own experience of interacting with people for this new task, maybe that'll be enough.
link |
01:13:16.640
Now, of course, if it's not enough, there are many other things we can do and there's
link |
01:13:20.700
quite a bit of research in that area.
link |
01:13:22.880
But I think it's worth a shot to see whether the multi agent interaction, the ability to
link |
01:13:29.400
understand that other beings in the world have their own goals and tensions and thoughts
link |
01:13:35.220
and so on, whether that kind of understanding can emerge automatically from simply learning
link |
01:13:41.580
to do things with and maximize utility.
link |
01:13:44.160
That information arises from the data.
link |
01:13:46.940
You've said something about gravity, that you don't need to explicitly inject anything
link |
01:13:53.400
into the system.
link |
01:13:54.400
They can be learned from the data.
link |
01:13:55.840
And gravity is an example of something that could be learned from data, so like the physics
link |
01:13:59.740
of the world.
link |
01:14:05.300
What are the limits of what we can learn from data?
link |
01:14:08.520
Do you really think we can?
link |
01:14:10.460
So a very simple, clean way to ask that is, do you really think we can learn gravity from
link |
01:14:15.600
just data, the idea, the laws of gravity?
link |
01:14:19.920
So something that I think is a common kind of pitfall when thinking about prior knowledge
link |
01:14:25.720
and learning is to assume that just because we know something, then that it's better to
link |
01:14:33.360
tell the machine about that rather than have it figured out on its own.
link |
01:14:36.880
In many cases, things that are important that affect many of the events that the machine
link |
01:14:44.060
will experience are actually pretty easy to learn.
link |
01:14:48.360
If every time you drop something, it falls down, yeah, you might get the Newton's version,
link |
01:14:54.320
not Einstein's version, but it'll be pretty good and it will probably be sufficient for
link |
01:14:58.680
you to act rationally in the world because you see the phenomenon all the time.
link |
01:15:03.320
So things that are readily apparent from the data, we might not need to specify those by
link |
01:15:07.640
hand.
link |
01:15:08.640
It might actually be easier to let the machine figure them out.
link |
01:15:10.320
It just feels like that there might be a space of many local minima in terms of theories
link |
01:15:17.400
of this world that we would discover and get stuck on, that Newtonian mechanics is not necessarily
link |
01:15:25.760
easy to come by.
link |
01:15:27.320
Yeah.
link |
01:15:28.320
And in fact, in some fields of science, for example, human civilization is itself full
link |
01:15:33.040
of these local optima.
link |
01:15:34.040
So for example, if you think about how people tried to figure out biology and medicine for
link |
01:15:40.520
the longest time, the kind of rules, the kind of principles that serve us very well in our
link |
01:15:45.800
day to day lives actually serve us very poorly in understanding medicine and biology.
link |
01:15:50.160
We had kind of very superstitious and weird ideas about how the body worked until the
link |
01:15:55.320
advent of the modern scientific method.
link |
01:15:58.020
So that does seem to be a failing of this approach, but it's also a failing of human
link |
01:16:02.080
intelligence arguably.
link |
01:16:04.380
Maybe a small aside, but some, you know, the idea of self play is fascinating in reinforcement
link |
01:16:09.680
learning sort of these competitive, creating a competitive context in which agents can
link |
01:16:14.840
play against each other in a, sort of at the same skill level and thereby increasing each
link |
01:16:20.340
other skill level.
link |
01:16:21.340
It seems to be this kind of self improving mechanism is exceptionally powerful in the
link |
01:16:26.320
context where it could be applied.
link |
01:16:29.020
First of all, is that beautiful to you that this mechanism work as well as it does?
link |
01:16:34.920
And also can we generalize to other contexts like in the robotic space or anything that's
link |
01:16:41.880
applicable to the real world?
link |
01:16:43.840
I think that it's a very interesting idea, but I suspect that the bottleneck to actually
link |
01:16:51.560
generalizing it to the robotic setting is actually going to be the same as the bottleneck
link |
01:16:56.240
for everything else that we need to be able to build machines that can get better and
link |
01:17:01.200
better through natural interaction with the world.
link |
01:17:04.760
And once we can do that, then they can go out and play with, they can play with each
link |
01:17:08.400
other, they can play with people, they can play with the natural environment.
link |
01:17:13.040
But before we get there, we've got all these other problems we've got, we have to get out
link |
01:17:16.040
of the way.
link |
01:17:17.040
So there's no shortcut around that.
link |
01:17:18.040
You have to interact with a natural environment that.
link |
01:17:21.160
Well because in a, in a self play setting, you still need a mediating mechanism.
link |
01:17:24.660
So the, the reason that, you know, self play works for a board game is because the rules
link |
01:17:30.080
of that board game mediate the interaction between the agents.
link |
01:17:33.780
So the kind of intelligent behavior that will emerge depends very heavily on the nature
link |
01:17:37.760
of that mediating mechanism.
link |
01:17:39.920
So on the side of reward functions, that's coming up with good reward functions seems
link |
01:17:44.360
to be the thing that we associate with general intelligence, like human beings seem to value
link |
01:17:50.760
the idea of developing our own reward functions of, you know, at arriving at meaning and so
link |
01:17:57.000
on.
link |
01:17:58.440
And yet for reinforcement learning, we often kind of specify that's the given.
link |
01:18:02.840
What's your sense of how we develop reward, you know, good reward functions?
link |
01:18:08.360
Yeah, I think that's a very complicated and very deep question.
link |
01:18:12.160
And you're completely right that classically in reinforcement learning, this question,
link |
01:18:16.520
I guess, kind of been treated as an on issue that you sort of treat the reward as this
link |
01:18:21.420
external thing that comes from some other bit of your biology and you kind of don't
link |
01:18:27.360
worry about it.
link |
01:18:28.520
And I do think that that's actually, you know, a little bit of a mistake that we should worry
link |
01:18:32.520
about it.
link |
01:18:33.520
And we can approach it in a few different ways.
link |
01:18:34.920
We can approach it, for instance, by thinking of rewards as a communication medium.
link |
01:18:39.040
We can say, well, how does a person communicate to a robot what its objective is?
link |
01:18:43.400
You can approach it also as a sort of more of an intrinsic motivation medium.
link |
01:18:47.720
You could say, can we write down kind of a general objective that leads to good capability?
link |
01:18:55.200
Like for example, can you write down some objectives such that even in the absence of
link |
01:18:58.000
any other task, if you maximize that objective, you'll sort of learn useful things.
link |
01:19:02.680
This is something that has sometimes been called unsupervised reinforcement learning,
link |
01:19:07.040
which I think is a really fascinating area of research, especially today.
link |
01:19:11.600
We've done a bit of work on that recently.
link |
01:19:13.040
One of the things we've studied is whether we can have some notion of unsupervised reinforcement
link |
01:19:19.920
learning by means of, you know, information theoretic quantities, like for instance, minimizing
link |
01:19:25.160
a Bayesian measure of surprise.
link |
01:19:26.660
This is an idea that was, you know, pioneered actually in the computational neuroscience
link |
01:19:30.160
community by folks like Carl Friston.
link |
01:19:32.900
And we've done some work recently that shows that you can actually learn pretty interesting
link |
01:19:35.980
skills by essentially behaving in a way that allows you to make accurate predictions about
link |
01:19:41.920
the world.
link |
01:19:42.920
Like do the things that will lead to you getting the right answer for prediction.
link |
01:19:48.840
But you can, you know, by doing this, you can sort of discover stable niches in the
link |
01:19:52.960
world.
link |
01:19:53.960
You can discover that if you're playing Tetris, then correctly, you know, clearing the rows
link |
01:19:57.940
will let you play Tetris for longer and keep the board nice and clean, which sort of satisfies
link |
01:20:01.840
some desire for order in the world.
link |
01:20:04.180
And as a result, get some degree of leverage over your domain.
link |
01:20:07.400
So we're exploring that pretty actively.
link |
01:20:08.800
Is there a role for a human notion of curiosity in itself being the reward, sort of discovering
link |
01:20:15.960
new things about the world?
link |
01:20:19.880
So one of the things that I'm pretty interested in is actually whether discovering new things
link |
01:20:26.000
can actually be an emergent property of some other objective that quantifies capability.
link |
01:20:30.760
So new things for the sake of new things maybe is not, maybe might not by itself be the right
link |
01:20:36.440
answer, but perhaps we can figure out an objective for which discovering new things is actually
link |
01:20:42.280
the natural consequence.
link |
01:20:44.480
That's something we're working on right now, but I don't have a clear answer for you there
link |
01:20:47.400
yet that's still a work in progress.
link |
01:20:49.640
You mean just that it's a curious observation to see sort of creative patterns of curiosity
link |
01:20:57.640
on the way to optimize for a particular task?
link |
01:21:00.980
On the way to optimize for a particular measure of capability.
link |
01:21:05.520
Is there ways to understand or anticipate unexpected unintended consequences of particular
link |
01:21:15.040
reward functions, sort of anticipate the kind of strategies that might be developed and
link |
01:21:22.280
try to avoid highly detrimental strategies?
link |
01:21:27.120
So classically, this is something that has been pretty hard in reinforcement learning
link |
01:21:30.260
because it's difficult for a designer to have good intuition about, you know, what a learning
link |
01:21:35.380
algorithm will come up with when they give it some objective.
link |
01:21:38.960
There are ways to mitigate that.
link |
01:21:40.340
One way to mitigate it is to actually define an objective that says like, don't do weird
link |
01:21:45.240
stuff.
link |
01:21:46.240
You can actually quantify it.
link |
01:21:47.240
You can say just like, don't enter situations that have low probability under the distribution
link |
01:21:52.340
of states you've seen before.
link |
01:21:54.720
It turns out that that's actually one very good way to do off policy reinforcement learning
link |
01:21:57.840
actually.
link |
01:21:59.560
So we can do some things like that.
link |
01:22:02.500
If we slowly venture in speaking about reward functions into greater and greater levels
link |
01:22:08.360
of intelligence, there's, I mean, Stuart Russell thinks about this, the alignment of AI systems
link |
01:22:16.280
with us humans.
link |
01:22:18.160
So how do we ensure that AGI systems align with us humans?
link |
01:22:23.040
It's kind of a reward function question of specifying the behavior of AI systems such
link |
01:22:32.320
that their success aligns with this, with the broader intended success interest of human
link |
01:22:39.640
beings.
link |
01:22:40.640
Do you have thoughts on this?
link |
01:22:41.640
Do you have kind of concerns of where reinforcement learning fits into this, or are you really
link |
01:22:45.840
focused on the current moment of us being quite far away and trying to solve the robotics
link |
01:22:50.840
problem?
link |
01:22:51.840
I don't have a great answer to this, but, you know, and I do think that this is a problem
link |
01:22:56.780
that's important to figure out.
link |
01:22:59.520
For my part, I'm actually a bit more concerned about the other side of the, of this equation
link |
01:23:04.520
that, you know, maybe rather than unintended consequences for objectives that are specified
link |
01:23:11.920
too well, I'm actually more worried right now about unintended consequences for objectives
link |
01:23:15.980
that are not optimized well enough, which might become a very pressing problem when
link |
01:23:21.480
we, for instance, try to use these techniques for safety critical systems like cars and
link |
01:23:26.520
aircraft and so on.
link |
01:23:28.520
I think at some point we'll face the issue of objectives being optimized too well, but
link |
01:23:32.360
right now I think we're, we're more likely to face the issue of them not being optimized
link |
01:23:36.240
well enough.
link |
01:23:37.240
But you don't think unintended consequences can arise even when you're far from optimality,
link |
01:23:41.360
sort of like on the path to it?
link |
01:23:43.200
Oh no, I think unintended consequences can absolutely arise.
link |
01:23:46.960
It's just, I think right now the bottleneck for improving reliability, safety and things
link |
01:23:52.000
like that is more with systems that like need to work better, that need to optimize their
link |
01:23:57.400
objectives better.
link |
01:23:58.400
Do you have thoughts, concerns about existential threats of human level intelligence that have,
link |
01:24:05.360
if we put on our hat of looking in 10, 20, 100, 500 years from now, do you have concerns
link |
01:24:11.700
about existential threats of AI systems?
link |
01:24:15.720
I think there are absolutely existential threats for AI systems, just like there are for any
link |
01:24:19.400
powerful technology.
link |
01:24:22.480
But I think that the, these kinds of problems can take many forms and, and some of those
link |
01:24:28.240
forms will come down to, you know, people with nefarious intent.
link |
01:24:34.200
Some of them will come down to AI systems that have some fatal flaws.
link |
01:24:38.960
And some of them will, will of course come down to AI systems that are too capable in
link |
01:24:42.380
some way.
link |
01:24:44.740
But among this set of potential concerns, I would actually be much more concerned about
link |
01:24:50.320
the first two right now, and principally the one with nefarious humans, because, you know,
link |
01:24:55.040
just through all of human history, actually it's the nefarious humans that have been the
link |
01:24:57.160
problem, not the nefarious machines, than I am about the others.
link |
01:25:01.680
And I think that right now the best that I can do to make sure things go well is to build
link |
01:25:07.080
the best technology I can and also hopefully promote responsible use of that technology.
link |
01:25:13.820
Do you think RL Systems has something to teach us humans?
link |
01:25:19.000
You said nefarious humans getting us in trouble.
link |
01:25:21.080
I mean, machine learning systems have in some ways have revealed to us the ethical flaws
link |
01:25:26.960
in our data.
link |
01:25:27.960
In that same kind of way, can reinforcement learning teach us about ourselves?
link |
01:25:32.680
Has it taught something?
link |
01:25:34.480
What have you learned about yourself from trying to build robots and reinforcement learning
link |
01:25:40.600
systems?
link |
01:25:42.920
I'm not sure what I've learned about myself, but maybe part of the answer to your question
link |
01:25:49.960
might become a little bit more apparent once we see more widespread deployment of reinforcement
link |
01:25:55.180
learning for decision making support in domains like healthcare, education, social media,
link |
01:26:02.720
etc.
link |
01:26:03.720
And I think we will see some interesting stuff emerge there.
link |
01:26:06.720
We will see, for instance, what kind of behaviors these systems come up with in situations where
link |
01:26:12.800
there is interaction with humans and where they have a possibility of influencing human
link |
01:26:17.840
behavior.
link |
01:26:18.840
I think we're not quite there yet, but maybe in the next few years we'll see some interesting
link |
01:26:22.360
stuff come out in that area.
link |
01:26:23.800
I hope outside the research space, because the exciting space where this could be observed
link |
01:26:28.880
is sort of large companies that deal with large data, and I hope there's some transparency.
link |
01:26:35.200
One of the things that's unclear when I look at social networks and just online is why
link |
01:26:40.400
an algorithm did something or whether even an algorithm was involved.
link |
01:26:45.200
And that'd be interesting from a research perspective, just to observe the results of
link |
01:26:52.080
algorithms, to open up that data, or to at least be sufficiently transparent about the
link |
01:26:58.320
behavior of these AI systems in the real world.
link |
01:27:02.280
What's your sense?
link |
01:27:03.280
I don't know if you looked at the blog post, Bitter Lesson, by Rich Sutton, where it looks
link |
01:27:08.380
at sort of the big lesson of researching AI and reinforcement learning is that simple
link |
01:27:16.520
methods, general methods that leverage computation seem to work well.
link |
01:27:21.480
So basically don't try to do any kind of fancy algorithms, just wait for computation to get
link |
01:27:26.280
fast.
link |
01:27:28.480
Do you share this kind of intuition?
link |
01:27:31.160
I think the high level idea makes a lot of sense.
link |
01:27:34.200
I'm not sure that my takeaway would be that we don't need to work on algorithms.
link |
01:27:37.480
I think that my takeaway would be that we should work on general algorithms.
link |
01:27:43.800
And actually, I think that this idea of needing to better automate the acquisition of experience
link |
01:27:52.360
in the real world actually follows pretty naturally from Rich Sutton's conclusion.
link |
01:27:58.780
So if the claim is that automated general methods plus data leads to good results, then
link |
01:28:06.600
it makes sense that we should build general methods and we should build the kind of methods
link |
01:28:09.760
that we can deploy and get them to go out there and collect their experience autonomously.
link |
01:28:14.440
I think that one place where I think that the current state of things falls a little
link |
01:28:19.200
bit short of that is actually the going out there and collecting the data autonomously,
link |
01:28:23.560
which is easy to do in a simulated board game, but very hard to do in the real world.
link |
01:28:27.440
Yeah, it keeps coming back to this one problem, right?
link |
01:28:31.840
Your mind is focused there now in this real world.
link |
01:28:35.800
It just seems scary, the step of collecting the data, and it seems unclear to me how we
link |
01:28:43.840
can do it effectively.
link |
01:28:44.840
Well, you know, seven billion people in the world, each of them had to do that at some
link |
01:28:49.360
point in their lives.
link |
01:28:51.040
And we should leverage that experience that they've all done.
link |
01:28:54.860
We should be able to try to collect that kind of data.
link |
01:28:58.440
Okay, big questions.
link |
01:29:02.760
Maybe stepping back through your life, what book or books, technical or fiction or philosophical,
link |
01:29:10.480
had a big impact on the way you saw the world, on the way you thought about in the world,
link |
01:29:15.840
your life in general?
link |
01:29:19.480
And maybe what books, if it's different, would you recommend people consider reading on their
link |
01:29:24.160
own intellectual journey?
link |
01:29:26.320
It could be within reinforcement learning, but it could be very much bigger.
link |
01:29:30.280
I don't know if this is like a scientifically, like, particularly meaningful answer.
link |
01:29:39.360
But like, the honest answer is that I actually found a lot of the work by Isaac Asimov to
link |
01:29:45.800
be very inspiring when I was younger.
link |
01:29:47.720
I don't know if that has anything to do with AI necessarily.
link |
01:29:50.840
You don't think it had a ripple effect in your life?
link |
01:29:53.380
Maybe it did.
link |
01:29:56.200
But yeah, I think that a vision of a future where, well, first of all, artificial, I might
link |
01:30:06.800
say artificial intelligence system, artificial robotic systems have, you know, kind of a
link |
01:30:10.880
big place, a big role in society, and where we try to imagine the sort of the limiting
link |
01:30:18.560
case of technological advancement and how that might play out in our future history.
link |
01:30:25.640
But yeah, I think that that was in some way influential.
link |
01:30:30.720
I don't really know how.
link |
01:30:33.720
I would recommend it.
link |
01:30:34.720
I mean, if nothing else, you'd be well entertained.
link |
01:30:37.040
When did you first yourself like fall in love with the idea of artificial intelligence,
link |
01:30:41.840
get captivated by this field?
link |
01:30:45.080
So my honest answer here is actually that I only really started to think about it as
link |
01:30:52.280
something that I might want to do actually in graduate school pretty late.
link |
01:30:56.200
And a big part of that was that until, you know, somewhere around 2009, 2010, it just
link |
01:31:02.400
wasn't really high on my priority list because I didn't think that it was something where
link |
01:31:06.920
we're going to see very substantial advances in my lifetime.
link |
01:31:11.560
And you know, maybe in terms of my career, the time when I really decided I wanted to
link |
01:31:18.120
work on this was when I actually took a seminar course that was taught by Professor Andrew
link |
01:31:23.480
Ng.
link |
01:31:24.480
And, you know, at that point, I, of course, had like a decent understanding of the technical
link |
01:31:29.320
things involved.
link |
01:31:30.320
But one of the things that really resonated with me was when he said in the opening lecture
link |
01:31:33.640
something to the effect of like, well, he used to have graduate students come to him
link |
01:31:37.140
and talk about how they want to work on AI, and he would kind of chuckle and give them
link |
01:31:40.920
some math problem to deal with.
link |
01:31:42.600
But now he's actually thinking that this is an area where we might see like substantial
link |
01:31:45.940
advances in our lifetime.
link |
01:31:47.840
And that kind of got me thinking because, you know, in some abstract sense, yeah, like
link |
01:31:52.280
you can kind of imagine that, but in a very real sense, when someone who had been working
link |
01:31:56.940
on that kind of stuff their whole career suddenly says that, yeah, like that had some effect
link |
01:32:02.520
on me.
link |
01:32:03.520
Yeah, this might be a special moment in the history of the field.
link |
01:32:08.040
That this is where we might see some interesting breakthroughs.
link |
01:32:14.060
So in the space of advice, somebody who's interested in getting started in machine learning
link |
01:32:19.120
or reinforcement learning, what advice would you give to maybe an undergraduate student
link |
01:32:23.720
or maybe even younger, how, what are the first steps to take and further on what are the
link |
01:32:30.520
steps to take on that journey?
link |
01:32:32.800
So something that I think is important to do is to not be afraid to like spend time
link |
01:32:43.160
imagining the kind of outcome that you might like to see.
link |
01:32:46.280
So you know, one outcome might be a successful career, a large paycheck or something, or
link |
01:32:51.480
state of the art results on some benchmark, but hopefully that's not the thing that's
link |
01:32:54.920
like the main driving force for somebody.
link |
01:32:57.760
But I think that if someone who is a student considering a career in AI like takes a little
link |
01:33:04.360
while, sits down and thinks like, what do I really want to see?
link |
01:33:07.420
What I want to see a machine do?
link |
01:33:09.120
What do I want to see a robot do?
link |
01:33:10.320
What do I want to do?
link |
01:33:11.320
What do I want to see a natural language system, which is like, imagine, you know, imagine
link |
01:33:15.200
it almost like a commercial for a future product or something or like, like something that
link |
01:33:19.040
you'd like to see in the world and then actually sit down and think about the steps that are
link |
01:33:23.520
necessary to get there.
link |
01:33:25.160
And hopefully that thing is not a better number on image net classification.
link |
01:33:29.000
It's like, it's probably like an actual thing that we can't do today that would be really
link |
01:33:32.000
awesome.
link |
01:33:33.000
Whether it's a robot Butler or a, you know, a really awesome healthcare decision making
link |
01:33:38.280
support system, whatever it is that you find inspiring.
link |
01:33:41.760
And I think that thinking about that and then backtracking from there and imagining the
link |
01:33:45.240
steps needed to get there will actually lead to much better research.
link |
01:33:48.240
It'll lead to rethinking the assumptions.
link |
01:33:50.480
It'll lead to working on the bottlenecks that other people aren't working on.
link |
01:33:55.880
And then naturally to turn to you, we've talked about reward functions and you just give an
link |
01:34:01.080
advice on looking forward, how you'd like to see, what kind of change you would like
link |
01:34:05.440
to make in the world.
link |
01:34:06.920
What do you think, ridiculous, big question, what do you think is the meaning of life?
link |
01:34:11.560
What is the meaning of your life?
link |
01:34:13.480
What gives you fulfillment, purpose, happiness and meaning?
link |
01:34:20.540
That's a very big question.
link |
01:34:24.600
What's the reward function under which you are operating?
link |
01:34:27.640
Yeah.
link |
01:34:28.640
I think one thing that does give, you know, if not meaning, at least satisfaction is some
link |
01:34:33.600
degree of confidence that I'm working on a problem that really matters.
link |
01:34:37.400
I feel like it's less important to me to like actually solve a problem, but it's quite nice
link |
01:34:42.960
to take things to spend my time on that I believe really matter.
link |
01:34:49.400
And I try pretty hard to look for that.
link |
01:34:53.080
I don't know if it's easy to answer this, but if you're successful, what does that look
link |
01:34:59.160
like?
link |
01:35:00.160
What's the big dream?
link |
01:35:01.880
Now, of course, success is built on top of success and you keep going forever, but what
link |
01:35:09.840
is the dream?
link |
01:35:10.840
Yeah.
link |
01:35:11.840
So one very concrete thing or maybe as concrete as it's going to get here is to see machines
link |
01:35:18.040
that actually get better and better the longer they exist in the world.
link |
01:35:23.420
And that kind of seems like on the surface, one might even think that that's something
link |
01:35:26.820
that we have today, but I think we really don't.
link |
01:35:28.840
I think that there is an ending complexity in the universe and to date, all of the machines
link |
01:35:38.480
that we've been able to build don't sort of improve up to the limit of that complexity.
link |
01:35:44.200
They hit a wall somewhere.
link |
01:35:45.660
Maybe they hit a wall because they're in a simulator that has, that is only a very limited,
link |
01:35:50.260
very pale imitation of the real world, or they hit a wall because they rely on a label
link |
01:35:54.320
data set, but they never hit the wall of like running out of stuff to see.
link |
01:36:00.400
So I'd like to build a machine that can go as far as possible.
link |
01:36:04.920
Runs up against the ceiling of the complexity of the universe.
link |
01:36:08.160
Yes.
link |
01:36:09.160
Well, I don't think there's a better way to end it, Sergey.
link |
01:36:12.000
Thank you so much.
link |
01:36:13.000
It's a huge honor.
link |
01:36:14.000
I can't wait to see the amazing work that you have to publish and in education space
link |
01:36:20.280
in terms of reinforcement learning.
link |
01:36:21.820
Thank you for inspiring the world.
link |
01:36:23.000
Thank you for the great research you do.
link |
01:36:24.720
Thank you.
link |
01:36:25.720
Thanks for listening to this conversation with Sergey Levine and thank you to our sponsors,
link |
01:36:31.000
Cash App and ExpressVPN.
link |
01:36:33.560
Please consider supporting this podcast by downloading Cash App and using code LexPodcast
link |
01:36:40.360
and signing up at expressvpn.com slash LexPod.
link |
01:36:44.840
Click all the links, buy all the stuff, it's the best way to support this podcast and the
link |
01:36:50.900
journey I'm on.
link |
01:36:51.900
If you enjoy this thing, subscribe on YouTube, review it with five stars on Apple Podcast,
link |
01:36:57.440
support it on Patreon, or connect with me on Twitter at Lex Friedman, spelled somehow
link |
01:37:02.900
if you can figure out how without using the letter E, just F R I D M A N.
link |
01:37:08.920
And now let me leave you with some words from Salvador Dali.
link |
01:37:14.120
Intelligence without ambition is a bird without wings.
link |
01:37:18.820
Thank you for listening and hope to see you next time.