back to index

Sergey Levine: Robotics and Machine Learning | Lex Fridman Podcast #108


small model | large model

link |
00:00:00.000
The following is a conversation with Sergei Levine, a professor at Berkeley and a world
link |
00:00:05.360
class researcher in deep learning, reinforcement learning, robotics, and computer vision,
link |
00:00:10.320
including the development of algorithms for end to end training of neural network policies
link |
00:00:15.040
that combine perception and control, scalable algorithms for inverse reinforcement learning,
link |
00:00:20.320
and, in general, deep RL algorithms.
link |
00:00:23.840
Quick summary of the ads.
link |
00:00:25.200
Two sponsors, Cash App and ExpressVPN. Please consider supporting the podcast by
link |
00:00:30.320
downloading Cash App and using code LexPodcast and signing up at expressvpn.com slash LexPod.
link |
00:00:38.640
Click the links, buy the stuff, it's the best way to support this podcast and, in general,
link |
00:00:44.160
the journey I'm on.
link |
00:00:46.080
If you enjoy this thing, subscribe on YouTube, review it with 5 stars on Apple Podcast,
link |
00:00:50.720
follow on Spotify, support it on Patreon, or connect with me on Twitter at LexFreedman.
link |
00:00:57.520
As usual, I'll do a few minutes of ads now and never any ads in the middle that can break
link |
00:01:01.520
the flow of the conversation. This show is presented by Cash App, the number one finance
link |
00:01:06.960
app in the App Store. When you get it, use code LexPodcast. Cash App lets you send money to friends
link |
00:01:13.600
by Bitcoin and invest in the stock market with as little as $1. Since Cash App does fractional
link |
00:01:19.600
share trading, let me mention that the order execution algorithm that works behind the scenes
link |
00:01:25.040
to create the abstraction of the fractional orders is an algorithmic marvel. So big props to
link |
00:01:30.720
the Cash App engineers for taking a step up to the next layer of abstraction over the stock market,
link |
00:01:35.680
making trading more accessible for new investors and diversification much easier.
link |
00:01:41.520
So again, if you get Cash App from the App Store, Google Play and use the code LexPodcast,
link |
00:01:47.440
you get $10 and Cash App will also donate $10 the first, an organization that is helping to
link |
00:01:54.240
advance robotics and STEM education for young people around the world. This show is also sponsored
link |
00:02:01.600
by ExpressVPN. Get it at expressvpn.com slash LexPod to support this podcast and to get an extra
link |
00:02:11.040
three months free on a one year package. I've been using ExpressVPN for many years. I love it.
link |
00:02:18.480
I think ExpressVPN is the best VPN out there. They told me to say it, but it happens to be true
link |
00:02:24.240
in my humble opinion. It doesn't log your data. It's crazy fast and it's easy to use literally
link |
00:02:30.320
just one big power on button. Again, it's probably obvious to you, but I should say it again,
link |
00:02:36.400
it's really important that they don't log your data. It works on Linux and every other operating
link |
00:02:42.240
system, but Linux of course is the best operating system. Shout out to my favorite flavor, Ubuntu
link |
00:02:48.800
Mate 2004. Once again, get it at expressvpn.com slash LexPod to support this podcast and to get
link |
00:02:56.560
an extra three months free on a one year package. And now here's my conversation with Sergey Levine.
link |
00:03:04.240
What's the difference between a state of the art human, such as you and I, well,
link |
00:03:10.000
I don't know if we qualify as state of the art humans, but a state of the art human and a state
link |
00:03:14.320
of the art robot? That's a very interesting question. Robot capability is, it's kind of a,
link |
00:03:22.320
I think it's a very tricky thing to understand because there are some things that are difficult
link |
00:03:28.080
that we wouldn't think are difficult and some things that are easy that we wouldn't think are easy.
link |
00:03:31.360
And there's also a really big gap between capabilities of robots in terms of
link |
00:03:37.280
hardware and their physical capability and capabilities of robots in terms of what they
link |
00:03:40.880
can do autonomously. There is a little video that I think robotics researchers really like
link |
00:03:46.720
to show, especially robotics learning researchers like myself from 2004 from Stanford,
link |
00:03:52.320
which demonstrates a prototype robot called the PR1. And the PR1 was a robot that was designed as a
link |
00:03:57.840
home assistance robot. And there's this beautiful video showing the PR1 tidying up a living room,
link |
00:04:03.440
putting away toys, and at the end, bringing a beer to the person sitting on the couch,
link |
00:04:09.440
which looks really amazing. And then the punchline is that this robot is entirely controlled by a
link |
00:04:14.960
person. So you can, in some ways, the gap between a state of the art human and a state of the art
link |
00:04:19.760
robot, if the robot has a human brain, is actually not that large. Now, obviously,
link |
00:04:24.240
like human bodies are sophisticated and very robust and resilient in many ways. But on the whole,
link |
00:04:29.360
if we're willing to like spend a bit of money and do a bit of engineering,
link |
00:04:32.400
we can kind of close the hardware gap almost. But the intelligence gap, that one is very wide.
link |
00:04:40.160
And when you say hardware, you're referring to the physical sort of the actuators, the actual
link |
00:04:44.160
body of the robot as opposed to the hardware on which the cognition, the nervous, the hardware of
link |
00:04:49.040
the nervous system. Yes, exactly. I'm referring to the body rather than the mind.
link |
00:04:52.960
So that means that the kind of the work is cut out for us. While we can still make the
link |
00:04:58.240
body better, we kind of know that the big bottleneck right now is really the mind.
link |
00:05:02.560
And how big is that gap? How big is the difference in your sense of ability to learn,
link |
00:05:09.600
ability to reason, ability to perceive the world between humans and our best robots?
link |
00:05:15.920
The gap is very large, and the gap becomes larger the more unexpected events can happen
link |
00:05:23.600
in the world. So essentially, the spectrum along which you can measure the size of that gap is
link |
00:05:30.240
the spectrum of how open the world is. If you control everything in the world very tightly,
link |
00:05:33.680
if you put the robot in like a factory and you tell it where everything is and you rigidly
link |
00:05:38.240
program its motion, then it can do things, one might even say, in a superhuman way,
link |
00:05:43.440
it can move faster, it's stronger, it can lift up a car and things like that. But as soon as
link |
00:05:48.080
anything starts to vary in the environment, now it'll trip up. And if many, many things vary,
link |
00:05:52.480
like they would like in your kitchen, for example, then things are pretty much like wide open.
link |
00:05:58.720
Now, again, we're going to stick a bit on the philosophical questions, but
link |
00:06:03.120
how much on the human side of the cognitive abilities in your sense is nature versus nurture?
link |
00:06:10.400
So how much of it is a product of evolution and how much of it is something we'll learn from
link |
00:06:18.560
sort of scratch from the day we're born? I'm going to read into your question as asking about
link |
00:06:24.240
the implications of this for AI, because I'm not a biologist, I can't really like speak
link |
00:06:29.360
authoritatively. So until we learn it, if it's all about learning, then there's more hope for AI.
link |
00:06:37.280
Yeah. So the way that I look at this is that,
link |
00:06:42.400
you know, well, first, of course, biology is very messy. And it's, if you ask the question,
link |
00:06:47.680
how does a person do something, or how does a person's mind do something,
link |
00:06:51.040
you can come up with a bunch of hypotheses, and oftentimes you can find support for many
link |
00:06:54.800
different often conflicting hypotheses. One way that we can approach the question of
link |
00:07:00.800
what the implications of this for AI are, is we can think about what's sufficient.
link |
00:07:04.640
So, you know, maybe a person is, from birth, very, very good at some things like, for example,
link |
00:07:10.320
recognizing faces. There's a very strong evolutionary pressure to do that. If you can
link |
00:07:13.520
recognize your mother's face, then you're more likely to survive, and therefore people are good
link |
00:07:18.640
at this. But we can also ask like, what's the minimum sufficient thing? And one of the ways
link |
00:07:23.920
that we can study the minimal sufficient thing is we could, for example, see what people do in
link |
00:07:27.840
unusual situations. If you present them with things that evolution couldn't have prepared them for,
link |
00:07:31.840
you know, our daily lives actually do this to us all the time. We didn't evolve to deal with,
link |
00:07:36.720
you know, automobiles and space flight and whatever. So, there are all these situations
link |
00:07:41.280
that we can find ourselves in. And we do very well there. Like, I can give you a joystick
link |
00:07:46.320
to control a robotic arm, which you've never used before. And you might be pretty bad for the
link |
00:07:50.880
first couple of seconds. But if I tell you, like, your life depends on using this robotic arm to,
link |
00:07:54.960
like, open this door, you'll probably manage it. Even though you've never seen this device before,
link |
00:08:00.000
you've never used the joystick to control us, and you'll kind of muddle through it. And that's
link |
00:08:04.320
not your evolved natural ability. That's your flexibility, your adaptability. And that's exactly
link |
00:08:10.640
where our current robotic systems really kind of fall flat.
link |
00:08:13.120
But I wonder how much general, almost what we think of as common sense,
link |
00:08:20.240
pre trained models underneath all of that. So that ability to adapt to a joystick
link |
00:08:25.040
requires you to have a kind of, you know, I'm human. So it's hard for me to introspect all
link |
00:08:31.440
the knowledge I have about the world. But it seems like there might be an iceberg underneath
link |
00:08:37.440
of the amount of knowledge we actually bring to the table. That's kind of the open question.
link |
00:08:41.520
I think there's absolutely an iceberg of knowledge that we bring to the table. But
link |
00:08:45.520
I think it's very likely that iceberg of knowledge is actually built up over our
link |
00:08:49.600
lifetimes. Because we have, you know, we have a lot of prior experience to draw on. And it kind
link |
00:08:56.640
of makes sense that the right way for us to, you know, to optimize our efficiency, our evolutionary
link |
00:09:03.520
fitness, and so on, is to utilize all that experience to build up the best iceberg we can
link |
00:09:09.280
get. And that's actually one, you know, while that sounds an awful lot like what machine
link |
00:09:13.920
learning actually does, I think that for modern machine learning, it's actually one of the
link |
00:09:18.400
things that for modern machine learning, it's actually a really big challenge to take this
link |
00:09:21.600
unstructured mass of experience and distill out something that looks like a common sense
link |
00:09:26.560
understanding of the world. And perhaps part of that is it's not because something about
link |
00:09:31.120
machine learning itself is broken or hard, but because we've been a little too rigid
link |
00:09:36.960
in subscribing to a very supervised, very rigid notion of learning, you know, kind of the input
link |
00:09:41.440
output Xs go to Ys sort of model. And maybe what we really need to do is to view the world more
link |
00:09:48.320
as like a massive experience that is not necessarily providing any rigid supervision,
link |
00:09:53.760
but sort of providing many, many instances of things that could be. And then you take that
link |
00:09:57.520
and you distill it into some sort of common sense understanding.
link |
00:10:02.000
I see. Well, you're painting an optimistic, beautiful picture, especially from the robotics
link |
00:10:06.720
perspective, because that means we just need to invest and build better learning algorithms,
link |
00:10:12.240
figure out how we can get access to more and more data for those learning algorithms to extract
link |
00:10:18.160
signal from and then accumulate that iceberg of knowledge. It's a beautiful picture. It's a
link |
00:10:24.000
hopeful one. I think it's potentially a little bit more than just that. And this is where we
link |
00:10:30.080
perhaps reach the limits of our current understanding. But one thing that I think that
link |
00:10:34.880
the research community hasn't really resolved in a satisfactory way is how much it matters
link |
00:10:40.160
where that experience comes from. Like, you know, do you just like download everything on the
link |
00:10:44.400
internet and cram it into essentially the 21st century analog of the giant language model and
link |
00:10:51.280
then see what happens? Or does it actually matter whether your machine physically experiences the
link |
00:10:56.240
world or in the sense that it actually attempts things, observes the outcome of its actions and
link |
00:11:01.760
kind of augments its experience that way? That it chooses which parts of the world it
link |
00:11:06.960
gets to interact with and observe and learn from. Right. It may be that the world is so complex
link |
00:11:12.720
that simply obtaining a large mass of sort of IID samples of the world is a very difficult way
link |
00:11:20.000
to go. But if you are actually interacting with the world and essentially performing this sort of
link |
00:11:25.040
hard negative mining by attempting what you think might work, observing the sometimes happy and
link |
00:11:30.400
sometimes sad outcomes of that, and augmenting your understanding using that experience and
link |
00:11:35.600
you're just doing this continually for many years, maybe that sort of data in some sense
link |
00:11:40.800
is actually much more favorable to obtaining a common sense understanding. One reason we might
link |
00:11:45.200
think that this is true is that what we associate with common sense or lack of common sense
link |
00:11:51.920
is often characterized by the ability to reason about kind of counterfactual questions. Like,
link |
00:11:57.360
if I were to... Here, this bottle of water is sitting on the table, everything is fine,
link |
00:12:02.320
if I were to knock it over, which I'm not going to do, but if I were to do that, what would happen?
link |
00:12:06.400
And I know that nothing good would happen from that, but if I have a bad understanding of the
link |
00:12:11.680
world, I might think that that's a good way for me to gain more utility. If I actually go about
link |
00:12:19.200
daily life doing the things that my current understanding of the world suggests will give
link |
00:12:23.120
me high utility, in some ways, I'll get exactly the right supervision to tell me not to do those
link |
00:12:30.640
bad things and to keep doing the good things. So, there's a spectrum between IID, random walk
link |
00:12:36.960
through the space of data, and what we humans do. I don't even know if we do it optimal, but
link |
00:12:43.600
there might be beyond. So, this open question that you raised, where do you think systems,
link |
00:12:51.520
intelligent systems that would be able to deal with this world fall? Can we do pretty well
link |
00:12:57.360
by reading all of Wikipedia, randomly sampling it, like language models do, or do we have to be
link |
00:13:04.720
exceptionally selective and intelligent about which aspects of the world we try?
link |
00:13:11.840
So, I think this is first an open scientific problem, and I don't have a clear answer,
link |
00:13:15.760
but I can speculate a little bit. And what I would speculate is that you don't need to be
link |
00:13:20.800
super, super careful. I think it's less about being careful to avoid the useless stuff,
link |
00:13:27.680
and more about making sure that you hit on the really important stuff. So, perhaps it's okay
link |
00:13:32.640
if you spend part of your day just guided by your curiosity, visiting interesting regions of
link |
00:13:38.800
your state space, but it's important for you to, every once in a while, make sure that you really
link |
00:13:43.840
try out the solutions that your current model of the world suggests might be effective,
link |
00:13:49.680
and observe whether those solutions are working as you expect or not. And perhaps some of that
link |
00:13:55.040
is really essential to have a perpetual improvement loop. This perpetual improvement
link |
00:14:00.480
loop is really the key that's going to potentially distinguish the best current methods from the
link |
00:14:06.480
best methods of tomorrow, in a sense. How important do you think is exploration or total out of the
link |
00:14:11.840
box thinking exploration in this space to jump to totally different domains? So, you mentioned
link |
00:14:20.080
there's an optimization problem, you explore the specifics of a particular strategy, whatever the
link |
00:14:26.080
thing you're trying to solve. How important is it to explore totally outside of the strategies
link |
00:14:32.080
that have been working for you so far? What's your intuition there?
link |
00:14:34.960
Yeah, I think it's a very problem dependent kind of question. And I think that that's actually,
link |
00:14:39.760
you know, in some ways, that question gets at one of the big differences between
link |
00:14:48.400
sort of the classic formulation of a reinforcement learning problem and some of the sort of more
link |
00:14:54.640
open ended reformulations of that problem that had been explored in recent years. So,
link |
00:14:58.160
classically, reinforcement learning is framed as a problem of maximizing utility, like any kind of
link |
00:15:03.040
rational AI agent, and then anything you do is in service to maximizing that utility.
link |
00:15:07.440
But a very interesting kind of way to look at, I'm not necessarily saying this is the best way to
link |
00:15:15.440
look at it, but an interesting alternative way to look at these problems is as something where
link |
00:15:19.680
you first get to explore the world however you please, and then afterwards you will be tasked
link |
00:15:24.960
with doing something. And that might suggest a somewhat different solution. So, if you don't
link |
00:15:29.200
know what you're going to be tasked with doing, and you just want to prepare yourself optimally
link |
00:15:32.880
for whatever you're on certain future holds, maybe then you will choose to attain some sort of
link |
00:15:37.920
coverage, build up sort of an arsenal of cognitive tools, if you will, such that later on when someone
link |
00:15:44.000
tells you, now your job is to fetch the coffee for me, you will be well prepared to undertake that
link |
00:15:48.480
task. And that you see that as the modern formulation of the reinforcement learning
link |
00:15:53.680
problem as a kind of the more multitask, the general intelligence kind of formulation.
link |
00:15:59.040
I think that's one possible vision of where things might be headed. I don't think that's by any means
link |
00:16:04.160
the mainstream or standard way of doing things, and it's not like if I had to...
link |
00:16:08.480
But I like it. It's a beautiful vision. So, maybe actually take a step back. What is the goal of
link |
00:16:14.400
robotics? What's the general problem of robotics we're trying to solve? You actually kind of painted
link |
00:16:18.480
two pictures here, one of sort of the narrow, one of the general. What in your view is the big
link |
00:16:23.360
problem of robotics? Again, ridiculously philosophical question.
link |
00:16:29.200
I think that maybe there are two ways I can answer this question. One is there's a very
link |
00:16:34.640
pragmatic problem, which is what would make robots... What would sort of maximize the
link |
00:16:41.440
usefulness of robots? And there the answer might be something like a system where a system that
link |
00:16:50.400
a system that can perform whatever task a human user sets for it, within the physical constraints,
link |
00:16:58.800
of course. If you teleport to another planet, it probably can't do that. But if you ask it to do
link |
00:17:03.520
something that's within its physical capability, then potentially with a little bit of additional
link |
00:17:07.920
training or a little bit of additional trial and error, it ought to be able to figure it out
link |
00:17:11.760
in much the same way as like a human teleoperator ought to figure out how to drive the robot to
link |
00:17:16.160
do that. That's kind of the very pragmatic view of what it would take to kind of solve the robotics
link |
00:17:22.720
problem, if we will. But I think that there is a second answer, and that answer is a lot closer
link |
00:17:28.800
to why I want to work on robotics, which is that I think it's less about what it would take to do a
link |
00:17:34.240
really good job in the world of robotics, but more the other way around of what robotics
link |
00:17:39.040
can bring to the table to help us understand artificial intelligence.
link |
00:17:42.800
So your dream, fundamentally, is to understand intelligence?
link |
00:17:47.520
Yes. I think that's the dream for many people who actually work in this space. I think that
link |
00:17:54.560
there's something very pragmatic and very useful about studying robotics. But I do think that a
link |
00:17:59.200
lot of people that go into this field, actually, the things that they draw inspiration from are the
link |
00:18:05.520
potential for robots to help us learn about intelligence and about ourselves.
link |
00:18:10.160
So that's fascinating that robotics is basically the space by which you can get closer to
link |
00:18:17.600
understanding the fundamentals of artificial intelligence. So what is it about robotics
link |
00:18:22.720
that's different from some of the other approaches? So if we look at some of the early breakthroughs
link |
00:18:27.760
in deep learning or in the computer vision space and the natural language processing,
link |
00:18:32.400
there's really nice, clean benchmarks that a lot of people competed on, and thereby came
link |
00:18:36.880
up with a lot of brilliant ideas. What's the fundamental difference between computer vision
link |
00:18:42.080
purely defined and ImageNet and kind of the bigger robotics problem?
link |
00:18:46.400
So there are a couple of things. One is that with robotics, you kind of have to take away
link |
00:18:54.160
many of the crutches. So you have to deal with both the particular problems of perception,
link |
00:19:00.880
control, and so on. But you also have to deal with the integration of those things.
link |
00:19:03.600
And, you know, classically, we've always thought of the integration as kind of a separate problem.
link |
00:19:08.560
So a classic kind of modular engineering approach is that we solve the individual
link |
00:19:12.160
sub problems, then wire them together, and then the whole thing works. And one of the
link |
00:19:16.400
things that we've been seeing over the last couple of decades is that we'll maybe
link |
00:19:20.400
studying the thing as a whole might lead to just like very different solutions than if we were
link |
00:19:24.800
to study the parts and wire them together. So the integrative nature of robotics research
link |
00:19:29.680
helps us see, you know, the different perspectives on the problem. Another part of the answer is that
link |
00:19:36.160
with robotics, it casts a certain paradox into very clever relief. So this is sometimes referred
link |
00:19:42.960
to as Morvix paradox, the idea that in artificial intelligence, things that are very hard for people
link |
00:19:50.480
can be very easy for machines. And vice versa, things that are very easy for people can be
link |
00:19:53.760
very hard for machines. So, you know, integral and differential calculus is pretty difficult to
link |
00:20:01.120
learn for people. But if you program a computer, do it, it can derive derivatives and integrals
link |
00:20:05.600
for you all day long without any trouble. Whereas some things like, you know, drinking from a cup
link |
00:20:11.600
of water, very easy for a person to do very hard for a robot to deal with. And sometimes when we
link |
00:20:17.920
see such blatant discrepancies, that gives us a really strong hint that we're missing something
link |
00:20:22.160
important. So if we really try to zero in on those discrepancies, we might find that little bit
link |
00:20:27.120
that we're missing. And it's not that we need to make machines better or worse at math and better
link |
00:20:31.840
at drinking water, but just that by studying those discrepancies, we might find some new insight.
link |
00:20:37.600
So that could be, that could be in any space, doesn't have to be robotics. But you're saying,
link |
00:20:43.600
I mean, I get, it's kind of interesting that robotics seems to have a lot of those discrepancies.
link |
00:20:49.280
So the, the, the Hans Maravak paradox is probably referring to the space of the physical
link |
00:20:55.920
interaction, like you said, object manipulation, walking, all the kind of stuff we do in the physical
link |
00:21:00.560
world. How do you make sense if you were to try to disentangle the Maravaks paradox? Like, why is
link |
00:21:13.200
there such a gap in our intuition about it? Why do you think manipulating objects is so
link |
00:21:20.000
hard from everything you've learned from applying reinforcement learning in this space?
link |
00:21:25.520
Yeah, I think that one reason is maybe that for many of the, for many of the other problems that
link |
00:21:33.600
we've studied in AI and computer science and so on, the notion of input, output and supervision
link |
00:21:41.120
is much, much cleaner. So computer vision, for example, deals with very complex inputs,
link |
00:21:45.600
but it's comparatively a bit easier, at least up to some level of abstraction, to cast it as a very
link |
00:21:52.400
tightly supervised problem. It's comparatively much, much harder to cast robotic manipulation as a
link |
00:21:58.560
very tightly supervised problem. You can do it. It just doesn't seem to work all that well. So you
link |
00:22:03.520
could say that, well, maybe we get a labeled data set where we know exactly which motor commands
link |
00:22:07.600
to send and then we train on that. But for various reasons, that's not actually like such a great
link |
00:22:12.400
solution. And it also doesn't seem to be even remotely similar to how people and animals learn
link |
00:22:17.280
to do things because we're not told by like our parents, here's how you fire your muscles in order
link |
00:22:22.960
to walk. We do get some guidance, but the really low level detailed stuff we figure out mostly on
link |
00:22:29.120
our own. And that's what you mean by tightly coupled that every single little subaction gets a
link |
00:22:34.320
supervised signal of whether it's a good one or not. Right. So while in computer vision, you
link |
00:22:39.040
could sort of imagine up to a level of abstraction that maybe somebody told you this is a car and
link |
00:22:43.280
this is a cat and this is a dog. In motor control, it's very clear that that was not the case.
link |
00:22:49.120
If we look at sort of the subspaces of robotics that, again, as you said, robotics integrates
link |
00:22:57.120
all of them together and we get to see how this beautiful mess interplays. But so there's nevertheless
link |
00:23:02.080
still perception. So it's the computer vision problem, broadly speaking, understanding the
link |
00:23:08.800
environment. Then there's also, maybe you can correct me on this kind of categorization of the
link |
00:23:13.760
space, then there's prediction in trying to anticipate what things are going to do into
link |
00:23:19.920
the future in order for you to be able to act in that world. And then there's also this game
link |
00:23:26.880
theoretic aspect of how your actions will change the behavior of others. In this kind of space,
link |
00:23:36.000
and this is bigger than reinforcement learning, this is just broadly looking at the problem
link |
00:23:39.760
of robotics. What's the hardest problem here? Or is there, or is what you said
link |
00:23:47.680
true that when you start to look at all of them together, that's a whole
link |
00:23:52.640
another thing. You can't even say which one individually is harder because all of them
link |
00:23:58.080
together, you should only be looking at them all together. I think when you look at them all together,
link |
00:24:03.280
some things actually become easier. And I think that's actually pretty important.
link |
00:24:09.200
Back in 2014, we had some work, basically our first work on end to end reinforcement learning
link |
00:24:16.160
for robotic manipulation skills from vision, which at the time was something that seemed
link |
00:24:21.520
a little inflammatory and controversial in the robotics world. But other than the inflammatory
link |
00:24:28.000
and controversial part of it, the point that we were actually trying to make in that work is that
link |
00:24:32.560
for the particular case of combining perception and control, you could actually do better if you
link |
00:24:37.200
treat them together than if you try to separate them. And the way that we tried to demonstrate
link |
00:24:41.120
this is we picked a fairly simple motor control task where a robot had to insert a little red
link |
00:24:46.640
trapezoid into a trapezoidal hole. And we had our separated solution, which involved first
link |
00:24:52.880
detecting the hole using a pose detector, and then actuating the arm to put it in,
link |
00:24:57.360
and then our intent solution, which just mapped pixels to the torques. And one of the things we
link |
00:25:02.880
observed is that if you use the intent solution, essentially the pressure on the perception part
link |
00:25:06.960
of the model is actually lower. It doesn't have to figure out exactly where the thing is in 3D
link |
00:25:10.480
space. It just needs to figure out where it is distributing the errors in such a way that
link |
00:25:16.000
the horizontal difference matters more than the vertical difference, because vertically it just
link |
00:25:19.360
pushes it down all the way until it can't go any further. And their perceptual errors are a lot
link |
00:25:23.840
less harmful, whereas perpendicular to the direction of motion, perceptual errors are much
link |
00:25:27.680
more harmful. So the point is that if you combine these two things, you can trade off errors between
link |
00:25:33.440
the components optimally to best accomplish the task. And the components can actually be weaker
link |
00:25:39.600
while still leading to better overall performance.
link |
00:25:41.760
It has a profound idea. I mean, in the space of pegs and things like that, it's quite simple.
link |
00:25:48.560
It almost is tempting to overlook. But that seems to be at least intuitively an idea that should
link |
00:25:56.000
generalize the basically all aspects of perception and control.
link |
00:25:59.760
Of course.
link |
00:26:00.160
That one strengthens the other.
link |
00:26:01.840
Yeah. And people who have studied perceptual heuristics in humans and animals find things
link |
00:26:07.840
like that all the time. So one very well known example is something called the gaze heuristic,
link |
00:26:12.160
which is a little trick that you can use to intercept a flying object. So if you want to
link |
00:26:17.440
catch a ball, for instance, you could try to localize it in 3D space, estimate its velocity,
link |
00:26:22.480
estimate the effect of wind resistance, solve a complex system of differential equations in your
link |
00:26:25.920
head. Or you can maintain a running speed so that the object stays in the same position
link |
00:26:32.880
as in your field of view. So if it dips a little bit, you speed up. If it rises a little bit,
link |
00:26:36.400
you slow down. And if you follow the simple rule, you'll actually arrive at exactly the
link |
00:26:40.320
place where the object lands and you'll catch it. And humans use it when they play baseball.
link |
00:26:45.040
Human pilots use it when they fly airplanes to figure out if they're about to collide with
link |
00:26:48.240
somebody. Frogs use us to catch insects and so on and so on. So this is something that
link |
00:26:52.560
actually happens in nature. And I'm sure this is just one instance of it that we were able to
link |
00:26:56.160
identify just because all that scientists were able to identify because it's so prevalent,
link |
00:26:59.840
but there are probably many others.
link |
00:27:00.800
Do you have a, just so we can zoom in as we talk about robotics, do you have a canonical problem,
link |
00:27:07.040
sort of a simple, clean, beautiful representative problem in robotics that you think about when
link |
00:27:14.080
you're thinking about some of these problems? We talked about robotic manipulation. To me,
link |
00:27:18.800
that seems intuitively, at least the robotics community has conversed towards that as a space
link |
00:27:26.720
that's the canonical problem. If you agree, then maybe you do zoom in in some particular
link |
00:27:32.080
aspect of that problem that you just like. Like if we solve that problem perfectly,
link |
00:27:36.880
it'll unlock a major step towards human level intelligence.
link |
00:27:43.920
I don't think I have like a really great answer to that. And I think partly the reason I don't
link |
00:27:47.760
have a great answer kind of has to do with the, it has to do with the fact that the difficulty
link |
00:27:54.800
is really in the flexibility and adaptability rather than in doing a particular thing really,
link |
00:27:59.760
really well. So it's hard to just say like, oh, if you can shuffle a deck of cards as fast as like
link |
00:28:08.000
a Vegas casino dealer, then you'll be very proficient. It's really the ability to quickly
link |
00:28:15.280
figure out how to do some arbitrary new thing well enough to move on to the next arbitrary thing.
link |
00:28:25.920
But the source of newness and uncertainty, have you found problems in which it's easy to
link |
00:28:34.800
generate new newnessnessness is new types of newness?
link |
00:28:39.680
Yeah. So a few years ago, if you'd asked me this question around like 2016, maybe,
link |
00:28:46.560
I would have probably said that robotic grasping is a really great example of that because
link |
00:28:51.520
it's a task with great real world utility. Like you will get a lot of money if you can do it well.
link |
00:28:57.040
What is robotic grasping?
link |
00:28:58.720
Picking up any object.
link |
00:29:00.720
With a robotic hand.
link |
00:29:02.240
Exactly. So you will get a lot of money if you do it well because lots of people want to run
link |
00:29:06.160
warehouses with robots. And it's highly non trivial because very different objects will
link |
00:29:12.480
require very different grasping strategies. But actually, since then, people have gotten
link |
00:29:18.000
really good at building systems to solve this problem to the point where I'm not actually
link |
00:29:22.320
sure how much more progress we can make with that as like the main guiding thing.
link |
00:29:29.280
But it's kind of interesting to see the kind of methods that have actually worked well in that
link |
00:29:33.120
space because robotic grasping classically used to be regarded very much as kind of almost like
link |
00:29:39.760
a geometry problem. So people who have studied the history of computer vision will find this
link |
00:29:45.920
very familiar that kind of in the same way that in the early days of computer vision,
link |
00:29:49.440
people thought of it very much as like an inverse graphics thing. In robotic grasping,
link |
00:29:53.440
people thought of it as an inverse physics problem. Essentially, you look at what's in front of you,
link |
00:29:58.480
figure out the shapes, then use your best estimate of the laws of physics to figure out
link |
00:30:02.720
where to put your fingers on, you pick up the thing. And it turns out that what works really
link |
00:30:07.120
well for robotic grasping instantiated in many different recent works, including our own, but
link |
00:30:12.240
also ones from many other labs is to use learning methods with some combination of either exhaustive
link |
00:30:19.280
simulation or like actual real world trial and error. And it turns out that those things actually
link |
00:30:23.120
work really well. And then you don't have to worry about solving geometry problems or physics
link |
00:30:26.640
problems. So just by the way in the grasping, what are the difficulties that have been worked on?
link |
00:30:35.120
So one is like the materials of things, maybe occlusions on the perception side. Why is it
link |
00:30:41.200
such a difficult? Why is picking stuff up such a difficult problem? Yeah, it's a difficult problem
link |
00:30:47.120
because the number of things that you might have to deal with or the variety of things that you
link |
00:30:52.560
have to deal with is extremely large. And oftentimes, things that work for one class of
link |
00:30:58.400
objects won't work for other class of objects. So if you get really good at picking up boxes,
link |
00:31:03.680
and now you have to pick up plastic bags, you just need to employ a very different strategy.
link |
00:31:09.440
And there are many properties of objects that are more than just their geometry,
link |
00:31:15.120
it has to do with the bits that are easier to pick up, the bits that are hard to pick up,
link |
00:31:19.600
the bits that are more flexible, the bits that will cause the thing to pivot and bend and drop
link |
00:31:23.920
out of your hand versus the bits that result in a nice secure grasp, things that are flexible,
link |
00:31:28.960
things that if you pick them up the wrong way, they'll fall upside down and the contents will
link |
00:31:32.800
spill out. So there's all these little details that come up. But the task is still kind of can
link |
00:31:38.160
be characterized as one task, like there's a very clear notion of you did it or you didn't do it.
link |
00:31:42.240
So in terms of spilling things, there creeps in this notion that starts to sound and feel like
link |
00:31:51.040
common sense reasoning. Do you think solving the general problem of robotics requires
link |
00:31:59.920
common sense reasoning, requires general intelligence, this kind of human level capability of,
link |
00:32:07.040
you know, like you said, be robust and deal with uncertainty, but also be able to sort of reason
link |
00:32:13.360
and assimilate different pieces of knowledge that you have. Yeah. What are your thoughts on
link |
00:32:22.000
the needs of common sense reasoning in the space of the general robotics problem?
link |
00:32:28.400
So I'm going to slightly dodge that question and say that I think maybe actually,
link |
00:32:32.240
it's the other way around is that studying robotics can help us understand how to put
link |
00:32:37.520
common sense into RAI systems. One way to think about common sense is that, and why our current
link |
00:32:44.080
systems might lack common sense, is that common sense is an emergent property of
link |
00:32:50.080
actually having to interact with a particular world, a particular universe, and get things done
link |
00:32:54.880
in that universe. So you might think that, for instance, like an image captioning system,
link |
00:32:59.680
maybe it looks at pictures of the world and it types out English sentences. So it kind of
link |
00:33:06.160
deals with our world. And then you can easily construct situations where image captioning
link |
00:33:11.040
systems do things that defy common sense, like give it a picture of a person wearing a fur coat,
link |
00:33:15.520
and we'll say it's a teddy bear. But what I think what's really happening in those settings
link |
00:33:20.080
is that the system doesn't actually live in our world, it lives in its own world that consists
link |
00:33:24.960
of pixels and English sentences, and doesn't actually consist of having to put on a fur coat
link |
00:33:30.240
in the winter so you don't get cold. So perhaps the reason for the disconnect is that the
link |
00:33:36.640
systems that we have now simply inhabit a different universe. And if we build AI systems
link |
00:33:41.680
that are forced to deal with all of the messiness and complexity of our universe, maybe they will
link |
00:33:46.400
have to acquire common sense to essentially maximize their utility. Whereas the systems
link |
00:33:51.520
we're building now don't have to do that, they can take some shortcut.
link |
00:33:55.120
That's fascinating. You've a couple of times already sort of reframed the role of robotics
link |
00:34:00.320
in this whole thing. And for some reason, I don't know if my way of thinking is common,
link |
00:34:05.920
but I thought like, we need to understand and solve intelligence in order to solve robotics.
link |
00:34:12.080
And you're kind of framing it as, no, robotics is one of the best ways to just study
link |
00:34:16.880
artificial intelligence and build sort of like robotics is like the right space in which
link |
00:34:23.040
you get to explore some of the fundamental learning mechanisms, fundamental sort of
link |
00:34:29.360
multimodal, multitask aggregation of knowledge mechanisms that are required for general
link |
00:34:36.000
intelligence. That's a really interesting way to think about it. But let me ask about learning.
link |
00:34:41.280
Can the general sort of robotics, the epitome of the robotics problem be solved purely
link |
00:34:46.720
through learning, perhaps end to end learning, sort of learning from scratch,
link |
00:34:54.480
as opposed to injecting human expertise and rules and heuristics and so on?
link |
00:34:59.920
I think that in terms of the spirit of the question, I would say yes. I mean, I think that
link |
00:35:06.640
though in some ways it may be like an overly sharp dichotomy, like, you know, I think that
link |
00:35:12.480
in some ways when we build algorithms, at some point, a person does something.
link |
00:35:19.600
A person turned on the computer, a person implemented TensorFlow.
link |
00:35:26.160
But yeah, I think that in terms of the point that you're getting at, I do think the answer
link |
00:35:29.760
is yes. I think that we can solve many problems that have previously required meticulous manual
link |
00:35:36.480
engineering through automated optimization techniques. And actually, one thing I will say
link |
00:35:40.960
on this topic is I don't think this is actually a very radical or very new idea. I think people
link |
00:35:46.080
have been thinking about automated optimization techniques as a way to do control for a very,
link |
00:35:51.680
very long time. And in some ways, what's changed is really more the name. So today we would say that,
link |
00:35:59.680
oh, my robot does machine learning, it does reinforcement learning, maybe in the 1960s,
link |
00:36:04.560
you'd say, oh, my robot is doing optimal control. And maybe the difference between typing out a
link |
00:36:10.400
system of differential equations and doing feedback linearization versus training in neural net,
link |
00:36:15.600
maybe it's not such a large difference. It's just pushing the optimization deeper and deeper into
link |
00:36:21.040
the thing. Well, it is interesting, you think that way, but especially with deep learning,
link |
00:36:26.480
that the accumulation of experiences in data form to form deep representations starts to feel
link |
00:36:35.920
like knowledge as opposed to optimal control. So this feels like there's an accumulation of
link |
00:36:41.120
knowledge through the learning process. Yes. Yeah. So I think that is a good point that
link |
00:36:45.760
one big difference between learning based systems and classic optimal control systems is that
link |
00:36:50.240
learning based systems in principle should get better and better the more they do something.
link |
00:36:54.960
And I do think that that's actually a very, very powerful difference.
link |
00:36:58.000
So look back at the world of expert systems, the symbolic AI and so on,
link |
00:37:03.200
of using logic to accumulate expertise, human expertise, human encoded expertise.
link |
00:37:10.960
Do you think that will have a role at some points? The deep learning, machine learning,
link |
00:37:16.080
reinforcement learning has shown incredible results and breakthroughs and just inspired
link |
00:37:23.280
thousands, maybe millions of researchers. But there's this less popular now,
link |
00:37:30.480
but it used to be popular idea of symbolic AI. Do you think that will have a role?
link |
00:37:35.040
I think in some ways, the kind of the descendants of symbolic AI actually already have a role. So
link |
00:37:45.840
this is the highly biased history from my perspective. You say that, well, initially we
link |
00:37:50.560
thought that rational decision making involves logical manipulation. So you have some model of
link |
00:37:56.480
the world expressed in terms of logic. You have some query like, what action do I take in order to
link |
00:38:03.120
for X to be true? And then you manipulate your logical symbolic representation to get an answer.
link |
00:38:08.240
What that turned into somewhere in the 1990s is, well, instead of building kind of predicates
link |
00:38:14.160
and statements that have true or false values, we'll build probabilistic systems where things
link |
00:38:20.720
have probabilities associated and probabilities of being true and false. And that turned into
link |
00:38:23.520
Bayes Nets. And that provided sort of a boost to what were really still essentially logical
link |
00:38:29.760
inference systems, just probabilistic logical inference systems. And then people said, well,
link |
00:38:34.240
let's actually learn the individual probabilities inside these models. And then people said, well,
link |
00:38:40.560
let's not even specify the nodes in the models, let's just put a big neural net in there. But in
link |
00:38:45.520
many ways, I see these as actually kind of descendants from the same idea. It's essentially
link |
00:38:49.440
instantiating rational decision making by means of some inference process, and learning by means
link |
00:38:54.960
of an optimization process. So in a sense, I would say yes, that it has a place. And in many
link |
00:39:00.640
ways, that place is over, you know, it already holds that place. It's already in there. Yeah,
link |
00:39:05.600
it's just by different, it looks slightly different than it was before. But in some,
link |
00:39:09.680
there are some things that we can think about that make this a little bit more obvious. Like,
link |
00:39:13.360
if I train a big neural net model to predict what will happen in response to my robot's actions,
link |
00:39:18.800
and then I run probabilistic inference, meaning I invert that model to figure out the actions that
link |
00:39:23.680
lead to some plausible outcome. Like, to me, that seems like a kind of logic. You have a model of
link |
00:39:28.320
the world, it just happens to be expressed by a neural net. And you are doing some inference
link |
00:39:33.040
procedure, some sort of manipulation on that model to figure out, you know, the answer to a
link |
00:39:38.240
query that you have. It's the interpretability, it's the explainability, though, that seems to
link |
00:39:43.280
be lacking more so because the nice thing about sort of expert systems is you can follow the
link |
00:39:49.040
reasoning of the system that to us, mere humans is somehow compelling. It would, it's just,
link |
00:39:58.080
I don't know what to make of this fact that there's a human desire for intelligent systems to be able
link |
00:40:05.040
to convey in a poetic way to us why it made the decisions it did. Like, tell a convincing story.
link |
00:40:15.040
And perhaps that's like a silly human thing. Like, we shouldn't expect that of intelligent
link |
00:40:22.240
systems. Like, we should be super happy that there is intelligent systems out there. But
link |
00:40:29.200
if I were to sort of psychoanalyze the researchers at the time, I would say expert systems connected
link |
00:40:34.080
to that part, that desire for AI researchers for systems to be explainable. I mean, maybe on that
link |
00:40:40.880
topic, do you have a hope that sort of inference systems of learning based systems will be as
link |
00:40:50.160
explainable as the dream was with expert systems, for example? I think it's a very complicated
link |
00:40:56.080
question because I think that in some ways, the question of explainability is kind of very closely
link |
00:41:03.040
tied to the question of performance. Like, why do you want your system to explain itself? Well,
link |
00:41:09.280
it's so that when it screws up, you can kind of figure out why it did it. But in some ways,
link |
00:41:15.280
that's a much bigger problem, actually. Like, your system might screw up and then it might
link |
00:41:20.080
screw up in how it explains itself, or you might have some bug somewhere so that it's not actually
link |
00:41:25.760
doing what it was supposed to do. So, maybe a good way to view that problem is really as a
link |
00:41:30.880
problem, as a bigger problem of verification and validation of which explainability is sort of
link |
00:41:37.680
one component. I see. I just see it differently. I see explainability. You put it beautifully. I
link |
00:41:43.920
think you actually summarized the field of explainability. But to me, there's another
link |
00:41:48.320
aspect of explainability, which is like storytelling that has nothing to do with errors or with
link |
00:41:54.960
like the sort of it doesn't it uses errors as as elements of its story, as opposed to a fundamental
link |
00:42:05.200
need to be explainable when errors occur. It's just that for other intelligence systems to be in
link |
00:42:10.880
our world, we seem to want to tell each other stories. And that that's true in the political
link |
00:42:17.360
world. That's true in the academic world. And that I, you know, neural networks are less capable
link |
00:42:23.280
of doing that, or perhaps they're equally capable of storytelling, storytelling, maybe it doesn't
link |
00:42:27.520
matter what the fundamentals of the system are, you just need to be a good storyteller.
link |
00:42:32.560
Maybe one specific story I can tell you about in that space is actually about some work that was
link |
00:42:38.320
done by about my former collaborator, who's now a professor at MIT named Jacob Andreas. Jacob actually
link |
00:42:44.160
works in natural language processing, but he had this idea to do a little bit of work in reinforcement
link |
00:42:48.240
learning, and how on how natural language can basically structure the internals of policies
link |
00:42:54.160
trained with RL. And one of the things he did is he set up a model that attempts to perform some
link |
00:43:00.800
tasks that's defined by a reward function. But the model reads in a natural language instruction.
link |
00:43:06.320
So this is a pretty common thing to do an instruction following. So you tell it like,
link |
00:43:09.760
you know, go to the red house, and then it's supposed to go to the red house. But then one
link |
00:43:13.840
of the things that Jacob did is he treated that sentence not as a command from a person,
link |
00:43:19.360
but as a representation of the internal kind of state of the mind of this policy, essentially,
link |
00:43:26.480
so that when it was faced with a new task, what it would do is it would basically try to think of
link |
00:43:31.040
possible language descriptions, attempt to do them and see if they led to the right outcome.
link |
00:43:35.440
So would it kind of think out loud, like, you know, I'm faced with this new task, what am I
link |
00:43:38.800
going to do? Let me go to the red house. Oh, that didn't work. Let me go to the blue room or something,
link |
00:43:43.680
let me go to the green plant. And once it got some reward, it would say, oh, go to the green
link |
00:43:47.280
plant, that's what's working, I'm going to go to the green plant. And then you could look at the
link |
00:43:50.160
string that it came up with, and that was a description of how it thought it should solve
link |
00:43:53.120
the problem. So you could do, you could basically incorporate language as internal state, and you
link |
00:43:58.400
can start getting some handle on these kinds of things. And then what I was kind of trying to
link |
00:44:02.960
get to is that also if you add to the reward function, the convincingness of that story.
link |
00:44:09.280
So I have another reward signal of like, people who review that story, how much they like it.
link |
00:44:16.560
So that, you know, initially that could be a hyper parameter sort of hard coded heuristic
link |
00:44:22.800
type of thing, but it's an interesting notion of the convincingness of the story becoming part
link |
00:44:30.640
of the reward function, the objective function of the explainability. It's in the world of sort of
link |
00:44:36.320
Twitter and fake news that might be a scary notion that the nature of truth may not be as
link |
00:44:42.640
important as the convincingness of the how convincing you are in telling the story around
link |
00:44:48.640
the facts. Well, let me ask the basic question. You're one of the world class researchers in
link |
00:44:56.960
reinforcement learning, deeper enforcement learning, certainly in the robotics space.
link |
00:45:01.600
What is reinforcement learning? I think that what reinforcement learning
link |
00:45:05.920
refers to today is really just the kind of the modern incarnation of learning based control.
link |
00:45:12.800
So classically, reinforcement learning has a much more narrow definition, which is that
link |
00:45:16.320
it's, you know, literally learning from reinforcement, like the thing does something
link |
00:45:20.080
and then it gets a reward or punishment. But really, I think the way the term is used today is
link |
00:45:24.560
it's used for more broadly to learning based control. So some kind of system that's supposed
link |
00:45:29.520
to be controlling something and it uses data to get better. And what does control mean? So this
link |
00:45:36.160
action is the fundamental element there. It means making rational decisions.
link |
00:45:40.880
And rational decisions are decisions that maximize a measure of utility.
link |
00:45:44.160
And sequentially, so you made decisions time and time and time again. Now, like,
link |
00:45:49.840
it's easier to see that kind of idea in the space of maybe games and the space of robotics.
link |
00:45:56.240
Do you see it bigger than that? Is it applicable? Like, where are the limits of the applicability
link |
00:46:02.800
of reinforcement learning? Yeah, so rational decision making is essentially the
link |
00:46:09.120
encapsulation of the AI problem viewed through a particular lens. So any problem that we would
link |
00:46:14.800
want a machine to do, an intelligent machine can likely be represented as a decision making problem.
link |
00:46:20.400
Classifying images is a decision making problem, although not a sequential one typically.
link |
00:46:24.720
You know, controlling a chemical plant as a decision making problem,
link |
00:46:30.240
deciding what videos to recommend on YouTube is a decision making problem.
link |
00:46:34.320
And one of the really appealing things about reinforcement learning is, if it does encapsulate
link |
00:46:39.680
the range of all these decision making problems, perhaps working on reinforcement learning is,
link |
00:46:44.640
you know, one of the ways to reach a very broad swath of AI problems.
link |
00:46:48.400
But what do you use the fundamental difference between reinforcement learning and maybe supervised
link |
00:46:55.680
machine learning? So reinforcement learning can be viewed as a generalization of supervised
link |
00:47:01.520
machine learning. You can certainly cast supervised learning as a reinforcement learning problem.
link |
00:47:05.520
You can just say your loss function is the negative of your reward. But you have stronger
link |
00:47:09.680
assumptions. You have the assumption that someone actually told you what the correct answer was,
link |
00:47:13.440
that your data was IID and so on. So you could view reinforcement learning essentially relaxing
link |
00:47:18.960
some of those assumptions. Now, that's not always a very productive way to look at it,
link |
00:47:22.000
because if you actually have a supervised learning problem, you'll probably solve it
link |
00:47:25.120
much more effectively by using supervised learning methods, because it's easier. But
link |
00:47:30.240
you can view reinforcement learning as a generalization of that.
link |
00:47:32.320
No, for sure. But they're fundamentally different. That's a mathematical statement.
link |
00:47:37.040
That's absolutely correct. But it seems that reinforcement learning, the kind of tools
link |
00:47:42.240
we're bringing to the table today, after today, so maybe down the line, everything will be a
link |
00:47:47.440
reinforcement learning problem. Just like you said, image classification should be mapped to a
link |
00:47:52.560
reinforcement learning problem. But today, the tools and ideas, the way we think about them are
link |
00:47:57.920
different. Sort of supervised learning has been used very effectively to solve basic,
link |
00:48:04.640
narrow AI problems. Reinforcement learning kind of represents the dream of AI. It's very much so
link |
00:48:13.200
in the research space now, in sort of captivating the imagination of people of what we can do with
link |
00:48:18.640
intelligent systems. But it hasn't yet had as wide of an impact as the supervised learning
link |
00:48:24.480
approaches. So my question comes in a more practical sense. What do you see as the
link |
00:48:30.480
gap between the more general reinforcement learning and the very specific, yes,
link |
00:48:37.200
it's a question of decision making with one step in the sequence of the supervised learning?
link |
00:48:42.960
So from a practical standpoint, I think that one thing that is potentially a little tough now,
link |
00:48:49.280
and this is, I think, something that we'll see, this is a gap that we might see closing over
link |
00:48:53.040
the next couple of years, is the ability of reinforcement learning algorithms to effectively
link |
00:48:57.680
utilize large amounts of prior data. So one of the reasons why it's a bit difficult today
link |
00:49:03.280
to use reinforcement learning for all the things that we might want to use it for,
link |
00:49:07.040
is that in most of the settings where we want to do rational decision making,
link |
00:49:12.000
it's a little bit tough to just deploy some policy that does crazy stuff and learns purely
link |
00:49:17.520
through trial and error. It's much easier to collect a lot of data, a lot of logs of some
link |
00:49:22.480
other policy that you've got, and then maybe you, you know, if you can get a good policy out of that,
link |
00:49:27.520
then you deploy it and let it kind of fine tune a little bit. But algorithmically,
link |
00:49:31.760
it's quite difficult to do that. So I think that once we figure out how to get reinforcement learning
link |
00:49:36.800
to bootstrap effectively from large datasets, then we'll see very, very rapid growth in
link |
00:49:43.360
applications of these technologies. So this is what's referred to as off policy reinforcement
link |
00:49:46.880
learning or offline RL or batch RL. And I think we're seeing a lot of research right now that
link |
00:49:52.320
does bring us closer and closer to that. Can you maybe paint the picture of the different methods,
link |
00:49:57.040
as you said, off policy, what's value based, reinforcement learning, what's policy based,
link |
00:50:02.640
what's model based, what's off policy on policy, what are the different categories of reinforcement
link |
00:50:06.880
learning? Yeah. So one way we can think about reinforcement learning is that it's in some
link |
00:50:13.360
very fundamental way. It's about learning models that can answer kind of what if questions. So
link |
00:50:20.240
what would happen if I take this action that I hadn't taken before? And you do that, of course,
link |
00:50:25.040
from experience, from data. And oftentimes you do it in a loop. So you build a model that answers
link |
00:50:29.840
these what if questions, use it to figure out the best action you can take, and then go and try
link |
00:50:34.400
taking that and see if the outcome agrees with what you predicted. So the different kinds of
link |
00:50:40.560
techniques basically refer to different ways of doing it. So model based methods answer a question
link |
00:50:45.440
of what state you would get, basically, what would happen to the world if you were to take a
link |
00:50:50.080
certain action value based methods, they answer the question of what value you would get, meaning
link |
00:50:54.880
what utility you would get. But in a sense, they're not really all that different, because
link |
00:50:59.520
they're both really just answering these what if questions. Now, unfortunately, for us, with
link |
00:51:05.040
current machine learning methods, answering what if questions can be really hard, because
link |
00:51:08.640
they are really questions about things that didn't happen. If you wanted to answer what if questions
link |
00:51:13.040
about things that did happen, you wouldn't need to learn model, you would just like repeat the
link |
00:51:15.920
thing that worked before. And that's really a big part of why RL is a little bit tough. So if you
link |
00:51:23.680
have a purely on policy kind of online process, then you ask these what if questions, you make
link |
00:51:29.360
some mistakes, then you go and try doing those mistaken things. And then you observe kind of
link |
00:51:33.840
the counter examples that will teach you not to do those things again. If you have a bunch of
link |
00:51:37.920
off policy data, and you just want to synthesize the best policy you can out of that data, then
link |
00:51:43.200
you really have to deal with the challenges of making these counterfactual.
link |
00:51:47.200
First of all, what's a policy?
link |
00:51:49.760
A policy is a model or some kind of function that maps from observations of the world to actions.
link |
00:51:58.960
So in reinforcement learning, we often refer to the current configuration of the world as the
link |
00:52:04.960
state. So we say the state kind of encompasses everything you need to fully define where the
link |
00:52:09.520
world is at at the moment. And depending on how we formulate the problem, we might say you either
link |
00:52:14.000
get to see the state or you get to see an observation, which is some snapshots and piece of the state.
link |
00:52:19.520
So policy is just includes everything in it in order to be able to act in this world.
link |
00:52:25.440
Yes.
link |
00:52:26.000
And so what does off policy mean?
link |
00:52:29.440
Yeah, so the terms on policy and off policy refer to how you get your data.
link |
00:52:32.720
So if you get your data from somebody else who was doing some other stuff, maybe you get your data
link |
00:52:37.280
from some manually programmed system that was just running in the world before,
link |
00:52:43.680
that's referred to as off policy data. But if you got the data by actually acting in the world based
link |
00:52:48.480
on what your current policy thinks is good, we call that on policy data. And obviously,
link |
00:52:53.200
on policy data is more useful to you because if your current policy makes some bad decisions,
link |
00:52:58.560
you will actually see that those decisions are bad. Off policy data, however, might be much easier
link |
00:53:02.800
to obtain because maybe that's all the log data that you have from before.
link |
00:53:07.840
So we talked about offline, talked about autonomous vehicles, so you can envision
link |
00:53:13.520
off policy kind of approaches in robotic spaces where there's already a ton of robots out there,
link |
00:53:18.880
but they don't get the luxury of being able to explore based on a reinforced learning framework.
link |
00:53:25.440
So how do we make, again, open question, but how do we make off policy methods work?
link |
00:53:32.240
Yeah, so this is something that has been kind of a big open problem for a while. And in the last
link |
00:53:37.520
few years, people have made a little bit of progress on that. I can tell you about, and it's
link |
00:53:43.120
not by any means solved yet, but I can tell you some of the things that, for example, we've done to
link |
00:53:46.720
try to address some of the challenges. It turns out that one really big challenge with off policy
link |
00:53:52.560
reinforcement learning is that you can't really trust your models to give accurate predictions
link |
00:53:58.640
for any possible action. So if I've never tried to, if in my data set I never saw somebody steering
link |
00:54:05.120
the car off the road onto the sidewalk, my value function or my model is probably not going to
link |
00:54:10.800
predict the right thing if I ask what would happen if I were to steer the car off the road onto the
link |
00:54:14.480
sidewalk. So one of the important things you have to do to get off policy RL to work is you have
link |
00:54:20.560
to be able to figure out whether a given action will result in a trustworthy prediction or not.
link |
00:54:25.120
And you can use kind of distribution estimation methods, kind of density estimation methods
link |
00:54:31.120
to try to figure that out. So you could figure out that, well, this action, my model is telling me
link |
00:54:34.640
that it's great, but it looks totally different from any action I've taken before. So my model is
link |
00:54:38.400
probably not correct. And you can incorporate regularization terms into your learning objective
link |
00:54:44.080
that will essentially tell you not to ask those questions that your model is unable to answer.
link |
00:54:50.720
What would lead to breakthroughs in this space, do you think? Like what's needed? Is this a data set
link |
00:54:56.640
question? Do we need to collect big benchmark data sets that allows to explore the space?
link |
00:55:03.600
Is it a new kinds of methodologies? Like what's your sense? Or maybe coming together in a space
link |
00:55:11.520
of robotics and defining the right problem to be working on? I think for off policy reinforcement
link |
00:55:16.400
learning in particular, it's very much an algorithms question right now. And this is something that
link |
00:55:21.760
I think is great because an algorithms question is that that just takes some very smart people to
link |
00:55:26.960
get together and think about it really hard. Whereas if it was like a data problem or hardware
link |
00:55:32.400
problem, that would take some serious engineering. So that's why I'm pretty excited about that
link |
00:55:36.640
problem because I think that we're in a position where we can make some real progress on it just
link |
00:55:40.160
by coming up with the right algorithms in terms of which algorithms they could be. The problems at
link |
00:55:45.520
their core are very related to problems in things like causal inference because what you're really
link |
00:55:52.400
dealing with is situations where you have a model, a statistical model that's trying to make predictions
link |
00:55:57.760
about things that it hadn't seen before. And if it's a model that's generalizing properly,
link |
00:56:03.120
that'll make good predictions. If it's a model that picks up on various correlations that will
link |
00:56:07.200
not generalize properly, and then you have an arsenal of tools you can use, you could for example
link |
00:56:11.600
figure out what are the regions where it's trustworthy, or on the other hand, you could try
link |
00:56:15.840
to make it generalize better somehow or some combination of the two. Is there a room for
link |
00:56:22.720
mixing where most of it, like 90, 95% is off policy, you already have the data set,
link |
00:56:31.040
and then you get to send the robot out to do a little exploration? What's that role of
link |
00:56:37.120
mixing them together? Yeah, absolutely. I think that this is something that you actually
link |
00:56:42.960
described very well at the beginning of our discussion when you talked about the iceberg.
link |
00:56:47.200
This is the iceberg, that the 99% of your prior experience, that's your iceberg. You'd use that
link |
00:56:52.000
for off policy reinforcement learning. And then of course, if you've never opened that particular
link |
00:56:57.920
kind of door with that particular lock before, then you have to go out and fiddle with it a little
link |
00:57:01.760
bit. And that's that additional 1% to help you figure out a new task. And I think that's actually
link |
00:57:05.840
like a pretty good recipe going forward. Is this to you the most exciting space of reinforcement
link |
00:57:11.920
learning now? Or is there, what's, maybe taking a step back, not just now, but what's to use the
link |
00:57:18.160
most beautiful idea? Apologize for the romanticized question, but the beautiful idea or concept in
link |
00:57:24.320
reinforcement learning? In general, I actually think that one of the things that is a very beautiful
link |
00:57:31.680
idea in reinforcement learning is just the idea that you can obtain a near optimal control or
link |
00:57:40.080
near optimal policy without actually having a complete model of the world. It's something that
link |
00:57:48.720
feels perhaps kind of obvious if you just hear the term reinforcement learning or you think about
link |
00:57:54.400
trial and error learning. But from a control's perspective, it's a very weird thing because
link |
00:57:58.720
classically, we think about engineered systems and controlling engineered systems as the problem
link |
00:58:07.040
of writing down some equations and then figuring out, given these equations, basically like solve
link |
00:58:11.200
for x, figure out the thing that maximizes its performance. And the theory of reinforcement
link |
00:58:18.560
learning actually gives us a mathematically principled framework to think, to reason about
link |
00:58:22.720
optimizing some quantity when you don't actually know the equations that govern that system.
link |
00:58:28.640
And to me, that actually seems kind of very elegant, not something that
link |
00:58:36.880
becomes immediately obvious, at least in the mathematical sense.
link |
00:58:39.920
Does it make sense to you that it works at all?
link |
00:58:43.520
Well, I think it makes sense when you take some time to think about it, but it is a little
link |
00:58:48.000
surprising. Well, then taking a step into the more deeper representations, which is also very
link |
00:58:55.840
surprising of sort of the richness of the state space, the space of environments that
link |
00:59:04.160
this kind of approach can operate in. Can you maybe say what is deep reinforcement learning?
link |
00:59:10.880
Well, deep reinforcement learning simply refers to taking reinforcement learning algorithms and
link |
00:59:16.160
combining them with high capacity neural net representations, which might at first seem like
link |
00:59:22.880
a pretty arbitrary thing, just take these two components and stick them together. But the
link |
00:59:26.640
reason that it's something that has become so important in recent years is that reinforcement
link |
00:59:33.040
learning, it kind of faces an exacerbated version of a problem that has faced many other machine
link |
00:59:39.200
learning techniques. So if we go back to the early 2000s or the late 90s, we'll see a lot
link |
00:59:46.000
of research on machine learning methods that have some very appealing mathematical properties,
link |
00:59:51.280
like they reduce the convex optimization problems, for instance. But they require very
link |
00:59:56.160
special inputs. They require a representation of the input that is clean in some way, like for
link |
01:00:01.680
example, clean in the sense that the classes in your multi class classification problems
link |
01:00:06.480
separate linearly. So they have some kind of good representation, and we call this a feature
link |
01:00:10.640
representation. And for a long time, people were very worried about features in the world of supervised
link |
01:00:15.440
learning, because somebody had to actually build those features, so you couldn't just take an image
link |
01:00:19.280
and plug it into your logistic regression or your SVM or something, someone had to take that image
link |
01:00:23.600
and process it using some handwritten code. And then neural nets came along and they could
link |
01:00:28.240
actually learn the features. And suddenly, we could apply learning directly to the raw inputs,
link |
01:00:33.280
which was great for images, but it was even more great for all the other fields where people hadn't
link |
01:00:37.920
come up with good features yet. And one of those fields actually reinforcement learning,
link |
01:00:41.680
because in reinforcement learning, the notion of features, if you don't use neural nets and you
link |
01:00:45.360
have to design your own features, is very opaque. It's very hard to imagine, let's say I'm playing
link |
01:00:51.920
chess or Go. What is a feature with which I can represent the value function for Go or even the
link |
01:00:58.320
optimal policy for Go linearly? I don't even know how to start thinking about it. And people
link |
01:01:03.200
tried all sorts of things that would write down an expert chess player looks for whether the
link |
01:01:07.520
knight is in the middle of the board or not. So that's a feature is knight in middle of board.
link |
01:01:11.040
And they would write these like long lists of kind of arbitrary made up stuff. And that was
link |
01:01:16.160
really kind of getting us nowhere. And that's a little chess is a little more accessible than
link |
01:01:20.400
the robotics problem. Absolutely. Right. That's there's at least experts in the in the different
link |
01:01:25.840
features for chess. But still like the neural network there, to me, that's, I mean, you put it
link |
01:01:34.000
eloquently and almost made it seem like a natural step to add neural networks. But the fact that
link |
01:01:39.760
neural networks are able to discover features in the control problem. It's very interesting. It's
link |
01:01:45.680
hopeful. I'm not sure what to think about it, but it feels hopeful that the control problem has
link |
01:01:51.920
features to be learned. Like, I guess my question is, is it surprising to you how far the deep side
link |
01:02:01.440
of deeper enforcement learning was able to like what the space of promise has been able to tackle
link |
01:02:05.920
from especially in games with the Alpha star and and Alpha zero and just the representation power
link |
01:02:16.080
there and in the robotics space. And what is your sense of the limits of this representation power
link |
01:02:23.200
and the control context? I think that in regard to the limits that here, I think that one thing
link |
01:02:32.320
that makes it a little hard to fully answer this question is because in settings where we would
link |
01:02:39.120
like to push push these things to the limit, we encounter other bottlenecks. So like the reason
link |
01:02:45.840
that I can't get my robot to learn how to like, I don't know, do the dishes in the kitchen. It's
link |
01:02:53.600
not because it's neural net is not big enough. It's because when you try to actually do trial
link |
01:02:59.600
and error learning, reinforcement learning directly in the real world, where you have the
link |
01:03:04.640
potential to gather these large, very, you know, highly varied and complex data sets,
link |
01:03:09.760
you start running into other problems, like one problem you run into very quickly. It's it'll
link |
01:03:15.200
first sound like a very pragmatic problem, but it actually turns out to be a pretty deep scientific
link |
01:03:18.400
problem. Take the robot put in your kitchen, have it try to learn to do the dishes with trial and
link |
01:03:22.400
error, it'll break all your dishes, and then we'll have no more dishes to clean. Now you might think
link |
01:03:27.600
this is a very practical issue, but there's something to this, which is that if you have a
link |
01:03:30.960
person trying to do this, you know, a person will have some degree of common sense, they'll
link |
01:03:34.400
break one dish, they'll be a little more careful with the next one. And if they break all of them,
link |
01:03:37.920
they're going to go and get more or something like that. So there's all sorts of scaffolding
link |
01:03:42.880
that that comes very naturally to us for our learning process. Like, you know, if I have to
link |
01:03:48.240
learn something through trial and error, I have a common sense to know that I have to, you know,
link |
01:03:52.000
try multiple times. If I screw something up, I ask for help or I reset things or something like that.
link |
01:03:56.320
And all of that is kind of outside of the classic reinforcement learning problem formulation.
link |
01:04:01.840
There are other things that are that can also be categorized as kind of scaffolding,
link |
01:04:06.320
but are very important, like for example, where you do get your reward function. If I want to
link |
01:04:09.760
learn how to pour a cup of water, well, how do I know if I've done it correctly? Now that probably
link |
01:04:15.840
requires an entire computer vision system to be built just to determine that. And that seems a
link |
01:04:19.840
little bit inelegant. So there are all sorts of things like this that start to come up when we
link |
01:04:24.000
think through what we really need to get reinforcement learning to happen at scale in the real world.
link |
01:04:28.160
And many of these things actually suggest a little bit of a shortcoming in the problem
link |
01:04:32.320
formulation and a few deeper questions that we have to resolve. That's really interesting. I
link |
01:04:37.360
talked to like David Silver, about AlphaZero. And it seems like there's no, again, that we
link |
01:04:45.120
haven't hit the limit at all in the context when there's no broken dishes. So in the case of Go,
link |
01:04:52.000
you can, it's really about just scaling compute. So again, like the bottleneck is the amount of
link |
01:04:58.720
money you're willing to invest in compute, and then maybe the different, the scaffolding around
link |
01:05:04.320
how difficult it is to scale compute, maybe. But there there's no limit. And it's interesting.
link |
01:05:09.840
Now we move to the real world, and there's the broken dishes, there's all the, and the reward
link |
01:05:14.160
function like you mentioned, that's really nice. So what, how do we push forward there? Do you think
link |
01:05:20.160
there's, there's this kind of sample efficiency question that people bring up, you know, not
link |
01:05:26.560
having to break 100,000 dishes? Is this an algorithm question? Is this a data selection
link |
01:05:34.800
like question? What do you think? How do we, how do we not break too many dishes?
link |
01:05:39.920
Yeah. Well, one way we can think about that is that maybe we need to be better at
link |
01:05:49.280
at reusing our data, building that, that iceberg. So perhaps, perhaps it's too much to hope that
link |
01:05:56.800
you can have a machine that in isolation in the vacuum without anything else can just master
link |
01:06:03.440
complex tasks in like, in minutes, the way that people do. But perhaps it also doesn't have to,
link |
01:06:08.240
perhaps what it really needs to do is have an existence, a lifetime where it does many things
link |
01:06:13.600
and the previous things that it has done, prepare it to do new things more efficiently.
link |
01:06:18.160
And, you know, the study of these kinds of questions typically falls under categories
link |
01:06:22.640
like multitask learning or meta learning. But they all fundamentally deal with the same
link |
01:06:27.360
general theme, which is use experience for doing other things to learn to do new things
link |
01:06:34.000
efficiently and quickly. So what do you think about if you just look at the one particular
link |
01:06:38.880
case study of Tesla autopilot that has quickly approaching towards a million vehicles on the
link |
01:06:44.960
road, where some percentage of the time 30, 40% of the time is driving using the computer vision,
link |
01:06:51.440
multitask, hydranet, right? And then the other percent, that's what they call it hydranet.
link |
01:07:00.960
The other percent is human controlled. From the human side, how can we use that data? What's
link |
01:07:08.480
your sense? So like, what's the signal? Do you have ideas in this autonomous vehicle space
link |
01:07:14.480
when people can lose their lives? You know, it's a safety critical environment. So how do we use
link |
01:07:20.880
that data? So I think that actually the kind of problems that come up when we want systems that
link |
01:07:30.320
are reliable and that can kind of understand the limits of their capabilities, they're actually
link |
01:07:35.680
very similar to the kind of problems that come up when we're doing off policy reinforcement
link |
01:07:39.360
learning. So as I mentioned before, in off policy reinforcement learning, the big problem is you
link |
01:07:44.080
need to know when you can trust the predictions of your model, because if you're trying to evaluate
link |
01:07:49.680
some pattern of behavior for which your model doesn't give you an accurate prediction, then you
link |
01:07:53.120
shouldn't use that to modify your policy. It's actually very similar to the problem that we're
link |
01:07:57.520
faced when we actually then deploy that thing. And we want to decide whether we trust it in the
link |
01:08:02.240
moment or not. So perhaps we just need to do a better job of figuring out that part. And that's
link |
01:08:08.000
a very deep research question, of course. But it's also a question that a lot of people are
link |
01:08:11.120
working on. So I'm pretty optimistic that we can make some progress on that over the next few years.
link |
01:08:15.600
What's the role of simulation in reinforcement learning, deeper reinforcement learning,
link |
01:08:19.760
reinforcement learning? Like how essential is it? It's been essential for the breakthroughs so far,
link |
01:08:26.480
for some interesting breakthroughs. Do you think it's a crutch that we rely on? I mean,
link |
01:08:32.080
again, it's the connection to our off policy discussion. But do you think we can ever get rid
link |
01:08:37.440
of simulation? Or do you think simulation will actually take over? Will create more and more
link |
01:08:40.800
realistic simulations that will allow us to solve actual real world problems, like transfer the models
link |
01:08:47.120
we'll learn in simulation to real world problems? I think that simulation is a very pragmatic tool
link |
01:08:52.480
that we can use to get a lot of useful stuff to work right now. But I think that in the long run,
link |
01:08:57.520
we will need to build machines that can learn from real data, because that's the only way that we'll
link |
01:09:02.640
get them to improve perpetually. Because if we can't have our machines learn from real data,
link |
01:09:08.000
if they have to rely on simulated data, eventually the simulator becomes the bottleneck.
link |
01:09:12.240
In fact, this is a general thing. If your machine has any bottleneck that is built by humans,
link |
01:09:17.760
and that doesn't improve from data, it will eventually be the thing that holds it back.
link |
01:09:22.960
And if you're entirely reliant on your simulator, that'll be the bottleneck. If you're entirely
link |
01:09:26.720
reliant on a manually designed controller, that's going to be the bottleneck. So simulation is very
link |
01:09:31.680
useful. It's very pragmatic. But it's not a substitute for being able to utilize real experience.
link |
01:09:39.600
And this is, by the way, this is something that I think is quite relevant now, especially in the
link |
01:09:44.080
context of some of the things we've discussed, because some of these kind of scaffolding issues
link |
01:09:48.640
that I mentioned, things like the broken dishes and the unknown reward functions, like these are
link |
01:09:52.080
not problems that you would ever stumble on when working in a purely simulated kind of environment.
link |
01:09:58.480
But they become very apparent when we try to actually run these things in the real world.
link |
01:10:03.040
Do you throw a brief wrench into our discussion? Let me ask, do you think we're living in a simulation?
link |
01:10:07.920
Oh, I have no idea.
link |
01:10:09.760
Do you think that's a useful thing to even think about the fundamental physics nature of reality?
link |
01:10:16.800
Or another perspective? The reason I think the simulation hypothesis is interesting is
link |
01:10:24.400
to think about how difficult is it to create sort of a virtual reality game type situation
link |
01:10:32.800
that will be sufficiently convincing to us humans or sufficiently enjoyable that we wouldn't want
link |
01:10:38.880
to leave. That's actually a practical engineering challenge. And I personally really enjoy virtual
link |
01:10:45.440
reality, but it's quite far away. But I kind of think about, what would it take for me to want
link |
01:10:50.480
to spend more time in virtual reality versus the real world? And that's a, that's a sort of a nice
link |
01:10:57.600
clean question. Because at that point, we've reached, if I want to live in a virtual reality,
link |
01:11:04.640
that means we're just a few years away, we're a majority, the population lives in a virtual
link |
01:11:08.640
reality. And that's how we create the simulation, right? You don't need to actually simulate the
link |
01:11:12.800
the quantum gravity and just every aspect of the of the universe. And that's a really,
link |
01:11:20.400
that's an interesting question for reinforcement learning too, is if we want to make sufficiently
link |
01:11:24.800
realistic simulations that may, it blend the difference between sort of the real world and
link |
01:11:30.240
the simulation, thereby, just some of the things we've been talking about, kind of the problems
link |
01:11:36.800
go away, if we can create actually interesting, rich simulations. It's an interesting question.
link |
01:11:41.520
And it actually, I think your question casts your previous question in a very interesting light,
link |
01:11:46.720
because in some ways, asking whether we can, well, the more practical version is like,
link |
01:11:53.920
can we build simulators that are good enough to train essentially AI systems that will work
link |
01:11:59.680
in the world? And it's kind of interesting to think about this, about what this implies. If true,
link |
01:12:06.160
it kind of implies that it's easier to create the universe than it is to create a brain.
link |
01:12:09.600
And that seems like put this way, it seems kind of weird.
link |
01:12:14.320
The aspect of the simulation most interesting to me is the simulation of other humans.
link |
01:12:20.800
That seems to be a complexity that makes the robotics problem harder. Now,
link |
01:12:27.920
I don't know if every robotics person agrees with that notion. Just as a quick aside,
link |
01:12:33.440
what are your thoughts about when the human enters the picture of the robotics problem? How
link |
01:12:39.840
does that change the reinforcement learning problem, the learning problem in general?
link |
01:12:44.880
Yeah, I think that's a kind of a complex question. And I guess my hope for a while had been that
link |
01:12:53.520
if we build these robotic learning systems that are multitask, that utilize lots of prior data,
link |
01:13:00.880
and that learn from their own experience, the bit where they have to interact with people
link |
01:13:05.440
will be perhaps handled in much the same way as all the other bits. So if they have prior
link |
01:13:09.440
experience of attracting with people and they can learn from their own experience of attracting
link |
01:13:13.440
with people for this new task, maybe that'll be enough. Now, of course, if it's not enough,
link |
01:13:19.120
there are many other things we can do. And there's quite a bit of research in that area.
link |
01:13:22.560
But I think it's worth a shot to see whether the multi agent interaction, the ability to understand
link |
01:13:29.920
that other beings in the world have their own goals and intentions and thoughts and so on,
link |
01:13:34.960
whether that kind of understanding can emerge automatically from simply learning to do things
link |
01:13:41.360
with and maximize utility. That information arises from the data. You've said something
link |
01:13:46.960
about gravity, sort of that you don't need to explicitly inject anything into the system,
link |
01:13:53.040
they can be learned from the data. And gravity is an example of something that could be learned
link |
01:13:57.360
from data, sort of like the physics of the world. What are the limits of what we can learn from
link |
01:14:06.800
data? So a very simple, clean way to ask that is, do you really think we can learn gravity
link |
01:14:15.280
from just data, the idea, the laws of gravity? So something that I think is a common kind of
link |
01:14:23.040
pitfall when thinking about prior knowledge and learning is to assume that just because we know
link |
01:14:30.800
something, then that it's better to tell the machine about that rather than have it figured out
link |
01:14:35.600
and so on. In many cases, things that are important, that affect many of the events
link |
01:14:43.440
that the machine will experience are actually pretty easy to learn. If every time you drop
link |
01:14:48.960
something, it falls down. Yeah, you might get the Newton's version, not Einstein's version,
link |
01:14:55.840
but it'll be pretty good and it will probably be sufficient for you to act rationally in the world
link |
01:15:00.720
because you see the phenomenon all the time. So things that are readily apparent from the data,
link |
01:15:06.000
we might not need to specify those by hand. It might actually be easier to let the machine
link |
01:15:09.120
figure them out. It just feels like that there might be a space of many local
link |
01:15:13.360
minima in terms of theories of this world that we would discover and get stuck on.
link |
01:15:20.240
Yeah, of course.
link |
01:15:21.120
That Newtonian mechanics is not necessarily easy to come by.
link |
01:15:27.520
Yeah, and well, in fact, in some fields of science, for example, human civilizations
link |
01:15:32.480
that sell full of these local optimism. So for example, if you think about how people
link |
01:15:36.880
try to figure out biology and medicine for the longest time, the kind of rules, the kind of
link |
01:15:44.080
principles that serve us very well in our day to day lives actually serve us very poorly
link |
01:15:47.760
in understanding medicine and biology. We had very superstitious and weird ideas about how
link |
01:15:53.920
the body worked until the advent of the modern scientific method. So that does seem to be
link |
01:15:59.920
a failing of this approach, but it's also a failing of human intelligence arguably.
link |
01:16:03.040
Yeah, maybe a smaller side, but the idea of self play is fascinating in reinforcement learning,
link |
01:16:10.000
sort of these competitive, creating a competitive context in which agents can play against each
link |
01:16:15.760
other in sort of at the same skill level and thereby increasing each other skill level.
link |
01:16:20.960
It seems to be this kind of self improving mechanism is exceptionally powerful in the
link |
01:16:26.320
context where it could be applied. First of all, is that beautiful to you that this mechanism
link |
01:16:32.720
work as well as it does and also can be generalized to other contexts like in the robotic space
link |
01:16:40.880
or anything that's applicable to the real world?
link |
01:16:43.760
I think that it's a very interesting idea, but I suspect that the bottleneck to actually
link |
01:16:51.520
generalizing it to the robotic setting is actually going to be the same as
link |
01:16:55.760
the bottleneck for everything else, that we need to be able to build machines that can get better
link |
01:17:01.040
and better through natural interaction with the world. And once we can do that, then they can go
link |
01:17:06.480
out and play with each other, they can play with people, they can play with the natural environment.
link |
01:17:12.720
But before we get there, we've got all these other problems we have to get out of the way.
link |
01:17:16.160
So there's no shortcut around that. You have to interact with the natural environment that...
link |
01:17:20.880
Well, because in a self play setting, you still need a mediating mechanism. So the reason that
link |
01:17:25.760
self play works for a board game is because the rules of that board game
link |
01:17:31.200
mediate the interaction between the agents. So the kind of intelligent behavior that will
link |
01:17:35.520
emerge depends very heavily on the nature of that mediating mechanism.
link |
01:17:39.680
So on the side of reward functions, that's coming up with good reward function seems to
link |
01:17:44.480
be the thing that we associate with general... like human beings seem to value the idea of
link |
01:17:51.360
developing our own reward functions, of arriving at meaning and so on. And yet for reinforcement
link |
01:17:59.440
learning, we often specify that's the given. What's your sense of how we develop good reward
link |
01:18:07.680
functions? Yeah, I think that's a very complicated and very deep question. And you're completely
link |
01:18:12.720
right that classically in reinforcement learning, this question has been treated as an on issue,
link |
01:18:19.200
that you treat the reward as this external thing that comes from some other bit of your biology
link |
01:18:26.320
and you don't worry about it. And I do think that that's actually a little bit of a mistake that
link |
01:18:31.920
we should worry about it. And we can approach it in a few different ways. We can approach it,
link |
01:18:35.760
for instance, by thinking of reward as a communication medium. We can say, well,
link |
01:18:39.440
how does a person communicate to a robot what its objective is? You can approach it also as
link |
01:18:45.120
sort of more of an intrinsic motivation medium. You could say, can we write down
link |
01:18:50.320
kind of a general objective that leads to good capability? Like, for example, can you write
link |
01:18:55.920
down some objectives such that even in the absence of any other task, if you maximize that objective,
link |
01:19:00.080
you'll sort of learn useful things. This is something that has sometimes been called unsupervised
link |
01:19:06.000
reinforcement learning, which I think is a really fascinating area of research, especially today.
link |
01:19:11.360
We've done a bit of work on that recently. One of the things we've studied is whether
link |
01:19:14.640
we can have some notion of unsupervised reinforcement learning by means of
link |
01:19:22.000
information theoretic quantities, like, for instance, minimizing a Bayesian measure of surprise. This
link |
01:19:26.640
is an idea that was pioneered actually in the computational neuroscience community by folks
link |
01:19:30.880
like Carl Friston. And we've done some work recently that shows that you can actually learn
link |
01:19:34.960
pretty interesting skills by essentially behaving in a way that allows you to make accurate predictions
link |
01:19:41.600
about the world. It seems a little circular. Do the things that will lead to you getting the right
link |
01:19:46.000
answer for prediction. But by doing this, you can sort of discover stable niches in the world.
link |
01:19:52.880
You can discover that if you're playing Tetris, then correctly clearing the rows will let you
link |
01:19:58.240
play Tetris for longer and keep the board nice and clean, which sort of satisfies some desire
link |
01:20:02.320
for order in the world. And as a result, get some degree of leverage over your domain.
link |
01:20:06.560
So we're exploring that pretty actively. Is there a role for a human notion of curiosity
link |
01:20:12.560
in itself being the reward sort of discovering new things about the world?
link |
01:20:19.600
So one of the things that I'm pretty interested in is actually whether
link |
01:20:23.040
discovering new things can actually be an emergent property of some other objective
link |
01:20:28.640
that quantifies capability. So new things for the sake of new things, maybe might not by itself
link |
01:20:36.000
be the right answer, but perhaps we can figure out an objective for which discovering new things
link |
01:20:41.760
is actually the natural consequence. That's something we're working on right now,
link |
01:20:45.680
but I don't have a clear answer for you there yet. That's still a work in progress.
link |
01:20:49.360
You mean just as a curious observation to see sort of creative patterns of curiosity
link |
01:20:57.680
on the way to optimize for a particular task?
link |
01:21:00.720
On the way to optimize for a particular measure of capability.
link |
01:21:03.520
Is there ways to understand or anticipate unexpected, unintended consequences of
link |
01:21:14.480
particular reward functions? Sort of anticipate the kind of strategies that might be developed
link |
01:21:21.840
and try to avoid highly detrimental strategies?
link |
01:21:25.680
Yeah. So classically, this is something that has been pretty hard in reinforcement learning
link |
01:21:30.240
because it's difficult for a designer to have good intuition about what a learning algorithm
link |
01:21:35.600
will come up with when they give it some objective. There are ways to mitigate that.
link |
01:21:40.080
One way to mitigate it is to actually define an objective that says, don't do weird stuff.
link |
01:21:45.840
You can actually quantify it and say just don't enter situations that have low probability
link |
01:21:51.200
under the distribution of states you've seen before.
link |
01:21:54.480
It turns out that that's actually one very good way to do off policy reinforcement learning,
link |
01:21:57.760
actually. So we can do some things like that.
link |
01:22:02.320
If we slowly venture in speaking about reward functions into greater and greater levels of
link |
01:22:08.480
intelligence, there's, I mean, Stuart Russell thinks about this, the alignment of AI systems
link |
01:22:16.240
with us humans. So how do we ensure that AGI systems align with us humans?
link |
01:22:22.800
It's kind of a reward function question of specifying the behavior of AI systems
link |
01:22:31.840
such that their success aligns with the broader intended success interests of human beings.
link |
01:22:40.160
Do you have thoughts on this? Do you have concerns of where reinforcement learning fits into this?
link |
01:22:45.120
Or are you really focused on the current moment of us being quite far away and trying to solve
link |
01:22:50.080
the robotics problem? I don't have a great answer to this. And I do think that this is a problem
link |
01:22:56.720
that's important to figure out. For my part, I'm actually a bit more concerned about the other
link |
01:23:03.040
side of this equation that maybe rather than unintended consequences for objectives that
link |
01:23:11.280
are specified too well, I'm actually more worried right now about unintended consequences for
link |
01:23:15.520
objectives that are not optimized well enough, which might become a very pressing problem
link |
01:23:21.120
when we, for instance, try to use these techniques for safety critical systems like
link |
01:23:26.000
cars and aircraft and so on. I think at some point we'll face the issue of objectives being
link |
01:23:30.960
optimized too well, but right now I think we're more likely to face the issue of them not being
link |
01:23:35.760
optimized well enough. But you don't think unintended consequences can arise even when
link |
01:23:39.840
you're far from optimality, sort of like on the path to it? Oh, no, I think unintended
link |
01:23:44.640
consequences can absolutely arise. It's just I think right now the bottleneck for improving
link |
01:23:50.320
reliability, safety and things like that is more with systems that need to work better,
link |
01:23:56.400
that need to optimize their objective better. Do you have thoughts, concerns about existential
link |
01:24:01.760
threats of human level intelligence? If we put on our hat of looking in 10, 20, 100, 500 years from
link |
01:24:10.080
now, do you have concerns about existential threats of AI systems? I think there are absolutely
link |
01:24:16.560
existential threats for AI systems just like there are for any powerful technology.
link |
01:24:22.240
But I think that these kinds of problems can take many forms and some of those forms will
link |
01:24:29.440
come down to people with nefarious intent. Some of them will come down to AI systems that have
link |
01:24:37.280
some fatal flaws and some of them will of course come down to AI systems that are too capable in
link |
01:24:42.400
some way. But among this set of potential concerns, I would actually be much more concerned about the
link |
01:24:50.320
first two right now and principally the one with nefarious humans because just through all
link |
01:24:55.200
of human history actually it's the nefarious humans that have been the problem not the nefarious
link |
01:24:58.080
machines than I am about the others. And I think that right now the best that I can do to make
link |
01:25:05.200
sure things go well is to build the best technology I can and also hopefully promote
link |
01:25:10.000
responsible use of that technology. Do you think RL systems has something to teach us humans?
link |
01:25:18.800
You said nefarious humans getting us in trouble. I mean machine learning systems have in some ways
link |
01:25:23.680
have revealed to us the ethical flaws in our data in that same kind of way. Can reinforcement learning
link |
01:25:30.560
teach us about ourselves? Has it taught something? What have you learned about yourself from trying
link |
01:25:37.200
to build robots and reinforcement learning systems? I'm not sure what I've learned about myself but
link |
01:25:45.280
maybe part of the answer to your question might become a little bit more apparent once we see
link |
01:25:53.200
more widespread deployment of reinforcement learning for decision making support in domains
link |
01:25:59.440
like healthcare, education, social media, etc. And I think we will see some interesting stuff
link |
01:26:05.440
emerge there. We will see for instance what kind of behaviors these systems come up with
link |
01:26:11.280
in situations where there is interaction with humans and where they have possibility of
link |
01:26:16.720
influencing human behavior. I think we're not quite there yet but maybe in the next two years
link |
01:26:21.520
we'll see some interesting stuff come out in that area. I hope outside the research space because
link |
01:26:25.920
the exciting space where this could be observed is sort of large companies that deal with large
link |
01:26:31.440
data and I hope there's some transparency. One of the things that's unclear when I look at social
link |
01:26:37.440
networks and just online is why an algorithm did something or whether even an algorithm was involved
link |
01:26:44.960
and that'd be interesting as a from a research perspective just to observe the results of algorithms
link |
01:26:52.960
to open up that data or to at least be sufficiently transparent about the behavior of these AS systems
link |
01:26:59.680
in the real world. What's your sense? I don't know if you looked at the blog post bitter lesson
link |
01:27:05.600
by Rich Sutton where it looks at sort of the big lesson of research in AI and reinforcement learning
link |
01:27:14.800
is that simple methods, general methods that leverage computation seem to work well. So basically
link |
01:27:22.240
don't try to do any kind of fancy algorithms just wait for computation to get fast. Do you share
link |
01:27:28.800
this kind of intuition? I think the high level idea makes a lot of sense. I'm not sure that my
link |
01:27:34.960
takeaway would be that we don't need to work on algorithms. I think that my takeaway would be that
link |
01:27:39.920
we should work on general algorithms and actually I think that this idea of needing to better automate
link |
01:27:50.000
the acquisition of experience in the real world actually follows pretty naturally from Rich
link |
01:27:56.720
Sutton's conclusion. So if the claim is that automated general methods plus data leads to good
link |
01:28:05.040
results, then it makes sense that we should build general methods and we should build the kind of
link |
01:28:08.880
methods that we can deploy and get them to go out there and collect their experience autonomously.
link |
01:28:13.760
I think that one place where I think that the current state of things falls a little bit short
link |
01:28:19.040
of that is actually that the going out there and collecting the data autonomously, which is easy to
link |
01:28:24.080
do in a simulated board game but very hard to do in the real world. Yeah, it keeps coming back to
link |
01:28:28.640
this one problem, right? So your mind is focused there now in this real world. It just seems scary
link |
01:28:38.160
the step of collecting the data and it seems unclear to me how we can do it effectively.
link |
01:28:45.120
Well, you know, 7 billion people in the world, each of them had to do that at some point in
link |
01:28:49.600
their lives. And we should leverage that experience that they've all done. We should be able to try
link |
01:28:55.600
to collect that kind of data. Okay, big questions. Maybe stepping back to your life, wood book or
link |
01:29:05.920
books, technical or fiction or philosophical had a big impact on the way you saw the world,
link |
01:29:13.920
and the way you thought about in the world, your life in general. And maybe what books,
link |
01:29:20.800
if it's different, would you recommend people consider reading on their own intellectual
link |
01:29:25.440
journey? It could be within reinforcement learning, but it could be very much bigger.
link |
01:29:32.160
I don't know if this is like a scientifically, like, particularly meaningful answer, but
link |
01:29:39.280
like, the honest answer is that I actually found a lot of the work by Isaac Hasimov to be very
link |
01:29:45.680
inspiring when I was younger. I don't know if that has anything to do with AI necessarily.
link |
01:29:49.920
You don't think it had a ripple effect in your life?
link |
01:29:52.400
Maybe it did. But yeah, I think that a vision of a future where, well, first of all,
link |
01:30:03.360
artificial, I might say artificial intelligence system, artificial robotic systems,
link |
01:30:08.400
robotic systems have, you know, kind of a big place, a big role in society,
link |
01:30:13.280
and where we try to imagine the sort of the limiting case of technological advancement
link |
01:30:20.480
and how that might play out in our future history. But yeah, I think that that was
link |
01:30:28.960
in some way influential. I don't really know how, but I would recommend it. I mean,
link |
01:30:34.560
if nothing else, you'd be well entertained. When did you first, yourself, like fall in love with
link |
01:30:39.600
idea of artificial intelligence get captivated by this field?
link |
01:30:44.880
So my honest answer here is actually that I only really started to think about it as a,
link |
01:30:51.760
that's something that I might want to do actually in graduate school pretty late.
link |
01:30:55.840
And a big part of that was that until, you know, somewhere around 2009, 2010,
link |
01:31:01.040
then just wasn't really high on my priority list because I didn't think that it was something
link |
01:31:06.640
where we're going to see very substantial advances in my lifetime. And, you know, maybe
link |
01:31:14.160
in terms of my career, the time when I really decided I wanted to work on this was when I
link |
01:31:20.400
actually took a seminar course that was taught by Professor Andrew Ng. And, you know, at that
link |
01:31:25.760
point, I of course had some, had like a decent understanding of the technical things involved.
link |
01:31:30.000
But one of the things that really resonated with me was when he said in the opening lecture,
link |
01:31:33.600
something the effect of like, well, he used to have graduate students come to him and talk about
link |
01:31:37.920
how they want to work on AI and he would kind of chuckle and give them some math problem to deal
link |
01:31:41.840
with. But now he's actually thinking that this is an area where we might see like substantial
link |
01:31:45.840
advances in our lifetime. And that kind of got me thinking because, you know, in some abstract
link |
01:31:51.200
sense, yeah, like you can kind of imagine that. But in a very real sense, when someone who had
link |
01:31:56.400
been working on that kind of stuff their whole career suddenly says that, yeah, like that,
link |
01:32:01.680
that had some effect on me. Yeah, this might be a special moment in the history of the field.
link |
01:32:07.680
That this is where we might see some, some interesting breakthroughs. So in the space of
link |
01:32:14.880
advice, somebody who's interested in getting started in machine learning or reinforcement
link |
01:32:19.920
learning, what advice would you give to maybe an undergraduate student or maybe even younger,
link |
01:32:25.040
how what are the first steps to take? And further on, what are the steps to take on that journey?
link |
01:32:32.560
So something that I think is important to do is to not be afraid to spend time imagining
link |
01:32:43.600
the kind of outcome that you might like to see. So one outcome might be a successful career,
link |
01:32:49.440
a large paycheck or something, or state of the art results on some benchmark.
link |
01:32:53.440
But hopefully that's not the thing that's like the main driving force for somebody.
link |
01:32:57.440
But I think that if someone who's a student considering a career in AI, like, takes a
link |
01:33:04.080
little while, sits down and thinks like, what do I really want to see? What do I want to see a machine
link |
01:33:08.240
do? What do I want? What do I want to see a robot do? What do I want to do? And what do I want to
link |
01:33:11.360
see a natural language system? Just like imagine, you know, imagine it almost like a commercial
link |
01:33:16.400
for a future product or something or like like something that you'd like to see in the world,
link |
01:33:20.400
and then actually sit down and think about the steps that are necessary to get there.
link |
01:33:24.880
And hopefully that thing is not a better number on image net classification. It's like it's
link |
01:33:29.200
probably like an actual thing that we can't do today that would be really awesome, whether it's
link |
01:33:33.040
a robot butler or a, you know, a really awesome healthcare decision making support system,
link |
01:33:38.720
whatever it is that you find inspiring. And I think that thinking about that and then
link |
01:33:43.680
backtracking from there and imagining the steps needed to get there will actually
link |
01:33:46.880
lead to much better research. It'll lead to rethinking the assumptions. It'll lead to
link |
01:33:51.360
working on the bottlenecks that other people aren't working on.
link |
01:33:55.600
And then naturally to turn to you, we've talked about reward functions, and you just give an
link |
01:34:01.040
advice on looking forward to how you'd like to see what kind of change you would like to make
link |
01:34:05.680
in the world. What do you think, ridiculous, big question? What do you think is the meaning
link |
01:34:10.640
of life? What is the meaning of your life? What gives you fulfillment, purpose, happiness, and
link |
01:34:18.080
meaning? That's a very big question. What's the reward function under which you're operating?
link |
01:34:27.440
Yeah, I think one thing that does give, you know, if not meaning at least satisfaction is
link |
01:34:33.360
some degree of confidence that I'm working on a problem that really matters. I feel like it's
link |
01:34:37.840
less important to me to actually solve a problem, but it's quite nice to take things to spend my
link |
01:34:46.080
time on that I believe really matter. And I try pretty hard to look for that.
link |
01:34:52.880
I don't know if it's easy to answer this, but if you're successful, what does that look like?
link |
01:34:59.680
What's the big dream? Of course, success is built on top of success and you keep going forever,
link |
01:35:06.800
but what is the dream? Yeah, so one very concrete thing or maybe as concrete as it's going to get
link |
01:35:15.280
here is to see machines that actually get better and better the longer they exist in the world.
link |
01:35:23.200
And that kind of seems like on the surface, one might even think that that's something that we
link |
01:35:26.960
have today, but I think we really don't. I think that there is an ending complexity in the universe
link |
01:35:34.960
and to date, all of the machines that we've been able to build don't sort of improve up to the limit
link |
01:35:42.160
of that complexity. They hit a wall somewhere. Maybe they hit a wall because they're in a simulator
link |
01:35:47.200
that has that is only a very limited, very pale limitation of the real world or they hit a wall
link |
01:35:52.160
because they rely on a labeled dataset, but they never hit the wall of like running out of stuff
link |
01:35:57.520
to see. So, you know, I'd like to build a machine that can go as far as possible.
link |
01:36:03.760
And that runs up against the ceiling of the complexity of the universe. Yes.
link |
01:36:09.280
Well, I don't think there's a better way to end it, Sergei. Thank you so much. It's a huge honor.
link |
01:36:13.280
I can't wait to see the amazing work that you have to publish and in education space in terms
link |
01:36:20.560
of reinforcement learning. Thank you for inspiring the world. Thank you for the great research you
link |
01:36:24.000
do. Thank you. Thanks for listening to this conversation with Sergei Levine and thank you
link |
01:36:29.200
to our sponsors, Cash App and ExpressVPN. Please consider supporting this podcast by
link |
01:36:35.600
downloading Cash App and using code LEX Podcast and signing up at expressvpn.com
link |
01:36:42.640
slash lexpod. Click all the links, buy all the stuff. It's the best way to support this podcast
link |
01:36:50.160
and the journey I'm on. If you enjoy this thing, subscribe on YouTube, review it with
link |
01:36:55.200
Firestars and Apple Podcast. Support on Patreon or connect with me on Twitter at Lex Freedman
link |
01:37:01.200
spelled somehow if you can figure out how without using the letter E, just F R I D M A N.
link |
01:37:08.800
And now let me leave you with some words from Salvador Dali. Intelligence without ambition
link |
01:37:14.960
is a bird without wings. Thank you for listening and hope to see you next time.