back to index

David Silver: AlphaGo, AlphaZero, and Deep Reinforcement Learning | Lex Fridman Podcast #86


small model | large model

link |
00:00:00.000
The following is a conversation with David Silver,
link |
00:00:02.560
who leads the Reinforcement Learning Research Group
link |
00:00:05.000
at DeepMind, and was the lead researcher
link |
00:00:07.840
on AlphaGo, AlphaZero, and co led the AlphaStar
link |
00:00:12.080
and MuZero efforts, and a lot of important work
link |
00:00:14.760
in reinforcement learning in general.
link |
00:00:17.160
I believe AlphaZero is one of the most important
link |
00:00:20.840
accomplishments in the history of artificial intelligence.
link |
00:00:24.160
And David is one of the key humans who brought AlphaZero
link |
00:00:27.760
to life together with a lot of other great researchers
link |
00:00:30.560
at DeepMind.
link |
00:00:31.880
He's humble, kind, and brilliant.
link |
00:00:35.160
We were both jet lagged, but didn't care and made it happen.
link |
00:00:39.040
It was a pleasure and truly an honor to talk with David.
link |
00:00:43.280
This conversation was recorded before the outbreak
link |
00:00:45.720
of the pandemic.
link |
00:00:46.960
For everyone feeling the medical, psychological,
link |
00:00:49.520
and financial burden of this crisis,
link |
00:00:51.600
I'm sending love your way.
link |
00:00:53.360
Stay strong, we're in this together, we'll beat this thing.
link |
00:00:57.680
This is the Artificial Intelligence Podcast.
link |
00:01:00.360
If you enjoy it, subscribe on YouTube,
link |
00:01:02.480
review it with five stars on Apple Podcast,
link |
00:01:04.760
support on Patreon, or simply connect with me on Twitter
link |
00:01:07.960
at Lex Friedman, spelled F R I D M A N.
link |
00:01:12.040
As usual, I'll do a few minutes of ads now
link |
00:01:14.520
and never any ads in the middle
link |
00:01:16.080
that can break the flow of the conversation.
link |
00:01:18.360
I hope that works for you
link |
00:01:19.680
and doesn't hurt the listening experience.
link |
00:01:22.560
Quick summary of the ads.
link |
00:01:23.920
Two sponsors, Masterclass and Cash App.
link |
00:01:27.360
Please consider supporting the podcast
link |
00:01:29.040
by signing up to Masterclass and masterclass.com slash Lex
link |
00:01:34.000
and downloading Cash App and using code LexPodcast.
link |
00:01:38.760
This show is presented by Cash App,
link |
00:01:41.120
the number one finance app in the app store.
link |
00:01:43.480
When you get it, use code LexPodcast.
link |
00:01:46.960
Cash App lets you send money to friends, buy Bitcoin,
link |
00:01:50.040
and invest in the stock market with as little as $1.
link |
00:01:53.800
Since Cash App allows you to buy Bitcoin,
link |
00:01:56.040
let me mention that cryptocurrency
link |
00:01:57.840
in the context of the history of money is fascinating.
link |
00:02:01.400
I recommend Ascent of Money as a great book on this history.
link |
00:02:05.320
Debits and credits on Ledger started around 30,000 years ago.
link |
00:02:10.040
The US dollar created over 200 years ago,
link |
00:02:12.840
and Bitcoin, the first decentralized cryptocurrency,
link |
00:02:15.840
released just over 10 years ago.
link |
00:02:18.600
So given that history, cryptocurrency is still very much
link |
00:02:21.880
in its early days of development,
link |
00:02:23.880
but it's still aiming to and just might
link |
00:02:26.480
redefine the nature of money.
link |
00:02:29.040
So again, if you get Cash App from the app store or Google Play
link |
00:02:32.360
and use the code LexPodcast, you get $10,
link |
00:02:35.880
and Cash App will also donate $10 to FIRST,
link |
00:02:38.640
an organization that is helping to advance robotics
link |
00:02:41.080
and STEM education for young people around the world.
link |
00:02:44.840
This show is sponsored by Masterclass.
link |
00:02:46.960
Sign up at masterclass.com slash Lex
link |
00:02:49.480
to get a discount and to support this podcast.
link |
00:02:52.000
In fact, for a limited time now,
link |
00:02:53.560
if you sign up for an all access pass for a year,
link |
00:02:56.600
you get to get another all access pass
link |
00:02:59.480
to share with a friend.
link |
00:03:01.200
Buy one, get one free.
link |
00:03:02.600
When I first heard about Masterclass,
link |
00:03:04.280
I thought it was too good to be true.
link |
00:03:06.240
For $180 a year, you get an all access pass
link |
00:03:09.680
to watch courses from to list some of my favorites.
link |
00:03:12.920
Chris Hadfield on space exploration,
link |
00:03:15.120
Neil deGrasse Tyson on scientific thinking communication,
link |
00:03:18.080
Will Wright, the creator of SimCity and Sims on game design,
link |
00:03:22.760
Jane Goodall on conservation,
link |
00:03:24.640
Carlos Santana on guitar.
link |
00:03:26.560
His song Europa could be the most beautiful
link |
00:03:29.040
guitar song ever written.
link |
00:03:30.960
Gary Kasparov on chess, Daniel Negrano on poker,
link |
00:03:34.240
and many, many more.
link |
00:03:35.640
Chris Hadfield explaining how rockets work
link |
00:03:37.840
and the experience of being launched into space alone
link |
00:03:40.400
is worth the money.
link |
00:03:41.640
For me, the key is to not be overwhelmed
link |
00:03:44.680
by the abundance of choice.
link |
00:03:46.200
Pick three courses you want to complete,
link |
00:03:48.040
watch each of them all the way through.
link |
00:03:50.080
It's not that long, but it's an experience
link |
00:03:51.880
that will stick with you for a long time, I promise.
link |
00:03:55.240
It's easily worth the money.
link |
00:03:56.760
You can watch it on basically any device.
link |
00:03:59.160
Once again, sign up on masterclass.com slash Lex
link |
00:04:02.280
to get a discount and to support this podcast.
link |
00:04:05.600
And now, here's my conversation with David Silver.
link |
00:04:09.720
What was the first program you've ever written?
link |
00:04:12.160
And what programming language?
link |
00:04:13.920
Do you remember?
link |
00:04:14.840
I remember very clearly, yeah.
link |
00:04:16.120
My parents brought home this BBC Model B microcomputer.
link |
00:04:22.000
It was just this fascinating thing to me.
link |
00:04:24.160
I was about seven years old and couldn't resist
link |
00:04:27.720
just playing around with it.
link |
00:04:29.960
So I think first program ever was writing my name out
link |
00:04:35.400
in different colors and getting it to loop and repeat that.
link |
00:04:39.560
And there was something magical about that,
link |
00:04:41.600
which just led to more and more.
link |
00:04:43.320
How did you think about computers back then?
link |
00:04:46.280
Like the magical aspect of it, that you can write a program
link |
00:04:49.640
and there's this thing that you just gave birth to
link |
00:04:52.840
that's able to create sort of visual elements
link |
00:04:56.240
and live in its own.
link |
00:04:57.640
Or did you not think of it in those romantic notions?
link |
00:04:59.960
Was it more like, oh, that's cool.
link |
00:05:02.440
I can solve some puzzles.
link |
00:05:05.240
It was always more than solving puzzles.
link |
00:05:06.880
It was something where, you know,
link |
00:05:08.600
there was this limitless possibilities.
link |
00:05:13.400
Once you have a computer in front of you,
link |
00:05:14.720
you can do anything with it.
link |
00:05:16.400
I used to play with Lego with the same feeling.
link |
00:05:18.000
You can make anything you want out of Lego,
link |
00:05:20.000
but even more so with a computer, you know,
link |
00:05:21.840
you're not constrained by the amount of kit you've got.
link |
00:05:24.480
And so I was fascinated by it and started pulling out
link |
00:05:26.960
the user guide and the advanced user guide
link |
00:05:29.560
and then learning.
link |
00:05:30.680
So I started in basic and then later 6502.
link |
00:05:34.600
My father also became interested in this machine
link |
00:05:38.360
and gave up his career to go back to school
link |
00:05:40.240
and study for a master's degree
link |
00:05:42.960
in artificial intelligence, funnily enough,
link |
00:05:46.040
at Essex University when I was seven.
link |
00:05:48.560
So I was exposed to those things at an early age.
link |
00:05:52.000
He showed me how to program in prologue
link |
00:05:54.840
and do things like querying your family tree.
link |
00:05:57.600
And those are some of my earliest memories
link |
00:05:59.760
of trying to figure things out on a computer.
link |
00:06:04.040
Those are the early steps in computer science programming,
link |
00:06:07.120
but when did you first fall in love
link |
00:06:09.320
with artificial intelligence or with the ideas,
link |
00:06:12.040
the dreams of AI?
link |
00:06:14.840
I think it was really when I went to study at university.
link |
00:06:19.000
So I was an undergrad at Cambridge
link |
00:06:20.880
and studying computer science.
link |
00:06:23.800
And I really started to question,
link |
00:06:27.560
you know, what really are the goals?
link |
00:06:29.480
What's the goal?
link |
00:06:30.320
Where do we want to go with computer science?
link |
00:06:32.760
And it seemed to me that the only step
link |
00:06:37.360
of major significance to take was to try
link |
00:06:40.880
and recreate something akin to human intelligence.
link |
00:06:44.200
If we could do that, that would be a major leap forward.
link |
00:06:47.480
And that idea, I certainly wasn't the first to have it,
link |
00:06:50.960
but it, you know, nestled within me somewhere
link |
00:06:53.480
and became like a bug.
link |
00:06:55.480
You know, I really wanted to crack that problem.
link |
00:06:58.880
So you thought it was, like you had a notion
link |
00:07:00.760
that this is something that human beings can do,
link |
00:07:03.000
that it is possible to create an intelligent machine.
link |
00:07:07.280
Well, I mean, unless you believe in something metaphysical,
link |
00:07:11.360
then what are our brains doing?
link |
00:07:13.400
Well, at some level they're information processing systems,
link |
00:07:17.240
which are able to take whatever information is in there,
link |
00:07:22.440
transform it through some form of program
link |
00:07:24.800
and produce some kind of output,
link |
00:07:26.120
which enables that human being to do all the amazing things
link |
00:07:29.360
that they can do in this incredible world.
link |
00:07:31.800
So then do you remember the first time
link |
00:07:35.480
you've written a program that,
link |
00:07:37.960
because you also had an interest in games.
link |
00:07:40.080
Do you remember the first time you were in a program
link |
00:07:41.960
that beat you in a game?
link |
00:07:45.680
That more beat you at anything?
link |
00:07:47.360
Sort of achieved super David Silver level performance?
link |
00:07:54.280
So I used to work in the games industry.
link |
00:07:56.440
So for five years I programmed games for my first job.
link |
00:08:01.280
So it was an amazing opportunity
link |
00:08:03.080
to get involved in a startup company.
link |
00:08:05.800
And so I was involved in building AI at that time.
link |
00:08:12.080
And so for sure there was a sense of building handcrafted,
link |
00:08:18.200
what people used to call AI in the games industry,
link |
00:08:20.280
which I think is not really what we might think of as AI
link |
00:08:23.120
in its fullest sense,
link |
00:08:24.000
but something which is able to take actions
link |
00:08:29.280
and in a way which makes things interesting
link |
00:08:31.440
and challenging for the human player.
link |
00:08:35.000
And at that time I was able to build
link |
00:08:38.360
these handcrafted agents,
link |
00:08:39.400
which in certain limited cases could do things
link |
00:08:41.360
which were able to do better than me,
link |
00:08:45.360
but mostly in these kind of Twitch like scenarios
link |
00:08:47.920
where they were able to do things faster
link |
00:08:50.000
or because they had some pattern
link |
00:08:51.680
which was able to exploit repeatedly.
link |
00:08:55.400
I think if we're talking about real AI,
link |
00:08:58.520
the first experience for me came after that
link |
00:09:00.800
when I realized that this path I was on
link |
00:09:05.600
wasn't taking me towards,
link |
00:09:06.840
it wasn't dealing with that bug which I still had inside me
link |
00:09:10.200
to really understand intelligence and try and solve it.
link |
00:09:14.240
That everything people were doing in games
link |
00:09:15.760
was short term fixes rather than long term vision.
link |
00:09:19.920
And so I went back to study for my PhD,
link |
00:09:22.760
which was funny enough trying to apply reinforcement learning
link |
00:09:26.320
to the game of Go.
link |
00:09:27.880
And I built my first Go program using reinforcement learning,
link |
00:09:31.360
a system which would by trial and error play against itself
link |
00:09:35.000
and was able to learn which patterns were actually helpful
link |
00:09:40.000
to predict whether it was gonna win or lose the game
link |
00:09:42.240
and then choose the moves that led
link |
00:09:44.520
to the combination of patterns
link |
00:09:45.640
that would mean that you're more likely to win.
link |
00:09:47.760
And that system, that system beat me.
link |
00:09:50.360
And how did that make you feel?
link |
00:09:53.400
Made me feel good.
link |
00:09:54.240
I mean, was there sort of the, yeah,
link |
00:09:57.000
it's a mix of a sort of excitement
link |
00:09:59.560
and was there a tinge of sort of like,
link |
00:10:02.480
almost like a fearful awe?
link |
00:10:04.440
You know, it's like in space, 2001 Space Odyssey
link |
00:10:08.240
kind of realizing that you've created something that,
link |
00:10:12.680
you know, that's achieved human level intelligence
link |
00:10:19.160
in this one particular little task.
link |
00:10:21.160
And in that case, I suppose neural networks
link |
00:10:23.400
weren't involved.
link |
00:10:24.320
There were no neural networks in those days.
link |
00:10:26.840
This was pre deep learning revolution.
link |
00:10:30.560
But it was a principled self learning system
link |
00:10:33.000
based on a lot of the principles which people
link |
00:10:36.120
are still using in deep reinforcement learning.
link |
00:10:40.200
How did I feel?
link |
00:10:41.200
I think I found it immensely satisfying
link |
00:10:46.600
that a system which was able to learn
link |
00:10:49.600
from first principles for itself
link |
00:10:51.320
was able to reach the point
link |
00:10:52.400
that it was understanding this domain
link |
00:10:56.240
better than I could and able to outwit me.
link |
00:11:00.040
I don't think it was a sense of awe.
link |
00:11:01.560
It was a sense that satisfaction,
link |
00:11:04.560
that something I felt should work had worked.
link |
00:11:08.640
So to me, AlphaGo, and I don't know how else to put it,
link |
00:11:11.840
but to me, AlphaGo and AlphaGo Zero,
link |
00:11:14.560
mastering the game of Go is again, to me,
link |
00:11:18.520
the most profound and inspiring moment
link |
00:11:20.400
in the history of artificial intelligence.
link |
00:11:23.440
So you're one of the key people behind this achievement
link |
00:11:26.560
and I'm Russian.
link |
00:11:27.580
So I really felt the first sort of seminal achievement
link |
00:11:31.840
when Deep Blue beat Garry Kasparov in 1987.
link |
00:11:34.800
So as far as I know, the AI community at that point
link |
00:11:40.680
largely saw the game of Go as unbeatable in AI
link |
00:11:43.960
using the sort of the state of the art
link |
00:11:46.160
brute force methods, search methods.
link |
00:11:48.760
Even if you consider, at least the way I saw it,
link |
00:11:51.480
even if you consider arbitrary exponential scaling
link |
00:11:55.920
of compute, Go would still not be solvable,
link |
00:11:59.160
hence why it was thought to be impossible.
link |
00:12:01.380
So given that the game of Go was impossible to master,
link |
00:12:07.660
what was the dream for you?
link |
00:12:09.460
You just mentioned your PhD thesis
link |
00:12:11.420
of building the system that plays Go.
link |
00:12:14.020
What was the dream for you that you could actually
link |
00:12:16.060
build a computer program that achieves world class,
link |
00:12:20.100
not necessarily beats the world champion,
link |
00:12:21.860
but achieves that kind of level of playing Go?
link |
00:12:24.900
First of all, thank you, that's very kind words.
link |
00:12:27.260
And funnily enough, I just came from a panel
link |
00:12:31.380
where I was actually in a conversation
link |
00:12:34.500
with Garry Kasparov and Murray Campbell,
link |
00:12:36.060
who was the author of Deep Blue.
link |
00:12:38.980
And it was their first meeting together since the match.
link |
00:12:43.260
So that just occurred yesterday.
link |
00:12:44.500
So I'm literally fresh from that experience.
link |
00:12:47.300
So these are amazing moments when they happen,
link |
00:12:50.760
but where did it all start?
link |
00:12:52.280
Well, for me, it started when I became fascinated
link |
00:12:55.020
in the game of Go.
link |
00:12:56.100
So Go for me, I've grown up playing games.
link |
00:12:59.180
I've always had a fascination in board games.
link |
00:13:01.820
I played chess as a kid, I played Scrabble as a kid.
link |
00:13:06.060
When I was at university, I discovered the game of Go.
link |
00:13:08.940
And to me, it just blew all of those other games
link |
00:13:11.180
out of the water.
link |
00:13:12.020
It was just so deep and profound in its complexity
link |
00:13:15.580
with endless levels to it.
link |
00:13:17.700
What I discovered was that I could devote
link |
00:13:22.700
endless hours to this game.
link |
00:13:25.940
And I knew in my heart of hearts
link |
00:13:28.180
that no matter how many hours I would devote to it,
link |
00:13:30.340
I would never become a grandmaster,
link |
00:13:34.300
or there was another path.
link |
00:13:35.980
And the other path was to try and understand
link |
00:13:38.180
how you could get some other intelligence
link |
00:13:40.340
to play this game better than I would be able to.
link |
00:13:43.500
And so even in those days, I had this idea that,
link |
00:13:46.780
what if, what if it was possible to build a program
link |
00:13:49.340
that could crack this?
link |
00:13:51.100
And as I started to explore the domain,
link |
00:13:53.260
I discovered that this was really the domain
link |
00:13:57.500
where people felt deeply that if progress
link |
00:14:01.300
could be made in Go,
link |
00:14:02.140
it would really mean a giant leap forward for AI.
link |
00:14:06.340
It was the challenge where all other approaches had failed.
link |
00:14:10.980
This is coming out of the era you mentioned,
link |
00:14:13.460
which was in some sense, the golden era
link |
00:14:15.980
for the classical methods of AI, like heuristic search.
link |
00:14:19.940
In the 90s, they all fell one after another,
link |
00:14:23.340
not just chess with deep blue, but checkers,
link |
00:14:26.580
backgammon, Othello.
link |
00:14:28.900
There were numerous cases where systems
link |
00:14:33.340
built on top of heuristic search methods
link |
00:14:35.940
with these high performance systems
link |
00:14:37.980
had been able to defeat the human world champion
link |
00:14:40.380
in each of those domains.
link |
00:14:41.980
And yet in that same time period,
link |
00:14:44.900
there was a million dollar prize available
link |
00:14:47.420
for the game of Go, for the first system
link |
00:14:50.700
to be a human professional player.
link |
00:14:52.700
And at the end of that time period,
link |
00:14:54.700
in year 2000 when the prize expired,
link |
00:14:57.140
the strongest Go program in the world
link |
00:15:00.060
was defeated by a nine year old child
link |
00:15:02.700
when that nine year old child was giving nine free moves
link |
00:15:05.820
to the computer at the start of the game
link |
00:15:07.500
to try and even things up.
link |
00:15:09.820
And computer Go expert beat that same strongest program
link |
00:15:13.900
with 29 handicapped stones, 29 free moves.
link |
00:15:18.140
So that's what the state of affairs was
link |
00:15:20.420
when I became interested in this problem
link |
00:15:23.380
in around 2003 when I started working on computer Go.
link |
00:15:29.500
There was nothing, there was very, very little
link |
00:15:33.180
in the way of progress towards meaningful performance,
link |
00:15:36.700
again, anything approaching human level.
link |
00:15:39.180
And so people, it wasn't through lack of effort,
link |
00:15:42.900
people had tried many, many things.
link |
00:15:44.980
And so there was a strong sense
link |
00:15:46.700
that something different would be required for Go
link |
00:15:49.900
than had been needed for all of these other domains
link |
00:15:52.220
where AI had been successful.
link |
00:15:54.220
And maybe the single clearest example
link |
00:15:56.380
is that Go, unlike those other domains,
link |
00:15:59.820
had this kind of intuitive property
link |
00:16:02.460
that a Go player would look at a position
link |
00:16:04.740
and say, hey, here's this mess of black and white stones.
link |
00:16:09.580
But from this mess, oh, I can predict
link |
00:16:12.740
that this part of the board has become my territory,
link |
00:16:15.860
this part of the board has become your territory,
link |
00:16:17.900
and I've got this overall sense that I'm gonna win
link |
00:16:20.260
and that this is about the right move to play.
link |
00:16:22.380
And that intuitive sense of judgment,
link |
00:16:24.780
of being able to evaluate what's going on in a position,
link |
00:16:28.220
it was pivotal to humans being able to play this game
link |
00:16:31.820
and something that people had no idea
link |
00:16:33.340
how to put into computers.
link |
00:16:35.060
So this question of how to evaluate a position,
link |
00:16:37.780
how to come up with these intuitive judgments
link |
00:16:40.140
was the key reason why Go was so hard
link |
00:16:44.980
in addition to its enormous search space,
link |
00:16:47.900
and the reason why methods
link |
00:16:49.740
which had succeeded so well elsewhere failed in Go.
link |
00:16:53.220
And so people really felt deep down that in order to crack Go
link |
00:16:57.980
we would need to get something akin to human intuition.
link |
00:17:00.420
And if we got something akin to human intuition,
link |
00:17:02.700
we'd be able to solve many, many more problems in AI.
link |
00:17:06.860
So for me, that was the moment where it's like,
link |
00:17:09.260
okay, this is not just about playing the game of Go,
link |
00:17:11.980
this is about something profound.
link |
00:17:13.620
And it was back to that bug
link |
00:17:15.020
which had been itching me all those years.
link |
00:17:17.740
This is the opportunity to do something meaningful
link |
00:17:19.660
and transformative, and I guess a dream was born.
link |
00:17:23.780
That's a really interesting way to put it.
link |
00:17:25.340
So almost this realization that you need to find,
link |
00:17:29.140
formulate Go as a kind of a prediction problem
link |
00:17:31.540
versus a search problem was the intuition.
link |
00:17:34.820
I mean, maybe that's the wrong crude term,
link |
00:17:37.380
but to give it the ability to kind of intuit things
link |
00:17:44.020
about positional structure of the board.
link |
00:17:47.060
Now, okay, but what about the learning part of it?
link |
00:17:51.340
Did you have a sense that you have to,
link |
00:17:54.940
that learning has to be part of the system?
link |
00:17:57.580
Again, something that hasn't really as far as I think,
link |
00:18:01.060
except with TD Gammon in the 90s with RL a little bit,
link |
00:18:05.220
hasn't been part of those state of the art game playing
link |
00:18:07.500
systems.
link |
00:18:08.580
So I strongly felt that learning would be necessary.
link |
00:18:12.820
And that's why my PhD topic back then was trying
link |
00:18:16.020
to apply reinforcement learning to the game of Go
link |
00:18:20.100
and not just learning of any type,
link |
00:18:21.820
but I felt that the only way to really have a system
link |
00:18:26.180
to progress beyond human levels of performance
link |
00:18:29.220
wouldn't just be to mimic how humans do it,
link |
00:18:31.060
but to understand for themselves.
link |
00:18:33.140
And how else can a machine hope to understand
link |
00:18:36.580
what's going on except through learning?
link |
00:18:39.020
If you're not learning, what else are you doing?
link |
00:18:40.420
Well, you're putting all the knowledge into the system.
link |
00:18:42.540
And that just feels like something which decades of AI
link |
00:18:47.860
have told us is maybe not a dead end,
link |
00:18:50.580
but certainly has a ceiling to the capabilities.
link |
00:18:53.380
It's known as the knowledge acquisition bottleneck,
link |
00:18:55.420
that the more you try to put into something,
link |
00:18:58.500
the more brittle the system becomes.
link |
00:19:00.380
And so you just have to have learning.
link |
00:19:02.780
You have to have learning.
link |
00:19:03.620
That's the only way you're going to be able to get a system
link |
00:19:06.900
which has sufficient knowledge in it,
link |
00:19:10.380
millions and millions of pieces of knowledge,
link |
00:19:11.900
billions, trillions of a form
link |
00:19:14.220
that it can actually apply for itself
link |
00:19:15.580
and understand how those billions and trillions
link |
00:19:18.000
of pieces of knowledge can be leveraged in a way
link |
00:19:20.940
which will actually lead it towards its goal
link |
00:19:22.780
without conflict or other issues.
link |
00:19:27.500
Yeah, I mean, if I put myself back in that time,
link |
00:19:30.620
I just wouldn't think like that.
link |
00:19:33.180
Without a good demonstration of RL,
link |
00:19:34.860
I would think more in the symbolic AI,
link |
00:19:37.740
like not learning, but sort of a simulation
link |
00:19:42.780
of knowledge base, like a growing knowledge base,
link |
00:19:46.940
but it would still be sort of pattern based,
link |
00:19:50.060
like basically have little rules
link |
00:19:52.800
that you kind of assemble together
link |
00:19:54.660
into a large knowledge base.
link |
00:19:56.660
Well, in a sense, that was the state of the art back then.
link |
00:19:59.820
So if you look at the Go programs,
link |
00:20:01.140
which had been competing for this prize I mentioned,
link |
00:20:05.320
they were an assembly of different specialized systems,
link |
00:20:09.860
some of which used huge amounts of human knowledge
link |
00:20:11.900
to describe how you should play the opening,
link |
00:20:14.860
how you should, all the different patterns
link |
00:20:16.740
that were required to play well in the game of Go,
link |
00:20:21.460
end game theory, combinatorial game theory,
link |
00:20:24.620
and combined with more principled search based methods,
link |
00:20:28.620
which were trying to solve for particular sub parts
link |
00:20:31.280
of the game, like life and death,
link |
00:20:34.100
connecting groups together,
link |
00:20:36.840
all these amazing sub problems
link |
00:20:38.100
that just emerge in the game of Go,
link |
00:20:40.420
there were different pieces all put together
link |
00:20:43.280
into this like collage,
link |
00:20:45.240
which together would try and play against a human.
link |
00:20:49.120
And although not all of the pieces were handcrafted,
link |
00:20:54.620
the overall effect was nevertheless still brittle,
link |
00:20:56.780
and it was hard to make all these pieces work well together.
link |
00:21:00.220
And so really, what I was pressing for
link |
00:21:02.660
and the main innovation of the approach I took
link |
00:21:05.600
was to go back to first principles and say,
link |
00:21:08.440
well, let's back off that
link |
00:21:10.380
and try and find a principled approach
link |
00:21:12.860
where the system can learn for itself,
link |
00:21:16.900
just from the outcome, like learn for itself.
link |
00:21:19.300
If you try something, did that help or did it not help?
link |
00:21:22.660
And only through that procedure can you arrive at knowledge,
link |
00:21:26.380
which is verified.
link |
00:21:27.940
The system has to verify it for itself,
link |
00:21:29.760
not relying on any other third party
link |
00:21:31.620
to say this is right or this is wrong.
link |
00:21:33.540
And so that principle was already very important
link |
00:21:38.180
in those days, but unfortunately,
link |
00:21:39.820
we were missing some important pieces back then.
link |
00:21:43.260
So before we dive into maybe
link |
00:21:46.580
discussing the beauty of reinforcement learning,
link |
00:21:49.140
let's take a step back, we kind of skipped it a bit,
link |
00:21:52.660
but the rules of the game of Go,
link |
00:21:55.940
what the elements of it perhaps contrasting to chess
link |
00:22:02.100
that sort of you really enjoyed as a human being,
link |
00:22:07.100
and also that make it really difficult
link |
00:22:09.620
as a AI machine learning problem.
link |
00:22:13.100
So the game of Go has remarkably simple rules.
link |
00:22:16.740
In fact, so simple that people have speculated
link |
00:22:19.180
that if we were to meet alien life at some point,
link |
00:22:22.220
that we wouldn't be able to communicate with them,
link |
00:22:23.820
but we would be able to play Go with them.
link |
00:22:26.140
Probably have discovered the same rule set.
link |
00:22:28.980
So the game is played on a 19 by 19 grid,
link |
00:22:32.260
and you play on the intersections of the grid
link |
00:22:34.140
and the players take turns.
link |
00:22:35.580
And the aim of the game is very simple.
link |
00:22:37.580
It's to surround as much territory as you can,
link |
00:22:40.820
as many of these intersections with your stones
link |
00:22:43.600
and to surround more than your opponent does.
link |
00:22:46.180
And the only nuance to the game is that
link |
00:22:48.800
if you fully surround your opponent's piece,
link |
00:22:50.500
then you get to capture it and remove it from the board
link |
00:22:52.420
and it counts as your own territory.
link |
00:22:54.460
Now from those very simple rules, immense complexity arises.
link |
00:22:58.320
There's kind of profound strategies
link |
00:22:59.820
in how to surround territory,
link |
00:23:02.020
how to kind of trade off between
link |
00:23:04.680
making solid territory yourself now
link |
00:23:07.140
compared to building up influence
link |
00:23:09.260
that will help you acquire territory later in the game,
link |
00:23:11.300
how to connect groups together,
link |
00:23:12.580
how to keep your own groups alive,
link |
00:23:16.620
which patterns of stones are most useful
link |
00:23:19.940
compared to others.
link |
00:23:21.500
There's just immense knowledge.
link |
00:23:23.920
And human Go players have played this game for,
link |
00:23:27.180
it was discovered thousands of years ago,
link |
00:23:29.260
and human Go players have built up
link |
00:23:30.860
this immense knowledge base over the years.
link |
00:23:33.760
It's studied very deeply and played by
link |
00:23:36.300
something like 50 million players across the world,
link |
00:23:38.780
mostly in China, Japan, and Korea,
link |
00:23:41.220
where it's an important part of the culture,
link |
00:23:43.700
so much so that it's considered one of the
link |
00:23:45.900
four ancient arts that was required by Chinese scholars.
link |
00:23:49.860
So there's a deep history there.
link |
00:23:51.680
But there's interesting qualities.
link |
00:23:53.100
So if I sort of compare to chess,
link |
00:23:55.620
chess is in the same way as it is in Chinese culture for Go,
link |
00:23:59.380
and chess in Russia is also considered
link |
00:24:01.860
one of the sacred arts.
link |
00:24:03.980
So if we contrast sort of Go with chess,
link |
00:24:06.460
there's interesting qualities about Go.
link |
00:24:09.300
Maybe you can correct me if I'm wrong,
link |
00:24:10.840
but the evaluation of a particular static board
link |
00:24:15.700
is not as reliable.
link |
00:24:18.780
Like you can't, in chess you can kind of assign points
link |
00:24:21.820
to the different units,
link |
00:24:23.860
and it's kind of a pretty good measure
link |
00:24:26.620
of who's winning, who's losing.
link |
00:24:27.980
It's not so clear.
link |
00:24:29.800
Yeah, so in the game of Go,
link |
00:24:31.300
you find yourself in a situation where
link |
00:24:33.420
both players have played the same number of stones.
link |
00:24:36.020
Actually, captures at a strong level of play
link |
00:24:38.380
happen very rarely, which means that
link |
00:24:40.260
at any moment in the game,
link |
00:24:41.180
you've got the same number of white stones and black stones.
link |
00:24:43.700
And the only thing which differentiates
link |
00:24:45.180
how well you're doing is this intuitive sense
link |
00:24:48.180
of where are the territories ultimately
link |
00:24:50.740
going to form on this board?
link |
00:24:52.180
And if you look at the complexity of a real Go position,
link |
00:24:57.260
it's mind boggling that kind of question
link |
00:25:00.560
of what will happen in 300 moves from now
link |
00:25:02.660
when you see just a scattering of 20 white
link |
00:25:05.420
and black stones intermingled.
link |
00:25:07.860
And so that challenge is the reason
link |
00:25:12.780
why position evaluation is so hard in Go
link |
00:25:15.540
compared to other games.
link |
00:25:17.420
In addition to that, it has an enormous search space.
link |
00:25:19.300
So there's around 10 to the 170 positions
link |
00:25:23.380
in the game of Go.
link |
00:25:24.380
That's an astronomical number.
link |
00:25:26.220
And that search space is so great
link |
00:25:28.540
that traditional heuristic search methods
link |
00:25:30.500
that were so successful in things like Deep Blue
link |
00:25:32.500
and chess programs just kind of fall over in Go.
link |
00:25:36.060
So at which point did reinforcement learning
link |
00:25:39.440
enter your life, your research life, your way of thinking?
link |
00:25:43.980
We just talked about learning,
link |
00:25:45.460
but reinforcement learning is a very particular
link |
00:25:47.780
kind of learning.
link |
00:25:49.660
One that's both philosophically sort of profound,
link |
00:25:53.060
but also one that's pretty difficult to get to work
link |
00:25:55.860
as if we look back in the early days.
link |
00:25:58.500
So when did that enter your life
link |
00:26:00.300
and how did that work progress?
link |
00:26:02.300
So I had just finished working in the games industry
link |
00:26:06.300
at this startup company.
link |
00:26:07.660
And I took a year out to discover for myself
link |
00:26:13.080
exactly which path I wanted to take.
link |
00:26:14.780
I knew I wanted to study intelligence,
link |
00:26:17.140
but I wasn't sure what that meant at that stage.
link |
00:26:19.220
I really didn't feel I had the tools
link |
00:26:21.420
to decide on exactly which path I wanted to follow.
link |
00:26:24.860
So during that year, I read a lot.
link |
00:26:27.180
And one of the things I read was Saturn and Barto,
link |
00:26:31.460
the sort of seminal textbook
link |
00:26:33.340
on an introduction to reinforcement learning.
link |
00:26:35.900
And when I read that textbook,
link |
00:26:39.100
I just had this resonating feeling
link |
00:26:43.500
that this is what I understood intelligence to be.
link |
00:26:47.820
And this was the path that I felt would be necessary
link |
00:26:51.420
to go down to make progress in AI.
link |
00:26:55.780
So I got in touch with Rich Saturn
link |
00:27:00.300
and asked him if he would be interested
link |
00:27:02.740
in supervising me on a PhD thesis in computer go.
link |
00:27:07.780
And he basically said
link |
00:27:11.940
that if he's still alive, he'd be happy to.
link |
00:27:15.740
But unfortunately, he'd been struggling
link |
00:27:19.460
with very serious cancer for some years.
link |
00:27:21.780
And he really wasn't confident at that stage
link |
00:27:23.980
that he'd even be around to see the end event.
link |
00:27:26.340
But fortunately, that part of the story
link |
00:27:28.660
worked out very happily.
link |
00:27:29.860
And I found myself out there in Alberta.
link |
00:27:32.780
They've got a great games group out there
link |
00:27:34.820
with a history of fantastic work in board games as well,
link |
00:27:38.700
as Rich Saturn, the father of RL.
link |
00:27:40.860
So it was the natural place for me to go in some sense
link |
00:27:43.580
to study this question.
link |
00:27:45.900
And the more I looked into it,
link |
00:27:48.420
the more strongly I felt that this
link |
00:27:53.500
wasn't just the path to progress in computer go.
link |
00:27:56.260
But really, this was the thing I'd been looking for.
link |
00:27:59.340
This was really an opportunity
link |
00:28:04.900
to frame what intelligence means.
link |
00:28:08.420
Like what are the goals of AI in a clear,
link |
00:28:12.260
single clear problem definition,
link |
00:28:14.220
such that if we're able to solve
link |
00:28:15.620
that clear single problem definition,
link |
00:28:18.780
in some sense, we've cracked the problem of AI.
link |
00:28:21.180
So to you, reinforcement learning ideas,
link |
00:28:24.860
at least sort of echoes of it,
link |
00:28:26.220
would be at the core of intelligence.
link |
00:28:29.420
It is at the core of intelligence.
link |
00:28:31.340
And if we ever create a human level intelligence system,
link |
00:28:34.900
it would be at the core of that kind of system.
link |
00:28:37.460
Let me say it this way, that I think it's helpful
link |
00:28:39.580
to separate out the problem from the solution.
link |
00:28:42.340
So I see the problem of intelligence,
link |
00:28:45.980
I would say it can be formalized
link |
00:28:48.460
as the reinforcement learning problem,
link |
00:28:50.700
and that that formalization is enough
link |
00:28:52.820
to capture most, if not all of the things
link |
00:28:56.180
that we mean by intelligence,
link |
00:28:58.460
that they can all be brought within this framework
link |
00:29:01.060
and gives us a way to access them in a meaningful way
link |
00:29:03.500
that allows us as scientists to understand intelligence
link |
00:29:08.620
and us as computer scientists to build them.
link |
00:29:12.820
And so in that sense, I feel that it gives us a path,
link |
00:29:16.260
maybe not the only path, but a path towards AI.
link |
00:29:20.300
And so do I think that any system in the future
link |
00:29:24.940
that's solved AI would have to have RL within it?
link |
00:29:29.700
Well, I think if you ask that,
link |
00:29:30.700
you're asking about the solution methods.
link |
00:29:33.420
I would say that if we have such a thing,
link |
00:29:35.500
it would be a solution to the RL problem.
link |
00:29:37.860
Now, what particular methods have been used to get there?
link |
00:29:41.180
Well, we should keep an open mind
link |
00:29:42.300
about the best approaches to actually solve any problem.
link |
00:29:45.660
And the things we have right now for reinforcement learning,
link |
00:29:49.420
maybe I believe they've got a lot of legs,
link |
00:29:53.500
but maybe we're missing some things.
link |
00:29:54.860
Maybe there's gonna be better ideas.
link |
00:29:56.460
I think we should keep, let's remain modest
link |
00:29:59.060
and we're at the early days of this field
link |
00:30:02.380
and there are many amazing discoveries ahead of us.
link |
00:30:04.980
For sure, the specifics,
link |
00:30:06.300
especially of the different kinds of RL approaches currently,
link |
00:30:09.580
there could be other things that fall
link |
00:30:11.260
into the very large umbrella of RL.
link |
00:30:13.420
But if it's okay, can we take a step back
link |
00:30:16.700
and kind of ask the basic question
link |
00:30:18.940
of what is to you reinforcement learning?
link |
00:30:22.540
So reinforcement learning is the study
link |
00:30:25.500
and the science and the problem of intelligence
link |
00:30:31.340
in the form of an agent that interacts with an environment.
link |
00:30:35.460
So the problem you're trying to solve
link |
00:30:36.660
is represented by some environment,
link |
00:30:38.100
like the world in which that agent is situated.
link |
00:30:40.700
And the goal of RL is clear
link |
00:30:42.500
that the agent gets to take actions.
link |
00:30:45.580
Those actions have some effect on the environment
link |
00:30:47.580
and the environment gives back an observation
link |
00:30:49.180
to the agent saying, this is what you see or sense.
link |
00:30:52.820
And one special thing which it gives back
link |
00:30:54.780
is called the reward signal,
link |
00:30:56.300
how well it's doing in the environment.
link |
00:30:58.100
And the reinforcement learning problem
link |
00:30:59.900
is to simply take actions over time
link |
00:31:04.380
so as to maximize that reward signal.
link |
00:31:07.260
So a couple of basic questions.
link |
00:31:11.060
What types of RL approaches are there?
link |
00:31:13.860
So I don't know if there's a nice brief inwards way
link |
00:31:17.820
to paint the picture of sort of value based,
link |
00:31:21.500
model based, policy based reinforcement learning.
link |
00:31:25.820
Yeah, so now if we think about,
link |
00:31:27.860
okay, so there's this ambitious problem definition of RL.
link |
00:31:31.940
It's really, it's truly ambitious.
link |
00:31:33.380
It's trying to capture and encircle
link |
00:31:34.860
all of the things in which an agent interacts
link |
00:31:36.980
with an environment and say, well,
link |
00:31:38.460
how can we formalize and understand
link |
00:31:39.820
what it means to crack that?
link |
00:31:41.980
Now let's think about the solution method.
link |
00:31:43.820
Well, how do you solve a really hard problem like that?
link |
00:31:46.460
Well, one approach you can take
link |
00:31:48.060
is to decompose that very hard problem
link |
00:31:51.700
into pieces that work together to solve that hard problem.
link |
00:31:55.380
And so you can kind of look at the decomposition
link |
00:31:58.020
that's inside the agent's head, if you like,
link |
00:32:00.660
and ask, well, what form does that decomposition take?
link |
00:32:03.740
And some of the most common pieces that people use
link |
00:32:06.140
when they're kind of putting
link |
00:32:07.300
the solution method together,
link |
00:32:09.540
some of the most common pieces that people use
link |
00:32:11.660
are whether or not that solution has a value function.
link |
00:32:14.820
That means, is it trying to predict,
link |
00:32:16.740
explicitly trying to predict how much reward
link |
00:32:18.540
it will get in the future?
link |
00:32:20.060
Does it have a representation of a policy?
link |
00:32:22.740
That means something which is deciding how to pick actions.
link |
00:32:25.700
Is that decision making process explicitly represented?
link |
00:32:28.980
And is there a model in the system?
link |
00:32:31.980
Is there something which is explicitly trying to predict
link |
00:32:34.380
what will happen in the environment?
link |
00:32:36.540
And so those three pieces are, to me,
link |
00:32:40.500
some of the most common building blocks.
link |
00:32:42.340
And I understand the different choices in RL
link |
00:32:47.020
as choices of whether or not to use those building blocks
link |
00:32:49.860
when you're trying to decompose the solution.
link |
00:32:52.580
Should I have a value function represented?
link |
00:32:54.260
Should I have a policy represented?
link |
00:32:56.700
Should I have a model represented?
link |
00:32:58.420
And there are combinations of those pieces
link |
00:33:00.180
and, of course, other things that you could
link |
00:33:01.700
add into the picture as well.
link |
00:33:03.140
But those three fundamental choices
link |
00:33:04.980
give rise to some of the branches of RL
link |
00:33:06.900
with which we're very familiar.
link |
00:33:08.580
And so those, as you mentioned,
link |
00:33:10.860
there is a choice of what's specified
link |
00:33:14.300
or modeled explicitly.
link |
00:33:17.180
And the idea is that all of these
link |
00:33:20.460
are somehow implicitly learned within the system.
link |
00:33:23.420
So it's almost a choice of how you approach a problem.
link |
00:33:28.500
Do you see those as fundamental differences
link |
00:33:30.260
or are these almost like small specifics,
link |
00:33:35.420
like the details of how you solve a problem
link |
00:33:37.500
but they're not fundamentally different from each other?
link |
00:33:40.900
I think the fundamental idea is maybe at the higher level.
link |
00:33:45.940
The fundamental idea is the first step
link |
00:33:48.660
of the decomposition is really to say,
link |
00:33:50.860
well, how are we really gonna solve any kind of problem
link |
00:33:55.060
where you're trying to figure out how to take actions
link |
00:33:57.380
and just from this stream of observations,
link |
00:33:59.780
you've got some agent situated in its sensory motor stream
link |
00:34:02.140
and getting all these observations in,
link |
00:34:04.300
getting to take these actions, and what should it do?
link |
00:34:06.140
How can you even broach that problem?
link |
00:34:07.420
You know, maybe the complexity of the world is so great
link |
00:34:10.780
that you can't even imagine how to build a system
link |
00:34:13.220
that would understand how to deal with that.
link |
00:34:15.700
And so the first step of this decomposition is to say,
link |
00:34:18.540
well, you have to learn.
link |
00:34:19.540
The system has to learn for itself.
link |
00:34:22.020
And so note that the reinforcement learning problem
link |
00:34:24.420
doesn't actually stipulate that you have to learn.
link |
00:34:27.060
Like you could maximize your rewards without learning.
link |
00:34:29.340
It would just, wouldn't do a very good job of it.
link |
00:34:32.380
So learning is required
link |
00:34:34.420
because it's the only way to achieve good performance
link |
00:34:36.900
in any sufficiently large and complex environment.
link |
00:34:40.500
So that's the first step.
link |
00:34:42.260
And so that step gives commonality
link |
00:34:43.740
to all of the other pieces,
link |
00:34:45.340
because now you might ask, well, what should you be learning?
link |
00:34:48.780
What does learning even mean?
link |
00:34:49.900
You know, in this sense, you know, learning might mean,
link |
00:34:52.260
well, you're trying to update the parameters
link |
00:34:55.740
of some system, which is then the thing
link |
00:34:59.060
that actually picks the actions.
link |
00:35:00.860
And those parameters could be representing anything.
link |
00:35:03.460
They could be parameterizing a value function or a model
link |
00:35:06.820
or a policy.
link |
00:35:08.540
And so in that sense, there's a lot of commonality
link |
00:35:10.860
in that whatever is being represented there
link |
00:35:12.380
is the thing which is being learned,
link |
00:35:13.580
and it's being learned with the ultimate goal
link |
00:35:15.740
of maximizing rewards.
link |
00:35:17.500
But the way in which you decompose the problem
link |
00:35:20.300
is really what gives the semantics to the whole system.
link |
00:35:23.140
Like, are you trying to learn something to predict well,
link |
00:35:27.300
like a value function or a model?
link |
00:35:28.580
Are you learning something to perform well, like a policy?
link |
00:35:31.700
And the form of that objective
link |
00:35:34.020
is kind of giving the semantics to the system.
link |
00:35:36.300
And so it really is, at the next level down,
link |
00:35:39.260
a fundamental choice,
link |
00:35:40.300
and we have to make those fundamental choices
link |
00:35:42.860
as system designers or enable our algorithms
link |
00:35:46.180
to be able to learn how to make those choices for themselves.
link |
00:35:49.340
So then the next step you mentioned,
link |
00:35:52.020
the very first thing you have to deal with is,
link |
00:35:56.020
can you even take in this huge stream of observations
link |
00:36:00.060
and do anything with it?
link |
00:36:01.540
So the natural next basic question is,
link |
00:36:05.060
what is deep reinforcement learning?
link |
00:36:08.140
And what is this idea of using neural networks
link |
00:36:11.540
to deal with this huge incoming stream?
link |
00:36:14.580
So amongst all the approaches for reinforcement learning,
link |
00:36:18.220
deep reinforcement learning
link |
00:36:19.420
is one family of solution methods
link |
00:36:23.180
that tries to utilize powerful representations
link |
00:36:29.700
that are offered by neural networks
link |
00:36:31.620
to represent any of these different components
link |
00:36:35.740
of the solution, of the agent,
link |
00:36:37.980
like whether it's the value function
link |
00:36:39.660
or the model or the policy.
link |
00:36:41.820
The idea of deep learning is to say,
link |
00:36:43.460
well, here's a powerful toolkit that's so powerful
link |
00:36:46.700
that it's universal in the sense
link |
00:36:48.180
that it can represent any function
link |
00:36:50.140
and it can learn any function.
link |
00:36:52.020
And so if we can leverage that universality,
link |
00:36:55.020
that means that whatever we need to represent
link |
00:36:57.940
for our policy or for our value function or for a model,
link |
00:37:00.260
deep learning can do it.
link |
00:37:01.940
So that deep learning is one approach
link |
00:37:04.860
that offers us a toolkit
link |
00:37:06.620
that has no ceiling to its performance,
link |
00:37:09.460
that as we start to put more resources into the system,
link |
00:37:12.500
more memory and more computation and more data,
link |
00:37:17.180
more experience, more interactions with the environment,
link |
00:37:20.140
that these are systems that can just get better
link |
00:37:22.220
and better and better at doing whatever the job is
link |
00:37:24.420
they've asked them to do,
link |
00:37:25.340
whatever we've asked that function to represent,
link |
00:37:27.740
it can learn a function that does a better and better job
link |
00:37:31.140
of representing that knowledge,
link |
00:37:33.340
whether that knowledge be estimating
link |
00:37:35.500
how well you're gonna do in the world,
link |
00:37:36.660
the value function,
link |
00:37:37.700
whether it's gonna be choosing what to do in the world,
link |
00:37:40.660
the policy,
link |
00:37:41.500
or whether it's understanding the world itself,
link |
00:37:43.860
what's gonna happen next, the model.
link |
00:37:45.780
Nevertheless, the fact that neural networks
link |
00:37:49.100
are able to learn incredibly complex representations
link |
00:37:53.780
that allow you to do the policy, the model
link |
00:37:55.780
or the value function is, at least to my mind,
link |
00:38:00.780
exceptionally beautiful and surprising.
link |
00:38:02.980
Like, was it surprising to you?
link |
00:38:07.980
Can you still believe it works as well as it does?
link |
00:38:10.660
Do you have good intuition about why it works at all
link |
00:38:13.980
and works as well as it does?
link |
00:38:18.500
I think, let me take two parts to that question.
link |
00:38:22.140
I think it's not surprising to me
link |
00:38:26.740
that the idea of reinforcement learning works
link |
00:38:30.180
because in some sense, I think it's the,
link |
00:38:34.420
I feel it's the only thing which can ultimately.
link |
00:38:36.860
And so I feel we have to address it
link |
00:38:39.460
and there must be success as possible
link |
00:38:41.940
because we have examples of intelligence.
link |
00:38:44.140
And it must at some level be able to,
link |
00:38:47.020
possible to acquire experience
link |
00:38:49.500
and use that experience to do better
link |
00:38:51.740
in a way which is meaningful to environments
link |
00:38:55.260
of the complexity that humans can deal with.
link |
00:38:57.180
It must be.
link |
00:38:58.980
Am I surprised that our current systems
link |
00:39:00.540
can do as well as they can do?
link |
00:39:03.540
I think one of the big surprises for me
link |
00:39:05.460
and a lot of the community
link |
00:39:09.060
is really the fact that deep learning
link |
00:39:13.660
can continue to perform so well
link |
00:39:18.660
despite the fact that these neural networks
link |
00:39:21.980
that they're representing
link |
00:39:23.180
have these incredibly nonlinear kind of bumpy surfaces
link |
00:39:27.340
which to our kind of low dimensional intuitions
link |
00:39:30.540
make it feel like surely you're just gonna get stuck
link |
00:39:33.300
and learning will get stuck
link |
00:39:34.540
because you won't be able to make any further progress.
link |
00:39:37.940
And yet the big surprise is that learning continues
link |
00:39:42.580
and these what appear to be local optima
link |
00:39:45.860
turn out not to be because in high dimensions
link |
00:39:48.020
when we make really big neural nets,
link |
00:39:49.780
there's always a way out
link |
00:39:51.580
and there's a way to go even lower
link |
00:39:52.980
and then you're still not in a local optima
link |
00:39:55.900
because there's some other pathway
link |
00:39:57.180
that will take you out and take you lower still.
link |
00:39:59.380
And so no matter where you are,
link |
00:40:00.580
learning can proceed and do better and better and better
link |
00:40:04.580
without bound.
link |
00:40:06.380
And so that is a surprising
link |
00:40:09.900
and beautiful property of neural nets
link |
00:40:13.220
which I find elegant and beautiful
link |
00:40:16.860
and somewhat shocking that it turns out to be the case.
link |
00:40:20.460
As you said, which I really like
link |
00:40:22.540
to our low dimensional intuitions, that's surprising.
link |
00:40:27.940
Yeah, we're very tuned to working
link |
00:40:31.980
within a three dimensional environment.
link |
00:40:33.900
And so to start to visualize
link |
00:40:36.300
what a billion dimensional neural network surface
link |
00:40:41.300
that you're trying to optimize over,
link |
00:40:42.740
what that even looks like is very hard for us.
link |
00:40:45.620
And so I think that really,
link |
00:40:47.940
if you try to account for the,
link |
00:40:52.780
essentially the AI winter
link |
00:40:54.260
where people gave up on neural networks,
link |
00:40:56.780
I think it's really down to that lack of ability
link |
00:41:00.300
to generalize from low dimensions to high dimensions
link |
00:41:03.260
because back then we were in the low dimensional case.
link |
00:41:05.780
People could only build neural nets
link |
00:41:07.180
with 50 nodes in them or something.
link |
00:41:11.460
And to imagine that it might be possible
link |
00:41:14.180
to build a billion dimensional neural net
link |
00:41:15.980
and it might have a completely different,
link |
00:41:17.500
qualitatively different property was very hard to anticipate.
link |
00:41:21.340
And I think even now we're starting to build the theory
link |
00:41:24.580
to support that.
link |
00:41:26.420
And it's incomplete at the moment,
link |
00:41:28.260
but all of the theory seems to be pointing in the direction
link |
00:41:30.900
that indeed this is an approach which truly is universal
link |
00:41:34.820
both in its representational capacity, which was known,
link |
00:41:37.220
but also in its learning ability, which is surprising.
link |
00:41:40.860
And it makes one wonder what else we're missing
link |
00:41:44.780
due to our low dimensional intuitions
link |
00:41:47.620
that will seem obvious once it's discovered.
link |
00:41:51.700
I often wonder, when we one day do have AIs
link |
00:41:57.580
which are superhuman in their abilities
link |
00:42:00.980
to understand the world,
link |
00:42:05.380
what will they think of the algorithms
link |
00:42:07.540
that we developed back now?
link |
00:42:08.940
Will it be looking back at these days
link |
00:42:11.540
and thinking that, will we look back and feel
link |
00:42:17.100
that these algorithms were naive first steps
link |
00:42:19.580
or will they still be the fundamental ideas
link |
00:42:21.500
which are used even in 100,000, 10,000 years?
link |
00:42:26.180
It's hard to know.
link |
00:42:27.500
They'll watch back to this conversation
link |
00:42:30.300
and with a smile, maybe a little bit of a laugh.
link |
00:42:34.820
I mean, my sense is, I think just like when we used
link |
00:42:40.140
to think that the sun revolved around the earth,
link |
00:42:45.860
they'll see our systems of today, reinforcement learning
link |
00:42:49.540
as too complicated, that the answer was simple all along.
link |
00:42:54.460
There's something, just like you said in the game of Go,
link |
00:42:58.180
I mean, I love the systems of like cellular automata,
link |
00:43:01.700
that there's simple rules from which incredible complexity
link |
00:43:05.020
emerges, so it feels like there might be
link |
00:43:08.180
some really simple approaches,
link |
00:43:10.540
just like Rich Sutton says, right?
link |
00:43:12.660
These simple methods with compute over time
link |
00:43:17.700
seem to prove to be the most effective.
link |
00:43:20.700
I 100% agree.
link |
00:43:21.900
I think that if we try to anticipate
link |
00:43:27.780
what will generalize well into the future,
link |
00:43:30.660
I think it's likely to be the case
link |
00:43:32.900
that it's the simple, clear ideas
link |
00:43:35.540
which will have the longest legs
link |
00:43:36.780
and which will carry us furthest into the future.
link |
00:43:39.340
Nevertheless, we're in a situation
link |
00:43:40.860
where we need to make things work today,
link |
00:43:43.260
and sometimes that requires putting together
link |
00:43:44.940
more complex systems where we don't have
link |
00:43:47.420
the full answers yet as to what
link |
00:43:49.580
those minimal ingredients might be.
link |
00:43:51.580
So speaking of which, if we could take a step back to Go,
link |
00:43:55.060
what was MoGo and what was the key idea behind the system?
link |
00:44:00.780
So back during my PhD on Computer Go,
link |
00:44:04.420
around about that time, there was a major new development
link |
00:44:08.900
which actually happened in the context of Computer Go,
link |
00:44:12.780
and it was really a revolution in the way
link |
00:44:16.660
that heuristic search was done,
link |
00:44:18.700
and the idea was essentially that
link |
00:44:21.820
a position could be evaluated or a state in general
link |
00:44:26.300
could be evaluated not by humans saying
link |
00:44:30.620
whether that position is good or not,
link |
00:44:33.500
or even humans providing rules
link |
00:44:35.100
as to how you might evaluate it,
link |
00:44:37.220
but instead by allowing the system
link |
00:44:40.860
to randomly play out the game until the end multiple times
link |
00:44:45.820
and taking the average of those outcomes
link |
00:44:48.100
as the prediction of what will happen.
link |
00:44:50.620
So for example, if you're in the game of Go,
link |
00:44:53.020
the intuition is that you take a position
link |
00:44:55.380
and you get the system to kind of play random moves
link |
00:44:58.100
against itself all the way to the end of the game
link |
00:45:00.100
and you see who wins.
link |
00:45:01.740
And if black ends up winning
link |
00:45:03.220
more of those random games than white,
link |
00:45:05.140
well, you say, hey, this is a position that favors white.
link |
00:45:07.420
And if white ends up winning more of those random games
link |
00:45:09.580
than black, then it favors white.
link |
00:45:13.620
So that idea was known as Monte Carlo search,
link |
00:45:18.140
and a particular form of Monte Carlo search
link |
00:45:21.140
that became very effective and was developed in computer Go
link |
00:45:24.140
first by Remy Coulomb in 2006,
link |
00:45:26.620
and then taken further by others
link |
00:45:29.140
was something called Monte Carlo tree search,
link |
00:45:31.860
which basically takes that same idea
link |
00:45:34.020
and uses that insight to evaluate every node of a search tree
link |
00:45:39.020
is evaluated by the average of the random play outs
link |
00:45:42.140
from that node onwards.
link |
00:45:44.260
And this idea, when you think about it,
link |
00:45:46.820
and this idea was very powerful
link |
00:45:49.220
and suddenly led to huge leaps forward
link |
00:45:51.620
in the strength of computer Go playing programs.
link |
00:45:55.180
And among those, the strongest of the Go playing programs
link |
00:45:58.500
in those days was a program called MoGo,
link |
00:46:00.700
which was the first program to actually reach
link |
00:46:03.860
human master level on small boards, nine by nine boards.
link |
00:46:07.660
And so this was a program by someone called Sylvain Gelli,
link |
00:46:11.860
who's a good colleague of mine,
link |
00:46:13.140
but I worked with him a little bit in those days,
link |
00:46:16.780
part of my PhD thesis.
link |
00:46:18.420
And MoGo was a first step towards the latest successes
link |
00:46:23.500
we saw in computer Go,
link |
00:46:25.460
but it was still missing a key ingredient.
link |
00:46:28.020
MoGo was evaluating purely by random rollouts against itself.
link |
00:46:33.860
And in a way, it's truly remarkable
link |
00:46:36.380
that random play should give you anything at all.
link |
00:46:39.500
Why in this perfectly deterministic game
link |
00:46:42.580
that's very precise and involves these very exact sequences,
link |
00:46:46.860
why is it that randomization is helpful?
link |
00:46:52.100
And so the intuition is that randomization
link |
00:46:54.100
captures something about the nature of the search tree,
link |
00:46:59.060
from a position that you're understanding
link |
00:47:01.820
the nature of the search tree from that node onwards
link |
00:47:04.580
by using randomization.
link |
00:47:06.980
And this was a very powerful idea.
link |
00:47:09.220
And I've seen this in other spaces,
link |
00:47:12.580
talked to Richard Karp and so on,
link |
00:47:14.660
randomized algorithms somehow magically
link |
00:47:17.340
are able to do exceptionally well
link |
00:47:19.740
and simplifying the problem somehow.
link |
00:47:23.540
Makes you wonder about the fundamental nature
link |
00:47:25.660
of randomness in our universe.
link |
00:47:27.620
It seems to be a useful thing.
link |
00:47:29.500
But so from that moment,
link |
00:47:32.100
can you maybe tell the origin story
link |
00:47:33.980
and the journey of AlphaGo?
link |
00:47:36.100
Yeah, so programs based on Monte Carlo tree search
link |
00:47:39.460
were a first revolution
link |
00:47:41.580
in the sense that they led to suddenly programs
link |
00:47:44.740
that could play the game to any reasonable level,
link |
00:47:47.900
but they plateaued.
link |
00:47:50.100
It seemed that no matter how much effort
link |
00:47:51.900
people put into these techniques,
link |
00:47:53.180
they couldn't exceed the level
link |
00:47:54.820
of amateur Dan level Go players.
link |
00:47:58.060
So strong players,
link |
00:47:59.580
but not anywhere near the level of professionals,
link |
00:48:02.580
nevermind the world champion.
link |
00:48:04.460
And so that brings us to the birth of AlphaGo,
link |
00:48:08.380
which happened in the context of a startup company
link |
00:48:12.300
known as DeepMind.
link |
00:48:14.540
I heard of them.
link |
00:48:15.460
Where a project was born.
link |
00:48:19.020
And the project was really a scientific investigation
link |
00:48:23.700
where myself and Adger Huang
link |
00:48:27.900
and an intern, Chris Madison,
link |
00:48:30.660
were exploring a scientific question.
link |
00:48:33.220
And that scientific question was really,
link |
00:48:37.300
is there another fundamentally different approach
link |
00:48:39.620
to this key question of Go,
link |
00:48:42.140
the key challenge of how can you build that intuition
link |
00:48:45.740
and how can you just have a system
link |
00:48:47.580
that could look at a position
link |
00:48:48.940
and understand what move to play
link |
00:48:51.260
or how well you're doing in that position,
link |
00:48:53.340
who's gonna win?
link |
00:48:54.820
And so the deep learning revolution had just begun.
link |
00:48:59.140
That systems like ImageNet had suddenly been won
link |
00:49:03.460
by deep learning techniques back in 2012.
link |
00:49:06.540
And following that, it was natural to ask,
link |
00:49:08.620
well, if deep learning is able to scale up so effectively
link |
00:49:12.460
with images to understand them enough to classify them,
link |
00:49:16.660
well, why not go?
link |
00:49:17.500
Why not take the black and white stones of the Go board
link |
00:49:22.700
and build a system which can understand for itself
link |
00:49:25.340
what that means in terms of what move to pick
link |
00:49:27.540
or who's gonna win the game, black or white?
link |
00:49:31.140
And so that was our scientific question
link |
00:49:32.540
which we were probing and trying to understand.
link |
00:49:35.660
And as we started to look at it,
link |
00:49:37.860
we discovered that we could build a system.
link |
00:49:40.860
So in fact, our very first paper on AlphaGo
link |
00:49:43.620
was actually a pure deep learning system
link |
00:49:47.020
which was trying to answer this question.
link |
00:49:49.460
And we showed that actually a pure deep learning system
link |
00:49:52.420
with no search at all was actually able
link |
00:49:54.860
to reach human band level, master level
link |
00:49:58.260
at the full game of Go, 19 by 19 boards.
link |
00:50:01.740
And so without any search at all,
link |
00:50:04.020
suddenly we had systems which were playing
link |
00:50:06.060
at the level of the best Monte Carlo tree search systems,
link |
00:50:10.100
the ones with randomized rollouts.
link |
00:50:11.780
So first of all, sorry to interrupt,
link |
00:50:13.100
but that's kind of a groundbreaking notion.
link |
00:50:16.620
That's like basically a definitive step away
link |
00:50:20.700
from a couple of decades
link |
00:50:22.700
of essentially search dominating AI.
link |
00:50:26.300
So how did that make you feel?
link |
00:50:28.940
Was it surprising from a scientific perspective in general,
link |
00:50:33.020
how to make you feel?
link |
00:50:33.980
I found this to be profoundly surprising.
link |
00:50:37.340
In fact, it was so surprising that we had a bet back then.
link |
00:50:41.780
And like many good projects, bets are quite motivating.
link |
00:50:44.980
And the bet was whether it was possible
link |
00:50:47.900
for a system based purely on deep learning,
link |
00:50:52.140
with no search at all to beat a down level human player.
link |
00:50:55.900
And so we had someone who joined our team
link |
00:51:00.100
who was a down level player.
link |
00:51:01.100
He came in and we had this first match against him and...
link |
00:51:06.660
Which side of the bed were you on, by the way?
link |
00:51:09.420
The losing or the winning side?
link |
00:51:11.740
I tend to be an optimist with the power
link |
00:51:14.660
of deep learning and reinforcement learning.
link |
00:51:18.420
So the system won,
link |
00:51:21.140
and we were able to beat this human down level player.
link |
00:51:24.260
And for me, that was the moment where it was like,
link |
00:51:26.420
okay, something special is afoot here.
link |
00:51:29.460
We have a system which without search
link |
00:51:32.620
is able to already just look at this position
link |
00:51:36.180
and understand things as well as a strong human player.
link |
00:51:39.580
And from that point onwards,
link |
00:51:41.500
I really felt that reaching the top levels of human play,
link |
00:51:49.060
professional level, world champion level,
link |
00:51:50.820
I felt it was actually an inevitability.
link |
00:51:56.620
And if it was an inevitable outcome,
link |
00:51:59.700
I was rather keen that it would be us that achieved it.
link |
00:52:03.020
So we scaled up.
link |
00:52:05.420
This was something where,
link |
00:52:06.820
so I had lots of conversations back then
link |
00:52:09.380
with Demis Sassabis, the head of DeepMind,
link |
00:52:14.660
who was extremely excited.
link |
00:52:16.100
And we made the decision to scale up the project,
link |
00:52:21.140
brought more people on board.
link |
00:52:23.380
And so AlphaGo became something where we had a clear goal,
link |
00:52:30.060
which was to try and crack this outstanding challenge of AI
link |
00:52:33.700
to see if we could beat the world's best players.
link |
00:52:37.300
And this led within the space of not so many months
link |
00:52:42.460
to playing against the European champion Fan Hui
link |
00:52:45.780
in a match which became memorable in history
link |
00:52:48.940
as the first time a Go program
link |
00:52:50.660
had ever beaten a professional player.
link |
00:52:53.940
And at that time we had to make a judgment
link |
00:52:56.220
as to when and whether we should go
link |
00:52:59.700
and challenge the world champion.
link |
00:53:01.780
And this was a difficult decision to make.
link |
00:53:04.140
Again, we were basing our predictions on our own progress
link |
00:53:08.460
and had to estimate based on the rapidity
link |
00:53:11.300
of our own progress when we thought we would exceed
link |
00:53:15.340
the level of the human world champion.
link |
00:53:17.620
And we tried to make an estimate and set up a match
link |
00:53:20.420
and that became the AlphaGo versus Lee Sedol match in 2016.
link |
00:53:27.100
And we should say, spoiler alert,
link |
00:53:29.900
that AlphaGo was able to defeat Lee Sedol.
link |
00:53:33.740
That's right, yeah.
link |
00:53:34.980
So maybe we could take even a broader view.
link |
00:53:39.980
AlphaGo involves both learning from expert games
link |
00:53:45.900
and as far as I remember, a self play component
link |
00:53:51.220
to where it learns by playing against itself.
link |
00:53:54.260
But in your sense, what was the role of learning
link |
00:53:57.580
from expert games there?
link |
00:53:59.060
And in terms of your self evaluation,
link |
00:54:01.380
whether you can take on the world champion,
link |
00:54:04.140
what was the thing that you're trying to do more of?
link |
00:54:06.980
Sort of train more on expert games
link |
00:54:09.420
or was there's now another,
link |
00:54:12.620
I'm asking so many poorly phrased questions,
link |
00:54:15.620
but did you have a hope or dream that self play
link |
00:54:19.580
would be the key component at that moment yet?
link |
00:54:24.460
So in the early days of AlphaGo,
link |
00:54:26.420
we used human data to explore the science
link |
00:54:29.780
of what deep learning can achieve.
link |
00:54:31.380
And so when we had our first paper that showed
link |
00:54:34.620
that it was possible to predict the winner of the game,
link |
00:54:37.820
that it was possible to suggest moves,
link |
00:54:39.700
that was done using human data.
link |
00:54:41.260
A solely human data.
link |
00:54:42.380
Yeah, and so the reason that we did it that way
link |
00:54:45.100
was at that time we were exploring separately
link |
00:54:47.620
the deep learning aspect
link |
00:54:48.940
from the reinforcement learning aspect.
link |
00:54:51.100
That was the part which was new and unknown
link |
00:54:53.420
to me at that time was how far could that be stretched?
link |
00:54:58.260
Once we had that, it then became natural
link |
00:55:00.540
to try and use that same representation
link |
00:55:03.060
and see if we could learn for ourselves
link |
00:55:04.940
using that same representation.
link |
00:55:06.580
And so right from the beginning,
link |
00:55:08.340
actually our goal had been to build a system
link |
00:55:11.940
using self play.
link |
00:55:14.220
And to us, the human data right from the beginning
link |
00:55:16.860
was an expedient step to help us for pragmatic reasons
link |
00:55:20.860
to go faster towards the goals of the project
link |
00:55:24.540
than we might be able to starting solely from self play.
link |
00:55:27.540
And so in those days, we were very aware
link |
00:55:29.820
that we were choosing to use human data
link |
00:55:32.780
and that might not be the longterm holy grail of AI,
link |
00:55:37.380
but that it was something which was extremely useful to us.
link |
00:55:40.860
It helped us to understand the system.
link |
00:55:42.260
It helped us to build deep learning representations
link |
00:55:44.380
which were clear and simple and easy to use.
link |
00:55:48.420
And so really I would say it served a purpose
link |
00:55:51.980
not just as part of the algorithm,
link |
00:55:53.300
but something which I continue to use in our research today,
link |
00:55:56.180
which is trying to break down a very hard challenge
link |
00:56:00.100
into pieces which are easier to understand for us
link |
00:56:02.500
as researchers and develop.
link |
00:56:04.180
So if you use a component based on human data,
link |
00:56:07.740
it can help you to understand the system
link |
00:56:10.340
such that then you can build
link |
00:56:11.340
the more principled version later that does it for itself.
link |
00:56:15.220
So as I said, the AlphaGo victory,
link |
00:56:19.660
and I don't think I'm being sort of romanticizing this notion.
link |
00:56:23.740
I think it's one of the greatest moments
link |
00:56:25.140
in the history of AI.
link |
00:56:26.980
So were you cognizant of this magnitude
link |
00:56:29.900
of the accomplishment at the time?
link |
00:56:32.300
I mean, are you cognizant of it even now?
link |
00:56:35.900
Because to me, I feel like it's something that would,
link |
00:56:38.580
we mentioned what the AGI systems of the future
link |
00:56:41.300
will look back.
link |
00:56:42.500
I think they'll look back at the AlphaGo victory
link |
00:56:46.100
as like, holy crap, they figured it out.
link |
00:56:49.140
This is where it started.
link |
00:56:51.700
Well, thank you again.
link |
00:56:52.740
I mean, it's funny because I guess I've been working on,
link |
00:56:56.220
I've been working on ComputerGo for a long time.
link |
00:56:58.100
So I'd been working at the time of the AlphaGo match
link |
00:57:00.300
on ComputerGo for more than a decade.
link |
00:57:03.020
And throughout that decade, I'd had this dream
link |
00:57:06.060
of what would it be like to, what would it be like really
link |
00:57:08.780
to actually be able to build a system
link |
00:57:12.220
that could play against the world champion.
link |
00:57:14.300
And I imagined that that would be an interesting moment
link |
00:57:17.500
that maybe some people might care about that
link |
00:57:20.300
and that this might be a nice achievement.
link |
00:57:24.140
But I think when I arrived in Seoul
link |
00:57:27.500
and discovered the legions of journalists
link |
00:57:31.540
that were following us around and the 100 million people
link |
00:57:34.220
that were watching the match online live,
link |
00:57:37.620
I realized that I'd been off in my estimation
link |
00:57:40.140
of how significant this moment was
link |
00:57:41.900
by several orders of magnitude.
link |
00:57:43.980
And so there was definitely an adjustment process
link |
00:57:48.980
to realize that this was something
link |
00:57:53.140
which the world really cared about
link |
00:57:55.620
and which was a watershed moment.
link |
00:57:57.980
And I think there was that moment of realization.
link |
00:58:01.380
But it's also a little bit scary
link |
00:58:02.540
because if you go into something thinking
link |
00:58:05.580
it's gonna be maybe of interest
link |
00:58:08.420
and then discover that 100 million people are watching,
link |
00:58:10.860
it suddenly makes you worry about
link |
00:58:12.220
whether some of the decisions you'd made
link |
00:58:13.660
were really the best ones or the wisest,
link |
00:58:16.140
or were going to lead to the best outcome.
link |
00:58:18.260
And we knew for sure that there were still imperfections
link |
00:58:20.580
in AlphaGo, which were gonna be exposed
link |
00:58:22.700
to the whole world watching.
link |
00:58:24.420
And so, yeah, it was I think a great experience
link |
00:58:28.180
and I feel privileged to have been part of it,
link |
00:58:32.220
privileged to have led that amazing team.
link |
00:58:35.980
I feel privileged to have been in a moment of history
link |
00:58:38.860
like you say, but also lucky that in a sense
link |
00:58:43.700
I was insulated from the knowledge of,
link |
00:58:46.420
I think it would have been harder to focus on the research
link |
00:58:48.860
if the full kind of reality of what was gonna come to pass
link |
00:58:52.500
had been known to me and the team.
link |
00:58:55.340
I think it was, we were in our bubble
link |
00:58:57.620
and we were working on research
link |
00:58:58.740
and we were trying to answer the scientific questions
link |
00:59:01.580
and then bam, the public sees it.
link |
00:59:04.540
And I think it was better that way in retrospect.
link |
00:59:07.500
Were you confident that, I guess,
link |
00:59:10.180
what were the chances that you could get the win?
link |
00:59:13.580
So just like you said, I'm a little bit more familiar
link |
00:59:19.060
with another accomplishment
link |
00:59:20.300
that we may not even get a chance to talk to.
link |
00:59:22.380
I talked to Oriel Venialis about Alpha Star
link |
00:59:24.500
which is another incredible accomplishment,
link |
00:59:26.260
but here with Alpha Star and beating the StarCraft,
link |
00:59:31.140
there was already a track record with AlphaGo.
link |
00:59:34.460
This is the really first time
link |
00:59:36.260
you get to see reinforcement learning
link |
00:59:39.900
face the best human in the world.
link |
00:59:41.700
So what was your confidence like, what was the odds?
link |
00:59:45.000
Well, we actually. Was there a bet?
link |
00:59:47.860
Funnily enough, there was.
link |
00:59:49.100
So just before the match,
link |
00:59:52.100
we weren't betting on anything concrete,
link |
00:59:54.300
but we all held out a hand.
link |
00:59:56.520
Everyone in the team held out a hand
link |
00:59:57.980
at the beginning of the match.
link |
00:59:59.620
And the number of fingers that they had out on their hand
link |
01:00:01.500
was supposed to represent how many games
link |
01:00:03.420
they thought we would win against Lee Sedol.
link |
01:00:06.300
And there was an amazing spread in the team's predictions.
link |
01:00:10.540
But I have to say, I predicted four, one.
link |
01:00:15.060
And the reason was based purely on data.
link |
01:00:18.580
So I'm a scientist first and foremost.
link |
01:00:20.620
And one of the things which we had established
link |
01:00:23.140
was that AlphaGo in around one in five games
link |
01:00:27.260
would develop something which we called a delusion,
link |
01:00:29.540
which was a kind of in a hole in its knowledge
link |
01:00:31.980
where it wasn't able to fully understand
link |
01:00:34.840
everything about the position.
link |
01:00:36.100
And that hole in its knowledge would persist
link |
01:00:38.080
for tens of moves throughout the game.
link |
01:00:41.700
And we knew two things.
link |
01:00:42.720
We knew that if there were no delusions,
link |
01:00:44.480
that AlphaGo seemed to be playing at a level
link |
01:00:46.620
that was far beyond any human capabilities.
link |
01:00:49.420
But we also knew that if there were delusions,
link |
01:00:52.020
the opposite was true.
link |
01:00:53.780
And in fact, that's what came to pass.
link |
01:00:58.300
We saw all of those outcomes.
link |
01:01:00.180
And Lee Sedol in one of the games
link |
01:01:02.900
played a really beautiful sequence
link |
01:01:04.580
that AlphaGo just hadn't predicted.
link |
01:01:08.180
And after that, it led it into this situation
link |
01:01:11.800
where it was unable to really understand the position fully
link |
01:01:14.980
and found itself in one of these delusions.
link |
01:01:17.900
So indeed, yeah, 4.1 was the outcome.
link |
01:01:20.780
So yeah, and can you maybe speak to it a little bit more?
link |
01:01:23.220
What were the five games?
link |
01:01:25.620
What happened?
link |
01:01:26.460
Is there interesting things that come to memory
link |
01:01:29.900
in terms of the play of the human or the machine?
link |
01:01:33.600
So I remember all of these games vividly, of course.
link |
01:01:37.220
Moments like these don't come too often
link |
01:01:39.320
in the lifetime of a scientist.
link |
01:01:42.460
And the first game was magical because it was the first time
link |
01:01:49.900
that a computer program had defeated a world
link |
01:01:53.700
champion in this grand challenge of Go.
link |
01:01:57.020
And there was a moment where AlphaGo invaded Lee Sedol's
link |
01:02:04.580
territory towards the end of the game.
link |
01:02:07.900
And that's quite an audacious thing to do.
link |
01:02:09.920
It's like saying, hey, you thought
link |
01:02:11.260
this was going to be your territory in the game,
link |
01:02:12.580
but I'm going to stick a stone right in the middle of it
link |
01:02:14.920
and prove to you that I can break it up.
link |
01:02:17.980
And Lee Sedol's face just dropped.
link |
01:02:20.260
He wasn't expecting a computer to do something that audacious.
link |
01:02:26.140
The second game became famous for a move known as move 37.
link |
01:02:30.820
This was a move that was played by AlphaGo that broke
link |
01:02:36.540
all of the conventions of Go, that the Go players were
link |
01:02:39.340
so shocked by this.
link |
01:02:40.260
They thought that maybe the operator had made a mistake.
link |
01:02:45.300
They thought that there was something crazy going on.
link |
01:02:48.180
And it just broke every rule that Go players
link |
01:02:50.580
are taught from a very young age.
link |
01:02:52.580
They're just taught this kind of move called a shoulder hit.
link |
01:02:55.300
You can only play it on the third line or the fourth line,
link |
01:02:58.820
and AlphaGo played it on the fifth line.
link |
01:03:00.700
And it turned out to be a brilliant move
link |
01:03:03.500
and made this beautiful pattern in the middle of the board that
link |
01:03:06.100
ended up winning the game.
link |
01:03:08.500
And so this really was a clear instance
link |
01:03:12.300
where we could say computers exhibited creativity,
link |
01:03:16.020
that this was really a move that was something
link |
01:03:18.620
humans hadn't known about, hadn't anticipated.
link |
01:03:22.620
And computers discovered this idea.
link |
01:03:24.860
They were the ones to say, actually, here's
link |
01:03:27.460
a new idea, something new, not in the domains
link |
01:03:30.700
of human knowledge of the game.
link |
01:03:33.460
And now the humans think this is a reasonable thing to do.
link |
01:03:38.260
And it's part of Go knowledge now.
link |
01:03:41.580
The third game, something special
link |
01:03:44.300
happens when you play against a human world champion, which,
link |
01:03:46.860
again, I hadn't anticipated before going there,
link |
01:03:48.860
which is these players are amazing.
link |
01:03:53.300
Lee Sedol was a true champion, 18 time world champion,
link |
01:03:56.460
and had this amazing ability to probe AlphaGo
link |
01:04:01.020
for weaknesses of any kind.
link |
01:04:03.500
And in the third game, he was losing,
link |
01:04:06.200
and we felt we were sailing comfortably to victory.
link |
01:04:09.740
But he managed to, from nothing, stir up this fight
link |
01:04:14.620
and build what's called a double co,
link |
01:04:17.060
these kind of repetitive positions.
link |
01:04:20.500
And he knew that historically, no computer Go program had ever
link |
01:04:24.180
been able to deal correctly with double co positions.
link |
01:04:26.780
And he managed to summon one out of nothing.
link |
01:04:29.800
And so for us, this was a real challenge.
link |
01:04:33.220
Would AlphaGo be able to deal with this,
link |
01:04:35.340
or would it just kind of crumble in the face of this situation?
link |
01:04:38.660
And fortunately, it dealt with it perfectly.
link |
01:04:41.460
The fourth game was amazing in that Lee Sedol
link |
01:04:46.180
appeared to be losing this game.
link |
01:04:48.380
AlphaGo thought it was winning.
link |
01:04:49.900
And then Lee Sedol did something,
link |
01:04:52.000
which I think only a true world champion can do,
link |
01:04:55.020
which is he found a brilliant sequence
link |
01:04:57.980
in the middle of the game, a brilliant sequence
link |
01:04:59.860
that led him to really just transform the position.
link |
01:05:05.220
He kind of found just a piece of genius, really.
link |
01:05:10.780
And after that, AlphaGo, its evaluation just tumbled.
link |
01:05:15.660
It thought it was winning this game.
link |
01:05:17.220
And all of a sudden, it tumbled and said, oh, now
link |
01:05:20.540
I've got no chance.
link |
01:05:21.460
And it started to behave rather oddly at that point.
link |
01:05:24.420
In the final game, for some reason, we as a team
link |
01:05:27.540
were convinced, having seen AlphaGo in the previous game,
link |
01:05:30.960
suffer from delusions.
link |
01:05:31.980
We as a team were convinced that it
link |
01:05:34.220
was suffering from another delusion.
link |
01:05:35.940
We were convinced that it was misevaluating the position
link |
01:05:38.340
and that something was going terribly wrong.
link |
01:05:41.260
And it was only in the last few moves of the game
link |
01:05:43.740
that we realized that actually, although it
link |
01:05:46.780
had been predicting it was going to win all the way through,
link |
01:05:49.460
it really was.
link |
01:05:51.380
And so somehow, it just taught us yet again
link |
01:05:54.220
that you have to have faith in your systems.
link |
01:05:56.180
When they exceed your own level of ability
link |
01:05:58.700
and your own judgment, you have to trust in them
link |
01:06:01.340
to know better than you, the designer, once you've
link |
01:06:06.300
bestowed in them the ability to judge better than you can,
link |
01:06:10.580
then trust the system to do so.
link |
01:06:13.020
So just like in the case of Deep Blue beating Gary Kasparov,
link |
01:06:18.900
so Gary was, I think, the first time he's ever lost, actually,
link |
01:06:23.120
to anybody.
link |
01:06:24.460
And I mean, there's a similar situation with Lee Sedol.
link |
01:06:27.740
It's a tragic loss for humans, but a beautiful one,
link |
01:06:36.580
I think, that's kind of, from the tragedy,
link |
01:06:40.780
sort of emerges over time, emerges
link |
01:06:45.020
a kind of inspiring story.
link |
01:06:47.300
But Lee Sedol recently announced his retirement.
link |
01:06:52.180
I don't know if we can look too deeply into it,
link |
01:06:56.020
but he did say that even if I become number one,
link |
01:06:59.540
there's an entity that cannot be defeated.
link |
01:07:02.620
So what do you think about these words?
link |
01:07:05.460
What do you think about his retirement from the game ago?
link |
01:07:08.020
Well, let me take you back, first of all,
link |
01:07:09.660
to the first part of your comment about Gary Kasparov,
link |
01:07:12.420
because actually, at the panel yesterday,
link |
01:07:15.700
he specifically said that when he first lost to Deep Blue,
link |
01:07:19.780
he viewed it as a failure.
link |
01:07:22.340
He viewed that this had been a failure of his.
link |
01:07:24.940
But later on in his career, he said
link |
01:07:27.220
he'd come to realize that actually, it was a success.
link |
01:07:30.420
It was a success for everyone, because this marked
link |
01:07:33.380
transformational moment for AI.
link |
01:07:37.180
And so even for Gary Kasparov, he
link |
01:07:39.120
came to realize that that moment was pivotal
link |
01:07:42.500
and actually meant something much more
link |
01:07:45.420
than his personal loss in that moment.
link |
01:07:49.960
Lee Sedol, I think, was much more cognizant of that,
link |
01:07:53.840
even at the time.
link |
01:07:54.860
And so in his closing remarks to the match,
link |
01:07:57.940
he really felt very strongly that what
link |
01:08:01.580
had happened in the AlphaGo match
link |
01:08:02.940
was not only meaningful for AI, but for humans as well.
link |
01:08:06.580
And he felt as a Go player that it had opened his horizons
link |
01:08:09.940
and meant that he could start exploring new things.
link |
01:08:12.700
It brought his joy back for the game of Go,
link |
01:08:14.460
because it had broken all of the conventions and barriers
link |
01:08:18.620
and meant that suddenly, anything was possible again.
link |
01:08:23.700
So I was sad to hear that he'd retired,
link |
01:08:26.060
but he's been a great world champion over many, many years.
link |
01:08:31.180
And I think he'll be remembered for that ever more.
link |
01:08:36.180
He'll be remembered as the last person to beat AlphaGo.
link |
01:08:39.340
I mean, after that, we increased the power of the system.
link |
01:08:43.100
And the next version of AlphaGo beats the other strong human
link |
01:08:49.580
player 60 games to nil.
link |
01:08:52.260
So what a great moment for him and something
link |
01:08:55.580
to be remembered for.
link |
01:08:58.020
It's interesting that you spent time at AAAI on a panel
link |
01:09:02.380
with Garry Kasparov.
link |
01:09:05.220
What, I mean, it's almost, I'm just
link |
01:09:07.460
curious to learn the conversations you've
link |
01:09:12.020
had with Garry, because he's also now,
link |
01:09:15.260
he's written a book about artificial intelligence.
link |
01:09:17.420
He's thinking about AI.
link |
01:09:18.900
He has kind of a view of it.
link |
01:09:21.140
And he talks about AlphaGo a lot.
link |
01:09:23.820
What's your sense?
link |
01:09:26.940
Arguably, I'm not just being Russian,
link |
01:09:28.620
but I think Garry is the greatest chess player
link |
01:09:31.100
of all time, probably one of the greatest game
link |
01:09:34.700
players of all time.
link |
01:09:36.540
And you sort of at the center of creating
link |
01:09:41.700
a system that beats one of the greatest players of all time.
link |
01:09:45.300
So what is that conversation like?
link |
01:09:46.740
Is there anything, any interesting digs, any bets,
link |
01:09:50.420
any funny things, any profound things?
link |
01:09:53.660
So Garry Kasparov has an incredible respect
link |
01:09:58.220
for what we did with AlphaGo.
link |
01:10:01.140
And it's an amazing tribute coming from him of all people
link |
01:10:07.540
that he really appreciates and respects what we've done.
link |
01:10:11.780
And I think he feels that the progress which has happened
link |
01:10:14.580
in computer chess, which later after AlphaGo,
link |
01:10:19.100
we built the AlphaZero system, which
link |
01:10:23.060
defeated the world's strongest chess programs.
link |
01:10:26.700
And to Garry Kasparov, that moment in computer chess
link |
01:10:29.860
was more profound than Deep Blue.
link |
01:10:32.980
And the reason he believes it mattered more
link |
01:10:35.660
was because it was done with learning
link |
01:10:37.620
and a system which was able to discover for itself
link |
01:10:39.940
new principles, new ideas, which were
link |
01:10:42.620
able to play the game in a way which he hadn't always
link |
01:10:47.740
known about or anyone.
link |
01:10:50.180
And in fact, one of the things I discovered at this panel
link |
01:10:53.180
was that the current world champion, Magnus Carlsen,
link |
01:10:56.500
apparently recently commented on his improvement
link |
01:11:00.460
in performance.
link |
01:11:01.820
And he attributed it to AlphaZero,
link |
01:11:03.860
that he's been studying the games of AlphaZero.
link |
01:11:05.860
And he's changed his style to play more like AlphaZero.
link |
01:11:08.700
And it's led to him actually increasing his rating
link |
01:11:13.820
to a new peak.
link |
01:11:15.100
Yeah, I guess to me, just like to Garry,
link |
01:11:18.420
the inspiring thing is that, and just like you said,
link |
01:11:21.340
with reinforcement learning, reinforcement learning
link |
01:11:25.140
and deep learning, machine learning
link |
01:11:26.940
feels like what intelligence is.
link |
01:11:29.540
And you could attribute it to a bitter viewpoint
link |
01:11:35.900
from Garry's perspective, from us humans perspective,
link |
01:11:39.500
saying that pure search that IBM Deep Blue was doing
link |
01:11:43.740
is not really intelligence, but somehow it didn't feel like it.
link |
01:11:47.780
And so that's the magical.
link |
01:11:49.100
I'm not sure what it is about learning that
link |
01:11:50.900
feels like intelligence, but it does.
link |
01:11:54.620
So I think we should not demean the achievements of what
link |
01:11:58.220
was done in previous eras of AI.
link |
01:12:00.060
I think that Deep Blue was an amazing achievement in itself.
link |
01:12:04.140
And that heuristic search of the kind that was used by Deep
link |
01:12:07.900
Blue had some powerful ideas that were in there,
link |
01:12:11.420
but it also missed some things.
link |
01:12:13.220
So the fact that the evaluation function, the way
link |
01:12:16.860
that the chess position was understood,
link |
01:12:18.620
was created by humans and not by the machine
link |
01:12:22.540
is a limitation, which means that there's
link |
01:12:26.740
a ceiling on how well it can do.
link |
01:12:28.900
But maybe more importantly, it means
link |
01:12:30.900
that the same idea cannot be applied in other domains
link |
01:12:33.740
where we don't have access to the human grandmasters
link |
01:12:38.500
and that ability to encode exactly their knowledge
link |
01:12:41.140
into an evaluation function.
link |
01:12:43.060
And the reality is that the story of AI
link |
01:12:45.060
is that most domains turn out to be of the second type
link |
01:12:48.580
where knowledge is messy, it's hard to extract from experts,
link |
01:12:52.020
or it isn't even available.
link |
01:12:53.940
And so we need to solve problems in a different way.
link |
01:12:59.860
And I think AlphaGo is a step towards solving things
link |
01:13:02.740
in a way which puts learning as a first class citizen
link |
01:13:07.780
and says systems need to understand for themselves
link |
01:13:11.420
how to understand the world, how to judge the value of any action
link |
01:13:19.300
that they might take within that world
link |
01:13:20.780
and any state they might find themselves in.
link |
01:13:22.780
And in order to do that, we make progress towards AI.
link |
01:13:29.060
Yeah, so one of the nice things about taking a learning
link |
01:13:32.980
approach to the game of Go or game playing
link |
01:13:36.540
is that the things you learn, the things you figure out,
link |
01:13:39.380
are actually going to be applicable to other problems
link |
01:13:42.540
that are real world problems.
link |
01:13:44.100
That's ultimately, I mean, there's
link |
01:13:47.060
two really interesting things about AlphaGo.
link |
01:13:49.100
One is the science of it, just the science of learning,
link |
01:13:52.420
the science of intelligence.
link |
01:13:54.540
And then the other is while you're actually
link |
01:13:56.980
learning to figuring out how to build systems that
link |
01:13:59.900
would be potentially applicable in other applications,
link |
01:14:04.140
medical, autonomous vehicles, robotics,
link |
01:14:06.580
I mean, it's just open the door to all kinds of applications.
link |
01:14:10.580
So the next incredible step, really the profound step
link |
01:14:16.340
is probably AlphaGo Zero.
link |
01:14:18.220
I mean, it's arguable.
link |
01:14:20.500
I kind of see them all as the same place.
link |
01:14:22.420
But really, and perhaps you were already
link |
01:14:24.300
thinking that AlphaGo Zero is the natural.
link |
01:14:26.740
It was always going to be the next step.
link |
01:14:29.180
But it's removing the reliance on human expert games
link |
01:14:33.340
for pre training, as you mentioned.
link |
01:14:35.340
So how big of an intellectual leap
link |
01:14:38.260
was this that self play could achieve superhuman level
link |
01:14:43.420
performance in its own?
link |
01:14:45.580
And maybe could you also say, what is self play?
link |
01:14:48.580
Kind of mention it a few times.
link |
01:14:51.580
So let me start with self play.
link |
01:14:55.180
So the idea of self play is something
link |
01:14:58.300
which is really about systems learning for themselves,
link |
01:15:01.940
but in the situation where there's more than one agent.
link |
01:15:05.660
And so if you're in a game, and the game
link |
01:15:08.300
is played between two players, then self play
link |
01:15:11.100
is really about understanding that game just
link |
01:15:15.140
by playing games against yourself
link |
01:15:17.540
rather than against any actual real opponent.
link |
01:15:19.940
And so it's a way to kind of discover strategies
link |
01:15:23.860
without having to actually need to go out and play
link |
01:15:27.900
against any particular human player, for example.
link |
01:15:36.020
The main idea of Alpha Zero was really
link |
01:15:38.940
to try and step back from any of the knowledge
link |
01:15:45.300
that we put into the system and ask the question,
link |
01:15:47.820
is it possible to come up with a single elegant principle
link |
01:15:52.980
by which a system can learn for itself all of the knowledge
link |
01:15:57.380
which it requires to play a game such as Go?
link |
01:16:00.780
Importantly, by taking knowledge out,
link |
01:16:03.220
you not only make the system less brittle in the sense
link |
01:16:08.860
that perhaps the knowledge you were putting in
link |
01:16:10.620
was just getting in the way and maybe stopping the system
link |
01:16:13.860
learning for itself, but also you make it more general.
link |
01:16:17.820
The more knowledge you put in, the harder
link |
01:16:20.260
it is for a system to actually be placed,
link |
01:16:23.460
taken out of the system in which it's kind of been designed,
link |
01:16:26.700
and placed in some other system that maybe would need
link |
01:16:29.340
a completely different knowledge base to understand
link |
01:16:31.420
and perform well.
link |
01:16:32.860
And so the real goal here is to strip out all of the knowledge
link |
01:16:36.900
that we put in to the point that we can just plug it
link |
01:16:39.580
into something totally different.
link |
01:16:41.700
And that, to me, is really the promise of AI
link |
01:16:45.260
is that we can have systems such as that which,
link |
01:16:47.700
no matter what the goal is, no matter what goal
link |
01:16:51.540
we set to the system, we can come up
link |
01:16:53.980
with an algorithm which can be placed into that world,
link |
01:16:57.580
into that environment, and can succeed
link |
01:16:59.940
in achieving that goal.
link |
01:17:01.780
And then that, to me, is almost the essence of intelligence
link |
01:17:06.620
if we can achieve that.
link |
01:17:07.980
And so AlphaZero is a step towards that.
link |
01:17:11.340
And it's a step that was taken in the context of two player
link |
01:17:15.300
perfect information games like Go and chess.
link |
01:17:18.820
We also applied it to Japanese chess.
link |
01:17:21.460
So just to clarify, the first step
link |
01:17:23.660
was AlphaGo Zero.
link |
01:17:25.540
The first step was to try and take all of the knowledge out
link |
01:17:29.860
of AlphaGo in such a way that it could
link |
01:17:32.580
play in a fully self discovered way, purely from self play.
link |
01:17:39.620
And to me, the motivation for that
link |
01:17:41.300
was always that we could then plug it into other domains.
link |
01:17:44.980
But we saved that until later.
link |
01:17:48.060
Well, in fact, I mean, just for fun,
link |
01:17:52.860
I could tell you exactly the moment
link |
01:17:54.300
where the idea for AlphaZero occurred to me.
link |
01:17:57.460
Because I think there's maybe a lesson there for researchers
link |
01:18:00.380
who are too deeply embedded in their research
link |
01:18:03.180
and working 24 sevens to try and come up with the next idea,
link |
01:18:08.140
which is it actually occurred to me on honeymoon.
link |
01:18:13.660
And I was at my most fully relaxed state,
link |
01:18:17.140
really enjoying myself, and just bing,
link |
01:18:22.900
the algorithm for AlphaZero just appeared in its full form.
link |
01:18:29.860
And this was actually before we played against Lisa Dahl.
link |
01:18:33.180
But we just didn't.
link |
01:18:35.780
I think we were so busy trying to make sure
link |
01:18:39.140
we could beat the world champion that it was only later
link |
01:18:43.460
that we had the opportunity to step back and start
link |
01:18:47.420
examining that sort of deeper scientific question of whether
link |
01:18:51.060
this could really work.
link |
01:18:52.340
So nevertheless, so self play is probably
link |
01:18:56.260
one of the most profound ideas that represents, to me at least,
link |
01:19:03.340
artificial intelligence.
link |
01:19:05.500
But the fact that you could use that kind of mechanism
link |
01:19:09.780
to, again, beat world class players,
link |
01:19:13.020
that's very surprising.
link |
01:19:14.860
So to me, it feels like you have to train
link |
01:19:19.180
in a large number of expert games.
link |
01:19:21.300
So was it surprising to you?
link |
01:19:22.740
What was the intuition?
link |
01:19:23.660
Can you sort of think, not necessarily at that time,
link |
01:19:26.540
even now, what's your intuition?
link |
01:19:27.980
Why this thing works so well?
link |
01:19:30.060
Why it's able to learn from scratch?
link |
01:19:31.900
Well, let me first say why we tried it.
link |
01:19:34.580
So we tried it both because I feel
link |
01:19:36.500
that it was the deeper scientific question
link |
01:19:38.540
to be asking to make progress towards AI,
link |
01:19:42.140
and also because, in general, in my research,
link |
01:19:44.980
I don't like to do research on questions for which we already
link |
01:19:49.060
know the likely outcome.
link |
01:19:51.060
I don't see much value in running an experiment where
link |
01:19:53.380
you're 95% confident that you will succeed.
link |
01:19:57.700
And so we could have tried maybe to take AlphaGo and do
link |
01:20:02.260
something which we knew for sure it would succeed on.
link |
01:20:05.060
But much more interesting to me was to try it on the things
link |
01:20:07.620
which we weren't sure about.
link |
01:20:09.460
And one of the big questions on our minds
link |
01:20:12.980
back then was, could you really do this with self play alone?
link |
01:20:16.220
How far could that go?
link |
01:20:17.660
Would it be as strong?
link |
01:20:19.540
And honestly, we weren't sure.
link |
01:20:22.340
It was 50, 50, I think.
link |
01:20:25.380
If you'd asked me, I wasn't confident
link |
01:20:27.340
that it could reach the same level as these systems,
link |
01:20:30.660
but it felt like the right question to ask.
link |
01:20:33.860
And even if it had not achieved the same level,
link |
01:20:36.780
I felt that that was an important direction
link |
01:20:41.620
to be studying.
link |
01:20:42.900
And so then, lo and behold, it actually
link |
01:20:48.300
ended up outperforming the previous version of AlphaGo
link |
01:20:52.380
and indeed was able to beat it by 100 games to zero.
link |
01:20:55.940
So what's the intuition as to why?
link |
01:20:59.780
I think the intuition to me is clear,
link |
01:21:02.380
that whenever you have errors in a system, as we did in AlphaGo,
link |
01:21:09.420
AlphaGo suffered from these delusions.
link |
01:21:11.700
Occasionally, it would misunderstand
link |
01:21:13.300
what was going on in a position and miss evaluate it.
link |
01:21:15.940
How can you remove all of these errors?
link |
01:21:19.700
Errors arise from many sources.
link |
01:21:21.820
For us, they were arising both starting from the human data,
link |
01:21:25.300
but also from the nature of the search
link |
01:21:27.740
and the nature of the algorithm itself.
link |
01:21:29.780
But the only way to address them in any complex system
link |
01:21:33.180
is to give the system the ability
link |
01:21:36.180
to correct its own errors.
link |
01:21:37.940
It must be able to correct them.
link |
01:21:39.500
It must be able to learn for itself
link |
01:21:41.420
when it's doing something wrong and correct for it.
link |
01:21:44.660
And so it seemed to me that the way to correct delusions
link |
01:21:47.820
was indeed to have more iterations of reinforcement
link |
01:21:51.340
learning, that no matter where you start,
link |
01:21:53.540
you should be able to correct those errors
link |
01:21:55.740
until it gets to play that out and understand,
link |
01:21:58.380
oh, well, I thought that I was going to win in this situation,
link |
01:22:01.420
but then I ended up losing.
link |
01:22:03.220
That suggests that I was miss evaluating something.
link |
01:22:05.420
There's a hole in my knowledge, and now the system
link |
01:22:07.620
can correct for itself and understand how to do better.
link |
01:22:11.580
Now, if you take that same idea and trace it back
link |
01:22:14.300
all the way to the beginning, it should
link |
01:22:16.540
be able to take you from no knowledge,
link |
01:22:19.180
from completely random starting point,
link |
01:22:21.740
all the way to the highest levels of knowledge
link |
01:22:24.740
that you can achieve in a domain.
link |
01:22:27.100
And the principle is the same, that if you bestow a system
link |
01:22:30.620
with the ability to correct its own errors,
link |
01:22:33.540
then it can take you from random to something slightly
link |
01:22:36.180
better than random because it sees the stupid things
link |
01:22:39.540
that the random is doing, and it can correct them.
link |
01:22:41.580
And then it can take you from that slightly better system
link |
01:22:43.940
and understand, well, what's that doing wrong?
link |
01:22:45.900
And it takes you on to the next level and the next level.
link |
01:22:49.300
And this progress can go on indefinitely.
link |
01:22:52.980
And indeed, what would have happened
link |
01:22:55.300
if we'd carried on training AlphaGo Zero for longer?
link |
01:22:59.420
We saw no sign of it slowing down its improvements,
link |
01:23:03.340
or at least it was certainly carrying on to improve.
link |
01:23:06.660
And presumably, if you had the computational resources,
link |
01:23:11.060
this could lead to better and better systems
link |
01:23:14.500
that discover more and more.
link |
01:23:15.740
So your intuition is fundamentally
link |
01:23:18.940
there's not a ceiling to this process.
link |
01:23:21.620
One of the surprising things, just like you said,
link |
01:23:24.660
is the process of patching errors.
link |
01:23:27.340
It intuitively makes sense that this is,
link |
01:23:31.060
that reinforcement learning should be part of that process.
link |
01:23:33.580
But what is surprising is in the process
link |
01:23:36.060
of patching your own lack of knowledge,
link |
01:23:39.260
you don't open up other patches.
link |
01:23:41.980
You keep sort of, like there's a monotonic decrease
link |
01:23:46.660
of your weaknesses.
link |
01:23:48.500
Well, let me back this up.
link |
01:23:50.140
I think science always should make falsifiable hypotheses.
link |
01:23:53.780
So let me back up this claim with a falsifiable hypothesis,
link |
01:23:57.060
which is that if someone was to, in the future,
link |
01:23:59.780
take Alpha Zero as an algorithm
link |
01:24:02.380
and run it on with greater computational resources
link |
01:24:07.460
that we had available today,
link |
01:24:10.580
then I would predict that they would be able
link |
01:24:12.860
to beat the previous system 100 games to zero.
link |
01:24:15.380
And that if they were then to do the same thing
link |
01:24:17.260
a couple of years later,
link |
01:24:19.260
that that would beat that previous system 100 games to zero,
link |
01:24:22.100
and that that process would continue indefinitely
link |
01:24:25.180
throughout at least my human lifetime.
link |
01:24:27.580
Presumably the game of Go would set the ceiling.
link |
01:24:31.020
I mean.
link |
01:24:31.860
The game of Go would set the ceiling,
link |
01:24:33.220
but the game of Go has 10 to the 170 states in it.
link |
01:24:35.980
So the ceiling is unreachable by any computational device
link |
01:24:40.420
that can be built out of the 10 to the 80 atoms
link |
01:24:44.540
in the universe.
link |
01:24:46.620
You asked a really good question,
link |
01:24:47.900
which is, do you not open up other errors
link |
01:24:51.180
when you correct your previous ones?
link |
01:24:53.660
And the answer is yes, you do.
link |
01:24:56.180
And so it's a remarkable fact
link |
01:24:58.660
about this class of two player game
link |
01:25:02.260
and also true of single agent games
link |
01:25:05.220
that essentially progress will always lead you to,
link |
01:25:11.780
if you have sufficient representational resource,
link |
01:25:15.100
like imagine you had,
link |
01:25:16.620
could represent every state in a big table of the game,
link |
01:25:20.180
then we know for sure that a progress of self improvement
link |
01:25:24.060
will lead all the way in the single agent case
link |
01:25:27.140
to the optimal possible behavior,
link |
01:25:29.100
and in the two player case to the minimax optimal behavior.
link |
01:25:31.820
And that is the best way that I can play
link |
01:25:35.300
knowing that you're playing perfectly against me.
link |
01:25:38.020
And so for those cases,
link |
01:25:39.780
we know that even if you do open up some new error,
link |
01:25:44.700
that in some sense you've made progress.
link |
01:25:46.940
You're progressing towards the best that can be done.
link |
01:25:50.460
So AlphaGo was initially trained on expert games
link |
01:25:55.220
with some self play.
link |
01:25:56.460
AlphaGo Zero removed the need to be trained on expert games.
link |
01:26:00.220
And then another incredible step for me,
link |
01:26:03.980
because I just love chess,
link |
01:26:05.740
is to generalize that further to be in AlphaZero
link |
01:26:09.500
to be able to play the game of Go,
link |
01:26:12.220
beating AlphaGo Zero and AlphaGo,
link |
01:26:14.620
and then also being able to play the game of chess
link |
01:26:18.140
and others.
link |
01:26:19.140
So what was that step like?
link |
01:26:20.980
What's the interesting aspects there
link |
01:26:23.580
that required to make that happen?
link |
01:26:26.660
I think the remarkable observation,
link |
01:26:29.940
which we saw with AlphaZero,
link |
01:26:31.980
was that actually without modifying the algorithm at all,
link |
01:26:35.740
it was able to play and crack
link |
01:26:37.500
some of AI's greatest previous challenges.
link |
01:26:41.300
In particular, we dropped it into the game of chess.
link |
01:26:44.780
And unlike the previous systems like Deep Blue,
link |
01:26:47.180
which had been worked on for years and years,
link |
01:26:50.420
and we were able to beat
link |
01:26:52.660
the world's strongest computer chess program convincingly
link |
01:26:57.300
using a system that was fully discovered
link |
01:27:00.940
from scratch with its own principles.
link |
01:27:04.940
And in fact, one of the nice things that we found
link |
01:27:08.180
was that in fact, we also achieved the same result
link |
01:27:11.540
in Japanese chess, a variant of chess
link |
01:27:13.500
where you get to capture pieces
link |
01:27:15.180
and then place them back down on your own side
link |
01:27:17.660
as an extra piece.
link |
01:27:18.980
So a much more complicated variant of chess.
link |
01:27:21.860
And we also beat the world's strongest programs
link |
01:27:24.780
and reached superhuman performance in that game too.
link |
01:27:28.020
And it was the very first time that we'd ever run the system
link |
01:27:32.100
on that particular game,
link |
01:27:34.460
was the version that we published
link |
01:27:35.860
in the paper on AlphaZero.
link |
01:27:38.700
It just worked out of the box, literally, no touching it.
link |
01:27:41.700
We didn't have to do anything.
link |
01:27:42.860
And there it was, superhuman performance,
link |
01:27:45.260
no tweaking, no twiddling.
link |
01:27:47.860
And so I think there's something beautiful
link |
01:27:49.540
about that principle that you can take an algorithm
link |
01:27:52.980
and without twiddling anything, it just works.
link |
01:27:57.700
Now, to go beyond AlphaZero, what's required?
link |
01:28:02.740
AlphaZero is just a step.
link |
01:28:05.460
And there's a long way to go beyond that
link |
01:28:06.940
to really crack the deep problems of AI.
link |
01:28:10.980
But one of the important steps is to acknowledge
link |
01:28:13.500
that the world is a really messy place.
link |
01:28:16.260
It's this rich, complex, beautiful,
link |
01:28:18.500
but messy environment that we live in.
link |
01:28:21.980
And no one gives us the rules.
link |
01:28:23.460
Like no one knows the rules of the world.
link |
01:28:26.140
At least maybe we understand that it operates
link |
01:28:28.500
according to Newtonian or quantum mechanics
link |
01:28:31.180
at the micro level or according to relativity
link |
01:28:34.020
at the macro level.
link |
01:28:35.140
But that's not a model that's useful for us as people
link |
01:28:38.420
to operate in it.
link |
01:28:40.220
Somehow the agent needs to understand the world for itself
link |
01:28:43.780
in a way where no one tells it the rules of the game.
link |
01:28:46.300
And yet it can still figure out what to do in that world,
link |
01:28:50.860
deal with this stream of observations coming in,
link |
01:28:53.580
rich sensory input coming in,
link |
01:28:55.300
actions going out in a way that allows it to reason
link |
01:28:58.300
in the way that AlphaGo or AlphaZero can reason
link |
01:29:01.460
in the way that these go and chess playing programs
link |
01:29:03.660
can reason.
link |
01:29:04.820
But in a way that allows it to take actions
link |
01:29:07.780
in that messy world to achieve its goals.
link |
01:29:11.500
And so this led us to the most recent step
link |
01:29:15.260
in the story of AlphaGo,
link |
01:29:17.460
which was a system called MuZero.
link |
01:29:19.500
And MuZero is a system which learns for itself
link |
01:29:23.380
even when the rules are not given to it.
link |
01:29:25.420
It actually can be dropped into a system
link |
01:29:28.180
with messy perceptual inputs.
link |
01:29:29.700
We actually tried it in some Atari games,
link |
01:29:33.860
the canonical domains of Atari
link |
01:29:36.540
that have been used for reinforcement learning.
link |
01:29:38.540
And this system learned to build a model
link |
01:29:42.900
of these Atari games that was sufficiently rich
link |
01:29:46.940
and useful enough for it to be able to plan successfully.
link |
01:29:51.380
And in fact, that system not only went on
link |
01:29:53.500
to beat the state of the art in Atari,
link |
01:29:56.660
but the same system without modification
link |
01:29:59.300
was able to reach the same level of superhuman performance
link |
01:30:02.980
in go, chess, and shogi that we'd seen in AlphaZero,
link |
01:30:06.900
showing that even without the rules,
link |
01:30:08.700
the system can learn for itself just by trial and error,
link |
01:30:11.100
just by playing this game of go.
link |
01:30:13.100
And no one tells you what the rules are,
link |
01:30:15.020
but you just get to the end and someone says win or loss.
link |
01:30:19.580
You play this game of chess and someone says win or loss,
link |
01:30:22.020
or you play a game of breakout in Atari
link |
01:30:25.540
and someone just tells you your score at the end.
link |
01:30:28.020
And the system for itself figures out
link |
01:30:30.580
essentially the rules of the system,
link |
01:30:31.900
the dynamics of the world, how the world works.
link |
01:30:35.180
And not in any explicit way, but just implicitly,
link |
01:30:39.580
enough understanding for it to be able to plan
link |
01:30:41.820
in that system in order to achieve its goals.
link |
01:30:45.460
And that's the fundamental process
link |
01:30:48.020
that you have to go through when you're facing
link |
01:30:49.660
in any uncertain kind of environment
link |
01:30:51.500
that you would in the real world,
link |
01:30:53.180
is figuring out the sort of the rules,
link |
01:30:55.060
the basic rules of the game.
link |
01:30:56.540
That's right.
link |
01:30:57.380
So that allows it to be applicable
link |
01:31:00.620
to basically any domain that could be digitized
link |
01:31:05.860
in the way that it needs to in order to be consumable,
link |
01:31:10.020
sort of in order for the reinforcement learning framework
link |
01:31:12.140
to be able to sense the environment,
link |
01:31:13.700
to be able to act in the environment and so on.
link |
01:31:15.540
The full reinforcement learning problem
link |
01:31:16.980
needs to deal with worlds that are unknown and complex
link |
01:31:21.300
and the agent needs to learn for itself
link |
01:31:23.700
how to deal with that.
link |
01:31:24.820
And so MuZero is a further step in that direction.
link |
01:31:29.460
One of the things that inspired the general public
link |
01:31:32.180
and just in conversations I have like with my parents
link |
01:31:34.540
or something with my mom that just loves what was done
link |
01:31:38.300
is kind of at least the notion
link |
01:31:40.340
that there was some display of creativity,
link |
01:31:42.140
some new strategies, new behaviors that were created.
link |
01:31:45.860
That again has echoes of intelligence.
link |
01:31:48.900
So is there something that stands out?
link |
01:31:50.780
Do you see it the same way that there's creativity
link |
01:31:52.940
and there's some behaviors, patterns that you saw
link |
01:31:57.220
that AlphaZero was able to display that are truly creative?
link |
01:32:01.820
So let me start by saying that I think we should ask
link |
01:32:06.660
what creativity really means.
link |
01:32:08.260
So to me, creativity means discovering something
link |
01:32:13.820
which wasn't known before, something unexpected,
link |
01:32:16.860
something outside of our norms.
link |
01:32:19.700
And so in that sense, the process of reinforcement learning
link |
01:32:24.700
or the self play approach that was used by AlphaZero
link |
01:32:29.460
is the essence of creativity.
link |
01:32:31.780
It's really saying at every stage,
link |
01:32:34.180
you're playing according to your current norms
link |
01:32:36.500
and you try something and if it works out,
link |
01:32:39.980
you say, hey, here's something great,
link |
01:32:42.980
I'm gonna start using that.
link |
01:32:44.580
And then that process, it's like a micro discovery
link |
01:32:47.180
that happens millions and millions of times
link |
01:32:49.580
over the course of the algorithm's life
link |
01:32:51.580
where it just discovers some new idea,
link |
01:32:54.180
oh, this pattern, this pattern's working really well for me,
link |
01:32:56.500
I'm gonna start using that.
link |
01:32:58.300
And now, oh, here's this other thing I can do,
link |
01:33:00.420
I can start to connect these stones together in this way
link |
01:33:03.740
or I can start to sacrifice stones or give up on pieces
link |
01:33:08.660
or play shoulder hits on the fifth line or whatever it is.
link |
01:33:12.060
The system's discovering things like this for itself
link |
01:33:13.940
continually, repeatedly, all the time.
link |
01:33:16.740
And so it should come as no surprise to us then
link |
01:33:19.580
when if you leave these systems going,
link |
01:33:21.740
that they discover things that are not known to humans,
link |
01:33:25.740
that to the human norms are considered creative.
link |
01:33:30.580
And we've seen this several times.
link |
01:33:32.900
In fact, in AlphaGo Zero,
link |
01:33:35.700
we saw this beautiful timeline of discovery
link |
01:33:39.220
where what we saw was that there are these opening patterns
link |
01:33:44.020
that humans play called joseki,
link |
01:33:45.500
these are like the patterns that humans learn
link |
01:33:47.820
to play in the corners and they've been developed
link |
01:33:49.660
and refined over literally thousands of years
link |
01:33:51.940
in the game of Go.
link |
01:33:53.220
And what we saw was in the course of the training,
link |
01:33:57.220
AlphaGo Zero, over the course of the 40 days
link |
01:34:00.100
that we trained this system,
link |
01:34:01.900
it starts to discover exactly these patterns
link |
01:34:05.620
that human players play.
link |
01:34:06.980
And over time, we found that all of the joseki
link |
01:34:10.180
that humans played were discovered by the system
link |
01:34:13.180
through this process of self play
link |
01:34:15.620
and this sort of essential notion of creativity.
link |
01:34:19.660
But what was really interesting was that over time,
link |
01:34:22.500
it then starts to discard some of these
link |
01:34:24.900
in favor of its own joseki that humans didn't know about.
link |
01:34:28.220
And it starts to say, oh, well,
link |
01:34:29.540
you thought that the Knights move pincer joseki
link |
01:34:33.020
was a great idea,
link |
01:34:35.060
but here's something different you can do there
link |
01:34:37.060
which makes some new variation
link |
01:34:38.740
that humans didn't know about.
link |
01:34:40.380
And actually now the human Go players
link |
01:34:42.420
study the joseki that AlphaGo played
link |
01:34:44.660
and they become the new norms
link |
01:34:46.580
that are used in today's top level Go competitions.
link |
01:34:51.260
That never gets old.
link |
01:34:52.540
Even just the first to me,
link |
01:34:54.740
maybe just makes me feel good as a human being
link |
01:34:58.300
that a self play mechanism that knows nothing about us humans
link |
01:35:01.900
discovers patterns that we humans do.
link |
01:35:04.540
That's just like an affirmation
link |
01:35:06.340
that we're doing okay as humans.
link |
01:35:08.420
Yeah.
link |
01:35:09.260
We've, in this domain and other domains,
link |
01:35:12.540
we figured out it's like the Churchill quote
link |
01:35:14.820
about democracy.
link |
01:35:15.780
It's the, you know, it sucks,
link |
01:35:18.380
but it's the best one we've tried.
link |
01:35:20.260
So in general, taking a step outside of Go
link |
01:35:24.460
and you've like a million accomplishment
link |
01:35:27.180
that I have no time to talk about
link |
01:35:29.540
with AlphaStar and so on and the current work.
link |
01:35:32.860
But in general, this self play mechanism
link |
01:35:36.660
that you've inspired the world with
link |
01:35:38.180
by beating the world champion Go player.
link |
01:35:40.620
Do you see that as,
link |
01:35:43.820
do you see it being applied in other domains?
link |
01:35:47.180
Do you have sort of dreams and hopes
link |
01:35:50.620
that it's applied in both the simulated environments
link |
01:35:53.780
and the constrained environments of games?
link |
01:35:56.380
Constrained, I mean, AlphaStar really demonstrates
link |
01:35:59.020
that you can remove a lot of the constraints,
link |
01:36:00.500
but nevertheless, it's in a digital simulated environment.
link |
01:36:04.100
Do you have a hope, a dream that it starts being applied
link |
01:36:07.220
in the robotics environment?
link |
01:36:09.100
And maybe even in domains that are safety critical
link |
01:36:12.940
and so on and have, you know,
link |
01:36:15.180
have a real impact in the real world,
link |
01:36:16.580
like autonomous vehicles, for example,
link |
01:36:18.260
which seems like a very far out dream at this point.
link |
01:36:21.140
So I absolutely do hope and imagine
link |
01:36:25.540
that we will get to the point where ideas
link |
01:36:27.980
just like these are used in all kinds of different domains.
link |
01:36:31.140
In fact, one of the most satisfying things
link |
01:36:32.700
as a researcher is when you start to see other people
link |
01:36:35.340
use your algorithms in unexpected ways.
link |
01:36:39.100
So in the last couple of years, there have been,
link |
01:36:41.060
you know, a couple of nature papers
link |
01:36:43.180
where different teams, unbeknownst to us,
link |
01:36:47.140
took AlphaZero and applied exactly those same algorithms
link |
01:36:51.980
and ideas to real world problems of huge meaning to society.
link |
01:36:57.580
So one of them was the problem of chemical synthesis,
link |
01:37:00.980
and they were able to beat the state of the art
link |
01:37:02.940
in finding pathways of how to actually synthesize chemicals,
link |
01:37:08.700
retrochemical synthesis.
link |
01:37:11.980
And the second paper actually just came out
link |
01:37:14.060
a couple of weeks ago in Nature,
link |
01:37:16.620
showed that in quantum computation,
link |
01:37:19.500
you know, one of the big questions is how to understand
link |
01:37:22.740
the nature of the function in quantum computation
link |
01:37:27.660
and a system based on AlphaZero beat the state of the art
link |
01:37:30.340
by quite some distance there again.
link |
01:37:32.380
So these are just examples.
link |
01:37:34.060
And I think, you know, the lesson,
link |
01:37:36.300
which we've seen elsewhere in machine learning
link |
01:37:38.500
time and time again, is that if you make something general,
link |
01:37:42.620
it will be used in all kinds of ways.
link |
01:37:44.140
You know, you provide a really powerful tool to society,
link |
01:37:47.340
and those tools can be used in amazing ways.
link |
01:37:51.700
And so I think we're just at the beginning,
link |
01:37:53.580
and for sure, I hope that we see all kinds of outcomes.
link |
01:37:58.900
So the other side of the question of reinforcement
link |
01:38:03.340
learning framework is, you know,
link |
01:38:05.540
you usually want to specify a reward function
link |
01:38:07.620
and an objective function.
link |
01:38:11.180
What do you think about sort of ideas of intrinsic rewards
link |
01:38:13.780
of when we're not really sure about, you know,
link |
01:38:19.260
if we take, you know, human beings as existence proof
link |
01:38:23.660
that we don't seem to be operating
link |
01:38:25.820
according to a single reward,
link |
01:38:27.820
do you think that there's interesting ideas
link |
01:38:32.100
for when you don't know how to truly specify the reward,
link |
01:38:35.540
you know, that there's some flexibility
link |
01:38:38.140
for discovering it intrinsically or so on
link |
01:38:40.620
in the context of reinforcement learning?
link |
01:38:42.700
So I think, you know, when we think about intelligence,
link |
01:38:45.020
it's really important to be clear
link |
01:38:46.740
about the problem of intelligence.
link |
01:38:48.380
And I think it's clearest to understand that problem
link |
01:38:51.180
in terms of some ultimate goal
link |
01:38:52.660
that we want the system to try and solve for.
link |
01:38:55.340
And after all, if we don't understand the ultimate purpose
link |
01:38:57.900
of the system, do we really even have
link |
01:39:00.860
a clearly defined problem that we're solving at all?
link |
01:39:04.340
Now, within that, as with your example for humans,
link |
01:39:10.380
the system may choose to create its own motivations
link |
01:39:13.980
and subgoals that help the system
link |
01:39:16.340
to achieve its ultimate goal.
link |
01:39:19.060
And that may indeed be a hugely important mechanism
link |
01:39:22.380
to achieve those ultimate goals,
link |
01:39:23.820
but there is still some ultimate goal
link |
01:39:25.500
I think the system needs to be measurable
link |
01:39:27.060
and evaluated against.
link |
01:39:29.660
And even for humans, I mean, humans,
link |
01:39:31.380
we're incredibly flexible.
link |
01:39:32.420
We feel that we can, you know, any goal that we're given,
link |
01:39:35.180
we feel we can master to some degree.
link |
01:39:40.220
But if we think of those goals, really, you know,
link |
01:39:41.860
like the goal of being able to pick up an object
link |
01:39:44.860
or the goal of being able to communicate
link |
01:39:47.180
or influence people to do things in a particular way
link |
01:39:50.980
or whatever those goals are, really, they're subgoals,
link |
01:39:56.940
really, that we set ourselves.
link |
01:39:58.580
You know, we choose to pick up the object.
link |
01:40:00.900
We choose to communicate.
link |
01:40:02.100
We choose to influence someone else.
link |
01:40:05.340
And we choose those because we think it will lead us
link |
01:40:07.660
to something later on.
link |
01:40:10.460
We think that's helpful to us to achieve some ultimate goal.
link |
01:40:15.100
Now, I don't want to speculate whether or not humans
link |
01:40:18.260
as a system necessarily have a singular overall goal
link |
01:40:20.900
of survival or whatever it is.
link |
01:40:23.540
But I think the principle for understanding
link |
01:40:25.660
and implementing intelligence is, has to be,
link |
01:40:28.140
that if we're trying to understand intelligence
link |
01:40:30.100
or implement our own,
link |
01:40:31.420
there has to be a well defined problem.
link |
01:40:33.180
Otherwise, if it's not, I think it's like an admission
link |
01:40:37.500
of defeat, that for there to be hope for understanding
link |
01:40:41.500
or implementing intelligence, we have to know what we're doing.
link |
01:40:44.060
We have to know what we're asking the system to do.
link |
01:40:46.420
Otherwise, if you don't have a clearly defined purpose,
link |
01:40:48.860
you're not going to get a clearly defined answer.
link |
01:40:51.620
The ridiculous big question that has to naturally follow,
link |
01:40:56.420
because I have to pin you down on this thing,
link |
01:41:00.820
that nevertheless, one of the big silly
link |
01:41:03.340
or big real questions before humans is the meaning of life,
link |
01:41:08.060
is us trying to figure out our own reward function.
link |
01:41:11.180
And you just kind of mentioned that if you want to build
link |
01:41:13.300
intelligent systems and you know what you're doing,
link |
01:41:16.260
you should be at least cognizant to some degree
link |
01:41:18.380
of what the reward function is.
link |
01:41:20.300
So the natural question is what do you think
link |
01:41:23.700
is the reward function of human life,
link |
01:41:26.260
the meaning of life for us humans,
link |
01:41:29.260
the meaning of our existence?
link |
01:41:32.980
I think I'd be speculating beyond my own expertise,
link |
01:41:36.620
but just for fun, let me do that.
link |
01:41:38.460
Yes, please.
link |
01:41:39.420
And say, I think that there are many levels
link |
01:41:41.180
at which you can understand a system
link |
01:41:43.020
and you can understand something as optimizing
link |
01:41:46.420
for a goal at many levels.
link |
01:41:48.900
And so you can understand the,
link |
01:41:52.540
let's start with the universe.
link |
01:41:54.500
Does the universe have a purpose?
link |
01:41:55.780
Well, it feels like it's just at one level
link |
01:41:58.100
just following certain mechanical laws of physics
link |
01:42:02.340
and that that's led to the development of the universe.
link |
01:42:04.620
But at another level, you can view it as actually,
link |
01:42:08.500
there's the second law of thermodynamics that says
link |
01:42:10.300
that this is increasing in entropy over time forever.
link |
01:42:13.340
And now there's a view that's been developed
link |
01:42:15.380
by certain people at MIT that this,
link |
01:42:17.900
you can think of this as almost like a goal of the universe,
link |
01:42:20.660
that the purpose of the universe is to maximize entropy.
link |
01:42:24.900
So there are multiple levels
link |
01:42:26.060
at which you can understand a system.
link |
01:42:28.820
The next level down, you might say,
link |
01:42:30.660
well, if the goal is to maximize entropy,
link |
01:42:34.060
well, how can that be done by a particular system?
link |
01:42:40.020
And maybe evolution is something that the universe
link |
01:42:42.780
discovered in order to kind of dissipate energy
link |
01:42:45.900
as efficiently as possible.
link |
01:42:48.060
And by the way, I'm borrowing from Max Tegmark
link |
01:42:49.940
for some of these metaphors, the physicist.
link |
01:42:53.900
But if you can think of evolution
link |
01:42:55.460
as a mechanism for dispersing energy,
link |
01:42:59.380
then evolution, you might say, then becomes a goal,
link |
01:43:04.180
which is if evolution disperses energy
link |
01:43:06.620
by reproducing as efficiently as possible,
link |
01:43:09.340
what's evolution then?
link |
01:43:10.580
Well, it's now got its own goal within that,
link |
01:43:13.700
which is to actually reproduce as effectively as possible.
link |
01:43:19.300
And now how does reproduction,
link |
01:43:22.260
how is that made as effective as possible?
link |
01:43:25.020
Well, you need entities within that
link |
01:43:27.580
that can survive and reproduce as effectively as possible.
link |
01:43:29.900
And so it's natural that in order to achieve
link |
01:43:31.620
that high level goal, those individual organisms
link |
01:43:33.860
discover brains, intelligences,
link |
01:43:37.700
which enable them to support the goals of evolution.
link |
01:43:43.220
And those brains, what do they do?
link |
01:43:45.340
Well, perhaps the early brains,
link |
01:43:47.820
maybe they were controlling things at some direct level.
link |
01:43:51.980
Maybe they were the equivalent of preprogrammed systems,
link |
01:43:54.220
which were directly controlling what was going on
link |
01:43:57.540
and setting certain things in order
link |
01:43:59.940
to achieve these particular goals.
link |
01:44:03.060
But that led to another level of discovery,
link |
01:44:05.940
which was learning systems.
link |
01:44:07.260
There are parts of the brain
link |
01:44:08.100
which are able to learn for themselves
link |
01:44:10.140
and learn how to program themselves to achieve any goal.
link |
01:44:13.460
And presumably there are parts of the brain
link |
01:44:16.580
where goals are set to parts of that system
link |
01:44:20.340
and provides this very flexible notion of intelligence
link |
01:44:23.020
that we as humans presumably have,
link |
01:44:25.020
which is the ability to kind of,
link |
01:44:26.820
the reason we feel that we can achieve any goal.
link |
01:44:30.020
So it's a very long winded answer to say that,
link |
01:44:32.980
I think there are many perspectives
link |
01:44:34.700
and many levels at which intelligence can be understood.
link |
01:44:38.620
And at each of those levels,
link |
01:44:40.460
you can take multiple perspectives.
link |
01:44:42.220
You can view the system as something
link |
01:44:43.940
which is optimizing for a goal,
link |
01:44:45.420
which is understanding it at a level
link |
01:44:47.820
by which we can maybe implement it
link |
01:44:49.500
and understand it as AI researchers or computer scientists,
link |
01:44:53.340
or you can understand it at the level
link |
01:44:54.780
of the mechanistic thing which is going on
link |
01:44:56.420
that there are these atoms bouncing around in the brain
link |
01:44:58.780
and they lead to the outcome of that system
link |
01:45:01.380
is not in contradiction with the fact
link |
01:45:02.940
that it's also a decision making system
link |
01:45:07.100
that's optimizing for some goal and purpose.
link |
01:45:10.140
I've never heard the description of the meaning of life
link |
01:45:14.380
structured so beautifully in layers,
link |
01:45:16.860
but you did miss one layer, which is the next step,
link |
01:45:19.860
which you're responsible for,
link |
01:45:21.740
which is creating the artificial intelligence layer
link |
01:45:27.420
on top of that.
link |
01:45:28.260
And I can't wait to see, well, I may not be around,
link |
01:45:31.740
but I can't wait to see what the next layer beyond that be.
link |
01:45:36.860
Well, let's just take that argument
link |
01:45:39.260
and pursue it to its natural conclusion.
link |
01:45:41.300
So the next level indeed is for how can our learning brain
link |
01:45:46.860
achieve its goals most effectively?
link |
01:45:49.180
Well, maybe it does so by us as learning beings
link |
01:45:56.180
building a system which is able to solve for those goals
link |
01:46:00.180
more effectively than we can.
link |
01:46:02.180
And so when we build a system to play the game of Go,
link |
01:46:05.140
when I said that I wanted to build a system
link |
01:46:06.940
that can play Go better than I can,
link |
01:46:08.740
I've enabled myself to achieve that goal of playing Go
link |
01:46:12.180
better than I could by directly playing it
link |
01:46:14.500
and learning it myself.
link |
01:46:15.820
And so now a new layer has been created,
link |
01:46:18.740
which is systems which are able to achieve goals
link |
01:46:21.260
for themselves.
link |
01:46:22.620
And ultimately there may be layers beyond that
link |
01:46:25.060
where they set sub goals to parts of their own system
link |
01:46:28.500
in order to achieve those and so forth.
link |
01:46:32.980
So the story of intelligence, I think,
link |
01:46:36.700
is a multi layered one and a multi perspective one.
link |
01:46:39.980
We live in an incredible universe.
link |
01:46:41.980
David, thank you so much, first of all,
link |
01:46:43.980
for dreaming of using learning to solve Go
link |
01:46:47.900
and building intelligent systems
link |
01:46:50.100
and for actually making it happen
link |
01:46:52.260
and for inspiring millions of people in the process.
link |
01:46:56.100
It's truly an honor.
link |
01:46:57.060
Thank you so much for talking today.
link |
01:46:58.500
Okay, thank you.
link |
01:46:59.940
Thanks for listening to this conversation
link |
01:47:01.300
with David Silver and thank you to our sponsors,
link |
01:47:04.060
Masterclass and Cash App.
link |
01:47:05.980
Please consider supporting the podcast
link |
01:47:07.740
by signing up to Masterclass at masterclass.com slash Lex
link |
01:47:12.100
and downloading Cash App and using code LexPodcast.
link |
01:47:15.740
If you enjoy this podcast, subscribe on YouTube,
link |
01:47:18.020
review it with five stars on Apple Podcast,
link |
01:47:20.260
support it on Patreon,
link |
01:47:21.420
or simply connect with me on Twitter at LexFriedman.
link |
01:47:25.260
And now let me leave you with some words from David Silver.
link |
01:47:28.700
My personal belief is that we've seen something
link |
01:47:31.300
of a turning point where we're starting to understand
link |
01:47:34.460
that many abilities like intuition and creativity
link |
01:47:38.180
that we've previously thought were in the domain only
link |
01:47:40.820
of the human mind are actually accessible
link |
01:47:43.340
to machine intelligence as well.
link |
01:47:45.500
And I think that's a really exciting moment in history.
link |
01:47:48.220
Thank you for listening and hope to see you next time.