back to index

David Silver: AlphaGo, AlphaZero, and Deep Reinforcement Learning | Lex Fridman Podcast #86


small model | large model

link |
00:00:00.000
The following is a conversation with David Silver, who leads the Reinforcement Learning
link |
00:00:04.040
Research Group at DeepMind, and was the lead researcher on AlphaGo, AlphaZero, and CoLED,
link |
00:00:11.160
the AlphaStar and MuZero efforts, and a lot of important work in reinforcement learning
link |
00:00:15.640
in general.
link |
00:00:17.160
I believe AlphaZero is one of the most important accomplishments in the history of artificial
link |
00:00:23.040
intelligence, and David is one of the key humans who brought AlphaZero to life together
link |
00:00:28.840
with a lot of other great researchers at DeepMind.
link |
00:00:32.000
He's humble, kind, and brilliant.
link |
00:00:34.440
We were both jet lagged, but didn't care and made it happen.
link |
00:00:38.600
It was a pleasure and truly an honor to talk with David.
link |
00:00:43.400
This conversation was recorded before the outbreak of the pandemic.
link |
00:00:47.040
For everyone feeling the medical, psychological, and financial burden of this crisis, I'm
link |
00:00:51.840
sending love your way.
link |
00:00:53.480
Stay strong, or in this together, we'll beat this thing.
link |
00:00:57.800
This is the Artificial Intelligence Podcast.
link |
00:00:59.920
If you enjoy it, subscribe on YouTube, review it with 5 stars on Apple Podcasts, support
link |
00:01:05.120
on Patreon, or simply connect with me on Twitter at Lex Freedman, spelled F R I D M A N.
link |
00:01:12.120
As usual, I'll do a few minutes of ads now and never any ads in the middle that can break
link |
00:01:16.480
the flow of the conversation.
link |
00:01:18.360
I hope that works for you and doesn't hurt the listening experience.
link |
00:01:22.720
Quick summary of the ads.
link |
00:01:24.000
Two sponsors.
link |
00:01:25.360
Masterclass and Cash App.
link |
00:01:27.480
Please consider supporting the podcast by signing up to masterclass and masterclass.com
link |
00:01:32.640
slash lex, and downloading Cash App and using code Lex Podcast.
link |
00:01:38.920
This show is presented by Cash App, the number one finance app in the App Store.
link |
00:01:43.240
When you get it, use code Lex Podcast.
link |
00:01:47.200
Cash App lets you send money to friends, buy Bitcoin, and invest in the stock market with
link |
00:01:51.520
as little as $1.
link |
00:01:53.960
This Cash App allows you to buy Bitcoin, let me mention that cryptocurrency in the context
link |
00:01:58.560
of the history of money is fascinating.
link |
00:02:01.400
I recommend Ascent of Money as a great book on this history.
link |
00:02:05.360
Debits and credits on ledgers started around 30,000 years ago.
link |
00:02:09.720
The US Dollar created over 200 years ago, and Bitcoin, the first decentralized cryptocurrency,
link |
00:02:15.920
released just over 10 years ago.
link |
00:02:18.720
So given that history, cryptocurrency is still very much in its early days of development,
link |
00:02:24.000
but is still aiming to and just might redefine the nature of money.
link |
00:02:29.160
So again, if you get Cash App from the App Store or Google Play and use the code Lex Podcast,
link |
00:02:34.720
you get $10, and Cash App will also donate $10 the first, an organization that is helping
link |
00:02:40.080
to advance robotics and STEM education for young people around the world.
link |
00:02:45.000
This show is sponsored by Masterclass, sign up at masterclass.com slash Lex to get a discount
link |
00:02:50.360
and to support this podcast.
link |
00:02:51.640
In fact, for a limited time now, if you sign up for an All Access Pass for a year, you
link |
00:02:56.760
get to get another All Access Pass to share with a friend.
link |
00:03:01.280
Buy one, get one free.
link |
00:03:02.720
When I first heard about Masterclass, I thought it was too good to be true.
link |
00:03:06.400
For $180 a year, you get an All Access Pass to watch courses from to list some of my favorites.
link |
00:03:13.000
Chris Hatfield on space exploration, Neil deGrasse Tyson on scientific thinking communication,
link |
00:03:18.160
Will Wright, the creator of SimCity and Sims, on game design, Jane Goodall on conservation,
link |
00:03:24.680
Carlos Santana on guitar, his song Europa could be the most beautiful guitar song ever
link |
00:03:29.920
written.
link |
00:03:30.920
Gary Gasparov on chess, Daniel Nagrano on poker, and many, many more.
link |
00:03:35.680
Chris Hatfield explaining how rockets work and the experience of being launched into
link |
00:03:39.400
space alone is worth the money.
link |
00:03:41.680
For me, the key is to not be overwhelmed by the abundance of choice.
link |
00:03:46.320
Pick three courses you want to complete.
link |
00:03:48.240
Watch each of them all the way through.
link |
00:03:49.680
It's not that long, but it's an experience that will stick with you for a long time.
link |
00:03:53.600
I promise.
link |
00:03:54.600
It's easily worth the money.
link |
00:03:56.760
You can watch it on basically any device.
link |
00:03:59.240
Once again, sign up on masterclass.com slash lex to get a discount and to support this
link |
00:04:03.840
podcast.
link |
00:04:05.640
And now here's my conversation with David Silver.
link |
00:04:09.800
What was the first program you ever written and what programming language do you remember?
link |
00:04:14.760
I remember very clearly, yeah, my parents brought home this BBC model B microcomputer.
link |
00:04:22.040
It was just this fascinating thing to me.
link |
00:04:24.160
I was about seven years old and couldn't resist just playing around with it.
link |
00:04:30.040
So I think first program ever was writing my name out in different colors and getting
link |
00:04:37.160
it to loop and repeat that.
link |
00:04:39.720
And there was something magical about that, which just led to more and more.
link |
00:04:44.520
How did you think about computers back then?
link |
00:04:46.760
The magical aspect of it, that you can write a program and there's this thing that you
link |
00:04:51.640
just gave birth to that's able to create visual elements and live in its own.
link |
00:04:57.560
Or did you not think of it in those romantic notions?
link |
00:05:00.040
Was it more like, oh, that's cool.
link |
00:05:02.400
I can solve some puzzles.
link |
00:05:05.360
It was always more than solving puzzles.
link |
00:05:06.920
It was something where there was this limitless possibilities once you have a computer in
link |
00:05:14.280
front of you.
link |
00:05:15.280
You can do anything with it.
link |
00:05:16.280
I used to play with Lego with the same feeling.
link |
00:05:18.040
You can make anything you want out of Lego, but even more so with a computer.
link |
00:05:21.480
You're not constrained by the amount of kit you've got.
link |
00:05:24.600
And so I was fascinated by it and started pulling out the user guide and the advanced
link |
00:05:28.960
user guide and then learning.
link |
00:05:30.760
So I started in basic and then later 6502.
link |
00:05:34.640
My father also became interested in this machine and gave up his career to go back to school
link |
00:05:40.160
and study for a master's degree in artificial intelligence, funnily enough, at Essex University
link |
00:05:47.080
when I was seven.
link |
00:05:48.720
So I was exposed to those things at an early age.
link |
00:05:52.080
He showed me how to program in Prologue and do things like querying your family tree.
link |
00:05:57.880
And those are some of my earliest memories of trying to figure things out on a computer.
link |
00:06:04.280
Those are the early steps in computer science programming, but when did you first fall in
link |
00:06:09.000
love with artificial intelligence or with the ideas, the dreams of AI?
link |
00:06:14.840
I think it was really when I went to study at university.
link |
00:06:19.120
So I was an undergrad at Cambridge and studying computer science.
link |
00:06:26.080
And I really started to question, you know, what really are the goals?
link |
00:06:29.640
What's the goal?
link |
00:06:30.640
Where do we want to go with computer science?
link |
00:06:33.000
And it seemed to me that the only step of major significance to take was to try and recreate
link |
00:06:42.240
something akin to human intelligence.
link |
00:06:44.240
If we could do that, that would be a major leap forward.
link |
00:06:47.800
And that idea certainly wasn't the first to have it, but it, you know, nestled within
link |
00:06:52.440
me somewhere and became like a bug, you know, I really wanted to crack that problem.
link |
00:06:58.600
So you thought it was like, you had a notion that this is something that human beings can
link |
00:07:03.200
do, that it is possible to create an intelligent machine?
link |
00:07:07.000
Well, I mean, unless you believe in something metaphysical, then what are our brains doing?
link |
00:07:13.560
Well, at some level, their information processing systems, which are able to take whatever information
link |
00:07:21.960
is in there, transform it through some form of program and produce some kind of output,
link |
00:07:26.120
which enables that human being to do all the amazing things that they can do in this incredible
link |
00:07:30.640
world.
link |
00:07:31.640
So, so then do you remember the first time you've written a program that, because you
link |
00:07:38.160
also had an interest in games, do you remember the first time you were in a program that
link |
00:07:42.040
beat you in a game?
link |
00:07:45.800
That or beat you at anything sort of achieved super David Silver level performance?
link |
00:07:54.400
So I used to work in the games industry.
link |
00:07:56.520
So for five years, I programmed games for my first job.
link |
00:08:01.360
So it was an amazing opportunity to get involved in a startup company.
link |
00:08:05.920
And so I was involved in building AI at that time.
link |
00:08:12.200
And so for sure, there was a sense of building handcrafted what people used to call AI in
link |
00:08:19.480
the games industry, which I think is not really what we might think of as AI in its fuller
link |
00:08:23.480
sense, but something which is able to take actions in a way which makes things interesting
link |
00:08:31.400
and challenging for the human player.
link |
00:08:35.160
And at that time, I was able to build these handcrafted agents, which in certain limited
link |
00:08:40.320
cases could do things which were able to do better than me, but mostly in these kind of
link |
00:08:46.680
twitch like scenarios where they were able to do things faster or because they had some
link |
00:08:51.240
pattern which was able to exploit repeatedly.
link |
00:08:55.400
I think if we're talking about real AI, the first experience for me came after that when
link |
00:09:01.760
I realized that this path I was on wasn't taking me towards, it wasn't dealing with
link |
00:09:08.360
that bug which I still had inside me to really understand intelligence and try and solve it.
link |
00:09:14.720
Everything people were doing in games was short term fixes rather than long term vision.
link |
00:09:20.840
And so I went back to study for my PhD, which was finally enough trying to apply reinforcement
link |
00:09:26.640
learning to the game of Go.
link |
00:09:28.520
And I built my first Go program using reinforcement learning, a system which would by trial and
link |
00:09:33.840
error play against itself and was able to learn which patterns were actually helpful
link |
00:09:40.520
to predict whether it was going to win or lose the game and then choose the moves that
link |
00:09:44.720
led to the combination of patterns that would mean that you're more likely to win.
link |
00:09:48.440
And that system, that system beat me.
link |
00:09:51.040
And how did that make you feel?
link |
00:09:53.520
Make me feel good.
link |
00:09:54.520
I mean, it's a mix of a sort of excitement and was there a tinge of sort of like almost
link |
00:10:02.720
like a fearful awe?
link |
00:10:04.600
You know, it's like in 2001 Space Odyssey kind of realizing that you've created something
link |
00:10:11.880
that's achieved human level intelligence in this one particular little task.
link |
00:10:21.240
And in that case, I suppose the neural networks weren't involved.
link |
00:10:24.440
There were no neural networks in those days.
link |
00:10:26.920
This was pre deep learning revolution, but it was a principled self learning system based
link |
00:10:33.240
on a lot of the principles which people still use in deep reinforcement learning.
link |
00:10:40.320
How did I feel?
link |
00:10:41.320
I think I found it immensely satisfying that a system which was able to learn from first
link |
00:10:50.080
principles for itself was able to reach the point that it was understanding this domain
link |
00:10:56.360
better than I could and able to outwit me.
link |
00:11:00.000
I don't think it was a sense of awe.
link |
00:11:01.520
It was a sense that satisfaction that something I felt should work had worked.
link |
00:11:08.160
So to me, Alpha Go, and I don't know how else to put it, but to me, Alpha Go and Alpha
link |
00:11:13.120
Go Zero mastering the game of Go is, again, to me, the most profound and inspiring moment
link |
00:11:20.360
in the history of artificial intelligence.
link |
00:11:23.560
So you're one of the key people behind this achievement.
link |
00:11:26.680
And I'm Russian, so I really felt the first sort of seminal achievement when deep blue
link |
00:11:32.640
be Garakasparov in 1997.
link |
00:11:36.920
So as far as I know, the AI community at that point largely saw the game of Go as unbeatable
link |
00:11:42.720
in AI using the sort of the state of the art to brute force methods, search methods.
link |
00:11:49.080
Even if you consider at least the way I saw it, even if you consider arbitrary exponential
link |
00:11:54.920
scaling of compute, Go would still not be solvable, hence why it was thought to be impossible.
link |
00:12:02.640
So given that the game of Go was impossible to master, when was the dream for you?
link |
00:12:09.520
You just mentioned your PhD thesis of building the system that plays Go.
link |
00:12:13.640
When was the dream for you that you could actually build a computer program that achieves
link |
00:12:19.120
world class, not necessarily beats the world champion, but achieves that kind of level
link |
00:12:23.480
of playing Go?
link |
00:12:24.480
First of all, thank you.
link |
00:12:25.480
That was very kind words.
link |
00:12:28.640
And funnily enough, I just came from a panel where I was actually in a conversation with
link |
00:12:34.600
Garakasparov and Murray Campbell, who was the author of Deep Blue.
link |
00:12:39.200
And it was their first meeting together since the match, so that just occurred yesterday.
link |
00:12:44.560
So I'm literally fresh from that experience.
link |
00:12:47.440
So these are amazing moments when they happen, but where did it all start?
link |
00:12:52.000
Well, for me, it started when I became fascinated in the game of Go.
link |
00:12:56.200
So Go, for me, I've grown up playing games, I've always had a fascination in board games.
link |
00:13:01.800
I played chess as a kid, I played Scrabble as a kid.
link |
00:13:06.160
When I was at university, I discovered the game of Go, and to me, it just blew all of
link |
00:13:10.520
those other games out of the water, it was just so deep and profound in its complexity
link |
00:13:15.480
with endless levels to it.
link |
00:13:17.920
What I discovered was that I could devote endless hours to this game, and I knew in
link |
00:13:27.320
my heart of hearts that no matter how many hours I would devote to it, I would never
link |
00:13:30.680
become a grandmaster.
link |
00:13:34.560
Or there was another path, and the other path was to try and understand how you could get
link |
00:13:39.400
some other intelligence to play this game better than I would be able to.
link |
00:13:43.560
And so even in those days, I had this idea that, what if it was possible to build a program
link |
00:13:49.240
that could crack this?
link |
00:13:51.160
And as I started to explore the domain, I discovered that this was really the domain
link |
00:13:57.340
where people felt deeply that if progress could be made in Go, it would really mean
link |
00:14:04.520
a giant leap forward for AI.
link |
00:14:06.380
It was the challenge where all other approaches had failed.
link |
00:14:11.000
This is coming out of the era you mentioned, which was in some sense the golden era for
link |
00:14:16.760
the classical methods of AI, like heuristic search.
link |
00:14:19.960
In the 90s, they all fell one after another, not just chess with deep blue, but checkers,
link |
00:14:26.720
batgammon, Othello.
link |
00:14:28.920
There were numerous cases where systems built on top of heuristic search methods with these
link |
00:14:36.560
high performance systems had been able to defeat the human world champion in each of
link |
00:14:40.640
those domains.
link |
00:14:42.080
And yet in that same time period, there was a million dollar prize available for the game
link |
00:14:49.520
of Go, for the first system to be a human professional player.
link |
00:14:52.960
And at the end of that time period, at year 2000, when the prize expired, the strongest
link |
00:14:58.480
Go program in the world was defeated by a nine year old child.
link |
00:15:02.800
When that nine year old child was giving nine free moves to the computer at the start of
link |
00:15:06.920
the game to try and even things up.
link |
00:15:09.960
And computer Go expert beat that same strongest program with 29 handicap stones, 29 free moves.
link |
00:15:18.200
So that's what the state of affairs was when I became interested in this problem in around
link |
00:15:23.880
2003 when I started working on computer Go.
link |
00:15:29.560
There was nothing.
link |
00:15:30.560
There was just very, very little in the way of progress towards meaningful performance
link |
00:15:36.640
again at anything approaching human level.
link |
00:15:39.240
And so people, it wasn't through lack of effort, people who tried many, many things.
link |
00:15:45.000
And so there was a strong sense that something different would be required for Go than had
link |
00:15:50.760
been needed for all of these other domains where AI had been successful.
link |
00:15:54.280
And maybe the single clearest example is that Go, unlike those other domains, had this kind
link |
00:16:00.800
of intuitive property that a Go player would look at a position and say, hey, here's this
link |
00:16:07.040
mess of black and white stones.
link |
00:16:09.680
But from this mess, oh, I can predict that this part of the board has become my territory,
link |
00:16:15.920
this part of the board has become your territory, and I've got this overall sense that I'm going
link |
00:16:19.880
to win and that this is about the right move to play.
link |
00:16:22.440
And that intuitive sense of judgment of being able to evaluate what's going on in a position,
link |
00:16:28.240
it was pivotal to humans being able to play this game and something that people had no
link |
00:16:32.840
idea how to put into computers.
link |
00:16:35.120
So this question of how to evaluate a position, how to come up with these intuitive judgments
link |
00:16:39.960
was the key reason why Go was so hard in addition to its enormous search space and the reason
link |
00:16:48.320
why methods which had succeeded so well elsewhere failed in Go.
link |
00:16:53.040
And so people really felt deep down that in order to crack Go, we would need to get something
link |
00:16:59.000
akin to human intuition.
link |
00:17:00.520
And if we got something akin to human intuition, we'd be able to solve many, many more problems
link |
00:17:06.000
in AI.
link |
00:17:07.000
So for me, that was the moment where it's like, okay, this is not just about playing
link |
00:17:11.240
the game of Go.
link |
00:17:12.240
This is about something profound.
link |
00:17:13.680
And it was back to that bug which had been itching me all those years.
link |
00:17:17.800
This is the opportunity to do something meaningful and transformative and I guess a dream was
link |
00:17:22.840
born.
link |
00:17:23.840
That's a really interesting way to put it.
link |
00:17:25.400
So almost this realization that you need to find formulate Go as a kind of a prediction
link |
00:17:30.840
problem versus a search problem was the intuition.
link |
00:17:34.720
I mean, maybe that's the wrong crude term, but to give it the ability to kind of intuit
link |
00:17:43.520
things about positional structure of the board.
link |
00:17:47.120
Now, okay, but what about the learning part of it?
link |
00:17:51.040
Did you have a sense that you have to, that learning has to be part of the system?
link |
00:17:57.520
Again, something that hasn't really as far as I think, except with TD Gammon and the
link |
00:18:03.360
90s with RL a little bit, hasn't been part of those daily art game playing systems.
link |
00:18:08.800
So I strongly felt that learning would be necessary and that's why my PhD topic back
link |
00:18:15.160
then was trying to apply reinforcement learning to the game of Go.
link |
00:18:20.240
And not just learning of any type, but I felt that the only way to really have a system
link |
00:18:26.040
to progress beyond human levels of performance wouldn't just be to mimic how humans do it,
link |
00:18:31.120
but to understand for themselves.
link |
00:18:33.440
And how else can a machine hope to understand what's going on, except through learning?
link |
00:18:39.120
If you're not learning, what else are you doing?
link |
00:18:40.480
Well, you're putting all the knowledge into the system and that just feels like something
link |
00:18:45.480
which decades of AI have told us is maybe not a dead end, but certainly has a ceiling
link |
00:18:52.280
to the capabilities.
link |
00:18:53.280
It's known as the knowledge acquisition bottleneck that the more you try to put into something,
link |
00:18:58.520
the more brittle the system becomes.
link |
00:19:00.520
And so you just have to have learning.
link |
00:19:02.840
You have to have learning.
link |
00:19:03.840
That's the only way you're going to be able to get a system which has sufficient knowledge
link |
00:19:08.960
in it, millions and millions of pieces of knowledge, billions, trillions of a form that
link |
00:19:14.320
can actually apply for itself and understand how those billions and trillions of pieces
link |
00:19:18.520
of knowledge can be leveraged in a way which will actually lead it towards its goal without
link |
00:19:23.480
conflict or other issues.
link |
00:19:26.520
Yeah.
link |
00:19:27.520
I mean, if I put myself back in that time, I just wouldn't think like that without a good
link |
00:19:33.720
demonstration of RL.
link |
00:19:34.800
I would think more in the symbolic AI, like the not learning, but sort of a simulation
link |
00:19:42.720
of knowledge base, like a growing knowledge base, but it would still be sort of pattern
link |
00:19:48.840
based, like basically have little rules that you kind of assemble together into a large
link |
00:19:55.200
knowledge base.
link |
00:19:56.800
Well, in a sense, that was the state of the art back then.
link |
00:19:59.840
So if you look at the Go programs which had been competing for this prize I mentioned,
link |
00:20:05.400
they were an assembly of different specialized systems, some of which used huge amounts of
link |
00:20:11.240
human knowledge to describe how you should play the opening, how you should all the different
link |
00:20:16.240
patterns that were required to play well in the game of Go, end game theory, combinatorial
link |
00:20:23.680
game theory, and combined with more principled search based methods, which we're trying
link |
00:20:29.120
to solve for particular sub parts of the game, like life and death, connecting groups together,
link |
00:20:36.880
all these amazing sub problems that just emerge in the game of Go, there were different pieces
link |
00:20:42.440
all put together into this collage, which together would try and play against a human.
link |
00:20:49.360
And although not all of the pieces were handcrafted, the overall effect was nevertheless still
link |
00:20:56.280
brittle and it was hard to make all these pieces work well together.
link |
00:21:00.320
And so really, what I was pressing for and the main innovation of the approach I took
link |
00:21:05.480
was to go back to first principles and say, well, let's back off that and try and find
link |
00:21:11.280
a principled approach where the system can learn for itself.
link |
00:21:17.240
Just from the outcome, like, you know, learn for itself, if you try something, did that
link |
00:21:21.080
help or did it not help?
link |
00:21:22.840
And only through that procedure can you arrive at knowledge which is verified, the system
link |
00:21:28.280
has to verify it for itself, not relying on any other third party to say this is right
link |
00:21:32.400
or this is wrong.
link |
00:21:33.400
And so that principle was already very important in those days that unfortunately we were missing
link |
00:21:40.760
some important pieces back then.
link |
00:21:43.360
So before we dive into maybe discussing the beauty of reinforcement learning, let's take
link |
00:21:49.400
a step back, we kind of skipped it a bit, but the rules of the game of Go, the elements
link |
00:21:58.920
of it perhaps contrasting to chess that sort of you really enjoyed as a human being and
link |
00:22:07.280
also that make it really difficult as a AI machine learning problem.
link |
00:22:13.160
So the game of Go has remarkably simple rules, in fact, so simple that people have speculated
link |
00:22:19.080
that if we were to meet alien life at some point that we wouldn't be able to communicate
link |
00:22:23.360
with them, but we would be able to play a game of Go with them, probably have discovered
link |
00:22:26.880
the same ruleset, so the game is played on a 19 by 19 grid and you play on the intersections
link |
00:22:33.640
of the grid and the players take turns.
link |
00:22:36.000
And the aim of the game is very simple, it's to surround as much territory as you can as
link |
00:22:40.800
many of these intersections with your stones and to surround more than your opponent does.
link |
00:22:46.480
And the only nuance to the game is that if you fully surround your opponent's piece,
link |
00:22:50.520
then you get to capture it and remove it from the board and it counts as your own territory.
link |
00:22:54.480
Now from those very simple rules, immense complexity arises, there's kind of profound
link |
00:22:59.080
strategies in how to surround territory, how to kind of trade off between making solid
link |
00:23:05.240
territory yourself now, compared to building up influence that will help you acquire territory
link |
00:23:10.440
later in the game, how to connect groups together, how to keep your own groups alive, which patterns
link |
00:23:17.800
of stones are most useful compared to others.
link |
00:23:21.560
There's just immense knowledge and human Go players have played this game for, it was
link |
00:23:27.520
discovered thousands of years ago and human Go players have built up this immense knowledge
link |
00:23:31.600
base over the years.
link |
00:23:33.840
It's studied very deeply and played by something like 50 million players across the world, mostly
link |
00:23:39.000
in China, Japan and Korea, where it's an important part of the culture, so much so that it's
link |
00:23:44.400
considered one of the four ancient arts that was required by Chinese scholars.
link |
00:23:49.960
There's a deep history there.
link |
00:23:51.760
But there's interesting quality, so if I were to compare it to chess, chess is in the same
link |
00:23:57.320
way as it is in Chinese culture for Go, and chess in Russia is also considered one of
link |
00:24:02.040
the sacred arts.
link |
00:24:04.040
So if we contrast Go with chess, there's interesting qualities about Go, maybe you can correct
link |
00:24:10.040
me if I'm wrong, but the evaluation of a particular static board is not as reliable, like you
link |
00:24:18.960
can't, in chess you can kind of assign points to the different units, and it's kind of a
link |
00:24:25.800
pretty good measure of who's winning, who's losing.
link |
00:24:28.080
It's not so clear to do some Go.
link |
00:24:30.040
Yeah, so in the game of Go, you find yourself in a situation where both players have played
link |
00:24:34.200
the same number of stones, actually captures at strong level of play happen very rarely,
link |
00:24:39.400
which means that at any moment in the game you've got the same number of white stones
link |
00:24:42.320
and black stones, and the only thing which differentiates how well you're doing is this
link |
00:24:47.160
intuitive sense of where are the territories ultimately going to form on this board?
link |
00:24:52.640
And if you look at the complexity of a real Go position, it's mind boggling that question
link |
00:25:00.480
of what will happen in 300 moves from now when you see just a scattering of 20 white
link |
00:25:05.320
and black stones intermingled, and so that challenge is the reason why position evaluation
link |
00:25:14.120
is so hard in Go compared to other games.
link |
00:25:17.480
In addition to that, it has an enormous search space, so there's around 10 to 170 positions
link |
00:25:23.280
in the game of Go, that's an astronomical number, and that search space is so great
link |
00:25:28.440
that traditional heuristic search methods that were so successful and things like Deep
link |
00:25:32.080
Blue and chess programs just kind of fall over in Go.
link |
00:25:36.240
So at which point did reinforcement learning enter your life, your research life, your way
link |
00:25:43.360
of thinking?
link |
00:25:44.360
We just talked about learning, but reinforcement learning is a very particular kind of learning,
link |
00:25:49.840
one that's both philosophically sort of profound, but also one that's pretty difficult to get
link |
00:25:55.080
to work as if we look back in the early days.
link |
00:25:58.560
So when did that enter your life and how did that work progress?
link |
00:26:02.480
So I had just finished working in the games industry at this startup company, and I took
link |
00:26:09.720
a year out to discover for myself exactly which path I wanted to take, I knew I wanted
link |
00:26:15.360
to study intelligence, but I wasn't sure what that meant at that stage, I really didn't
link |
00:26:20.080
feel I had the tools to decide on exactly which path I wanted to follow.
link |
00:26:24.920
So during that year, I read a lot, and one of the things I read was Sutton and Bartow,
link |
00:26:31.280
the sort of seminal textbook on an introduction to reinforcement learning, and when I read
link |
00:26:37.960
that textbook, I just had this resonating feeling that this is what I understood intelligence
link |
00:26:46.320
to be.
link |
00:26:48.080
And this was the path that I felt would be necessary to go down to make progress in AI.
link |
00:26:55.920
So I got in touch with Rich Sutton and asked him if he would be interested in supervising
link |
00:27:03.560
me on a PhD thesis in computer go, and he basically said that if he's still alive he'd
link |
00:27:14.840
be happy to, but unfortunately he'd been struggling with very serious cancer for some years, and
link |
00:27:22.000
he really wasn't confident at that stage that he'd even be around to see the end event.
link |
00:27:26.560
But fortunately that part of the story worked out very happily, and I found myself out there
link |
00:27:32.160
in Alberta, they've got a great games group out there with a history of fantastic work
link |
00:27:36.200
in board games as well, as Rich Sutton, the father of RL, so it was the natural place
link |
00:27:42.440
for me to go in some sense to study this question.
link |
00:27:46.320
And the more I looked into it, the more strongly I felt that this wasn't just the path to progress
link |
00:27:55.440
in computer go, but really this was the thing I'd been looking for.
link |
00:27:59.440
This was really an opportunity to frame what intelligence means, like what are the goals
link |
00:28:09.800
of AI in a single, clear problem definition such that if we're able to solve that clear
link |
00:28:16.040
single problem definition, in some sense we've cracked the problem of AI.
link |
00:28:21.280
So to you, reinforcement learning ideas, at least sort of echoes of it, would be at the
link |
00:28:27.000
core of intelligence.
link |
00:28:29.800
Is it the core of intelligence?
link |
00:28:31.480
And if we ever create a human level intelligence system, it would be at the core of that kind
link |
00:28:36.440
of system.
link |
00:28:37.440
Let me say it this way, that I think it's helpful to separate out the problem from the solution.
link |
00:28:42.480
So I see the problem of intelligence, I would say it can be formalized as the reinforcement
link |
00:28:49.640
learning problem, and that that formalization is enough to capture most if not all of the
link |
00:28:55.880
things that we mean by intelligence, that they can all be brought within this framework
link |
00:29:01.000
and gives us a way to access them in a meaningful way that allows us as scientists to understand
link |
00:29:07.920
intelligence and us as computer scientists to build them.
link |
00:29:12.920
And so in that sense, I feel that it gives us a path, maybe not the only path, but a
link |
00:29:17.600
path towards AI.
link |
00:29:20.640
And so do I think that any system in the future that's solved AI would have to have RL within
link |
00:29:29.320
it?
link |
00:29:30.320
Well, I think if you ask that, you're asking about the solution methods.
link |
00:29:33.440
I would say that if we have such a thing, it would be a solution to the RL problem.
link |
00:29:38.000
Now, what particular methods have been used to get there?
link |
00:29:40.960
Well, we should keep an open mind about the best approaches to actually solve any problem.
link |
00:29:46.040
And the things we have right now for reinforcement learning, maybe I believe they've got a lot
link |
00:29:53.080
of legs, but maybe we're missing some things.
link |
00:29:55.040
Maybe there's going to be better ideas.
link |
00:29:56.200
I think we should keep, let's remain modest, and we're at the early days of this field,
link |
00:30:02.880
and there are many amazing discoveries ahead of us.
link |
00:30:05.080
For sure.
link |
00:30:06.080
The specifics, especially of the different kinds of RL approaches currently, there could
link |
00:30:09.840
be other things that fall into the very large umbrella of RL.
link |
00:30:13.600
But if it's okay, can we take a step back and ask the basic question of what is, do
link |
00:30:20.360
you, reinforcement learning?
link |
00:30:22.720
So reinforcement learning is the study and the science and the problem of intelligence
link |
00:30:31.400
in the form of an agent that interacts with an environment.
link |
00:30:35.600
So the problem you're trying to solve is represented by some environment, like the world in which
link |
00:30:38.880
that agent is situated.
link |
00:30:40.880
And the goal of RL is clear, that the agent gets to take actions.
link |
00:30:45.760
Those actions have some effect on the environment, and the environment gives back an observation
link |
00:30:49.160
to the agent saying, you know, this is what you see or sense.
link |
00:30:52.960
And one special thing which it gives back is called the reward signal, how well it's
link |
00:30:56.760
doing in the environment.
link |
00:30:58.240
And the reinforcement learning problem is to simply take actions over time so as to maximize
link |
00:31:05.200
that reward signal.
link |
00:31:07.480
So a couple of basic questions, what types of RL approaches are there?
link |
00:31:14.000
So I don't know if there's a nice brief inwards way to paint a picture of sort of value based,
link |
00:31:21.840
model based, policy based reinforcement learning.
link |
00:31:25.400
Yeah.
link |
00:31:26.400
So now if we think about, okay, so there's this ambitious problem definition of RL.
link |
00:31:32.040
It's really, you know, it's truly ambitious.
link |
00:31:33.480
It's trying to capture and encircle all of the things in which an agent interacts with
link |
00:31:37.080
an environment and say, well, how can we formalize and understand what it means to crack that?
link |
00:31:42.200
Now let's think about the solution method.
link |
00:31:43.760
Well, how do you solve a really hard problem like that?
link |
00:31:46.320
Well, one approach you can take is to decompose that very hard problem into pieces that work
link |
00:31:53.400
together to solve that hard problem.
link |
00:31:56.240
And so you can kind of look at the decomposition that's inside the agent's head, if you like,
link |
00:32:00.880
and ask, well, what form does that decomposition take?
link |
00:32:03.960
And some of the most common pieces that people use when they're kind of putting this system,
link |
00:32:07.920
the solution method together, some of the most common pieces that people use are whether
link |
00:32:12.480
or not that solution has a value function.
link |
00:32:15.000
That means is it trying to predict, explicitly trying to predict how much reward it will get
link |
00:32:18.920
in the future?
link |
00:32:19.920
Does it have a representation of a policy?
link |
00:32:22.960
That means something which is deciding how to pick actions.
link |
00:32:25.880
Is that decision making process explicitly represented?
link |
00:32:29.480
And is there a model in the system?
link |
00:32:32.160
Is there something which is explicitly trying to predict what will happen in the environment?
link |
00:32:36.760
And so those three pieces are, to me, some of the most common building blocks.
link |
00:32:42.680
And I understand the different choices in RL as choices of whether or not to use those
link |
00:32:49.240
building blocks when you're trying to decompose the solution.
link |
00:32:52.800
Should I have a value function represented?
link |
00:32:54.440
Should I have a policy represented?
link |
00:32:56.880
Should I have a model represented?
link |
00:32:58.640
And there are combinations of those pieces and, of course, other things that you could
link |
00:33:01.400
add into the picture as well.
link |
00:33:03.320
But those three fundamental choices give rise to some of the branches of RL with which
link |
00:33:07.240
we are very familiar.
link |
00:33:08.760
And so those, as you mentioned, there is a choice of what's specified or modeled explicitly.
link |
00:33:17.600
And the idea is that all of these are somehow implicitly learned within the system.
link |
00:33:23.560
So it's almost a choice of how you approach a problem.
link |
00:33:28.560
Do you see those as fundamental differences or are these almost like small specifics,
link |
00:33:35.600
like the details of how you solve the problem, but they're not fundamentally different from
link |
00:33:39.040
each other?
link |
00:33:40.920
I think the fundamental idea is maybe at the higher level.
link |
00:33:46.000
The fundamental idea is the first step of the decomposition is really to say, well,
link |
00:33:52.440
how are we really going to solve any kind of problem where you're trying to figure out
link |
00:33:56.520
how to take actions and just from this stream of observations, you know, you've got some
link |
00:34:00.240
agent situated in its sensory motor stream and getting all these observations in, getting
link |
00:34:04.520
to take these actions.
link |
00:34:05.520
And what should it do?
link |
00:34:06.520
How can you even broach that problem?
link |
00:34:07.520
You know, maybe the complexity of the world is so great that you can't even imagine how
link |
00:34:12.400
to build a system that would understand how to deal with that.
link |
00:34:15.880
And so the first step of this decomposition is to say, well, you have to learn.
link |
00:34:19.600
The system has to learn for itself.
link |
00:34:22.160
And so note that the reinforcement learning problem doesn't actually stipulate that you
link |
00:34:26.240
have to learn.
link |
00:34:27.240
Like you could maximize your rewards without learning, it would just say wouldn't do a
link |
00:34:30.680
very good job of it.
link |
00:34:32.520
So learning is required because it's the only way to achieve good performance in any sufficiently
link |
00:34:37.880
large and complex environment.
link |
00:34:40.600
So that's the first step.
link |
00:34:42.080
And so that step gives commonality to all of the other pieces, because now you might
link |
00:34:45.920
ask, well, what should you be learning?
link |
00:34:48.920
What does learning even mean?
link |
00:34:49.920
You know, in this sense, you know, learning might mean, well, you're trying to update
link |
00:34:54.480
the parameters of some system, which is then the thing that actually picks the actions.
link |
00:35:01.480
And those parameters could be representing anything, they could be parametrizing a value
link |
00:35:05.280
function or a model or a policy.
link |
00:35:08.640
And so in that sense, there's a lot of commonality in that whatever is being represented there
link |
00:35:12.320
is the thing which is being learned and it's being learned with the ultimate goal of maximizing
link |
00:35:16.320
rewards.
link |
00:35:17.480
But the way in which you decompose the problem is really what gives the semantics to the
link |
00:35:22.560
whole system.
link |
00:35:23.560
You're trying to learn something to predict well, like a value function or a model, or
link |
00:35:28.640
you're learning something to perform well, like a policy.
link |
00:35:32.200
And the form of that objective is kind of giving the semantics to the system.
link |
00:35:36.440
And so it really is, at the next level down, a fundamental choice.
link |
00:35:40.360
And we have to make those fundamental choices as system designers or enable our algorithms
link |
00:35:46.120
to be able to learn how to make those choices for themselves.
link |
00:35:49.440
So then the next step you mentioned, the very first thing you have to deal with is can you
link |
00:35:56.280
even take in this huge stream of observations and do anything with it?
link |
00:36:01.640
So the natural next basic question is what is the, what is deeper enforcement learning
link |
00:36:07.960
and what is this idea of using neural networks to deal with this huge incoming stream?
link |
00:36:14.640
So amongst all the approaches for reinforcement learning, deep reinforcement learning is
link |
00:36:19.400
one family of solution methods that tries to utilize powerful representations that are
link |
00:36:29.840
offered by neural networks to represent any of these different components of the solution,
link |
00:36:37.080
of the agent.
link |
00:36:38.080
Like whether it's the value function or the model or the policy, the idea of deep learning
link |
00:36:42.640
is to say, well, here's a powerful toolkit that's so powerful that it's universal in
link |
00:36:47.800
the sense that it can represent any function and it can learn any function.
link |
00:36:52.280
And so if we can leverage that universality, that means that whatever we need to represent
link |
00:36:57.880
for our policy or for our value function for a model, deep learning can do it.
link |
00:37:02.000
So that deep learning is one approach that offers us a toolkit that has no ceiling to
link |
00:37:08.680
its performance, that as we start to put more resources into the system, more memory and
link |
00:37:13.600
more computation and more data, more experience, more interactions with the environment, that
link |
00:37:20.680
these are systems that can just get better and better and better at doing whatever the
link |
00:37:24.000
job is they've asked them to do.
link |
00:37:25.760
Whatever we've asked that function to represent, it can learn a function that does a better
link |
00:37:31.600
and better job of representing that knowledge, whether that knowledge be estimating how well
link |
00:37:36.920
you're going to do in the world, the value function, whether it's going to be choosing
link |
00:37:40.040
what to do in the world, the policy, or whether it's understanding the world itself, what's
link |
00:37:45.040
going to happen next, the model.
link |
00:37:47.120
Nevertheless, the fact that neural networks are able to learn incredibly complex representations
link |
00:37:54.840
that allow you to do the policy, the model or the value function is at least to my mind
link |
00:38:01.640
exceptionally beautiful and surprising.
link |
00:38:04.200
Was it surprising to you?
link |
00:38:08.880
Can you still believe it works as well as it does?
link |
00:38:11.560
Do you have good intuition about why it works at all and works as well as it does?
link |
00:38:19.280
I think let me take two parts to that question.
link |
00:38:22.880
I think it's not surprising to me that the idea of reinforcement learning works because
link |
00:38:31.000
in some sense, I feel it's the only thing which can ultimately.
link |
00:38:37.600
I feel we have to address it and there must be success as possible because we have examples
link |
00:38:43.440
of intelligence.
link |
00:38:45.840
It must at some level be able to possible to acquire experience and use that experience
link |
00:38:51.000
to do better in a way which is meaningful to environments of the complexity that humans
link |
00:38:57.040
can deal with.
link |
00:38:58.040
It must be.
link |
00:38:59.040
Am I surprised that our current systems can do as well as they can do?
link |
00:39:04.000
I think one of the big surprises for me and a lot of the community is really the fact
link |
00:39:11.760
that deep learning can continue to perform so well despite the facts that these neural
link |
00:39:22.200
networks that they're representing have these incredibly nonlinear kind of bumpy surfaces
link |
00:39:27.960
which to our kind of low dimensional intuitions make it feel like surely you're just going
link |
00:39:33.280
to get stuck and learning will get stuck because you won't be able to make any further progress.
link |
00:39:38.840
Yet, the big surprise is that learning continues and these what appear to be local optima turn
link |
00:39:46.640
out not to be because in high dimensions when we make really big neural nets, there's always
link |
00:39:50.800
a way out and there's a way to go even lower and then you're still not in a local optima
link |
00:39:56.440
because there's some other pathway that will take you out and take you lower still.
link |
00:40:00.440
No matter where you are, learning can proceed and do better and better and better without
link |
00:40:05.560
bound.
link |
00:40:08.360
That is a surprising and beautiful property of neural nets which I find elegant and beautiful
link |
00:40:17.960
and somewhat shocking that it turns out to be the case.
link |
00:40:21.120
As you said, which I really like, to our low dimensional intuitions, that's surprising.
link |
00:40:28.160
Yeah.
link |
00:40:29.160
We're very tuned to working within a three dimensional environment and so to start to
link |
00:40:36.160
visualize what a billion dimensional neural network surface that you're trying to optimize
link |
00:40:42.600
over, what that even looks like is very hard for us and so I think that really if you try
link |
00:40:48.280
to account for essentially the AI winter where people gave up on neural networks, I think
link |
00:40:57.040
it's really down to that lack of ability to generalize from low dimensions to high dimensions
link |
00:41:03.160
because back then we were in the low dimensional case, people could only build neural nets
link |
00:41:07.120
with 50 nodes in them or something and to imagine that it might be possible to build
link |
00:41:14.560
a billion dimensional neural net and it might have a completely different qualitatively
link |
00:41:18.120
different property was very hard to anticipate and I think even now we're starting to build
link |
00:41:23.480
the theory to support that and it's incomplete at the moment but all of the theory seems
link |
00:41:29.480
to be pointing in the direction that indeed this is an approach which truly is universal
link |
00:41:34.760
both in its representational capacity which was known but also in its learning ability
link |
00:41:38.400
which is surprising.
link |
00:41:41.520
It makes one wonder what else we're missing due to our low dimensional intuitions that
link |
00:41:48.880
will seem obvious once it's discovered.
link |
00:41:51.720
I often wonder when we one day do have AIs which are superhuman in their abilities to
link |
00:42:01.800
understand the world, what will they think of the algorithms that we developed back now?
link |
00:42:09.040
Will it be looking back at these days and thinking that will we look back and feel that
link |
00:42:17.200
these algorithms were naive first steps or will they still be the fundamental ideas which
link |
00:42:21.560
are used even in 100,000, 10,000 years?
link |
00:42:26.840
They'll watch back to this conversation and with a smile and do a little bit of a laugh.
link |
00:42:35.480
My sense is I think just like when we used to think that the sun revolved around the
link |
00:42:44.800
earth they'll see our systems of today reinforcement learning as too complicated that the answer
link |
00:42:52.200
was simple all along.
link |
00:42:54.560
There's something just like you said in the game of Go, I mean I love the systems of like
link |
00:43:00.040
cellular automata that there's simple rules from which incredible complexity emerges.
link |
00:43:06.160
So it feels like there might be some very simple approaches.
link |
00:43:10.720
Just like Rich Sutton says, these simple methods would compute over time seem to prove to be
link |
00:43:19.600
the most effective.
link |
00:43:20.600
I 100% agree I think that if we try to anticipate what will generalize well into the future
link |
00:43:30.640
I think it's likely to be the case that it's the simple clear ideas which will have the
link |
00:43:36.080
longest legs and which will carry us further into the future.
link |
00:43:39.440
And nevertheless we're in a situation where we need to make things work today and sometimes
link |
00:43:43.800
that requires putting together more complex systems where we don't have the full answers
link |
00:43:48.800
yet as to what those minimal ingredients might be.
link |
00:43:51.680
So speaking of which, if we could take a step back to Go, what was Mogo and what was the
link |
00:43:58.880
key idea behind the system?
link |
00:44:00.920
So back during my PhD on computer Go around about that time there was a major new development
link |
00:44:08.000
in which actually happened in the context of computer Go and it was really a revolution
link |
00:44:16.360
in the way that heuristic search was done and the idea was essentially that a position
link |
00:44:24.320
could be evaluated or a state in general could be evaluated not by humans saying whether
link |
00:44:30.840
that position is good or not or even humans providing rules as to how you might evaluate
link |
00:44:36.240
it but instead by allowing the system to randomly play out the game until the end multiple times
link |
00:44:45.800
and taking the average of those outcomes as the prediction of what will happen.
link |
00:44:50.680
So for example if you're in the game of Go the intuition is that you take a position
link |
00:44:55.280
and you get the system to kind of play random moves against itself all the way to the end
link |
00:44:59.480
of the game and you see who wins and if black ends up winning more of those random games
link |
00:45:04.080
than white well you say hey this is a position that favors white and if white ends up winning
link |
00:45:08.560
more of those random games than black then it favors white.
link |
00:45:13.680
So that idea was known as Monte Carlo search and a particular form of Monte Carlo search
link |
00:45:23.080
that became very effective and was developed in computer Go first by Remy Coulomb in 2006
link |
00:45:28.640
and then taken further by others was something called Monte Carlo tree search which basically
link |
00:45:34.200
takes that same idea and uses that insight to evaluate every node of a search tree is
link |
00:45:41.080
evaluated by the average of the random playouts from that node onwards.
link |
00:45:46.880
And this idea was very powerful and suddenly led to huge leaps forward in the strength
link |
00:45:52.040
of computer Go playing programs and among those the strongest of the Go playing programs
link |
00:45:58.360
in those days was a program called Mogo which was the first program to actually reach human
link |
00:46:04.400
master level on small boards nine by nine boards.
link |
00:46:07.960
And so this was a program by someone called Sylvangeli who's a good colleague of mine
link |
00:46:13.000
but I worked with him a little bit in those days part of my PhD thesis and Mogo was a
link |
00:46:20.640
first step towards the latest successes we saw in computer Go but it was still missing
link |
00:46:26.160
a key ingredient Mogo was evaluating purely by random rollouts against itself and in a
link |
00:46:34.000
way it's truly remarkable that random play should give you anything at all like why in
link |
00:46:40.200
this perfectly deterministic game that's very precise and involves these very exact sequences
link |
00:46:46.680
why is it that random randomization is helpful and so the intuition is that randomization
link |
00:46:54.040
captures something about the nature of the search tree from a position that you're understanding
link |
00:47:01.640
the nature of the search tree from that node onwards by using randomization and this was
link |
00:47:07.240
a very powerful idea.
link |
00:47:09.400
And I've seen this in other spaces when I talk to Richard Karp and so on, randomized
link |
00:47:15.080
algorithms somehow magically are able to do exceptionally well and simplifying the problem
link |
00:47:22.560
somehow makes you wonder about the fundamental nature of randomness in our universe.
link |
00:47:27.680
It seems to be a useful thing but so from that moment, can you maybe tell the origin
link |
00:47:33.440
story in the journey of AlphaGo?
link |
00:47:36.160
Yeah, so programs based on Monte Carlo tree search were a first revolution in the sense
link |
00:47:41.960
that they led to suddenly programs that could play the game to any reasonable level but
link |
00:47:48.040
they plateaued, it seemed that no matter how much effort people put into these techniques
link |
00:47:53.120
they couldn't exceed the level of amateur, Dan level go players.
link |
00:47:58.160
So strong players but not anywhere near the level of professionals, never mind the world
link |
00:48:03.200
champion.
link |
00:48:04.600
And so that brings us to the birth of AlphaGo which happened in the context of a startup
link |
00:48:11.880
company known as DeepMind where a project was born and the project was really a scientific
link |
00:48:21.880
investigation where myself and Ajah Huang and an intern Chris Madison were exploring
link |
00:48:31.480
a scientific question and that scientific question was really, is there another fundamentally
link |
00:48:38.680
different approach to this key question of go, the key challenge of how can you build
link |
00:48:44.600
that intuition in?
link |
00:48:46.040
How can you just have a system that could look at a position and understand what move
link |
00:48:50.800
to play or how well you're doing in that position, who's going to win?
link |
00:48:54.920
And so the deep learning revolution had just begun, the systems like ImageNet had suddenly
link |
00:49:02.920
been won by deep learning techniques back in 2012 and following that it was natural
link |
00:49:08.040
to ask, well, if deep learning is able to scale up so effectively with images to understand
link |
00:49:14.160
them enough to classify them, well, why not go, why not take the black and white stones
link |
00:49:21.800
of the go board and build a system which can understand for itself what that means in terms
link |
00:49:26.360
of what move to pick or who's going to win the game, black or white?
link |
00:49:31.200
And so that was our scientific question which we were probing and trying to understand and
link |
00:49:36.600
as we started to look at it, we discovered that we could build a system, so in fact our
link |
00:49:41.320
very first paper on AlphaGo was actually a pure deep learning system which was trying
link |
00:49:48.080
to answer this question and we showed that actually a pure deep learning system with
link |
00:49:52.440
no search at all was actually able to reach human band level, master level at the full
link |
00:49:58.640
game of go, 19 by 19 boards.
link |
00:50:01.840
And so without any search at all, suddenly we had systems which were playing at the level
link |
00:50:06.600
of the best Monte Carlo tree set systems, the ones with randomized rollouts.
link |
00:50:11.840
So first of all, sorry to interrupt, but that's kind of a groundbreaking notion that's like
link |
00:50:17.440
basically a definitive step away from a couple of decades of essentially search dominating
link |
00:50:24.920
AI.
link |
00:50:25.920
Yeah.
link |
00:50:26.920
How did that make you feel, was it surprising from a scientific perspective in general how
link |
00:50:33.080
to make you feel?
link |
00:50:34.080
I found this to be profoundly surprising.
link |
00:50:37.360
In fact, it was so surprising that we had a bet back then and like many good projects,
link |
00:50:43.640
bets are quite motivating and the bet was whether it was possible for a system based
link |
00:50:50.240
purely on deep learning, no search at all to beat a Dan level human player.
link |
00:50:56.680
And so we had someone who joined our team, who was a Dan level player, he came in and
link |
00:51:03.040
we had this first match against him.
link |
00:51:06.120
And which side of the bet were you on, by the way, the losing and the winning side?
link |
00:51:11.640
I tend to be an optimist with the power of deep learning and reinforcement learning.
link |
00:51:18.520
So the system won and we were able to beat this human Dan level player.
link |
00:51:24.320
And for me, that was the moment where it was like, okay, something special is afoot
link |
00:51:28.880
here.
link |
00:51:29.880
We have a system which without search is able to already just look at this position and
link |
00:51:36.160
understand things as well as a strong human player.
link |
00:51:39.760
And from that point onwards, I really felt that reaching the top levels of human play,
link |
00:51:48.560
you know, professional level, world champion level, I felt it was actually an inevitability
link |
00:51:52.880
and, and if it was an inevitable outcome, I was rather keen that it would be us that
link |
00:52:01.560
achieved it.
link |
00:52:03.120
So we scaled up.
link |
00:52:05.440
This was something where, you know, so had lots of conversations back then with Dennis
link |
00:52:10.320
the service at the head of DeepMind, who was extremely excited.
link |
00:52:17.440
And we made the decision to scale up the project, brought more people on board.
link |
00:52:24.640
And so AlphaGo became something where we had a clear goal, which was to try and crack this
link |
00:52:32.400
outstanding challenge of AI to see if we could beat the world's best players.
link |
00:52:37.680
And this led within the space of not so many months to playing against the European champion
link |
00:52:44.800
Fan Hui in a match which became, you know, memorable in history as the first time a go
link |
00:52:50.200
program had ever beaten a professional player.
link |
00:52:54.040
And at that time, we had to make a judgment as to whether when and whether we should go
link |
00:52:59.560
and challenge the world champion.
link |
00:53:02.120
And this was a difficult decision to make again.
link |
00:53:04.560
We were basing our predictions on our own progress and had to estimate based on the
link |
00:53:10.800
rapidity of our own progress when we thought we would exceed the level of the human world
link |
00:53:16.600
champion and we tried to make an estimate and set up a match and that became the AlphaGo
link |
00:53:22.520
versus LisaDoll match in 2016.
link |
00:53:27.440
And we should say spoiler alert that AlphaGo was able to defeat LisaDoll.
link |
00:53:33.920
That's right.
link |
00:53:34.920
Yeah.
link |
00:53:35.920
Maybe we could take even a broader view, AlphaGo involves both learning from expert games
link |
00:53:45.840
and as far as I remember a self played component to where it learns by playing against itself.
link |
00:53:54.440
But in your sense, what was the role of learning from expert games there?
link |
00:53:59.240
And in terms of your self evaluation, whether you can take on the world champion, what was
link |
00:54:04.760
the thing that you're trying to do more of, sort of train more on expert games?
link |
00:54:09.520
Or was there's now another, I'm asking so many poorly faced questions, but did you have
link |
00:54:17.960
a hope or dream that self play would be the key component at that moment yet?
link |
00:54:24.560
So in the early days of AlphaGo, we used human data to explore the science of what deep learning
link |
00:54:30.480
can achieve.
link |
00:54:31.480
And so when we had our first paper that showed that it was possible to predict the winner
link |
00:54:37.360
of the game, that it was possible to suggest moves, that was done using human data.
link |
00:54:41.160
Oh, solely human data.
link |
00:54:42.800
Yeah.
link |
00:54:43.800
And so the reason that we did it that way was at that time we were exploring separately
link |
00:54:47.560
the deep learning aspect from the reinforcement learning aspect.
link |
00:54:51.240
That was the part which was new and unknown to me at that time was how far could that
link |
00:54:56.440
be stretched?
link |
00:54:58.440
Once we had that, it then became natural to try and use that same representation and
link |
00:55:03.200
see if we could learn for ourselves using that same representation.
link |
00:55:06.680
And so right from the beginning, actually, our goal had been to build a system using
link |
00:55:12.160
self play.
link |
00:55:14.320
And to us, the human data right from the beginning was an expedient step to help us for pragmatic
link |
00:55:20.280
reasons to go faster towards the goals of the project than we might be able to starting
link |
00:55:25.760
solely from self play.
link |
00:55:27.920
And so in those days, we were very aware that we were choosing to use human data and that
link |
00:55:32.960
might not be the long term holy grail of AI, but that it was something which was extremely
link |
00:55:40.120
useful to us.
link |
00:55:41.120
It helped us to understand the system.
link |
00:55:42.120
It helped us to build deep learning representations which were clear and simple and easy to use.
link |
00:55:48.480
And so really, I would say it served a purpose not just as part of the algorithm, but something
link |
00:55:54.120
which I continue to use in our research today, which is trying to break down a very hard
link |
00:55:59.400
challenge into pieces which are easier to understand for us as researchers and develop.
link |
00:56:04.240
So if you use a component based on human data, it can help you to understand the system such
link |
00:56:10.480
that then you can build the more principled version later that does it for itself.
link |
00:56:15.360
So as I said, the AlphaGo victory, and I don't think I'm being sort of romanticizing this
link |
00:56:23.400
notion.
link |
00:56:24.400
I think it's one of the greatest moments in the history of AI.
link |
00:56:27.120
So were you cognizant of this magnitude of the accomplishment of the time?
link |
00:56:32.080
I mean, are you cognizant of it even now?
link |
00:56:35.560
Because to me, I feel like it's something that we mentioned, what the AGI systems of
link |
00:56:40.800
the future will look back.
link |
00:56:42.440
I think they'll look back at the AlphaGo victory as like, holy crap, they figured it out.
link |
00:56:49.240
This is where it started.
link |
00:56:51.800
Well thank you again.
link |
00:56:52.800
I mean, it's funny because I guess I've been working on, I've been working on computer
link |
00:56:57.120
go for a long time.
link |
00:56:58.120
So I'd been working at the time of the AlphaGo match on computer go for more than a decade.
link |
00:57:03.120
And throughout that decade, I'd had this dream of what would it be like to, what would it
link |
00:57:07.960
be like really to actually be able to build a system that could play against the world
link |
00:57:13.280
champion.
link |
00:57:14.280
And I imagined that that would be an interesting moment that maybe, you know, some people might
link |
00:57:19.160
care about that and that this might be, you know, a nice achievement.
link |
00:57:24.200
But I think when I arrived in Seoul and discovered the legions of journalists that were following
link |
00:57:32.160
us around and the hundred million people that were watching the match online live, I realized
link |
00:57:38.120
that I'd been off in my estimation of how significant this moment was by several orders
link |
00:57:42.600
of magnitude.
link |
00:57:44.440
And so there was definitely an adjustment process to realize that this was something
link |
00:57:53.040
which the world really cared about and which was a watershed moment.
link |
00:57:58.000
And I think there was that moment of realization, which was also a little bit scary because,
link |
00:58:03.160
you know, if you go into something thinking it's going to be maybe of interest and then
link |
00:58:08.640
discover that a hundred million people are watching, it suddenly makes you worry about
link |
00:58:12.120
whether some of the decisions you've made were really the best ones or the wisest or
link |
00:58:16.240
were going to lead to the best outcome.
link |
00:58:18.400
And we knew for sure that there were still imperfections in AlphaGo, which were going
link |
00:58:22.040
to be exposed to the whole world watching.
link |
00:58:24.480
And so, yeah, it was, I think, a great experience.
link |
00:58:28.280
And I feel privileged to have been part of it, privileged to have led that amazing team.
link |
00:58:35.880
I feel privileged to have been in a moment of history, like you say.
link |
00:58:40.960
But also lucky that, you know, in a sense, I was insulated from the knowledge of, I think
link |
00:58:46.600
it would have been harder to focus on the research if the full kind of reality of what
link |
00:58:51.320
was going to come to pass had been known to me and the team.
link |
00:58:55.280
I think it was, you know, we were in our bubble and we were working on research and we were
link |
00:58:58.840
trying to answer the scientific questions.
link |
00:59:01.640
And then bam, you know, the public sees it.
link |
00:59:04.880
And I think it was better that way in retrospect.
link |
00:59:07.600
Were you confident that, I guess, what were the chances that you could get the win?
link |
00:59:13.720
So just like you said, I'm a little bit more familiar with another accomplishment than
link |
00:59:20.320
we may not even get a chance to talk to.
link |
00:59:22.320
I talked to Oriel Villalos about AlphaStar, which is another incredible accomplishment.
link |
00:59:26.360
But here, you know, with AlphaStar and beating the StarCraft, there was like already a track
link |
00:59:32.480
record with AlphaGo.
link |
00:59:34.560
This is like the really first time you get to see reinforcement learning face the best
link |
00:59:40.800
human in the world.
link |
00:59:41.800
So what was your confidence like?
link |
00:59:43.400
What was the odds?
link |
00:59:44.400
Well, we actually...
link |
00:59:45.400
Was there a bet?
link |
00:59:46.400
Funnily enough, there was.
link |
00:59:50.520
So just before the match, we weren't betting on anything concrete, but we all held out
link |
00:59:55.880
a hand.
link |
00:59:56.880
Everyone in the team held out a hand at the beginning of the match.
link |
00:59:59.760
And the number of fingers that they had out on that hand was supposed to represent how
link |
01:00:02.960
many games they thought we would win against Lisa Dahl.
link |
01:00:06.480
And there was an amazing spread in the team's predictions.
link |
01:00:10.600
But I have to say, I predicted four one.
link |
01:00:15.520
And the reason was based purely on data.
link |
01:00:18.680
So I'm a scientist first and foremost.
link |
01:00:20.840
And one of the things which we had established was that AlphaGo in around one in five games
link |
01:00:27.200
would develop something which we called a delusion, which was a kind of hole in its
link |
01:00:31.480
knowledge where it wasn't able to fully understand everything about the position.
link |
01:00:36.200
And that hole in its knowledge would persist for tens of moves throughout the game.
link |
01:00:41.800
And we knew two things.
link |
01:00:42.800
We knew that if there were no delusions that AlphaGo seemed to be playing at a level that
link |
01:00:46.680
was far beyond any human capabilities.
link |
01:00:49.520
But we also knew that if there were delusions, the opposite was true.
link |
01:00:55.360
And in fact, that's what came to pass.
link |
01:00:58.320
We saw all of those outcomes and Lisa Dahl in one of the games played a really beautiful
link |
01:01:03.920
sequence that AlphaGo just hadn't predicted.
link |
01:01:08.320
And after that, it led it into this situation where it was unable to really understand the
link |
01:01:14.280
position fully and found itself in one of these delusions.
link |
01:01:18.480
So indeed, four one was the outcome.
link |
01:01:20.880
So yeah.
link |
01:01:21.880
And can you maybe speak to it a little bit more?
link |
01:01:23.400
What were the five games?
link |
01:01:25.880
What happened, is there interesting things that come to memory in terms of the play of
link |
01:01:31.520
the human machine?
link |
01:01:33.720
So I remember all of these games vividly, of course.
link |
01:01:37.440
Moments like these don't come too often in the lifetime of a scientist.
link |
01:01:42.760
And the first game was magical because it was the first time that a computer program
link |
01:01:52.680
had defeated a world champion in this grand challenge of go.
link |
01:01:57.320
And there was a moment where AlphaGo invaded Lisa Dahl's territory towards the end of the
link |
01:02:06.120
game.
link |
01:02:08.600
And that's quite an audacious thing to do.
link |
01:02:10.040
It's like saying, hey, you thought this was going to be your territory in the game, but
link |
01:02:12.680
I'm going to stick a stone right in the middle of it and prove to you that I can break it
link |
01:02:17.080
up.
link |
01:02:18.280
And Lisa Dahl's face just dropped.
link |
01:02:20.320
She wasn't expecting a computer to do something that audacious.
link |
01:02:26.200
The second game became famous for a move known as Move 37.
link |
01:02:30.920
This was a move that was played by AlphaGo that broke all of the conventions of go.
link |
01:02:38.560
That the go players were so shocked by this, they thought that maybe the operator had made
link |
01:02:43.840
a mistake.
link |
01:02:45.960
They thought there was something crazy going on, and it just broke every rule that go players
link |
01:02:50.480
are taught from a very young age.
link |
01:02:52.280
They're just taught, you know, this kind of move called a shoulder hit.
link |
01:02:56.320
You can only play it on the third line or the fourth line, and AlphaGo played it on
link |
01:02:59.720
the fifth line.
link |
01:03:01.560
And it turned out to be a brilliant move and made this beautiful pattern in the middle
link |
01:03:05.240
of the board that ended up winning the game.
link |
01:03:08.680
And so this really was a clear instance where we could say computers exhibited creativity,
link |
01:03:16.120
that this was really a move that was something humans hadn't known about, hadn't anticipated.
link |
01:03:22.840
And computers discovered this idea.
link |
01:03:24.800
They were the ones to say, actually, here's a new idea, something new, not in the domains
link |
01:03:30.320
of human knowledge of the game.
link |
01:03:34.320
And now the humans think this is a reasonable thing to do, and it's part of go knowledge
link |
01:03:40.160
now.
link |
01:03:41.160
The third game, something special happens when you play against a human world champion,
link |
01:03:46.720
which again, I hadn't anticipated before going there, which is, you know, these players
link |
01:03:52.080
are amazing.
link |
01:03:53.080
Lee Siddle was a true champion, 18 time world champion, and had this amazing ability to
link |
01:03:59.360
probe AlphaGo for weaknesses of any kind.
link |
01:04:03.520
And in the third game, he was losing, and we felt we were sailing comfortably to victory,
link |
01:04:09.880
but he managed to, from nothing, stir up this fight and build what's called a double co,
link |
01:04:17.120
these kind of repetitive positions.
link |
01:04:20.560
And he knew that historically, no computer go program had ever been able to deal correctly
link |
01:04:25.200
with double code positions, and he managed to summon one out of nothing.
link |
01:04:30.000
And so for us, you know, this was a real challenge, like would AlphaGo be able to deal with this,
link |
01:04:35.320
or would it just kind of crumble in the face of this situation?
link |
01:04:38.800
And fortunately, it dealt with it perfectly.
link |
01:04:41.480
The fourth game was amazing in that Lee Siddle appeared to be losing this game, AlphaGo thought
link |
01:04:48.880
it was winning, and then Lee Siddle did something which I think only a true world champion can
link |
01:04:54.720
do, which is he found a brilliant sequence in the middle of the game, a brilliant sequence
link |
01:04:59.760
that led him to really just transform the position, it kind of, he found just a piece
link |
01:05:09.400
of genius really.
link |
01:05:11.000
And after that, AlphaGo, its evaluation just tumbled, it thought it was winning this game,
link |
01:05:17.240
and all of a sudden it tumbled and said, oh, now I've got no chance, and it starts to behave
link |
01:05:22.160
rather oddly at that point.
link |
01:05:24.440
In the final game, for some reason, we as a team were convinced having seen AlphaGo in
link |
01:05:29.400
the previous game suffer from delusions, we as a team were convinced that it was suffering
link |
01:05:34.560
from another delusion, we were convinced that it was misevaluating the position and
link |
01:05:38.280
that something was going terribly wrong.
link |
01:05:41.280
And it was only in the last few moves of the game that we realized that actually, although
link |
01:05:46.600
it had been predicting it was going to win all the way through, it really was.
link |
01:05:51.440
And so somehow, you know, it just taught us yet again that you have to have faith in your
link |
01:05:55.680
systems when they exceed your own level of ability and your own judgment, you have to
link |
01:06:00.720
trust in them to know better than you, the designer, once you've bestowed in them the
link |
01:06:07.040
ability to judge better than you can, then trust the system to do so.
link |
01:06:13.160
So just like in the case of Deep Blue beating Gary Kasparov, so Gary is, I think the first
link |
01:06:21.680
time he's ever lost actually to anybody, and I mean, there's a similar situation with
link |
01:06:27.120
Lisa Dahl, it's a tragic, it's a tragic loss for humans, but a beautiful one.
link |
01:06:36.400
I think that's kind of, from the tragedy, sort of emerges over time, emerges a kind
link |
01:06:45.280
of inspiring story, but Lisa Dahl recently analysis retirement, I don't know if we can
link |
01:06:54.600
look too deeply into it, but he did say that even if I become number one, there's an entity
link |
01:07:00.220
that cannot be defeated.
link |
01:07:02.720
So what do you think about these words?
link |
01:07:05.560
What do you think about his retirement from the game ago?
link |
01:07:07.720
Well, let me take you back first of all to the first part of your comment about Gary Kasparov,
link |
01:07:12.560
who was actually at the panel yesterday.
link |
01:07:15.760
He specifically said that when he first lost to Deep Blue, he viewed it as a failure.
link |
01:07:22.420
He viewed that this had been a failure of his, but later on in his career, he said he'd
link |
01:07:27.240
come to realize that actually it was a success, it was a success for everyone, because this
link |
01:07:32.240
marked a transformational moment for AI, and so even for Gary Kasparov, he came to realize
link |
01:07:40.080
that that moment was pivotal and actually meant something much more than his personal
link |
01:07:47.480
loss in that moment.
link |
01:07:49.920
Lisa Dahl, I think, was much more cognizant of that even at the time.
link |
01:07:54.920
So in his closing remarks to the match, he really felt very strongly that what had happened
link |
01:08:02.080
in the AlphaGo match was not only meaningful for AI, but for humans as well, and he felt
link |
01:08:06.920
as a go player that it had opened his horizons and meant that he could start exploring new
link |
01:08:12.280
things.
link |
01:08:13.280
It brought his joy back for the game of go because it had broken all of the conventions
link |
01:08:17.960
and barriers and meant that suddenly anything was possible again.
link |
01:08:23.640
And so I was sad to hear that he'd retired, but he's been a great world champion over
link |
01:08:29.800
many, many years, and I think he'll be remembered for that ever more.
link |
01:08:36.280
He'll be remembered as the last person to beat AlphaGo.
link |
01:08:39.360
I mean, after that, we increased the power of the system, and the next version of AlphaGo
link |
01:08:45.760
beats the other strong human players 60 games to nil.
link |
01:08:52.400
So what a great moment for him and something to be remembered for.
link |
01:08:58.120
It's interesting that you spent time at AAAI on a panel with Gary Kasparov.
link |
01:09:05.440
But I mean, it's almost, I'm just curious to learn the conversations you've had with
link |
01:09:12.680
Gary because he's also now, he's written a book about artificial intelligence.
link |
01:09:17.480
He's thinking about AI.
link |
01:09:18.960
He has kind of a view of it, and he talks about AlphaGo a lot.
link |
01:09:24.080
What's your sense, arguably, I'm not just being Russian, but I think Gary is the greatest
link |
01:09:30.560
chess player of all time.
link |
01:09:32.880
Probably one of the greatest game players of all time.
link |
01:09:36.840
And you sort of at the center of creating a system that beats one of the greatest players
link |
01:09:44.440
of all time.
link |
01:09:45.440
So what is that conversation like?
link |
01:09:46.720
Is there anything, any interesting digs, any bets, any funny things, any profound things?
link |
01:09:53.760
So Gary Kasparov has an incredible respect for what we did with AlphaGo, and it's an
link |
01:10:02.560
amazing tribute coming from him, of all people, that he really appreciates and respects what
link |
01:10:10.120
we've done.
link |
01:10:11.880
And I think he feels that the progress which has happened in computer chess, which later
link |
01:10:17.880
after AlphaGo, we built the AlphaZero system, which defeated the world's strongest chess
link |
01:10:24.920
programs.
link |
01:10:26.920
And to Gary Kasparov, that moment in computer chess was more profound than deep blue.
link |
01:10:33.160
And the reason he believes it mattered more was because it was done with learning and
link |
01:10:37.680
a system which was able to discover for itself new principles, new ideas, which were able
link |
01:10:42.720
to play the game in a way which he hadn't always known about, or anyone.
link |
01:10:50.320
And in fact, one of the things I discovered at this panel was that the current world champion
link |
01:10:55.240
Magnus Carlsen apparently recently commented on his improvement in performance, and he
link |
01:11:02.400
attributes it to AlphaZero, that he's been studying the games of AlphaZero, he's changed
link |
01:11:06.280
his style to play more like AlphaZero.
link |
01:11:08.920
And it's led to him actually increasing his rating to a new peak.
link |
01:11:14.800
Yeah, I guess to me, just like to Gary, the inspiring thing is that, and just like you
link |
01:11:20.960
said with reinforcement learning, reinforcement learning and deep learning, machine learning
link |
01:11:26.920
feels like what intelligence is.
link |
01:11:30.000
And you could attribute it to sort of a bitter viewpoint from Gary's perspective, from us
link |
01:11:38.240
humans perspective, saying that pure search that IBM Deep Blue was doing is not really
link |
01:11:44.600
intelligence, but somehow it didn't feel like it.
link |
01:11:47.840
And so that's the magical.
link |
01:11:49.040
I'm not sure what it is about learning that feels like intelligence, but it does.
link |
01:11:54.720
So I think we should not demean the achievements of what was done in previous areas of AI.
link |
01:12:00.040
I think that Deep Blue was an amazing achievement in itself, and that heuristic search of the
link |
01:12:06.800
kind that was used by Deep Blue had some powerful ideas that were in there.
link |
01:12:11.480
But it also missed some things.
link |
01:12:13.320
So the fact that the evaluation function, the way that the chess position was understood,
link |
01:12:18.720
was created by humans and not by the machine is a limitation, which means that there's
link |
01:12:26.640
a ceiling on how well it can do.
link |
01:12:29.000
But maybe more importantly, it means that the same idea cannot be applied in other domains
link |
01:12:33.600
where we don't have access to the kind of human grandmasters and that ability to kind
link |
01:12:39.400
of encode exactly their knowledge into an evaluation function.
link |
01:12:42.760
And the reality is that the story of AI is that most domains turn out to be of the second
link |
01:12:48.200
type where knowledge is messy, it's hard to extract from experts or it isn't even available.
link |
01:12:54.120
And so we need to solve problems in a different way.
link |
01:12:59.920
And I think AlphaGo is a step towards solving things in a way which puts learning as a first
link |
01:13:06.760
class citizen and says, systems need to understand for themselves how to understand the world,
link |
01:13:14.080
how to judge the value of any action that they might take within that world in any state
link |
01:13:21.360
they might find themselves in.
link |
01:13:23.080
And in order to do that, we make progress towards AI.
link |
01:13:28.920
Yeah.
link |
01:13:29.920
So one of the nice things about this, about taking a learning approach to the game of
link |
01:13:35.040
go or game playing is that the things you learn, the things you figure out are actually
link |
01:13:39.680
going to be applicable to other problems that are real world problems.
link |
01:13:44.320
That's ultimately, I mean, there's two really interesting things about AlphaGo.
link |
01:13:49.280
One is the science of it, just the science of learning, the science of intelligence.
link |
01:13:54.720
And then the other is, while you're actually learning to figuring out how to build systems
link |
01:13:59.680
that would be potentially applicable in other applications, medical, autonomous vehicles,
link |
01:14:05.960
robotics.
link |
01:14:06.960
And it's just open the door to all kinds of applications.
link |
01:14:10.800
So the next incredible step, really the profound step is probably AlphaGo Zero.
link |
01:14:18.080
I mean, it's arguable I kind of see them all as the same place, but really, and perhaps
link |
01:14:23.720
you were already thinking that AlphaGo Zero is the natural, it was always going to be
link |
01:14:28.040
the next step, but it's removing the reliance on human expert games for pre training as
link |
01:14:34.480
you mentioned.
link |
01:14:35.600
So how big of an intellectual leap was this, that self play could achieve super human level
link |
01:14:43.480
performance in its own?
link |
01:14:45.800
And maybe could you also say what is self play, we kind of mentioned it a few times.
link |
01:14:51.840
So let me start with self play.
link |
01:14:55.440
So the idea of self play is something which is really about systems learning for themselves,
link |
01:15:02.240
but in the situation where there's more than one agent.
link |
01:15:05.840
And so if you're in a game, and the game is played between two players, then self play
link |
01:15:11.040
is really about understanding that game just by playing games against yourself rather than
link |
01:15:17.840
against any actual real opponent.
link |
01:15:20.120
And so it's a way to kind of discover strategies without having to actually need to go out and
link |
01:15:27.520
play against any particular human player, for example.
link |
01:15:36.080
The main idea of Alpha Zero was really to try and step back from any of the knowledge
link |
01:15:45.240
that we put into the system and ask the question, is it possible to come up with a single elegant
link |
01:15:52.040
principle by which a system can learn for itself all of the knowledge which it requires
link |
01:15:58.000
to play a game such as Go.
link |
01:16:01.440
Importantly by taking knowledge out, you not only make the system less brittle in the sense
link |
01:16:08.760
that perhaps the knowledge you were putting in was just getting in the way and maybe stopping
link |
01:16:12.840
the system learning for itself, but also you make it more general.
link |
01:16:17.920
The more knowledge you put in, the harder it is for a system to actually be placed,
link |
01:16:23.640
taken out of the system in which it's kind of been designed, and placed in some other
link |
01:16:28.320
system that maybe would need a completely different knowledge base to understand and
link |
01:16:31.440
perform well.
link |
01:16:32.920
And so the real goal here is to strip out all of the knowledge that we put in to the
link |
01:16:38.040
point that we can just plug it into something totally different.
link |
01:16:41.960
And that to me is really the promise of AI is that we can have systems such as that,
link |
01:16:47.360
which no matter what the goal is, no matter what goal we set to the system, we can come
link |
01:16:53.760
up with, we have an algorithm which can be placed into that world, into that environment
link |
01:16:58.640
and can succeed in achieving that goal.
link |
01:17:02.000
And then that to me is almost the essence of intelligence if we can achieve that.
link |
01:17:08.120
And so AlphaZero is a step towards that, and it's a step that was taken in the context
link |
01:17:13.600
of two player perfect information games like Go and chess.
link |
01:17:18.840
We also applied it to Japanese chess.
link |
01:17:21.560
So just to clarify, the first step was AlphaGo Zero.
link |
01:17:25.560
The first step was to try and take all of the knowledge out of AlphaGo in such a way
link |
01:17:31.520
that it could play in a fully self discovered way, purely from self play.
link |
01:17:39.680
And to me, the motivation for that was always that we could then plug it into other domains,
link |
01:17:45.080
but we saved that until later.
link |
01:17:48.080
Well, in fact, I mean, just for fun, I could tell you exactly the moment where the idea
link |
01:17:55.280
for AlphaZero occurred to me, because I think there's maybe a lesson there for researchers
link |
01:18:00.280
who are kind of too deeply embedded in their research and working 24 sevens, try and come
link |
01:18:05.840
up with the next idea, which is, it actually occurred to me on honeymoon, and I was at
link |
01:18:14.440
my most fully relaxed state, really enjoying myself, and just being this, like, the algorithm
link |
01:18:23.600
for AlphaZero just appeared, and in its full form, and this was actually before we played
link |
01:18:31.240
against Lisa Dahl, but we just didn't, I think we were so busy trying to make sure we could
link |
01:18:39.360
beat the world champion, that it was only later that we had the opportunity to step
link |
01:18:46.280
back and start examining that sort of deeper scientific question of whether this could
link |
01:18:51.280
really work.
link |
01:18:52.440
So nevertheless, so self play is probably one of the most profound ideas that represents
link |
01:19:02.200
to me at least, artificial intelligence.
link |
01:19:05.760
But the fact that you could use that kind of mechanism to, again, beat world class players,
link |
01:19:13.240
that's very surprising.
link |
01:19:14.920
So we kind of, to me, it feels like you have to train in a large number of expert games.
link |
01:19:21.380
So was it surprising to you?
link |
01:19:22.840
What was the intuition?
link |
01:19:23.840
Can you sort of think, not necessarily at that time, even now, what's your intuition?
link |
01:19:28.080
Why this thing works so well?
link |
01:19:29.560
Why is it able to learn from scratch?
link |
01:19:31.600
Well, let me first say why we tried it.
link |
01:19:34.640
So we tried it both because I feel that it was the deeper scientific question to be asking,
link |
01:19:40.160
to make progress towards AI.
link |
01:19:42.200
And also because in general, in my research, I don't like to do research on questions for
link |
01:19:47.520
which we already know the likely outcome.
link |
01:19:51.080
I don't see much value in running an experiment where you're 95% confident that you will succeed.
link |
01:19:57.720
And so we could have tried maybe to take AlphaGo and do something which we knew for sure it
link |
01:20:03.840
would succeed on.
link |
01:20:04.840
But much more interesting to me was to try it on the things which we weren't sure about.
link |
01:20:09.640
And one of the big questions on our minds back then was, could you really do this with
link |
01:20:15.160
self play alone?
link |
01:20:16.280
How far could that go?
link |
01:20:17.800
Would it be as strong?
link |
01:20:19.720
And honestly, we weren't sure.
link |
01:20:21.960
It was 50, 50, I think.
link |
01:20:25.440
If you'd asked me, I wasn't confident that it could reach the same level as these systems,
link |
01:20:30.800
but it felt like the right question to ask.
link |
01:20:33.960
And even if it had not achieved the same level, I felt that that was an important direction
link |
01:20:41.640
to be studying.
link |
01:20:43.040
And so then, lo and behold, it actually ended up outperforming the previous version of AlphaGo
link |
01:20:52.280
and indeed was able to beat it by 100 games to zero.
link |
01:20:56.080
So what's the intuition as to why?
link |
01:20:59.760
I think the intuition to me is clear that whenever you have errors in a system, as we
link |
01:21:08.640
did in AlphaGo, AlphaGo suffered from these delusions.
link |
01:21:11.760
Occasionally, it would misunderstand what was going on in a position and misevaluate
link |
01:21:15.280
it.
link |
01:21:16.280
How can you remove all of these errors?
link |
01:21:20.000
Errors arise from many sources.
link |
01:21:21.960
For us, they were arising both from starting from the human data, but also from the nature
link |
01:21:27.120
of the search and the nature of the algorithm itself.
link |
01:21:29.960
But the only way to address them in any complex system is to give the system the ability to
link |
01:21:36.280
correct its own errors.
link |
01:21:38.120
It must be able to correct them.
link |
01:21:39.560
It must be able to learn for itself when it's doing something wrong and correct for it.
link |
01:21:44.800
And so it seemed to me that the way to correct delusions was indeed to have more iterations
link |
01:21:50.600
of reinforcement learning, that no matter where you start, you should be able to correct
link |
01:21:54.720
those errors until it gets to play that out and understand, oh, well, I thought that I
link |
01:22:00.160
was going to win in this situation, but then I ended up losing.
link |
01:22:03.480
That suggests that I was misevaluating something, there's a hole in my knowledge and now the
link |
01:22:07.320
system can correct for itself and understand how to do better.
link |
01:22:11.600
Now if you take that same idea and trace it back all the way to the beginning, it should
link |
01:22:16.480
be able to take you from no knowledge, from completely random starting point, all the
link |
01:22:22.080
way to the highest levels of knowledge that you can achieve in a domain.
link |
01:22:27.280
And the principle is the same, that if you give, if you bestow a system with the ability
link |
01:22:31.800
to correct its own errors, then it can take you from random to something slightly better
link |
01:22:36.400
than random because it sees the stupid things that the random is doing and it can correct
link |
01:22:41.160
them.
link |
01:22:42.160
And then it can take you from that slightly better system and understand, well, what's
link |
01:22:44.920
that doing wrong?
link |
01:22:45.920
And it takes you on to the next level and the next level and this progress can go on indefinitely.
link |
01:22:53.120
And indeed, what would have happened if we'd carried on training AlphaGo Zero for longer?
link |
01:22:59.520
We saw no sign of it slowing down its improvements, or at least it was certainly carrying on to
link |
01:23:04.960
improve.
link |
01:23:06.920
And presumably, if you had the computational resources, this could lead to better and better
link |
01:23:13.760
systems that discover more and more.
link |
01:23:15.840
So your intuition is fundamentally there's not a ceiling to this process.
link |
01:23:21.960
One of the surprising things, just like you said, is the process of patching errors.
link |
01:23:27.560
It's intuitively makes sense that reinforcement learning should be part of that process.
link |
01:23:33.760
But what is surprising is in the process of patching your own lack of knowledge, you don't
link |
01:23:40.120
open up other patches.
link |
01:23:42.440
You keep sort of like there's a monotonic decrease of your weaknesses.
link |
01:23:47.880
Well, let me back this up.
link |
01:23:50.120
I think science always should make falsifiable hypotheses.
link |
01:23:53.000
Yes.
link |
01:23:54.000
So let me back up this claim with a falsifiable hypothesis, which is that if someone was to,
link |
01:23:59.240
in the future, take AlphaZero as an algorithm and run it on with greater computational
link |
01:24:06.840
resources that we had available today, then I would predict that they would be able to
link |
01:24:12.880
beat the previous system 100 games to zero.
link |
01:24:15.480
And that if they were then to do the same thing a couple of years later, that that would
link |
01:24:19.640
beat that previous system 100 games to zero, and that that process would continue indefinitely
link |
01:24:25.120
throughout at least my human lifetime.
link |
01:24:28.000
Probably the game of go would set the ceiling.
link |
01:24:30.640
I mean, the game of go would set the ceiling, but the game of go has 10 to the 170 states
link |
01:24:35.320
in it.
link |
01:24:36.320
So the ceiling is unreachable by any computational device that can be built out of the 10 to
link |
01:24:43.880
the 80 atoms in the universe.
link |
01:24:46.720
You asked a really good question, which is, do you not open up other errors when you correct
link |
01:24:52.360
your previous ones?
link |
01:24:53.800
And the answer is yes, you do.
link |
01:24:56.320
And so it's a remarkable fact about this class of two player game and also true of single
link |
01:25:03.520
agent games that essentially progress will always lead you to, if you have sufficient
link |
01:25:13.640
representational resource like imagine you had, could represent every state in a big
link |
01:25:17.920
table of the game, then we know for sure that a progress of self improvement will lead all
link |
01:25:25.040
the way in the single agent case to the optimal possible behavior, and in the two player case
link |
01:25:29.960
to the minimax optimal behavior that is the best way that I can play knowing that you're
link |
01:25:35.760
playing perfectly against me.
link |
01:25:38.120
And so for those cases, we know that even if you do open up some new error that in some
link |
01:25:45.080
sense you've made progress, you're progressing towards the best that can be done.
link |
01:25:50.600
So AlphaGo was initially trained on expert games with some self play AlphaGo zero removed
link |
01:25:57.960
the need to be trained on expert games.
link |
01:26:00.520
And then another incredible step for me because I just love chess is to generalize that further
link |
01:26:07.880
to be in Alpha zero to be able to play the game of go beating AlphaGo zero and AlphaGo,
link |
01:26:14.880
and then also being able to play the game of chess and others.
link |
01:26:19.320
So what was that step like?
link |
01:26:21.200
What's the interesting aspects there that required to make that happen?
link |
01:26:26.640
I think the remarkable observation which we saw with Alpha zero was that actually without
link |
01:26:33.240
modifying the algorithm at all, it was able to play and crack some of AI's greatest previous
link |
01:26:39.800
challenges.
link |
01:26:41.400
In particular, we dropped it into the game of chess.
link |
01:26:45.080
And unlike the previous systems like Deep Blue, which had been worked on for years and years,
link |
01:26:51.800
we were able to beat the world's strongest computer chess program convincingly using
link |
01:26:57.520
a system that was fully discovered by its own from scratch with its own principles.
link |
01:27:05.080
And in fact, one of the nice things that we found was that in fact, we also achieved the
link |
01:27:10.960
same result in Japanese chess, a variant of chess where you get to capture pieces and then
link |
01:27:15.600
place them back down on your own side as an extra piece.
link |
01:27:19.120
So a much more complicated variant of chess.
link |
01:27:22.040
And we also beat the world's strongest programs and reached superhuman performance in that
link |
01:27:26.840
game too.
link |
01:27:28.440
And it was the very first time that we'd ever run the system on that particular game was
link |
01:27:34.760
the version that we published in the paper on Alpha zero.
link |
01:27:38.880
It just worked out of the box, literally no touching it, we didn't have to do anything
link |
01:27:42.880
and there it was, superhuman performance, no tweaking, no twiddling.
link |
01:27:48.040
And so I think there's something beautiful about that principle that you can take an
link |
01:27:52.080
algorithm and without twiddling anything, it just works.
link |
01:27:57.880
Now to go beyond Alpha zero, what's required?
link |
01:28:03.040
Alpha zero is just a step.
link |
01:28:05.640
And there's a long way to go beyond that to really crack the deep problems of AI.
link |
01:28:10.360
But one of the important steps is to acknowledge that the world is a really messy place.
link |
01:28:16.360
It's this rich, complex, beautiful, but messy environment that we live in and no one gives
link |
01:28:22.560
us the rules.
link |
01:28:23.560
Like no one knows the rules of the world, at least maybe we understand that it operates
link |
01:28:28.440
according to Newtonian or quantum mechanics at the micro level or according to relativity
link |
01:28:33.960
at the macro level, but that's not a model that's useful for us as people to operate
link |
01:28:39.280
in it.
link |
01:28:40.520
Somehow the agent needs to understand the world for itself in a way where no one tells
link |
01:28:45.240
it the rules of the game and yet it can still figure out what to do in that world, deal
link |
01:28:51.080
with this stream of observations coming in, rich sensory input coming in, actions going
link |
01:28:55.920
out in a way that allows it to reason in the way that Alpha zero can reason in the way
link |
01:29:01.720
that these go and chess playing programs can reason, but in a way that allows it to take
link |
01:29:07.320
actions in that messy world to achieve its goals.
link |
01:29:11.600
And so this led us to the most recent step in the story of AlphaGo, which was a system
link |
01:29:18.160
called Mu Zero, and Mu Zero is a system which learns for itself even when the rules are
link |
01:29:24.240
not given to it.
link |
01:29:25.520
It actually can be dropped into a system with messy perceptual inputs.
link |
01:29:29.800
We actually tried it in some Atari games, the canonical domains of Atari that have
link |
01:29:36.720
been used for reinforcement learning, and this system learned to build a model of these
link |
01:29:43.280
Atari games that was sufficiently rich and useful enough for it to be able to plan successfully.
link |
01:29:51.520
And in fact, that system not only went on to beat the state of the art in Atari, but
link |
01:29:56.800
the same system without modification was able to reach the same level of superhuman performance
link |
01:30:03.000
in Go, Chess, and Shogi that we'd seen in Alpha Zero, showing that even without the
link |
01:30:08.280
rules, the system can learn for itself just by trial and error, just by playing this game
link |
01:30:12.440
of Go, and no one tells you what the rules are, but you just get to the end and someone
link |
01:30:16.880
says, you know, win or loss, or you play this game of chess and someone says win or loss,
link |
01:30:22.080
or you play a game of breakout in Atari and someone just tells you, you know, your score
link |
01:30:27.160
at the end.
link |
01:30:28.160
And the system for itself figures out essentially the rules of the system, the dynamics of the
link |
01:30:32.720
world, how the world works, and not in any explicit way, but just implicitly enough understanding
link |
01:30:40.720
for it to be able to plan in that system in order to achieve its goals.
link |
01:30:45.600
And that's the fundamental process that you have to go through when you're facing in
link |
01:30:49.920
any uncertain kind of environment that you would in the real world is figuring out the
link |
01:30:54.040
sort of the rules, the basic rules of the game.
link |
01:30:56.720
That's right.
link |
01:30:57.720
So that allows it to be applicable to basically any domain that could be digitized in the
link |
01:31:06.000
way that it needs to in order to be consumable, sort of in order for the reinforcement learning
link |
01:31:11.680
framework to be able to sense the environment, to be able to act in the environment, so on.
link |
01:31:15.640
The full reinforcement learning problem needs to deal with worlds that are unknown and complex
link |
01:31:21.280
and the agent needs to learn for itself how to deal with that.
link |
01:31:24.960
And so Musero is a step, a further step in that direction.
link |
01:31:29.880
One of the things that inspired the general public, and just in conversations I have with
link |
01:31:33.960
my parents or something with my mom, that just loves what was done is kind of at least
link |
01:31:39.640
a notion that there was some display of creativity, some new strategies, new behaviors that were
link |
01:31:45.320
created.
link |
01:31:46.320
That again has echoes of intelligence.
link |
01:31:49.000
So is there something that stands out?
link |
01:31:50.840
Do you see it the same way that there's creativity and there's some behaviors, patterns that
link |
01:31:56.640
you saw that AlphaZero was able to display that are truly creative?
link |
01:32:02.040
So let me start by saying that I think we should ask what creativity really means.
link |
01:32:08.360
So to me, creativity means discovering something which wasn't known before, something unexpected,
link |
01:32:17.120
something outside of our norms.
link |
01:32:19.800
And so in that sense, the process of reinforcement learning or the self play approach that was
link |
01:32:27.920
used by AlphaZero is the essence of creativity.
link |
01:32:32.240
It's really saying at every stage, you're playing according to your current norms and
link |
01:32:37.280
you try something.
link |
01:32:39.040
And if it works out, you say, hey, here's something great, I'm going to start using that.
link |
01:32:45.040
And then that process, it's like a micro discovery that happens millions and millions of times
link |
01:32:49.920
over the course of the algorithm's life, where it just discovers some new idea, oh, this
link |
01:32:55.000
pattern, this pattern's working really well for me, I'm going to start using that.
link |
01:32:58.400
Oh, now, oh, here's this other thing I can do, I can start to connect these stones together
link |
01:33:03.080
in this way, or I can start to sacrifice stones or give up on pieces or play shoulder hits
link |
01:33:10.520
on the fifth line or whatever it is.
link |
01:33:12.460
The system's discovering things like this for itself continually, repeatedly all the
link |
01:33:16.160
time.
link |
01:33:17.160
And so it should come as no surprise to us then, when if you leave these systems going,
link |
01:33:22.200
that they discover things that are not known to humans, that to the human norms are considered
link |
01:33:29.520
creative.
link |
01:33:30.720
And we've seen this several times, in fact, in AlphaGo Zero, we saw this beautiful timeline
link |
01:33:37.560
of discovery where what we saw was that there are these opening patterns that humans play
link |
01:33:44.520
called joseki, these are like the patterns that humans learn to play in the corners and
link |
01:33:48.800
they've been developed and refined over literally thousands of years in the game of Go.
link |
01:33:53.360
And what we saw was in the course of the training AlphaGo Zero, over the course of the 40 days
link |
01:34:00.000
that we trained this system, it starts to discover exactly these patterns that human
link |
01:34:05.920
players play.
link |
01:34:07.200
And over time, we found that all of the joseki that humans played were discovered by the
link |
01:34:12.760
system through this process of self play and this sort of essential notion of creativity.
link |
01:34:19.720
But what was really interesting was that over time, it then starts to discard some of these
link |
01:34:24.320
in favor of its own joseki that humans didn't know about.
link |
01:34:27.880
And it starts to say, oh, well, you thought that the Knights move pincer joseki was a great
link |
01:34:33.400
idea.
link |
01:34:35.200
But here's something different you can do there, which makes some new variation that
link |
01:34:38.920
the humans didn't know about.
link |
01:34:40.520
And actually now the human Go players study the joseki that AlphaGo played and they become
link |
01:34:45.600
the new norms that are used in today's top level Go competitions.
link |
01:34:51.440
That never gets old.
link |
01:34:52.800
And just the first to me, maybe just makes me feel good as a human being that a self
link |
01:34:59.160
play mechanism that knows nothing about us humans discovers patterns that we humans do.
link |
01:35:04.920
It says like an affirmation that we're doing okay as humans.
link |
01:35:10.640
In this domain and other domains, we figured out it's like the Churchill quote about democracy.
link |
01:35:15.960
It's the, you know, it sucks, but it's the best one we've tried.
link |
01:35:20.440
So in general, it's taking a step outside of Go and you have like a million accomplishment
link |
01:35:27.160
that I have no time to talk about with AlphaStar and so on and the current work.
link |
01:35:33.120
But in general, this self play mechanism that you've inspired the world with by beating
link |
01:35:39.080
the world champion Go player.
link |
01:35:42.380
Do you see that as DC being applied in other domains, do you have sort of dreams and hope
link |
01:35:50.400
that it's applied in both the simulated environments and the constrained environments of games constrained?
link |
01:35:56.600
I mean, AlphaStar really demonstrates that you can remove a lot of the constraints, but
link |
01:36:00.760
nevertheless, it's an individual simulated environment.
link |
01:36:04.160
Do you have a hope or dream that it starts being applied in the robotics environment
link |
01:36:09.040
and maybe even in domains that are safety critical and so on and have, you know, have
link |
01:36:15.280
a real impact in the real world like autonomous vehicles, for example, which seems like a very
link |
01:36:19.000
far out dream at this point.
link |
01:36:21.280
So I absolutely do hope and imagine that we will get to the point where ideas just like
link |
01:36:28.280
these are used in all kinds of different domains.
link |
01:36:30.600
In fact, one of the most satisfying things as a researcher is when you start to see other
link |
01:36:34.980
people use your algorithms in unexpected ways.
link |
01:36:39.200
So in the last couple of years, there have been, you know, a couple of nature papers
link |
01:36:43.080
where different teams unbeknownst to us took AlphaZero and applied exactly those same algorithms
link |
01:36:51.840
and ideas to real world problems of huge meaning to society.
link |
01:36:57.640
So one of them was the problem of chemical synthesis and they were able to beat the state
link |
01:37:02.040
of the art in finding pathways of how to actually synthesize chemicals, retrochemical synthesis.
link |
01:37:12.040
And the second paper actually just came out a couple of weeks ago in Nature showed that
link |
01:37:17.920
in quantum computation, you know, one of the big questions is how to understand the
link |
01:37:22.760
nature of the function in quantum computation and a system based on AlphaZero beat the state
link |
01:37:29.840
of the art by quite some distance there again.
link |
01:37:33.000
So these are just examples.
link |
01:37:34.080
And I think, you know, the lesson which we've seen elsewhere in machine learning time and
link |
01:37:38.800
time again is that if you make something general, it will be used in all kinds of ways.
link |
01:37:44.200
You know, you provide a really powerful tool to society and those tools can be used in
link |
01:37:49.680
amazing ways.
link |
01:37:51.800
And so I think we're just at the beginning and for sure, I hope that we see all kinds
link |
01:37:56.720
of outcomes.
link |
01:37:59.000
So the other side of the question of reinforcement learning framework is, you know, usually want
link |
01:38:06.080
to specify a reward function and an objective function.
link |
01:38:11.440
What do you think about sort of ideas of intrinsic rewards of when we're not really sure about,
link |
01:38:17.080
you know, if we take, you know, human beings as existence proof that we don't seem to
link |
01:38:24.440
be operating according to a single reward.
link |
01:38:27.960
Do you think that there's interesting ideas for when you don't know how to truly specify
link |
01:38:34.640
the reward, you know, that there's some flexibility for discovering it intrinsically or so on
link |
01:38:40.520
in the context of reinforcement learning?
link |
01:38:42.820
So I think, you know, when we think about intelligence, it's really important to be
link |
01:38:46.280
clear about the problem of intelligence.
link |
01:38:48.480
And I think it's clearest to understand that problem in terms of some ultimate goal that
link |
01:38:52.720
we want the system to try and solve for.
link |
01:38:55.520
And after all, if we don't understand the ultimate purpose of the system, do we really
link |
01:39:00.360
even have a clearly defined problem that we're solving at all?
link |
01:39:04.400
Now within that, as with your example for humans, the system may choose to create its
link |
01:39:12.920
own motivations and sub goals that help the system to achieve its ultimate goal.
link |
01:39:19.320
And that may indeed be a hugely important mechanism to achieve those ultimate goals.
link |
01:39:23.920
But there is still some ultimate goal I think the system needs to be measurable and evaluated
link |
01:39:28.360
against.
link |
01:39:29.360
And even for humans, I mean, humans, we're incredibly flexible.
link |
01:39:32.480
We feel that we can, you know, any goal that we're given, we feel we can master to some
link |
01:39:38.440
degree.
link |
01:39:40.320
But if we think of those goals really, you know, like the goal of being able to pick
link |
01:39:44.120
up an object or the goal of being able to communicate or influence people to do things
link |
01:39:49.800
in a particular way or whatever those goals are, really, they're sub goals really that
link |
01:39:57.320
we set ourselves.
link |
01:39:58.320
You know, we choose to pick up the object.
link |
01:40:01.000
We choose to communicate.
link |
01:40:02.200
We choose to influence someone else.
link |
01:40:05.400
And we choose those because we think it will lead us to something, you know, in later on.
link |
01:40:10.520
We think that that's helpful to us to achieve some ultimate goal.
link |
01:40:15.160
Now I don't want to speculate whether or not humans as a system necessarily have a singular
link |
01:40:20.160
overall goal of survival or whatever it is.
link |
01:40:23.560
But I think the principle for understanding and implementing intelligence is has to be
link |
01:40:27.840
that if we're trying to understand intelligence or implement our own, there has to be a well
link |
01:40:32.120
defined problem.
link |
01:40:33.120
Otherwise, if it's not, I think it's like an admission of defeat that for that to be
link |
01:40:39.880
hoped for understanding or implementing intelligence, we have to know what we're doing.
link |
01:40:44.080
We have to know what we're asking the system to do.
link |
01:40:46.240
Otherwise, if you don't have a clearly defined purpose, you're not going to get a clearly
link |
01:40:49.560
defined answer.
link |
01:40:52.240
The ridiculous big question that has to naturally follow is I have to pin you down on this thing
link |
01:41:00.720
that nevertheless, one of the big silly or big real questions before humans is the meaning
link |
01:41:06.920
of life is us trying to figure out our own reward function.
link |
01:41:11.320
And you just kind of mentioned that if you want to build intelligence systems and you
link |
01:41:15.480
know what you're doing, you should be at least cognizant to some degree of what the reward
link |
01:41:19.040
function is.
link |
01:41:20.360
So the natural question is, what do you think is the reward function of human life, the
link |
01:41:26.360
meaning of life for us humans, the meaning of our existence?
link |
01:41:32.960
I think I'd be speculating beyond my own expertise, but just for fun, let me do that and say
link |
01:41:39.760
I think that there are many levels at which you can understand a system and you can understand
link |
01:41:44.360
something as optimizing for a goal at many levels.
link |
01:41:49.040
And so you can understand the, let's start with the universe, like does the universe
link |
01:41:55.080
have a purpose?
link |
01:41:56.080
It feels like it's just at one level, just following certain mechanical laws of physics
link |
01:42:02.240
and that that's led to the development of the universe.
link |
01:42:04.720
But at another level, you can view it as actually, there's the second law of thermodynamics
link |
01:42:09.880
that says that this is increasing in entropy over time forever.
link |
01:42:13.440
And now there's a view that's been developed by certain people at MIT that you can think
link |
01:42:18.320
of this as almost like a goal of the universe, that the purpose of the universe is to maximize
link |
01:42:23.200
entropy.
link |
01:42:25.000
So there are multiple levels at which you can understand a system.
link |
01:42:28.880
The next level down, you might say, well, if the goal is to maximize entropy, well,
link |
01:42:38.520
how can that be done by a particular system?
link |
01:42:40.080
And maybe evolution is something that the universe discovered in order to kind of dissipate
link |
01:42:45.520
energy as efficiently as possible.
link |
01:42:47.320
And by the way, I'm borrowing from Max Tegmark for some of these metaphors, the physicist.
link |
01:42:54.000
But if you can think of evolution as a mechanism for dispersing energy, then evolution, you
link |
01:43:01.720
might say, then becomes a goal, which is if evolution disperses energy by reproducing
link |
01:43:07.320
as efficiently as possible, what's evolution then?
link |
01:43:10.720
Well, it's now got its own goal within that, which is to actually reproduce as effectively
link |
01:43:17.960
as possible.
link |
01:43:18.960
And now, how does reproduction, how is that made as effective as possible?
link |
01:43:24.280
Well, you need entities within that that can survive and reproduce as effectively as possible.
link |
01:43:30.000
And so it's natural that in order to achieve that high level goal, those individual organisms
link |
01:43:33.800
discover brains, intelligences, which enable them to support the goals of evolution.
link |
01:43:43.320
And those brains, what do they do?
link |
01:43:45.560
Well, perhaps the early brains, maybe they were controlling things at some direct level.
link |
01:43:52.160
Maybe they were the equivalent of preprogrammed systems, which were directly controlling what
link |
01:43:55.800
was going on and setting certain things in order to achieve these particular goals.
link |
01:44:03.160
But that led to another level of discovery, which was learning systems, parts of the brain
link |
01:44:07.840
which were able to learn for themselves and learn how to program themselves to achieve
link |
01:44:12.360
any goal, and presumably there are parts of the brain where goals are set to parts of
link |
01:44:18.720
that system and provides this very flexible notion of intelligence that we as humans presumably
link |
01:44:24.400
have, which is the ability to kind of write the reason we feel that we can achieve any
link |
01:44:29.080
goal.
link |
01:44:30.080
So, it's a very long winded answer to say that, you know, I think there are many perspectives
link |
01:44:34.640
and many levels at which intelligence can be understood.
link |
01:44:38.960
And at each of those levels, you can take multiple perspectives, you know, you can view
link |
01:44:42.600
the system as something which is optimizing for a goal, which is understanding it at a
link |
01:44:46.640
level by which we can maybe implement it and understand it as AI researchers or computer
link |
01:44:52.320
scientists, or you can understand it at the level of the mechanistic thing which is going
link |
01:44:56.120
on, that there are these atoms bouncing around in the brain and they lead to the outcome
link |
01:45:00.320
of that system is not in contradiction with the fact that it's also a decision making
link |
01:45:06.680
system that's optimizing for some goal and purpose.
link |
01:45:10.120
I've never heard the description of the meaning of life structured so beautifully in layers,
link |
01:45:16.520
but you did miss one layer, which is the next step, which you're responsible for, which
link |
01:45:21.840
is creating the artificial intelligence layer on top of that.
link |
01:45:28.440
Indeed.
link |
01:45:29.440
And I can't wait to see, well, I may not be around, but I can't wait to see what the
link |
01:45:34.280
next layer beyond that will be.
link |
01:45:36.920
Well, let's just take that argument and pursue it to its natural conclusion.
link |
01:45:41.640
So the next level indeed is for how can our learning brain achieve its goals most effectively?
link |
01:45:49.280
Well, maybe it does so by us as learning beings, building a system which is able to solve for
link |
01:45:59.600
those goals more effectively than we can.
link |
01:46:02.200
And so when we build a system to play the game of go, when I said that I wanted to build
link |
01:46:06.480
a system that can play go better than I can, I've enabled myself to achieve that goal of
link |
01:46:10.600
playing go better than I could by directly playing it and learning it myself.
link |
01:46:15.880
And so now a new layer has been created, which is systems which are able to achieve goals
link |
01:46:21.240
for themselves.
link |
01:46:22.840
And ultimately, there may be layers beyond that where they set subgoals to parts of their
link |
01:46:27.480
own system in order to achieve those and so forth.
link |
01:46:33.120
So the story of intelligence I think is a multi layered one and a multi perspective one.
link |
01:46:40.040
We live in an incredible universe.
link |
01:46:41.920
David, thank you so much first of all for dreaming of using learning to solve go and
link |
01:46:48.040
building intelligence systems and for actually making it happen and for inspiring millions
link |
01:46:54.680
of people in the process.
link |
01:46:56.160
It's truly an honor.
link |
01:46:57.160
Thank you so much for talking today.
link |
01:46:58.160
Okay.
link |
01:46:59.160
Thank you.
link |
01:47:00.160
Thanks for listening to this conversation with David Silver.
link |
01:47:02.440
And thank you to our sponsors masterclass and cash app.
link |
01:47:06.080
Please consider supporting the podcast by signing up to masterclass at masterclass.com slash
link |
01:47:11.200
Lex and downloading cash app and using code Lex podcast.
link |
01:47:15.760
If you enjoy this podcast, subscribe on YouTube, review it with five stars and Apple podcast,
link |
01:47:20.560
support on Patreon or simply connect with me on Twitter at Lex Friedman.
link |
01:47:25.320
And now let me leave you some words from David Silver.
link |
01:47:28.720
My personal belief is that we've seen something of a turning point where we're starting to
link |
01:47:33.600
understand that many abilities like intuition and creativity that we've previously thought
link |
01:47:39.600
were in the domain only of the human mind are actually accessible to machine intelligence
link |
01:47:44.240
as well.
link |
01:47:45.560
And I think that's a really exciting moment in history.
link |
01:47:49.400
Thank you for listening and hope to see you next time.