back to indexDavid Silver: AlphaGo, AlphaZero, and Deep Reinforcement Learning | Lex Fridman Podcast #86
link |
The following is a conversation with David Silver,
link |
who leads the Reinforcement Learning Research Group
link |
at DeepMind, and was the lead researcher
link |
on AlphaGo, AlphaZero, and co led the AlphaStar
link |
and MuZero efforts, and a lot of important work
link |
in reinforcement learning in general.
link |
I believe AlphaZero is one of the most important
link |
accomplishments in the history of artificial intelligence.
link |
And David is one of the key humans who brought AlphaZero
link |
to life together with a lot of other great researchers
link |
He's humble, kind, and brilliant.
link |
We were both jet lagged, but didn't care and made it happen.
link |
It was a pleasure and truly an honor to talk with David.
link |
This conversation was recorded before the outbreak
link |
For everyone feeling the medical, psychological,
link |
and financial burden of this crisis,
link |
I'm sending love your way.
link |
Stay strong, we're in this together, we'll beat this thing.
link |
This is the Artificial Intelligence Podcast.
link |
If you enjoy it, subscribe on YouTube,
link |
review it with five stars on Apple Podcast,
link |
support on Patreon, or simply connect with me on Twitter
link |
at Lex Friedman, spelled F R I D M A N.
link |
As usual, I'll do a few minutes of ads now
link |
and never any ads in the middle
link |
that can break the flow of the conversation.
link |
I hope that works for you
link |
and doesn't hurt the listening experience.
link |
Quick summary of the ads.
link |
Two sponsors, Masterclass and Cash App.
link |
Please consider supporting the podcast
link |
by signing up to Masterclass and masterclass.com slash Lex
link |
and downloading Cash App and using code LexPodcast.
link |
This show is presented by Cash App,
link |
the number one finance app in the app store.
link |
When you get it, use code LexPodcast.
link |
Cash App lets you send money to friends, buy Bitcoin,
link |
and invest in the stock market with as little as $1.
link |
Since Cash App allows you to buy Bitcoin,
link |
let me mention that cryptocurrency
link |
in the context of the history of money is fascinating.
link |
I recommend Ascent of Money as a great book on this history.
link |
Debits and credits on Ledger started around 30,000 years ago.
link |
The US dollar created over 200 years ago,
link |
and Bitcoin, the first decentralized cryptocurrency,
link |
released just over 10 years ago.
link |
So given that history, cryptocurrency is still very much
link |
in its early days of development,
link |
but it's still aiming to and just might
link |
redefine the nature of money.
link |
So again, if you get Cash App from the app store or Google Play
link |
and use the code LexPodcast, you get $10,
link |
and Cash App will also donate $10 to FIRST,
link |
an organization that is helping to advance robotics
link |
and STEM education for young people around the world.
link |
This show is sponsored by Masterclass.
link |
Sign up at masterclass.com slash Lex
link |
to get a discount and to support this podcast.
link |
In fact, for a limited time now,
link |
if you sign up for an all access pass for a year,
link |
you get to get another all access pass
link |
to share with a friend.
link |
Buy one, get one free.
link |
When I first heard about Masterclass,
link |
I thought it was too good to be true.
link |
For $180 a year, you get an all access pass
link |
to watch courses from to list some of my favorites.
link |
Chris Hadfield on space exploration,
link |
Neil deGrasse Tyson on scientific thinking communication,
link |
Will Wright, the creator of SimCity and Sims on game design,
link |
Jane Goodall on conservation,
link |
Carlos Santana on guitar.
link |
His song Europa could be the most beautiful
link |
guitar song ever written.
link |
Gary Kasparov on chess, Daniel Negrano on poker,
link |
and many, many more.
link |
Chris Hadfield explaining how rockets work
link |
and the experience of being launched into space alone
link |
is worth the money.
link |
For me, the key is to not be overwhelmed
link |
by the abundance of choice.
link |
Pick three courses you want to complete,
link |
watch each of them all the way through.
link |
It's not that long, but it's an experience
link |
that will stick with you for a long time, I promise.
link |
It's easily worth the money.
link |
You can watch it on basically any device.
link |
Once again, sign up on masterclass.com slash Lex
link |
to get a discount and to support this podcast.
link |
And now, here's my conversation with David Silver.
link |
What was the first program you've ever written?
link |
And what programming language?
link |
I remember very clearly, yeah.
link |
My parents brought home this BBC Model B microcomputer.
link |
It was just this fascinating thing to me.
link |
I was about seven years old and couldn't resist
link |
just playing around with it.
link |
So I think first program ever was writing my name out
link |
in different colors and getting it to loop and repeat that.
link |
And there was something magical about that,
link |
which just led to more and more.
link |
How did you think about computers back then?
link |
Like the magical aspect of it, that you can write a program
link |
and there's this thing that you just gave birth to
link |
that's able to create sort of visual elements
link |
and live in its own.
link |
Or did you not think of it in those romantic notions?
link |
Was it more like, oh, that's cool.
link |
I can solve some puzzles.
link |
It was always more than solving puzzles.
link |
It was something where, you know,
link |
there was this limitless possibilities.
link |
Once you have a computer in front of you,
link |
you can do anything with it.
link |
I used to play with Lego with the same feeling.
link |
You can make anything you want out of Lego,
link |
but even more so with a computer, you know,
link |
you're not constrained by the amount of kit you've got.
link |
And so I was fascinated by it and started pulling out
link |
the user guide and the advanced user guide
link |
and then learning.
link |
So I started in basic and then later 6502.
link |
My father also became interested in this machine
link |
and gave up his career to go back to school
link |
and study for a master's degree
link |
in artificial intelligence, funnily enough,
link |
at Essex University when I was seven.
link |
So I was exposed to those things at an early age.
link |
He showed me how to program in prologue
link |
and do things like querying your family tree.
link |
And those are some of my earliest memories
link |
of trying to figure things out on a computer.
link |
Those are the early steps in computer science programming,
link |
but when did you first fall in love
link |
with artificial intelligence or with the ideas,
link |
I think it was really when I went to study at university.
link |
So I was an undergrad at Cambridge
link |
and studying computer science.
link |
And I really started to question,
link |
you know, what really are the goals?
link |
Where do we want to go with computer science?
link |
And it seemed to me that the only step
link |
of major significance to take was to try
link |
and recreate something akin to human intelligence.
link |
If we could do that, that would be a major leap forward.
link |
And that idea, I certainly wasn't the first to have it,
link |
but it, you know, nestled within me somewhere
link |
and became like a bug.
link |
You know, I really wanted to crack that problem.
link |
So you thought it was, like you had a notion
link |
that this is something that human beings can do,
link |
that it is possible to create an intelligent machine.
link |
Well, I mean, unless you believe in something metaphysical,
link |
then what are our brains doing?
link |
Well, at some level they're information processing systems,
link |
which are able to take whatever information is in there,
link |
transform it through some form of program
link |
and produce some kind of output,
link |
which enables that human being to do all the amazing things
link |
that they can do in this incredible world.
link |
So then do you remember the first time
link |
you've written a program that,
link |
because you also had an interest in games.
link |
Do you remember the first time you were in a program
link |
that beat you in a game?
link |
That more beat you at anything?
link |
Sort of achieved super David Silver level performance?
link |
So I used to work in the games industry.
link |
So for five years I programmed games for my first job.
link |
So it was an amazing opportunity
link |
to get involved in a startup company.
link |
And so I was involved in building AI at that time.
link |
And so for sure there was a sense of building handcrafted,
link |
what people used to call AI in the games industry,
link |
which I think is not really what we might think of as AI
link |
in its fullest sense,
link |
but something which is able to take actions
link |
and in a way which makes things interesting
link |
and challenging for the human player.
link |
And at that time I was able to build
link |
these handcrafted agents,
link |
which in certain limited cases could do things
link |
which were able to do better than me,
link |
but mostly in these kind of Twitch like scenarios
link |
where they were able to do things faster
link |
or because they had some pattern
link |
which was able to exploit repeatedly.
link |
I think if we're talking about real AI,
link |
the first experience for me came after that
link |
when I realized that this path I was on
link |
wasn't taking me towards,
link |
it wasn't dealing with that bug which I still had inside me
link |
to really understand intelligence and try and solve it.
link |
That everything people were doing in games
link |
was short term fixes rather than long term vision.
link |
And so I went back to study for my PhD,
link |
which was funny enough trying to apply reinforcement learning
link |
to the game of Go.
link |
And I built my first Go program using reinforcement learning,
link |
a system which would by trial and error play against itself
link |
and was able to learn which patterns were actually helpful
link |
to predict whether it was gonna win or lose the game
link |
and then choose the moves that led
link |
to the combination of patterns
link |
that would mean that you're more likely to win.
link |
And that system, that system beat me.
link |
And how did that make you feel?
link |
Made me feel good.
link |
I mean, was there sort of the, yeah,
link |
it's a mix of a sort of excitement
link |
and was there a tinge of sort of like,
link |
almost like a fearful awe?
link |
You know, it's like in space, 2001 Space Odyssey
link |
kind of realizing that you've created something that,
link |
you know, that's achieved human level intelligence
link |
in this one particular little task.
link |
And in that case, I suppose neural networks
link |
There were no neural networks in those days.
link |
This was pre deep learning revolution.
link |
But it was a principled self learning system
link |
based on a lot of the principles which people
link |
are still using in deep reinforcement learning.
link |
I think I found it immensely satisfying
link |
that a system which was able to learn
link |
from first principles for itself
link |
was able to reach the point
link |
that it was understanding this domain
link |
better than I could and able to outwit me.
link |
I don't think it was a sense of awe.
link |
It was a sense that satisfaction,
link |
that something I felt should work had worked.
link |
So to me, AlphaGo, and I don't know how else to put it,
link |
but to me, AlphaGo and AlphaGo Zero,
link |
mastering the game of Go is again, to me,
link |
the most profound and inspiring moment
link |
in the history of artificial intelligence.
link |
So you're one of the key people behind this achievement
link |
So I really felt the first sort of seminal achievement
link |
when Deep Blue beat Garry Kasparov in 1987.
link |
So as far as I know, the AI community at that point
link |
largely saw the game of Go as unbeatable in AI
link |
using the sort of the state of the art
link |
brute force methods, search methods.
link |
Even if you consider, at least the way I saw it,
link |
even if you consider arbitrary exponential scaling
link |
of compute, Go would still not be solvable,
link |
hence why it was thought to be impossible.
link |
So given that the game of Go was impossible to master,
link |
what was the dream for you?
link |
You just mentioned your PhD thesis
link |
of building the system that plays Go.
link |
What was the dream for you that you could actually
link |
build a computer program that achieves world class,
link |
not necessarily beats the world champion,
link |
but achieves that kind of level of playing Go?
link |
First of all, thank you, that's very kind words.
link |
And funnily enough, I just came from a panel
link |
where I was actually in a conversation
link |
with Garry Kasparov and Murray Campbell,
link |
who was the author of Deep Blue.
link |
And it was their first meeting together since the match.
link |
So that just occurred yesterday.
link |
So I'm literally fresh from that experience.
link |
So these are amazing moments when they happen,
link |
but where did it all start?
link |
Well, for me, it started when I became fascinated
link |
in the game of Go.
link |
So Go for me, I've grown up playing games.
link |
I've always had a fascination in board games.
link |
I played chess as a kid, I played Scrabble as a kid.
link |
When I was at university, I discovered the game of Go.
link |
And to me, it just blew all of those other games
link |
It was just so deep and profound in its complexity
link |
with endless levels to it.
link |
What I discovered was that I could devote
link |
endless hours to this game.
link |
And I knew in my heart of hearts
link |
that no matter how many hours I would devote to it,
link |
I would never become a grandmaster,
link |
or there was another path.
link |
And the other path was to try and understand
link |
how you could get some other intelligence
link |
to play this game better than I would be able to.
link |
And so even in those days, I had this idea that,
link |
what if, what if it was possible to build a program
link |
that could crack this?
link |
And as I started to explore the domain,
link |
I discovered that this was really the domain
link |
where people felt deeply that if progress
link |
could be made in Go,
link |
it would really mean a giant leap forward for AI.
link |
It was the challenge where all other approaches had failed.
link |
This is coming out of the era you mentioned,
link |
which was in some sense, the golden era
link |
for the classical methods of AI, like heuristic search.
link |
In the 90s, they all fell one after another,
link |
not just chess with deep blue, but checkers,
link |
backgammon, Othello.
link |
There were numerous cases where systems
link |
built on top of heuristic search methods
link |
with these high performance systems
link |
had been able to defeat the human world champion
link |
in each of those domains.
link |
And yet in that same time period,
link |
there was a million dollar prize available
link |
for the game of Go, for the first system
link |
to be a human professional player.
link |
And at the end of that time period,
link |
in year 2000 when the prize expired,
link |
the strongest Go program in the world
link |
was defeated by a nine year old child
link |
when that nine year old child was giving nine free moves
link |
to the computer at the start of the game
link |
to try and even things up.
link |
And computer Go expert beat that same strongest program
link |
with 29 handicapped stones, 29 free moves.
link |
So that's what the state of affairs was
link |
when I became interested in this problem
link |
in around 2003 when I started working on computer Go.
link |
There was nothing, there was very, very little
link |
in the way of progress towards meaningful performance,
link |
again, anything approaching human level.
link |
And so people, it wasn't through lack of effort,
link |
people had tried many, many things.
link |
And so there was a strong sense
link |
that something different would be required for Go
link |
than had been needed for all of these other domains
link |
where AI had been successful.
link |
And maybe the single clearest example
link |
is that Go, unlike those other domains,
link |
had this kind of intuitive property
link |
that a Go player would look at a position
link |
and say, hey, here's this mess of black and white stones.
link |
But from this mess, oh, I can predict
link |
that this part of the board has become my territory,
link |
this part of the board has become your territory,
link |
and I've got this overall sense that I'm gonna win
link |
and that this is about the right move to play.
link |
And that intuitive sense of judgment,
link |
of being able to evaluate what's going on in a position,
link |
it was pivotal to humans being able to play this game
link |
and something that people had no idea
link |
how to put into computers.
link |
So this question of how to evaluate a position,
link |
how to come up with these intuitive judgments
link |
was the key reason why Go was so hard
link |
in addition to its enormous search space,
link |
and the reason why methods
link |
which had succeeded so well elsewhere failed in Go.
link |
And so people really felt deep down that in order to crack Go
link |
we would need to get something akin to human intuition.
link |
And if we got something akin to human intuition,
link |
we'd be able to solve many, many more problems in AI.
link |
So for me, that was the moment where it's like,
link |
okay, this is not just about playing the game of Go,
link |
this is about something profound.
link |
And it was back to that bug
link |
which had been itching me all those years.
link |
This is the opportunity to do something meaningful
link |
and transformative, and I guess a dream was born.
link |
That's a really interesting way to put it.
link |
So almost this realization that you need to find,
link |
formulate Go as a kind of a prediction problem
link |
versus a search problem was the intuition.
link |
I mean, maybe that's the wrong crude term,
link |
but to give it the ability to kind of intuit things
link |
about positional structure of the board.
link |
Now, okay, but what about the learning part of it?
link |
Did you have a sense that you have to,
link |
that learning has to be part of the system?
link |
Again, something that hasn't really as far as I think,
link |
except with TD Gammon in the 90s with RL a little bit,
link |
hasn't been part of those state of the art game playing
link |
So I strongly felt that learning would be necessary.
link |
And that's why my PhD topic back then was trying
link |
to apply reinforcement learning to the game of Go
link |
and not just learning of any type,
link |
but I felt that the only way to really have a system
link |
to progress beyond human levels of performance
link |
wouldn't just be to mimic how humans do it,
link |
but to understand for themselves.
link |
And how else can a machine hope to understand
link |
what's going on except through learning?
link |
If you're not learning, what else are you doing?
link |
Well, you're putting all the knowledge into the system.
link |
And that just feels like something which decades of AI
link |
have told us is maybe not a dead end,
link |
but certainly has a ceiling to the capabilities.
link |
It's known as the knowledge acquisition bottleneck,
link |
that the more you try to put into something,
link |
the more brittle the system becomes.
link |
And so you just have to have learning.
link |
You have to have learning.
link |
That's the only way you're going to be able to get a system
link |
which has sufficient knowledge in it,
link |
millions and millions of pieces of knowledge,
link |
billions, trillions of a form
link |
that it can actually apply for itself
link |
and understand how those billions and trillions
link |
of pieces of knowledge can be leveraged in a way
link |
which will actually lead it towards its goal
link |
without conflict or other issues.
link |
Yeah, I mean, if I put myself back in that time,
link |
I just wouldn't think like that.
link |
Without a good demonstration of RL,
link |
I would think more in the symbolic AI,
link |
like not learning, but sort of a simulation
link |
of knowledge base, like a growing knowledge base,
link |
but it would still be sort of pattern based,
link |
like basically have little rules
link |
that you kind of assemble together
link |
into a large knowledge base.
link |
Well, in a sense, that was the state of the art back then.
link |
So if you look at the Go programs,
link |
which had been competing for this prize I mentioned,
link |
they were an assembly of different specialized systems,
link |
some of which used huge amounts of human knowledge
link |
to describe how you should play the opening,
link |
how you should, all the different patterns
link |
that were required to play well in the game of Go,
link |
end game theory, combinatorial game theory,
link |
and combined with more principled search based methods,
link |
which were trying to solve for particular sub parts
link |
of the game, like life and death,
link |
connecting groups together,
link |
all these amazing sub problems
link |
that just emerge in the game of Go,
link |
there were different pieces all put together
link |
into this like collage,
link |
which together would try and play against a human.
link |
And although not all of the pieces were handcrafted,
link |
the overall effect was nevertheless still brittle,
link |
and it was hard to make all these pieces work well together.
link |
And so really, what I was pressing for
link |
and the main innovation of the approach I took
link |
was to go back to first principles and say,
link |
well, let's back off that
link |
and try and find a principled approach
link |
where the system can learn for itself,
link |
just from the outcome, like learn for itself.
link |
If you try something, did that help or did it not help?
link |
And only through that procedure can you arrive at knowledge,
link |
which is verified.
link |
The system has to verify it for itself,
link |
not relying on any other third party
link |
to say this is right or this is wrong.
link |
And so that principle was already very important
link |
in those days, but unfortunately,
link |
we were missing some important pieces back then.
link |
So before we dive into maybe
link |
discussing the beauty of reinforcement learning,
link |
let's take a step back, we kind of skipped it a bit,
link |
but the rules of the game of Go,
link |
what the elements of it perhaps contrasting to chess
link |
that sort of you really enjoyed as a human being,
link |
and also that make it really difficult
link |
as a AI machine learning problem.
link |
So the game of Go has remarkably simple rules.
link |
In fact, so simple that people have speculated
link |
that if we were to meet alien life at some point,
link |
that we wouldn't be able to communicate with them,
link |
but we would be able to play Go with them.
link |
Probably have discovered the same rule set.
link |
So the game is played on a 19 by 19 grid,
link |
and you play on the intersections of the grid
link |
and the players take turns.
link |
And the aim of the game is very simple.
link |
It's to surround as much territory as you can,
link |
as many of these intersections with your stones
link |
and to surround more than your opponent does.
link |
And the only nuance to the game is that
link |
if you fully surround your opponent's piece,
link |
then you get to capture it and remove it from the board
link |
and it counts as your own territory.
link |
Now from those very simple rules, immense complexity arises.
link |
There's kind of profound strategies
link |
in how to surround territory,
link |
how to kind of trade off between
link |
making solid territory yourself now
link |
compared to building up influence
link |
that will help you acquire territory later in the game,
link |
how to connect groups together,
link |
how to keep your own groups alive,
link |
which patterns of stones are most useful
link |
compared to others.
link |
There's just immense knowledge.
link |
And human Go players have played this game for,
link |
it was discovered thousands of years ago,
link |
and human Go players have built up
link |
this immense knowledge base over the years.
link |
It's studied very deeply and played by
link |
something like 50 million players across the world,
link |
mostly in China, Japan, and Korea,
link |
where it's an important part of the culture,
link |
so much so that it's considered one of the
link |
four ancient arts that was required by Chinese scholars.
link |
So there's a deep history there.
link |
But there's interesting qualities.
link |
So if I sort of compare to chess,
link |
chess is in the same way as it is in Chinese culture for Go,
link |
and chess in Russia is also considered
link |
one of the sacred arts.
link |
So if we contrast sort of Go with chess,
link |
there's interesting qualities about Go.
link |
Maybe you can correct me if I'm wrong,
link |
but the evaluation of a particular static board
link |
is not as reliable.
link |
Like you can't, in chess you can kind of assign points
link |
to the different units,
link |
and it's kind of a pretty good measure
link |
of who's winning, who's losing.
link |
It's not so clear.
link |
Yeah, so in the game of Go,
link |
you find yourself in a situation where
link |
both players have played the same number of stones.
link |
Actually, captures at a strong level of play
link |
happen very rarely, which means that
link |
at any moment in the game,
link |
you've got the same number of white stones and black stones.
link |
And the only thing which differentiates
link |
how well you're doing is this intuitive sense
link |
of where are the territories ultimately
link |
going to form on this board?
link |
And if you look at the complexity of a real Go position,
link |
it's mind boggling that kind of question
link |
of what will happen in 300 moves from now
link |
when you see just a scattering of 20 white
link |
and black stones intermingled.
link |
And so that challenge is the reason
link |
why position evaluation is so hard in Go
link |
compared to other games.
link |
In addition to that, it has an enormous search space.
link |
So there's around 10 to the 170 positions
link |
in the game of Go.
link |
That's an astronomical number.
link |
And that search space is so great
link |
that traditional heuristic search methods
link |
that were so successful in things like Deep Blue
link |
and chess programs just kind of fall over in Go.
link |
So at which point did reinforcement learning
link |
enter your life, your research life, your way of thinking?
link |
We just talked about learning,
link |
but reinforcement learning is a very particular
link |
One that's both philosophically sort of profound,
link |
but also one that's pretty difficult to get to work
link |
as if we look back in the early days.
link |
So when did that enter your life
link |
and how did that work progress?
link |
So I had just finished working in the games industry
link |
at this startup company.
link |
And I took a year out to discover for myself
link |
exactly which path I wanted to take.
link |
I knew I wanted to study intelligence,
link |
but I wasn't sure what that meant at that stage.
link |
I really didn't feel I had the tools
link |
to decide on exactly which path I wanted to follow.
link |
So during that year, I read a lot.
link |
And one of the things I read was Saturn and Barto,
link |
the sort of seminal textbook
link |
on an introduction to reinforcement learning.
link |
And when I read that textbook,
link |
I just had this resonating feeling
link |
that this is what I understood intelligence to be.
link |
And this was the path that I felt would be necessary
link |
to go down to make progress in AI.
link |
So I got in touch with Rich Saturn
link |
and asked him if he would be interested
link |
in supervising me on a PhD thesis in computer go.
link |
And he basically said
link |
that if he's still alive, he'd be happy to.
link |
But unfortunately, he'd been struggling
link |
with very serious cancer for some years.
link |
And he really wasn't confident at that stage
link |
that he'd even be around to see the end event.
link |
But fortunately, that part of the story
link |
worked out very happily.
link |
And I found myself out there in Alberta.
link |
They've got a great games group out there
link |
with a history of fantastic work in board games as well,
link |
as Rich Saturn, the father of RL.
link |
So it was the natural place for me to go in some sense
link |
to study this question.
link |
And the more I looked into it,
link |
the more strongly I felt that this
link |
wasn't just the path to progress in computer go.
link |
But really, this was the thing I'd been looking for.
link |
This was really an opportunity
link |
to frame what intelligence means.
link |
Like what are the goals of AI in a clear,
link |
single clear problem definition,
link |
such that if we're able to solve
link |
that clear single problem definition,
link |
in some sense, we've cracked the problem of AI.
link |
So to you, reinforcement learning ideas,
link |
at least sort of echoes of it,
link |
would be at the core of intelligence.
link |
It is at the core of intelligence.
link |
And if we ever create a human level intelligence system,
link |
it would be at the core of that kind of system.
link |
Let me say it this way, that I think it's helpful
link |
to separate out the problem from the solution.
link |
So I see the problem of intelligence,
link |
I would say it can be formalized
link |
as the reinforcement learning problem,
link |
and that that formalization is enough
link |
to capture most, if not all of the things
link |
that we mean by intelligence,
link |
that they can all be brought within this framework
link |
and gives us a way to access them in a meaningful way
link |
that allows us as scientists to understand intelligence
link |
and us as computer scientists to build them.
link |
And so in that sense, I feel that it gives us a path,
link |
maybe not the only path, but a path towards AI.
link |
And so do I think that any system in the future
link |
that's solved AI would have to have RL within it?
link |
Well, I think if you ask that,
link |
you're asking about the solution methods.
link |
I would say that if we have such a thing,
link |
it would be a solution to the RL problem.
link |
Now, what particular methods have been used to get there?
link |
Well, we should keep an open mind
link |
about the best approaches to actually solve any problem.
link |
And the things we have right now for reinforcement learning,
link |
maybe I believe they've got a lot of legs,
link |
but maybe we're missing some things.
link |
Maybe there's gonna be better ideas.
link |
I think we should keep, let's remain modest
link |
and we're at the early days of this field
link |
and there are many amazing discoveries ahead of us.
link |
For sure, the specifics,
link |
especially of the different kinds of RL approaches currently,
link |
there could be other things that fall
link |
into the very large umbrella of RL.
link |
But if it's okay, can we take a step back
link |
and kind of ask the basic question
link |
of what is to you reinforcement learning?
link |
So reinforcement learning is the study
link |
and the science and the problem of intelligence
link |
in the form of an agent that interacts with an environment.
link |
So the problem you're trying to solve
link |
is represented by some environment,
link |
like the world in which that agent is situated.
link |
And the goal of RL is clear
link |
that the agent gets to take actions.
link |
Those actions have some effect on the environment
link |
and the environment gives back an observation
link |
to the agent saying, this is what you see or sense.
link |
And one special thing which it gives back
link |
is called the reward signal,
link |
how well it's doing in the environment.
link |
And the reinforcement learning problem
link |
is to simply take actions over time
link |
so as to maximize that reward signal.
link |
So a couple of basic questions.
link |
What types of RL approaches are there?
link |
So I don't know if there's a nice brief inwards way
link |
to paint the picture of sort of value based,
link |
model based, policy based reinforcement learning.
link |
Yeah, so now if we think about,
link |
okay, so there's this ambitious problem definition of RL.
link |
It's really, it's truly ambitious.
link |
It's trying to capture and encircle
link |
all of the things in which an agent interacts
link |
with an environment and say, well,
link |
how can we formalize and understand
link |
what it means to crack that?
link |
Now let's think about the solution method.
link |
Well, how do you solve a really hard problem like that?
link |
Well, one approach you can take
link |
is to decompose that very hard problem
link |
into pieces that work together to solve that hard problem.
link |
And so you can kind of look at the decomposition
link |
that's inside the agent's head, if you like,
link |
and ask, well, what form does that decomposition take?
link |
And some of the most common pieces that people use
link |
when they're kind of putting
link |
the solution method together,
link |
some of the most common pieces that people use
link |
are whether or not that solution has a value function.
link |
That means, is it trying to predict,
link |
explicitly trying to predict how much reward
link |
it will get in the future?
link |
Does it have a representation of a policy?
link |
That means something which is deciding how to pick actions.
link |
Is that decision making process explicitly represented?
link |
And is there a model in the system?
link |
Is there something which is explicitly trying to predict
link |
what will happen in the environment?
link |
And so those three pieces are, to me,
link |
some of the most common building blocks.
link |
And I understand the different choices in RL
link |
as choices of whether or not to use those building blocks
link |
when you're trying to decompose the solution.
link |
Should I have a value function represented?
link |
Should I have a policy represented?
link |
Should I have a model represented?
link |
And there are combinations of those pieces
link |
and, of course, other things that you could
link |
add into the picture as well.
link |
But those three fundamental choices
link |
give rise to some of the branches of RL
link |
with which we're very familiar.
link |
And so those, as you mentioned,
link |
there is a choice of what's specified
link |
or modeled explicitly.
link |
And the idea is that all of these
link |
are somehow implicitly learned within the system.
link |
So it's almost a choice of how you approach a problem.
link |
Do you see those as fundamental differences
link |
or are these almost like small specifics,
link |
like the details of how you solve a problem
link |
but they're not fundamentally different from each other?
link |
I think the fundamental idea is maybe at the higher level.
link |
The fundamental idea is the first step
link |
of the decomposition is really to say,
link |
well, how are we really gonna solve any kind of problem
link |
where you're trying to figure out how to take actions
link |
and just from this stream of observations,
link |
you've got some agent situated in its sensory motor stream
link |
and getting all these observations in,
link |
getting to take these actions, and what should it do?
link |
How can you even broach that problem?
link |
You know, maybe the complexity of the world is so great
link |
that you can't even imagine how to build a system
link |
that would understand how to deal with that.
link |
And so the first step of this decomposition is to say,
link |
well, you have to learn.
link |
The system has to learn for itself.
link |
And so note that the reinforcement learning problem
link |
doesn't actually stipulate that you have to learn.
link |
Like you could maximize your rewards without learning.
link |
It would just, wouldn't do a very good job of it.
link |
So learning is required
link |
because it's the only way to achieve good performance
link |
in any sufficiently large and complex environment.
link |
So that's the first step.
link |
And so that step gives commonality
link |
to all of the other pieces,
link |
because now you might ask, well, what should you be learning?
link |
What does learning even mean?
link |
You know, in this sense, you know, learning might mean,
link |
well, you're trying to update the parameters
link |
of some system, which is then the thing
link |
that actually picks the actions.
link |
And those parameters could be representing anything.
link |
They could be parameterizing a value function or a model
link |
And so in that sense, there's a lot of commonality
link |
in that whatever is being represented there
link |
is the thing which is being learned,
link |
and it's being learned with the ultimate goal
link |
of maximizing rewards.
link |
But the way in which you decompose the problem
link |
is really what gives the semantics to the whole system.
link |
Like, are you trying to learn something to predict well,
link |
like a value function or a model?
link |
Are you learning something to perform well, like a policy?
link |
And the form of that objective
link |
is kind of giving the semantics to the system.
link |
And so it really is, at the next level down,
link |
a fundamental choice,
link |
and we have to make those fundamental choices
link |
as system designers or enable our algorithms
link |
to be able to learn how to make those choices for themselves.
link |
So then the next step you mentioned,
link |
the very first thing you have to deal with is,
link |
can you even take in this huge stream of observations
link |
and do anything with it?
link |
So the natural next basic question is,
link |
what is deep reinforcement learning?
link |
And what is this idea of using neural networks
link |
to deal with this huge incoming stream?
link |
So amongst all the approaches for reinforcement learning,
link |
deep reinforcement learning
link |
is one family of solution methods
link |
that tries to utilize powerful representations
link |
that are offered by neural networks
link |
to represent any of these different components
link |
of the solution, of the agent,
link |
like whether it's the value function
link |
or the model or the policy.
link |
The idea of deep learning is to say,
link |
well, here's a powerful toolkit that's so powerful
link |
that it's universal in the sense
link |
that it can represent any function
link |
and it can learn any function.
link |
And so if we can leverage that universality,
link |
that means that whatever we need to represent
link |
for our policy or for our value function or for a model,
link |
deep learning can do it.
link |
So that deep learning is one approach
link |
that offers us a toolkit
link |
that has no ceiling to its performance,
link |
that as we start to put more resources into the system,
link |
more memory and more computation and more data,
link |
more experience, more interactions with the environment,
link |
that these are systems that can just get better
link |
and better and better at doing whatever the job is
link |
they've asked them to do,
link |
whatever we've asked that function to represent,
link |
it can learn a function that does a better and better job
link |
of representing that knowledge,
link |
whether that knowledge be estimating
link |
how well you're gonna do in the world,
link |
the value function,
link |
whether it's gonna be choosing what to do in the world,
link |
or whether it's understanding the world itself,
link |
what's gonna happen next, the model.
link |
Nevertheless, the fact that neural networks
link |
are able to learn incredibly complex representations
link |
that allow you to do the policy, the model
link |
or the value function is, at least to my mind,
link |
exceptionally beautiful and surprising.
link |
Like, was it surprising to you?
link |
Can you still believe it works as well as it does?
link |
Do you have good intuition about why it works at all
link |
and works as well as it does?
link |
I think, let me take two parts to that question.
link |
I think it's not surprising to me
link |
that the idea of reinforcement learning works
link |
because in some sense, I think it's the,
link |
I feel it's the only thing which can ultimately.
link |
And so I feel we have to address it
link |
and there must be success as possible
link |
because we have examples of intelligence.
link |
And it must at some level be able to,
link |
possible to acquire experience
link |
and use that experience to do better
link |
in a way which is meaningful to environments
link |
of the complexity that humans can deal with.
link |
Am I surprised that our current systems
link |
can do as well as they can do?
link |
I think one of the big surprises for me
link |
and a lot of the community
link |
is really the fact that deep learning
link |
can continue to perform so well
link |
despite the fact that these neural networks
link |
that they're representing
link |
have these incredibly nonlinear kind of bumpy surfaces
link |
which to our kind of low dimensional intuitions
link |
make it feel like surely you're just gonna get stuck
link |
and learning will get stuck
link |
because you won't be able to make any further progress.
link |
And yet the big surprise is that learning continues
link |
and these what appear to be local optima
link |
turn out not to be because in high dimensions
link |
when we make really big neural nets,
link |
there's always a way out
link |
and there's a way to go even lower
link |
and then you're still not in a local optima
link |
because there's some other pathway
link |
that will take you out and take you lower still.
link |
And so no matter where you are,
link |
learning can proceed and do better and better and better
link |
And so that is a surprising
link |
and beautiful property of neural nets
link |
which I find elegant and beautiful
link |
and somewhat shocking that it turns out to be the case.
link |
As you said, which I really like
link |
to our low dimensional intuitions, that's surprising.
link |
Yeah, we're very tuned to working
link |
within a three dimensional environment.
link |
And so to start to visualize
link |
what a billion dimensional neural network surface
link |
that you're trying to optimize over,
link |
what that even looks like is very hard for us.
link |
And so I think that really,
link |
if you try to account for the,
link |
essentially the AI winter
link |
where people gave up on neural networks,
link |
I think it's really down to that lack of ability
link |
to generalize from low dimensions to high dimensions
link |
because back then we were in the low dimensional case.
link |
People could only build neural nets
link |
with 50 nodes in them or something.
link |
And to imagine that it might be possible
link |
to build a billion dimensional neural net
link |
and it might have a completely different,
link |
qualitatively different property was very hard to anticipate.
link |
And I think even now we're starting to build the theory
link |
And it's incomplete at the moment,
link |
but all of the theory seems to be pointing in the direction
link |
that indeed this is an approach which truly is universal
link |
both in its representational capacity, which was known,
link |
but also in its learning ability, which is surprising.
link |
And it makes one wonder what else we're missing
link |
due to our low dimensional intuitions
link |
that will seem obvious once it's discovered.
link |
I often wonder, when we one day do have AIs
link |
which are superhuman in their abilities
link |
to understand the world,
link |
what will they think of the algorithms
link |
that we developed back now?
link |
Will it be looking back at these days
link |
and thinking that, will we look back and feel
link |
that these algorithms were naive first steps
link |
or will they still be the fundamental ideas
link |
which are used even in 100,000, 10,000 years?
link |
It's hard to know.
link |
They'll watch back to this conversation
link |
and with a smile, maybe a little bit of a laugh.
link |
I mean, my sense is, I think just like when we used
link |
to think that the sun revolved around the earth,
link |
they'll see our systems of today, reinforcement learning
link |
as too complicated, that the answer was simple all along.
link |
There's something, just like you said in the game of Go,
link |
I mean, I love the systems of like cellular automata,
link |
that there's simple rules from which incredible complexity
link |
emerges, so it feels like there might be
link |
some really simple approaches,
link |
just like Rich Sutton says, right?
link |
These simple methods with compute over time
link |
seem to prove to be the most effective.
link |
I think that if we try to anticipate
link |
what will generalize well into the future,
link |
I think it's likely to be the case
link |
that it's the simple, clear ideas
link |
which will have the longest legs
link |
and which will carry us furthest into the future.
link |
Nevertheless, we're in a situation
link |
where we need to make things work today,
link |
and sometimes that requires putting together
link |
more complex systems where we don't have
link |
the full answers yet as to what
link |
those minimal ingredients might be.
link |
So speaking of which, if we could take a step back to Go,
link |
what was MoGo and what was the key idea behind the system?
link |
So back during my PhD on Computer Go,
link |
around about that time, there was a major new development
link |
which actually happened in the context of Computer Go,
link |
and it was really a revolution in the way
link |
that heuristic search was done,
link |
and the idea was essentially that
link |
a position could be evaluated or a state in general
link |
could be evaluated not by humans saying
link |
whether that position is good or not,
link |
or even humans providing rules
link |
as to how you might evaluate it,
link |
but instead by allowing the system
link |
to randomly play out the game until the end multiple times
link |
and taking the average of those outcomes
link |
as the prediction of what will happen.
link |
So for example, if you're in the game of Go,
link |
the intuition is that you take a position
link |
and you get the system to kind of play random moves
link |
against itself all the way to the end of the game
link |
and you see who wins.
link |
And if black ends up winning
link |
more of those random games than white,
link |
well, you say, hey, this is a position that favors white.
link |
And if white ends up winning more of those random games
link |
than black, then it favors white.
link |
So that idea was known as Monte Carlo search,
link |
and a particular form of Monte Carlo search
link |
that became very effective and was developed in computer Go
link |
first by Remy Coulomb in 2006,
link |
and then taken further by others
link |
was something called Monte Carlo tree search,
link |
which basically takes that same idea
link |
and uses that insight to evaluate every node of a search tree
link |
is evaluated by the average of the random play outs
link |
from that node onwards.
link |
And this idea, when you think about it,
link |
and this idea was very powerful
link |
and suddenly led to huge leaps forward
link |
in the strength of computer Go playing programs.
link |
And among those, the strongest of the Go playing programs
link |
in those days was a program called MoGo,
link |
which was the first program to actually reach
link |
human master level on small boards, nine by nine boards.
link |
And so this was a program by someone called Sylvain Gelli,
link |
who's a good colleague of mine,
link |
but I worked with him a little bit in those days,
link |
part of my PhD thesis.
link |
And MoGo was a first step towards the latest successes
link |
we saw in computer Go,
link |
but it was still missing a key ingredient.
link |
MoGo was evaluating purely by random rollouts against itself.
link |
And in a way, it's truly remarkable
link |
that random play should give you anything at all.
link |
Why in this perfectly deterministic game
link |
that's very precise and involves these very exact sequences,
link |
why is it that randomization is helpful?
link |
And so the intuition is that randomization
link |
captures something about the nature of the search tree,
link |
from a position that you're understanding
link |
the nature of the search tree from that node onwards
link |
by using randomization.
link |
And this was a very powerful idea.
link |
And I've seen this in other spaces,
link |
talked to Richard Karp and so on,
link |
randomized algorithms somehow magically
link |
are able to do exceptionally well
link |
and simplifying the problem somehow.
link |
Makes you wonder about the fundamental nature
link |
of randomness in our universe.
link |
It seems to be a useful thing.
link |
But so from that moment,
link |
can you maybe tell the origin story
link |
and the journey of AlphaGo?
link |
Yeah, so programs based on Monte Carlo tree search
link |
were a first revolution
link |
in the sense that they led to suddenly programs
link |
that could play the game to any reasonable level,
link |
but they plateaued.
link |
It seemed that no matter how much effort
link |
people put into these techniques,
link |
they couldn't exceed the level
link |
of amateur Dan level Go players.
link |
So strong players,
link |
but not anywhere near the level of professionals,
link |
nevermind the world champion.
link |
And so that brings us to the birth of AlphaGo,
link |
which happened in the context of a startup company
link |
known as DeepMind.
link |
Where a project was born.
link |
And the project was really a scientific investigation
link |
where myself and Adger Huang
link |
and an intern, Chris Madison,
link |
were exploring a scientific question.
link |
And that scientific question was really,
link |
is there another fundamentally different approach
link |
to this key question of Go,
link |
the key challenge of how can you build that intuition
link |
and how can you just have a system
link |
that could look at a position
link |
and understand what move to play
link |
or how well you're doing in that position,
link |
And so the deep learning revolution had just begun.
link |
That systems like ImageNet had suddenly been won
link |
by deep learning techniques back in 2012.
link |
And following that, it was natural to ask,
link |
well, if deep learning is able to scale up so effectively
link |
with images to understand them enough to classify them,
link |
Why not take the black and white stones of the Go board
link |
and build a system which can understand for itself
link |
what that means in terms of what move to pick
link |
or who's gonna win the game, black or white?
link |
And so that was our scientific question
link |
which we were probing and trying to understand.
link |
And as we started to look at it,
link |
we discovered that we could build a system.
link |
So in fact, our very first paper on AlphaGo
link |
was actually a pure deep learning system
link |
which was trying to answer this question.
link |
And we showed that actually a pure deep learning system
link |
with no search at all was actually able
link |
to reach human band level, master level
link |
at the full game of Go, 19 by 19 boards.
link |
And so without any search at all,
link |
suddenly we had systems which were playing
link |
at the level of the best Monte Carlo tree search systems,
link |
the ones with randomized rollouts.
link |
So first of all, sorry to interrupt,
link |
but that's kind of a groundbreaking notion.
link |
That's like basically a definitive step away
link |
from a couple of decades
link |
of essentially search dominating AI.
link |
So how did that make you feel?
link |
Was it surprising from a scientific perspective in general,
link |
how to make you feel?
link |
I found this to be profoundly surprising.
link |
In fact, it was so surprising that we had a bet back then.
link |
And like many good projects, bets are quite motivating.
link |
And the bet was whether it was possible
link |
for a system based purely on deep learning,
link |
with no search at all to beat a down level human player.
link |
And so we had someone who joined our team
link |
who was a down level player.
link |
He came in and we had this first match against him and...
link |
Which side of the bed were you on, by the way?
link |
The losing or the winning side?
link |
I tend to be an optimist with the power
link |
of deep learning and reinforcement learning.
link |
So the system won,
link |
and we were able to beat this human down level player.
link |
And for me, that was the moment where it was like,
link |
okay, something special is afoot here.
link |
We have a system which without search
link |
is able to already just look at this position
link |
and understand things as well as a strong human player.
link |
And from that point onwards,
link |
I really felt that reaching the top levels of human play,
link |
professional level, world champion level,
link |
I felt it was actually an inevitability.
link |
And if it was an inevitable outcome,
link |
I was rather keen that it would be us that achieved it.
link |
This was something where,
link |
so I had lots of conversations back then
link |
with Demis Sassabis, the head of DeepMind,
link |
who was extremely excited.
link |
And we made the decision to scale up the project,
link |
brought more people on board.
link |
And so AlphaGo became something where we had a clear goal,
link |
which was to try and crack this outstanding challenge of AI
link |
to see if we could beat the world's best players.
link |
And this led within the space of not so many months
link |
to playing against the European champion Fan Hui
link |
in a match which became memorable in history
link |
as the first time a Go program
link |
had ever beaten a professional player.
link |
And at that time we had to make a judgment
link |
as to when and whether we should go
link |
and challenge the world champion.
link |
And this was a difficult decision to make.
link |
Again, we were basing our predictions on our own progress
link |
and had to estimate based on the rapidity
link |
of our own progress when we thought we would exceed
link |
the level of the human world champion.
link |
And we tried to make an estimate and set up a match
link |
and that became the AlphaGo versus Lee Sedol match in 2016.
link |
And we should say, spoiler alert,
link |
that AlphaGo was able to defeat Lee Sedol.
link |
That's right, yeah.
link |
So maybe we could take even a broader view.
link |
AlphaGo involves both learning from expert games
link |
and as far as I remember, a self play component
link |
to where it learns by playing against itself.
link |
But in your sense, what was the role of learning
link |
from expert games there?
link |
And in terms of your self evaluation,
link |
whether you can take on the world champion,
link |
what was the thing that you're trying to do more of?
link |
Sort of train more on expert games
link |
or was there's now another,
link |
I'm asking so many poorly phrased questions,
link |
but did you have a hope or dream that self play
link |
would be the key component at that moment yet?
link |
So in the early days of AlphaGo,
link |
we used human data to explore the science
link |
of what deep learning can achieve.
link |
And so when we had our first paper that showed
link |
that it was possible to predict the winner of the game,
link |
that it was possible to suggest moves,
link |
that was done using human data.
link |
A solely human data.
link |
Yeah, and so the reason that we did it that way
link |
was at that time we were exploring separately
link |
the deep learning aspect
link |
from the reinforcement learning aspect.
link |
That was the part which was new and unknown
link |
to me at that time was how far could that be stretched?
link |
Once we had that, it then became natural
link |
to try and use that same representation
link |
and see if we could learn for ourselves
link |
using that same representation.
link |
And so right from the beginning,
link |
actually our goal had been to build a system
link |
And to us, the human data right from the beginning
link |
was an expedient step to help us for pragmatic reasons
link |
to go faster towards the goals of the project
link |
than we might be able to starting solely from self play.
link |
And so in those days, we were very aware
link |
that we were choosing to use human data
link |
and that might not be the longterm holy grail of AI,
link |
but that it was something which was extremely useful to us.
link |
It helped us to understand the system.
link |
It helped us to build deep learning representations
link |
which were clear and simple and easy to use.
link |
And so really I would say it served a purpose
link |
not just as part of the algorithm,
link |
but something which I continue to use in our research today,
link |
which is trying to break down a very hard challenge
link |
into pieces which are easier to understand for us
link |
as researchers and develop.
link |
So if you use a component based on human data,
link |
it can help you to understand the system
link |
such that then you can build
link |
the more principled version later that does it for itself.
link |
So as I said, the AlphaGo victory,
link |
and I don't think I'm being sort of romanticizing this notion.
link |
I think it's one of the greatest moments
link |
in the history of AI.
link |
So were you cognizant of this magnitude
link |
of the accomplishment at the time?
link |
I mean, are you cognizant of it even now?
link |
Because to me, I feel like it's something that would,
link |
we mentioned what the AGI systems of the future
link |
I think they'll look back at the AlphaGo victory
link |
as like, holy crap, they figured it out.
link |
This is where it started.
link |
Well, thank you again.
link |
I mean, it's funny because I guess I've been working on,
link |
I've been working on ComputerGo for a long time.
link |
So I'd been working at the time of the AlphaGo match
link |
on ComputerGo for more than a decade.
link |
And throughout that decade, I'd had this dream
link |
of what would it be like to, what would it be like really
link |
to actually be able to build a system
link |
that could play against the world champion.
link |
And I imagined that that would be an interesting moment
link |
that maybe some people might care about that
link |
and that this might be a nice achievement.
link |
But I think when I arrived in Seoul
link |
and discovered the legions of journalists
link |
that were following us around and the 100 million people
link |
that were watching the match online live,
link |
I realized that I'd been off in my estimation
link |
of how significant this moment was
link |
by several orders of magnitude.
link |
And so there was definitely an adjustment process
link |
to realize that this was something
link |
which the world really cared about
link |
and which was a watershed moment.
link |
And I think there was that moment of realization.
link |
But it's also a little bit scary
link |
because if you go into something thinking
link |
it's gonna be maybe of interest
link |
and then discover that 100 million people are watching,
link |
it suddenly makes you worry about
link |
whether some of the decisions you'd made
link |
were really the best ones or the wisest,
link |
or were going to lead to the best outcome.
link |
And we knew for sure that there were still imperfections
link |
in AlphaGo, which were gonna be exposed
link |
to the whole world watching.
link |
And so, yeah, it was I think a great experience
link |
and I feel privileged to have been part of it,
link |
privileged to have led that amazing team.
link |
I feel privileged to have been in a moment of history
link |
like you say, but also lucky that in a sense
link |
I was insulated from the knowledge of,
link |
I think it would have been harder to focus on the research
link |
if the full kind of reality of what was gonna come to pass
link |
had been known to me and the team.
link |
I think it was, we were in our bubble
link |
and we were working on research
link |
and we were trying to answer the scientific questions
link |
and then bam, the public sees it.
link |
And I think it was better that way in retrospect.
link |
Were you confident that, I guess,
link |
what were the chances that you could get the win?
link |
So just like you said, I'm a little bit more familiar
link |
with another accomplishment
link |
that we may not even get a chance to talk to.
link |
I talked to Oriel Venialis about Alpha Star
link |
which is another incredible accomplishment,
link |
but here with Alpha Star and beating the StarCraft,
link |
there was already a track record with AlphaGo.
link |
This is the really first time
link |
you get to see reinforcement learning
link |
face the best human in the world.
link |
So what was your confidence like, what was the odds?
link |
Well, we actually. Was there a bet?
link |
Funnily enough, there was.
link |
So just before the match,
link |
we weren't betting on anything concrete,
link |
but we all held out a hand.
link |
Everyone in the team held out a hand
link |
at the beginning of the match.
link |
And the number of fingers that they had out on their hand
link |
was supposed to represent how many games
link |
they thought we would win against Lee Sedol.
link |
And there was an amazing spread in the team's predictions.
link |
But I have to say, I predicted four, one.
link |
And the reason was based purely on data.
link |
So I'm a scientist first and foremost.
link |
And one of the things which we had established
link |
was that AlphaGo in around one in five games
link |
would develop something which we called a delusion,
link |
which was a kind of in a hole in its knowledge
link |
where it wasn't able to fully understand
link |
everything about the position.
link |
And that hole in its knowledge would persist
link |
for tens of moves throughout the game.
link |
And we knew two things.
link |
We knew that if there were no delusions,
link |
that AlphaGo seemed to be playing at a level
link |
that was far beyond any human capabilities.
link |
But we also knew that if there were delusions,
link |
the opposite was true.
link |
And in fact, that's what came to pass.
link |
We saw all of those outcomes.
link |
And Lee Sedol in one of the games
link |
played a really beautiful sequence
link |
that AlphaGo just hadn't predicted.
link |
And after that, it led it into this situation
link |
where it was unable to really understand the position fully
link |
and found itself in one of these delusions.
link |
So indeed, yeah, 4.1 was the outcome.
link |
So yeah, and can you maybe speak to it a little bit more?
link |
What were the five games?
link |
Is there interesting things that come to memory
link |
in terms of the play of the human or the machine?
link |
So I remember all of these games vividly, of course.
link |
Moments like these don't come too often
link |
in the lifetime of a scientist.
link |
And the first game was magical because it was the first time
link |
that a computer program had defeated a world
link |
champion in this grand challenge of Go.
link |
And there was a moment where AlphaGo invaded Lee Sedol's
link |
territory towards the end of the game.
link |
And that's quite an audacious thing to do.
link |
It's like saying, hey, you thought
link |
this was going to be your territory in the game,
link |
but I'm going to stick a stone right in the middle of it
link |
and prove to you that I can break it up.
link |
And Lee Sedol's face just dropped.
link |
He wasn't expecting a computer to do something that audacious.
link |
The second game became famous for a move known as move 37.
link |
This was a move that was played by AlphaGo that broke
link |
all of the conventions of Go, that the Go players were
link |
so shocked by this.
link |
They thought that maybe the operator had made a mistake.
link |
They thought that there was something crazy going on.
link |
And it just broke every rule that Go players
link |
are taught from a very young age.
link |
They're just taught this kind of move called a shoulder hit.
link |
You can only play it on the third line or the fourth line,
link |
and AlphaGo played it on the fifth line.
link |
And it turned out to be a brilliant move
link |
and made this beautiful pattern in the middle of the board that
link |
ended up winning the game.
link |
And so this really was a clear instance
link |
where we could say computers exhibited creativity,
link |
that this was really a move that was something
link |
humans hadn't known about, hadn't anticipated.
link |
And computers discovered this idea.
link |
They were the ones to say, actually, here's
link |
a new idea, something new, not in the domains
link |
of human knowledge of the game.
link |
And now the humans think this is a reasonable thing to do.
link |
And it's part of Go knowledge now.
link |
The third game, something special
link |
happens when you play against a human world champion, which,
link |
again, I hadn't anticipated before going there,
link |
which is these players are amazing.
link |
Lee Sedol was a true champion, 18 time world champion,
link |
and had this amazing ability to probe AlphaGo
link |
for weaknesses of any kind.
link |
And in the third game, he was losing,
link |
and we felt we were sailing comfortably to victory.
link |
But he managed to, from nothing, stir up this fight
link |
and build what's called a double co,
link |
these kind of repetitive positions.
link |
And he knew that historically, no computer Go program had ever
link |
been able to deal correctly with double co positions.
link |
And he managed to summon one out of nothing.
link |
And so for us, this was a real challenge.
link |
Would AlphaGo be able to deal with this,
link |
or would it just kind of crumble in the face of this situation?
link |
And fortunately, it dealt with it perfectly.
link |
The fourth game was amazing in that Lee Sedol
link |
appeared to be losing this game.
link |
AlphaGo thought it was winning.
link |
And then Lee Sedol did something,
link |
which I think only a true world champion can do,
link |
which is he found a brilliant sequence
link |
in the middle of the game, a brilliant sequence
link |
that led him to really just transform the position.
link |
He kind of found just a piece of genius, really.
link |
And after that, AlphaGo, its evaluation just tumbled.
link |
It thought it was winning this game.
link |
And all of a sudden, it tumbled and said, oh, now
link |
I've got no chance.
link |
And it started to behave rather oddly at that point.
link |
In the final game, for some reason, we as a team
link |
were convinced, having seen AlphaGo in the previous game,
link |
suffer from delusions.
link |
We as a team were convinced that it
link |
was suffering from another delusion.
link |
We were convinced that it was misevaluating the position
link |
and that something was going terribly wrong.
link |
And it was only in the last few moves of the game
link |
that we realized that actually, although it
link |
had been predicting it was going to win all the way through,
link |
And so somehow, it just taught us yet again
link |
that you have to have faith in your systems.
link |
When they exceed your own level of ability
link |
and your own judgment, you have to trust in them
link |
to know better than you, the designer, once you've
link |
bestowed in them the ability to judge better than you can,
link |
then trust the system to do so.
link |
So just like in the case of Deep Blue beating Gary Kasparov,
link |
so Gary was, I think, the first time he's ever lost, actually,
link |
And I mean, there's a similar situation with Lee Sedol.
link |
It's a tragic loss for humans, but a beautiful one,
link |
I think, that's kind of, from the tragedy,
link |
sort of emerges over time, emerges
link |
a kind of inspiring story.
link |
But Lee Sedol recently announced his retirement.
link |
I don't know if we can look too deeply into it,
link |
but he did say that even if I become number one,
link |
there's an entity that cannot be defeated.
link |
So what do you think about these words?
link |
What do you think about his retirement from the game ago?
link |
Well, let me take you back, first of all,
link |
to the first part of your comment about Gary Kasparov,
link |
because actually, at the panel yesterday,
link |
he specifically said that when he first lost to Deep Blue,
link |
he viewed it as a failure.
link |
He viewed that this had been a failure of his.
link |
But later on in his career, he said
link |
he'd come to realize that actually, it was a success.
link |
It was a success for everyone, because this marked
link |
transformational moment for AI.
link |
And so even for Gary Kasparov, he
link |
came to realize that that moment was pivotal
link |
and actually meant something much more
link |
than his personal loss in that moment.
link |
Lee Sedol, I think, was much more cognizant of that,
link |
And so in his closing remarks to the match,
link |
he really felt very strongly that what
link |
had happened in the AlphaGo match
link |
was not only meaningful for AI, but for humans as well.
link |
And he felt as a Go player that it had opened his horizons
link |
and meant that he could start exploring new things.
link |
It brought his joy back for the game of Go,
link |
because it had broken all of the conventions and barriers
link |
and meant that suddenly, anything was possible again.
link |
So I was sad to hear that he'd retired,
link |
but he's been a great world champion over many, many years.
link |
And I think he'll be remembered for that ever more.
link |
He'll be remembered as the last person to beat AlphaGo.
link |
I mean, after that, we increased the power of the system.
link |
And the next version of AlphaGo beats the other strong human
link |
player 60 games to nil.
link |
So what a great moment for him and something
link |
to be remembered for.
link |
It's interesting that you spent time at AAAI on a panel
link |
with Garry Kasparov.
link |
What, I mean, it's almost, I'm just
link |
curious to learn the conversations you've
link |
had with Garry, because he's also now,
link |
he's written a book about artificial intelligence.
link |
He's thinking about AI.
link |
He has kind of a view of it.
link |
And he talks about AlphaGo a lot.
link |
What's your sense?
link |
Arguably, I'm not just being Russian,
link |
but I think Garry is the greatest chess player
link |
of all time, probably one of the greatest game
link |
players of all time.
link |
And you sort of at the center of creating
link |
a system that beats one of the greatest players of all time.
link |
So what is that conversation like?
link |
Is there anything, any interesting digs, any bets,
link |
any funny things, any profound things?
link |
So Garry Kasparov has an incredible respect
link |
for what we did with AlphaGo.
link |
And it's an amazing tribute coming from him of all people
link |
that he really appreciates and respects what we've done.
link |
And I think he feels that the progress which has happened
link |
in computer chess, which later after AlphaGo,
link |
we built the AlphaZero system, which
link |
defeated the world's strongest chess programs.
link |
And to Garry Kasparov, that moment in computer chess
link |
was more profound than Deep Blue.
link |
And the reason he believes it mattered more
link |
was because it was done with learning
link |
and a system which was able to discover for itself
link |
new principles, new ideas, which were
link |
able to play the game in a way which he hadn't always
link |
known about or anyone.
link |
And in fact, one of the things I discovered at this panel
link |
was that the current world champion, Magnus Carlsen,
link |
apparently recently commented on his improvement
link |
And he attributed it to AlphaZero,
link |
that he's been studying the games of AlphaZero.
link |
And he's changed his style to play more like AlphaZero.
link |
And it's led to him actually increasing his rating
link |
Yeah, I guess to me, just like to Garry,
link |
the inspiring thing is that, and just like you said,
link |
with reinforcement learning, reinforcement learning
link |
and deep learning, machine learning
link |
feels like what intelligence is.
link |
And you could attribute it to a bitter viewpoint
link |
from Garry's perspective, from us humans perspective,
link |
saying that pure search that IBM Deep Blue was doing
link |
is not really intelligence, but somehow it didn't feel like it.
link |
And so that's the magical.
link |
I'm not sure what it is about learning that
link |
feels like intelligence, but it does.
link |
So I think we should not demean the achievements of what
link |
was done in previous eras of AI.
link |
I think that Deep Blue was an amazing achievement in itself.
link |
And that heuristic search of the kind that was used by Deep
link |
Blue had some powerful ideas that were in there,
link |
but it also missed some things.
link |
So the fact that the evaluation function, the way
link |
that the chess position was understood,
link |
was created by humans and not by the machine
link |
is a limitation, which means that there's
link |
a ceiling on how well it can do.
link |
But maybe more importantly, it means
link |
that the same idea cannot be applied in other domains
link |
where we don't have access to the human grandmasters
link |
and that ability to encode exactly their knowledge
link |
into an evaluation function.
link |
And the reality is that the story of AI
link |
is that most domains turn out to be of the second type
link |
where knowledge is messy, it's hard to extract from experts,
link |
or it isn't even available.
link |
And so we need to solve problems in a different way.
link |
And I think AlphaGo is a step towards solving things
link |
in a way which puts learning as a first class citizen
link |
and says systems need to understand for themselves
link |
how to understand the world, how to judge the value of any action
link |
that they might take within that world
link |
and any state they might find themselves in.
link |
And in order to do that, we make progress towards AI.
link |
Yeah, so one of the nice things about taking a learning
link |
approach to the game of Go or game playing
link |
is that the things you learn, the things you figure out,
link |
are actually going to be applicable to other problems
link |
that are real world problems.
link |
That's ultimately, I mean, there's
link |
two really interesting things about AlphaGo.
link |
One is the science of it, just the science of learning,
link |
the science of intelligence.
link |
And then the other is while you're actually
link |
learning to figuring out how to build systems that
link |
would be potentially applicable in other applications,
link |
medical, autonomous vehicles, robotics,
link |
I mean, it's just open the door to all kinds of applications.
link |
So the next incredible step, really the profound step
link |
is probably AlphaGo Zero.
link |
I mean, it's arguable.
link |
I kind of see them all as the same place.
link |
But really, and perhaps you were already
link |
thinking that AlphaGo Zero is the natural.
link |
It was always going to be the next step.
link |
But it's removing the reliance on human expert games
link |
for pre training, as you mentioned.
link |
So how big of an intellectual leap
link |
was this that self play could achieve superhuman level
link |
performance in its own?
link |
And maybe could you also say, what is self play?
link |
Kind of mention it a few times.
link |
So let me start with self play.
link |
So the idea of self play is something
link |
which is really about systems learning for themselves,
link |
but in the situation where there's more than one agent.
link |
And so if you're in a game, and the game
link |
is played between two players, then self play
link |
is really about understanding that game just
link |
by playing games against yourself
link |
rather than against any actual real opponent.
link |
And so it's a way to kind of discover strategies
link |
without having to actually need to go out and play
link |
against any particular human player, for example.
link |
The main idea of Alpha Zero was really
link |
to try and step back from any of the knowledge
link |
that we put into the system and ask the question,
link |
is it possible to come up with a single elegant principle
link |
by which a system can learn for itself all of the knowledge
link |
which it requires to play a game such as Go?
link |
Importantly, by taking knowledge out,
link |
you not only make the system less brittle in the sense
link |
that perhaps the knowledge you were putting in
link |
was just getting in the way and maybe stopping the system
link |
learning for itself, but also you make it more general.
link |
The more knowledge you put in, the harder
link |
it is for a system to actually be placed,
link |
taken out of the system in which it's kind of been designed,
link |
and placed in some other system that maybe would need
link |
a completely different knowledge base to understand
link |
And so the real goal here is to strip out all of the knowledge
link |
that we put in to the point that we can just plug it
link |
into something totally different.
link |
And that, to me, is really the promise of AI
link |
is that we can have systems such as that which,
link |
no matter what the goal is, no matter what goal
link |
we set to the system, we can come up
link |
with an algorithm which can be placed into that world,
link |
into that environment, and can succeed
link |
in achieving that goal.
link |
And then that, to me, is almost the essence of intelligence
link |
if we can achieve that.
link |
And so AlphaZero is a step towards that.
link |
And it's a step that was taken in the context of two player
link |
perfect information games like Go and chess.
link |
We also applied it to Japanese chess.
link |
So just to clarify, the first step
link |
The first step was to try and take all of the knowledge out
link |
of AlphaGo in such a way that it could
link |
play in a fully self discovered way, purely from self play.
link |
And to me, the motivation for that
link |
was always that we could then plug it into other domains.
link |
But we saved that until later.
link |
Well, in fact, I mean, just for fun,
link |
I could tell you exactly the moment
link |
where the idea for AlphaZero occurred to me.
link |
Because I think there's maybe a lesson there for researchers
link |
who are too deeply embedded in their research
link |
and working 24 sevens to try and come up with the next idea,
link |
which is it actually occurred to me on honeymoon.
link |
And I was at my most fully relaxed state,
link |
really enjoying myself, and just bing,
link |
the algorithm for AlphaZero just appeared in its full form.
link |
And this was actually before we played against Lisa Dahl.
link |
But we just didn't.
link |
I think we were so busy trying to make sure
link |
we could beat the world champion that it was only later
link |
that we had the opportunity to step back and start
link |
examining that sort of deeper scientific question of whether
link |
this could really work.
link |
So nevertheless, so self play is probably
link |
one of the most profound ideas that represents, to me at least,
link |
artificial intelligence.
link |
But the fact that you could use that kind of mechanism
link |
to, again, beat world class players,
link |
that's very surprising.
link |
So to me, it feels like you have to train
link |
in a large number of expert games.
link |
So was it surprising to you?
link |
What was the intuition?
link |
Can you sort of think, not necessarily at that time,
link |
even now, what's your intuition?
link |
Why this thing works so well?
link |
Why it's able to learn from scratch?
link |
Well, let me first say why we tried it.
link |
So we tried it both because I feel
link |
that it was the deeper scientific question
link |
to be asking to make progress towards AI,
link |
and also because, in general, in my research,
link |
I don't like to do research on questions for which we already
link |
know the likely outcome.
link |
I don't see much value in running an experiment where
link |
you're 95% confident that you will succeed.
link |
And so we could have tried maybe to take AlphaGo and do
link |
something which we knew for sure it would succeed on.
link |
But much more interesting to me was to try it on the things
link |
which we weren't sure about.
link |
And one of the big questions on our minds
link |
back then was, could you really do this with self play alone?
link |
How far could that go?
link |
Would it be as strong?
link |
And honestly, we weren't sure.
link |
It was 50, 50, I think.
link |
If you'd asked me, I wasn't confident
link |
that it could reach the same level as these systems,
link |
but it felt like the right question to ask.
link |
And even if it had not achieved the same level,
link |
I felt that that was an important direction
link |
And so then, lo and behold, it actually
link |
ended up outperforming the previous version of AlphaGo
link |
and indeed was able to beat it by 100 games to zero.
link |
So what's the intuition as to why?
link |
I think the intuition to me is clear,
link |
that whenever you have errors in a system, as we did in AlphaGo,
link |
AlphaGo suffered from these delusions.
link |
Occasionally, it would misunderstand
link |
what was going on in a position and miss evaluate it.
link |
How can you remove all of these errors?
link |
Errors arise from many sources.
link |
For us, they were arising both starting from the human data,
link |
but also from the nature of the search
link |
and the nature of the algorithm itself.
link |
But the only way to address them in any complex system
link |
is to give the system the ability
link |
to correct its own errors.
link |
It must be able to correct them.
link |
It must be able to learn for itself
link |
when it's doing something wrong and correct for it.
link |
And so it seemed to me that the way to correct delusions
link |
was indeed to have more iterations of reinforcement
link |
learning, that no matter where you start,
link |
you should be able to correct those errors
link |
until it gets to play that out and understand,
link |
oh, well, I thought that I was going to win in this situation,
link |
but then I ended up losing.
link |
That suggests that I was miss evaluating something.
link |
There's a hole in my knowledge, and now the system
link |
can correct for itself and understand how to do better.
link |
Now, if you take that same idea and trace it back
link |
all the way to the beginning, it should
link |
be able to take you from no knowledge,
link |
from completely random starting point,
link |
all the way to the highest levels of knowledge
link |
that you can achieve in a domain.
link |
And the principle is the same, that if you bestow a system
link |
with the ability to correct its own errors,
link |
then it can take you from random to something slightly
link |
better than random because it sees the stupid things
link |
that the random is doing, and it can correct them.
link |
And then it can take you from that slightly better system
link |
and understand, well, what's that doing wrong?
link |
And it takes you on to the next level and the next level.
link |
And this progress can go on indefinitely.
link |
And indeed, what would have happened
link |
if we'd carried on training AlphaGo Zero for longer?
link |
We saw no sign of it slowing down its improvements,
link |
or at least it was certainly carrying on to improve.
link |
And presumably, if you had the computational resources,
link |
this could lead to better and better systems
link |
that discover more and more.
link |
So your intuition is fundamentally
link |
there's not a ceiling to this process.
link |
One of the surprising things, just like you said,
link |
is the process of patching errors.
link |
It intuitively makes sense that this is,
link |
that reinforcement learning should be part of that process.
link |
But what is surprising is in the process
link |
of patching your own lack of knowledge,
link |
you don't open up other patches.
link |
You keep sort of, like there's a monotonic decrease
link |
of your weaknesses.
link |
Well, let me back this up.
link |
I think science always should make falsifiable hypotheses.
link |
So let me back up this claim with a falsifiable hypothesis,
link |
which is that if someone was to, in the future,
link |
take Alpha Zero as an algorithm
link |
and run it on with greater computational resources
link |
that we had available today,
link |
then I would predict that they would be able
link |
to beat the previous system 100 games to zero.
link |
And that if they were then to do the same thing
link |
a couple of years later,
link |
that that would beat that previous system 100 games to zero,
link |
and that that process would continue indefinitely
link |
throughout at least my human lifetime.
link |
Presumably the game of Go would set the ceiling.
link |
The game of Go would set the ceiling,
link |
but the game of Go has 10 to the 170 states in it.
link |
So the ceiling is unreachable by any computational device
link |
that can be built out of the 10 to the 80 atoms
link |
You asked a really good question,
link |
which is, do you not open up other errors
link |
when you correct your previous ones?
link |
And the answer is yes, you do.
link |
And so it's a remarkable fact
link |
about this class of two player game
link |
and also true of single agent games
link |
that essentially progress will always lead you to,
link |
if you have sufficient representational resource,
link |
like imagine you had,
link |
could represent every state in a big table of the game,
link |
then we know for sure that a progress of self improvement
link |
will lead all the way in the single agent case
link |
to the optimal possible behavior,
link |
and in the two player case to the minimax optimal behavior.
link |
And that is the best way that I can play
link |
knowing that you're playing perfectly against me.
link |
And so for those cases,
link |
we know that even if you do open up some new error,
link |
that in some sense you've made progress.
link |
You're progressing towards the best that can be done.
link |
So AlphaGo was initially trained on expert games
link |
with some self play.
link |
AlphaGo Zero removed the need to be trained on expert games.
link |
And then another incredible step for me,
link |
because I just love chess,
link |
is to generalize that further to be in AlphaZero
link |
to be able to play the game of Go,
link |
beating AlphaGo Zero and AlphaGo,
link |
and then also being able to play the game of chess
link |
So what was that step like?
link |
What's the interesting aspects there
link |
that required to make that happen?
link |
I think the remarkable observation,
link |
which we saw with AlphaZero,
link |
was that actually without modifying the algorithm at all,
link |
it was able to play and crack
link |
some of AI's greatest previous challenges.
link |
In particular, we dropped it into the game of chess.
link |
And unlike the previous systems like Deep Blue,
link |
which had been worked on for years and years,
link |
and we were able to beat
link |
the world's strongest computer chess program convincingly
link |
using a system that was fully discovered
link |
from scratch with its own principles.
link |
And in fact, one of the nice things that we found
link |
was that in fact, we also achieved the same result
link |
in Japanese chess, a variant of chess
link |
where you get to capture pieces
link |
and then place them back down on your own side
link |
as an extra piece.
link |
So a much more complicated variant of chess.
link |
And we also beat the world's strongest programs
link |
and reached superhuman performance in that game too.
link |
And it was the very first time that we'd ever run the system
link |
on that particular game,
link |
was the version that we published
link |
in the paper on AlphaZero.
link |
It just worked out of the box, literally, no touching it.
link |
We didn't have to do anything.
link |
And there it was, superhuman performance,
link |
no tweaking, no twiddling.
link |
And so I think there's something beautiful
link |
about that principle that you can take an algorithm
link |
and without twiddling anything, it just works.
link |
Now, to go beyond AlphaZero, what's required?
link |
AlphaZero is just a step.
link |
And there's a long way to go beyond that
link |
to really crack the deep problems of AI.
link |
But one of the important steps is to acknowledge
link |
that the world is a really messy place.
link |
It's this rich, complex, beautiful,
link |
but messy environment that we live in.
link |
And no one gives us the rules.
link |
Like no one knows the rules of the world.
link |
At least maybe we understand that it operates
link |
according to Newtonian or quantum mechanics
link |
at the micro level or according to relativity
link |
at the macro level.
link |
But that's not a model that's useful for us as people
link |
Somehow the agent needs to understand the world for itself
link |
in a way where no one tells it the rules of the game.
link |
And yet it can still figure out what to do in that world,
link |
deal with this stream of observations coming in,
link |
rich sensory input coming in,
link |
actions going out in a way that allows it to reason
link |
in the way that AlphaGo or AlphaZero can reason
link |
in the way that these go and chess playing programs
link |
But in a way that allows it to take actions
link |
in that messy world to achieve its goals.
link |
And so this led us to the most recent step
link |
in the story of AlphaGo,
link |
which was a system called MuZero.
link |
And MuZero is a system which learns for itself
link |
even when the rules are not given to it.
link |
It actually can be dropped into a system
link |
with messy perceptual inputs.
link |
We actually tried it in some Atari games,
link |
the canonical domains of Atari
link |
that have been used for reinforcement learning.
link |
And this system learned to build a model
link |
of these Atari games that was sufficiently rich
link |
and useful enough for it to be able to plan successfully.
link |
And in fact, that system not only went on
link |
to beat the state of the art in Atari,
link |
but the same system without modification
link |
was able to reach the same level of superhuman performance
link |
in go, chess, and shogi that we'd seen in AlphaZero,
link |
showing that even without the rules,
link |
the system can learn for itself just by trial and error,
link |
just by playing this game of go.
link |
And no one tells you what the rules are,
link |
but you just get to the end and someone says win or loss.
link |
You play this game of chess and someone says win or loss,
link |
or you play a game of breakout in Atari
link |
and someone just tells you your score at the end.
link |
And the system for itself figures out
link |
essentially the rules of the system,
link |
the dynamics of the world, how the world works.
link |
And not in any explicit way, but just implicitly,
link |
enough understanding for it to be able to plan
link |
in that system in order to achieve its goals.
link |
And that's the fundamental process
link |
that you have to go through when you're facing
link |
in any uncertain kind of environment
link |
that you would in the real world,
link |
is figuring out the sort of the rules,
link |
the basic rules of the game.
link |
So that allows it to be applicable
link |
to basically any domain that could be digitized
link |
in the way that it needs to in order to be consumable,
link |
sort of in order for the reinforcement learning framework
link |
to be able to sense the environment,
link |
to be able to act in the environment and so on.
link |
The full reinforcement learning problem
link |
needs to deal with worlds that are unknown and complex
link |
and the agent needs to learn for itself
link |
how to deal with that.
link |
And so MuZero is a further step in that direction.
link |
One of the things that inspired the general public
link |
and just in conversations I have like with my parents
link |
or something with my mom that just loves what was done
link |
is kind of at least the notion
link |
that there was some display of creativity,
link |
some new strategies, new behaviors that were created.
link |
That again has echoes of intelligence.
link |
So is there something that stands out?
link |
Do you see it the same way that there's creativity
link |
and there's some behaviors, patterns that you saw
link |
that AlphaZero was able to display that are truly creative?
link |
So let me start by saying that I think we should ask
link |
what creativity really means.
link |
So to me, creativity means discovering something
link |
which wasn't known before, something unexpected,
link |
something outside of our norms.
link |
And so in that sense, the process of reinforcement learning
link |
or the self play approach that was used by AlphaZero
link |
is the essence of creativity.
link |
It's really saying at every stage,
link |
you're playing according to your current norms
link |
and you try something and if it works out,
link |
you say, hey, here's something great,
link |
I'm gonna start using that.
link |
And then that process, it's like a micro discovery
link |
that happens millions and millions of times
link |
over the course of the algorithm's life
link |
where it just discovers some new idea,
link |
oh, this pattern, this pattern's working really well for me,
link |
I'm gonna start using that.
link |
And now, oh, here's this other thing I can do,
link |
I can start to connect these stones together in this way
link |
or I can start to sacrifice stones or give up on pieces
link |
or play shoulder hits on the fifth line or whatever it is.
link |
The system's discovering things like this for itself
link |
continually, repeatedly, all the time.
link |
And so it should come as no surprise to us then
link |
when if you leave these systems going,
link |
that they discover things that are not known to humans,
link |
that to the human norms are considered creative.
link |
And we've seen this several times.
link |
In fact, in AlphaGo Zero,
link |
we saw this beautiful timeline of discovery
link |
where what we saw was that there are these opening patterns
link |
that humans play called joseki,
link |
these are like the patterns that humans learn
link |
to play in the corners and they've been developed
link |
and refined over literally thousands of years
link |
in the game of Go.
link |
And what we saw was in the course of the training,
link |
AlphaGo Zero, over the course of the 40 days
link |
that we trained this system,
link |
it starts to discover exactly these patterns
link |
that human players play.
link |
And over time, we found that all of the joseki
link |
that humans played were discovered by the system
link |
through this process of self play
link |
and this sort of essential notion of creativity.
link |
But what was really interesting was that over time,
link |
it then starts to discard some of these
link |
in favor of its own joseki that humans didn't know about.
link |
And it starts to say, oh, well,
link |
you thought that the Knights move pincer joseki
link |
but here's something different you can do there
link |
which makes some new variation
link |
that humans didn't know about.
link |
And actually now the human Go players
link |
study the joseki that AlphaGo played
link |
and they become the new norms
link |
that are used in today's top level Go competitions.
link |
That never gets old.
link |
Even just the first to me,
link |
maybe just makes me feel good as a human being
link |
that a self play mechanism that knows nothing about us humans
link |
discovers patterns that we humans do.
link |
That's just like an affirmation
link |
that we're doing okay as humans.
link |
We've, in this domain and other domains,
link |
we figured out it's like the Churchill quote
link |
It's the, you know, it sucks,
link |
but it's the best one we've tried.
link |
So in general, taking a step outside of Go
link |
and you've like a million accomplishment
link |
that I have no time to talk about
link |
with AlphaStar and so on and the current work.
link |
But in general, this self play mechanism
link |
that you've inspired the world with
link |
by beating the world champion Go player.
link |
Do you see that as,
link |
do you see it being applied in other domains?
link |
Do you have sort of dreams and hopes
link |
that it's applied in both the simulated environments
link |
and the constrained environments of games?
link |
Constrained, I mean, AlphaStar really demonstrates
link |
that you can remove a lot of the constraints,
link |
but nevertheless, it's in a digital simulated environment.
link |
Do you have a hope, a dream that it starts being applied
link |
in the robotics environment?
link |
And maybe even in domains that are safety critical
link |
and so on and have, you know,
link |
have a real impact in the real world,
link |
like autonomous vehicles, for example,
link |
which seems like a very far out dream at this point.
link |
So I absolutely do hope and imagine
link |
that we will get to the point where ideas
link |
just like these are used in all kinds of different domains.
link |
In fact, one of the most satisfying things
link |
as a researcher is when you start to see other people
link |
use your algorithms in unexpected ways.
link |
So in the last couple of years, there have been,
link |
you know, a couple of nature papers
link |
where different teams, unbeknownst to us,
link |
took AlphaZero and applied exactly those same algorithms
link |
and ideas to real world problems of huge meaning to society.
link |
So one of them was the problem of chemical synthesis,
link |
and they were able to beat the state of the art
link |
in finding pathways of how to actually synthesize chemicals,
link |
retrochemical synthesis.
link |
And the second paper actually just came out
link |
a couple of weeks ago in Nature,
link |
showed that in quantum computation,
link |
you know, one of the big questions is how to understand
link |
the nature of the function in quantum computation
link |
and a system based on AlphaZero beat the state of the art
link |
by quite some distance there again.
link |
So these are just examples.
link |
And I think, you know, the lesson,
link |
which we've seen elsewhere in machine learning
link |
time and time again, is that if you make something general,
link |
it will be used in all kinds of ways.
link |
You know, you provide a really powerful tool to society,
link |
and those tools can be used in amazing ways.
link |
And so I think we're just at the beginning,
link |
and for sure, I hope that we see all kinds of outcomes.
link |
So the other side of the question of reinforcement
link |
learning framework is, you know,
link |
you usually want to specify a reward function
link |
and an objective function.
link |
What do you think about sort of ideas of intrinsic rewards
link |
of when we're not really sure about, you know,
link |
if we take, you know, human beings as existence proof
link |
that we don't seem to be operating
link |
according to a single reward,
link |
do you think that there's interesting ideas
link |
for when you don't know how to truly specify the reward,
link |
you know, that there's some flexibility
link |
for discovering it intrinsically or so on
link |
in the context of reinforcement learning?
link |
So I think, you know, when we think about intelligence,
link |
it's really important to be clear
link |
about the problem of intelligence.
link |
And I think it's clearest to understand that problem
link |
in terms of some ultimate goal
link |
that we want the system to try and solve for.
link |
And after all, if we don't understand the ultimate purpose
link |
of the system, do we really even have
link |
a clearly defined problem that we're solving at all?
link |
Now, within that, as with your example for humans,
link |
the system may choose to create its own motivations
link |
and subgoals that help the system
link |
to achieve its ultimate goal.
link |
And that may indeed be a hugely important mechanism
link |
to achieve those ultimate goals,
link |
but there is still some ultimate goal
link |
I think the system needs to be measurable
link |
and evaluated against.
link |
And even for humans, I mean, humans,
link |
we're incredibly flexible.
link |
We feel that we can, you know, any goal that we're given,
link |
we feel we can master to some degree.
link |
But if we think of those goals, really, you know,
link |
like the goal of being able to pick up an object
link |
or the goal of being able to communicate
link |
or influence people to do things in a particular way
link |
or whatever those goals are, really, they're subgoals,
link |
really, that we set ourselves.
link |
You know, we choose to pick up the object.
link |
We choose to communicate.
link |
We choose to influence someone else.
link |
And we choose those because we think it will lead us
link |
to something later on.
link |
We think that's helpful to us to achieve some ultimate goal.
link |
Now, I don't want to speculate whether or not humans
link |
as a system necessarily have a singular overall goal
link |
of survival or whatever it is.
link |
But I think the principle for understanding
link |
and implementing intelligence is, has to be,
link |
that if we're trying to understand intelligence
link |
or implement our own,
link |
there has to be a well defined problem.
link |
Otherwise, if it's not, I think it's like an admission
link |
of defeat, that for there to be hope for understanding
link |
or implementing intelligence, we have to know what we're doing.
link |
We have to know what we're asking the system to do.
link |
Otherwise, if you don't have a clearly defined purpose,
link |
you're not going to get a clearly defined answer.
link |
The ridiculous big question that has to naturally follow,
link |
because I have to pin you down on this thing,
link |
that nevertheless, one of the big silly
link |
or big real questions before humans is the meaning of life,
link |
is us trying to figure out our own reward function.
link |
And you just kind of mentioned that if you want to build
link |
intelligent systems and you know what you're doing,
link |
you should be at least cognizant to some degree
link |
of what the reward function is.
link |
So the natural question is what do you think
link |
is the reward function of human life,
link |
the meaning of life for us humans,
link |
the meaning of our existence?
link |
I think I'd be speculating beyond my own expertise,
link |
but just for fun, let me do that.
link |
And say, I think that there are many levels
link |
at which you can understand a system
link |
and you can understand something as optimizing
link |
for a goal at many levels.
link |
And so you can understand the,
link |
let's start with the universe.
link |
Does the universe have a purpose?
link |
Well, it feels like it's just at one level
link |
just following certain mechanical laws of physics
link |
and that that's led to the development of the universe.
link |
But at another level, you can view it as actually,
link |
there's the second law of thermodynamics that says
link |
that this is increasing in entropy over time forever.
link |
And now there's a view that's been developed
link |
by certain people at MIT that this,
link |
you can think of this as almost like a goal of the universe,
link |
that the purpose of the universe is to maximize entropy.
link |
So there are multiple levels
link |
at which you can understand a system.
link |
The next level down, you might say,
link |
well, if the goal is to maximize entropy,
link |
well, how can that be done by a particular system?
link |
And maybe evolution is something that the universe
link |
discovered in order to kind of dissipate energy
link |
as efficiently as possible.
link |
And by the way, I'm borrowing from Max Tegmark
link |
for some of these metaphors, the physicist.
link |
But if you can think of evolution
link |
as a mechanism for dispersing energy,
link |
then evolution, you might say, then becomes a goal,
link |
which is if evolution disperses energy
link |
by reproducing as efficiently as possible,
link |
what's evolution then?
link |
Well, it's now got its own goal within that,
link |
which is to actually reproduce as effectively as possible.
link |
And now how does reproduction,
link |
how is that made as effective as possible?
link |
Well, you need entities within that
link |
that can survive and reproduce as effectively as possible.
link |
And so it's natural that in order to achieve
link |
that high level goal, those individual organisms
link |
discover brains, intelligences,
link |
which enable them to support the goals of evolution.
link |
And those brains, what do they do?
link |
Well, perhaps the early brains,
link |
maybe they were controlling things at some direct level.
link |
Maybe they were the equivalent of preprogrammed systems,
link |
which were directly controlling what was going on
link |
and setting certain things in order
link |
to achieve these particular goals.
link |
But that led to another level of discovery,
link |
which was learning systems.
link |
There are parts of the brain
link |
which are able to learn for themselves
link |
and learn how to program themselves to achieve any goal.
link |
And presumably there are parts of the brain
link |
where goals are set to parts of that system
link |
and provides this very flexible notion of intelligence
link |
that we as humans presumably have,
link |
which is the ability to kind of,
link |
the reason we feel that we can achieve any goal.
link |
So it's a very long winded answer to say that,
link |
I think there are many perspectives
link |
and many levels at which intelligence can be understood.
link |
And at each of those levels,
link |
you can take multiple perspectives.
link |
You can view the system as something
link |
which is optimizing for a goal,
link |
which is understanding it at a level
link |
by which we can maybe implement it
link |
and understand it as AI researchers or computer scientists,
link |
or you can understand it at the level
link |
of the mechanistic thing which is going on
link |
that there are these atoms bouncing around in the brain
link |
and they lead to the outcome of that system
link |
is not in contradiction with the fact
link |
that it's also a decision making system
link |
that's optimizing for some goal and purpose.
link |
I've never heard the description of the meaning of life
link |
structured so beautifully in layers,
link |
but you did miss one layer, which is the next step,
link |
which you're responsible for,
link |
which is creating the artificial intelligence layer
link |
And I can't wait to see, well, I may not be around,
link |
but I can't wait to see what the next layer beyond that be.
link |
Well, let's just take that argument
link |
and pursue it to its natural conclusion.
link |
So the next level indeed is for how can our learning brain
link |
achieve its goals most effectively?
link |
Well, maybe it does so by us as learning beings
link |
building a system which is able to solve for those goals
link |
more effectively than we can.
link |
And so when we build a system to play the game of Go,
link |
when I said that I wanted to build a system
link |
that can play Go better than I can,
link |
I've enabled myself to achieve that goal of playing Go
link |
better than I could by directly playing it
link |
and learning it myself.
link |
And so now a new layer has been created,
link |
which is systems which are able to achieve goals
link |
And ultimately there may be layers beyond that
link |
where they set sub goals to parts of their own system
link |
in order to achieve those and so forth.
link |
So the story of intelligence, I think,
link |
is a multi layered one and a multi perspective one.
link |
We live in an incredible universe.
link |
David, thank you so much, first of all,
link |
for dreaming of using learning to solve Go
link |
and building intelligent systems
link |
and for actually making it happen
link |
and for inspiring millions of people in the process.
link |
It's truly an honor.
link |
Thank you so much for talking today.
link |
Thanks for listening to this conversation
link |
with David Silver and thank you to our sponsors,
link |
Masterclass and Cash App.
link |
Please consider supporting the podcast
link |
by signing up to Masterclass at masterclass.com slash Lex
link |
and downloading Cash App and using code LexPodcast.
link |
If you enjoy this podcast, subscribe on YouTube,
link |
review it with five stars on Apple Podcast,
link |
support it on Patreon,
link |
or simply connect with me on Twitter at LexFriedman.
link |
And now let me leave you with some words from David Silver.
link |
My personal belief is that we've seen something
link |
of a turning point where we're starting to understand
link |
that many abilities like intuition and creativity
link |
that we've previously thought were in the domain only
link |
of the human mind are actually accessible
link |
to machine intelligence as well.
link |
And I think that's a really exciting moment in history.
link |
Thank you for listening and hope to see you next time.