back to indexIlya Sutskever: Deep Learning | Lex Fridman Podcast #94
link |
The following is a conversation with Ilya Sotskever,
link |
cofounder and chief scientist of OpenAI,
link |
one of the most cited computer scientists in history
link |
with over 165,000 citations,
link |
and to me, one of the most brilliant and insightful minds
link |
ever in the field of deep learning.
link |
There are very few people in this world
link |
who I would rather talk to and brainstorm with
link |
about deep learning, intelligence, and life in general
link |
than Ilya, on and off the mic.
link |
This was an honor and a pleasure.
link |
This conversation was recorded
link |
before the outbreak of the pandemic.
link |
For everyone feeling the medical, psychological,
link |
and financial burden of this crisis,
link |
I'm sending love your way.
link |
Stay strong, we're in this together, we'll beat this thing.
link |
This is the Artificial Intelligence Podcast.
link |
If you enjoy it, subscribe on YouTube,
link |
review it with five stars on Apple Podcast,
link |
support it on Patreon,
link |
or simply connect with me on Twitter
link |
at lexfriedman, spelled F R I D M A N.
link |
As usual, I'll do a few minutes of ads now
link |
and never any ads in the middle
link |
that can break the flow of the conversation.
link |
I hope that works for you
link |
and doesn't hurt the listening experience.
link |
This show is presented by Cash App,
link |
the number one finance app in the App Store.
link |
When you get it, use code LEXPODCAST.
link |
Cash App lets you send money to friends,
link |
buy Bitcoin, invest in the stock market
link |
with as little as $1.
link |
Since Cash App allows you to buy Bitcoin,
link |
let me mention that cryptocurrency
link |
in the context of the history of money is fascinating.
link |
I recommend Ascent of Money as a great book on this history.
link |
Both the book and audio book are great.
link |
Debits and credits on ledgers
link |
started around 30,000 years ago.
link |
The US dollar created over 200 years ago,
link |
and Bitcoin, the first decentralized cryptocurrency,
link |
released just over 10 years ago.
link |
So given that history,
link |
cryptocurrency is still very much in its early days
link |
of development, but it's still aiming to
link |
and just might redefine the nature of money.
link |
So again, if you get Cash App from the App Store
link |
or Google Play and use the code LEXPODCAST,
link |
you get $10 and Cash App will also donate $10 to FIRST,
link |
an organization that is helping advance robotics
link |
and STEM education for young people around the world.
link |
And now here's my conversation with Ilya Satsgever.
link |
You were one of the three authors with Alex Kaszewski,
link |
Geoff Hinton of the famed AlexNet paper
link |
that is arguably the paper that marked
link |
the big catalytic moment
link |
that launched the deep learning revolution.
link |
At that time, take us back to that time,
link |
what was your intuition about neural networks,
link |
about the representational power of neural networks?
link |
And maybe you could mention how did that evolve
link |
over the next few years up to today,
link |
over the 10 years?
link |
Yeah, I can answer that question.
link |
At some point in about 2010 or 2011,
link |
I connected two facts in my mind.
link |
Basically, the realization was this,
link |
at some point we realized that we can train very large,
link |
I shouldn't say very, tiny by today's standards,
link |
but large and deep neural networks
link |
end to end with backpropagation.
link |
At some point, different people obtained this result.
link |
I obtained this result.
link |
The first moment in which I realized
link |
that deep neural networks are powerful
link |
was when James Martens invented
link |
the Hessian free optimizer in 2010.
link |
And he trained a 10 layer neural network end to end
link |
without pre training from scratch.
link |
And when that happened, I thought this is it.
link |
Because if you can train a big neural network,
link |
a big neural network can represent very complicated function.
link |
Because if you have a neural network with 10 layers,
link |
it's as though you allow the human brain
link |
to run for some number of milliseconds.
link |
Neuron firings are slow.
link |
And so in maybe 100 milliseconds,
link |
your neurons only fire 10 times.
link |
So it's also kind of like 10 layers.
link |
And in 100 milliseconds,
link |
you can perfectly recognize any object.
link |
So I thought, so I already had the idea then
link |
that we need to train a very big neural network
link |
on lots of supervised data.
link |
And then it must succeed
link |
because we can find the best neural network.
link |
And then there's also theory
link |
that if you have more data than parameters,
link |
you won't overfit.
link |
Today, we know that actually this theory is very incomplete
link |
and you won't overfit even if you have less data
link |
than parameters, but definitely,
link |
if you have more data than parameters, you won't overfit.
link |
So the fact that neural networks
link |
were heavily overparameterized wasn't discouraging to you?
link |
So you were thinking about the theory
link |
that the number of parameters,
link |
the fact that there's a huge number of parameters is okay?
link |
Is it gonna be okay?
link |
I mean, there was some evidence before that it was okayish,
link |
but the theory was most,
link |
the theory was that if you had a big data set
link |
and a big neural net, it was going to work.
link |
The overparameterization just didn't really
link |
figure much as a problem.
link |
I thought, well, with images,
link |
you're just gonna add some data augmentation
link |
and it's gonna be okay.
link |
So where was any doubt coming from?
link |
The main doubt was, can we train a bigger,
link |
will we have enough computer train
link |
a big enough neural net?
link |
With backpropagation.
link |
Backpropagation I thought would work.
link |
The thing which wasn't clear
link |
was whether there would be enough compute
link |
to get a very convincing result.
link |
And then at some point, Alex Kerchevsky wrote
link |
these insanely fast CUDA kernels
link |
for training convolutional neural nets.
link |
Net was bam, let's do this.
link |
Let's get image in it and it's gonna be the greatest thing.
link |
Was your intuition, most of your intuition
link |
from empirical results by you and by others?
link |
So like just actually demonstrating
link |
that a piece of program can train
link |
a 10 layer neural network?
link |
Or was there some pen and paper
link |
or marker and whiteboard thinking intuition?
link |
Like, cause you just connected a 10 layer
link |
large neural network to the brain.
link |
So you just mentioned the brain.
link |
So in your intuition about neural networks
link |
does the human brain come into play as a intuition builder?
link |
I mean, you gotta be precise with these analogies
link |
between artificial neural networks and the brain.
link |
But there is no question that the brain is a huge source
link |
of intuition and inspiration for deep learning researchers
link |
since all the way from Rosenblatt in the 60s.
link |
Like if you look at the whole idea of a neural network
link |
is directly inspired by the brain.
link |
You had people like McCallum and Pitts who were saying,
link |
hey, you got these neurons in the brain.
link |
And hey, we recently learned about the computer
link |
Can we use some ideas from the computer and automata
link |
to design some kind of computational object
link |
that's going to be simple, computational
link |
and kind of like the brain and they invented the neuron.
link |
So they were inspired by it back then.
link |
Then you had the convolutional neural network from Fukushima
link |
and then later Yann LeCun who said, hey,
link |
if you limit the receptive fields of a neural network,
link |
it's going to be especially suitable for images
link |
as it turned out to be true.
link |
So there was a very small number of examples
link |
where analogies to the brain were successful.
link |
And I thought, well, probably an artificial neuron
link |
is not that different from the brain
link |
if it's cleaned hard enough.
link |
So let's just assume it is and roll with it.
link |
So now we're not at a time where deep learning
link |
is very successful.
link |
So let us squint less and say, let's open our eyes
link |
and say, what do you use an interesting difference
link |
between the human brain?
link |
Now, I know you're probably not an expert
link |
neither in your scientists and your biologists,
link |
but loosely speaking, what's the difference
link |
between the human brain and artificial neural networks?
link |
That's interesting to you for the next decade or two.
link |
That's a good question to ask.
link |
What is an interesting difference between the neurons
link |
between the brain and our artificial neural networks?
link |
So I feel like today, artificial neural networks,
link |
so we all agree that there are certain dimensions
link |
in which the human brain vastly outperforms our models.
link |
But I also think that there are some ways
link |
in which our artificial neural networks
link |
have a number of very important advantages over the brain.
link |
Looking at the advantages versus disadvantages
link |
is a good way to figure out what is the important difference.
link |
So the brain uses spikes, which may or may not be important.
link |
Yeah, it's a really interesting question.
link |
Do you think it's important or not?
link |
That's one big architectural difference
link |
between artificial neural networks.
link |
It's hard to tell, but my prior is not very high
link |
and I can say why.
link |
There are people who are interested
link |
in spiking neural networks.
link |
And basically what they figured out
link |
is that they need to simulate
link |
the non spiking neural networks in spikes.
link |
And that's how they're gonna make them work.
link |
If you don't simulate the non spiking neural networks
link |
in spikes, it's not going to work
link |
because the question is why should it work?
link |
And that connects to questions around back propagation
link |
and questions around deep learning.
link |
You've got this giant neural network.
link |
Why should it work at all?
link |
Why should the learning rule work at all?
link |
It's not a self evident question,
link |
especially if you, let's say if you were just starting
link |
in the field and you read the very early papers,
link |
you can say, hey, people are saying,
link |
let's build neural networks.
link |
That's a great idea because the brain is a neural network.
link |
So it would be useful to build neural networks.
link |
Now let's figure out how to train them.
link |
It should be possible to train them probably, but how?
link |
And so the big idea is the cost function.
link |
That's the big idea.
link |
The cost function is a way of measuring the performance
link |
of the system according to some measure.
link |
By the way, that is a big, actually let me think,
link |
is that one, a difficult idea to arrive at
link |
and how big of an idea is that?
link |
That there's a single cost function.
link |
Sorry, let me take a pause.
link |
Is supervised learning a difficult concept to come to?
link |
All concepts are very easy in retrospect.
link |
Yeah, that's what it seems trivial now,
link |
but I, because the reason I asked that,
link |
and we'll talk about it, is there other things?
link |
Is there things that don't necessarily have a cost function,
link |
maybe have many cost functions
link |
or maybe have dynamic cost functions
link |
or maybe a totally different kind of architectures?
link |
Because we have to think like that
link |
in order to arrive at something new, right?
link |
So the only, so the good examples of things
link |
which don't have clear cost functions are GANs.
link |
Right. And a GAN, you have a game.
link |
So instead of thinking of a cost function,
link |
where you wanna optimize,
link |
where you know that you have an algorithm gradient descent,
link |
which will optimize the cost function,
link |
and then you can reason about the behavior of your system
link |
in terms of what it optimizes.
link |
With a GAN, you say, I have a game
link |
and I'll reason about the behavior of the system
link |
in terms of the equilibrium of the game.
link |
But it's all about coming up with these mathematical objects
link |
that help us reason about the behavior of our system.
link |
Right, that's really interesting.
link |
Yeah, so GAN is the only one, it's kind of a,
link |
the cost function is emergent from the comparison.
link |
It's, I don't know if it has a cost function.
link |
I don't know if it's meaningful
link |
to talk about the cost function of a GAN.
link |
It's kind of like the cost function of biological evolution
link |
or the cost function of the economy.
link |
It's, you can talk about regions
link |
to which it will go towards, but I don't think,
link |
I don't think the cost function analogy is the most useful.
link |
So if evolution doesn't, that's really interesting.
link |
So if evolution doesn't really have a cost function,
link |
like a cost function based on its,
link |
something akin to our mathematical conception
link |
of a cost function, then do you think cost functions
link |
in deep learning are holding us back?
link |
Yeah, so you just kind of mentioned that cost function
link |
is a nice first profound idea.
link |
Do you think that's a good idea?
link |
Do you think it's an idea we'll go past?
link |
So self play starts to touch on that a little bit
link |
in reinforcement learning systems.
link |
Self play and also ideas around exploration
link |
where you're trying to take action
link |
that surprise a predictor.
link |
I'm a big fan of cost functions.
link |
I think cost functions are great
link |
and they serve us really well.
link |
And I think that whenever we can do things
link |
with cost functions, we should.
link |
And you know, maybe there is a chance
link |
that we will come up with some,
link |
yet another profound way of looking at things
link |
that will involve cost functions in a less central way.
link |
But I don't know, I think cost functions are,
link |
I mean, I would not bet against cost functions.
link |
Is there other things about the brain
link |
that pop into your mind that might be different
link |
and interesting for us to consider
link |
in designing artificial neural networks?
link |
So we talked about spiking a little bit.
link |
I mean, one thing which may potentially be useful,
link |
I think people, neuroscientists have figured out
link |
something about the learning rule of the brain
link |
or I'm talking about spike time independent plasticity
link |
and it would be nice if some people
link |
would just study that in simulation.
link |
Wait, sorry, spike time independent plasticity?
link |
Yeah, that's right.
link |
It's a particular learning rule that uses spike timing
link |
to figure out how to determine how to update the synapses.
link |
So it's kind of like if a synapse fires into the neuron
link |
before the neuron fires,
link |
then it strengthens the synapse,
link |
and if the synapse fires into the neurons
link |
shortly after the neuron fired,
link |
then it weakens the synapse.
link |
Something along this line.
link |
I'm 90% sure it's right, so if I said something wrong here,
link |
don't get too angry.
link |
But you sounded brilliant while saying it.
link |
But the timing, that's one thing that's missing.
link |
The temporal dynamics is not captured.
link |
I think that's like a fundamental property of the brain
link |
is the timing of the timing of the timing
link |
Well, you have recurrent neural networks.
link |
But you think of that as this,
link |
I mean, that's a very crude, simplified,
link |
what's that called?
link |
There's a clock, I guess, to recurrent neural networks.
link |
It's, this seems like the brain is the general,
link |
the continuous version of that,
link |
the generalization where all possible timings are possible,
link |
and then within those timings is contained some information.
link |
You think recurrent neural networks,
link |
the recurrence in recurrent neural networks
link |
can capture the same kind of phenomena as the timing
link |
that seems to be important for the brain,
link |
in the firing of neurons in the brain?
link |
I mean, I think recurrent neural networks are amazing,
link |
and they can do, I think they can do anything
link |
we'd want them to, we'd want a system to do.
link |
Right now, recurrent neural networks
link |
have been superseded by transformers,
link |
but maybe one day they'll make a comeback,
link |
maybe they'll be back, we'll see.
link |
Let me, on a small tangent, say,
link |
do you think they'll be back?
link |
So, so much of the breakthroughs recently
link |
that we'll talk about on natural language processing
link |
and language modeling has been with transformers
link |
that don't emphasize recurrence.
link |
Do you think recurrence will make a comeback?
link |
Well, some kind of recurrence, I think very likely.
link |
Recurrent neural networks, as they're typically thought of
link |
for processing sequences, I think it's also possible.
link |
What is, to you, a recurrent neural network?
link |
In generally speaking, I guess,
link |
what is a recurrent neural network?
link |
You have a neural network which maintains
link |
a high dimensional hidden state,
link |
and then when an observation arrives,
link |
it updates its high dimensional hidden state
link |
through its connections in some way.
link |
So do you think, that's what expert systems did, right?
link |
Symbolic AI, the knowledge based,
link |
growing a knowledge base is maintaining a hidden state,
link |
which is its knowledge base,
link |
and is growing it by sequential processing.
link |
Do you think of it more generally in that way,
link |
or is it simply, is it the more constrained form
link |
of a hidden state with certain kind of gating units
link |
that we think of as today with LSTMs and that?
link |
I mean, the hidden state is technically
link |
what you described there, the hidden state
link |
that goes inside the LSTM or the RNN or something like this.
link |
But then what should be contained,
link |
if you want to make the expert system analogy,
link |
I'm not, I mean, you could say that
link |
the knowledge is stored in the connections,
link |
and then the short term processing
link |
is done in the hidden state.
link |
Yes, could you say that?
link |
So sort of, do you think there's a future of building
link |
large scale knowledge bases within the neural networks?
link |
So we're gonna pause on that confidence,
link |
because I want to explore that.
link |
Well, let me zoom back out and ask,
link |
back to the history of ImageNet.
link |
Neural networks have been around for many decades,
link |
What do you think were the key ideas
link |
that led to their success,
link |
that ImageNet moment and beyond,
link |
the success in the past 10 years?
link |
Okay, so the question is,
link |
to make sure I didn't miss anything,
link |
the key ideas that led to the success
link |
of deep learning over the past 10 years.
link |
Exactly, even though the fundamental thing
link |
behind deep learning has been around for much longer.
link |
So the key idea about deep learning,
link |
or rather the key fact about deep learning
link |
before deep learning started to be successful,
link |
is that it was underestimated.
link |
People who worked in machine learning
link |
simply didn't think that neural networks could do much.
link |
People didn't believe that large neural networks
link |
People thought that, well, there was lots of,
link |
there was a lot of debate going on in machine learning
link |
about what are the right methods and so on.
link |
And people were arguing because there were no,
link |
there was no way to get hard facts.
link |
And by that, I mean, there were no benchmarks
link |
which were truly hard that if you do really well on them,
link |
then you can say, look, here's my system.
link |
That's when you switch from,
link |
that's when this field becomes a little bit more
link |
of an engineering field.
link |
So in terms of deep learning,
link |
to answer the question directly,
link |
the ideas were all there.
link |
The thing that was missing was a lot of supervised data
link |
and a lot of compute.
link |
Once you have a lot of supervised data and a lot of compute,
link |
then there is a third thing which is needed as well.
link |
And that is conviction.
link |
Conviction that if you take the right stuff,
link |
which already exists, and apply and mix it
link |
with a lot of data and a lot of compute,
link |
that it will in fact work.
link |
And so that was the missing piece.
link |
It was, you had the, you needed the data,
link |
you needed the compute, which showed up in terms of GPUs,
link |
and you needed the conviction to realize
link |
that you need to mix them together.
link |
So that's really interesting.
link |
So I guess the presence of compute
link |
and the presence of supervised data
link |
allowed the empirical evidence to do the convincing
link |
of the majority of the computer science community.
link |
So I guess there's a key moment with Jitendra Malik
link |
and Alex Alyosha Efros who were very skeptical, right?
link |
And then there's a Jeffrey Hinton
link |
that was the opposite of skeptical.
link |
And there was a convincing moment.
link |
And I think ImageNet had served as that moment.
link |
And they represented this kind of,
link |
were the big pillars of computer vision community,
link |
kind of the wizards got together,
link |
and then all of a sudden there was a shift.
link |
And it's not enough for the ideas to all be there
link |
and the compute to be there,
link |
it's for it to convince the cynicism that existed.
link |
It's interesting that people just didn't believe
link |
for a couple of decades.
link |
Yeah, well, but it's more than that.
link |
It's kind of, when put this way,
link |
it sounds like, well, those silly people
link |
who didn't believe, what were they missing?
link |
But in reality, things were confusing
link |
because neural networks really did not work on anything.
link |
And they were not the best method
link |
on pretty much anything as well.
link |
And it was pretty rational to say,
link |
yeah, this stuff doesn't have any traction.
link |
And that's why you need to have these very hard tasks
link |
which produce undeniable evidence.
link |
And that's how we make progress.
link |
And that's why the field is making progress today
link |
because we have these hard benchmarks
link |
which represent true progress.
link |
And so, and this is why we are able to avoid endless debate.
link |
So incredibly you've contributed
link |
some of the biggest recent ideas in AI
link |
in computer vision, language, natural language processing,
link |
reinforcement learning, sort of everything in between,
link |
But there may not be a topic you haven't touched.
link |
And of course, the fundamental science of deep learning.
link |
What is the difference to you between vision, language,
link |
and as in reinforcement learning, action,
link |
as learning problems?
link |
And what are the commonalities?
link |
Do you see them as all interconnected?
link |
Are they fundamentally different domains
link |
that require different approaches?
link |
Okay, that's a good question.
link |
Machine learning is a field with a lot of unity,
link |
a huge amount of unity.
link |
In fact. What do you mean by unity?
link |
Like overlap of ideas?
link |
Overlap of ideas, overlap of principles.
link |
In fact, there's only one or two or three principles
link |
which are very, very simple.
link |
And then they apply in almost the same way,
link |
in almost the same way to the different modalities,
link |
to the different problems.
link |
And that's why today, when someone writes a paper
link |
on improving optimization of deep learning and vision,
link |
it improves the different NLP applications
link |
and it improves the different
link |
reinforcement learning applications.
link |
Reinforcement learning.
link |
So I would say that computer vision
link |
and NLP are very similar to each other.
link |
Today they differ in that they have
link |
slightly different architectures.
link |
We use transformers in NLP
link |
and we use convolutional neural networks in vision.
link |
But it's also possible that one day this will change
link |
and everything will be unified with a single architecture.
link |
Because if you go back a few years ago
link |
in natural language processing,
link |
there were a huge number of architectures
link |
for every different tiny problem had its own architecture.
link |
Today, there's just one transformer
link |
for all those different tasks.
link |
And if you go back in time even more,
link |
you had even more and more fragmentation
link |
and every little problem in AI
link |
had its own little subspecialization
link |
and sub, you know, little set of collection of skills,
link |
people who would know how to engineer the features.
link |
Now it's all been subsumed by deep learning.
link |
We have this unification.
link |
And so I expect vision to become unified
link |
with natural language as well.
link |
Or rather, I shouldn't say expect, I think it's possible.
link |
I don't wanna be too sure because
link |
I think on the convolutional neural net
link |
is very computationally efficient.
link |
RL does require slightly different techniques
link |
because you really do need to take action.
link |
You really need to do something about exploration.
link |
Your variance is much higher.
link |
But I think there is a lot of unity even there.
link |
And I would expect, for example, that at some point
link |
there will be some broader unification
link |
between RL and supervised learning
link |
where somehow the RL will be making decisions
link |
to make the supervised learning go better.
link |
And it will be, I imagine, one big black box
link |
and you just throw, you know, you shovel things into it
link |
and it just figures out what to do
link |
with whatever you shovel at it.
link |
I mean, reinforcement learning has some aspects
link |
of language and vision combined almost.
link |
There's elements of a long term memory
link |
that you should be utilizing
link |
and there's elements of a really rich sensory space.
link |
So it seems like the union of the two or something like that.
link |
I'd say something slightly differently.
link |
I'd say that reinforcement learning is neither,
link |
but it naturally interfaces
link |
and integrates with the two of them.
link |
Do you think action is fundamentally different?
link |
So yeah, what is interesting about,
link |
what is unique about policy of learning to act?
link |
Well, so one example, for instance,
link |
is that when you learn to act,
link |
you are fundamentally in a non stationary world
link |
because as your actions change,
link |
the things you see start changing.
link |
You experience the world in a different way.
link |
And this is not the case for
link |
the more traditional static problem
link |
where you have some distribution
link |
and you just apply a model to that distribution.
link |
You think it's a fundamentally different problem
link |
or is it just a more difficult generalization
link |
of the problem of understanding?
link |
I mean, it's a question of definitions almost.
link |
There is a huge amount of commonality for sure.
link |
You take gradients, you try, you take gradients.
link |
We try to approximate gradients in both cases.
link |
In the case of reinforcement learning,
link |
you have some tools to reduce the variance of the gradients.
link |
There's lots of commonality.
link |
Use the same neural net in both cases.
link |
You compute the gradient, you apply Adam in both cases.
link |
So, I mean, there's lots in common for sure,
link |
but there are some small differences
link |
which are not completely insignificant.
link |
It's really just a matter of your point of view,
link |
what frame of reference,
link |
how much do you wanna zoom in or out
link |
as you look at these problems?
link |
Which problem do you think is harder?
link |
So people like Noam Chomsky believe
link |
that language is fundamental to everything.
link |
So it underlies everything.
link |
Do you think language understanding is harder
link |
than visual scene understanding or vice versa?
link |
I think that asking if a problem is hard is slightly wrong.
link |
I think the question is a little bit wrong
link |
and I wanna explain why.
link |
So what does it mean for a problem to be hard?
link |
Okay, the non interesting dumb answer to that
link |
is there's a benchmark
link |
and there's a human level performance on that benchmark
link |
and how is the effort required
link |
to reach the human level benchmark.
link |
So from the perspective of how much
link |
until we get to human level on a very good benchmark.
link |
Yeah, I understand what you mean by that.
link |
So what I was going to say that a lot of it depends on,
link |
once you solve a problem, it stops being hard
link |
and that's always true.
link |
And so whether something is hard or not depends
link |
on what our tools can do today.
link |
So you say today through human level,
link |
language understanding and visual perception are hard
link |
in the sense that there is no way
link |
of solving the problem completely in the next three months.
link |
So I agree with that statement.
link |
Beyond that, my guess would be as good as yours,
link |
Oh, okay, so you don't have a fundamental intuition
link |
about how hard language understanding is.
link |
I think, I know I changed my mind.
link |
I'd say language is probably going to be harder.
link |
I mean, it depends on how you define it.
link |
Like if you mean absolute top notch,
link |
100% language understanding, I'll go with language.
link |
But then if I show you a piece of paper with letters on it,
link |
is that, you see what I mean?
link |
You have a vision system,
link |
you say it's the best human level vision system.
link |
I show you, I open a book and I show you letters.
link |
Will it understand how these letters form into word
link |
and sentences and meaning?
link |
Is this part of the vision problem?
link |
Where does vision end and language begin?
link |
Yeah, so Chomsky would say it starts at language.
link |
So vision is just a little example of the kind
link |
of a structure and fundamental hierarchy of ideas
link |
that's already represented in our brains somehow
link |
that's represented through language.
link |
But where does vision stop and language begin?
link |
That's a really interesting question.
link |
So one possibility is that it's impossible
link |
to achieve really deep understanding in either images
link |
or language without basically using the same kind of system.
link |
So you're going to get the other for free.
link |
I think it's pretty likely that yes,
link |
if we can get one, our machine learning is probably
link |
that good that we can get the other.
link |
But I'm not 100% sure.
link |
And also, I think a lot of it really does depend
link |
on your definitions.
link |
Of like perfect vision.
link |
Because reading is vision, but should it count?
link |
Yeah, to me, so my definition is if a system looked
link |
at an image and then a system looked at a piece of text
link |
and then told me something about that
link |
and I was really impressed.
link |
You'll be impressed for half an hour
link |
and then you're gonna say, well, I mean,
link |
all the systems do that, but here's the thing they don't do.
link |
Yeah, but I don't have that with humans.
link |
Humans continue to impress me.
link |
Well, the ones, okay, so I'm a fan of monogamy.
link |
So I like the idea of marrying somebody,
link |
being with them for several decades.
link |
So I believe in the fact that yes, it's possible
link |
to have somebody continuously giving you
link |
pleasurable, interesting, witty new ideas, friends.
link |
They continue to surprise you.
link |
The surprise, it's that injection of randomness.
link |
It seems to be a nice source of, yeah, continued inspiration,
link |
like the wit, the humor.
link |
I think, yeah, that would be,
link |
it's a very subjective test,
link |
but I think if you have enough humans in the room.
link |
Yeah, I understand what you mean.
link |
Yeah, I feel like I misunderstood
link |
what you meant by impressing you.
link |
I thought you meant to impress you with its intelligence,
link |
with how well it understands an image.
link |
I thought you meant something like,
link |
I'm gonna show it a really complicated image
link |
and it's gonna get it right.
link |
And you're gonna say, wow, that's really cool.
link |
Our systems of January 2020 have not been doing that.
link |
Yeah, no, I think it all boils down to like
link |
the reason people click like on stuff on the internet,
link |
which is like, it makes them laugh.
link |
So it's like humor or wit or insight.
link |
I'm sure we'll get that as well.
link |
So forgive the romanticized question,
link |
but looking back to you,
link |
what is the most beautiful or surprising idea
link |
in deep learning or AI in general you've come across?
link |
So I think the most beautiful thing about deep learning
link |
is that it actually works.
link |
And I mean it, because you got these ideas,
link |
you got the little neural network,
link |
you got the back propagation algorithm.
link |
And then you've got some theories as to,
link |
this is kind of like the brain.
link |
So maybe if you make it large,
link |
if you make the neural network large
link |
and you train it on a lot of data,
link |
then it will do the same function that the brain does.
link |
And it turns out to be true, that's crazy.
link |
And now we just train these neural networks
link |
and you make them larger and they keep getting better.
link |
And I find it unbelievable.
link |
I find it unbelievable that this whole AI stuff
link |
with neural networks works.
link |
Have you built up an intuition of why?
link |
Are there a lot of bits and pieces of intuitions,
link |
of insights of why this whole thing works?
link |
I mean, some, definitely.
link |
While we know that optimization, we now have good,
link |
we've had lots of empirical,
link |
huge amounts of empirical reasons
link |
to believe that optimization should work
link |
on most problems we care about.
link |
Do you have insights of why?
link |
So you just said empirical evidence.
link |
Is most of your sort of empirical evidence
link |
kind of convinces you?
link |
It's like evolution is empirical.
link |
It shows you that, look,
link |
this evolutionary process seems to be a good way
link |
to design organisms that survive in their environment,
link |
but it doesn't really get you to the insights
link |
of how the whole thing works.
link |
I think a good analogy is physics.
link |
You know how you say, hey, let's do some physics calculation
link |
and come up with some new physics theory
link |
and make some prediction.
link |
But then you got around the experiment.
link |
You know, you got around the experiment, it's important.
link |
So it's a bit the same here,
link |
except that maybe sometimes the experiment
link |
came before the theory.
link |
But it still is the case.
link |
You know, you have some data
link |
and you come up with some prediction.
link |
You say, yeah, let's make a big neural network.
link |
And it's going to work much better than anything before it.
link |
And it will in fact continue to get better
link |
as you make it larger.
link |
And it turns out to be true.
link |
That's amazing when a theory is validated like this.
link |
It's not a mathematical theory.
link |
It's more of a biological theory almost.
link |
So I think there are not terrible analogies
link |
between deep learning and biology.
link |
I would say it's like the geometric mean
link |
of biology and physics.
link |
That's deep learning.
link |
The geometric mean of biology and physics.
link |
I think I'm going to need a few hours
link |
to wrap my head around that.
link |
Because just to find the geometric,
link |
just to find the set of what biology represents.
link |
Well, in biology, things are really complicated.
link |
Theories are really, really,
link |
it's really hard to have good predictive theory.
link |
And in physics, the theories are too good.
link |
In physics, people make these super precise theories
link |
which make these amazing predictions.
link |
And in machine learning, we're kind of in between.
link |
Kind of in between, but it'd be nice
link |
if machine learning somehow helped us
link |
discover the unification of the two
link |
as opposed to sort of the in between.
link |
That's, you're kind of trying to juggle both.
link |
So do you think there are still beautiful
link |
and mysterious properties in neural networks
link |
that are yet to be discovered?
link |
I think that we are still massively underestimating
link |
What do you think it will look like?
link |
Like what, if I knew, I would have done it, you know?
link |
So, but if you look at all the progress
link |
from the past 10 years, I would say most of it,
link |
I would say there've been a few cases
link |
where some were things that felt like really new ideas
link |
showed up, but by and large it was every year
link |
we thought, okay, deep learning goes this far.
link |
Nope, it actually goes further.
link |
And then the next year, okay, now this is peak deep learning.
link |
We are really done.
link |
Nope, it goes further.
link |
It just keeps going further each year.
link |
So that means that we keep underestimating,
link |
we keep not understanding it.
link |
It has surprising properties all the time.
link |
Do you think it's getting harder and harder?
link |
Need to make progress?
link |
It depends on what you mean.
link |
I think the field will continue to make very robust progress
link |
for quite a while.
link |
I think for individual researchers,
link |
especially people who are doing research,
link |
it can be harder because there is a very large number
link |
of researchers right now.
link |
I think that if you have a lot of compute,
link |
then you can make a lot of very interesting discoveries,
link |
but then you have to deal with the challenge
link |
of managing a huge compute cluster
link |
to run your experiments.
link |
It's a little bit harder.
link |
So I'm asking all these questions
link |
that nobody knows the answer to,
link |
but you're one of the smartest people I know,
link |
so I'm gonna keep asking.
link |
So let's imagine all the breakthroughs
link |
that happen in the next 30 years in deep learning.
link |
Do you think most of those breakthroughs
link |
can be done by one person with one computer?
link |
Sort of in the space of breakthroughs,
link |
do you think compute will be,
link |
compute and large efforts will be necessary?
link |
I mean, I can't be sure.
link |
When you say one computer, you mean how large?
link |
I think it's pretty unlikely.
link |
I think it's pretty unlikely.
link |
I think that there are many,
link |
the stack of deep learning is starting to be quite deep.
link |
If you look at it, you've got all the way from the ideas,
link |
the systems to build the data sets,
link |
the distributed programming,
link |
the building the actual cluster,
link |
the GPU programming, putting it all together.
link |
So now the stack is getting really deep
link |
and I think it becomes,
link |
it can be quite hard for a single person
link |
to become, to be world class
link |
in every single layer of the stack.
link |
What about the, what like Vlad and Ravapnik
link |
really insist on is taking MNIST
link |
and trying to learn from very few examples.
link |
So being able to learn more efficiently.
link |
Do you think that's, there'll be breakthroughs in that space
link |
that would, may not need the huge compute?
link |
I think there will be a large number of breakthroughs
link |
in general that will not need a huge amount of compute.
link |
So maybe I should clarify that.
link |
I think that some breakthroughs will require a lot of compute
link |
and I think building systems which actually do things
link |
will require a huge amount of compute.
link |
That one is pretty obvious.
link |
If you want to do X and X requires a huge neural net,
link |
you gotta get a huge neural net.
link |
But I think there will be lots of,
link |
I think there is lots of room for very important work
link |
being done by small groups and individuals.
link |
Can you maybe sort of on the topic
link |
of the science of deep learning,
link |
talk about one of the recent papers
link |
that you released, the Deep Double Descent,
link |
where bigger models and more data hurt.
link |
I think it's a really interesting paper.
link |
Can you describe the main idea?
link |
So what happened is that some,
link |
over the years, some small number of researchers noticed
link |
that it is kind of weird that when you make
link |
the neural network larger, it works better
link |
and it seems to go in contradiction
link |
with statistical ideas.
link |
And then some people made an analysis showing
link |
that actually you got this double descent bump.
link |
And what we've done was to show that double descent occurs
link |
for pretty much all practical deep learning systems.
link |
And that it'll be also, so can you step back?
link |
What's the X axis and the Y axis of a double descent plot?
link |
So you can look, you can do things like,
link |
you can take your neural network
link |
and you can start increasing its size slowly
link |
while keeping your data set fixed.
link |
So if you increase the size of the neural network slowly,
link |
and if you don't do early stopping,
link |
that's a pretty important detail,
link |
then when the neural network is really small,
link |
you make it larger,
link |
you get a very rapid increase in performance.
link |
Then you continue to make it larger.
link |
And at some point performance will get worse.
link |
And it gets the worst exactly at the point
link |
at which it achieves zero training error,
link |
precisely zero training loss.
link |
And then as you make it larger,
link |
it starts to get better again.
link |
And it's kind of counterintuitive
link |
because you'd expect deep learning phenomena
link |
And it's hard to be sure what it means,
link |
but it also occurs in the case of linear classifiers.
link |
And the intuition basically boils down to the following.
link |
When you have a large data set and a small model,
link |
then small, tiny random,
link |
so basically what is overfitting?
link |
Overfitting is when your model is somehow very sensitive
link |
to the small random unimportant stuff in your data set.
link |
In the training data.
link |
In the training data set, precisely.
link |
So if you have a small model and you have a big data set,
link |
and there may be some random thing,
link |
some training cases are randomly in the data set
link |
and others may not be there,
link |
but the small model is kind of insensitive
link |
to this randomness because it's the same,
link |
there is pretty much no uncertainty about the model
link |
when the data set is large.
link |
So at the very basic level to me,
link |
it is the most surprising thing
link |
that neural networks don't overfit every time very quickly
link |
before ever being able to learn anything.
link |
The huge number of parameters.
link |
So here is, so there is one way, okay.
link |
So maybe, so let me try to give the explanation
link |
and maybe that will be, that will work.
link |
So you've got a huge neural network.
link |
Let's suppose you've got, you have a huge neural network,
link |
you have a huge number of parameters.
link |
And now let's pretend everything is linear,
link |
which is not, let's just pretend.
link |
Then there is this big subspace
link |
where your neural network achieves zero error.
link |
And SGD is going to find approximately the point.
link |
Approximately the point with the smallest norm
link |
And that can also be proven to be insensitive
link |
to the small randomness in the data
link |
when the dimensionality is high.
link |
But when the dimensionality of the data
link |
is equal to the dimensionality of the model,
link |
then there is a one to one correspondence
link |
between all the data sets and the models.
link |
So small changes in the data set
link |
actually lead to large changes in the model.
link |
And that's why performance gets worse.
link |
So this is the best explanation more or less.
link |
So then it would be good for the model
link |
to have more parameters, so to be bigger than the data.
link |
But only if you don't early stop.
link |
If you introduce early stop in your regularization,
link |
you can make the double descent bump
link |
almost completely disappear.
link |
What is early stop?
link |
Early stopping is when you train your model
link |
and you monitor your validation performance.
link |
And then if at some point validation performance
link |
starts to get worse, you say, okay, let's stop training.
link |
We are good enough.
link |
So the magic happens after that moment.
link |
So you don't want to do the early stopping.
link |
Well, if you don't do the early stopping,
link |
you get the very pronounced double descent.
link |
Do you have any intuition why this happens?
link |
Oh, sorry, early stopping?
link |
No, the double descent.
link |
Well, yeah, so I try...
link |
The intuition is basically is this,
link |
that when the data set has as many degrees of freedom
link |
as the model, then there is a one to one correspondence
link |
And so small changes to the data set
link |
lead to noticeable changes in the model.
link |
So your model is very sensitive to all the randomness.
link |
It is unable to discard it.
link |
Whereas it turns out that when you have
link |
a lot more data than parameters
link |
or a lot more parameters than data,
link |
the resulting solution will be insensitive
link |
to small changes in the data set.
link |
Oh, so it's able to, let's nicely put,
link |
discard the small changes, the randomness.
link |
The randomness, exactly.
link |
The spurious correlation which you don't want.
link |
Jeff Hinton suggested we need to throw back propagation.
link |
We already kind of talked about this a little bit,
link |
but he suggested that we need to throw away
link |
back propagation and start over.
link |
I mean, of course some of that is a little bit
link |
wit and humor, but what do you think?
link |
What could be an alternative method
link |
of training neural networks?
link |
Well, the thing that he said precisely is that
link |
to the extent that you can't find back propagation
link |
in the brain, it's worth seeing if we can learn something
link |
from how the brain learns.
link |
But back propagation is very useful
link |
and we should keep using it.
link |
Oh, you're saying that once we discover
link |
the mechanism of learning in the brain,
link |
or any aspects of that mechanism,
link |
we should also try to implement that in neural networks?
link |
If it turns out that we can't find
link |
back propagation in the brain.
link |
If we can't find back propagation in the brain.
link |
Well, so I guess your answer to that is
link |
back propagation is pretty damn useful.
link |
So why are we complaining?
link |
I mean, I personally am a big fan of back propagation.
link |
I think it's a great algorithm because it solves
link |
an extremely fundamental problem,
link |
which is finding a neural circuit
link |
subject to some constraints.
link |
And I don't see that problem going away.
link |
So that's why I really, I think it's pretty unlikely
link |
that we'll have anything which is going to be
link |
dramatically different.
link |
It could happen, but I wouldn't bet on it right now.
link |
So let me ask a sort of big picture question.
link |
Do you think neural networks can be made
link |
Well, if you look, for example, at AlphaGo or AlphaZero,
link |
the neural network of AlphaZero plays Go,
link |
which we all agree is a game that requires reasoning,
link |
better than 99.9% of all humans.
link |
Just the neural network, without the search,
link |
just the neural network itself.
link |
Doesn't that give us an existence proof
link |
that neural networks can reason?
link |
To push back and disagree a little bit,
link |
we all agree that Go is reasoning.
link |
I think I agree, I don't think it's a trivial,
link |
so obviously reasoning like intelligence
link |
is a loose gray area term a little bit.
link |
Maybe you disagree with that.
link |
But yes, I think it has some of the same elements
link |
Reasoning is almost like akin to search, right?
link |
There's a sequential element of reasoning
link |
of stepwise consideration of possibilities
link |
and sort of building on top of those possibilities
link |
in a sequential manner until you arrive at some insight.
link |
So yeah, I guess playing Go is kind of like that.
link |
And when you have a single neural network
link |
doing that without search, it's kind of like that.
link |
So there's an existence proof
link |
in a particular constrained environment
link |
that a process akin to what many people call reasoning
link |
exists, but more general kind of reasoning.
link |
There is one other existence proof.
link |
Oh boy, which one?
link |
Okay, all right, so do you think the architecture
link |
that will allow neural networks to reason
link |
will look similar to the neural network architectures
link |
I think, well, I don't wanna make
link |
two overly definitive statements.
link |
I think it's definitely possible
link |
that the neural networks that will produce
link |
the reasoning breakthroughs of the future
link |
will be very similar to the architectures that exist today.
link |
Maybe a little bit more recurrent,
link |
maybe a little bit deeper.
link |
But these neural nets are so insanely powerful.
link |
Why wouldn't they be able to learn to reason?
link |
Humans can reason.
link |
So why can't neural networks?
link |
So do you think the kind of stuff we've seen
link |
neural networks do is a kind of just weak reasoning?
link |
So it's not a fundamentally different process.
link |
Again, this is stuff nobody knows the answer to.
link |
So when it comes to our neural networks,
link |
the thing which I would say is that neural networks
link |
are capable of reasoning.
link |
But if you train a neural network on a task
link |
which doesn't require reasoning, it's not going to reason.
link |
This is a well known effect where the neural network
link |
will solve the problem that you pose in front of it
link |
in the easiest way possible.
link |
Right, that takes us to one of the brilliant sort of ways
link |
you've described neural networks,
link |
which is you've referred to neural networks
link |
as the search for small circuits
link |
and maybe general intelligence
link |
as the search for small programs,
link |
which I found as a metaphor very compelling.
link |
Can you elaborate on that difference?
link |
Yeah, so the thing which I said precisely was that
link |
if you can find the shortest program
link |
that outputs the data at your disposal,
link |
then you will be able to use it
link |
to make the best prediction possible.
link |
And that's a theoretical statement
link |
which can be proved mathematically.
link |
Now, you can also prove mathematically
link |
that finding the shortest program
link |
which generates some data is not a computable operation.
link |
No finite amount of compute can do this.
link |
So then with neural networks,
link |
neural networks are the next best thing
link |
that actually works in practice.
link |
We are not able to find the best,
link |
the shortest program which generates our data,
link |
but we are able to find a small,
link |
but now that statement should be amended,
link |
even a large circuit which fits our data in some way.
link |
Well, I think what you meant by the small circuit
link |
is the smallest needed circuit.
link |
Well, the thing which I would change now,
link |
back then I really haven't fully internalized
link |
the overparameterized results.
link |
The things we know about overparameterized neural nets,
link |
now I would phrase it as a large circuit
link |
whose weights contain a small amount of information,
link |
which I think is what's going on.
link |
If you imagine the training process of a neural network
link |
as you slowly transmit entropy
link |
from the dataset to the parameters,
link |
then somehow the amount of information in the weights
link |
ends up being not very large,
link |
which would explain why they generalize so well.
link |
So the large circuit might be one that's helpful
link |
for the generalization.
link |
Yeah, something like this.
link |
But do you see it important to be able to try
link |
to learn something like programs?
link |
I mean, if we can, definitely.
link |
I think it's kind of, the answer is kind of yes,
link |
if we can do it, we should do things that we can do it.
link |
It's the reason we are pushing on deep learning,
link |
the fundamental reason, the root cause
link |
is that we are able to train them.
link |
So in other words, training comes first.
link |
We've got our pillar, which is the training pillar.
link |
And now we're trying to contort our neural networks
link |
around the training pillar.
link |
We gotta stay trainable.
link |
This is an invariant we cannot violate.
link |
And so being trainable means starting from scratch,
link |
knowing nothing, you can actually pretty quickly
link |
converge towards knowing a lot.
link |
But it means that given the resources at your disposal,
link |
you can train the neural net
link |
and get it to achieve useful performance.
link |
Yeah, that's a pillar we can't move away from.
link |
Because if you say, hey, let's find the shortest program,
link |
well, we can't do that.
link |
So it doesn't matter how useful that would be.
link |
So do you think, you kind of mentioned
link |
that the neural networks are good at finding small circuits
link |
or large circuits.
link |
Do you think then the matter of finding small programs
link |
So the, sorry, not the size or the type of data.
link |
Sort of ask, giving it programs.
link |
Well, I think the thing is that right now,
link |
finding, there are no good precedents
link |
of people successfully finding programs really well.
link |
And so the way you'd find programs
link |
is you'd train a deep neural network to do it basically.
link |
Which is the right way to go about it.
link |
But there's not good illustrations of that.
link |
It hasn't been done yet.
link |
But in principle, it should be possible.
link |
Can you elaborate a little bit,
link |
what's your answer in principle?
link |
Put another way, you don't see why it's not possible.
link |
Well, it's kind of like more, it's more a statement of,
link |
I think that it's, I think that it's unwise
link |
to bet against deep learning.
link |
And if it's a cognitive function
link |
that humans seem to be able to do,
link |
then it doesn't take too long
link |
for some deep neural net to pop up that can do it too.
link |
Yeah, I'm there with you.
link |
I've stopped betting against neural networks at this point
link |
because they continue to surprise us.
link |
What about long term memory?
link |
Can neural networks have long term memory?
link |
Something like knowledge bases.
link |
So being able to aggregate important information
link |
over long periods of time that would then serve
link |
as useful sort of representations of state
link |
that you can make decisions by,
link |
so have a long term context
link |
based on which you're making the decision.
link |
So in some sense, the parameters already do that.
link |
The parameters are an aggregation of the neural,
link |
of the entirety of the neural nets experience,
link |
and so they count as long term knowledge.
link |
And people have trained various neural nets
link |
to act as knowledge bases and, you know,
link |
investigated with, people have investigated
link |
language models as knowledge bases.
link |
So there is work there.
link |
Yeah, but in some sense, do you think in every sense,
link |
do you think there's a, it's all just a matter of coming up
link |
with a better mechanism of forgetting the useless stuff
link |
and remembering the useful stuff?
link |
Because right now, I mean, there's not been mechanisms
link |
that do remember really long term information.
link |
What do you mean by that precisely?
link |
Precisely, I like the word precisely.
link |
So I'm thinking of the kind of compression of information
link |
the knowledge bases represent.
link |
Sort of creating a, now I apologize for my sort of
link |
human centric thinking about what knowledge is,
link |
because neural networks aren't interpretable necessarily
link |
with the kind of knowledge they have discovered.
link |
But a good example for me is knowledge bases,
link |
being able to build up over time something like
link |
the knowledge that Wikipedia represents.
link |
It's a really compressed, structured knowledge base.
link |
Obviously not the actual Wikipedia or the language,
link |
but like a semantic web, the dream that semantic web
link |
represented, so it's a really nice compressed knowledge base
link |
or something akin to that in the noninterpretable sense
link |
as neural networks would have.
link |
Well, the neural networks would be noninterpretable
link |
if you look at their weights, but their outputs
link |
should be very interpretable.
link |
Okay, so yeah, how do you make very smart neural networks
link |
like language models interpretable?
link |
Well, you ask them to generate some text
link |
and the text will generally be interpretable.
link |
Do you find that the epitome of interpretability,
link |
like can you do better?
link |
Like can you add, because you can't, okay,
link |
I'd like to know what does it know and what doesn't it know?
link |
I would like the neural network to come up with examples
link |
where it's completely dumb and examples
link |
where it's completely brilliant.
link |
And the only way I know how to do that now
link |
is to generate a lot of examples and use my human judgment.
link |
But it would be nice if a neural network
link |
had some self awareness about it.
link |
Yeah, 100%, I'm a big believer in self awareness
link |
and I think that, I think neural net self awareness
link |
will allow for things like the capabilities,
link |
like the ones you described, like for them to know
link |
what they know and what they don't know
link |
and for them to know where to invest
link |
to increase their skills most optimally.
link |
And to your question of interpretability,
link |
there are actually two answers to that question.
link |
One answer is, you know, we have the neural net
link |
so we can analyze the neurons and we can try to understand
link |
what the different neurons and different layers mean.
link |
And you can actually do that
link |
and OpenAI has done some work on that.
link |
But there is a different answer, which is that,
link |
I would say that's the human centric answer where you say,
link |
you know, you look at a human being, you can't read,
link |
how do you know what a human being is thinking?
link |
You ask them, you say, hey, what do you think about this?
link |
What do you think about that?
link |
And you get some answers.
link |
The answers you get are sticky in the sense
link |
you already have a mental model.
link |
You already have a mental model of that human being.
link |
You already have an understanding of like a big conception
link |
of that human being, how they think, what they know,
link |
how they see the world and then everything you ask,
link |
you're adding onto that.
link |
And that stickiness seems to be,
link |
that's one of the really interesting qualities
link |
of the human being is that information is sticky.
link |
You don't, you seem to remember the useful stuff,
link |
aggregate it well and forget most of the information
link |
that's not useful, that process.
link |
But that's also pretty similar to the process
link |
that neural networks do.
link |
It's just that neural networks are much crappier
link |
It doesn't seem to be fundamentally that different.
link |
But just to stick on reasoning for a little longer,
link |
you said, why not?
link |
Why can't I reason?
link |
What's a good impressive feat, benchmark to you
link |
of reasoning that you'll be impressed by
link |
if neural networks were able to do?
link |
Is that something you already have in mind?
link |
Well, I think writing really good code,
link |
I think proving really hard theorems,
link |
solving open ended problems with out of the box solutions.
link |
And sort of theorem type, mathematical problems.
link |
Yeah, I think those ones are a very natural example
link |
If you can prove an unproven theorem,
link |
then it's hard to argue you don't reason.
link |
And so by the way, and this comes back to the point
link |
about the hard results, if you have machine learning,
link |
deep learning as a field is very fortunate
link |
because we have the ability to sometimes produce
link |
these unambiguous results.
link |
And when they happen, the debate changes,
link |
the conversation changes.
link |
It's a converse, we have the ability
link |
to produce conversation changing results.
link |
Conversation, and then of course, just like you said,
link |
people kind of take that for granted
link |
and say that wasn't actually a hard problem.
link |
Well, I mean, at some point we'll probably run out
link |
Yeah, that whole mortality thing is kind of a sticky problem
link |
that we haven't quite figured out.
link |
Maybe we'll solve that one.
link |
I think one of the fascinating things
link |
in your entire body of work,
link |
but also the work at OpenAI recently,
link |
one of the conversation changes has been
link |
in the world of language models.
link |
Can you briefly kind of try to describe
link |
the recent history of using neural networks
link |
in the domain of language and text?
link |
Well, there's been lots of history.
link |
I think the Elman network was a small,
link |
tiny recurrent neural network applied to language
link |
So the history is really, you know, fairly long at least.
link |
And the thing that started,
link |
the thing that changed the trajectory
link |
of neural networks and language
link |
is the thing that changed the trajectory
link |
of all deep learning and that's data and compute.
link |
So suddenly you move from small language models,
link |
which learn a little bit,
link |
and with language models in particular,
link |
there's a very clear explanation
link |
for why they need to be large to be good,
link |
because they're trying to predict the next word.
link |
So when you don't know anything,
link |
you'll notice very, very broad strokes,
link |
surface level patterns,
link |
like sometimes there are characters
link |
and there is a space between those characters.
link |
You'll notice this pattern.
link |
And you'll notice that sometimes there is a comma
link |
and then the next character is a capital letter.
link |
You'll notice that pattern.
link |
Eventually you may start to notice
link |
that there are certain words occur often.
link |
You may notice that spellings are a thing.
link |
You may notice syntax.
link |
And when you get really good at all these,
link |
you start to notice the semantics.
link |
You start to notice the facts.
link |
But for that to happen,
link |
the language model needs to be larger.
link |
So that's, let's linger on that,
link |
because that's where you and Noam Chomsky disagree.
link |
So you think we're actually taking incremental steps,
link |
a sort of larger network, larger compute
link |
will be able to get to the semantics,
link |
to be able to understand language
link |
without what Noam likes to sort of think of
link |
as a fundamental understandings
link |
of the structure of language,
link |
like imposing your theory of language
link |
onto the learning mechanism.
link |
So you're saying the learning,
link |
you can learn from raw data,
link |
the mechanism that underlies language.
link |
Well, I think it's pretty likely,
link |
but I also want to say that I don't really know precisely
link |
what Chomsky means when he talks about him.
link |
You said something about imposing your structural language.
link |
I'm not 100% sure what he means,
link |
but empirically it seems that
link |
when you inspect those larger language models,
link |
they exhibit signs of understanding the semantics
link |
whereas the smaller language models do not.
link |
We've seen that a few years ago
link |
when we did work on the sentiment neuron.
link |
We trained a small, you know,
link |
smallish LSTM to predict the next character
link |
in Amazon reviews.
link |
And we noticed that when you increase the size of the LSTM
link |
from 500 LSTM cells to 4,000 LSTM cells,
link |
then one of the neurons starts to represent the sentiment
link |
of the article, sorry, of the review.
link |
Sentiment is a pretty semantic attribute.
link |
It's not a syntactic attribute.
link |
And for people who might not know,
link |
I don't know if that's a standard term,
link |
but sentiment is whether it's a positive
link |
or a negative review.
link |
Is the person happy with something
link |
or is the person unhappy with something?
link |
And so here we had very clear evidence
link |
that a small neural net does not capture sentiment
link |
while a large neural net does.
link |
Well, our theory is that at some point
link |
you run out of syntax to models,
link |
you start to gotta focus on something else.
link |
And with size, you quickly run out of syntax to model
link |
and then you really start to focus on the semantics
link |
would be the idea.
link |
And so I don't wanna imply that our models
link |
have complete semantic understanding
link |
because that's not true,
link |
but they definitely are showing signs
link |
of semantic understanding,
link |
partial semantic understanding,
link |
but the smaller models do not show those signs.
link |
Can you take a step back and say,
link |
what is GPT2, which is one of the big language models
link |
that was the conversation changer
link |
in the past couple of years?
link |
Yeah, so GPT2 is a transformer
link |
with one and a half billion parameters
link |
that was trained on about 40 billion tokens of text
link |
which were obtained from web pages
link |
that were linked to from Reddit articles
link |
with more than three outputs.
link |
And what's a transformer?
link |
The transformer, it's the most important advance
link |
in neural network architectures in recent history.
link |
What is attention maybe too?
link |
Cause I think that's an interesting idea,
link |
not necessarily sort of technically speaking,
link |
but the idea of attention versus maybe
link |
what recurrent neural networks represent.
link |
Yeah, so the thing is the transformer
link |
is a combination of multiple ideas simultaneously
link |
of which attention is one.
link |
Do you think attention is the key?
link |
No, it's a key, but it's not the key.
link |
The transformer is successful
link |
because it is the simultaneous combination
link |
of multiple ideas.
link |
And if you were to remove either idea,
link |
it would be much less successful.
link |
So the transformer uses a lot of attention,
link |
but attention existed for a few years.
link |
So that can't be the main innovation.
link |
The transformer is designed in such a way
link |
that it runs really fast on the GPU.
link |
And that makes a huge amount of difference.
link |
This is one thing.
link |
The second thing is that transformer is not recurrent.
link |
And that is really important too,
link |
because it is more shallow
link |
and therefore much easier to optimize.
link |
So in other words, users attention,
link |
it is a really great fit to the GPU
link |
and it is not recurrent,
link |
so therefore less deep and easier to optimize.
link |
And the combination of those factors make it successful.
link |
So now it makes great use of your GPU.
link |
It allows you to achieve better results
link |
for the same amount of compute.
link |
And that's why it's successful.
link |
Were you surprised how well transformers worked
link |
So you worked on language.
link |
You've had a lot of great ideas
link |
before transformers came about in language.
link |
So you got to see the whole set of revolutions
link |
Were you surprised?
link |
I mean, it's hard to remember
link |
because you adapt really quickly,
link |
but it definitely was surprising.
link |
It definitely was.
link |
In fact, you know what?
link |
I'll retract my statement.
link |
It was pretty amazing.
link |
It was just amazing to see generate this text of this.
link |
And you know, you gotta keep in mind
link |
that at that time we've seen all this progress in GANs
link |
in improving the samples produced by GANs
link |
were just amazing.
link |
You have these realistic faces,
link |
but text hasn't really moved that much.
link |
And suddenly we moved from, you know,
link |
whatever GANs were in 2015
link |
to the best, most amazing GANs in one step.
link |
And that was really stunning.
link |
Even though theory predicted,
link |
yeah, you train a big language model,
link |
of course you should get this,
link |
but then to see it with your own eyes,
link |
it's something else.
link |
And yet we adapt really quickly.
link |
And now there's sort of some cognitive scientists
link |
write articles saying that GPT2 models
link |
don't truly understand language.
link |
So we adapt quickly to how amazing
link |
the fact that they're able to model the language so well is.
link |
So what do you think is the bar?
link |
For impressing us that it...
link |
Do you think that bar will continuously be moved?
link |
I think when you start to see
link |
really dramatic economic impact,
link |
that's when I think that's in some sense the next barrier.
link |
Because right now, if you think about the work in AI,
link |
it's really confusing.
link |
It's really hard to know what to make of all these advances.
link |
It's kind of like, okay, you got an advance
link |
and now you can do more things
link |
and you've got another improvement
link |
and you've got another cool demo.
link |
At some point, I think people who are outside of AI,
link |
they can no longer distinguish this progress anymore.
link |
So we were talking offline
link |
about translating Russian to English
link |
and how there's a lot of brilliant work in Russian
link |
that the rest of the world doesn't know about.
link |
That's true for Chinese,
link |
it's true for a lot of scientists
link |
and just artistic work in general.
link |
Do you think translation is the place
link |
where we're going to see sort of economic big impact?
link |
I think there is a huge number of...
link |
I mean, first of all,
link |
I wanna point out that translation already today is huge.
link |
I think billions of people interact
link |
with big chunks of the internet primarily through translation.
link |
So translation is already huge
link |
and it's hugely positive too.
link |
I think self driving is going to be hugely impactful
link |
and that's, it's unknown exactly when it happens,
link |
but again, I would not bet against deep learning, so I...
link |
So there's deep learning in general,
link |
but you think this...
link |
Deep learning for self driving.
link |
Yes, deep learning for self driving.
link |
But I was talking about sort of language models.
link |
Beard off a little bit.
link |
you're not seeing a connection between driving and language.
link |
Or rather both use neural nets.
link |
That'd be a poetic connection.
link |
I think there might be some,
link |
like you said, there might be some kind of unification
link |
towards a kind of multitask transformers
link |
that can take on both language and vision tasks.
link |
That'd be an interesting unification.
link |
Now let's see, what can I ask about GPT two more?
link |
There's not much to ask.
link |
It's, you take a transform, you make it bigger,
link |
you give it more data,
link |
and suddenly it does all those amazing things.
link |
Yeah, one of the beautiful things is that GPT,
link |
the transformers are fundamentally simple to explain,
link |
Do you think bigger will continue
link |
to show better results in language?
link |
Sort of like what are the next steps
link |
with GPT two, do you think?
link |
I mean, I think for sure seeing
link |
what larger versions can do is one direction.
link |
Also, I mean, there are many questions.
link |
There's one question which I'm curious about
link |
and that's the following.
link |
So right now GPT two,
link |
so we feed it all this data from the internet,
link |
which means that it needs to memorize
link |
all those random facts about everything in the internet.
link |
And it would be nice if the model could somehow
link |
use its own intelligence to decide
link |
what data it wants to accept
link |
and what data it wants to reject.
link |
People don't learn all data indiscriminately.
link |
We are super selective about what we learn.
link |
And I think this kind of active learning,
link |
I think would be very nice to have.
link |
Yeah, listen, I love active learning.
link |
So let me ask, does the selection of data,
link |
can you just elaborate that a little bit more?
link |
Do you think the selection of data is,
link |
like I have this kind of sense
link |
that the optimization of how you select data,
link |
so the active learning process is going to be a place
link |
for a lot of breakthroughs, even in the near future?
link |
Because there hasn't been many breakthroughs there
link |
I feel like there might be private breakthroughs
link |
that companies keep to themselves
link |
because the fundamental problem has to be solved
link |
if you want to solve self driving,
link |
if you want to solve a particular task.
link |
What do you think about the space in general?
link |
Yeah, so I think that for something like active learning,
link |
or in fact, for any kind of capability, like active learning,
link |
the thing that it really needs is a problem.
link |
It needs a problem that requires it.
link |
It's very hard to do research about the capability
link |
if you don't have a task,
link |
because then what's going to happen
link |
is that you will come up with an artificial task,
link |
get good results, but not really convince anyone.
link |
Right, like we're now past the stage
link |
where getting a result on MNIST, some clever formulation
link |
of MNIST will convince people.
link |
That's right, in fact, you could quite easily
link |
come up with a simple active learning scheme on MNIST
link |
and get a 10x speed up, but then, so what?
link |
And I think that with active learning,
link |
the need, active learning will naturally arise
link |
as problems that require it pop up.
link |
That's how I would, that's my take on it.
link |
There's another interesting thing
link |
that OpenAI has brought up with GPT2,
link |
which is when you create a powerful
link |
artificial intelligence system,
link |
and it was unclear what kind of detrimental,
link |
once you release GPT2,
link |
what kind of detrimental effect it will have.
link |
Because if you have a model
link |
that can generate a pretty realistic text,
link |
you can start to imagine that it would be used by bots
link |
in some way that we can't even imagine.
link |
So there's this nervousness about what is possible to do.
link |
So you did a really kind of brave
link |
and I think profound thing,
link |
which is start a conversation about this.
link |
How do we release powerful artificial intelligence models
link |
If we do it all, how do we privately discuss
link |
with other, even competitors,
link |
about how we manage the use of the systems and so on?
link |
So from this whole experience,
link |
you released a report on it,
link |
but in general, are there any insights
link |
that you've gathered from just thinking about this,
link |
about how you release models like this?
link |
I mean, I think that my take on this
link |
is that the field of AI has been in a state of childhood.
link |
And now it's exiting that state
link |
and it's entering a state of maturity.
link |
What that means is that AI is very successful
link |
and also very impactful.
link |
And its impact is not only large, but it's also growing.
link |
And so for that reason, it seems wise to start thinking
link |
about the impact of our systems before releasing them,
link |
maybe a little bit too soon, rather than a little bit too late.
link |
And with the case of GPT2, like I mentioned earlier,
link |
the results really were stunning.
link |
And it seemed plausible, it didn't seem certain,
link |
it seemed plausible that something like GPT2
link |
could easily use to reduce the cost of this information.
link |
And so there was a question of what's the best way
link |
to release it, and a staged release seemed logical.
link |
A small model was released,
link |
and there was time to see the,
link |
many people use these models in lots of cool ways.
link |
There've been lots of really cool applications.
link |
There haven't been any negative application to be known of.
link |
And so eventually it was released,
link |
but also other people replicated similar models.
link |
That's an interesting question though that we know of.
link |
So in your view, staged release,
link |
is at least part of the answer to the question of how do we,
link |
what do we do once we create a system like this?
link |
It's part of the answer, yes.
link |
Is there any other insights?
link |
Like say you don't wanna release the model at all,
link |
because it's useful to you for whatever the business is.
link |
Well, plenty of people don't release models already.
link |
Right, of course, but is there some moral,
link |
ethical responsibility when you have a very powerful model
link |
to sort of communicate?
link |
Like, just as you said, when you had GPT2,
link |
it was unclear how much it could be used for misinformation.
link |
It's an open question, and getting an answer to that
link |
might require that you talk to other really smart people
link |
that are outside of your particular group.
link |
Have you, please tell me there's some optimistic pathway
link |
for people to be able to use this model
link |
for people across the world to collaborate
link |
on these kinds of cases?
link |
Or is it still really difficult from one company
link |
to talk to another company?
link |
So it's definitely possible.
link |
It's definitely possible to discuss these kind of models
link |
with colleagues elsewhere,
link |
and to get their take on what to do.
link |
How hard is it though?
link |
Do you see that happening?
link |
I think that's a place where it's important
link |
to gradually build trust between companies.
link |
Because ultimately, all the AI developers
link |
are building technology which is going to be
link |
increasingly more powerful.
link |
the way to think about it is that ultimately
link |
we're all in it together.
link |
Yeah, I tend to believe in the better angels of our nature,
link |
but I do hope that when you build a really powerful
link |
AI system in a particular domain,
link |
that you also think about the potential
link |
negative consequences of, yeah.
link |
It's an interesting and scary possibility
link |
that there will be a race for AI development
link |
that would push people to close that development,
link |
and not share ideas with others.
link |
I don't love this.
link |
I've been a pure academic for 10 years.
link |
I really like sharing ideas and it's fun, it's exciting.
link |
What do you think it takes to,
link |
let's talk about AGI a little bit.
link |
What do you think it takes to build a system
link |
of human level intelligence?
link |
We talked about reasoning,
link |
we talked about long term memory, but in general,
link |
what does it take, do you think?
link |
Well, I can't be sure.
link |
But I think the deep learning,
link |
plus maybe another,
link |
plus maybe another small idea.
link |
Do you think self play will be involved?
link |
So you've spoken about the powerful mechanism of self play
link |
where systems learn by sort of exploring the world
link |
in a competitive setting against other entities
link |
that are similarly skilled as them,
link |
and so incrementally improve in this way.
link |
Do you think self play will be a component
link |
of building an AGI system?
link |
Yeah, so what I would say, to build AGI,
link |
I think it's going to be deep learning plus some ideas.
link |
And I think self play will be one of those ideas.
link |
I think that that is a very,
link |
self play has this amazing property
link |
that it can surprise us in truly novel ways.
link |
For example, like we, I mean,
link |
pretty much every self play system,
link |
both are Dota bot.
link |
I don't know if, OpenAI had a release about multi agent
link |
where you had two little agents
link |
who were playing hide and seek.
link |
And of course, also alpha zero.
link |
They were all produced surprising behaviors.
link |
They all produce behaviors that we didn't expect.
link |
They are creative solutions to problems.
link |
And that seems like an important part of AGI
link |
that our systems don't exhibit routinely right now.
link |
And so that's why I like this area.
link |
I like this direction because of its ability to surprise us.
link |
And an AGI system would surprise us fundamentally.
link |
And to be precise, not just a random surprise,
link |
but to find the surprising solution to a problem
link |
that's also useful.
link |
Now, a lot of the self play mechanisms
link |
have been used in the game context
link |
or at least in the simulation context.
link |
How far along the path to AGI
link |
do you think will be done in simulation?
link |
How much faith, promise do you have in simulation
link |
versus having to have a system
link |
that operates in the real world?
link |
Whether it's the real world of digital real world data
link |
or real world like actual physical world of robotics.
link |
I don't think it's an easy or.
link |
I think simulation is a tool and it helps.
link |
It has certain strengths and certain weaknesses
link |
and we should use it.
link |
Yeah, but okay, I understand that.
link |
That's true, but one of the criticisms of self play,
link |
one of the criticisms of reinforcement learning
link |
is one of the, its current power, its current results,
link |
while amazing, have been demonstrated
link |
in a simulated environments
link |
or very constrained physical environments.
link |
Do you think it's possible to escape them,
link |
escape the simulator environments
link |
and be able to learn in non simulator environments?
link |
Or do you think it's possible to also just simulate
link |
in a photo realistic and physics realistic way,
link |
the real world in a way that we can solve real problems
link |
with self play in simulation?
link |
So I think that transfer from simulation to the real world
link |
is definitely possible and has been exhibited many times
link |
by many different groups.
link |
It's been especially successful in vision.
link |
Also open AI in the summer has demonstrated a robot hand
link |
which was trained entirely in simulation
link |
in a certain way that allowed for seem to real transfer
link |
Is this for the Rubik's cube?
link |
Yeah, that's right.
link |
I wasn't aware that was trained in simulation.
link |
It was trained in simulation entirely.
link |
Really, so it wasn't in the physical,
link |
the hand wasn't trained?
link |
No, 100% of the training was done in simulation
link |
and the policy that was learned in simulation
link |
was trained to be very adaptive.
link |
So adaptive that when you transfer it,
link |
it could very quickly adapt to the physical world.
link |
So the kind of perturbations with the giraffe
link |
or whatever the heck it was,
link |
those weren't, were those part of the simulation?
link |
Well, the simulation was generally,
link |
so the simulation was trained to be robust
link |
to many different things,
link |
but not the kind of perturbations we've had in the video.
link |
So it's never been trained with a glove.
link |
It's never been trained with a stuffed giraffe.
link |
So in theory, these are novel perturbations.
link |
Correct, it's not in theory, in practice.
link |
Those are novel perturbations?
link |
Well, that's okay.
link |
That's a clean, small scale,
link |
but clean example of a transfer
link |
from the simulated world to the physical world.
link |
Yeah, and I will also say
link |
that I expect the transfer capabilities
link |
of deep learning to increase in general.
link |
And the better the transfer capabilities are,
link |
the more useful simulation will become.
link |
Because then you could take,
link |
you could experience something in simulation
link |
and then learn a moral of the story,
link |
which you could then carry with you to the real world.
link |
As humans do all the time when they play computer games.
link |
So let me ask sort of a embodied question,
link |
staying on AGI for a sec.
link |
Do you think AGI system would need to have a body?
link |
We need to have some of those human elements
link |
of self awareness, consciousness,
link |
sort of fear of mortality,
link |
sort of self preservation in the physical space,
link |
which comes with having a body.
link |
I think having a body will be useful.
link |
I don't think it's necessary,
link |
but I think it's very useful to have a body for sure,
link |
because you can learn a whole new,
link |
you can learn things which cannot be learned without a body.
link |
But at the same time, I think that if you don't have a body,
link |
you could compensate for it and still succeed.
link |
Well, there is evidence for this.
link |
For example, there are many people who were born deaf
link |
and blind and they were able to compensate
link |
for the lack of modalities.
link |
I'm thinking about Helen Keller specifically.
link |
So even if you're not able to physically interact
link |
with the world, and if you're not able to,
link |
I mean, I actually was getting at,
link |
maybe let me ask on the more particular,
link |
I'm not sure if it's connected to having a body or not,
link |
but the idea of consciousness
link |
and a more constrained version of that is self awareness.
link |
Do you think an AGI system should have consciousness?
link |
We can't define, whatever the heck you think consciousness is.
link |
Yeah, hard question to answer,
link |
given how hard it is to define it.
link |
Do you think it's useful to think about?
link |
I mean, it's definitely interesting.
link |
I think it's definitely possible
link |
that our systems will be conscious.
link |
Do you think that's an emergent thing that just comes from,
link |
do you think consciousness could emerge
link |
from the representation that's stored within neural networks?
link |
So like that it naturally just emerges
link |
when you become more and more,
link |
you're able to represent more and more of the world?
link |
Well, I'd say I'd make the following argument,
link |
which is humans are conscious.
link |
And if you believe that artificial neural nets
link |
are sufficiently similar to the brain,
link |
then there should at least exist artificial neural nets
link |
you should be conscious too.
link |
You're leaning on that existence proof pretty heavily.
link |
Okay, so that's the best answer I can give.
link |
No, I know, I know, I know.
link |
There's still an open question
link |
if there's not some magic in the brain that we're not,
link |
I mean, I don't mean a non materialistic magic,
link |
but that the brain might be a lot more complicated
link |
and interesting than we give it credit for.
link |
If that's the case, then it should show up.
link |
And at some point we will find out
link |
that we can't continue to make progress.
link |
But I think it's unlikely.
link |
So we talk about consciousness,
link |
but let me talk about another poorly defined concept
link |
Again, we've talked about reasoning,
link |
we've talked about memory.
link |
What do you think is a good test of intelligence for you?
link |
Are you impressed by the test that Alan Turing formulated
link |
with the imitation game with natural language?
link |
Is there something in your mind
link |
that you will be deeply impressed by
link |
if a system was able to do?
link |
I mean, lots of things.
link |
There's a certain frontier of capabilities today.
link |
And there exist things outside of that frontier.
link |
And I would be impressed by any such thing.
link |
For example, I would be impressed by a deep learning system
link |
which solves a very pedestrian task,
link |
like machine translation or computer vision task
link |
or something which never makes mistake
link |
a human wouldn't make under any circumstances.
link |
I think that is something
link |
which have not yet been demonstrated
link |
and I would find it very impressive.
link |
Yeah, so right now they make mistakes in different,
link |
they might be more accurate than human beings,
link |
but they still, they make a different set of mistakes.
link |
So my, I would guess that a lot of the skepticism
link |
that some people have about deep learning
link |
is when they look at their mistakes and they say,
link |
well, those mistakes, they make no sense.
link |
Like if you understood the concept,
link |
you wouldn't make that mistake.
link |
And I think that changing that would be,
link |
that would inspire me.
link |
That would be, yes, this is progress.
link |
Yeah, that's a really nice way to put it.
link |
But I also just don't like that human instinct
link |
to criticize a model is not intelligent.
link |
That's the same instinct as we do
link |
when we criticize any group of creatures as the other.
link |
Because it's very possible that GPT2
link |
is much smarter than human beings at many things.
link |
That's definitely true.
link |
It has a lot more breadth of knowledge.
link |
Yes, breadth of knowledge
link |
and even perhaps depth on certain topics.
link |
It's kind of hard to judge what depth means,
link |
but there's definitely a sense in which
link |
humans don't make mistakes that these models do.
link |
The same is applied to autonomous vehicles.
link |
The same is probably gonna continue being applied
link |
to a lot of artificial intelligence systems.
link |
We find, this is the annoying thing.
link |
This is the process of, in the 21st century,
link |
the process of analyzing the progress of AI
link |
is the search for one case where the system fails
link |
in a big way where humans would not.
link |
And then many people writing articles about it.
link |
And then broadly, the public generally gets convinced
link |
that the system is not intelligent.
link |
And we pacify ourselves by thinking it's not intelligent
link |
because of this one anecdotal case.
link |
And this seems to continue happening.
link |
Yeah, I mean, there is truth to that.
link |
Although I'm sure that plenty of people
link |
are also extremely impressed
link |
by the system that exists today.
link |
But I think this connects to the earlier point
link |
we discussed that it's just confusing
link |
to judge progress in AI.
link |
And you have a new robot demonstrating something.
link |
How impressed should you be?
link |
And I think that people will start to be impressed
link |
once AI starts to really move the needle on the GDP.
link |
So you're one of the people that might be able
link |
to create an AGI system here.
link |
Not you, but you and OpenAI.
link |
If you do create an AGI system
link |
and you get to spend sort of the evening
link |
with it, him, her, what would you talk about, do you think?
link |
The very first time?
link |
Well, the first time I would just ask all kinds of questions
link |
and try to get it to make a mistake.
link |
And I would be amazed that it doesn't make mistakes
link |
and just keep asking broad questions.
link |
What kind of questions do you think?
link |
Would they be factual or would they be personal,
link |
emotional, psychological?
link |
What do you think?
link |
Would you ask for advice?
link |
I mean, why would I limit myself
link |
talking to a system like this?
link |
Now, again, let me emphasize the fact
link |
that you truly are one of the people
link |
that might be in the room where this happens.
link |
So let me ask sort of a profound question about,
link |
I've just talked to a Stalin historian.
link |
I've been talking to a lot of people who are studying power.
link |
Abraham Lincoln said,
link |
"'Nearly all men can stand adversity,
link |
"'but if you want to test a man's character, give him power.'"
link |
I would say the power of the 21st century,
link |
maybe the 22nd, but hopefully the 21st,
link |
would be the creation of an AGI system
link |
and the people who have control,
link |
direct possession and control of the AGI system.
link |
So what do you think, after spending that evening
link |
having a discussion with the AGI system,
link |
what do you think you would do?
link |
Well, the ideal world I'd like to imagine
link |
is one where humanity,
link |
I like, the board members of a company
link |
where the AGI is the CEO.
link |
So it would be, I would like,
link |
the picture which I would imagine
link |
is you have some kind of different entities,
link |
different countries or cities,
link |
and the people that leave their vote
link |
for what the AGI that represents them should do,
link |
and the AGI that represents them goes and does it.
link |
I think a picture like that, I find very appealing.
link |
You could have multiple AGI,
link |
you would have an AGI for a city, for a country,
link |
and there would be multiple AGI's,
link |
for a city, for a country, and there would be,
link |
it would be trying to, in effect,
link |
take the democratic process to the next level.
link |
And the board can always fire the CEO.
link |
Essentially, press the reset button, say.
link |
Press the reset button.
link |
Rerandomize the parameters.
link |
But let me sort of, that's actually,
link |
okay, that's a beautiful vision, I think,
link |
as long as it's possible to press the reset button.
link |
Do you think it will always be possible
link |
to press the reset button?
link |
So I think that it definitely will be possible to build.
link |
So you're talking, so the question
link |
that I really understand from you is,
link |
will humans or humans people have control
link |
over the AI systems that they build?
link |
And my answer is, it's definitely possible
link |
to build AI systems which will want
link |
to be controlled by their humans.
link |
Wow, that's part of their,
link |
so it's not that just they can't help but be controlled,
link |
but that's the, they exist,
link |
the one of the objectives of their existence
link |
is to be controlled.
link |
In the same way that human parents
link |
generally want to help their children,
link |
they want their children to succeed.
link |
It's not a burden for them.
link |
They are excited to help children and to feed them
link |
and to dress them and to take care of them.
link |
And I believe with high conviction
link |
that the same will be possible for an AGI.
link |
It will be possible to program an AGI,
link |
to design it in such a way
link |
that it will have a similar deep drive
link |
that it will be delighted to fulfill.
link |
And the drive will be to help humans flourish.
link |
But let me take a step back to that moment
link |
where you create the AGI system.
link |
I think this is a really crucial moment.
link |
And between that moment
link |
and the Democratic board members with the AGI at the head,
link |
there has to be a relinquishing of power.
link |
So as George Washington, despite all the bad things he did,
link |
one of the big things he did is he relinquished power.
link |
He, first of all, didn't want to be president.
link |
And even when he became president,
link |
he gave, he didn't keep just serving
link |
as most dictators do for indefinitely.
link |
Do you see yourself being able to relinquish control
link |
over an AGI system,
link |
given how much power you can have over the world,
link |
at first financial, just make a lot of money, right?
link |
And then control by having possession as AGI system.
link |
I'd find it trivial to do that.
link |
I'd find it trivial to relinquish this kind of power.
link |
I mean, the kind of scenario you are describing
link |
sounds terrifying to me.
link |
I would absolutely not want to be in that position.
link |
Do you think you represent the majority
link |
or the minority of people in the AI community?
link |
Say open question, an important one.
link |
Are most people good is another way to ask it.
link |
So I don't know if most people are good,
link |
but I think that when it really counts,
link |
people can be better than we think.
link |
That's beautifully put, yeah.
link |
Are there specific mechanism you can think of
link |
of aligning AI values to human values?
link |
Is that, do you think about these problems
link |
of continued alignment as we develop the AI systems?
link |
In some sense, the kind of question which you are asking is,
link |
so if I were to translate the question to today's terms,
link |
it would be a question about how to get an RL agent
link |
that's optimizing a value function which itself is learned.
link |
And if you look at humans, humans are like that
link |
because the reward function, the value function of humans
link |
is not external, it is internal.
link |
And there are definite ideas
link |
of how to train a value function.
link |
Basically an objective, you know,
link |
and as objective as possible perception system
link |
that will be trained separately to recognize,
link |
to internalize human judgments on different situations.
link |
And then that component would then be integrated
link |
as the base value function
link |
for some more capable RL system.
link |
You could imagine a process like this.
link |
I'm not saying this is the process,
link |
I'm saying this is an example
link |
of the kind of thing you could do.
link |
So on that topic of the objective functions
link |
of human existence,
link |
what do you think is the objective function
link |
that's implicit in human existence?
link |
What's the meaning of life?
link |
I think the question is wrong in some way.
link |
I think that the question implies
link |
that there is an objective answer
link |
which is an external answer,
link |
you know, your meaning of life is X.
link |
I think what's going on is that we exist
link |
and that's amazing.
link |
And we should try to make the most of it
link |
and try to maximize our own value
link |
and enjoyment of a very short time while we do exist.
link |
because action does require an objective function
link |
is definitely there in some form,
link |
but it's difficult to make it explicit
link |
and maybe impossible to make it explicit,
link |
I guess is what you're getting at.
link |
And that's an interesting fact of an RL environment.
link |
Well, but I was making a slightly different point
link |
is that humans want things
link |
and their wants create the drives that cause them to,
link |
you know, our wants are our objective functions,
link |
our individual objective functions.
link |
We can later decide that we want to change,
link |
that what we wanted before is no longer good
link |
and we want something else.
link |
Yeah, but they're so dynamic,
link |
there's gotta be some underlying sort of Freud,
link |
there's things, there's like sexual stuff,
link |
there's people who think it's the fear of death
link |
and there's also the desire for knowledge
link |
and you know, all these kinds of things,
link |
procreation, sort of all the evolutionary arguments,
link |
there might be some kind of fundamental objective function
link |
from which everything else emerges,
link |
but it seems like it's very difficult to make it explicit.
link |
I think that probably is an evolutionary objective function
link |
which is to survive and procreate
link |
and make sure you make your children succeed.
link |
That would be my guess,
link |
but it doesn't give an answer to the question
link |
of what's the meaning of life.
link |
I think you can see how humans are part of this big process,
link |
this ancient process.
link |
We exist on a small planet and that's it.
link |
So given that we exist, try to make the most of it
link |
and try to enjoy more and suffer less as much as we can.
link |
Let me ask two silly questions about life.
link |
One, do you have regrets?
link |
Moments that if you went back, you would do differently.
link |
And two, are there moments that you're especially proud of
link |
that made you truly happy?
link |
So I can answer that, I can answer both questions.
link |
Of course, there's a huge number of choices
link |
and decisions that I've made
link |
that with the benefit of hindsight,
link |
I wouldn't have made them.
link |
And I do experience some regret,
link |
but I try to take solace in the knowledge
link |
that at the time I did the best I could.
link |
And in terms of things that I'm proud of,
link |
I'm very fortunate to have done things I'm proud of
link |
and they made me happy for some time,
link |
but I don't think that that is the source of happiness.
link |
So your academic accomplishments, all the papers,
link |
you're one of the most cited people in the world.
link |
All of the breakthroughs I mentioned
link |
in computer vision and language and so on,
link |
what is the source of happiness and pride for you?
link |
I mean, all those things are a source of pride for sure.
link |
I'm very grateful for having done all those things
link |
and it was very fun to do them.
link |
But happiness comes, but you know, happiness,
link |
well, my current view is that happiness comes
link |
from our, to a very large degree,
link |
from the way we look at things.
link |
You know, you can have a simple meal
link |
and be quite happy as a result,
link |
or you can talk to someone and be happy as a result as well.
link |
Or conversely, you can have a meal and be disappointed
link |
that the meal wasn't a better meal.
link |
So I think a lot of happiness comes from that,
link |
but I'm not sure, I don't want to be too confident.
link |
Being humble in the face of the uncertainty
link |
seems to be also a part of this whole happiness thing.
link |
Well, I don't think there's a better way to end it
link |
than meaning of life and discussions of happiness.
link |
So Ilya, thank you so much.
link |
You've given me a few incredible ideas.
link |
You've given the world many incredible ideas.
link |
I really appreciate it and thanks for talking today.
link |
Yeah, thanks for stopping by, I really enjoyed it.
link |
Thanks for listening to this conversation
link |
with Ilya Setskever and thank you
link |
to our presenting sponsor, Cash App.
link |
Please consider supporting the podcast
link |
by downloading Cash App and using the code LEXPodcast.
link |
If you enjoy this podcast, subscribe on YouTube,
link |
review it with five stars on Apple Podcast,
link |
support on Patreon, or simply connect with me on Twitter
link |
And now let me leave you with some words
link |
from Alan Turing on machine learning.
link |
Instead of trying to produce a program
link |
to simulate the adult mind,
link |
why not rather try to produce one
link |
which simulates the child?
link |
If this were then subjected
link |
to an appropriate course of education,
link |
one would obtain the adult brain.
link |
Thank you for listening and hope to see you next time.