back to index

Ilya Sutskever: Deep Learning | Lex Fridman Podcast #94


small model | large model

link |
00:00:00.000
The following is a conversation with Ilya Sotskever,
link |
00:00:03.160
cofounder and chief scientist of OpenAI,
link |
00:00:06.120
one of the most cited computer scientists in history
link |
00:00:09.360
with over 165,000 citations,
link |
00:00:13.480
and to me, one of the most brilliant and insightful minds
link |
00:00:17.080
ever in the field of deep learning.
link |
00:00:20.000
There are very few people in this world
link |
00:00:21.680
who I would rather talk to and brainstorm with
link |
00:00:24.040
about deep learning, intelligence, and life in general
link |
00:00:27.760
than Ilya, on and off the mic.
link |
00:00:30.680
This was an honor and a pleasure.
link |
00:00:33.720
This conversation was recorded
link |
00:00:35.240
before the outbreak of the pandemic.
link |
00:00:37.200
For everyone feeling the medical, psychological,
link |
00:00:39.480
and financial burden of this crisis,
link |
00:00:41.440
I'm sending love your way.
link |
00:00:43.160
Stay strong, we're in this together, we'll beat this thing.
link |
00:00:47.160
This is the Artificial Intelligence Podcast.
link |
00:00:49.640
If you enjoy it, subscribe on YouTube,
link |
00:00:51.760
review it with five stars on Apple Podcast,
link |
00:00:54.060
support it on Patreon,
link |
00:00:55.120
or simply connect with me on Twitter
link |
00:00:57.000
at lexfriedman, spelled F R I D M A N.
link |
00:01:00.560
As usual, I'll do a few minutes of ads now
link |
00:01:03.000
and never any ads in the middle
link |
00:01:04.320
that can break the flow of the conversation.
link |
00:01:06.600
I hope that works for you
link |
00:01:07.980
and doesn't hurt the listening experience.
link |
00:01:10.960
This show is presented by Cash App,
link |
00:01:13.440
the number one finance app in the App Store.
link |
00:01:15.720
When you get it, use code LEXPODCAST.
link |
00:01:18.840
Cash App lets you send money to friends,
link |
00:01:20.960
buy Bitcoin, invest in the stock market
link |
00:01:23.440
with as little as $1.
link |
00:01:25.440
Since Cash App allows you to buy Bitcoin,
link |
00:01:27.520
let me mention that cryptocurrency
link |
00:01:29.320
in the context of the history of money is fascinating.
link |
00:01:33.080
I recommend Ascent of Money as a great book on this history.
link |
00:01:36.840
Both the book and audio book are great.
link |
00:01:39.600
Debits and credits on ledgers
link |
00:01:41.040
started around 30,000 years ago.
link |
00:01:43.920
The US dollar created over 200 years ago,
link |
00:01:47.200
and Bitcoin, the first decentralized cryptocurrency,
link |
00:01:50.040
released just over 10 years ago.
link |
00:01:52.080
So given that history,
link |
00:01:53.520
cryptocurrency is still very much in its early days
link |
00:01:55.960
of development, but it's still aiming to
link |
00:01:58.200
and just might redefine the nature of money.
link |
00:02:01.840
So again, if you get Cash App from the App Store
link |
00:02:04.240
or Google Play and use the code LEXPODCAST,
link |
00:02:08.040
you get $10 and Cash App will also donate $10 to FIRST,
link |
00:02:12.480
an organization that is helping advance robotics
link |
00:02:14.880
and STEM education for young people around the world.
link |
00:02:18.600
And now here's my conversation with Ilya Satsgever.
link |
00:02:22.460
You were one of the three authors with Alex Kaszewski,
link |
00:02:26.740
Geoff Hinton of the famed AlexNet paper
link |
00:02:30.140
that is arguably the paper that marked
link |
00:02:33.500
the big catalytic moment
link |
00:02:35.140
that launched the deep learning revolution.
link |
00:02:37.860
At that time, take us back to that time,
link |
00:02:39.620
what was your intuition about neural networks,
link |
00:02:42.260
about the representational power of neural networks?
link |
00:02:46.000
And maybe you could mention how did that evolve
link |
00:02:48.860
over the next few years up to today,
link |
00:02:51.780
over the 10 years?
link |
00:02:53.460
Yeah, I can answer that question.
link |
00:02:55.260
At some point in about 2010 or 2011,
link |
00:03:00.060
I connected two facts in my mind.
link |
00:03:02.620
Basically, the realization was this,
link |
00:03:07.580
at some point we realized that we can train very large,
link |
00:03:11.300
I shouldn't say very, tiny by today's standards,
link |
00:03:13.380
but large and deep neural networks
link |
00:03:16.560
end to end with backpropagation.
link |
00:03:18.540
At some point, different people obtained this result.
link |
00:03:22.380
I obtained this result.
link |
00:03:23.800
The first moment in which I realized
link |
00:03:26.420
that deep neural networks are powerful
link |
00:03:28.980
was when James Martens invented
link |
00:03:30.780
the Hessian free optimizer in 2010.
link |
00:03:33.620
And he trained a 10 layer neural network end to end
link |
00:03:37.100
without pre training from scratch.
link |
00:03:41.620
And when that happened, I thought this is it.
link |
00:03:43.940
Because if you can train a big neural network,
link |
00:03:45.620
a big neural network can represent very complicated function.
link |
00:03:49.500
Because if you have a neural network with 10 layers,
link |
00:03:52.700
it's as though you allow the human brain
link |
00:03:55.260
to run for some number of milliseconds.
link |
00:03:58.340
Neuron firings are slow.
link |
00:04:00.380
And so in maybe 100 milliseconds,
link |
00:04:03.220
your neurons only fire 10 times.
link |
00:04:04.700
So it's also kind of like 10 layers.
link |
00:04:06.780
And in 100 milliseconds,
link |
00:04:08.140
you can perfectly recognize any object.
link |
00:04:10.460
So I thought, so I already had the idea then
link |
00:04:13.100
that we need to train a very big neural network
link |
00:04:16.100
on lots of supervised data.
link |
00:04:18.160
And then it must succeed
link |
00:04:19.420
because we can find the best neural network.
link |
00:04:21.360
And then there's also theory
link |
00:04:22.740
that if you have more data than parameters,
link |
00:04:24.500
you won't overfit.
link |
00:04:25.760
Today, we know that actually this theory is very incomplete
link |
00:04:28.100
and you won't overfit even if you have less data
link |
00:04:29.780
than parameters, but definitely,
link |
00:04:31.320
if you have more data than parameters, you won't overfit.
link |
00:04:33.340
So the fact that neural networks
link |
00:04:34.700
were heavily overparameterized wasn't discouraging to you?
link |
00:04:39.100
So you were thinking about the theory
link |
00:04:41.220
that the number of parameters,
link |
00:04:43.080
the fact that there's a huge number of parameters is okay?
link |
00:04:45.220
Is it gonna be okay?
link |
00:04:46.060
I mean, there was some evidence before that it was okayish,
link |
00:04:48.260
but the theory was most,
link |
00:04:49.460
the theory was that if you had a big data set
link |
00:04:51.500
and a big neural net, it was going to work.
link |
00:04:53.080
The overparameterization just didn't really
link |
00:04:55.500
figure much as a problem.
link |
00:04:57.060
I thought, well, with images,
link |
00:04:57.940
you're just gonna add some data augmentation
link |
00:04:59.280
and it's gonna be okay.
link |
00:05:00.420
So where was any doubt coming from?
link |
00:05:02.460
The main doubt was, can we train a bigger,
link |
00:05:04.420
will we have enough computer train
link |
00:05:05.580
a big enough neural net?
link |
00:05:06.420
With backpropagation.
link |
00:05:07.580
Backpropagation I thought would work.
link |
00:05:09.440
The thing which wasn't clear
link |
00:05:10.660
was whether there would be enough compute
link |
00:05:12.480
to get a very convincing result.
link |
00:05:14.100
And then at some point, Alex Kerchevsky wrote
link |
00:05:15.780
these insanely fast CUDA kernels
link |
00:05:17.500
for training convolutional neural nets.
link |
00:05:19.180
Net was bam, let's do this.
link |
00:05:20.880
Let's get image in it and it's gonna be the greatest thing.
link |
00:05:23.420
Was your intuition, most of your intuition
link |
00:05:25.940
from empirical results by you and by others?
link |
00:05:29.540
So like just actually demonstrating
link |
00:05:31.140
that a piece of program can train
link |
00:05:33.160
a 10 layer neural network?
link |
00:05:34.660
Or was there some pen and paper
link |
00:05:37.360
or marker and whiteboard thinking intuition?
link |
00:05:41.180
Like, cause you just connected a 10 layer
link |
00:05:43.900
large neural network to the brain.
link |
00:05:45.520
So you just mentioned the brain.
link |
00:05:46.580
So in your intuition about neural networks
link |
00:05:49.180
does the human brain come into play as a intuition builder?
link |
00:05:53.820
Definitely.
link |
00:05:54.980
I mean, you gotta be precise with these analogies
link |
00:05:57.500
between artificial neural networks and the brain.
link |
00:06:00.260
But there is no question that the brain is a huge source
link |
00:06:04.080
of intuition and inspiration for deep learning researchers
link |
00:06:07.420
since all the way from Rosenblatt in the 60s.
link |
00:06:10.800
Like if you look at the whole idea of a neural network
link |
00:06:13.820
is directly inspired by the brain.
link |
00:06:15.700
You had people like McCallum and Pitts who were saying,
link |
00:06:18.060
hey, you got these neurons in the brain.
link |
00:06:22.020
And hey, we recently learned about the computer
link |
00:06:23.820
and automata.
link |
00:06:24.660
Can we use some ideas from the computer and automata
link |
00:06:26.420
to design some kind of computational object
link |
00:06:28.740
that's going to be simple, computational
link |
00:06:31.660
and kind of like the brain and they invented the neuron.
link |
00:06:34.380
So they were inspired by it back then.
link |
00:06:35.980
Then you had the convolutional neural network from Fukushima
link |
00:06:38.580
and then later Yann LeCun who said, hey,
link |
00:06:40.420
if you limit the receptive fields of a neural network,
link |
00:06:42.680
it's going to be especially suitable for images
link |
00:06:45.460
as it turned out to be true.
link |
00:06:46.980
So there was a very small number of examples
link |
00:06:49.940
where analogies to the brain were successful.
link |
00:06:52.340
And I thought, well, probably an artificial neuron
link |
00:06:55.100
is not that different from the brain
link |
00:06:56.740
if it's cleaned hard enough.
link |
00:06:57.660
So let's just assume it is and roll with it.
link |
00:07:00.940
So now we're not at a time where deep learning
link |
00:07:02.780
is very successful.
link |
00:07:03.800
So let us squint less and say, let's open our eyes
link |
00:07:08.900
and say, what do you use an interesting difference
link |
00:07:12.060
between the human brain?
link |
00:07:13.820
Now, I know you're probably not an expert
link |
00:07:16.380
neither in your scientists and your biologists,
link |
00:07:18.220
but loosely speaking, what's the difference
link |
00:07:20.420
between the human brain and artificial neural networks?
link |
00:07:22.420
That's interesting to you for the next decade or two.
link |
00:07:26.300
That's a good question to ask.
link |
00:07:27.860
What is an interesting difference between the neurons
link |
00:07:29.700
between the brain and our artificial neural networks?
link |
00:07:32.900
So I feel like today, artificial neural networks,
link |
00:07:37.140
so we all agree that there are certain dimensions
link |
00:07:39.380
in which the human brain vastly outperforms our models.
link |
00:07:43.000
But I also think that there are some ways
link |
00:07:44.400
in which our artificial neural networks
link |
00:07:46.180
have a number of very important advantages over the brain.
link |
00:07:50.380
Looking at the advantages versus disadvantages
link |
00:07:52.540
is a good way to figure out what is the important difference.
link |
00:07:55.600
So the brain uses spikes, which may or may not be important.
link |
00:08:00.100
Yeah, it's a really interesting question.
link |
00:08:01.380
Do you think it's important or not?
link |
00:08:03.860
That's one big architectural difference
link |
00:08:06.380
between artificial neural networks.
link |
00:08:08.380
It's hard to tell, but my prior is not very high
link |
00:08:11.700
and I can say why.
link |
00:08:13.500
There are people who are interested
link |
00:08:14.340
in spiking neural networks.
link |
00:08:15.380
And basically what they figured out
link |
00:08:17.460
is that they need to simulate
link |
00:08:19.260
the non spiking neural networks in spikes.
link |
00:08:22.740
And that's how they're gonna make them work.
link |
00:08:24.300
If you don't simulate the non spiking neural networks
link |
00:08:26.340
in spikes, it's not going to work
link |
00:08:27.780
because the question is why should it work?
link |
00:08:29.580
And that connects to questions around back propagation
link |
00:08:31.820
and questions around deep learning.
link |
00:08:34.860
You've got this giant neural network.
link |
00:08:36.900
Why should it work at all?
link |
00:08:38.420
Why should the learning rule work at all?
link |
00:08:43.220
It's not a self evident question,
link |
00:08:44.660
especially if you, let's say if you were just starting
link |
00:08:47.060
in the field and you read the very early papers,
link |
00:08:49.340
you can say, hey, people are saying,
link |
00:08:51.580
let's build neural networks.
link |
00:08:53.740
That's a great idea because the brain is a neural network.
link |
00:08:55.900
So it would be useful to build neural networks.
link |
00:08:58.020
Now let's figure out how to train them.
link |
00:09:00.420
It should be possible to train them probably, but how?
link |
00:09:03.420
And so the big idea is the cost function.
link |
00:09:07.260
That's the big idea.
link |
00:09:08.780
The cost function is a way of measuring the performance
link |
00:09:11.900
of the system according to some measure.
link |
00:09:14.940
By the way, that is a big, actually let me think,
link |
00:09:17.180
is that one, a difficult idea to arrive at
link |
00:09:21.180
and how big of an idea is that?
link |
00:09:22.740
That there's a single cost function.
link |
00:09:27.620
Sorry, let me take a pause.
link |
00:09:28.940
Is supervised learning a difficult concept to come to?
link |
00:09:33.340
I don't know.
link |
00:09:34.660
All concepts are very easy in retrospect.
link |
00:09:36.460
Yeah, that's what it seems trivial now,
link |
00:09:38.100
but I, because the reason I asked that,
link |
00:09:40.540
and we'll talk about it, is there other things?
link |
00:09:43.460
Is there things that don't necessarily have a cost function,
link |
00:09:47.180
maybe have many cost functions
link |
00:09:48.620
or maybe have dynamic cost functions
link |
00:09:50.900
or maybe a totally different kind of architectures?
link |
00:09:54.180
Because we have to think like that
link |
00:09:55.500
in order to arrive at something new, right?
link |
00:09:57.980
So the only, so the good examples of things
link |
00:09:59.940
which don't have clear cost functions are GANs.
link |
00:10:03.940
Right. And a GAN, you have a game.
link |
00:10:05.740
So instead of thinking of a cost function,
link |
00:10:08.240
where you wanna optimize,
link |
00:10:09.260
where you know that you have an algorithm gradient descent,
link |
00:10:12.100
which will optimize the cost function,
link |
00:10:13.940
and then you can reason about the behavior of your system
link |
00:10:16.340
in terms of what it optimizes.
link |
00:10:18.140
With a GAN, you say, I have a game
link |
00:10:20.060
and I'll reason about the behavior of the system
link |
00:10:22.220
in terms of the equilibrium of the game.
link |
00:10:24.540
But it's all about coming up with these mathematical objects
link |
00:10:26.540
that help us reason about the behavior of our system.
link |
00:10:30.140
Right, that's really interesting.
link |
00:10:31.180
Yeah, so GAN is the only one, it's kind of a,
link |
00:10:33.420
the cost function is emergent from the comparison.
link |
00:10:36.900
It's, I don't know if it has a cost function.
link |
00:10:38.980
I don't know if it's meaningful
link |
00:10:39.820
to talk about the cost function of a GAN.
link |
00:10:41.340
It's kind of like the cost function of biological evolution
link |
00:10:44.020
or the cost function of the economy.
link |
00:10:45.700
It's, you can talk about regions
link |
00:10:49.460
to which it will go towards, but I don't think,
link |
00:10:55.260
I don't think the cost function analogy is the most useful.
link |
00:10:57.460
So if evolution doesn't, that's really interesting.
link |
00:11:00.100
So if evolution doesn't really have a cost function,
link |
00:11:02.660
like a cost function based on its,
link |
00:11:06.540
something akin to our mathematical conception
link |
00:11:09.860
of a cost function, then do you think cost functions
link |
00:11:12.740
in deep learning are holding us back?
link |
00:11:15.140
Yeah, so you just kind of mentioned that cost function
link |
00:11:18.300
is a nice first profound idea.
link |
00:11:21.380
Do you think that's a good idea?
link |
00:11:23.340
Do you think it's an idea we'll go past?
link |
00:11:26.740
So self play starts to touch on that a little bit
link |
00:11:29.540
in reinforcement learning systems.
link |
00:11:31.700
That's right.
link |
00:11:32.540
Self play and also ideas around exploration
link |
00:11:34.700
where you're trying to take action
link |
00:11:36.580
that surprise a predictor.
link |
00:11:39.060
I'm a big fan of cost functions.
link |
00:11:40.500
I think cost functions are great
link |
00:11:41.660
and they serve us really well.
link |
00:11:42.740
And I think that whenever we can do things
link |
00:11:44.220
with cost functions, we should.
link |
00:11:45.940
And you know, maybe there is a chance
link |
00:11:47.740
that we will come up with some,
link |
00:11:49.020
yet another profound way of looking at things
link |
00:11:51.340
that will involve cost functions in a less central way.
link |
00:11:54.220
But I don't know, I think cost functions are,
link |
00:11:55.780
I mean, I would not bet against cost functions.
link |
00:12:01.780
Is there other things about the brain
link |
00:12:04.140
that pop into your mind that might be different
link |
00:12:06.940
and interesting for us to consider
link |
00:12:09.740
in designing artificial neural networks?
link |
00:12:12.260
So we talked about spiking a little bit.
link |
00:12:14.300
I mean, one thing which may potentially be useful,
link |
00:12:16.620
I think people, neuroscientists have figured out
link |
00:12:18.660
something about the learning rule of the brain
link |
00:12:20.180
or I'm talking about spike time independent plasticity
link |
00:12:22.780
and it would be nice if some people
link |
00:12:24.340
would just study that in simulation.
link |
00:12:26.340
Wait, sorry, spike time independent plasticity?
link |
00:12:28.820
Yeah, that's right.
link |
00:12:29.660
What's that?
link |
00:12:30.500
STD.
link |
00:12:31.340
It's a particular learning rule that uses spike timing
link |
00:12:33.700
to figure out how to determine how to update the synapses.
link |
00:12:37.660
So it's kind of like if a synapse fires into the neuron
link |
00:12:40.620
before the neuron fires,
link |
00:12:42.420
then it strengthens the synapse,
link |
00:12:44.380
and if the synapse fires into the neurons
link |
00:12:46.220
shortly after the neuron fired,
link |
00:12:47.860
then it weakens the synapse.
link |
00:12:49.020
Something along this line.
link |
00:12:50.500
I'm 90% sure it's right, so if I said something wrong here,
link |
00:12:54.460
don't get too angry.
link |
00:12:57.780
But you sounded brilliant while saying it.
link |
00:12:59.340
But the timing, that's one thing that's missing.
link |
00:13:02.500
The temporal dynamics is not captured.
link |
00:13:05.820
I think that's like a fundamental property of the brain
link |
00:13:08.340
is the timing of the timing of the timing
link |
00:13:12.340
of the signals.
link |
00:13:13.380
Well, you have recurrent neural networks.
link |
00:13:15.500
But you think of that as this,
link |
00:13:18.100
I mean, that's a very crude, simplified,
link |
00:13:21.380
what's that called?
link |
00:13:23.500
There's a clock, I guess, to recurrent neural networks.
link |
00:13:27.660
It's, this seems like the brain is the general,
link |
00:13:30.140
the continuous version of that,
link |
00:13:31.980
the generalization where all possible timings are possible,
link |
00:13:36.100
and then within those timings is contained some information.
link |
00:13:39.940
You think recurrent neural networks,
link |
00:13:42.060
the recurrence in recurrent neural networks
link |
00:13:45.460
can capture the same kind of phenomena as the timing
link |
00:13:51.300
that seems to be important for the brain,
link |
00:13:54.260
in the firing of neurons in the brain?
link |
00:13:56.340
I mean, I think recurrent neural networks are amazing,
link |
00:14:00.740
and they can do, I think they can do anything
link |
00:14:03.900
we'd want them to, we'd want a system to do.
link |
00:14:07.700
Right now, recurrent neural networks
link |
00:14:09.060
have been superseded by transformers,
link |
00:14:10.500
but maybe one day they'll make a comeback,
link |
00:14:12.740
maybe they'll be back, we'll see.
link |
00:14:15.460
Let me, on a small tangent, say,
link |
00:14:17.700
do you think they'll be back?
link |
00:14:19.100
So, so much of the breakthroughs recently
link |
00:14:21.340
that we'll talk about on natural language processing
link |
00:14:24.420
and language modeling has been with transformers
link |
00:14:28.060
that don't emphasize recurrence.
link |
00:14:30.860
Do you think recurrence will make a comeback?
link |
00:14:33.300
Well, some kind of recurrence, I think very likely.
link |
00:14:37.020
Recurrent neural networks, as they're typically thought of
link |
00:14:41.500
for processing sequences, I think it's also possible.
link |
00:14:44.980
What is, to you, a recurrent neural network?
link |
00:14:47.940
In generally speaking, I guess,
link |
00:14:49.300
what is a recurrent neural network?
link |
00:14:50.940
You have a neural network which maintains
link |
00:14:52.380
a high dimensional hidden state,
link |
00:14:54.940
and then when an observation arrives,
link |
00:14:56.820
it updates its high dimensional hidden state
link |
00:14:59.300
through its connections in some way.
link |
00:15:03.460
So do you think, that's what expert systems did, right?
link |
00:15:08.140
Symbolic AI, the knowledge based,
link |
00:15:12.380
growing a knowledge base is maintaining a hidden state,
link |
00:15:17.220
which is its knowledge base,
link |
00:15:18.460
and is growing it by sequential processing.
link |
00:15:20.300
Do you think of it more generally in that way,
link |
00:15:22.700
or is it simply, is it the more constrained form
link |
00:15:28.300
of a hidden state with certain kind of gating units
link |
00:15:31.340
that we think of as today with LSTMs and that?
link |
00:15:34.500
I mean, the hidden state is technically
link |
00:15:36.220
what you described there, the hidden state
link |
00:15:37.820
that goes inside the LSTM or the RNN or something like this.
link |
00:15:41.340
But then what should be contained,
link |
00:15:43.220
if you want to make the expert system analogy,
link |
00:15:46.300
I'm not, I mean, you could say that
link |
00:15:49.140
the knowledge is stored in the connections,
link |
00:15:51.060
and then the short term processing
link |
00:15:53.220
is done in the hidden state.
link |
00:15:56.300
Yes, could you say that?
link |
00:15:58.460
So sort of, do you think there's a future of building
link |
00:16:01.660
large scale knowledge bases within the neural networks?
link |
00:16:05.620
Definitely.
link |
00:16:09.020
So we're gonna pause on that confidence,
link |
00:16:11.180
because I want to explore that.
link |
00:16:12.740
Well, let me zoom back out and ask,
link |
00:16:16.900
back to the history of ImageNet.
link |
00:16:19.340
Neural networks have been around for many decades,
link |
00:16:21.380
as you mentioned.
link |
00:16:22.740
What do you think were the key ideas
link |
00:16:24.260
that led to their success,
link |
00:16:25.860
that ImageNet moment and beyond,
link |
00:16:28.700
the success in the past 10 years?
link |
00:16:32.540
Okay, so the question is,
link |
00:16:33.500
to make sure I didn't miss anything,
link |
00:16:35.500
the key ideas that led to the success
link |
00:16:37.460
of deep learning over the past 10 years.
link |
00:16:39.340
Exactly, even though the fundamental thing
link |
00:16:42.860
behind deep learning has been around for much longer.
link |
00:16:45.340
So the key idea about deep learning,
link |
00:16:51.300
or rather the key fact about deep learning
link |
00:16:53.900
before deep learning started to be successful,
link |
00:16:58.220
is that it was underestimated.
link |
00:17:01.260
People who worked in machine learning
link |
00:17:02.860
simply didn't think that neural networks could do much.
link |
00:17:06.220
People didn't believe that large neural networks
link |
00:17:08.740
could be trained.
link |
00:17:10.500
People thought that, well, there was lots of,
link |
00:17:13.340
there was a lot of debate going on in machine learning
link |
00:17:15.620
about what are the right methods and so on.
link |
00:17:17.260
And people were arguing because there were no,
link |
00:17:21.300
there was no way to get hard facts.
link |
00:17:23.340
And by that, I mean, there were no benchmarks
link |
00:17:25.420
which were truly hard that if you do really well on them,
link |
00:17:28.420
then you can say, look, here's my system.
link |
00:17:32.500
That's when you switch from,
link |
00:17:35.220
that's when this field becomes a little bit more
link |
00:17:37.620
of an engineering field.
link |
00:17:38.580
So in terms of deep learning,
link |
00:17:39.620
to answer the question directly,
link |
00:17:42.300
the ideas were all there.
link |
00:17:43.500
The thing that was missing was a lot of supervised data
link |
00:17:46.780
and a lot of compute.
link |
00:17:49.700
Once you have a lot of supervised data and a lot of compute,
link |
00:17:52.580
then there is a third thing which is needed as well.
link |
00:17:54.700
And that is conviction.
link |
00:17:56.340
Conviction that if you take the right stuff,
link |
00:17:59.140
which already exists, and apply and mix it
link |
00:18:01.700
with a lot of data and a lot of compute,
link |
00:18:03.540
that it will in fact work.
link |
00:18:05.940
And so that was the missing piece.
link |
00:18:07.740
It was, you had the, you needed the data,
link |
00:18:10.660
you needed the compute, which showed up in terms of GPUs,
link |
00:18:14.140
and you needed the conviction to realize
link |
00:18:15.780
that you need to mix them together.
link |
00:18:18.420
So that's really interesting.
link |
00:18:19.420
So I guess the presence of compute
link |
00:18:23.100
and the presence of supervised data
link |
00:18:26.100
allowed the empirical evidence to do the convincing
link |
00:18:29.660
of the majority of the computer science community.
link |
00:18:32.020
So I guess there's a key moment with Jitendra Malik
link |
00:18:36.860
and Alex Alyosha Efros who were very skeptical, right?
link |
00:18:42.580
And then there's a Jeffrey Hinton
link |
00:18:43.980
that was the opposite of skeptical.
link |
00:18:46.660
And there was a convincing moment.
link |
00:18:48.220
And I think ImageNet had served as that moment.
link |
00:18:50.220
That's right.
link |
00:18:51.060
And they represented this kind of,
link |
00:18:52.940
were the big pillars of computer vision community,
link |
00:18:55.860
kind of the wizards got together,
link |
00:18:59.700
and then all of a sudden there was a shift.
link |
00:19:01.460
And it's not enough for the ideas to all be there
link |
00:19:05.260
and the compute to be there,
link |
00:19:06.300
it's for it to convince the cynicism that existed.
link |
00:19:11.380
It's interesting that people just didn't believe
link |
00:19:14.020
for a couple of decades.
link |
00:19:15.900
Yeah, well, but it's more than that.
link |
00:19:18.540
It's kind of, when put this way,
link |
00:19:20.820
it sounds like, well, those silly people
link |
00:19:23.140
who didn't believe, what were they missing?
link |
00:19:25.540
But in reality, things were confusing
link |
00:19:27.500
because neural networks really did not work on anything.
link |
00:19:30.220
And they were not the best method
link |
00:19:31.420
on pretty much anything as well.
link |
00:19:33.540
And it was pretty rational to say,
link |
00:19:35.780
yeah, this stuff doesn't have any traction.
link |
00:19:39.580
And that's why you need to have these very hard tasks
link |
00:19:42.260
which produce undeniable evidence.
link |
00:19:44.860
And that's how we make progress.
link |
00:19:46.900
And that's why the field is making progress today
link |
00:19:48.580
because we have these hard benchmarks
link |
00:19:50.660
which represent true progress.
link |
00:19:52.740
And so, and this is why we are able to avoid endless debate.
link |
00:19:58.300
So incredibly you've contributed
link |
00:20:00.500
some of the biggest recent ideas in AI
link |
00:20:03.020
in computer vision, language, natural language processing,
link |
00:20:07.020
reinforcement learning, sort of everything in between,
link |
00:20:11.300
maybe not GANs.
link |
00:20:12.500
But there may not be a topic you haven't touched.
link |
00:20:16.180
And of course, the fundamental science of deep learning.
link |
00:20:19.580
What is the difference to you between vision, language,
link |
00:20:24.140
and as in reinforcement learning, action,
link |
00:20:26.900
as learning problems?
link |
00:20:28.260
And what are the commonalities?
link |
00:20:29.540
Do you see them as all interconnected?
link |
00:20:31.500
Are they fundamentally different domains
link |
00:20:33.780
that require different approaches?
link |
00:20:38.180
Okay, that's a good question.
link |
00:20:39.620
Machine learning is a field with a lot of unity,
link |
00:20:41.860
a huge amount of unity.
link |
00:20:44.060
In fact. What do you mean by unity?
link |
00:20:45.300
Like overlap of ideas?
link |
00:20:48.340
Overlap of ideas, overlap of principles.
link |
00:20:50.140
In fact, there's only one or two or three principles
link |
00:20:52.660
which are very, very simple.
link |
00:20:54.340
And then they apply in almost the same way,
link |
00:20:57.340
in almost the same way to the different modalities,
link |
00:20:59.940
to the different problems.
link |
00:21:01.340
And that's why today, when someone writes a paper
link |
00:21:04.100
on improving optimization of deep learning and vision,
link |
00:21:07.140
it improves the different NLP applications
link |
00:21:09.300
and it improves the different
link |
00:21:10.140
reinforcement learning applications.
link |
00:21:12.340
Reinforcement learning.
link |
00:21:13.260
So I would say that computer vision
link |
00:21:15.820
and NLP are very similar to each other.
link |
00:21:18.620
Today they differ in that they have
link |
00:21:20.980
slightly different architectures.
link |
00:21:22.180
We use transformers in NLP
link |
00:21:23.900
and we use convolutional neural networks in vision.
link |
00:21:26.500
But it's also possible that one day this will change
link |
00:21:28.900
and everything will be unified with a single architecture.
link |
00:21:31.820
Because if you go back a few years ago
link |
00:21:33.660
in natural language processing,
link |
00:21:36.580
there were a huge number of architectures
link |
00:21:39.340
for every different tiny problem had its own architecture.
link |
00:21:43.380
Today, there's just one transformer
link |
00:21:45.900
for all those different tasks.
link |
00:21:47.460
And if you go back in time even more,
link |
00:21:49.700
you had even more and more fragmentation
link |
00:21:51.380
and every little problem in AI
link |
00:21:53.820
had its own little subspecialization
link |
00:21:55.940
and sub, you know, little set of collection of skills,
link |
00:21:58.660
people who would know how to engineer the features.
link |
00:22:00.980
Now it's all been subsumed by deep learning.
link |
00:22:02.900
We have this unification.
link |
00:22:04.180
And so I expect vision to become unified
link |
00:22:06.860
with natural language as well.
link |
00:22:08.540
Or rather, I shouldn't say expect, I think it's possible.
link |
00:22:10.500
I don't wanna be too sure because
link |
00:22:12.500
I think on the convolutional neural net
link |
00:22:13.780
is very computationally efficient.
link |
00:22:15.540
RL is different.
link |
00:22:16.860
RL does require slightly different techniques
link |
00:22:18.860
because you really do need to take action.
link |
00:22:20.820
You really need to do something about exploration.
link |
00:22:23.860
Your variance is much higher.
link |
00:22:26.020
But I think there is a lot of unity even there.
link |
00:22:28.220
And I would expect, for example, that at some point
link |
00:22:29.980
there will be some broader unification
link |
00:22:33.500
between RL and supervised learning
link |
00:22:35.260
where somehow the RL will be making decisions
link |
00:22:37.180
to make the supervised learning go better.
link |
00:22:38.580
And it will be, I imagine, one big black box
link |
00:22:41.780
and you just throw, you know, you shovel things into it
link |
00:22:44.980
and it just figures out what to do
link |
00:22:46.260
with whatever you shovel at it.
link |
00:22:48.060
I mean, reinforcement learning has some aspects
link |
00:22:50.740
of language and vision combined almost.
link |
00:22:55.180
There's elements of a long term memory
link |
00:22:57.780
that you should be utilizing
link |
00:22:58.900
and there's elements of a really rich sensory space.
link |
00:23:03.100
So it seems like the union of the two or something like that.
link |
00:23:08.420
I'd say something slightly differently.
link |
00:23:10.020
I'd say that reinforcement learning is neither,
link |
00:23:12.740
but it naturally interfaces
link |
00:23:14.900
and integrates with the two of them.
link |
00:23:17.380
Do you think action is fundamentally different?
link |
00:23:19.300
So yeah, what is interesting about,
link |
00:23:21.340
what is unique about policy of learning to act?
link |
00:23:26.060
Well, so one example, for instance,
link |
00:23:27.540
is that when you learn to act,
link |
00:23:29.860
you are fundamentally in a non stationary world
link |
00:23:33.300
because as your actions change,
link |
00:23:35.860
the things you see start changing.
link |
00:23:38.140
You experience the world in a different way.
link |
00:23:41.380
And this is not the case for
link |
00:23:43.300
the more traditional static problem
link |
00:23:44.980
where you have some distribution
link |
00:23:46.380
and you just apply a model to that distribution.
link |
00:23:49.540
You think it's a fundamentally different problem
link |
00:23:51.260
or is it just a more difficult generalization
link |
00:23:55.060
of the problem of understanding?
link |
00:23:57.020
I mean, it's a question of definitions almost.
link |
00:23:59.860
There is a huge amount of commonality for sure.
link |
00:24:02.020
You take gradients, you try, you take gradients.
link |
00:24:04.180
We try to approximate gradients in both cases.
link |
00:24:06.180
In the case of reinforcement learning,
link |
00:24:08.020
you have some tools to reduce the variance of the gradients.
link |
00:24:11.180
You do that.
link |
00:24:13.020
There's lots of commonality.
link |
00:24:13.980
Use the same neural net in both cases.
link |
00:24:16.340
You compute the gradient, you apply Adam in both cases.
link |
00:24:20.820
So, I mean, there's lots in common for sure,
link |
00:24:24.300
but there are some small differences
link |
00:24:26.900
which are not completely insignificant.
link |
00:24:28.940
It's really just a matter of your point of view,
link |
00:24:30.980
what frame of reference,
link |
00:24:32.700
how much do you wanna zoom in or out
link |
00:24:35.020
as you look at these problems?
link |
00:24:37.260
Which problem do you think is harder?
link |
00:24:39.820
So people like Noam Chomsky believe
link |
00:24:41.660
that language is fundamental to everything.
link |
00:24:43.980
So it underlies everything.
link |
00:24:45.700
Do you think language understanding is harder
link |
00:24:48.660
than visual scene understanding or vice versa?
link |
00:24:52.580
I think that asking if a problem is hard is slightly wrong.
link |
00:24:56.260
I think the question is a little bit wrong
link |
00:24:57.500
and I wanna explain why.
link |
00:24:59.460
So what does it mean for a problem to be hard?
link |
00:25:04.340
Okay, the non interesting dumb answer to that
link |
00:25:07.220
is there's a benchmark
link |
00:25:10.700
and there's a human level performance on that benchmark
link |
00:25:13.660
and how is the effort required
link |
00:25:16.660
to reach the human level benchmark.
link |
00:25:19.060
So from the perspective of how much
link |
00:25:20.620
until we get to human level on a very good benchmark.
link |
00:25:25.280
Yeah, I understand what you mean by that.
link |
00:25:28.840
So what I was going to say that a lot of it depends on,
link |
00:25:32.200
once you solve a problem, it stops being hard
link |
00:25:34.000
and that's always true.
link |
00:25:35.960
And so whether something is hard or not depends
link |
00:25:38.160
on what our tools can do today.
link |
00:25:39.720
So you say today through human level,
link |
00:25:43.680
language understanding and visual perception are hard
link |
00:25:46.280
in the sense that there is no way
link |
00:25:48.920
of solving the problem completely in the next three months.
link |
00:25:52.000
So I agree with that statement.
link |
00:25:53.920
Beyond that, my guess would be as good as yours,
link |
00:25:56.600
I don't know.
link |
00:25:57.440
Oh, okay, so you don't have a fundamental intuition
link |
00:26:00.360
about how hard language understanding is.
link |
00:26:02.800
I think, I know I changed my mind.
link |
00:26:04.280
I'd say language is probably going to be harder.
link |
00:26:06.800
I mean, it depends on how you define it.
link |
00:26:09.160
Like if you mean absolute top notch,
link |
00:26:11.240
100% language understanding, I'll go with language.
link |
00:26:16.160
But then if I show you a piece of paper with letters on it,
link |
00:26:18.880
is that, you see what I mean?
link |
00:26:21.720
You have a vision system,
link |
00:26:22.600
you say it's the best human level vision system.
link |
00:26:25.080
I show you, I open a book and I show you letters.
link |
00:26:28.760
Will it understand how these letters form into word
link |
00:26:30.880
and sentences and meaning?
link |
00:26:32.240
Is this part of the vision problem?
link |
00:26:33.720
Where does vision end and language begin?
link |
00:26:36.080
Yeah, so Chomsky would say it starts at language.
link |
00:26:38.240
So vision is just a little example of the kind
link |
00:26:40.440
of a structure and fundamental hierarchy of ideas
link |
00:26:46.520
that's already represented in our brains somehow
link |
00:26:49.080
that's represented through language.
link |
00:26:51.400
But where does vision stop and language begin?
link |
00:26:57.960
That's a really interesting question.
link |
00:27:07.760
So one possibility is that it's impossible
link |
00:27:09.880
to achieve really deep understanding in either images
link |
00:27:14.720
or language without basically using the same kind of system.
link |
00:27:18.400
So you're going to get the other for free.
link |
00:27:21.440
I think it's pretty likely that yes,
link |
00:27:23.080
if we can get one, our machine learning is probably
link |
00:27:25.840
that good that we can get the other.
link |
00:27:27.320
But I'm not 100% sure.
link |
00:27:30.160
And also, I think a lot of it really does depend
link |
00:27:34.520
on your definitions.
link |
00:27:36.680
Definitions of?
link |
00:27:37.800
Of like perfect vision.
link |
00:27:40.040
Because reading is vision, but should it count?
link |
00:27:44.640
Yeah, to me, so my definition is if a system looked
link |
00:27:47.440
at an image and then a system looked at a piece of text
link |
00:27:52.240
and then told me something about that
link |
00:27:56.040
and I was really impressed.
link |
00:27:58.400
That's relative.
link |
00:27:59.480
You'll be impressed for half an hour
link |
00:28:01.280
and then you're gonna say, well, I mean,
link |
00:28:02.520
all the systems do that, but here's the thing they don't do.
link |
00:28:05.200
Yeah, but I don't have that with humans.
link |
00:28:07.120
Humans continue to impress me.
link |
00:28:08.920
Is that true?
link |
00:28:10.600
Well, the ones, okay, so I'm a fan of monogamy.
link |
00:28:14.000
So I like the idea of marrying somebody,
link |
00:28:16.000
being with them for several decades.
link |
00:28:18.080
So I believe in the fact that yes, it's possible
link |
00:28:20.600
to have somebody continuously giving you
link |
00:28:24.480
pleasurable, interesting, witty new ideas, friends.
link |
00:28:28.560
Yeah, I think so.
link |
00:28:29.960
They continue to surprise you.
link |
00:28:32.080
The surprise, it's that injection of randomness.
link |
00:28:37.080
It seems to be a nice source of, yeah, continued inspiration,
link |
00:28:47.080
like the wit, the humor.
link |
00:28:48.680
I think, yeah, that would be,
link |
00:28:53.560
it's a very subjective test,
link |
00:28:54.840
but I think if you have enough humans in the room.
link |
00:28:58.480
Yeah, I understand what you mean.
link |
00:29:00.440
Yeah, I feel like I misunderstood
link |
00:29:02.000
what you meant by impressing you.
link |
00:29:02.960
I thought you meant to impress you with its intelligence,
link |
00:29:06.440
with how well it understands an image.
link |
00:29:10.120
I thought you meant something like,
link |
00:29:11.640
I'm gonna show it a really complicated image
link |
00:29:13.200
and it's gonna get it right.
link |
00:29:14.040
And you're gonna say, wow, that's really cool.
link |
00:29:15.720
Our systems of January 2020 have not been doing that.
link |
00:29:19.880
Yeah, no, I think it all boils down to like
link |
00:29:23.440
the reason people click like on stuff on the internet,
link |
00:29:26.040
which is like, it makes them laugh.
link |
00:29:28.280
So it's like humor or wit or insight.
link |
00:29:32.640
I'm sure we'll get that as well.
link |
00:29:35.360
So forgive the romanticized question,
link |
00:29:38.120
but looking back to you,
link |
00:29:40.400
what is the most beautiful or surprising idea
link |
00:29:43.080
in deep learning or AI in general you've come across?
link |
00:29:46.760
So I think the most beautiful thing about deep learning
link |
00:29:49.160
is that it actually works.
link |
00:29:51.640
And I mean it, because you got these ideas,
link |
00:29:53.120
you got the little neural network,
link |
00:29:54.640
you got the back propagation algorithm.
link |
00:29:58.920
And then you've got some theories as to,
link |
00:30:00.640
this is kind of like the brain.
link |
00:30:02.040
So maybe if you make it large,
link |
00:30:03.560
if you make the neural network large
link |
00:30:04.840
and you train it on a lot of data,
link |
00:30:05.920
then it will do the same function that the brain does.
link |
00:30:09.640
And it turns out to be true, that's crazy.
link |
00:30:12.480
And now we just train these neural networks
link |
00:30:14.120
and you make them larger and they keep getting better.
link |
00:30:16.640
And I find it unbelievable.
link |
00:30:17.880
I find it unbelievable that this whole AI stuff
link |
00:30:20.600
with neural networks works.
link |
00:30:22.480
Have you built up an intuition of why?
link |
00:30:24.960
Are there a lot of bits and pieces of intuitions,
link |
00:30:27.920
of insights of why this whole thing works?
link |
00:30:31.320
I mean, some, definitely.
link |
00:30:33.240
While we know that optimization, we now have good,
link |
00:30:37.400
we've had lots of empirical,
link |
00:30:40.800
huge amounts of empirical reasons
link |
00:30:42.320
to believe that optimization should work
link |
00:30:44.280
on most problems we care about.
link |
00:30:47.520
Do you have insights of why?
link |
00:30:48.680
So you just said empirical evidence.
link |
00:30:50.720
Is most of your sort of empirical evidence
link |
00:30:56.760
kind of convinces you?
link |
00:30:58.360
It's like evolution is empirical.
link |
00:31:00.360
It shows you that, look,
link |
00:31:01.400
this evolutionary process seems to be a good way
link |
00:31:03.920
to design organisms that survive in their environment,
link |
00:31:08.240
but it doesn't really get you to the insights
link |
00:31:11.400
of how the whole thing works.
link |
00:31:13.960
I think a good analogy is physics.
link |
00:31:16.480
You know how you say, hey, let's do some physics calculation
link |
00:31:19.040
and come up with some new physics theory
link |
00:31:20.480
and make some prediction.
link |
00:31:21.720
But then you got around the experiment.
link |
00:31:23.920
You know, you got around the experiment, it's important.
link |
00:31:26.040
So it's a bit the same here,
link |
00:31:27.440
except that maybe sometimes the experiment
link |
00:31:29.760
came before the theory.
link |
00:31:31.040
But it still is the case.
link |
00:31:32.040
You know, you have some data
link |
00:31:33.840
and you come up with some prediction.
link |
00:31:35.000
You say, yeah, let's make a big neural network.
link |
00:31:36.560
Let's train it.
link |
00:31:37.400
And it's going to work much better than anything before it.
link |
00:31:39.840
And it will in fact continue to get better
link |
00:31:41.440
as you make it larger.
link |
00:31:42.720
And it turns out to be true.
link |
00:31:43.600
That's amazing when a theory is validated like this.
link |
00:31:47.360
It's not a mathematical theory.
link |
00:31:48.720
It's more of a biological theory almost.
link |
00:31:51.680
So I think there are not terrible analogies
link |
00:31:53.960
between deep learning and biology.
link |
00:31:55.560
I would say it's like the geometric mean
link |
00:31:57.520
of biology and physics.
link |
00:31:58.760
That's deep learning.
link |
00:32:00.240
The geometric mean of biology and physics.
link |
00:32:03.880
I think I'm going to need a few hours
link |
00:32:05.160
to wrap my head around that.
link |
00:32:07.680
Because just to find the geometric,
link |
00:32:10.480
just to find the set of what biology represents.
link |
00:32:16.480
Well, in biology, things are really complicated.
link |
00:32:19.480
Theories are really, really,
link |
00:32:21.000
it's really hard to have good predictive theory.
link |
00:32:22.840
And in physics, the theories are too good.
link |
00:32:25.400
In physics, people make these super precise theories
link |
00:32:27.920
which make these amazing predictions.
link |
00:32:29.360
And in machine learning, we're kind of in between.
link |
00:32:31.440
Kind of in between, but it'd be nice
link |
00:32:33.800
if machine learning somehow helped us
link |
00:32:35.920
discover the unification of the two
link |
00:32:37.720
as opposed to sort of the in between.
link |
00:32:40.800
But you're right.
link |
00:32:41.640
That's, you're kind of trying to juggle both.
link |
00:32:44.920
So do you think there are still beautiful
link |
00:32:46.760
and mysterious properties in neural networks
link |
00:32:48.800
that are yet to be discovered?
link |
00:32:50.160
Definitely.
link |
00:32:51.360
I think that we are still massively underestimating
link |
00:32:53.560
deep learning.
link |
00:32:55.440
What do you think it will look like?
link |
00:32:56.640
Like what, if I knew, I would have done it, you know?
link |
00:33:01.080
So, but if you look at all the progress
link |
00:33:04.000
from the past 10 years, I would say most of it,
link |
00:33:07.040
I would say there've been a few cases
link |
00:33:08.880
where some were things that felt like really new ideas
link |
00:33:12.080
showed up, but by and large it was every year
link |
00:33:15.080
we thought, okay, deep learning goes this far.
link |
00:33:17.160
Nope, it actually goes further.
link |
00:33:19.000
And then the next year, okay, now this is peak deep learning.
link |
00:33:22.480
We are really done.
link |
00:33:23.320
Nope, it goes further.
link |
00:33:24.440
It just keeps going further each year.
link |
00:33:26.040
So that means that we keep underestimating,
link |
00:33:27.600
we keep not understanding it.
link |
00:33:29.160
It has surprising properties all the time.
link |
00:33:31.360
Do you think it's getting harder and harder?
link |
00:33:33.600
To make progress?
link |
00:33:34.440
Need to make progress?
link |
00:33:36.000
It depends on what you mean.
link |
00:33:36.840
I think the field will continue to make very robust progress
link |
00:33:39.960
for quite a while.
link |
00:33:41.120
I think for individual researchers,
link |
00:33:42.800
especially people who are doing research,
link |
00:33:46.120
it can be harder because there is a very large number
link |
00:33:48.240
of researchers right now.
link |
00:33:50.080
I think that if you have a lot of compute,
link |
00:33:51.800
then you can make a lot of very interesting discoveries,
link |
00:33:54.720
but then you have to deal with the challenge
link |
00:33:57.440
of managing a huge compute cluster
link |
00:34:01.680
to run your experiments.
link |
00:34:02.520
It's a little bit harder.
link |
00:34:03.360
So I'm asking all these questions
link |
00:34:04.920
that nobody knows the answer to,
link |
00:34:06.440
but you're one of the smartest people I know,
link |
00:34:08.280
so I'm gonna keep asking.
link |
00:34:10.440
So let's imagine all the breakthroughs
link |
00:34:12.400
that happen in the next 30 years in deep learning.
link |
00:34:15.240
Do you think most of those breakthroughs
link |
00:34:17.120
can be done by one person with one computer?
link |
00:34:22.040
Sort of in the space of breakthroughs,
link |
00:34:23.760
do you think compute will be,
link |
00:34:26.840
compute and large efforts will be necessary?
link |
00:34:32.360
I mean, I can't be sure.
link |
00:34:34.040
When you say one computer, you mean how large?
link |
00:34:36.680
You're clever.
link |
00:34:40.760
I mean, one GPU.
link |
00:34:42.640
I see.
link |
00:34:43.960
I think it's pretty unlikely.
link |
00:34:47.520
I think it's pretty unlikely.
link |
00:34:48.720
I think that there are many,
link |
00:34:51.000
the stack of deep learning is starting to be quite deep.
link |
00:34:54.680
If you look at it, you've got all the way from the ideas,
link |
00:34:59.840
the systems to build the data sets,
link |
00:35:02.200
the distributed programming,
link |
00:35:04.200
the building the actual cluster,
link |
00:35:06.480
the GPU programming, putting it all together.
link |
00:35:09.040
So now the stack is getting really deep
link |
00:35:10.600
and I think it becomes,
link |
00:35:12.280
it can be quite hard for a single person
link |
00:35:14.160
to become, to be world class
link |
00:35:15.680
in every single layer of the stack.
link |
00:35:17.960
What about the, what like Vlad and Ravapnik
link |
00:35:21.120
really insist on is taking MNIST
link |
00:35:23.200
and trying to learn from very few examples.
link |
00:35:26.000
So being able to learn more efficiently.
link |
00:35:29.120
Do you think that's, there'll be breakthroughs in that space
link |
00:35:32.120
that would, may not need the huge compute?
link |
00:35:34.880
I think there will be a large number of breakthroughs
link |
00:35:37.920
in general that will not need a huge amount of compute.
link |
00:35:40.640
So maybe I should clarify that.
link |
00:35:42.160
I think that some breakthroughs will require a lot of compute
link |
00:35:45.440
and I think building systems which actually do things
link |
00:35:48.680
will require a huge amount of compute.
link |
00:35:50.200
That one is pretty obvious.
link |
00:35:51.360
If you want to do X and X requires a huge neural net,
link |
00:35:54.720
you gotta get a huge neural net.
link |
00:35:56.560
But I think there will be lots of,
link |
00:35:59.360
I think there is lots of room for very important work
link |
00:36:02.520
being done by small groups and individuals.
link |
00:36:05.120
Can you maybe sort of on the topic
link |
00:36:07.480
of the science of deep learning,
link |
00:36:10.040
talk about one of the recent papers
link |
00:36:12.000
that you released, the Deep Double Descent,
link |
00:36:15.640
where bigger models and more data hurt.
link |
00:36:18.120
I think it's a really interesting paper.
link |
00:36:19.600
Can you describe the main idea?
link |
00:36:22.280
Yeah, definitely.
link |
00:36:23.480
So what happened is that some,
link |
00:36:25.600
over the years, some small number of researchers noticed
link |
00:36:28.840
that it is kind of weird that when you make
link |
00:36:30.760
the neural network larger, it works better
link |
00:36:32.120
and it seems to go in contradiction
link |
00:36:33.320
with statistical ideas.
link |
00:36:34.720
And then some people made an analysis showing
link |
00:36:36.880
that actually you got this double descent bump.
link |
00:36:38.880
And what we've done was to show that double descent occurs
link |
00:36:42.760
for pretty much all practical deep learning systems.
link |
00:36:46.400
And that it'll be also, so can you step back?
link |
00:36:51.560
What's the X axis and the Y axis of a double descent plot?
link |
00:36:55.960
Okay, great.
link |
00:36:57.000
So you can look, you can do things like,
link |
00:37:02.680
you can take your neural network
link |
00:37:04.960
and you can start increasing its size slowly
link |
00:37:07.600
while keeping your data set fixed.
link |
00:37:10.000
So if you increase the size of the neural network slowly,
link |
00:37:14.760
and if you don't do early stopping,
link |
00:37:16.880
that's a pretty important detail,
link |
00:37:20.360
then when the neural network is really small,
link |
00:37:22.480
you make it larger,
link |
00:37:23.560
you get a very rapid increase in performance.
link |
00:37:26.040
Then you continue to make it larger.
link |
00:37:27.280
And at some point performance will get worse.
link |
00:37:30.160
And it gets the worst exactly at the point
link |
00:37:33.920
at which it achieves zero training error,
link |
00:37:36.240
precisely zero training loss.
link |
00:37:38.640
And then as you make it larger,
link |
00:37:39.600
it starts to get better again.
link |
00:37:41.480
And it's kind of counterintuitive
link |
00:37:42.840
because you'd expect deep learning phenomena
link |
00:37:44.600
to be monotonic.
link |
00:37:46.800
And it's hard to be sure what it means,
link |
00:37:50.040
but it also occurs in the case of linear classifiers.
link |
00:37:53.120
And the intuition basically boils down to the following.
link |
00:37:57.040
When you have a large data set and a small model,
link |
00:38:03.560
then small, tiny random,
link |
00:38:05.000
so basically what is overfitting?
link |
00:38:07.120
Overfitting is when your model is somehow very sensitive
link |
00:38:12.000
to the small random unimportant stuff in your data set.
link |
00:38:16.080
In the training data.
link |
00:38:17.000
In the training data set, precisely.
link |
00:38:19.000
So if you have a small model and you have a big data set,
link |
00:38:23.400
and there may be some random thing,
link |
00:38:24.760
some training cases are randomly in the data set
link |
00:38:27.480
and others may not be there,
link |
00:38:29.080
but the small model is kind of insensitive
link |
00:38:31.640
to this randomness because it's the same,
link |
00:38:34.400
there is pretty much no uncertainty about the model
link |
00:38:37.080
when the data set is large.
link |
00:38:38.320
So, okay.
link |
00:38:39.160
So at the very basic level to me,
link |
00:38:41.200
it is the most surprising thing
link |
00:38:43.360
that neural networks don't overfit every time very quickly
link |
00:38:51.840
before ever being able to learn anything.
link |
00:38:54.040
The huge number of parameters.
link |
00:38:56.280
So here is, so there is one way, okay.
link |
00:38:57.680
So maybe, so let me try to give the explanation
link |
00:39:00.240
and maybe that will be, that will work.
link |
00:39:02.040
So you've got a huge neural network.
link |
00:39:03.640
Let's suppose you've got, you have a huge neural network,
link |
00:39:07.640
you have a huge number of parameters.
link |
00:39:09.760
And now let's pretend everything is linear,
link |
00:39:11.360
which is not, let's just pretend.
link |
00:39:13.120
Then there is this big subspace
link |
00:39:15.560
where your neural network achieves zero error.
link |
00:39:18.040
And SGD is going to find approximately the point.
link |
00:39:21.920
That's right.
link |
00:39:22.760
Approximately the point with the smallest norm
link |
00:39:24.480
in that subspace.
link |
00:39:26.720
Okay.
link |
00:39:27.560
And that can also be proven to be insensitive
link |
00:39:30.280
to the small randomness in the data
link |
00:39:33.520
when the dimensionality is high.
link |
00:39:35.360
But when the dimensionality of the data
link |
00:39:37.160
is equal to the dimensionality of the model,
link |
00:39:39.360
then there is a one to one correspondence
link |
00:39:41.040
between all the data sets and the models.
link |
00:39:44.400
So small changes in the data set
link |
00:39:45.680
actually lead to large changes in the model.
link |
00:39:47.360
And that's why performance gets worse.
link |
00:39:48.800
So this is the best explanation more or less.
link |
00:39:52.280
So then it would be good for the model
link |
00:39:54.000
to have more parameters, so to be bigger than the data.
link |
00:39:58.640
That's right.
link |
00:39:59.480
But only if you don't early stop.
link |
00:40:00.800
If you introduce early stop in your regularization,
link |
00:40:02.840
you can make the double descent bump
link |
00:40:04.640
almost completely disappear.
link |
00:40:06.120
What is early stop?
link |
00:40:07.120
Early stopping is when you train your model
link |
00:40:09.960
and you monitor your validation performance.
link |
00:40:13.640
And then if at some point validation performance
link |
00:40:15.200
starts to get worse, you say, okay, let's stop training.
link |
00:40:17.640
We are good enough.
link |
00:40:20.000
So the magic happens after that moment.
link |
00:40:23.160
So you don't want to do the early stopping.
link |
00:40:25.080
Well, if you don't do the early stopping,
link |
00:40:26.680
you get the very pronounced double descent.
link |
00:40:29.200
Do you have any intuition why this happens?
link |
00:40:31.880
Double descent?
link |
00:40:32.880
Oh, sorry, early stopping?
link |
00:40:33.880
No, the double descent.
link |
00:40:34.880
So the...
link |
00:40:35.880
Well, yeah, so I try...
link |
00:40:36.880
Let's see.
link |
00:40:37.880
The intuition is basically is this,
link |
00:40:39.880
that when the data set has as many degrees of freedom
link |
00:40:44.120
as the model, then there is a one to one correspondence
link |
00:40:47.560
between them.
link |
00:40:48.560
And so small changes to the data set
link |
00:40:50.760
lead to noticeable changes in the model.
link |
00:40:53.640
So your model is very sensitive to all the randomness.
link |
00:40:55.920
It is unable to discard it.
link |
00:40:57.960
Whereas it turns out that when you have
link |
00:41:01.360
a lot more data than parameters
link |
00:41:03.160
or a lot more parameters than data,
link |
00:41:05.200
the resulting solution will be insensitive
link |
00:41:07.480
to small changes in the data set.
link |
00:41:09.040
Oh, so it's able to, let's nicely put,
link |
00:41:12.120
discard the small changes, the randomness.
link |
00:41:14.800
The randomness, exactly.
link |
00:41:15.800
The spurious correlation which you don't want.
link |
00:41:19.120
Jeff Hinton suggested we need to throw back propagation.
link |
00:41:22.120
We already kind of talked about this a little bit,
link |
00:41:23.840
but he suggested that we need to throw away
link |
00:41:25.720
back propagation and start over.
link |
00:41:28.160
I mean, of course some of that is a little bit
link |
00:41:32.080
wit and humor, but what do you think?
link |
00:41:34.960
What could be an alternative method
link |
00:41:36.440
of training neural networks?
link |
00:41:37.920
Well, the thing that he said precisely is that
link |
00:41:40.560
to the extent that you can't find back propagation
link |
00:41:42.440
in the brain, it's worth seeing if we can learn something
link |
00:41:45.960
from how the brain learns.
link |
00:41:47.480
But back propagation is very useful
link |
00:41:48.960
and we should keep using it.
link |
00:41:50.760
Oh, you're saying that once we discover
link |
00:41:52.960
the mechanism of learning in the brain,
link |
00:41:54.720
or any aspects of that mechanism,
link |
00:41:56.520
we should also try to implement that in neural networks?
link |
00:41:59.040
If it turns out that we can't find
link |
00:42:00.640
back propagation in the brain.
link |
00:42:01.960
If we can't find back propagation in the brain.
link |
00:42:06.280
Well, so I guess your answer to that is
link |
00:42:10.160
back propagation is pretty damn useful.
link |
00:42:12.200
So why are we complaining?
link |
00:42:14.280
I mean, I personally am a big fan of back propagation.
link |
00:42:16.800
I think it's a great algorithm because it solves
link |
00:42:18.760
an extremely fundamental problem,
link |
00:42:20.320
which is finding a neural circuit
link |
00:42:24.920
subject to some constraints.
link |
00:42:27.240
And I don't see that problem going away.
link |
00:42:28.800
So that's why I really, I think it's pretty unlikely
link |
00:42:33.280
that we'll have anything which is going to be
link |
00:42:35.680
dramatically different.
link |
00:42:37.040
It could happen, but I wouldn't bet on it right now.
link |
00:42:41.640
So let me ask a sort of big picture question.
link |
00:42:45.200
Do you think neural networks can be made
link |
00:42:49.160
to reason?
link |
00:42:50.720
Why not?
link |
00:42:52.440
Well, if you look, for example, at AlphaGo or AlphaZero,
link |
00:42:57.320
the neural network of AlphaZero plays Go,
link |
00:43:00.720
which we all agree is a game that requires reasoning,
link |
00:43:04.080
better than 99.9% of all humans.
link |
00:43:07.600
Just the neural network, without the search,
link |
00:43:09.440
just the neural network itself.
link |
00:43:11.320
Doesn't that give us an existence proof
link |
00:43:14.160
that neural networks can reason?
link |
00:43:15.720
To push back and disagree a little bit,
link |
00:43:18.320
we all agree that Go is reasoning.
link |
00:43:20.800
I think I agree, I don't think it's a trivial,
link |
00:43:24.800
so obviously reasoning like intelligence
link |
00:43:27.080
is a loose gray area term a little bit.
link |
00:43:31.080
Maybe you disagree with that.
link |
00:43:32.640
But yes, I think it has some of the same elements
link |
00:43:36.560
of reasoning.
link |
00:43:37.960
Reasoning is almost like akin to search, right?
link |
00:43:41.640
There's a sequential element of reasoning
link |
00:43:45.680
of stepwise consideration of possibilities
link |
00:43:51.520
and sort of building on top of those possibilities
link |
00:43:54.320
in a sequential manner until you arrive at some insight.
link |
00:43:57.680
So yeah, I guess playing Go is kind of like that.
link |
00:44:00.560
And when you have a single neural network
link |
00:44:02.320
doing that without search, it's kind of like that.
link |
00:44:04.960
So there's an existence proof
link |
00:44:06.160
in a particular constrained environment
link |
00:44:08.160
that a process akin to what many people call reasoning
link |
00:44:13.200
exists, but more general kind of reasoning.
link |
00:44:17.160
So off the board.
link |
00:44:18.880
There is one other existence proof.
link |
00:44:20.520
Oh boy, which one?
link |
00:44:22.160
Us humans?
link |
00:44:23.000
Yes.
link |
00:44:23.840
Okay, all right, so do you think the architecture
link |
00:44:29.840
that will allow neural networks to reason
link |
00:44:33.400
will look similar to the neural network architectures
link |
00:44:37.360
we have today?
link |
00:44:38.840
I think it will.
link |
00:44:39.680
I think, well, I don't wanna make
link |
00:44:41.720
two overly definitive statements.
link |
00:44:44.040
I think it's definitely possible
link |
00:44:45.800
that the neural networks that will produce
link |
00:44:48.520
the reasoning breakthroughs of the future
link |
00:44:50.240
will be very similar to the architectures that exist today.
link |
00:44:53.640
Maybe a little bit more recurrent,
link |
00:44:55.360
maybe a little bit deeper.
link |
00:44:57.120
But these neural nets are so insanely powerful.
link |
00:45:02.920
Why wouldn't they be able to learn to reason?
link |
00:45:05.560
Humans can reason.
link |
00:45:07.240
So why can't neural networks?
link |
00:45:09.320
So do you think the kind of stuff we've seen
link |
00:45:11.640
neural networks do is a kind of just weak reasoning?
link |
00:45:14.640
So it's not a fundamentally different process.
link |
00:45:16.600
Again, this is stuff nobody knows the answer to.
link |
00:45:19.680
So when it comes to our neural networks,
link |
00:45:23.000
the thing which I would say is that neural networks
link |
00:45:25.560
are capable of reasoning.
link |
00:45:28.200
But if you train a neural network on a task
link |
00:45:30.560
which doesn't require reasoning, it's not going to reason.
link |
00:45:34.000
This is a well known effect where the neural network
link |
00:45:36.360
will solve the problem that you pose in front of it
link |
00:45:41.360
in the easiest way possible.
link |
00:45:44.440
Right, that takes us to one of the brilliant sort of ways
link |
00:45:51.560
you've described neural networks,
link |
00:45:52.840
which is you've referred to neural networks
link |
00:45:55.480
as the search for small circuits
link |
00:45:57.920
and maybe general intelligence
link |
00:46:01.160
as the search for small programs,
link |
00:46:04.520
which I found as a metaphor very compelling.
link |
00:46:06.960
Can you elaborate on that difference?
link |
00:46:09.200
Yeah, so the thing which I said precisely was that
link |
00:46:13.720
if you can find the shortest program
link |
00:46:17.280
that outputs the data at your disposal,
link |
00:46:20.940
then you will be able to use it
link |
00:46:22.280
to make the best prediction possible.
link |
00:46:25.680
And that's a theoretical statement
link |
00:46:27.000
which can be proved mathematically.
link |
00:46:29.240
Now, you can also prove mathematically
link |
00:46:31.160
that finding the shortest program
link |
00:46:33.920
which generates some data is not a computable operation.
link |
00:46:38.920
No finite amount of compute can do this.
link |
00:46:42.740
So then with neural networks,
link |
00:46:46.060
neural networks are the next best thing
link |
00:46:47.900
that actually works in practice.
link |
00:46:50.140
We are not able to find the best,
link |
00:46:52.860
the shortest program which generates our data,
link |
00:46:55.740
but we are able to find a small,
link |
00:46:58.840
but now that statement should be amended,
link |
00:47:01.580
even a large circuit which fits our data in some way.
link |
00:47:05.280
Well, I think what you meant by the small circuit
link |
00:47:07.180
is the smallest needed circuit.
link |
00:47:10.020
Well, the thing which I would change now,
link |
00:47:12.620
back then I really haven't fully internalized
link |
00:47:14.780
the overparameterized results.
link |
00:47:17.100
The things we know about overparameterized neural nets,
link |
00:47:20.460
now I would phrase it as a large circuit
link |
00:47:24.540
whose weights contain a small amount of information,
link |
00:47:27.780
which I think is what's going on.
link |
00:47:29.160
If you imagine the training process of a neural network
link |
00:47:31.500
as you slowly transmit entropy
link |
00:47:33.780
from the dataset to the parameters,
link |
00:47:37.040
then somehow the amount of information in the weights
link |
00:47:41.060
ends up being not very large,
link |
00:47:42.920
which would explain why they generalize so well.
link |
00:47:45.220
So the large circuit might be one that's helpful
link |
00:47:49.380
for the generalization.
link |
00:47:51.900
Yeah, something like this.
link |
00:47:54.660
But do you see it important to be able to try
link |
00:48:00.220
to learn something like programs?
link |
00:48:02.420
I mean, if we can, definitely.
link |
00:48:04.860
I think it's kind of, the answer is kind of yes,
link |
00:48:08.140
if we can do it, we should do things that we can do it.
link |
00:48:11.140
It's the reason we are pushing on deep learning,
link |
00:48:15.300
the fundamental reason, the root cause
link |
00:48:18.780
is that we are able to train them.
link |
00:48:21.520
So in other words, training comes first.
link |
00:48:23.880
We've got our pillar, which is the training pillar.
link |
00:48:27.500
And now we're trying to contort our neural networks
link |
00:48:30.020
around the training pillar.
link |
00:48:30.900
We gotta stay trainable.
link |
00:48:31.940
This is an invariant we cannot violate.
link |
00:48:36.380
And so being trainable means starting from scratch,
link |
00:48:40.540
knowing nothing, you can actually pretty quickly
link |
00:48:42.820
converge towards knowing a lot.
link |
00:48:44.580
Or even slowly.
link |
00:48:45.900
But it means that given the resources at your disposal,
link |
00:48:50.700
you can train the neural net
link |
00:48:52.380
and get it to achieve useful performance.
link |
00:48:55.380
Yeah, that's a pillar we can't move away from.
link |
00:48:57.500
That's right.
link |
00:48:58.340
Because if you say, hey, let's find the shortest program,
link |
00:49:01.460
well, we can't do that.
link |
00:49:02.800
So it doesn't matter how useful that would be.
link |
00:49:06.060
We can't do it.
link |
00:49:07.260
So we won't.
link |
00:49:08.460
So do you think, you kind of mentioned
link |
00:49:09.820
that the neural networks are good at finding small circuits
link |
00:49:12.220
or large circuits.
link |
00:49:14.440
Do you think then the matter of finding small programs
link |
00:49:17.540
is just the data?
link |
00:49:19.280
No.
link |
00:49:20.120
So the, sorry, not the size or the type of data.
link |
00:49:25.880
Sort of ask, giving it programs.
link |
00:49:28.940
Well, I think the thing is that right now,
link |
00:49:31.960
finding, there are no good precedents
link |
00:49:34.540
of people successfully finding programs really well.
link |
00:49:38.900
And so the way you'd find programs
link |
00:49:40.660
is you'd train a deep neural network to do it basically.
link |
00:49:44.320
Right.
link |
00:49:45.900
Which is the right way to go about it.
link |
00:49:48.140
But there's not good illustrations of that.
link |
00:49:50.700
It hasn't been done yet.
link |
00:49:51.860
But in principle, it should be possible.
link |
00:49:56.500
Can you elaborate a little bit,
link |
00:49:58.200
what's your answer in principle?
link |
00:50:00.260
Put another way, you don't see why it's not possible.
link |
00:50:04.180
Well, it's kind of like more, it's more a statement of,
link |
00:50:09.500
I think that it's, I think that it's unwise
link |
00:50:12.020
to bet against deep learning.
link |
00:50:13.420
And if it's a cognitive function
link |
00:50:16.920
that humans seem to be able to do,
link |
00:50:18.680
then it doesn't take too long
link |
00:50:21.700
for some deep neural net to pop up that can do it too.
link |
00:50:25.740
Yeah, I'm there with you.
link |
00:50:27.820
I've stopped betting against neural networks at this point
link |
00:50:33.140
because they continue to surprise us.
link |
00:50:35.740
What about long term memory?
link |
00:50:37.280
Can neural networks have long term memory?
link |
00:50:39.060
Something like knowledge bases.
link |
00:50:42.220
So being able to aggregate important information
link |
00:50:45.540
over long periods of time that would then serve
link |
00:50:49.420
as useful sort of representations of state
link |
00:50:54.420
that you can make decisions by,
link |
00:50:57.720
so have a long term context
link |
00:50:59.480
based on which you're making the decision.
link |
00:51:01.560
So in some sense, the parameters already do that.
link |
00:51:06.000
The parameters are an aggregation of the neural,
link |
00:51:09.000
of the entirety of the neural nets experience,
link |
00:51:10.880
and so they count as long term knowledge.
link |
00:51:15.600
And people have trained various neural nets
link |
00:51:17.740
to act as knowledge bases and, you know,
link |
00:51:20.140
investigated with, people have investigated
link |
00:51:22.360
language models as knowledge bases.
link |
00:51:23.640
So there is work there.
link |
00:51:27.260
Yeah, but in some sense, do you think in every sense,
link |
00:51:29.840
do you think there's a, it's all just a matter of coming up
link |
00:51:35.700
with a better mechanism of forgetting the useless stuff
link |
00:51:38.440
and remembering the useful stuff?
link |
00:51:40.240
Because right now, I mean, there's not been mechanisms
link |
00:51:43.080
that do remember really long term information.
link |
00:51:46.880
What do you mean by that precisely?
link |
00:51:48.880
Precisely, I like the word precisely.
link |
00:51:51.780
So I'm thinking of the kind of compression of information
link |
00:51:58.160
the knowledge bases represent.
link |
00:52:00.480
Sort of creating a, now I apologize for my sort of
link |
00:52:05.680
human centric thinking about what knowledge is,
link |
00:52:08.720
because neural networks aren't interpretable necessarily
link |
00:52:12.880
with the kind of knowledge they have discovered.
link |
00:52:15.780
But a good example for me is knowledge bases,
link |
00:52:18.720
being able to build up over time something like
link |
00:52:21.280
the knowledge that Wikipedia represents.
link |
00:52:24.080
It's a really compressed, structured knowledge base.
link |
00:52:30.840
Obviously not the actual Wikipedia or the language,
link |
00:52:34.360
but like a semantic web, the dream that semantic web
link |
00:52:37.040
represented, so it's a really nice compressed knowledge base
link |
00:52:40.360
or something akin to that in the noninterpretable sense
link |
00:52:44.560
as neural networks would have.
link |
00:52:46.980
Well, the neural networks would be noninterpretable
link |
00:52:48.560
if you look at their weights, but their outputs
link |
00:52:50.820
should be very interpretable.
link |
00:52:52.200
Okay, so yeah, how do you make very smart neural networks
link |
00:52:55.840
like language models interpretable?
link |
00:52:58.040
Well, you ask them to generate some text
link |
00:53:00.280
and the text will generally be interpretable.
link |
00:53:02.120
Do you find that the epitome of interpretability,
link |
00:53:04.720
like can you do better?
link |
00:53:06.000
Like can you add, because you can't, okay,
link |
00:53:08.600
I'd like to know what does it know and what doesn't it know?
link |
00:53:12.240
I would like the neural network to come up with examples
link |
00:53:15.720
where it's completely dumb and examples
link |
00:53:18.640
where it's completely brilliant.
link |
00:53:20.320
And the only way I know how to do that now
link |
00:53:22.280
is to generate a lot of examples and use my human judgment.
link |
00:53:26.440
But it would be nice if a neural network
link |
00:53:28.160
had some self awareness about it.
link |
00:53:31.720
Yeah, 100%, I'm a big believer in self awareness
link |
00:53:34.800
and I think that, I think neural net self awareness
link |
00:53:39.940
will allow for things like the capabilities,
link |
00:53:42.540
like the ones you described, like for them to know
link |
00:53:44.840
what they know and what they don't know
link |
00:53:47.000
and for them to know where to invest
link |
00:53:48.740
to increase their skills most optimally.
link |
00:53:50.800
And to your question of interpretability,
link |
00:53:52.280
there are actually two answers to that question.
link |
00:53:54.360
One answer is, you know, we have the neural net
link |
00:53:56.480
so we can analyze the neurons and we can try to understand
link |
00:53:59.600
what the different neurons and different layers mean.
link |
00:54:02.000
And you can actually do that
link |
00:54:03.440
and OpenAI has done some work on that.
link |
00:54:05.920
But there is a different answer, which is that,
link |
00:54:10.000
I would say that's the human centric answer where you say,
link |
00:54:13.160
you know, you look at a human being, you can't read,
link |
00:54:16.520
how do you know what a human being is thinking?
link |
00:54:18.800
You ask them, you say, hey, what do you think about this?
link |
00:54:20.600
What do you think about that?
link |
00:54:22.340
And you get some answers.
link |
00:54:24.120
The answers you get are sticky in the sense
link |
00:54:26.040
you already have a mental model.
link |
00:54:28.000
You already have a mental model of that human being.
link |
00:54:32.680
You already have an understanding of like a big conception
link |
00:54:37.200
of that human being, how they think, what they know,
link |
00:54:40.400
how they see the world and then everything you ask,
link |
00:54:42.880
you're adding onto that.
link |
00:54:45.520
And that stickiness seems to be,
link |
00:54:49.760
that's one of the really interesting qualities
link |
00:54:51.680
of the human being is that information is sticky.
link |
00:54:55.000
You don't, you seem to remember the useful stuff,
link |
00:54:57.520
aggregate it well and forget most of the information
link |
00:55:00.400
that's not useful, that process.
link |
00:55:02.960
But that's also pretty similar to the process
link |
00:55:05.520
that neural networks do.
link |
00:55:06.760
It's just that neural networks are much crappier
link |
00:55:09.040
at this time.
link |
00:55:10.640
It doesn't seem to be fundamentally that different.
link |
00:55:13.260
But just to stick on reasoning for a little longer,
link |
00:55:17.000
you said, why not?
link |
00:55:18.720
Why can't I reason?
link |
00:55:19.920
What's a good impressive feat, benchmark to you
link |
00:55:23.960
of reasoning that you'll be impressed by
link |
00:55:28.720
if neural networks were able to do?
link |
00:55:30.600
Is that something you already have in mind?
link |
00:55:32.840
Well, I think writing really good code,
link |
00:55:36.520
I think proving really hard theorems,
link |
00:55:39.280
solving open ended problems with out of the box solutions.
link |
00:55:45.880
And sort of theorem type, mathematical problems.
link |
00:55:49.480
Yeah, I think those ones are a very natural example
link |
00:55:52.080
as well.
link |
00:55:52.920
If you can prove an unproven theorem,
link |
00:55:54.480
then it's hard to argue you don't reason.
link |
00:55:57.920
And so by the way, and this comes back to the point
link |
00:55:59.400
about the hard results, if you have machine learning,
link |
00:56:04.400
deep learning as a field is very fortunate
link |
00:56:06.080
because we have the ability to sometimes produce
link |
00:56:08.720
these unambiguous results.
link |
00:56:10.840
And when they happen, the debate changes,
link |
00:56:13.120
the conversation changes.
link |
00:56:14.160
It's a converse, we have the ability
link |
00:56:16.720
to produce conversation changing results.
link |
00:56:19.480
Conversation, and then of course, just like you said,
link |
00:56:21.600
people kind of take that for granted
link |
00:56:23.040
and say that wasn't actually a hard problem.
link |
00:56:25.080
Well, I mean, at some point we'll probably run out
link |
00:56:27.040
of hard problems.
link |
00:56:29.320
Yeah, that whole mortality thing is kind of a sticky problem
link |
00:56:33.640
that we haven't quite figured out.
link |
00:56:35.100
Maybe we'll solve that one.
link |
00:56:37.200
I think one of the fascinating things
link |
00:56:39.120
in your entire body of work,
link |
00:56:40.880
but also the work at OpenAI recently,
link |
00:56:43.040
one of the conversation changes has been
link |
00:56:44.840
in the world of language models.
link |
00:56:47.160
Can you briefly kind of try to describe
link |
00:56:50.280
the recent history of using neural networks
link |
00:56:52.240
in the domain of language and text?
link |
00:56:54.620
Well, there's been lots of history.
link |
00:56:56.620
I think the Elman network was a small,
link |
00:57:00.240
tiny recurrent neural network applied to language
link |
00:57:02.080
back in the 80s.
link |
00:57:03.840
So the history is really, you know, fairly long at least.
link |
00:57:08.640
And the thing that started,
link |
00:57:10.640
the thing that changed the trajectory
link |
00:57:13.440
of neural networks and language
link |
00:57:14.920
is the thing that changed the trajectory
link |
00:57:17.200
of all deep learning and that's data and compute.
link |
00:57:19.660
So suddenly you move from small language models,
link |
00:57:22.720
which learn a little bit,
link |
00:57:24.400
and with language models in particular,
link |
00:57:26.600
there's a very clear explanation
link |
00:57:28.440
for why they need to be large to be good,
link |
00:57:31.620
because they're trying to predict the next word.
link |
00:57:34.600
So when you don't know anything,
link |
00:57:36.840
you'll notice very, very broad strokes,
link |
00:57:40.240
surface level patterns,
link |
00:57:41.480
like sometimes there are characters
link |
00:57:44.840
and there is a space between those characters.
link |
00:57:46.480
You'll notice this pattern.
link |
00:57:47.960
And you'll notice that sometimes there is a comma
link |
00:57:50.040
and then the next character is a capital letter.
link |
00:57:51.920
You'll notice that pattern.
link |
00:57:53.600
Eventually you may start to notice
link |
00:57:55.000
that there are certain words occur often.
link |
00:57:57.160
You may notice that spellings are a thing.
link |
00:57:59.400
You may notice syntax.
link |
00:58:00.920
And when you get really good at all these,
link |
00:58:03.680
you start to notice the semantics.
link |
00:58:05.880
You start to notice the facts.
link |
00:58:07.820
But for that to happen,
link |
00:58:08.880
the language model needs to be larger.
link |
00:58:11.440
So that's, let's linger on that,
link |
00:58:14.040
because that's where you and Noam Chomsky disagree.
link |
00:58:18.680
So you think we're actually taking incremental steps,
link |
00:58:23.740
a sort of larger network, larger compute
link |
00:58:25.720
will be able to get to the semantics,
link |
00:58:29.480
to be able to understand language
link |
00:58:32.000
without what Noam likes to sort of think of
link |
00:58:35.520
as a fundamental understandings
link |
00:58:38.640
of the structure of language,
link |
00:58:40.440
like imposing your theory of language
link |
00:58:43.360
onto the learning mechanism.
link |
00:58:45.860
So you're saying the learning,
link |
00:58:48.000
you can learn from raw data,
link |
00:58:50.580
the mechanism that underlies language.
link |
00:58:53.400
Well, I think it's pretty likely,
link |
00:58:56.760
but I also want to say that I don't really know precisely
link |
00:59:01.240
what Chomsky means when he talks about him.
link |
00:59:05.520
You said something about imposing your structural language.
link |
00:59:08.780
I'm not 100% sure what he means,
link |
00:59:10.520
but empirically it seems that
link |
00:59:12.680
when you inspect those larger language models,
link |
00:59:14.640
they exhibit signs of understanding the semantics
link |
00:59:16.640
whereas the smaller language models do not.
link |
00:59:18.520
We've seen that a few years ago
link |
00:59:19.800
when we did work on the sentiment neuron.
link |
00:59:21.920
We trained a small, you know,
link |
00:59:24.040
smallish LSTM to predict the next character
link |
00:59:27.320
in Amazon reviews.
link |
00:59:28.600
And we noticed that when you increase the size of the LSTM
link |
00:59:31.680
from 500 LSTM cells to 4,000 LSTM cells,
link |
00:59:35.400
then one of the neurons starts to represent the sentiment
link |
00:59:38.600
of the article, sorry, of the review.
link |
00:59:42.040
Now, why is that?
link |
00:59:42.960
Sentiment is a pretty semantic attribute.
link |
00:59:45.280
It's not a syntactic attribute.
link |
00:59:46.880
And for people who might not know,
link |
00:59:48.400
I don't know if that's a standard term,
link |
00:59:49.480
but sentiment is whether it's a positive
link |
00:59:51.200
or a negative review.
link |
00:59:52.040
That's right.
link |
00:59:52.880
Is the person happy with something
link |
00:59:54.320
or is the person unhappy with something?
link |
00:59:55.960
And so here we had very clear evidence
link |
00:59:58.800
that a small neural net does not capture sentiment
link |
01:00:01.960
while a large neural net does.
link |
01:00:03.640
And why is that?
link |
01:00:04.760
Well, our theory is that at some point
link |
01:00:07.480
you run out of syntax to models,
link |
01:00:08.840
you start to gotta focus on something else.
link |
01:00:11.040
And with size, you quickly run out of syntax to model
link |
01:00:15.840
and then you really start to focus on the semantics
link |
01:00:18.360
would be the idea.
link |
01:00:19.420
That's right.
link |
01:00:20.260
And so I don't wanna imply that our models
link |
01:00:22.160
have complete semantic understanding
link |
01:00:23.840
because that's not true,
link |
01:00:25.360
but they definitely are showing signs
link |
01:00:28.260
of semantic understanding,
link |
01:00:29.400
partial semantic understanding,
link |
01:00:30.800
but the smaller models do not show those signs.
link |
01:00:34.520
Can you take a step back and say,
link |
01:00:36.600
what is GPT2, which is one of the big language models
link |
01:00:40.540
that was the conversation changer
link |
01:00:42.520
in the past couple of years?
link |
01:00:43.760
Yeah, so GPT2 is a transformer
link |
01:00:48.120
with one and a half billion parameters
link |
01:00:50.360
that was trained on about 40 billion tokens of text
link |
01:00:56.320
which were obtained from web pages
link |
01:00:58.840
that were linked to from Reddit articles
link |
01:01:01.080
with more than three outputs.
link |
01:01:02.320
And what's a transformer?
link |
01:01:03.920
The transformer, it's the most important advance
link |
01:01:06.680
in neural network architectures in recent history.
link |
01:01:09.800
What is attention maybe too?
link |
01:01:11.480
Cause I think that's an interesting idea,
link |
01:01:13.280
not necessarily sort of technically speaking,
link |
01:01:15.000
but the idea of attention versus maybe
link |
01:01:18.680
what recurrent neural networks represent.
link |
01:01:21.080
Yeah, so the thing is the transformer
link |
01:01:23.320
is a combination of multiple ideas simultaneously
link |
01:01:25.840
of which attention is one.
link |
01:01:28.140
Do you think attention is the key?
link |
01:01:29.380
No, it's a key, but it's not the key.
link |
01:01:32.460
The transformer is successful
link |
01:01:34.520
because it is the simultaneous combination
link |
01:01:36.760
of multiple ideas.
link |
01:01:37.700
And if you were to remove either idea,
link |
01:01:39.040
it would be much less successful.
link |
01:01:41.480
So the transformer uses a lot of attention,
link |
01:01:43.880
but attention existed for a few years.
link |
01:01:45.860
So that can't be the main innovation.
link |
01:01:48.440
The transformer is designed in such a way
link |
01:01:53.180
that it runs really fast on the GPU.
link |
01:01:56.120
And that makes a huge amount of difference.
link |
01:01:58.200
This is one thing.
link |
01:01:59.360
The second thing is that transformer is not recurrent.
link |
01:02:02.840
And that is really important too,
link |
01:02:04.680
because it is more shallow
link |
01:02:06.380
and therefore much easier to optimize.
link |
01:02:08.440
So in other words, users attention,
link |
01:02:10.400
it is a really great fit to the GPU
link |
01:02:14.260
and it is not recurrent,
link |
01:02:15.320
so therefore less deep and easier to optimize.
link |
01:02:17.800
And the combination of those factors make it successful.
link |
01:02:20.720
So now it makes great use of your GPU.
link |
01:02:24.160
It allows you to achieve better results
link |
01:02:26.360
for the same amount of compute.
link |
01:02:28.680
And that's why it's successful.
link |
01:02:31.080
Were you surprised how well transformers worked
link |
01:02:34.200
and GPT2 worked?
link |
01:02:36.120
So you worked on language.
link |
01:02:37.840
You've had a lot of great ideas
link |
01:02:39.760
before transformers came about in language.
link |
01:02:42.880
So you got to see the whole set of revolutions
link |
01:02:44.960
before and after.
link |
01:02:46.160
Were you surprised?
link |
01:02:47.560
Yeah, a little.
link |
01:02:48.680
A little?
link |
01:02:50.040
I mean, it's hard to remember
link |
01:02:51.920
because you adapt really quickly,
link |
01:02:54.520
but it definitely was surprising.
link |
01:02:55.920
It definitely was.
link |
01:02:56.880
In fact, you know what?
link |
01:02:59.060
I'll retract my statement.
link |
01:03:00.480
It was pretty amazing.
link |
01:03:02.480
It was just amazing to see generate this text of this.
link |
01:03:06.000
And you know, you gotta keep in mind
link |
01:03:07.380
that at that time we've seen all this progress in GANs
link |
01:03:10.480
in improving the samples produced by GANs
link |
01:03:13.280
were just amazing.
link |
01:03:14.720
You have these realistic faces,
link |
01:03:15.960
but text hasn't really moved that much.
link |
01:03:17.960
And suddenly we moved from, you know,
link |
01:03:20.520
whatever GANs were in 2015
link |
01:03:23.120
to the best, most amazing GANs in one step.
link |
01:03:26.200
And that was really stunning.
link |
01:03:27.520
Even though theory predicted,
link |
01:03:29.000
yeah, you train a big language model,
link |
01:03:30.420
of course you should get this,
link |
01:03:31.840
but then to see it with your own eyes,
link |
01:03:33.200
it's something else.
link |
01:03:34.880
And yet we adapt really quickly.
link |
01:03:37.240
And now there's sort of some cognitive scientists
link |
01:03:42.240
write articles saying that GPT2 models
link |
01:03:47.040
don't truly understand language.
link |
01:03:49.320
So we adapt quickly to how amazing
link |
01:03:51.880
the fact that they're able to model the language so well is.
link |
01:03:55.680
So what do you think is the bar?
link |
01:03:58.840
For what?
link |
01:03:59.680
For impressing us that it...
link |
01:04:02.440
I don't know.
link |
01:04:03.720
Do you think that bar will continuously be moved?
link |
01:04:06.080
Definitely.
link |
01:04:06.920
I think when you start to see
link |
01:04:08.840
really dramatic economic impact,
link |
01:04:11.240
that's when I think that's in some sense the next barrier.
link |
01:04:13.800
Because right now, if you think about the work in AI,
link |
01:04:16.880
it's really confusing.
link |
01:04:18.880
It's really hard to know what to make of all these advances.
link |
01:04:22.560
It's kind of like, okay, you got an advance
link |
01:04:25.560
and now you can do more things
link |
01:04:26.840
and you've got another improvement
link |
01:04:29.080
and you've got another cool demo.
link |
01:04:30.400
At some point, I think people who are outside of AI,
link |
01:04:36.160
they can no longer distinguish this progress anymore.
link |
01:04:38.700
So we were talking offline
link |
01:04:40.040
about translating Russian to English
link |
01:04:41.760
and how there's a lot of brilliant work in Russian
link |
01:04:44.120
that the rest of the world doesn't know about.
link |
01:04:46.440
That's true for Chinese,
link |
01:04:47.580
it's true for a lot of scientists
link |
01:04:50.080
and just artistic work in general.
link |
01:04:52.220
Do you think translation is the place
link |
01:04:53.880
where we're going to see sort of economic big impact?
link |
01:04:57.080
I don't know.
link |
01:04:57.920
I think there is a huge number of...
link |
01:05:00.040
I mean, first of all,
link |
01:05:01.080
I wanna point out that translation already today is huge.
link |
01:05:05.520
I think billions of people interact
link |
01:05:07.500
with big chunks of the internet primarily through translation.
link |
01:05:11.080
So translation is already huge
link |
01:05:13.060
and it's hugely positive too.
link |
01:05:16.400
I think self driving is going to be hugely impactful
link |
01:05:20.320
and that's, it's unknown exactly when it happens,
link |
01:05:24.440
but again, I would not bet against deep learning, so I...
link |
01:05:27.960
So there's deep learning in general,
link |
01:05:29.320
but you think this...
link |
01:05:30.160
Deep learning for self driving.
link |
01:05:31.920
Yes, deep learning for self driving.
link |
01:05:33.120
But I was talking about sort of language models.
link |
01:05:35.320
I see.
link |
01:05:36.160
Just to check.
link |
01:05:36.980
Beard off a little bit.
link |
01:05:38.080
Just to check,
link |
01:05:38.920
you're not seeing a connection between driving and language.
link |
01:05:41.120
No, no.
link |
01:05:41.960
Okay.
link |
01:05:42.800
Or rather both use neural nets.
link |
01:05:44.040
That'd be a poetic connection.
link |
01:05:45.560
I think there might be some,
link |
01:05:47.160
like you said, there might be some kind of unification
link |
01:05:49.160
towards a kind of multitask transformers
link |
01:05:54.480
that can take on both language and vision tasks.
link |
01:05:58.200
That'd be an interesting unification.
link |
01:06:01.400
Now let's see, what can I ask about GPT two more?
link |
01:06:04.940
It's simple.
link |
01:06:05.780
There's not much to ask.
link |
01:06:06.980
It's, you take a transform, you make it bigger,
link |
01:06:09.960
you give it more data,
link |
01:06:10.800
and suddenly it does all those amazing things.
link |
01:06:12.700
Yeah, one of the beautiful things is that GPT,
link |
01:06:14.920
the transformers are fundamentally simple to explain,
link |
01:06:17.920
to train.
link |
01:06:20.320
Do you think bigger will continue
link |
01:06:23.960
to show better results in language?
link |
01:06:27.060
Probably.
link |
01:06:28.240
Sort of like what are the next steps
link |
01:06:29.760
with GPT two, do you think?
link |
01:06:31.440
I mean, I think for sure seeing
link |
01:06:34.000
what larger versions can do is one direction.
link |
01:06:37.600
Also, I mean, there are many questions.
link |
01:06:41.200
There's one question which I'm curious about
link |
01:06:42.720
and that's the following.
link |
01:06:43.960
So right now GPT two,
link |
01:06:45.360
so we feed it all this data from the internet,
link |
01:06:46.960
which means that it needs to memorize
link |
01:06:48.120
all those random facts about everything in the internet.
link |
01:06:51.840
And it would be nice if the model could somehow
link |
01:06:56.840
use its own intelligence to decide
link |
01:06:59.800
what data it wants to accept
link |
01:07:01.800
and what data it wants to reject.
link |
01:07:03.560
Just like people.
link |
01:07:04.400
People don't learn all data indiscriminately.
link |
01:07:07.160
We are super selective about what we learn.
link |
01:07:09.760
And I think this kind of active learning,
link |
01:07:11.560
I think would be very nice to have.
link |
01:07:14.240
Yeah, listen, I love active learning.
link |
01:07:16.720
So let me ask, does the selection of data,
link |
01:07:21.120
can you just elaborate that a little bit more?
link |
01:07:23.040
Do you think the selection of data is,
link |
01:07:28.160
like I have this kind of sense
link |
01:07:29.880
that the optimization of how you select data,
link |
01:07:33.760
so the active learning process is going to be a place
link |
01:07:38.520
for a lot of breakthroughs, even in the near future?
link |
01:07:42.120
Because there hasn't been many breakthroughs there
link |
01:07:44.040
that are public.
link |
01:07:45.080
I feel like there might be private breakthroughs
link |
01:07:47.560
that companies keep to themselves
link |
01:07:49.320
because the fundamental problem has to be solved
link |
01:07:51.480
if you want to solve self driving,
link |
01:07:52.920
if you want to solve a particular task.
link |
01:07:55.280
What do you think about the space in general?
link |
01:07:57.800
Yeah, so I think that for something like active learning,
link |
01:08:00.160
or in fact, for any kind of capability, like active learning,
link |
01:08:03.760
the thing that it really needs is a problem.
link |
01:08:05.800
It needs a problem that requires it.
link |
01:08:09.360
It's very hard to do research about the capability
link |
01:08:12.080
if you don't have a task,
link |
01:08:12.980
because then what's going to happen
link |
01:08:14.200
is that you will come up with an artificial task,
link |
01:08:16.720
get good results, but not really convince anyone.
link |
01:08:20.640
Right, like we're now past the stage
link |
01:08:22.960
where getting a result on MNIST, some clever formulation
link |
01:08:28.880
of MNIST will convince people.
link |
01:08:30.800
That's right, in fact, you could quite easily
link |
01:08:33.280
come up with a simple active learning scheme on MNIST
link |
01:08:35.320
and get a 10x speed up, but then, so what?
link |
01:08:39.560
And I think that with active learning,
link |
01:08:41.760
the need, active learning will naturally arise
link |
01:08:45.480
as problems that require it pop up.
link |
01:08:49.240
That's how I would, that's my take on it.
link |
01:08:51.840
There's another interesting thing
link |
01:08:54.140
that OpenAI has brought up with GPT2,
link |
01:08:56.100
which is when you create a powerful
link |
01:09:00.240
artificial intelligence system,
link |
01:09:01.460
and it was unclear what kind of detrimental,
link |
01:09:04.660
once you release GPT2,
link |
01:09:07.460
what kind of detrimental effect it will have.
link |
01:09:09.580
Because if you have a model
link |
01:09:11.540
that can generate a pretty realistic text,
link |
01:09:14.080
you can start to imagine that it would be used by bots
link |
01:09:18.340
in some way that we can't even imagine.
link |
01:09:21.740
So there's this nervousness about what is possible to do.
link |
01:09:24.460
So you did a really kind of brave
link |
01:09:27.100
and I think profound thing,
link |
01:09:28.180
which is start a conversation about this.
link |
01:09:30.100
How do we release powerful artificial intelligence models
link |
01:09:34.900
to the public?
link |
01:09:36.100
If we do it all, how do we privately discuss
link |
01:09:39.780
with other, even competitors,
link |
01:09:42.200
about how we manage the use of the systems and so on?
link |
01:09:46.060
So from this whole experience,
link |
01:09:47.980
you released a report on it,
link |
01:09:49.580
but in general, are there any insights
link |
01:09:51.820
that you've gathered from just thinking about this,
link |
01:09:55.340
about how you release models like this?
link |
01:09:57.740
I mean, I think that my take on this
link |
01:10:00.700
is that the field of AI has been in a state of childhood.
link |
01:10:05.060
And now it's exiting that state
link |
01:10:06.860
and it's entering a state of maturity.
link |
01:10:09.660
What that means is that AI is very successful
link |
01:10:12.340
and also very impactful.
link |
01:10:14.140
And its impact is not only large, but it's also growing.
link |
01:10:16.980
And so for that reason, it seems wise to start thinking
link |
01:10:21.980
about the impact of our systems before releasing them,
link |
01:10:24.940
maybe a little bit too soon, rather than a little bit too late.
link |
01:10:28.700
And with the case of GPT2, like I mentioned earlier,
link |
01:10:31.900
the results really were stunning.
link |
01:10:34.060
And it seemed plausible, it didn't seem certain,
link |
01:10:37.700
it seemed plausible that something like GPT2
link |
01:10:40.540
could easily use to reduce the cost of this information.
link |
01:10:44.540
And so there was a question of what's the best way
link |
01:10:47.060
to release it, and a staged release seemed logical.
link |
01:10:49.380
A small model was released,
link |
01:10:51.220
and there was time to see the,
link |
01:10:54.980
many people use these models in lots of cool ways.
link |
01:10:57.300
There've been lots of really cool applications.
link |
01:10:59.700
There haven't been any negative application to be known of.
link |
01:11:03.820
And so eventually it was released,
link |
01:11:05.180
but also other people replicated similar models.
link |
01:11:07.620
That's an interesting question though that we know of.
link |
01:11:10.260
So in your view, staged release,
link |
01:11:12.860
is at least part of the answer to the question of how do we,
link |
01:11:20.620
what do we do once we create a system like this?
link |
01:11:22.980
It's part of the answer, yes.
link |
01:11:24.980
Is there any other insights?
link |
01:11:26.900
Like say you don't wanna release the model at all,
link |
01:11:29.340
because it's useful to you for whatever the business is.
link |
01:11:32.820
Well, plenty of people don't release models already.
link |
01:11:36.020
Right, of course, but is there some moral,
link |
01:11:39.660
ethical responsibility when you have a very powerful model
link |
01:11:43.340
to sort of communicate?
link |
01:11:44.860
Like, just as you said, when you had GPT2,
link |
01:11:48.580
it was unclear how much it could be used for misinformation.
link |
01:11:51.340
It's an open question, and getting an answer to that
link |
01:11:54.780
might require that you talk to other really smart people
link |
01:11:57.700
that are outside of your particular group.
link |
01:12:00.940
Have you, please tell me there's some optimistic pathway
link |
01:12:05.500
for people to be able to use this model
link |
01:12:08.900
for people across the world to collaborate
link |
01:12:11.380
on these kinds of cases?
link |
01:12:14.740
Or is it still really difficult from one company
link |
01:12:17.940
to talk to another company?
link |
01:12:19.660
So it's definitely possible.
link |
01:12:21.380
It's definitely possible to discuss these kind of models
link |
01:12:26.220
with colleagues elsewhere,
link |
01:12:28.380
and to get their take on what to do.
link |
01:12:32.300
How hard is it though?
link |
01:12:33.740
I mean.
link |
01:12:36.540
Do you see that happening?
link |
01:12:38.140
I think that's a place where it's important
link |
01:12:40.620
to gradually build trust between companies.
link |
01:12:43.380
Because ultimately, all the AI developers
link |
01:12:47.180
are building technology which is going to be
link |
01:12:48.860
increasingly more powerful.
link |
01:12:50.860
And so it's,
link |
01:12:54.780
the way to think about it is that ultimately
link |
01:12:56.340
we're all in it together.
link |
01:12:58.660
Yeah, I tend to believe in the better angels of our nature,
link |
01:13:03.660
but I do hope that when you build a really powerful
link |
01:13:09.820
AI system in a particular domain,
link |
01:13:11.860
that you also think about the potential
link |
01:13:14.700
negative consequences of, yeah.
link |
01:13:21.420
It's an interesting and scary possibility
link |
01:13:23.020
that there will be a race for AI development
link |
01:13:26.340
that would push people to close that development,
link |
01:13:29.340
and not share ideas with others.
link |
01:13:31.180
I don't love this.
link |
01:13:32.460
I've been a pure academic for 10 years.
link |
01:13:34.340
I really like sharing ideas and it's fun, it's exciting.
link |
01:13:39.220
What do you think it takes to,
link |
01:13:40.420
let's talk about AGI a little bit.
link |
01:13:42.180
What do you think it takes to build a system
link |
01:13:44.100
of human level intelligence?
link |
01:13:45.660
We talked about reasoning,
link |
01:13:47.300
we talked about long term memory, but in general,
link |
01:13:50.060
what does it take, do you think?
link |
01:13:51.380
Well, I can't be sure.
link |
01:13:55.140
But I think the deep learning,
link |
01:13:57.100
plus maybe another,
link |
01:13:58.940
plus maybe another small idea.
link |
01:14:03.740
Do you think self play will be involved?
link |
01:14:05.580
So you've spoken about the powerful mechanism of self play
link |
01:14:09.020
where systems learn by sort of exploring the world
link |
01:14:15.300
in a competitive setting against other entities
link |
01:14:18.340
that are similarly skilled as them,
link |
01:14:20.540
and so incrementally improve in this way.
link |
01:14:23.020
Do you think self play will be a component
link |
01:14:24.540
of building an AGI system?
link |
01:14:26.660
Yeah, so what I would say, to build AGI,
link |
01:14:29.420
I think it's going to be deep learning plus some ideas.
link |
01:14:34.180
And I think self play will be one of those ideas.
link |
01:14:37.780
I think that that is a very,
link |
01:14:41.380
self play has this amazing property
link |
01:14:43.980
that it can surprise us in truly novel ways.
link |
01:14:48.780
For example, like we, I mean,
link |
01:14:53.020
pretty much every self play system,
link |
01:14:55.740
both are Dota bot.
link |
01:14:58.420
I don't know if, OpenAI had a release about multi agent
link |
01:15:02.660
where you had two little agents
link |
01:15:04.340
who were playing hide and seek.
link |
01:15:06.060
And of course, also alpha zero.
link |
01:15:08.220
They were all produced surprising behaviors.
link |
01:15:11.020
They all produce behaviors that we didn't expect.
link |
01:15:13.180
They are creative solutions to problems.
link |
01:15:15.820
And that seems like an important part of AGI
link |
01:15:18.700
that our systems don't exhibit routinely right now.
link |
01:15:22.180
And so that's why I like this area.
link |
01:15:24.900
I like this direction because of its ability to surprise us.
link |
01:15:27.540
To surprise us.
link |
01:15:28.380
And an AGI system would surprise us fundamentally.
link |
01:15:31.180
Yes.
link |
01:15:32.020
And to be precise, not just a random surprise,
link |
01:15:34.500
but to find the surprising solution to a problem
link |
01:15:37.900
that's also useful.
link |
01:15:39.140
Right.
link |
01:15:39.980
Now, a lot of the self play mechanisms
link |
01:15:42.620
have been used in the game context
link |
01:15:45.620
or at least in the simulation context.
link |
01:15:48.380
How far along the path to AGI
link |
01:15:55.100
do you think will be done in simulation?
link |
01:15:56.700
How much faith, promise do you have in simulation
link |
01:16:01.340
versus having to have a system
link |
01:16:03.060
that operates in the real world?
link |
01:16:05.620
Whether it's the real world of digital real world data
link |
01:16:09.860
or real world like actual physical world of robotics.
link |
01:16:13.220
I don't think it's an easy or.
link |
01:16:15.060
I think simulation is a tool and it helps.
link |
01:16:17.540
It has certain strengths and certain weaknesses
link |
01:16:19.700
and we should use it.
link |
01:16:21.500
Yeah, but okay, I understand that.
link |
01:16:24.540
That's true, but one of the criticisms of self play,
link |
01:16:32.740
one of the criticisms of reinforcement learning
link |
01:16:34.820
is one of the, its current power, its current results,
link |
01:16:41.060
while amazing, have been demonstrated
link |
01:16:42.940
in a simulated environments
link |
01:16:44.820
or very constrained physical environments.
link |
01:16:46.420
Do you think it's possible to escape them,
link |
01:16:49.180
escape the simulator environments
link |
01:16:50.780
and be able to learn in non simulator environments?
link |
01:16:53.420
Or do you think it's possible to also just simulate
link |
01:16:57.020
in a photo realistic and physics realistic way,
link |
01:17:01.140
the real world in a way that we can solve real problems
link |
01:17:03.780
with self play in simulation?
link |
01:17:06.740
So I think that transfer from simulation to the real world
link |
01:17:10.380
is definitely possible and has been exhibited many times
link |
01:17:14.140
by many different groups.
link |
01:17:16.060
It's been especially successful in vision.
link |
01:17:18.660
Also open AI in the summer has demonstrated a robot hand
link |
01:17:22.660
which was trained entirely in simulation
link |
01:17:25.260
in a certain way that allowed for seem to real transfer
link |
01:17:27.820
to occur.
link |
01:17:29.860
Is this for the Rubik's cube?
link |
01:17:31.420
Yeah, that's right.
link |
01:17:32.660
I wasn't aware that was trained in simulation.
link |
01:17:34.660
It was trained in simulation entirely.
link |
01:17:37.020
Really, so it wasn't in the physical,
link |
01:17:39.420
the hand wasn't trained?
link |
01:17:40.980
No, 100% of the training was done in simulation
link |
01:17:44.820
and the policy that was learned in simulation
link |
01:17:46.900
was trained to be very adaptive.
link |
01:17:48.980
So adaptive that when you transfer it,
link |
01:17:50.940
it could very quickly adapt to the physical world.
link |
01:17:53.940
So the kind of perturbations with the giraffe
link |
01:17:57.380
or whatever the heck it was,
link |
01:17:58.900
those weren't, were those part of the simulation?
link |
01:18:01.860
Well, the simulation was generally,
link |
01:18:04.140
so the simulation was trained to be robust
link |
01:18:07.060
to many different things,
link |
01:18:08.140
but not the kind of perturbations we've had in the video.
link |
01:18:10.580
So it's never been trained with a glove.
link |
01:18:12.660
It's never been trained with a stuffed giraffe.
link |
01:18:17.060
So in theory, these are novel perturbations.
link |
01:18:19.340
Correct, it's not in theory, in practice.
link |
01:18:22.020
Those are novel perturbations?
link |
01:18:23.780
Well, that's okay.
link |
01:18:26.420
That's a clean, small scale,
link |
01:18:28.460
but clean example of a transfer
link |
01:18:29.940
from the simulated world to the physical world.
link |
01:18:32.140
Yeah, and I will also say
link |
01:18:33.220
that I expect the transfer capabilities
link |
01:18:35.620
of deep learning to increase in general.
link |
01:18:38.180
And the better the transfer capabilities are,
link |
01:18:40.540
the more useful simulation will become.
link |
01:18:43.660
Because then you could take,
link |
01:18:45.260
you could experience something in simulation
link |
01:18:48.540
and then learn a moral of the story,
link |
01:18:50.340
which you could then carry with you to the real world.
link |
01:18:53.540
As humans do all the time when they play computer games.
link |
01:18:56.980
So let me ask sort of a embodied question,
link |
01:19:01.740
staying on AGI for a sec.
link |
01:19:04.660
Do you think AGI system would need to have a body?
link |
01:19:07.740
We need to have some of those human elements
link |
01:19:09.580
of self awareness, consciousness,
link |
01:19:13.020
sort of fear of mortality,
link |
01:19:15.100
sort of self preservation in the physical space,
link |
01:19:18.140
which comes with having a body.
link |
01:19:20.340
I think having a body will be useful.
link |
01:19:22.420
I don't think it's necessary,
link |
01:19:24.340
but I think it's very useful to have a body for sure,
link |
01:19:26.260
because you can learn a whole new,
link |
01:19:28.900
you can learn things which cannot be learned without a body.
link |
01:19:32.500
But at the same time, I think that if you don't have a body,
link |
01:19:35.420
you could compensate for it and still succeed.
link |
01:19:38.580
You think so?
link |
01:19:39.420
Yes.
link |
01:19:40.260
Well, there is evidence for this.
link |
01:19:41.100
For example, there are many people who were born deaf
link |
01:19:43.340
and blind and they were able to compensate
link |
01:19:46.580
for the lack of modalities.
link |
01:19:48.260
I'm thinking about Helen Keller specifically.
link |
01:19:51.580
So even if you're not able to physically interact
link |
01:19:53.860
with the world, and if you're not able to,
link |
01:19:56.940
I mean, I actually was getting at,
link |
01:19:59.660
maybe let me ask on the more particular,
link |
01:20:02.700
I'm not sure if it's connected to having a body or not,
link |
01:20:05.380
but the idea of consciousness
link |
01:20:07.860
and a more constrained version of that is self awareness.
link |
01:20:11.260
Do you think an AGI system should have consciousness?
link |
01:20:16.300
We can't define, whatever the heck you think consciousness is.
link |
01:20:19.420
Yeah, hard question to answer,
link |
01:20:21.580
given how hard it is to define it.
link |
01:20:24.780
Do you think it's useful to think about?
link |
01:20:26.460
I mean, it's definitely interesting.
link |
01:20:28.380
It's fascinating.
link |
01:20:29.860
I think it's definitely possible
link |
01:20:31.820
that our systems will be conscious.
link |
01:20:33.900
Do you think that's an emergent thing that just comes from,
link |
01:20:36.420
do you think consciousness could emerge
link |
01:20:37.780
from the representation that's stored within neural networks?
link |
01:20:40.860
So like that it naturally just emerges
link |
01:20:42.980
when you become more and more,
link |
01:20:45.100
you're able to represent more and more of the world?
link |
01:20:47.020
Well, I'd say I'd make the following argument,
link |
01:20:48.780
which is humans are conscious.
link |
01:20:53.820
And if you believe that artificial neural nets
link |
01:20:56.060
are sufficiently similar to the brain,
link |
01:20:59.540
then there should at least exist artificial neural nets
link |
01:21:02.700
you should be conscious too.
link |
01:21:04.260
You're leaning on that existence proof pretty heavily.
link |
01:21:06.620
Okay, so that's the best answer I can give.
link |
01:21:12.100
No, I know, I know, I know.
link |
01:21:15.980
There's still an open question
link |
01:21:17.100
if there's not some magic in the brain that we're not,
link |
01:21:20.780
I mean, I don't mean a non materialistic magic,
link |
01:21:23.620
but that the brain might be a lot more complicated
link |
01:21:27.780
and interesting than we give it credit for.
link |
01:21:29.900
If that's the case, then it should show up.
link |
01:21:32.500
And at some point we will find out
link |
01:21:35.140
that we can't continue to make progress.
link |
01:21:36.580
But I think it's unlikely.
link |
01:21:38.740
So we talk about consciousness,
link |
01:21:40.180
but let me talk about another poorly defined concept
link |
01:21:42.380
of intelligence.
link |
01:21:44.580
Again, we've talked about reasoning,
link |
01:21:46.860
we've talked about memory.
link |
01:21:48.100
What do you think is a good test of intelligence for you?
link |
01:21:51.660
Are you impressed by the test that Alan Turing formulated
link |
01:21:55.700
with the imitation game with natural language?
link |
01:21:58.580
Is there something in your mind
link |
01:22:01.100
that you will be deeply impressed by
link |
01:22:04.260
if a system was able to do?
link |
01:22:06.420
I mean, lots of things.
link |
01:22:07.980
There's a certain frontier of capabilities today.
link |
01:22:13.260
And there exist things outside of that frontier.
link |
01:22:16.900
And I would be impressed by any such thing.
link |
01:22:18.980
For example, I would be impressed by a deep learning system
link |
01:22:24.580
which solves a very pedestrian task,
link |
01:22:27.260
like machine translation or computer vision task
link |
01:22:29.700
or something which never makes mistake
link |
01:22:33.420
a human wouldn't make under any circumstances.
link |
01:22:37.300
I think that is something
link |
01:22:38.540
which have not yet been demonstrated
link |
01:22:40.060
and I would find it very impressive.
link |
01:22:42.740
Yeah, so right now they make mistakes in different,
link |
01:22:44.860
they might be more accurate than human beings,
link |
01:22:46.580
but they still, they make a different set of mistakes.
link |
01:22:49.100
So my, I would guess that a lot of the skepticism
link |
01:22:53.420
that some people have about deep learning
link |
01:22:55.780
is when they look at their mistakes and they say,
link |
01:22:57.380
well, those mistakes, they make no sense.
link |
01:23:00.260
Like if you understood the concept,
link |
01:23:01.660
you wouldn't make that mistake.
link |
01:23:04.060
And I think that changing that would be,
link |
01:23:07.380
that would inspire me.
link |
01:23:09.380
That would be, yes, this is progress.
link |
01:23:12.580
Yeah, that's a really nice way to put it.
link |
01:23:15.460
But I also just don't like that human instinct
link |
01:23:18.580
to criticize a model is not intelligent.
link |
01:23:21.540
That's the same instinct as we do
link |
01:23:23.180
when we criticize any group of creatures as the other.
link |
01:23:28.820
Because it's very possible that GPT2
link |
01:23:33.500
is much smarter than human beings at many things.
link |
01:23:36.420
That's definitely true.
link |
01:23:37.620
It has a lot more breadth of knowledge.
link |
01:23:39.380
Yes, breadth of knowledge
link |
01:23:41.020
and even perhaps depth on certain topics.
link |
01:23:46.140
It's kind of hard to judge what depth means,
link |
01:23:48.380
but there's definitely a sense in which
link |
01:23:51.180
humans don't make mistakes that these models do.
link |
01:23:54.860
The same is applied to autonomous vehicles.
link |
01:23:57.780
The same is probably gonna continue being applied
link |
01:23:59.700
to a lot of artificial intelligence systems.
link |
01:24:01.740
We find, this is the annoying thing.
link |
01:24:04.100
This is the process of, in the 21st century,
link |
01:24:06.820
the process of analyzing the progress of AI
link |
01:24:09.460
is the search for one case where the system fails
link |
01:24:13.380
in a big way where humans would not.
link |
01:24:17.020
And then many people writing articles about it.
link |
01:24:20.460
And then broadly, the public generally gets convinced
link |
01:24:24.820
that the system is not intelligent.
link |
01:24:26.580
And we pacify ourselves by thinking it's not intelligent
link |
01:24:29.860
because of this one anecdotal case.
link |
01:24:31.980
And this seems to continue happening.
link |
01:24:34.540
Yeah, I mean, there is truth to that.
link |
01:24:36.900
Although I'm sure that plenty of people
link |
01:24:38.140
are also extremely impressed
link |
01:24:39.220
by the system that exists today.
link |
01:24:40.860
But I think this connects to the earlier point
link |
01:24:42.500
we discussed that it's just confusing
link |
01:24:45.020
to judge progress in AI.
link |
01:24:47.100
Yeah.
link |
01:24:47.940
And you have a new robot demonstrating something.
link |
01:24:50.700
How impressed should you be?
link |
01:24:52.700
And I think that people will start to be impressed
link |
01:24:55.980
once AI starts to really move the needle on the GDP.
link |
01:25:00.380
So you're one of the people that might be able
link |
01:25:02.020
to create an AGI system here.
link |
01:25:03.740
Not you, but you and OpenAI.
link |
01:25:06.820
If you do create an AGI system
link |
01:25:09.020
and you get to spend sort of the evening
link |
01:25:11.940
with it, him, her, what would you talk about, do you think?
link |
01:25:17.900
The very first time?
link |
01:25:19.140
First time.
link |
01:25:19.980
Well, the first time I would just ask all kinds of questions
link |
01:25:23.620
and try to get it to make a mistake.
link |
01:25:25.700
And I would be amazed that it doesn't make mistakes
link |
01:25:28.100
and just keep asking broad questions.
link |
01:25:33.100
What kind of questions do you think?
link |
01:25:34.940
Would they be factual or would they be personal,
link |
01:25:39.100
emotional, psychological?
link |
01:25:40.940
What do you think?
link |
01:25:42.500
All of the above.
link |
01:25:46.100
Would you ask for advice?
link |
01:25:47.260
Definitely.
link |
01:25:49.260
I mean, why would I limit myself
link |
01:25:51.580
talking to a system like this?
link |
01:25:53.140
Now, again, let me emphasize the fact
link |
01:25:56.100
that you truly are one of the people
link |
01:25:57.780
that might be in the room where this happens.
link |
01:26:01.220
So let me ask sort of a profound question about,
link |
01:26:06.540
I've just talked to a Stalin historian.
link |
01:26:08.540
I've been talking to a lot of people who are studying power.
link |
01:26:13.180
Abraham Lincoln said,
link |
01:26:14.780
"'Nearly all men can stand adversity,
link |
01:26:17.700
"'but if you want to test a man's character, give him power.'"
link |
01:26:21.380
I would say the power of the 21st century,
link |
01:26:24.700
maybe the 22nd, but hopefully the 21st,
link |
01:26:28.460
would be the creation of an AGI system
link |
01:26:30.260
and the people who have control,
link |
01:26:33.420
direct possession and control of the AGI system.
link |
01:26:36.260
So what do you think, after spending that evening
link |
01:26:39.500
having a discussion with the AGI system,
link |
01:26:42.900
what do you think you would do?
link |
01:26:45.500
Well, the ideal world I'd like to imagine
link |
01:26:50.180
is one where humanity,
link |
01:26:52.820
I like, the board members of a company
link |
01:26:57.940
where the AGI is the CEO.
link |
01:26:59.500
So it would be, I would like,
link |
01:27:04.500
the picture which I would imagine
link |
01:27:05.860
is you have some kind of different entities,
link |
01:27:09.540
different countries or cities,
link |
01:27:11.700
and the people that leave their vote
link |
01:27:13.220
for what the AGI that represents them should do,
link |
01:27:16.220
and the AGI that represents them goes and does it.
link |
01:27:18.660
I think a picture like that, I find very appealing.
link |
01:27:23.660
You could have multiple AGI,
link |
01:27:24.500
you would have an AGI for a city, for a country,
link |
01:27:26.620
and there would be multiple AGI's,
link |
01:27:27.980
for a city, for a country, and there would be,
link |
01:27:30.740
it would be trying to, in effect,
link |
01:27:33.980
take the democratic process to the next level.
link |
01:27:36.060
And the board can always fire the CEO.
link |
01:27:38.660
Essentially, press the reset button, say.
link |
01:27:40.660
Press the reset button.
link |
01:27:41.500
Rerandomize the parameters.
link |
01:27:42.940
But let me sort of, that's actually,
link |
01:27:45.980
okay, that's a beautiful vision, I think,
link |
01:27:49.060
as long as it's possible to press the reset button.
link |
01:27:53.460
Do you think it will always be possible
link |
01:27:54.980
to press the reset button?
link |
01:27:56.380
So I think that it definitely will be possible to build.
link |
01:28:02.100
So you're talking, so the question
link |
01:28:03.860
that I really understand from you is,
link |
01:28:06.620
will humans or humans people have control
link |
01:28:12.500
over the AI systems that they build?
link |
01:28:14.260
Yes.
link |
01:28:15.100
And my answer is, it's definitely possible
link |
01:28:17.300
to build AI systems which will want
link |
01:28:19.580
to be controlled by their humans.
link |
01:28:21.820
Wow, that's part of their,
link |
01:28:24.020
so it's not that just they can't help but be controlled,
link |
01:28:26.180
but that's the, they exist,
link |
01:28:31.540
the one of the objectives of their existence
link |
01:28:33.500
is to be controlled.
link |
01:28:34.500
In the same way that human parents
link |
01:28:39.780
generally want to help their children,
link |
01:28:42.460
they want their children to succeed.
link |
01:28:44.420
It's not a burden for them.
link |
01:28:46.020
They are excited to help children and to feed them
link |
01:28:49.340
and to dress them and to take care of them.
link |
01:28:52.460
And I believe with high conviction
link |
01:28:56.300
that the same will be possible for an AGI.
link |
01:28:58.900
It will be possible to program an AGI,
link |
01:29:00.500
to design it in such a way
link |
01:29:01.700
that it will have a similar deep drive
link |
01:29:04.820
that it will be delighted to fulfill.
link |
01:29:07.060
And the drive will be to help humans flourish.
link |
01:29:11.180
But let me take a step back to that moment
link |
01:29:13.940
where you create the AGI system.
link |
01:29:15.460
I think this is a really crucial moment.
link |
01:29:17.460
And between that moment
link |
01:29:21.660
and the Democratic board members with the AGI at the head,
link |
01:29:28.900
there has to be a relinquishing of power.
link |
01:29:31.860
So as George Washington, despite all the bad things he did,
link |
01:29:36.500
one of the big things he did is he relinquished power.
link |
01:29:39.340
He, first of all, didn't want to be president.
link |
01:29:42.180
And even when he became president,
link |
01:29:43.780
he gave, he didn't keep just serving
link |
01:29:45.960
as most dictators do for indefinitely.
link |
01:29:49.140
Do you see yourself being able to relinquish control
link |
01:29:55.180
over an AGI system,
link |
01:29:56.300
given how much power you can have over the world,
link |
01:29:59.300
at first financial, just make a lot of money, right?
link |
01:30:02.780
And then control by having possession as AGI system.
link |
01:30:07.020
I'd find it trivial to do that.
link |
01:30:09.060
I'd find it trivial to relinquish this kind of power.
link |
01:30:11.500
I mean, the kind of scenario you are describing
link |
01:30:15.100
sounds terrifying to me.
link |
01:30:17.420
That's all.
link |
01:30:19.020
I would absolutely not want to be in that position.
link |
01:30:22.420
Do you think you represent the majority
link |
01:30:25.680
or the minority of people in the AI community?
link |
01:30:29.420
Well, I mean.
link |
01:30:30.740
Say open question, an important one.
link |
01:30:33.780
Are most people good is another way to ask it.
link |
01:30:36.540
So I don't know if most people are good,
link |
01:30:39.340
but I think that when it really counts,
link |
01:30:44.340
people can be better than we think.
link |
01:30:47.040
That's beautifully put, yeah.
link |
01:30:49.260
Are there specific mechanism you can think of
link |
01:30:51.480
of aligning AI values to human values?
link |
01:30:54.580
Is that, do you think about these problems
link |
01:30:56.680
of continued alignment as we develop the AI systems?
link |
01:31:00.320
Yeah, definitely.
link |
01:31:02.780
In some sense, the kind of question which you are asking is,
link |
01:31:07.320
so if I were to translate the question to today's terms,
link |
01:31:10.660
it would be a question about how to get an RL agent
link |
01:31:17.040
that's optimizing a value function which itself is learned.
link |
01:31:21.160
And if you look at humans, humans are like that
link |
01:31:23.160
because the reward function, the value function of humans
link |
01:31:26.280
is not external, it is internal.
link |
01:31:28.800
That's right.
link |
01:31:30.160
And there are definite ideas
link |
01:31:33.880
of how to train a value function.
link |
01:31:36.760
Basically an objective, you know,
link |
01:31:39.120
and as objective as possible perception system
link |
01:31:42.560
that will be trained separately to recognize,
link |
01:31:47.640
to internalize human judgments on different situations.
link |
01:31:51.960
And then that component would then be integrated
link |
01:31:54.640
as the base value function
link |
01:31:56.520
for some more capable RL system.
link |
01:31:59.040
You could imagine a process like this.
link |
01:32:00.600
I'm not saying this is the process,
link |
01:32:02.440
I'm saying this is an example
link |
01:32:03.800
of the kind of thing you could do.
link |
01:32:05.700
So on that topic of the objective functions
link |
01:32:11.140
of human existence,
link |
01:32:12.120
what do you think is the objective function
link |
01:32:15.020
that's implicit in human existence?
link |
01:32:17.420
What's the meaning of life?
link |
01:32:18.920
Oh.
link |
01:32:28.860
I think the question is wrong in some way.
link |
01:32:31.460
I think that the question implies
link |
01:32:33.780
that there is an objective answer
link |
01:32:35.620
which is an external answer,
link |
01:32:36.580
you know, your meaning of life is X.
link |
01:32:38.620
I think what's going on is that we exist
link |
01:32:40.740
and that's amazing.
link |
01:32:44.220
And we should try to make the most of it
link |
01:32:45.660
and try to maximize our own value
link |
01:32:48.180
and enjoyment of a very short time while we do exist.
link |
01:32:53.220
It's funny,
link |
01:32:54.060
because action does require an objective function
link |
01:32:56.180
is definitely there in some form,
link |
01:32:58.600
but it's difficult to make it explicit
link |
01:33:01.080
and maybe impossible to make it explicit,
link |
01:33:02.840
I guess is what you're getting at.
link |
01:33:03.940
And that's an interesting fact of an RL environment.
link |
01:33:08.140
Well, but I was making a slightly different point
link |
01:33:10.540
is that humans want things
link |
01:33:13.360
and their wants create the drives that cause them to,
link |
01:33:16.980
you know, our wants are our objective functions,
link |
01:33:19.900
our individual objective functions.
link |
01:33:21.960
We can later decide that we want to change,
link |
01:33:24.340
that what we wanted before is no longer good
link |
01:33:26.060
and we want something else.
link |
01:33:27.280
Yeah, but they're so dynamic,
link |
01:33:29.020
there's gotta be some underlying sort of Freud,
link |
01:33:32.180
there's things, there's like sexual stuff,
link |
01:33:33.980
there's people who think it's the fear of death
link |
01:33:37.220
and there's also the desire for knowledge
link |
01:33:40.300
and you know, all these kinds of things,
link |
01:33:42.100
procreation, sort of all the evolutionary arguments,
link |
01:33:46.220
it seems to be,
link |
01:33:47.100
there might be some kind of fundamental objective function
link |
01:33:49.500
from which everything else emerges,
link |
01:33:54.100
but it seems like it's very difficult to make it explicit.
link |
01:33:56.860
I think that probably is an evolutionary objective function
link |
01:33:58.900
which is to survive and procreate
link |
01:34:00.260
and make sure you make your children succeed.
link |
01:34:02.560
That would be my guess,
link |
01:34:04.260
but it doesn't give an answer to the question
link |
01:34:06.860
of what's the meaning of life.
link |
01:34:08.180
I think you can see how humans are part of this big process,
link |
01:34:13.260
this ancient process.
link |
01:34:14.340
We exist on a small planet and that's it.
link |
01:34:20.780
So given that we exist, try to make the most of it
link |
01:34:24.220
and try to enjoy more and suffer less as much as we can.
link |
01:34:28.080
Let me ask two silly questions about life.
link |
01:34:32.800
One, do you have regrets?
link |
01:34:34.780
Moments that if you went back, you would do differently.
link |
01:34:39.000
And two, are there moments that you're especially proud of
link |
01:34:42.320
that made you truly happy?
link |
01:34:44.720
So I can answer that, I can answer both questions.
link |
01:34:47.520
Of course, there's a huge number of choices
link |
01:34:51.240
and decisions that I've made
link |
01:34:52.440
that with the benefit of hindsight,
link |
01:34:54.240
I wouldn't have made them.
link |
01:34:55.480
And I do experience some regret,
link |
01:34:56.940
but I try to take solace in the knowledge
link |
01:35:00.120
that at the time I did the best I could.
link |
01:35:02.920
And in terms of things that I'm proud of,
link |
01:35:04.680
I'm very fortunate to have done things I'm proud of
link |
01:35:08.680
and they made me happy for some time,
link |
01:35:10.920
but I don't think that that is the source of happiness.
link |
01:35:14.640
So your academic accomplishments, all the papers,
link |
01:35:17.360
you're one of the most cited people in the world.
link |
01:35:19.940
All of the breakthroughs I mentioned
link |
01:35:21.720
in computer vision and language and so on,
link |
01:35:23.840
what is the source of happiness and pride for you?
link |
01:35:29.560
I mean, all those things are a source of pride for sure.
link |
01:35:31.400
I'm very grateful for having done all those things
link |
01:35:35.180
and it was very fun to do them.
link |
01:35:37.440
But happiness comes, but you know, happiness,
link |
01:35:40.220
well, my current view is that happiness comes
link |
01:35:42.600
from our, to a very large degree,
link |
01:35:45.260
from the way we look at things.
link |
01:35:47.740
You know, you can have a simple meal
link |
01:35:49.160
and be quite happy as a result,
link |
01:35:51.320
or you can talk to someone and be happy as a result as well.
link |
01:35:54.880
Or conversely, you can have a meal and be disappointed
link |
01:35:58.200
that the meal wasn't a better meal.
link |
01:36:00.420
So I think a lot of happiness comes from that,
link |
01:36:02.360
but I'm not sure, I don't want to be too confident.
link |
01:36:05.520
Being humble in the face of the uncertainty
link |
01:36:07.840
seems to be also a part of this whole happiness thing.
link |
01:36:12.140
Well, I don't think there's a better way to end it
link |
01:36:14.040
than meaning of life and discussions of happiness.
link |
01:36:17.880
So Ilya, thank you so much.
link |
01:36:19.720
You've given me a few incredible ideas.
link |
01:36:22.600
You've given the world many incredible ideas.
link |
01:36:24.860
I really appreciate it and thanks for talking today.
link |
01:36:27.480
Yeah, thanks for stopping by, I really enjoyed it.
link |
01:36:30.520
Thanks for listening to this conversation
link |
01:36:32.040
with Ilya Setskever and thank you
link |
01:36:33.960
to our presenting sponsor, Cash App.
link |
01:36:36.360
Please consider supporting the podcast
link |
01:36:38.120
by downloading Cash App and using the code LEXPodcast.
link |
01:36:42.600
If you enjoy this podcast, subscribe on YouTube,
link |
01:36:45.400
review it with five stars on Apple Podcast,
link |
01:36:47.960
support on Patreon, or simply connect with me on Twitter
link |
01:36:51.420
at Lex Friedman.
link |
01:36:54.120
And now let me leave you with some words
link |
01:36:56.320
from Alan Turing on machine learning.
link |
01:37:00.140
Instead of trying to produce a program
link |
01:37:01.880
to simulate the adult mind,
link |
01:37:03.740
why not rather try to produce one
link |
01:37:06.240
which simulates the child?
link |
01:37:08.740
If this were then subjected
link |
01:37:10.240
to an appropriate course of education,
link |
01:37:12.500
one would obtain the adult brain.
link |
01:37:15.200
Thank you for listening and hope to see you next time.