back to index

Ilya Sutskever: Deep Learning | Lex Fridman Podcast #94


small model | large model

link |
00:00:00.000
The following is a conversation with Ilya Satskeva, cofounder and chief scientist of Open AI,
link |
00:00:06.080
one of the most cited computer scientists in history with over 165,000 citations,
link |
00:00:13.440
and to me, one of the most brilliant and insightful minds ever in the field of deep learning.
link |
00:00:19.920
There are very few people in this world who I would rather talk to and brainstorm with about
link |
00:00:24.240
deep learning, intelligence, and life in general than Ilya, on and off the mic. This was an honor
link |
00:00:31.920
and a pleasure. This conversation was recorded before the outbreak of the pandemic,
link |
00:00:37.120
for everyone feeling the medical, psychological, and financial burden of this crisis,
link |
00:00:41.360
I'm sending love your way. Stay strong, we're in this together, we'll beat this thing.
link |
00:00:47.120
This is the Artificial Intelligence Podcast. If you enjoy it, subscribe on YouTube,
link |
00:00:51.680
review it with 5 stars on Apple Podcasts, support on Patreon, or simply connect with me on Twitter
link |
00:00:56.880
at Lex Freedman, spelled F R I D M A N. As usual, I'll do a few minutes of ads now and never any
link |
00:01:03.440
ads in the middle that can break the flow of the conversation. I hope that works for you
link |
00:01:07.840
and doesn't hurt the listening experience. This show is presented by Cash App, the number one
link |
00:01:13.840
finance app in the App Store. When you get it, use code Lex Podcast. Cash App lets you send money
link |
00:01:20.000
to friends, buy Bitcoin, invest in the stock market with as little as $1. Since Cash App allows you
link |
00:01:26.480
to buy Bitcoin, let me mention that cryptocurrency in the context of the history of money is
link |
00:01:31.680
fascinating. I recommend Ascent of Money as a great book on this history. Both the book and
link |
00:01:37.520
audiobook are great. Debits and credits on ledgers started around 30,000 years ago. The US dollar
link |
00:01:44.640
created over 200 years ago and Bitcoin, the first decentralized cryptocurrency released just over
link |
00:01:50.800
10 years ago. So given that history, cryptocurrency is still very much in its early days of development,
link |
00:01:56.800
but it's still aiming to just might redefine the nature of money. So again, if you get Cash App
link |
00:02:03.440
from the App Store, Google Play and use the code Lex Podcast, you get $10 and Cash App will also
link |
00:02:10.400
donate $10 to FIRST, an organization that is helping advance robotics and STEM education
link |
00:02:15.920
for young people around the world. And now, here's my conversation with Ilya Setskever.
link |
00:02:23.280
You were one of the three authors with Alex Kyshevsky, Jeff Hinton of the famed AlexNet paper
link |
00:02:30.000
that is arguably the paper that marked the big catalytic moment that launched the deep learning
link |
00:02:36.000
revolution. At that time, take us back to that time. What was your intuition about neural networks,
link |
00:02:42.160
about the representational power of neural networks? And maybe you could mention how did
link |
00:02:47.840
that evolve over the next few years, up to today, over the 10 years? Yeah, I can answer that question.
link |
00:02:55.120
At some point in about 2010 or 2011, I connected two facts in my mind. Basically,
link |
00:03:03.200
the realization was this. At some point, we realized that we can train very large, I shouldn't say very
link |
00:03:11.680
you know, tiny by today's standards, but large and deep neural networks end to end with back
link |
00:03:17.280
propagation. At some point, different people obtained this result. I obtained this result.
link |
00:03:23.760
The first, the first moment in which I realized that deep neural networks are powerful was when
link |
00:03:29.200
James Martens invented the Hessian free optimizer in 2010. And he trained a 10 layer neural network
link |
00:03:36.240
end to end without pre training from scratch. And when that happened, I thought this is it.
link |
00:03:43.840
Because if you can train a big neural network, a big neural network can represent very complicated
link |
00:03:48.560
function. Because if you have a neural network with 10 layers, it says, though, you allow the
link |
00:03:54.480
human brain to run for some number of milliseconds, neuron firings are slow. And so in maybe 100
link |
00:04:02.480
milliseconds, your neurons only fire 10 times. So it's also kind of like 10 layers. And in 100
link |
00:04:07.120
milliseconds, you can perfectly recognize any object. So I thought, so I already had the idea
link |
00:04:12.720
then that we need to train a very big neural network on lots of supervised data. And then it
link |
00:04:18.480
must succeed, because we can find the best neural network. And then there's also theory
link |
00:04:22.560
that if you have more data than parameters, you won't overfit. Today, we know that actually,
link |
00:04:26.720
this theory is very incomplete, and you won't overfit even if you have less data than parameters.
link |
00:04:30.240
But definitely, if you have more data than parameters, you won't overfit.
link |
00:04:33.200
So the fact that neural networks were heavily overparameterized, wasn't discouraging to you.
link |
00:04:38.960
So you were thinking about the theory that the number of parameters,
link |
00:04:42.880
the fact there's a huge number of parameters is okay. Is it going to be okay?
link |
00:04:45.840
I mean, there was some evidence before that it was okay, but the theory was most,
link |
00:04:49.360
the theory was that if you had a big data set and a big neural net, it was going to work.
link |
00:04:52.880
The overparameterization just didn't really figure much as a problem. I thought, well,
link |
00:04:57.280
with images, you're just going to add some data augmentation, and it's going to be okay.
link |
00:05:00.240
So where was any doubt coming from? The main doubt was can we train a big,
link |
00:05:04.320
really have enough compute to train a big enough neural net with back propagation.
link |
00:05:07.440
Back propagation, I thought was would work. The thing which wasn't clear would was whether
link |
00:05:11.120
there would be enough compute to get a very convincing result. And then at some point,
link |
00:05:14.720
Alex Krzewski wrote these insanely fast UDA kernels for training convolutional neural nets,
link |
00:05:19.040
and that was bam, let's do this. Let's get image net, and it's going to be the greatest thing.
link |
00:05:23.280
Was your intuition, most of your intuition from empirical results by you and by others?
link |
00:05:29.520
So like just actually demonstrating that a piece of program can train a 10 layer neural network?
link |
00:05:34.560
Or was there some pen and paper or marker and whiteboard thinking intuition?
link |
00:05:41.600
Because you just connected a 10 layer large neural network to the brain. So you just mentioned
link |
00:05:46.080
the brain. So in your intuition about neural networks, does the human brain come into play
link |
00:05:51.840
as an intuition builder? Definitely. I mean, you know, you got to be precise with these analogies
link |
00:05:57.360
between neural artificial neural networks and the brain. But there's no question that the brain
link |
00:06:02.720
is a huge source of intuition and inspiration for deep learning researchers since all the way
link |
00:06:08.400
from Rosenblatt in the 60s. Like if you look at the whole idea of a neural network is directly
link |
00:06:14.240
inspired by the brain. You had people like McCallum and Pitts who were saying, hey, you got these
link |
00:06:20.080
neurons in the brain. And hey, we recently learned about the computer and automata.
link |
00:06:24.240
Can we use some ideas from the computer and automata to design some kind of computational
link |
00:06:28.240
object that's going to be simple, computational, and kind of like the brain, and they invented the
link |
00:06:33.280
neuron. So they were inspired by it back then. Then you had the convolutional neural network
link |
00:06:37.280
from Fukushima, and then later young Lacan, who said, hey, if you limit the receptive fields of
link |
00:06:42.000
a neural network, it's going to be especially suitable for images as it turned out to be true.
link |
00:06:46.800
So there was a very small number of examples where analogies to the brain were successful.
link |
00:06:52.240
And then I thought, well, probably an artificial neuron is not that different from the brain if
link |
00:06:56.720
it's quite hard enough. So let's just assume it is and roll with it. So we're now at a time where
link |
00:07:02.160
deep learning is very successful. So let us squint less and say, let's open our eyes and say, what
link |
00:07:09.600
to you is an interesting difference between the human brain? Now, I know you're probably not an
link |
00:07:14.960
expert, neither in your scientists and your biologists, but loosely speaking, what's the
link |
00:07:19.920
difference between the human brain and artificial neural networks? That's interesting to you for
link |
00:07:24.000
the next decade or two. That's a good question to ask. What is an interesting difference between
link |
00:07:28.880
the neural between the brain and our artificial neural networks? So I feel like today,
link |
00:07:33.920
artificial neural networks, so we all agree that there are certain dimensions in which the human
link |
00:07:39.920
brain vastly outperforms our models. But I also think that there are some ways in which artificial
link |
00:07:45.200
neural networks have a number of very important advantages over the brain. Looking at the advantages
link |
00:07:51.360
versus disadvantages is a good way to figure out what is the important difference. So the brain
link |
00:07:57.360
uses spikes, which may or may not be important. Yes, that's a really interesting question. Do you
link |
00:08:01.680
think it's important or not? That's one big architectural difference between artificial
link |
00:08:07.200
neural networks. It's hard to tell, but my prior is not very high. And I can say why. There are
link |
00:08:13.520
people who are interested in spiking neural networks. And basically, what they figured out is
link |
00:08:17.920
that they need to simulate the non spiking neural networks in spikes. And that's how they're going
link |
00:08:23.200
to make them work. If you don't simulate the non spiking neural networks in spikes, it's not going
link |
00:08:27.280
to work because the question is, why should it work? And that connects to questions around back
link |
00:08:30.720
propagation and questions around deep learning. You've got this giant neural network. Why should
link |
00:08:37.200
it work at all? Why should that learning rule work at all? It's not a self evident question,
link |
00:08:44.560
especially if you, let's say if you were just starting in the field and you read the very
link |
00:08:48.000
early papers, you can say, Hey, people are saying, let's build neural networks. That's a great idea
link |
00:08:54.400
because the brain is a neural network. So it would be useful to build neural networks. Now,
link |
00:08:58.480
let's figure out how to train them. It should be possible to train them probably, but how?
link |
00:09:03.360
And so the big idea is the cost function. That's the big idea. The cost function is a way of measuring
link |
00:09:11.200
the performance of the system according to some measure. By the way, that is a big, actually,
link |
00:09:16.560
let me think. Is that, is that a one, a difficult idea to arrive at? And how big of an idea is that
link |
00:09:22.640
that there's a single cost function? Sorry, let me take a pause. Is supervised learning a difficult
link |
00:09:31.360
concept to come to? I don't know. All concepts are very easy in retrospect. Yeah, that's what it
link |
00:09:37.040
seems trivial now, but I, because the reason I asked that, and we'll talk about it, because is there
link |
00:09:42.000
other things? Is there things that don't necessarily have a cost function, maybe have many cost
link |
00:09:48.080
functions, or maybe have dynamic cost functions, or maybe a totally different kind of architectures?
link |
00:09:54.000
Because we have to think like that in order to arrive at something new, right? So the only,
link |
00:09:58.560
so the good examples of things which don't have clear cost functions are GANs.
link |
00:10:03.840
Right. And again, you have a game. So instead of thinking of a cost function,
link |
00:10:08.080
where you want to optimize, where you know that you have an algorithm gradient descent,
link |
00:10:11.920
which will optimize the cost function. And then you can reason about the behavior of your system
link |
00:10:16.240
in terms of what it optimizes. With GAN, you say, I have a game, and I'll reason about the behavior
link |
00:10:21.680
of the system in terms of the equilibrium of the game. But it's all about coming up with these
link |
00:10:25.600
mathematical objects that help us reason about the behavior of our system. Right. That's really
link |
00:10:30.640
interesting. Yeah. So GAN is the only one. It's kind of a, the cost function is emergent from the
link |
00:10:35.600
comparison. I don't know if it has a cost function. I don't know if it's meaningful to talk about the
link |
00:10:40.160
cost function of a GAN. It's kind of like the cost function of biological evolution or the cost
link |
00:10:44.160
function of the economy. It's, you can talk about regions to which it will go towards, but I don't
link |
00:10:51.840
think, I don't think the cost function analogy is the most useful. So evolution doesn't,
link |
00:10:59.120
that's really interesting. So if evolution doesn't really have a cost function, like a cost function
link |
00:11:03.920
based on it's something akin to our mathematical conception of a cost function, then do you think
link |
00:11:11.840
cost functions in deep learning are holding us back? Yeah. So you just kind of mentioned that
link |
00:11:17.520
cost function is a nice first profound idea. Do you think that's a good idea? Do you think it's
link |
00:11:23.680
an idea will go past? So self play starts to touch on that a little bit in reinforcement
link |
00:11:30.160
learning systems. That's right. Self play and also ideas around exploration where you're trying to
link |
00:11:35.840
take action. That's surprise a predictor. I'm a big fan of cost functions. I think cost functions
link |
00:11:41.120
are great and they serve us really well. And I think that whenever we can do things with
link |
00:11:44.560
cost functions, we should. And you know, maybe there is a chance that we will come up with some
link |
00:11:50.240
yet another profound way of looking at things that will involve cost functions in a less central way.
link |
00:11:55.440
But I don't know. I think cost functions are I mean,
link |
00:11:59.840
I would not bet against against cost functions. Is there other things about the brain
link |
00:12:04.640
that pop into your mind that might be different and interesting for us to consider in designing
link |
00:12:11.200
artificial neural networks? So we talked about spiking a little bit. I mean, one thing which may
link |
00:12:16.880
potentially be useful, I think people, neuroscientists figured out something about the learning rule of
link |
00:12:20.960
the brain, or I'm talking about spike time independent plasticity. And it would be nice
link |
00:12:25.120
if some people were to study that in simulation. Wait, sorry, spike time independent plasticity.
link |
00:12:30.160
Yeah, that's that STD. It's a particular learning rule that uses spike timing to figure out how to
link |
00:12:36.640
determine how to update the synapses. So it's kind of like, if a synapse fires into the neuron before
link |
00:12:42.800
the neuron fires, then it's strengthened the synapse. And if the synapse fires into the
link |
00:12:47.440
neuron shortly after the neuron fired, then it becomes the synapse something along this line.
link |
00:12:52.080
I'm 90% sure it's right. So if I said something wrong here, don't don't get too angry.
link |
00:12:58.320
But you saw me brilliant while saying it. But the timing, that's one thing that's missing.
link |
00:13:04.080
The temporal dynamics is not captured. I think that's like a fundamental property of the brain,
link |
00:13:10.000
is the timing of the signals. Well, you're recording neural networks.
link |
00:13:15.280
But you think of that as, I mean, that's a very crude simplified, what's that called?
link |
00:13:21.920
There's a clock, I guess, to recurrent neural networks. This seems like the brain is the
link |
00:13:29.600
general, the continuous version of that, the generalization where all possible timings are
link |
00:13:35.440
possible. And then within those timings is contained some information. You think recurrent neural
link |
00:13:41.520
networks, the recurrence in recurrent neural networks can capture the same kind of phenomena
link |
00:13:48.160
as the timing that seems to be important for the brain in the firing of neurons in the brain?
link |
00:13:55.680
I mean, I think recurrent neural networks are amazing. And I think they can do anything we'd
link |
00:14:04.160
want a system to do. Right now, recurrent neural networks have been superseded by
link |
00:14:09.760
transformers, but maybe one day they'll make a comeback, maybe it'll be back. We'll see.
link |
00:14:13.920
Let me, in a small tangent, say, do you think they'll be back? So so much of the breakthroughs
link |
00:14:20.720
recently that we'll talk about on natural language processing and language modeling has been with
link |
00:14:26.640
transformers that don't emphasize recurrence. Do you think recurrence will make a comeback?
link |
00:14:33.200
Well, some kind of recurrence, I think, very likely. Recurrent neural networks, as they're
link |
00:14:39.120
typically thought of for processing sequences, I think it's also possible.
link |
00:14:43.600
What is, to you, a recurrent neural network? And generally speaking, I guess, what is a
link |
00:14:49.520
recurrent neural network? You have a neural network which maintains a high dimensional
link |
00:14:53.440
hidden state. And then when an observation arrives, it updates its high dimensional hidden state
link |
00:14:59.200
through its connections in some way. So do you think, you know, that's what like expert systems
link |
00:15:06.960
did, right? Symbolic AI, the knowledge based, growing a knowledge base is maintaining a
link |
00:15:15.920
hidden state, which is its knowledge base and is growing it by sequentially processing. Do you
link |
00:15:20.320
think of it more generally in that way? Or is it simply, is it the more constrained form of
link |
00:15:29.120
a hidden state with certain kind of gating units that we think of as today with LSDMs and that?
link |
00:15:33.520
I mean, the hidden state is technically what you described there, the hidden state that goes
link |
00:15:38.000
inside the LSDM or the RNN or something like this. But then what should be contained, you know,
link |
00:15:43.040
if you want to make the expert system analogy, I'm not, I mean, you could say that the knowledge
link |
00:15:49.440
is stored in the connections and then the short term processing is done in the hidden state.
link |
00:15:56.160
Yes. Could you say that? So sort of, do you think there's a future of building large scale
link |
00:16:02.720
knowledge bases within the neural networks? Definitely.
link |
00:16:08.880
So we're going to pause on that confidence because I want to explore that. But let me zoom back out
link |
00:16:14.080
and ask back to the history of ImageNet. Neural networks have been around for many decades,
link |
00:16:21.200
as you mentioned. What do you think were the key ideas that led to their success, that ImageNet
link |
00:16:26.720
moment and beyond the success in the past 10 years? Okay, so the question is to make sure I
link |
00:16:33.840
didn't miss anything. The key ideas that led to the success of deep learning over the past 10 years.
link |
00:16:39.280
Exactly. Even though the fundamental thing behind deep learning has been around for much longer.
link |
00:16:44.720
So the key idea about deep learning or rather the key fact about deep learning before deep learning
link |
00:16:55.920
started to be successful is that it was underestimated. People who worked in machine learning
link |
00:17:02.800
simply didn't think that neural networks could do much. People didn't believe that large neural
link |
00:17:08.320
networks could be trained. People thought that, well, there was lots of, there was a lot of debate
link |
00:17:14.320
going on in machine learning about what are the right methods and so on. And people were arguing
link |
00:17:19.200
because there was no way to get hard facts. And by that, I mean, there were no benchmarks which
link |
00:17:25.440
were truly hard, that if you do really well on them, then you can say, look, here's my system.
link |
00:17:32.400
That's when you switch from, that's when this field becomes a little bit more of an engineering
link |
00:17:38.240
field. So in terms of deep learning, to answer the question directly, the ideas were all there.
link |
00:17:43.360
The thing that was missing was a lot of supervised data and a lot of compute.
link |
00:17:49.600
Once you have a lot of supervised data and a lot of compute, then there is a third thing which is
link |
00:17:53.600
needed as well. And that is conviction, conviction that if you take the right stuff, which already
link |
00:17:59.520
exists, and apply and mixed with a lot of data and a lot of compute, that it will in fact work.
link |
00:18:05.840
And so that was the missing piece. It was you had the, you needed the data, you needed the compute
link |
00:18:11.440
which showed up in terms of GPUs, and you needed the conviction to realize that you need to mix
link |
00:18:16.560
them together. So that's really interesting. So I guess the presence of compute and the presence
link |
00:18:23.600
supervised data allowed the empirical evidence to do the convincing of the majority of the computer
link |
00:18:30.720
science community. So I guess there's a key moment with Jitendra Malik and Alex Alyosha Efros,
link |
00:18:40.160
who were very skeptical, right? And then there's a Jeffrey Hinton that was the opposite of skeptical.
link |
00:18:46.480
And there was a convincing moment. And I think Emission had served as that moment.
link |
00:18:50.080
That's right. And they represented this kind of, or the big pillars of computer vision community,
link |
00:18:55.760
kind of the wizards got together. And then all of a sudden there was a shift. And
link |
00:19:02.960
it's not enough for the ideas to all be there and the computer to be there. It's
link |
00:19:06.320
for it to convince the cynicism that existed. That's interesting. That people just didn't
link |
00:19:12.960
believe for a couple of decades. Yeah. Well, but it's more than that. It's kind of, when put this
link |
00:19:20.480
way, it sounds like, well, you know, those silly people who didn't believe what were they, what
link |
00:19:24.880
were they missing. But in reality, things were confusing because neural networks really did
link |
00:19:28.720
not work on anything. And they were not the best method on pretty much anything as well.
link |
00:19:32.640
Well, and it was pretty rational to say, yeah, this stuff doesn't have any traction.
link |
00:19:39.520
And that's why you need to have these very hard tasks, which are, which produce undeniable evidence.
link |
00:19:44.800
And that's how we make progress. And that's why the field is making progress today,
link |
00:19:48.480
because we have these hard benchmarks, which represent true progress. And so, and this is
link |
00:19:53.840
why we were able to avoid endless debate. So incredibly, you've contributed some of the
link |
00:20:00.800
biggest recent ideas in AI in computer vision, language, natural language processing, reinforcement
link |
00:20:07.600
learning, sort of everything in between, maybe not GANs. There may not be a topic you haven't
link |
00:20:15.840
touched. And of course, the fundamental science of deep learning. What is the difference to you
link |
00:20:21.520
between vision, language, and as in reinforcement learning, action, as learning problems,
link |
00:20:28.160
and what are the commonalities? Do you see them as all interconnected? Are they fundamentally
link |
00:20:32.400
different domains that require different approaches? Okay, that's a good question.
link |
00:20:39.520
Machine learning is a field with a lot of unity, a huge amount of unity.
link |
00:20:44.000
What do you mean by unity? Like overlap of ideas?
link |
00:20:48.240
Overlap of ideas, overlap of principles. In fact, there's only one or two or three principles,
link |
00:20:52.560
which are very, very simple. And then they apply in almost the same way, in almost the
link |
00:20:58.160
same way to the different modalities to the different problems. And that's why today,
link |
00:21:02.480
when someone writes a paper on improving optimization of deep learning and vision,
link |
00:21:07.040
it improves the different NLP applications, and it improves the different reinforcement
link |
00:21:10.400
learning applications. Reinforcement learning. So I would say that computer vision and NLP are
link |
00:21:16.640
very similar to each other. Today, they differ in that they have slightly different architectures.
link |
00:21:22.080
We use transformers in NLP, and we use convolutional neural networks in vision.
link |
00:21:26.320
But it's also possible that one day this will change and everything will be unified with a
link |
00:21:30.560
single architecture. Because if you go back a few years ago in natural language processing,
link |
00:21:36.400
there were a huge number of architectures for every different tiny problem had its own architecture.
link |
00:21:43.200
Today, there's just one transformer for all those different tasks. And if you go back in time even
link |
00:21:48.960
more, you had even more and more fragmentation and every little problem in AI had its own
link |
00:21:54.320
little subspecialization and sub, you know, little set of collection of skills, people who would
link |
00:21:59.040
know how to engineer the features. Now it's all been subsumed by deep learning. We have this
link |
00:22:03.120
unification. And so I expect vision to become unified with natural language as well. Or rather,
link |
00:22:08.640
I just expect I think it's possible. I don't want to be too sure because I think on the
link |
00:22:12.800
convolutional neural net is very computationally efficient. Arrel is different. Arrel does require
link |
00:22:17.600
slightly different techniques because you really do need to take action. You really do need to do
link |
00:22:21.920
something about exploration, your variance is much higher. But I think there is a lot of unity
link |
00:22:27.040
even there. And I would expect, for example, that at some point, there will be some
link |
00:22:32.400
broader unification between Arrel and supervised learning where somehow the Arrel will be making
link |
00:22:36.560
decisions to make the supervised learning go better. And it will be, I imagine one big black
link |
00:22:41.200
box and you just throw every, you know, you shovel, shovel things into it. And it just
link |
00:22:45.440
figures out what to do with whatever you shovel in it. I mean, reinforcement learning has
link |
00:22:49.760
some aspects of language and vision combined, almost. There's elements of a long term
link |
00:22:57.200
memory that you should be utilizing. And there's elements of a really rich sensory space. So it
link |
00:23:03.200
seems like the, it's like the union of the two or something like that. I'd say something
link |
00:23:09.040
slightly differently. I'd say that reinforcement learning is neither, but it naturally interfaces
link |
00:23:14.720
and integrates with the two of them. Do you think action is fundamentally different? So yeah,
link |
00:23:19.600
what is interesting about what is unique about policy of learning to act? Well, so one example,
link |
00:23:26.800
for instance, is that when you learn to act, you are fundamentally in a non stationary world.
link |
00:23:33.120
Because as your actions change, the things you see start changing. You,
link |
00:23:39.440
you experience the world in a different way. And this is not the case for
link |
00:23:43.040
the more traditional static problem where you have some distribution and you just apply a model to
link |
00:23:47.040
that distribution. Do you think it's a fundamentally different problem or is it just a more difficult
link |
00:23:53.840
it's a generalization of the problem of understanding? I mean, it's a question of
link |
00:23:58.560
definitions almost. There is a huge amount of commonality for sure. You take gradients,
link |
00:24:02.880
you take gradients, we try to approximate gradients in both cases. In some case,
link |
00:24:06.480
in the case of reinforcement learning, you have some tools to reduce the variance of the gradients.
link |
00:24:10.960
You do that. There's lots of commonalities, the same neural net in both cases,
link |
00:24:16.080
you compute the gradient, you apply Adam in both cases.
link |
00:24:20.640
So I mean, there's lots in common for sure, but there are some small
link |
00:24:26.160
differences which are not completely insignificant. It's really just a matter of your point of view,
link |
00:24:30.800
what frame of reference you what how much do you want to zoom in or out as you look at these
link |
00:24:36.400
problems? Which problem do you think is harder? So people like Noam Chomsky believe that language
link |
00:24:42.000
is fundamental to everything. So it underlies everything. Do you think language understanding
link |
00:24:47.920
is harder than visual scene understanding or vice versa? I think that asking if a problem is hard
link |
00:24:54.480
is slightly wrong. I think the question is a little bit wrong and I want to explain why.
link |
00:24:58.400
Okay. So what does it mean for a problem to be hard? Okay, the non interesting dumb answer to
link |
00:25:06.800
that is there's a benchmark and there's a human level performance on that benchmark. And how
link |
00:25:15.120
is the effort required to reach the human level? Okay, benchmark. So from the perspective of how
link |
00:25:20.160
much until we get to human level on a very good benchmark. Yeah, like some I understand what you
link |
00:25:28.080
mean by that. So what I was going to say that a lot of it depends on, you know, once you solve a
link |
00:25:32.720
problem, it stops being hard. And that's, that's always true. And so but something is hard or not
link |
00:25:37.680
depends on what our tools can do today. So you know, you say today, through human level, language
link |
00:25:44.000
understanding and visual perception are hard in the sense that there is no way of solving the
link |
00:25:49.520
problem completely in the next three months. Right. So I agree with that statement. Beyond
link |
00:25:54.160
that, I'm just I'd be my guess would be as good as yours. I don't know. Okay, so you don't have a
link |
00:25:59.120
fundamental intuition about how hard language understanding is. I think I not change my mind.
link |
00:26:04.160
I'd say language is probably going to be hard. I mean, it depends on how you define it. Like if
link |
00:26:09.280
you mean absolute top notch 100% language understanding, I'll go with language. And so
link |
00:26:16.000
but then if I show you a piece of paper with letters on it, is that you see what I mean?
link |
00:26:20.720
So you have a vision system, you say it's the best human level vision system. I show you I open
link |
00:26:26.000
a book, and I show you letters. Will it understand how these letters form into word and sentences
link |
00:26:31.360
and meaning is this part of the vision problem? Where does vision and the language begin?
link |
00:26:36.000
Yeah, so Chomsky would say it starts at language. So vision is just a little example of the kind of
link |
00:26:41.840
structure and, you know, fundamental hierarchy of ideas that's already represented in our brain
link |
00:26:48.080
somehow that's represented through language. But where does vision stop and language begin?
link |
00:26:57.840
That's a really interesting question.
link |
00:27:07.680
So one possibility is that it's impossible to achieve really deep understanding
link |
00:27:11.600
in either images or language without basically using the same kind of system.
link |
00:27:18.240
So you're going to get the other for free. I think I think it's pretty likely that yes,
link |
00:27:22.960
if we can get one we probably our machine learning is probably that good that we can get the other
link |
00:27:27.200
but it's not 100 I'm not 100% sure. And also, I think a lot, a lot of it really does depend on
link |
00:27:34.400
your definitions. Definitions of like perfect vision. Because, you know, reading is vision,
link |
00:27:41.840
but should it count? Yeah, to me, so my definition is if a system looked at an image,
link |
00:27:48.640
and then a system looked at a piece of text, and then told me something about that,
link |
00:27:55.840
and I was really impressed. That's relative. You'll be impressed for half an hour and then
link |
00:28:01.280
you're going to say, well, I mean, all the systems do that. But here's the thing they don't do.
link |
00:28:04.960
Yeah, but I don't have that with humans. Humans continue to impress me.
link |
00:28:08.720
Is that true?
link |
00:28:10.400
Well, the ones, okay, so I'm a fan of monogamy. So I like the idea of marrying somebody being
link |
00:28:16.080
with them for several decades. So I believe in the fact that yes, it's possible to have somebody
link |
00:28:21.360
continuously giving you pleasurable, interesting, witty new ideas, friends. Yeah, I think so. They
link |
00:28:29.920
continue to surprise you. The surprise, it's that injection of randomness seems to be a nice source
link |
00:28:41.840
of, yeah, continued inspiration, like the wit, the humor. I think, yeah, that would be,
link |
00:28:53.520
it's a very subjective test, but I think if you have enough humans in the room.
link |
00:28:57.600
Yeah, I understand what you mean. Yeah, I feel like I misunderstood what you meant by
link |
00:29:02.480
impressing you. I thought you meant to impress you with its intelligence, with how good,
link |
00:29:08.080
valid understands an image. I thought you meant something like, I'm going to show you
link |
00:29:12.080
a really complicated image and it's going to get it right and you're going to say, wow,
link |
00:29:14.880
that's really cool. The systems of January 2020 have not been doing that.
link |
00:29:19.760
Yeah, no, I think it all boils down to the reason people click like on stuff on the
link |
00:29:25.520
internet, which is like it makes them laugh. So it's like humor or wit or insight.
link |
00:29:32.720
I'm sure we'll get that as well. So forgive the romanticized question, but looking back to you,
link |
00:29:40.320
what is the most beautiful or surprising idea in deep learning or AI in general you've come across?
link |
00:29:46.640
So I think the most beautiful thing about deep learning is that it actually works.
link |
00:29:51.520
And I mean it because you got these ideas, you got the little neural network, you got the back
link |
00:29:54.960
propagation algorithm. And then you got some theories as to, you know, this is kind of like
link |
00:30:01.200
the brain. So maybe if you make it large, if you make the neural network large and you're
link |
00:30:04.880
trained on a lot of data, then it will do the same function that the brain does.
link |
00:30:09.520
And it turns out to be true. That's crazy. And now we just train these neural networks and you
link |
00:30:14.160
make them larger and they keep getting better. And I find it unbelievable. I find it unbelievable
link |
00:30:18.720
that this whole AI stuff with neural networks works. Have you built up an intuition of why are
link |
00:30:24.960
there a little bits and pieces of intuitions of insights of why this whole thing works?
link |
00:30:31.200
I mean, some definitely, while we know that optimization, we now have good, you know,
link |
00:30:37.280
we've had lots of empirical, huge amounts of empirical reasons to believe that optimization
link |
00:30:43.280
should work on all most problems we care about. Do you have insights of what, so you just said
link |
00:30:49.280
empirical evidence is most of your sort of empirical evidence kind of convinces you,
link |
00:30:58.240
it's like evolution is empirical, it shows you that look, this evolutionary process seems to be
link |
00:31:03.280
a good way to design organisms that survive in their environment. But it doesn't really get you
link |
00:31:10.480
to the insides of how the whole thing works. I think it's a good analogy is physics. You know how
link |
00:31:16.880
you say, Hey, let's do some physics calculation and come up with some new physics theory and make
link |
00:31:20.640
some prediction. But then you got around the experiment. You know, you got around the experiment,
link |
00:31:24.960
it's important. So it's a bit the same here, except that maybe sometimes the experiment came
link |
00:31:29.760
before the theory. But it still is the case, you know, you have some data and you come up with
link |
00:31:34.240
some prediction, you say, Yeah, let's make a big neural network, let's train it, and it's going to
link |
00:31:37.520
work much better than anything before it. And it will in fact continue to get better as you make
link |
00:31:41.680
it larger. And it turns out to be true. That's, that's amazing when a theory is validated like
link |
00:31:46.560
this, you know, it's not a mathematical theory, it's more of a biological theory almost. So I
link |
00:31:51.760
think there are not terrible analogies between deep learning and biology. I would say it's like
link |
00:31:56.160
the geometric mean of biology and physics, that's deep learning. The geometric mean of biology and
link |
00:32:02.800
physics, I think I'm going to need a few hours to wrap my head around that. Because just to find
link |
00:32:08.880
the geometric, just to find the set of what biology represents. Well, biology, in biology,
link |
00:32:17.920
things are really complicated. The theories are really, really, it's really hard to have good
link |
00:32:21.920
predictive theory. And in physics, the theories are too good. In theory, in physics, people make
link |
00:32:26.480
these super precise theories, which make these amazing predictions. And in machine learning,
link |
00:32:29.840
they're kind of in between. Kind of in between. But it'd be nice if machine learning somehow
link |
00:32:35.120
helped us discover the unification of the two as opposed to serve the in between.
link |
00:32:40.800
But you're right, that's, you're kind of trying to juggle both. So do you think there are still
link |
00:32:46.240
beautiful and mysterious properties in neural networks that are yet to be discovered? Definitely.
link |
00:32:51.200
I think that we are still massively underestimating deep learning.
link |
00:32:54.000
What do you think it will look like? Like what if I knew I would have done it?
link |
00:33:01.120
So, but if you look at all the progress from the past 10 years, I would say most of it,
link |
00:33:06.960
I would say there have been a few cases where some were things that felt like really new ideas
link |
00:33:12.080
showed up. But by and large, it was every year, we thought, okay, deep learning goes this far.
link |
00:33:17.120
Nope, it actually goes further. And then the next year, okay, now you know, this is this is
link |
00:33:21.680
big deep learning. We are really done. Nope, it goes further. It just keeps going further each
link |
00:33:25.440
year. So that means that we keep underestimating, we keep not understanding it as surprising properties
link |
00:33:30.160
all the time. Do you think it's getting harder and harder to make progress, need to make progress?
link |
00:33:35.840
It depends on what we mean. I think the field will continue to make very robust progress
link |
00:33:39.840
for quite a while. I think for individual researchers, especially people who are doing
link |
00:33:45.040
research, it can be harder because there is a very large number of researchers right now.
link |
00:33:49.040
I think that if you have a lot of compute, then you can make a lot of very interesting discoveries,
link |
00:33:53.360
but then you have to deal with the challenge of managing a huge computer, a huge class,
link |
00:33:59.440
huge computer cluster to run your experiments. It's a little bit harder.
link |
00:34:01.840
So I'm asking all these questions that nobody knows the answer to, but you're one of the smartest
link |
00:34:06.640
people I know. So I'm going to keep asking the, so let's imagine all the breakthroughs that happen
link |
00:34:11.760
in the next 30 years in deep learning. Do you think most of those breakthroughs can be done by
link |
00:34:16.720
one person with one computer? Sort of in the space of breakthroughs, do you think compute
link |
00:34:23.600
will be, compute and large efforts will be necessary? I mean, I can't be sure. When you say
link |
00:34:32.720
one computer, you mean how large? You're clever. I mean, one GPU. I see. I think it's pretty unlikely.
link |
00:34:47.440
I think it's pretty unlikely. I think that there are many, the stack of deep learning is starting
link |
00:34:52.400
to be quite deep. If you look at it, you've got all the way from the ideas, the systems to build
link |
00:35:00.640
the datasets, the distributed programming, the building the actual cluster, the GPU programming,
link |
00:35:08.000
putting it all together. So the stack is getting really deep. And I think it becomes,
link |
00:35:12.160
it can be quite hard for a single person to become, to be world class in every single layer of the
link |
00:35:16.960
stack. What about what like Vladimir Vapnik really insists on is taking MNIST and trying to learn
link |
00:35:24.240
from very few examples. So being able to learn more efficiently. Do you think that there'll be
link |
00:35:30.800
breakthroughs in that space that would may not need this huge compute? I think there will be a
link |
00:35:36.720
large number of breakthroughs in general that will not need a huge amount of compute. So maybe I
link |
00:35:40.960
should clarify that. I think that some breakthroughs will require a lot of compute. And I think building
link |
00:35:46.720
systems which actually do things will require a huge amount of compute. That one is pretty obvious.
link |
00:35:51.200
If you want to do X, and X requires a huge neural net, you got to get a huge neural net.
link |
00:35:56.480
But I think there will be lots of, I think there is lots of room for very important work being
link |
00:36:02.640
done by small groups and individuals. Can you maybe sort of on the topic of the science of
link |
00:36:08.400
deep learning, talk about one of the recent papers that you've released, the deep double descent,
link |
00:36:15.600
where bigger models and more data hurt. I think it's a really interesting paper.
link |
00:36:19.520
Can you can describe the main idea? And yeah, definitely. So what happened is that some over
link |
00:36:26.400
the years, some small number of researchers noticed that it is kind of weird that when you make the
link |
00:36:30.720
neural network larger, it works better. And it seems to go in contradiction with statistical
link |
00:36:33.840
ideas. And then some people made an analysis showing that actually you got this double descent
link |
00:36:38.240
bump. And what we've done was to show that double descent occurs for pretty much all practical
link |
00:36:44.560
deep learning systems. And that it'll be also so can you step back? What's the X axis and the Y
link |
00:36:53.440
axis of a double descent plot? Okay, great. So you can you can look you can do things like
link |
00:37:02.480
you can take your neural network. And you can start increasing its size slowly,
link |
00:37:07.440
while keeping your data set fixed. So if you increase the size of the neural network slowly,
link |
00:37:13.840
and if you don't do early stopping, that's a pretty important detail. Then when the
link |
00:37:21.360
neural network is really small, you make it larger, you get a very rapid increase in performance.
link |
00:37:25.920
Then you continue to make it larger. And at some point performance will get worse.
link |
00:37:30.000
And it gets and it gets the worst exactly at the point at which it achieves zero training
link |
00:37:35.760
error precisely zero training loss. And then as you make it larger, it starts to get better again.
link |
00:37:40.480
And it's kind of counterintuitive, because you'd expect deep learning phenomena to be
link |
00:37:44.400
monotonic. And it's hard to be sure what it means. But it also occurs in the case of linear
link |
00:37:51.520
classifiers. And the intuition basically boils down to the following. When you when you have a lot
link |
00:37:58.000
when you have a large data set, and a small model, then small, tiny, random. So basically,
link |
00:38:05.120
what is overfitting? Overfitting is when your model is somehow very sensitive to the small, random,
link |
00:38:14.000
unimportant stuff in your data set in the training data in the training data set precisely. So if
link |
00:38:19.200
you have a small model, and you have a big data set, and there may be some random thing, you know,
link |
00:38:24.640
some training cases are randomly in the data set, and others may not be there. But the small model,
link |
00:38:29.760
but the small model is kind of insensitive to this randomness, because it's the same you there is
link |
00:38:34.640
pretty much no uncertainty about the model, when the data set is large. So okay, so at the very
link |
00:38:39.520
basic level, to me, it is the most surprising thing that neural networks don't overfit every time,
link |
00:38:48.560
very quickly, before ever being able to learn anything, the huge number of parameters. So here
link |
00:38:56.560
is so there is one way okay, so maybe so let me try to give the explanation, maybe that will be
link |
00:39:01.360
that will work. So you got a huge neural network, let's suppose you got them. You are you have a
link |
00:39:06.720
huge neural network, you have a huge number of parameters. And now let's pretend everything is
link |
00:39:10.880
linear, which is not let's just pretend. Then there is this big subspace, where your network
link |
00:39:16.160
achieves zero error. And SGT is going to find approximately the point really that's right,
link |
00:39:22.480
approximately the point with the smallest norm in that subspace. Okay, and that can also be proven
link |
00:39:29.040
to be insensitive to the small randomness in the data, when the dimensionality is high. But when
link |
00:39:35.680
the dimensionality of the data is equal to the dimensionality of the model, then there is a
link |
00:39:39.680
one to one correspondence between all the data sets and the models. So small changes in the data
link |
00:39:45.360
set actually lead to large changes in the model. And that's why performance gets worse. So this
link |
00:39:48.960
is the best explanation more or less. So then it would be good for the model to have more parameters
link |
00:39:56.080
so to be bigger than the data. That's right. But only if you don't really stop. If you introduce
link |
00:40:01.280
early stop in your regularization, you can make a double as a descent pump almost completely
link |
00:40:05.280
disappear. What is early stop early stopping is when you train your model, and you monitor your
link |
00:40:10.960
test validation performance. And then if at some point validation performance starts to get worse,
link |
00:40:15.760
you say, Okay, let's stop training. We are good. We are good. We are good enough.
link |
00:40:19.840
So the magic happens after after that moment. So you don't want to do the early stopping.
link |
00:40:24.880
Well, if you don't do the early stopping, you get this very, you get a very pronounced double
link |
00:40:28.480
descent. Do you have any intuition why this happens? Double descent or sorry, are you stopping?
link |
00:40:35.360
No, the double descent. So the Well, yeah, so I try it, let's see the intuition is basically
link |
00:40:40.000
is this that when the data set has as many degrees of freedom as the model, then there is a one to
link |
00:40:48.240
one correspondence between them. And so small changes to the data set lead to noticeable changes
link |
00:40:53.840
in the model. So your model is very sensitive to all the randomness, it is unable to discard it.
link |
00:40:59.440
Whereas, it turns out that when you have a lot more data than parameters, or a lot more parameters
link |
00:41:05.600
than data, the resulting solution will be insensitive to small changes in the data set.
link |
00:41:10.400
So it's able to nicely put discard the small changes, the randomness. Exactly. The spurious
link |
00:41:18.720
correlation which you don't want. Jeff Hinton suggested we need to throw back propagation.
link |
00:41:23.360
We already kind of talked about this a little bit, but he suggested we need to throw away
link |
00:41:27.200
back propagation and start over. I mean, of course, some of that is a little bit
link |
00:41:33.760
wit and humor. But what do you think? What could be an alternative method of training neural networks?
link |
00:41:39.520
Well, the thing that he said precisely is that to the extent that you can't find back propagation
link |
00:41:44.000
in the brain, it's worth seeing if we can learn something from how the brain learns. But back
link |
00:41:49.360
propagation is very useful and we should keep using it. Oh, you're saying that once we discover the
link |
00:41:54.560
mechanism of learning in the brain or any aspects of that mechanism, we should also try to implement
link |
00:41:59.600
that in neural networks? If it turns out that you can't find back propagation in the brain?
link |
00:42:03.600
If we can't find back propagation in the brain? Well, so I guess your answer to that is back
link |
00:42:11.920
propagation is pretty damn useful. So why are we complaining? I mean, I personally am a big fan
link |
00:42:17.440
of back propagation. I think it's a great algorithm because it solves an extremely fundamental problem
link |
00:42:21.840
which is finding a neural circuit subject to some constraints. And I don't see that problem going
link |
00:42:29.840
away. So that's why I really, I think it's pretty unlikely that you'll have anything which is going
link |
00:42:36.160
to be dramatically different. It could happen. But I wouldn't bet on it right now.
link |
00:42:41.120
So let me ask a sort of big picture question. Do you think neural networks can be made to reason?
link |
00:42:51.600
Why not? Well, if you look, for example, at AlphaGo or AlphaZero, the neural network of AlphaZero
link |
00:43:00.320
plays Go, which we all agree is a game that requires reasoning, better than 99.9% of all humans,
link |
00:43:08.400
just the neural network without the search, just the neural network itself.
link |
00:43:12.240
Doesn't that give us an existence proof that neural networks can reason?
link |
00:43:17.680
To push back and disagree a little bit, we all agree that Go is reasoning. I think I agree. I
link |
00:43:24.800
don't think it's at trivial. So obviously reasoning like intelligence is a loose gray area term
link |
00:43:32.000
a little bit. Maybe you disagree with that. But yes, I think it has some of the same elements
link |
00:43:37.920
of reasoning. Reasoning is almost akin to search. There's a sequential element of
link |
00:43:46.720
stepwise consideration of possibilities and sort of building on top of those possibilities in a
link |
00:43:54.320
sequential manner until you arrive at some insight. So yeah, I guess playing Go is kind of like that.
link |
00:44:00.400
And when you have a single neural network doing that without search, that's kind of like that.
link |
00:44:04.720
So there's an existence proof in a particular constrained environment that a process akin to
link |
00:44:10.880
what many people call reasoning exists, but more general kind of reasoning. So off the board.
link |
00:44:18.720
There is one other existence proof. Oh boy, which one? Us humans? Yes. Okay. All right. So
link |
00:44:26.000
do you think the architecture that will allow neural networks to reason will look similar
link |
00:44:34.480
to the neural network architectures we have today? I think it will. I think, well, I don't want to make
link |
00:44:41.520
two overly definitive statements. I think it's definitely possible that the neural networks
link |
00:44:47.600
that will produce the reasoning breakthroughs of the future will be very similar to the
link |
00:44:51.760
architectures that exist today, maybe a little bit more recurrent, maybe a little bit deeper.
link |
00:44:56.160
But these neural nets are so insanely powerful. Why wouldn't they be able to learn to reason?
link |
00:45:05.440
Humans can reason. So why can't neural networks? So do you think the kind of stuff we've seen
link |
00:45:11.520
neural networks do is a kind of just weak reasoning? So it's not a fundamentally different
link |
00:45:15.840
process. Again, this is stuff we don't nobody knows the answer to. So when it comes to our
link |
00:45:21.120
neural networks, I would think which I would say is that neural networks are capable of reasoning.
link |
00:45:28.160
But if you train a neural network on a task which doesn't require reasoning,
link |
00:45:32.400
it's not going to reason. This is a well known effect where the neural network will solve
link |
00:45:37.040
exactly the it will solve the problem that you pose in front of it in the easiest way possible.
link |
00:45:43.040
Right. That takes us to one of the brilliant ways you describe neural networks, which is
link |
00:45:54.000
you've referred to neural networks as the search for small circuits,
link |
00:45:57.840
and maybe general intelligence as the search for small programs,
link |
00:46:04.320
which I found is a metaphor very compelling. Can you elaborate on that difference?
link |
00:46:08.560
Yeah. So the thing which I said precisely was that if you can find the shortest program that
link |
00:46:17.520
outputs the data at your disposal, then you will be able to use it to make the best prediction
link |
00:46:23.520
possible. And that's a theoretical statement which can be proved mathematically. Now,
link |
00:46:29.680
you can also prove mathematically that it is that finding the shortest program which generates
link |
00:46:34.800
some data is not a computable operation. No finite amount of compute can do this.
link |
00:46:42.720
So then with neural networks, neural networks are the next best thing that actually works in
link |
00:46:48.800
practice. We are not able to find the best, the shortest program which generates our data,
link |
00:46:55.600
but we are able to find a small, but now that statement should be amended. Even a large circuit
link |
00:47:02.720
which fits our data in some way. Well, I think what you meant by the small
link |
00:47:06.640
circuit is the smallest needed circuit. Well, the thing which I would change now,
link |
00:47:12.240
back then I really haven't fully internalized the overparameterized results. The things we know
link |
00:47:18.400
about overparameterized neural nets, now I would phrase it as a large circuit whose weights contain
link |
00:47:25.360
a small amount of information, which I think is what's going on. If you imagine the training
link |
00:47:30.320
process of a neural network as you slowly transmit entropy from the data set to the parameters,
link |
00:47:36.880
then somehow the amount of information in the weights ends up being not very large,
link |
00:47:42.720
which would explain why the general is so well. So the large circuit might be one that's
link |
00:47:48.800
helpful for the generalization. Yeah, something like this. But do you see it important to be able
link |
00:47:59.680
to try to learn something like programs? I mean, if we can, definitely. I think it's kind of,
link |
00:48:05.920
the answer is kind of yes, if we can do it. We should do things that we can do it.
link |
00:48:11.840
The reason we are pushing on deep learning, the fundamental reason, the root cause is that we
link |
00:48:19.040
are able to train them. So in other words, training comes first. We've got our pillar,
link |
00:48:25.440
which is the training pillar. And now if you're trying to contort our neural networks around
link |
00:48:30.080
the training pillar, we got to stay trainable. This is an invariant we cannot violate. And so
link |
00:48:38.160
being trainable means starting from scratch, knowing nothing, you can actually pretty quickly
link |
00:48:42.720
converge towards knowing a lot or even slowly. But it means that given the resources at your
link |
00:48:48.480
disposal, you can train the neural net and get it to achieve useful performance. Yeah, that's a
link |
00:48:55.920
pillar we can't move away from. That's right. Because if you can, whereas if you say, Hey,
link |
00:48:59.520
let's find the shortest program, we can't do that. So it doesn't matter how useful that would be.
link |
00:49:05.920
We can do it. So we want. So do you think you kind of mentioned that the neural networks are
link |
00:49:10.640
good at finding small circuits or large circuits? Do you think then the matter of finding small
link |
00:49:16.400
programs is just the data? No, sorry, not the size or the type of data. Ask giving it programs.
link |
00:49:28.800
Well, I think the thing is that right now, there are no good precedents of people successfully
link |
00:49:35.440
finding programs really well. And so the way you'd find programs is you'd train a deep neural
link |
00:49:42.320
network to do it basically. Right. Which is the right way to go about it. But there's not good
link |
00:49:49.600
illustrations of that. Yes, hasn't been done yet. But in principle, it should be possible.
link |
00:49:56.320
Can you elaborate a little bit? What's your insight in principle? And put another way,
link |
00:50:01.120
you don't see why it's not possible. Well, it's kind of like more, it's more a statement of
link |
00:50:07.440
I think that it's, I think that it's unwise to bet against deep learning. And
link |
00:50:14.880
if it's a, if it's a cognitive function that humans seem to be able to do, then
link |
00:50:20.320
it doesn't take too long for some deep neural net to pop up that can do it too.
link |
00:50:25.680
Yeah, I'm there with you. I can, I've stopped betting against neural networks at this point
link |
00:50:32.880
because they continue to surprise us. What about long term memory? Can neural networks have long
link |
00:50:38.320
term memory or something like knowledge basis? So being able to aggregate important information
link |
00:50:45.360
over long periods of time, that would then serve as useful sort of representations of
link |
00:50:54.000
state that you can make decisions by. So have a long term context based on what you make in the
link |
00:51:00.400
decision. So in some sense, the parameters already do that. The parameters are an aggregation of the
link |
00:51:07.600
day of the neural of the entirety of the neural nets experience. And so they count as the long
link |
00:51:12.080
as long form long term knowledge. And people have trained various neural nets to act as
link |
00:51:18.240
knowledge basis and, you know, investigated with invest people have investigated language
link |
00:51:22.480
models as knowledge basis. So there is work, there is work there. Yeah, but in some sense,
link |
00:51:28.480
do you think in every sense, do you think there's a, it's all just a matter of coming up with a
link |
00:51:35.760
better mechanism of forgetting the useless stuff and remembering the useful stuff? Because right
link |
00:51:40.480
now, I mean, there's not been mechanisms that do remember really long term information.
link |
00:51:46.720
What do you mean by that precisely?
link |
00:51:48.800
Precisely. I like the word precisely. So
link |
00:51:51.920
I'm thinking of the kind of compression of information the knowledge bases represent,
link |
00:52:00.400
sort of creating a, now I apologize for my sort of human centric thinking about
link |
00:52:07.680
what knowledge is because neural networks aren't interpretable necessarily with the
link |
00:52:13.040
kind of knowledge they have discovered. But a good example for me is knowledge bases being
link |
00:52:18.800
able to build up over time something like the knowledge that Wikipedia represents.
link |
00:52:23.920
It's a really compressed structured
link |
00:52:29.680
knowledge base, obviously not the actual Wikipedia or the language, but like a semantic web,
link |
00:52:35.600
the dream that semantic web represented. So it's a really nice compressed knowledge base
link |
00:52:40.240
or something akin to that in the noninterpretable sense as neural networks would have.
link |
00:52:46.800
Well, the neural networks would be noninterpretable if you look at their weights, but their outputs
link |
00:52:50.720
should be very interpretable. Okay, so yeah, how do you make very smart neural networks like
link |
00:52:55.920
language models interpretable? Well, you ask them to generate some text and the text will
link |
00:53:00.720
generally be interpretable. Do you find that the epitome of interpretability, like can you do better?
link |
00:53:07.600
Because you can't, okay, I'd like to know what does it know and what doesn't know.
link |
00:53:11.360
I would like the neural network to come up with examples where it's completely dumb
link |
00:53:17.200
and examples where it's completely brilliant. And the only way I know how to do that now is to
link |
00:53:22.480
generate a lot of examples and use my human judgment. But it would be nice if the neural
link |
00:53:27.360
network had some self awareness about it. Yeah, 100%. I'm a big believer in self awareness. And I
link |
00:53:34.320
think that I think, I think neural net self awareness will allow for things like the capabilities,
link |
00:53:41.760
like the ones you described, like for them to know what they know and what they don't know,
link |
00:53:45.840
and for them to know where to invest to increase their skills most optimally. And to your question
link |
00:53:50.640
of interpretability, there are actually two answers to that question. One answer is, you know,
link |
00:53:54.880
we have the neural net, so we can analyze the neurons and we can try to understand what the
link |
00:53:59.040
different neurons and different layers mean. And you can actually do that and OpenAI has done
link |
00:54:03.680
some work on that. But there is a different answer which is that I would say that's the human
link |
00:54:10.000
centric answer where you say, you know, you look at a human being, you can't read, you know,
link |
00:54:15.920
how do you know what a human being is thinking? You ask them, you say, Hey, what do you think about
link |
00:54:19.680
this? What do you think about that? And you get some answers. The answers you get are sticky. In
link |
00:54:24.960
the sense, you already have a mental model, you already have an, yeah, mental model of that human
link |
00:54:30.800
being. You already have an understanding of like a big conception of what it of that human being,
link |
00:54:37.760
how they think, what they know, how they see the world, and then everything you ask, you're
link |
00:54:42.400
adding onto that. And that stickiness seems to be, that's one of the really interesting qualities
link |
00:54:50.800
of the human being is that information is sticky. You don't, you seem to remember the useful stuff,
link |
00:54:56.640
aggregate it well, and forget most of the information that's not useful. That process,
link |
00:55:02.000
but that's also pretty similar to the process that neural networks do. It's just that neural
link |
00:55:06.960
networks are much crappier at this time. It doesn't seem to be fundamentally that different.
link |
00:55:12.480
But just to stick on reasoning for a little longer, you said, why not? Why can't that reason?
link |
00:55:18.880
What's a good, impressive feat benchmark to you of reasoning that you'll be impressed by if
link |
00:55:28.000
neural networks were able to do? Is that something you already have in mind? Well, I think writing,
link |
00:55:32.960
writing really good code, I think, proving really hard theorems, solving open ended problems
link |
00:55:40.800
without the box solutions. And sort of theorem type mathematical problems. Yeah, I think those
link |
00:55:49.760
ones are a very natural example as well. You know, if you can prove an unproven theorem,
link |
00:55:53.920
then it's hard to argue it on reason. And so by the way, and this comes back to the point about
link |
00:55:58.960
the hard results, you know, if you've got a hard, if you have machine learning, deep learning as a
link |
00:56:04.480
field is very fortunate, because we have the ability to sometimes produce these unambiguous
link |
00:56:09.040
results. And when they happen, the debate changes, the conversation changes. It's a converse, you
link |
00:56:15.280
have the ability to produce conversation changing results. Conversation. And then of course,
link |
00:56:20.800
just like you said, people kind of take that for granted, say that wasn't actually a hard problem.
link |
00:56:24.960
Well, I mean, at some point, we'll probably run out of hard problems.
link |
00:56:29.280
Yeah, that whole mortality thing is kind of a sticky problem that we haven't quite figured
link |
00:56:34.720
out. Maybe we'll solve that one. I think one of the fascinating things in your entire body of work,
link |
00:56:40.720
but also the work at OpenAI recently, one of the conversation changes has been in the world of
link |
00:56:45.680
language models. Can you briefly kind of try to describe the recent history of using neural
link |
00:56:51.600
networks in the domain of language and text? Well, there's been lots of history. I think the
link |
00:56:57.280
Elman network was a small, tiny recurrent neural network applied to language back in the 80s.
link |
00:57:03.040
So the history is really, you know, fairly long, at least. And the thing that started the thing
link |
00:57:11.120
that changed the trajectory of neural networks and language is the thing that changed the trajectory
link |
00:57:17.040
of all deep learning and that's data and compute. So suddenly you move from small language models,
link |
00:57:22.560
which learn a little bit. And with language models, in particular, you can, there's a very clear
link |
00:57:27.600
explanation for why they need to be large, to be good. Because they're trying to predict the
link |
00:57:32.560
next word. So when you don't know anything, you'll notice very, very broad strokes, surface level
link |
00:57:40.800
patterns, like sometimes there are characters and there is space between those characters,
link |
00:57:46.320
you'll notice this pattern. And you'll notice that sometimes there is a comma and then the next
link |
00:57:50.640
character is a capital letter, you'll notice that pattern. Eventually, you may start to notice that
link |
00:57:54.960
there are certain words occur often, you may notice that spellings are a thing, you may notice
link |
00:57:59.840
syntax. And when you get really good at all these, you start to notice the semantics,
link |
00:58:05.680
you start to notice the facts. But for that to happen, the language model needs to be larger.
link |
00:58:11.360
So that's, let's linger on that, is that's where you and Noam Chomsky disagree.
link |
00:58:18.640
See, you think we're actually taking incremental steps, sort of larger network, larger compute
link |
00:58:25.600
will be able to get to the semantics, be able to understand language without what Noam likes to
link |
00:58:34.640
sort of think of as a fundamental understandings of the structure of language, like imposing
link |
00:58:42.000
your theory of language onto the learning mechanism. So you're saying the learning you can learn from
link |
00:58:49.280
raw data, the mechanism that underlies language? Well, I think it's pretty likely. But I also
link |
00:58:57.040
want to say that I don't really know precisely what Chomsky means when he talks about him.
link |
00:59:05.520
You said something about imposing your structural language. I'm not 100% sure what he means, but
link |
00:59:11.040
empirically, it seems that when you inspect those larger language models, they exhibit signs of
link |
00:59:15.520
understanding the semantics, whereas the smaller language models do not. We've seen that a few
link |
00:59:19.200
years ago when we did work on the sentiment neuron, we trained a small, you know, smallish LSTM
link |
00:59:25.200
to predict the next character in Amazon reviews. And we noticed that when you increase the size
link |
00:59:30.800
of the LSTM from 500 LSTM cells to 4,000 LSTM cells, then one of the neurons starts to represent
link |
00:59:37.120
the sentiment of the article of sorry, of their view. Now, why is that sentiment is a pretty
link |
00:59:43.840
semantic attribute? It's not a syntactic attribute. And for people who might not know, I don't know
link |
00:59:48.560
if that's a standard term, but sentiment is whether it's a positive or negative review.
link |
00:59:51.920
That's right. Like, is the person happy with something or is the person unhappy with something?
link |
00:59:56.000
And so here we had very clear evidence that a small neural net does not capture sentiment
link |
01:00:01.760
while a large neural net does. And why is that? Well, our theory is that at some point,
link |
01:00:07.280
you run out of syntax to models, you start to got to focus on something else.
link |
01:00:10.960
And besides, you quickly run out of syntax to model, and then you really start to focus on
link |
01:00:17.360
the semantics. This would be the idea. That's right. And so I don't want to imply that our models
link |
01:00:22.080
have complete semantic understanding, because that's not true. But they definitely are showing
link |
01:00:27.760
signs of semantic understanding, partial semantic understanding, but the smaller models do not show
link |
01:00:32.480
that those signs. Can you take a step back and say, what is GPT2, which is one of the big language
link |
01:00:39.920
models that was the conversation changer in the past couple of years? Yeah, so GPT2 is a
link |
01:00:46.320
transformer with one and a half billion parameters that was trained on about 40 billion tokens of
link |
01:00:54.880
text, which were obtained from webpages that were linked to from Reddit articles with more than three
link |
01:01:01.600
uploads. And what's the transformer? The transformer, it's the most important advance
link |
01:01:06.560
in neural network architectures in recent history. What is the tension maybe two,
link |
01:01:11.280
because I think that's an interesting idea, not necessarily sort of technically speaking, but
link |
01:01:15.360
the idea of attention versus maybe what recurrent neural networks represent.
link |
01:01:21.040
Yeah. So the thing is, the transformer is a combination of multiple ideas simultaneously
link |
01:01:25.760
of which attention is one. Do you think attention is the key? No, it's a key, but it's not the key.
link |
01:01:32.320
The transformer is successful because it is the simultaneous combination of multiple ideas. And
link |
01:01:37.680
if you were to remove either idea, it would be much less successful. So the transformer uses a
link |
01:01:43.040
lot of attention, but attention exists for a few years. So that can't be the main innovation.
link |
01:01:48.320
The transformer is designed in such a way that it runs really fast on the GPU.
link |
01:01:56.000
And that makes a huge amount of difference. This is one thing. The second thing is that
link |
01:02:00.240
transformer is not recurrent. And that is really important too, because it is more shallow and
link |
01:02:06.320
therefore much easier to optimize. So in other words, uses attention. It is, it is a really
link |
01:02:12.160
great fit to the GPU. And it is not recurrent. So therefore, less deep and easier to optimize.
link |
01:02:17.760
And the combination of those factors make it successful. So now it makes, it makes great
link |
01:02:22.640
use of your GPU. It allows you to achieve better results for the same amount of compute.
link |
01:02:28.560
And that's why it's successful. Were you surprised how well transformers worked?
link |
01:02:34.160
And GPT2 worked? So you worked on language. You've had a lot of great ideas before
link |
01:02:40.480
transformers came about in language. So you got to see the whole set of revolutions before and
link |
01:02:45.280
after. Were you surprised? Yeah, a little. A little. Yeah. I mean, it's hard, it's hard to
link |
01:02:51.280
remember because you adapt really quickly. But it definitely was surprising. It definitely was,
link |
01:02:56.800
in fact, I'll, you know what, I'll, I'll retract my statement. It was, it was pretty amazing.
link |
01:03:02.320
It was just amazing to see generate this text of this. And you know, I gotta keep in mind that
link |
01:03:07.680
we've seen, at that time, we've seen all this progress in GANs and improving, you know, the
link |
01:03:12.000
samples produced by GANs were just amazing. You have these realistic faces, but text hasn't really
link |
01:03:16.720
moved that much. And suddenly we moved from, you know, whatever GANs were in 2015, to the best,
link |
01:03:23.760
most amazing GANs in one step. And I was really stunning. Even though theory predicted, yeah,
link |
01:03:29.040
you train a big language model, of course, you should get this, but then to see it with your
link |
01:03:32.560
own eyes, it's something else. And yet we adapt really quickly. And now there's sort of
link |
01:03:41.520
some cognitive scientists, right, articles saying that GPT two models don't really understand
link |
01:03:48.240
language. So we adapt quickly to how amazing the fact that they're able to model the language so
link |
01:03:54.240
well is. So what do you think is the bar for what for impressing us that I don't know, do you think
link |
01:04:03.920
that bar will continuously be moved? Definitely. I think when you start to see really dramatic
link |
01:04:10.160
economic impact, that's when I think that's in some sense the next barrier. Because right now,
link |
01:04:14.480
if you think about the work in AI, it's really confusing. It's really hard to know what to make
link |
01:04:20.640
of all these advances. It's kind of like, okay, you got an advance and now you can do more things
link |
01:04:26.720
and you got another improvement and you got another cool demo. At some point, I think people who are
link |
01:04:34.400
outside of AI, they can no longer distinguish this progress anymore. So we were talking offline about
link |
01:04:40.160
translating Russian to English and how there's a lot of brilliant work in Russian that the rest
link |
01:04:45.440
of the world doesn't know about. That's true for Chinese. It's true for a lot of scientists and
link |
01:04:50.080
just artistic work in general. Do you think translation is the place where we're going
link |
01:04:54.240
to see sort of economic big impact? I don't know. I think there is a huge number of
link |
01:04:59.600
applications. First of all, I want to point out that translation already today is huge.
link |
01:05:05.360
I think billions of people interact with big chunks of the internet primarily through translation.
link |
01:05:10.960
So translation is already huge and it's hugely positive too. I think self driving is going to be
link |
01:05:18.000
hugely impactful. It's unknown exactly when it happens, but again, I would not bet against
link |
01:05:26.320
deep learning. So that's deep learning in general. Deep learning for self driving.
link |
01:05:31.680
Yes, deep learning for self driving. But I was talking about sort of language models.
link |
01:05:35.040
I see. Just to check. I beard off a little bit. Just to check. You're not seeing a connection
link |
01:05:39.760
between driving and language. No, no. Okay. I'd rather both use neural nets.
link |
01:05:44.000
That'd be a poetic connection. I think there might be some, like you said, there might be some kind
link |
01:05:48.240
of unification towards a kind of multitask transformers that can take on both language
link |
01:05:56.000
and vision tasks. That'd be an interesting unification. Let's see. What can I ask about
link |
01:06:02.640
GPT2 more? It's simple. It's not much to ask. You take a transform, you make it bigger,
link |
01:06:09.760
give it more data, and suddenly it does all those amazing things.
link |
01:06:12.480
Yeah. One of the beautiful things is that GPT, the transformers are fundamentally simple to
link |
01:06:17.200
explain, to train. Do you think bigger will continue to show better results in language?
link |
01:06:26.960
Probably. Sort of like what are the next steps with GPT2? Do you think?
link |
01:06:31.280
I mean, I think for sure seeing what larger versions can do is one direction. Also,
link |
01:06:38.000
I mean, there are many questions. There's one question which I'm curious about and that's
link |
01:06:42.880
the following. Right now, GPT2, so we feed it all this data from the internet, which means that
link |
01:06:47.200
it needs to memorize all those random facts about everything in the internet. It would be nice if
link |
01:06:55.280
the model could somehow use its own intelligence to decide what data it wants to start, accept,
link |
01:07:01.600
and what data it wants to reject, just like people. People don't learn all data indiscriminately.
link |
01:07:07.040
We are super selective about what we learn. I think this kind of active learning I think
link |
01:07:11.760
would be very nice to have. Yeah. Listen, I love active learning. Let me ask,
link |
01:07:19.280
does the selection of data, can you just elaborate that a little bit more? Do you think the selection
link |
01:07:23.840
of data is, I have this kind of sense that the optimization of how you select data,
link |
01:07:33.680
so the active learning process is going to be a place for a lot of breakthroughs,
link |
01:07:40.720
even in the near future, because there hasn't been many breakthroughs there that are public.
link |
01:07:44.880
I feel like there might be private breakthroughs that companies keep to themselves,
link |
01:07:49.120
because the fundamental problem has to be solved if you want to solve self driving,
link |
01:07:52.800
if you want to solve a particular task. What do you think about the space in general?
link |
01:07:57.680
Yeah, so I think that for something like active learning, or in fact for any kind of capability,
link |
01:08:02.320
like active learning, the thing that it really needs is the problem. It needs a problem that
link |
01:08:06.880
requires it. It's very hard to do research about the capability if you don't have a task,
link |
01:08:12.880
because then what's going to happen is you will come up with an artificial task,
link |
01:08:16.640
get good results, but not really convince anyone. Right. We're now past the stage where
link |
01:08:24.000
getting a result on MNIST, some clever formulation of MNIST will convince people.
link |
01:08:30.720
That's right. In fact, you could quite easily come up with a simple active learning scheme
link |
01:08:34.800
on MNIST and get a 10x speed up, but then so what? I think that active learning will naturally arise
link |
01:08:45.440
as problems that require it to pop up. That's my take on it.
link |
01:08:52.720
There's another interesting thing that OpenAS brought up with GPT2, which is when you create a
link |
01:08:58.720
powerful artificial intelligence system, and it was unclear what kind of detrimental, once you
link |
01:09:04.880
release GPT2, what kind of detrimental effect it will have. Because if you have a model that can
link |
01:09:11.600
generate pretty realistic text, you can start to imagine that it would be used by bots in some
link |
01:09:20.080
way that we can't even imagine. There's this nervousness about what it's possible to do.
link |
01:09:24.320
So you did a really brave and profound thing, which just started a conversation about this.
link |
01:09:30.000
How do we release powerful artificial intelligence models to the public? If we do it all, how do we
link |
01:09:38.320
privately discuss with other even competitors about how we manage the use of the systems and so on?
link |
01:09:45.920
So from this whole experience, you've released a report on it, but in general, are there any
link |
01:09:51.120
insights that you've gathered from just thinking about this, about how you release models like this?
link |
01:09:57.520
I mean, I think that my take on this is that the field of AI has been in a state of childhood,
link |
01:10:04.880
and now it's exiting that state and it's entering a state of maturity.
link |
01:10:09.440
What that means is that AI is very successful and also very impactful, and its impact is not only
link |
01:10:15.280
large, but it's also growing. And so for that reason, it seems wise to start thinking about
link |
01:10:22.960
the impact of our systems before releasing them, maybe a little bit too soon, rather than a little
link |
01:10:27.920
bit too late. And with the case of GPT2, like I mentioned earlier, the results really were stunning,
link |
01:10:34.960
and it seemed plausible. It didn't seem certain. It seemed plausible that something like GPT2 could
link |
01:10:41.680
easily use to reduce the cost of disinformation. And so there was a question of what's the best
link |
01:10:49.200
way to release it? And a staged release seemed logical. A small model was released, and there
link |
01:10:54.800
was time to see the many people use these models in lots of cool ways. There have been lots of
link |
01:11:00.080
really cool applications. There haven't been any negative applications we know of. And so eventually
link |
01:11:06.800
it was released, but also other people replicated similar models. That's an interesting question,
link |
01:11:10.880
though, that we know of. So in your view, staged release is at least part of the answer to the
link |
01:11:19.040
question of what do we do once we create a system like this? It's part of the answer, yes.
link |
01:11:28.240
Is there any other insights? Say you don't want to release the model at all, because it's useful
link |
01:11:33.120
to you for whatever the business is. Well, there are plenty of people who don't release models
link |
01:11:38.160
already, right? Of course. But is there some moral ethical responsibility when you have a
link |
01:11:45.040
very powerful model to sort of communicate? Just as you said, when you had GPT2, it was
link |
01:11:51.760
unclear how much it could be used for misinformation. It's an open question. And getting an answer to
link |
01:11:57.280
that might require that you talk to other really smart people that are outside of your particular
link |
01:12:03.280
group. Have you please tell me there's some optimistic pathway for people across the world
link |
01:12:10.400
to collaborate on these kinds of cases? Or is it still really difficult from one company to
link |
01:12:17.840
talk to another company? So it's definitely possible. It's definitely possible to discuss
link |
01:12:24.560
these kind of models with colleagues elsewhere and to get their take on what to do.
link |
01:12:31.280
How hard is it though? I mean,
link |
01:12:36.320
do you see that happening? I think that's the place where it's important to gradually build
link |
01:12:41.440
trust between companies. Because ultimately, all the AI developers are building technology,
link |
01:12:47.760
which is going to be increasingly more powerful. And so it's
link |
01:12:54.480
the way to think about it is that ultimately, we're only together.
link |
01:12:56.880
Yeah, I tend to believe in the better angels of our nature, but I do hope that
link |
01:13:08.960
when you build a really powerful AI system in a particular domain, that you also think about
link |
01:13:14.880
the potential negative consequences of AI. It's an interesting and scary possibility
link |
01:13:23.920
that there will be a race for AI development that would push people to close that development
link |
01:13:30.240
and not share ideas with others. I don't love this. I've been in a pure academic for 10 years.
link |
01:13:36.480
I really like sharing ideas and it's fun. It's exciting. Let's talk about AGI a little bit.
link |
01:13:44.400
What do you think it takes to build a system of human level intelligence? We talked about reasoning,
link |
01:13:49.360
we talked about long term memory, but in general, what does it take, do you think?
link |
01:13:53.600
Well, I can't be sure. But I think the deep learning plus maybe another small idea.
link |
01:14:03.600
Do you think self play will be involved? You've spoken about the powerful mechanism of self play,
link |
01:14:08.960
where systems learn by exploring the world in a competitive setting against other entities that
link |
01:14:18.400
are similarly skilled as them and so incrementally improve in this way. Do you think self play will
link |
01:14:23.760
be a component of building an AGI system? Yeah. What I would say to build AGI, I think it's going
link |
01:14:30.880
to be deep learning plus some ideas. I think self play will be one of those ideas. I think that
link |
01:14:39.120
that is a very self play has this amazing property that it can surprise us in truly novel ways. For
link |
01:14:49.760
example, like we, I mean, pretty much every self play system, both are dot a bot. I don't know if
link |
01:14:59.280
openly I had a release about multi agents where you had two little agents who were playing hide and
link |
01:15:05.120
seek. And of course, also alpha zero, they will all produce surprising behaviors. They all produce
link |
01:15:11.520
behaviors that we didn't expect. They are creative solutions to problems. And that seems like an
link |
01:15:17.200
important part of AGI that our systems don't exhibit routinely right now. And so that's why I
link |
01:15:23.840
like this area, I like this direction because of its ability to surprises.
link |
01:15:27.440
To surprises. And AGI system would surprise us fundamentally. Yes. But and to be precise, not
link |
01:15:32.720
just a not just a random surprise, but to find the surprising solution to a problem that's also
link |
01:15:38.240
useful. Right. Now, a lot of the self playing mechanisms have been used in the game context,
link |
01:15:45.520
or at least in the simulation context. How much, how much, how far along the path to AGI do you
link |
01:15:55.120
think will be done in simulation? How much faith promise do you have in simulation versus having
link |
01:16:02.160
to have a system that operates in the real world, whether it's the real world of digital real world
link |
01:16:09.280
data or real world, like actual physical world with robotics? I don't think it's in either war.
link |
01:16:14.800
I think simulation is a tool. And it helps it has certain its strengths and certain weaknesses.
link |
01:16:19.520
And we should use it. Yeah, but okay. I understand that that's
link |
01:16:24.640
that's true. But one of the criticisms of self play, one of the criticisms and reinforcement
link |
01:16:34.320
learning is one of the the its current power, its current results, while amazing have been
link |
01:16:42.160
demonstrated in a simulated environments, or very constrained physical environments,
link |
01:16:46.240
do you think it's possible to escape them, escape the simulator environments and be able to learn
link |
01:16:51.600
in non simulated environments? Or do you think it's possible to also just simulate in a photo
link |
01:16:57.520
realistic, and physics realistic way, the real world in a way that we can solve real problems
link |
01:17:03.680
with self play in simulation. So I think that's transfer from simulation to the real world is
link |
01:17:10.400
definitely possible, and has been exhibited many times in by many different groups. It's been
link |
01:17:16.240
especially successful in vision. Also open AI in the summer has demonstrated a robot hand which
link |
01:17:22.640
was trained entirely in simulation, in a certain way that allowed for seem to real transfer to occur.
link |
01:17:29.680
This is for the Rubik's cube. That's right. And I wasn't aware that was trained in simulation
link |
01:17:34.560
was trained in simulation entirely. Really? So what it wasn't in the physical that the hand
link |
01:17:39.680
wasn't trained? No, 100% of the training was done in simulation. And the policy that was
link |
01:17:45.840
learned in simulation was trained to be very adaptive. So adaptive that when you transfer it,
link |
01:17:50.800
it could very quickly adapt to the physical to the physical world. So the kind of perturbations
link |
01:17:55.360
with the giraffe or whatever the heck it was, those weren't were those part of the simulation?
link |
01:18:01.680
Well, the simulation was generally so the simulation was trained to be robust to many
link |
01:18:07.280
different things, but not the kind of perturbations we've had in the video. So it's never been
link |
01:18:11.680
trained with a glove. It's never been trained with a stuff giraffe. So in theory, these are
link |
01:18:18.160
novel perturbations? Correct. It's not a theory in practice. And that those are novel perturbations?
link |
01:18:23.680
Well, that's okay. That's a clean, small scale, but clean example of a transfer from the simulated
link |
01:18:30.480
world to the physical world. Yeah. And I will also say that I expect the transfer capabilities
link |
01:18:35.440
of deep learning to increase in general. And the better the transfer capabilities are,
link |
01:18:40.320
the more useful simulation will become. Because then you could take you could
link |
01:18:46.240
experience something in simulation, and then learn a moral of the story, which you could then
link |
01:18:50.880
carry with you to the real world, right? As humans do all the time in the computer games.
link |
01:18:56.880
So let me ask sort of an embodied question, staying an AGI for a sec. Do you think AGI system
link |
01:19:06.000
would need to have a body? We need to have some of those human elements of self awareness, consciousness,
link |
01:19:12.880
sort of fear of mortality, sort of self preservation in the physical space, which comes with having
link |
01:19:19.120
a body? I think having a body will be useful. I don't think it's necessary. But I think it's
link |
01:19:24.720
very useful to have a body for sure, because you can learn a whole new you can learn things which
link |
01:19:30.160
cannot be learned without a body. But at the same time, I think that you can go if you don't have
link |
01:19:34.960
a body, you could compensate for it and still succeed. You think so? Yes. Well, there is evidence
link |
01:19:40.480
for this. For example, there are many people who were born deaf and blind, and they were able to
link |
01:19:45.920
compensate for the lack of modalities. I'm thinking about Helen Kahler specifically. So even if you're
link |
01:19:52.080
not able to physically interact with the world, and if you're not able to, I mean, I actually was
link |
01:19:57.760
getting at maybe let me ask on the more particular, I'm not sure if it's connected to having a body
link |
01:20:04.800
or not, but the idea of consciousness and a more constrained version of that is self awareness.
link |
01:20:11.120
Do you think an AGI system should have consciousness? It's what we can't define kind of whatever the
link |
01:20:17.680
heck you think consciousness is. Yeah, hard question to answer, given how hard is to define it.
link |
01:20:24.560
Do you think it's useful to think about? I mean, it's definitely interesting. It's fascinating.
link |
01:20:29.680
I think it's definitely possible that our systems will be conscious.
link |
01:20:32.880
Do you think that's an emergent thing that just comes from? Do you think consciousness could
link |
01:20:37.360
emerge from the representation that's stored within your networks? So like that it naturally just
link |
01:20:42.400
emerges when you become more and more able to represent more and more of the world?
link |
01:20:46.960
Well, let's say I'd make the following argument, which is humans are conscious. And if you believe
link |
01:20:54.480
that artificial neural nets are sufficiently similar to the brain, then there should at least
link |
01:21:01.200
exist artificial neural nets you should be conscious to. You're leaning on that existence proof pretty
link |
01:21:06.000
heavily. Okay. But that's the best answer I can give. No, I know. I know. I know. There's still
link |
01:21:16.320
an open question if there's not some magic in the brain that we're not. I mean, I don't mean
link |
01:21:21.680
a non materialistic magic, but that the brain might be a lot more complicated and interesting
link |
01:21:28.080
than we give it credit for. If that's the case, then it should show up. And at some point,
link |
01:21:33.440
at some point, we will find out that we can't continue to make progress. But I think it's
link |
01:21:37.360
unlikely. So we talk about consciousness, but let me talk about another poorly defined concept of
link |
01:21:42.400
intelligence. Again, we've talked about reasoning. We've talked about memory. What do you think is
link |
01:21:48.400
a good test of intelligence for you? Are you impressed by the test that Alan Turing formulated
link |
01:21:55.520
with the imitation game of natural language? Is there something in your mind that you will be
link |
01:22:02.800
deeply impressed by if a system was able to do? I mean, lots of things. There's certain
link |
01:22:09.040
frontier. There is a certain frontier of capabilities today. And there exists things
link |
01:22:15.360
outside of that frontier. And I would be impressed by any such thing. For example, I would be
link |
01:22:20.720
impressed by a deep learning system, which solves a very pedestrian, you know, pedestrian task
link |
01:22:27.120
like machine translation or computer vision task or something, which never makes mistake a human
link |
01:22:33.680
wouldn't make under any circumstances. I think that is something which have not yet been demonstrated.
link |
01:22:39.920
And I would find it very impressive. Yes. So right now, they make mistakes in different,
link |
01:22:44.720
they might be more accurate than human beings, but they still make a different set of mistakes.
link |
01:22:49.040
So my, my, I would guess that a lot of the skepticism that some people have about deep learning
link |
01:22:55.680
is when they look at their mistakes and they say, well, those mistakes,
link |
01:22:59.200
they make no sense. Like if you understood the concept, you wouldn't make that mistake us.
link |
01:23:03.920
And I think that changing that would be would would that would inspire me that would be,
link |
01:23:09.600
yes, this is this is this is this is progress. Yeah, that's a really nice way to put it.
link |
01:23:15.280
But I also just don't like that human instinct to criticize a model is not intelligent. That's
link |
01:23:21.600
the same instinct as we do when we criticize any group of creatures as the other. Because
link |
01:23:31.280
it's very possible that GPT two is much smarter than human beings at many things.
link |
01:23:36.320
That's definitely true. It has a lot more breadth of knowledge.
link |
01:23:39.280
Yes, breadth of knowledge and even, and even perhaps depth on certain topics. It's kind of
link |
01:23:46.400
hard to judge what depth means, but there's definitely a sense in which humans don't make
link |
01:23:52.560
mistakes that these models do. Yes, the same is applied to autonomous vehicles. The same is
link |
01:23:58.160
probably going to continue being applied to a lot of artificial intelligence systems. We find
link |
01:24:02.880
this is the annoying thing. This is the process of in the 21st century, the process of analyzing
link |
01:24:07.840
the progress of AI is the search for one case where the system fails in a big way where humans
link |
01:24:15.760
would not. And then many people writing articles about it. And then broadly, as the public
link |
01:24:23.600
generally gets convinced that the system is not intelligent. And we like pacify ourselves by
link |
01:24:28.800
thinking it's not intelligent because of this one anecdotal case. And this can seems to continue
link |
01:24:33.520
happening. Yeah, I mean, there is truth to that. Although I'm sure that plenty of people are also
link |
01:24:38.400
extremely impressed by the systems that exist today. But I think this connects to the earlier
link |
01:24:42.160
point we discussed that it's just confusing to judge progress in AI. And you have a new robot
link |
01:24:49.440
demonstrating something. How impressed should you be? And I think that people will start to be
link |
01:24:55.360
impressed once AI starts to really move the needle on the GDP. So you're one of the people that
link |
01:25:01.440
might be able to create an AGS system here, not you, but you and open AI. If you do create an
link |
01:25:07.920
AGS system, and you get to spend sort of the evening with it, him, her, what would you talk
link |
01:25:15.680
about do you think? The very first time? The first time? Well, the first time was just,
link |
01:25:21.680
we would just ask all kinds of questions and try to make it to get it to make a mistake. And that
link |
01:25:26.000
would be amazed that it doesn't make mistakes and just keep asking broad. What kind of questions do
link |
01:25:34.000
you think? Would they be factual or would they be personal, emotional, psychological? What do you
link |
01:25:41.120
think? All of the above. Would you ask for advice? Definitely. I mean, why would I limit myself
link |
01:25:51.440
talking to a system like this? Now, again, let me emphasize the fact that you truly are one of
link |
01:25:57.280
the people that might be in the room where this happens. So let me ask a sort of a profound
link |
01:26:04.400
question about, I've just talked to Stalin historian, been talking to a lot of people who
link |
01:26:11.360
are studying power. Abraham Lincoln said, nearly all men can stand adversity. But if you want to
link |
01:26:18.240
test a man's character, give him power. I would say the power of the 21st century, maybe the 22nd,
link |
01:26:26.320
but hopefully the 21st would be the creation of an AGI system and the people who have control,
link |
01:26:33.360
direct possession and control of the AGI system. So what do you think after spending that evening
link |
01:26:41.120
having a discussion with the AGI system? What do you think you would do?
link |
01:26:45.040
Well, the ideal world I'd like to imagine is one where humanity are like the board,
link |
01:26:56.960
the board members of a company where the AGI is the CEO. So it would be,
link |
01:27:05.520
I would like the picture of which I would imagine is you have some kind of different
link |
01:27:09.760
entities, different countries or cities and the people that leave their vote for what the AGI
link |
01:27:17.520
that represents them should do and the AGI that represents them goes and does it. I think a picture
link |
01:27:22.160
like that, I find very appealing. You could have multiple AGI, you would have an AGI for a city,
link |
01:27:28.560
for a country and it would be trying to in effect take the democratic process to the next level.
link |
01:27:35.920
And the board can almost fire the CEO. Essentially, press the reset button, say.
link |
01:27:40.480
Press the reset. Rerandomize the parameters.
link |
01:27:42.800
Well, let me sort of, that's actually, okay, that's a beautiful vision, I think,
link |
01:27:48.960
as long as it's possible to press the reset button. Do you think it will always be possible to
link |
01:27:54.960
press the reset button? So I think that it's definitely really possible to build.
link |
01:28:00.000
So you're talking, so the question that I really understand from you is, will humans or humans
link |
01:28:11.600
people have control over the AI systems that they build? Yes. And my answer is, it's definitely
link |
01:28:16.720
possible to build AI systems which will want to be controlled by their humans. Wow, that's part of
link |
01:28:22.640
their, so it's not that just they can't help but be controlled, but that's one of the objectives
link |
01:28:32.560
of their existence is to be controlled. In the same way that human parents
link |
01:28:39.760
generally want to help their children, they want their children to succeed. It's not a burden for
link |
01:28:45.200
them. They are excited to help the children and to feed them and to dress them and to take care of
link |
01:28:51.360
them. And I believe with high conviction that the same will be possible for an AGI. It will be
link |
01:28:59.040
possible to program an AGI, to design it in such a way that it will have a similar deep drive that
link |
01:29:04.800
it will be delighted to fulfill and the drive will be to help humans flourish. But let me take a step
link |
01:29:12.160
back to that moment where you create the AGI system. I think this is a really crucial moment.
link |
01:29:16.880
And between that moment and the Democratic Board members with the AGI at the head,
link |
01:29:28.800
there has to be a relinquishing of power. So it's George Washington, despite all the bad
link |
01:29:35.680
things he did, one of the big things he did is he relinquished power. He first of all didn't want
link |
01:29:40.240
to be president. And even when he became president, he gave, he didn't keep just serving as most
link |
01:29:46.240
dictators do for indefinitely. Do you see yourself being able to relinquish control over an AGI system
link |
01:29:56.240
given how much power you can have over the world at first financial, just make a lot of money,
link |
01:30:02.640
and then control by having possession as a AGI system? I'd find it trivial to do that. I'd find
link |
01:30:09.200
it trivial to relinquish this kind of power. I mean, the kind of scenario you are describing
link |
01:30:14.960
sounds terrifying to me. That's all. I would absolutely not want to be in that position.
link |
01:30:22.320
Do you think you represent the majority or the minority of people in the AGI community?
link |
01:30:29.280
Well, I mean, it's an open question, an important one. Are most people good is another way to ask it.
link |
01:30:35.600
So I don't know if most people are good. But I think that when it really counts,
link |
01:30:44.240
people can be better than we think. That's beautifully put. Yeah.
link |
01:30:49.040
Are there specific mechanisms you can think of aligning AI gene values to human values?
link |
01:30:54.400
Is that do you think about these problems of continued alignment as we develop the AI systems?
link |
01:31:00.160
Yeah, definitely. In some sense, the kind of question which you are asking is,
link |
01:31:07.200
so if you have to translate the question to today's terms, it would be a question about
link |
01:31:13.280
how to get an RL agent that's optimizing a value function, which itself is learned.
link |
01:31:21.040
And if you look at humans, humans are like that because the reward function,
link |
01:31:24.880
the value function of humans is not external, it is internal.
link |
01:31:28.560
That's right. And there are definite ideas of how to train a value function,
link |
01:31:36.720
basically an objective, an as objective as possible perception system
link |
01:31:42.400
that will be trained separately to recognize, to internalize human judgments on different
link |
01:31:50.160
situations. And then that component wouldn't be integrated as the value as the base value
link |
01:31:56.000
function for some more capable RL system. You could imagine a process like this.
link |
01:32:00.400
I'm not saying this is the process. I'm saying this is an example of the kind of thing you could do.
link |
01:32:07.360
So on that topic of the objective functions of human existence, what do you think is the
link |
01:32:13.120
objective function that's implicit in human existence? What's the meaning of life? I think
link |
01:32:28.960
the question is wrong in some way. I think that the question implies that there is an
link |
01:32:34.720
objective answer, which is an external answer, you know, your meaning of life is X. I think
link |
01:32:39.120
what's going on is that we exist and that's amazing. And we should try to make the most
link |
01:32:45.280
of it and try to maximize our own value and enjoyment of a very short time while we do exist.
link |
01:32:53.360
It's funny because action does require an objective function. It's definitely there
link |
01:32:57.280
in some form, but it's difficult to make it explicit and maybe impossible to make it explicit.
link |
01:33:02.720
I guess is what you're getting at. And that's an interesting fact of an RL environment.
link |
01:33:08.000
Well, what I was making a slightly different point is that humans want things and their wants
link |
01:33:13.920
create the drives that cause them to, you know, our wants are our objective functions,
link |
01:33:19.680
our individual objective functions. We can later decide that we want to change,
link |
01:33:24.080
that what we wanted before is no longer good and we want something else.
link |
01:33:27.040
Yeah, but they're so dynamic. There's got to be some underlying sort of Freud.
link |
01:33:32.000
There's things, there's like sexual stuff. There's people who think it's the fear of death.
link |
01:33:37.040
And there's also the desire for knowledge and, you know, all these kinds of things,
link |
01:33:41.920
procreation, sort of all the evolutionary arguments, it seems to be,
link |
01:33:46.880
there might be some kind of fundamental objective function from which everything else emerges.
link |
01:33:53.920
But it seems like it's very difficult to make it explicit.
link |
01:33:56.400
I think that probably is an evolutionary objective function, which is to survive and
link |
01:33:59.520
procreate and make your students succeed. That would be my guess. But it doesn't give an answer
link |
01:34:06.160
to the question of what's the meaning of life. I think you can see how humans are part of this
link |
01:34:11.760
big process, this ancient process we are, we are, we exist on a small planet. And that's it.
link |
01:34:20.720
So given that we exist, try to make the most of it and try to
link |
01:34:25.040
enjoy more and suffer less as much as we can. Let me ask two silly questions about life.
link |
01:34:30.880
One, do you have regrets, moments that if you went back, you would do differently? And two,
link |
01:34:39.920
are there moments that you're especially proud of that made you truly happy?
link |
01:34:44.640
So I can answer both questions. Of course, there's a huge number of choices and decisions that
link |
01:34:51.840
have made that with the benefit of hindsight, I wouldn't have made them. And I do experience
link |
01:34:56.160
some regret, but, you know, I try to take solace in the knowledge that at the time I did the best
link |
01:35:01.360
I could. And in terms of things that I'm proud of, I'm very fortunate to have done things I'm
link |
01:35:06.880
proud of. And they made me happy for some time, but I don't think that that is the source of
link |
01:35:12.720
happiness. So your academic accomplishments, all the papers, you're one of the most cited people
link |
01:35:18.720
in the world, all the breakthroughs I mentioned in computer vision and language and so on is
link |
01:35:24.720
what is the source of happiness and pride for you? I mean, all those things are a source of pride,
link |
01:35:30.960
for sure. I'm very grateful for having done all those things. And it was very fun to do them.
link |
01:35:37.280
But happiness comes, you know, you can, happiness, well, my current view is that happiness comes from
link |
01:35:42.800
our to a lot to a very large degree from the way we look at things. You know, you can have a
link |
01:35:48.480
simple meal and be quite happy as a result, or you can talk to someone and be happy as a result
link |
01:35:53.760
as well. Or conversely, you can have a meal and be disappointed that the meal wasn't a better meal.
link |
01:36:00.240
So I think a lot of happiness comes from that, but I'm not sure. I don't want to be too confident.
link |
01:36:05.440
Being humble in the face of the uncertainty seems to be also a part of this whole happiness thing.
link |
01:36:12.000
Well, I don't think there's a better way to end it than meaning of life and discussions of happiness.
link |
01:36:17.760
So, Ilya, thank you so much. You've given me a few incredible ideas. You've given the world
link |
01:36:23.600
many incredible ideas. I really appreciate it. And thanks for talking today.
link |
01:36:27.280
Yeah, thanks for stopping by. I really enjoyed it.
link |
01:36:30.320
Thanks for listening to this conversation with Ilya Setskever. And thank you to our
link |
01:36:34.000
presenting sponsor, Cash App. Please consider supporting the podcast by downloading Cash App
link |
01:36:39.440
and using the code Lex Podcast. If you enjoy this podcast, subscribe on YouTube,
link |
01:36:45.200
review it with Five Stars and Apple Podcasts. Support on Patreon or simply connect with me
link |
01:36:50.320
on Twitter at Lex Freedman. And now let me leave you with some words from Alan Turing on Machine
link |
01:36:58.160
Learning. Instead of trying to produce a program to simulate the adult mind, why not rather try
link |
01:37:04.800
to produce one which simulates the child? If this were then subjected to an appropriate course of
link |
01:37:11.200
education, one would obtain the adult brain. Thank you for listening and hope to see you next time.