back to indexIlya Sutskever: Deep Learning | Lex Fridman Podcast #94
link |
The following is a conversation with Ilya Satskeva, cofounder and chief scientist of Open AI,
link |
one of the most cited computer scientists in history with over 165,000 citations,
link |
and to me, one of the most brilliant and insightful minds ever in the field of deep learning.
link |
There are very few people in this world who I would rather talk to and brainstorm with about
link |
deep learning, intelligence, and life in general than Ilya, on and off the mic. This was an honor
link |
and a pleasure. This conversation was recorded before the outbreak of the pandemic,
link |
for everyone feeling the medical, psychological, and financial burden of this crisis,
link |
I'm sending love your way. Stay strong, we're in this together, we'll beat this thing.
link |
This is the Artificial Intelligence Podcast. If you enjoy it, subscribe on YouTube,
link |
review it with 5 stars on Apple Podcasts, support on Patreon, or simply connect with me on Twitter
link |
at Lex Freedman, spelled F R I D M A N. As usual, I'll do a few minutes of ads now and never any
link |
ads in the middle that can break the flow of the conversation. I hope that works for you
link |
and doesn't hurt the listening experience. This show is presented by Cash App, the number one
link |
finance app in the App Store. When you get it, use code Lex Podcast. Cash App lets you send money
link |
to friends, buy Bitcoin, invest in the stock market with as little as $1. Since Cash App allows you
link |
to buy Bitcoin, let me mention that cryptocurrency in the context of the history of money is
link |
fascinating. I recommend Ascent of Money as a great book on this history. Both the book and
link |
audiobook are great. Debits and credits on ledgers started around 30,000 years ago. The US dollar
link |
created over 200 years ago and Bitcoin, the first decentralized cryptocurrency released just over
link |
10 years ago. So given that history, cryptocurrency is still very much in its early days of development,
link |
but it's still aiming to just might redefine the nature of money. So again, if you get Cash App
link |
from the App Store, Google Play and use the code Lex Podcast, you get $10 and Cash App will also
link |
donate $10 to FIRST, an organization that is helping advance robotics and STEM education
link |
for young people around the world. And now, here's my conversation with Ilya Setskever.
link |
You were one of the three authors with Alex Kyshevsky, Jeff Hinton of the famed AlexNet paper
link |
that is arguably the paper that marked the big catalytic moment that launched the deep learning
link |
revolution. At that time, take us back to that time. What was your intuition about neural networks,
link |
about the representational power of neural networks? And maybe you could mention how did
link |
that evolve over the next few years, up to today, over the 10 years? Yeah, I can answer that question.
link |
At some point in about 2010 or 2011, I connected two facts in my mind. Basically,
link |
the realization was this. At some point, we realized that we can train very large, I shouldn't say very
link |
you know, tiny by today's standards, but large and deep neural networks end to end with back
link |
propagation. At some point, different people obtained this result. I obtained this result.
link |
The first, the first moment in which I realized that deep neural networks are powerful was when
link |
James Martens invented the Hessian free optimizer in 2010. And he trained a 10 layer neural network
link |
end to end without pre training from scratch. And when that happened, I thought this is it.
link |
Because if you can train a big neural network, a big neural network can represent very complicated
link |
function. Because if you have a neural network with 10 layers, it says, though, you allow the
link |
human brain to run for some number of milliseconds, neuron firings are slow. And so in maybe 100
link |
milliseconds, your neurons only fire 10 times. So it's also kind of like 10 layers. And in 100
link |
milliseconds, you can perfectly recognize any object. So I thought, so I already had the idea
link |
then that we need to train a very big neural network on lots of supervised data. And then it
link |
must succeed, because we can find the best neural network. And then there's also theory
link |
that if you have more data than parameters, you won't overfit. Today, we know that actually,
link |
this theory is very incomplete, and you won't overfit even if you have less data than parameters.
link |
But definitely, if you have more data than parameters, you won't overfit.
link |
So the fact that neural networks were heavily overparameterized, wasn't discouraging to you.
link |
So you were thinking about the theory that the number of parameters,
link |
the fact there's a huge number of parameters is okay. Is it going to be okay?
link |
I mean, there was some evidence before that it was okay, but the theory was most,
link |
the theory was that if you had a big data set and a big neural net, it was going to work.
link |
The overparameterization just didn't really figure much as a problem. I thought, well,
link |
with images, you're just going to add some data augmentation, and it's going to be okay.
link |
So where was any doubt coming from? The main doubt was can we train a big,
link |
really have enough compute to train a big enough neural net with back propagation.
link |
Back propagation, I thought was would work. The thing which wasn't clear would was whether
link |
there would be enough compute to get a very convincing result. And then at some point,
link |
Alex Krzewski wrote these insanely fast UDA kernels for training convolutional neural nets,
link |
and that was bam, let's do this. Let's get image net, and it's going to be the greatest thing.
link |
Was your intuition, most of your intuition from empirical results by you and by others?
link |
So like just actually demonstrating that a piece of program can train a 10 layer neural network?
link |
Or was there some pen and paper or marker and whiteboard thinking intuition?
link |
Because you just connected a 10 layer large neural network to the brain. So you just mentioned
link |
the brain. So in your intuition about neural networks, does the human brain come into play
link |
as an intuition builder? Definitely. I mean, you know, you got to be precise with these analogies
link |
between neural artificial neural networks and the brain. But there's no question that the brain
link |
is a huge source of intuition and inspiration for deep learning researchers since all the way
link |
from Rosenblatt in the 60s. Like if you look at the whole idea of a neural network is directly
link |
inspired by the brain. You had people like McCallum and Pitts who were saying, hey, you got these
link |
neurons in the brain. And hey, we recently learned about the computer and automata.
link |
Can we use some ideas from the computer and automata to design some kind of computational
link |
object that's going to be simple, computational, and kind of like the brain, and they invented the
link |
neuron. So they were inspired by it back then. Then you had the convolutional neural network
link |
from Fukushima, and then later young Lacan, who said, hey, if you limit the receptive fields of
link |
a neural network, it's going to be especially suitable for images as it turned out to be true.
link |
So there was a very small number of examples where analogies to the brain were successful.
link |
And then I thought, well, probably an artificial neuron is not that different from the brain if
link |
it's quite hard enough. So let's just assume it is and roll with it. So we're now at a time where
link |
deep learning is very successful. So let us squint less and say, let's open our eyes and say, what
link |
to you is an interesting difference between the human brain? Now, I know you're probably not an
link |
expert, neither in your scientists and your biologists, but loosely speaking, what's the
link |
difference between the human brain and artificial neural networks? That's interesting to you for
link |
the next decade or two. That's a good question to ask. What is an interesting difference between
link |
the neural between the brain and our artificial neural networks? So I feel like today,
link |
artificial neural networks, so we all agree that there are certain dimensions in which the human
link |
brain vastly outperforms our models. But I also think that there are some ways in which artificial
link |
neural networks have a number of very important advantages over the brain. Looking at the advantages
link |
versus disadvantages is a good way to figure out what is the important difference. So the brain
link |
uses spikes, which may or may not be important. Yes, that's a really interesting question. Do you
link |
think it's important or not? That's one big architectural difference between artificial
link |
neural networks. It's hard to tell, but my prior is not very high. And I can say why. There are
link |
people who are interested in spiking neural networks. And basically, what they figured out is
link |
that they need to simulate the non spiking neural networks in spikes. And that's how they're going
link |
to make them work. If you don't simulate the non spiking neural networks in spikes, it's not going
link |
to work because the question is, why should it work? And that connects to questions around back
link |
propagation and questions around deep learning. You've got this giant neural network. Why should
link |
it work at all? Why should that learning rule work at all? It's not a self evident question,
link |
especially if you, let's say if you were just starting in the field and you read the very
link |
early papers, you can say, Hey, people are saying, let's build neural networks. That's a great idea
link |
because the brain is a neural network. So it would be useful to build neural networks. Now,
link |
let's figure out how to train them. It should be possible to train them probably, but how?
link |
And so the big idea is the cost function. That's the big idea. The cost function is a way of measuring
link |
the performance of the system according to some measure. By the way, that is a big, actually,
link |
let me think. Is that, is that a one, a difficult idea to arrive at? And how big of an idea is that
link |
that there's a single cost function? Sorry, let me take a pause. Is supervised learning a difficult
link |
concept to come to? I don't know. All concepts are very easy in retrospect. Yeah, that's what it
link |
seems trivial now, but I, because the reason I asked that, and we'll talk about it, because is there
link |
other things? Is there things that don't necessarily have a cost function, maybe have many cost
link |
functions, or maybe have dynamic cost functions, or maybe a totally different kind of architectures?
link |
Because we have to think like that in order to arrive at something new, right? So the only,
link |
so the good examples of things which don't have clear cost functions are GANs.
link |
Right. And again, you have a game. So instead of thinking of a cost function,
link |
where you want to optimize, where you know that you have an algorithm gradient descent,
link |
which will optimize the cost function. And then you can reason about the behavior of your system
link |
in terms of what it optimizes. With GAN, you say, I have a game, and I'll reason about the behavior
link |
of the system in terms of the equilibrium of the game. But it's all about coming up with these
link |
mathematical objects that help us reason about the behavior of our system. Right. That's really
link |
interesting. Yeah. So GAN is the only one. It's kind of a, the cost function is emergent from the
link |
comparison. I don't know if it has a cost function. I don't know if it's meaningful to talk about the
link |
cost function of a GAN. It's kind of like the cost function of biological evolution or the cost
link |
function of the economy. It's, you can talk about regions to which it will go towards, but I don't
link |
think, I don't think the cost function analogy is the most useful. So evolution doesn't,
link |
that's really interesting. So if evolution doesn't really have a cost function, like a cost function
link |
based on it's something akin to our mathematical conception of a cost function, then do you think
link |
cost functions in deep learning are holding us back? Yeah. So you just kind of mentioned that
link |
cost function is a nice first profound idea. Do you think that's a good idea? Do you think it's
link |
an idea will go past? So self play starts to touch on that a little bit in reinforcement
link |
learning systems. That's right. Self play and also ideas around exploration where you're trying to
link |
take action. That's surprise a predictor. I'm a big fan of cost functions. I think cost functions
link |
are great and they serve us really well. And I think that whenever we can do things with
link |
cost functions, we should. And you know, maybe there is a chance that we will come up with some
link |
yet another profound way of looking at things that will involve cost functions in a less central way.
link |
But I don't know. I think cost functions are I mean,
link |
I would not bet against against cost functions. Is there other things about the brain
link |
that pop into your mind that might be different and interesting for us to consider in designing
link |
artificial neural networks? So we talked about spiking a little bit. I mean, one thing which may
link |
potentially be useful, I think people, neuroscientists figured out something about the learning rule of
link |
the brain, or I'm talking about spike time independent plasticity. And it would be nice
link |
if some people were to study that in simulation. Wait, sorry, spike time independent plasticity.
link |
Yeah, that's that STD. It's a particular learning rule that uses spike timing to figure out how to
link |
determine how to update the synapses. So it's kind of like, if a synapse fires into the neuron before
link |
the neuron fires, then it's strengthened the synapse. And if the synapse fires into the
link |
neuron shortly after the neuron fired, then it becomes the synapse something along this line.
link |
I'm 90% sure it's right. So if I said something wrong here, don't don't get too angry.
link |
But you saw me brilliant while saying it. But the timing, that's one thing that's missing.
link |
The temporal dynamics is not captured. I think that's like a fundamental property of the brain,
link |
is the timing of the signals. Well, you're recording neural networks.
link |
But you think of that as, I mean, that's a very crude simplified, what's that called?
link |
There's a clock, I guess, to recurrent neural networks. This seems like the brain is the
link |
general, the continuous version of that, the generalization where all possible timings are
link |
possible. And then within those timings is contained some information. You think recurrent neural
link |
networks, the recurrence in recurrent neural networks can capture the same kind of phenomena
link |
as the timing that seems to be important for the brain in the firing of neurons in the brain?
link |
I mean, I think recurrent neural networks are amazing. And I think they can do anything we'd
link |
want a system to do. Right now, recurrent neural networks have been superseded by
link |
transformers, but maybe one day they'll make a comeback, maybe it'll be back. We'll see.
link |
Let me, in a small tangent, say, do you think they'll be back? So so much of the breakthroughs
link |
recently that we'll talk about on natural language processing and language modeling has been with
link |
transformers that don't emphasize recurrence. Do you think recurrence will make a comeback?
link |
Well, some kind of recurrence, I think, very likely. Recurrent neural networks, as they're
link |
typically thought of for processing sequences, I think it's also possible.
link |
What is, to you, a recurrent neural network? And generally speaking, I guess, what is a
link |
recurrent neural network? You have a neural network which maintains a high dimensional
link |
hidden state. And then when an observation arrives, it updates its high dimensional hidden state
link |
through its connections in some way. So do you think, you know, that's what like expert systems
link |
did, right? Symbolic AI, the knowledge based, growing a knowledge base is maintaining a
link |
hidden state, which is its knowledge base and is growing it by sequentially processing. Do you
link |
think of it more generally in that way? Or is it simply, is it the more constrained form of
link |
a hidden state with certain kind of gating units that we think of as today with LSDMs and that?
link |
I mean, the hidden state is technically what you described there, the hidden state that goes
link |
inside the LSDM or the RNN or something like this. But then what should be contained, you know,
link |
if you want to make the expert system analogy, I'm not, I mean, you could say that the knowledge
link |
is stored in the connections and then the short term processing is done in the hidden state.
link |
Yes. Could you say that? So sort of, do you think there's a future of building large scale
link |
knowledge bases within the neural networks? Definitely.
link |
So we're going to pause on that confidence because I want to explore that. But let me zoom back out
link |
and ask back to the history of ImageNet. Neural networks have been around for many decades,
link |
as you mentioned. What do you think were the key ideas that led to their success, that ImageNet
link |
moment and beyond the success in the past 10 years? Okay, so the question is to make sure I
link |
didn't miss anything. The key ideas that led to the success of deep learning over the past 10 years.
link |
Exactly. Even though the fundamental thing behind deep learning has been around for much longer.
link |
So the key idea about deep learning or rather the key fact about deep learning before deep learning
link |
started to be successful is that it was underestimated. People who worked in machine learning
link |
simply didn't think that neural networks could do much. People didn't believe that large neural
link |
networks could be trained. People thought that, well, there was lots of, there was a lot of debate
link |
going on in machine learning about what are the right methods and so on. And people were arguing
link |
because there was no way to get hard facts. And by that, I mean, there were no benchmarks which
link |
were truly hard, that if you do really well on them, then you can say, look, here's my system.
link |
That's when you switch from, that's when this field becomes a little bit more of an engineering
link |
field. So in terms of deep learning, to answer the question directly, the ideas were all there.
link |
The thing that was missing was a lot of supervised data and a lot of compute.
link |
Once you have a lot of supervised data and a lot of compute, then there is a third thing which is
link |
needed as well. And that is conviction, conviction that if you take the right stuff, which already
link |
exists, and apply and mixed with a lot of data and a lot of compute, that it will in fact work.
link |
And so that was the missing piece. It was you had the, you needed the data, you needed the compute
link |
which showed up in terms of GPUs, and you needed the conviction to realize that you need to mix
link |
them together. So that's really interesting. So I guess the presence of compute and the presence
link |
supervised data allowed the empirical evidence to do the convincing of the majority of the computer
link |
science community. So I guess there's a key moment with Jitendra Malik and Alex Alyosha Efros,
link |
who were very skeptical, right? And then there's a Jeffrey Hinton that was the opposite of skeptical.
link |
And there was a convincing moment. And I think Emission had served as that moment.
link |
That's right. And they represented this kind of, or the big pillars of computer vision community,
link |
kind of the wizards got together. And then all of a sudden there was a shift. And
link |
it's not enough for the ideas to all be there and the computer to be there. It's
link |
for it to convince the cynicism that existed. That's interesting. That people just didn't
link |
believe for a couple of decades. Yeah. Well, but it's more than that. It's kind of, when put this
link |
way, it sounds like, well, you know, those silly people who didn't believe what were they, what
link |
were they missing. But in reality, things were confusing because neural networks really did
link |
not work on anything. And they were not the best method on pretty much anything as well.
link |
Well, and it was pretty rational to say, yeah, this stuff doesn't have any traction.
link |
And that's why you need to have these very hard tasks, which are, which produce undeniable evidence.
link |
And that's how we make progress. And that's why the field is making progress today,
link |
because we have these hard benchmarks, which represent true progress. And so, and this is
link |
why we were able to avoid endless debate. So incredibly, you've contributed some of the
link |
biggest recent ideas in AI in computer vision, language, natural language processing, reinforcement
link |
learning, sort of everything in between, maybe not GANs. There may not be a topic you haven't
link |
touched. And of course, the fundamental science of deep learning. What is the difference to you
link |
between vision, language, and as in reinforcement learning, action, as learning problems,
link |
and what are the commonalities? Do you see them as all interconnected? Are they fundamentally
link |
different domains that require different approaches? Okay, that's a good question.
link |
Machine learning is a field with a lot of unity, a huge amount of unity.
link |
What do you mean by unity? Like overlap of ideas?
link |
Overlap of ideas, overlap of principles. In fact, there's only one or two or three principles,
link |
which are very, very simple. And then they apply in almost the same way, in almost the
link |
same way to the different modalities to the different problems. And that's why today,
link |
when someone writes a paper on improving optimization of deep learning and vision,
link |
it improves the different NLP applications, and it improves the different reinforcement
link |
learning applications. Reinforcement learning. So I would say that computer vision and NLP are
link |
very similar to each other. Today, they differ in that they have slightly different architectures.
link |
We use transformers in NLP, and we use convolutional neural networks in vision.
link |
But it's also possible that one day this will change and everything will be unified with a
link |
single architecture. Because if you go back a few years ago in natural language processing,
link |
there were a huge number of architectures for every different tiny problem had its own architecture.
link |
Today, there's just one transformer for all those different tasks. And if you go back in time even
link |
more, you had even more and more fragmentation and every little problem in AI had its own
link |
little subspecialization and sub, you know, little set of collection of skills, people who would
link |
know how to engineer the features. Now it's all been subsumed by deep learning. We have this
link |
unification. And so I expect vision to become unified with natural language as well. Or rather,
link |
I just expect I think it's possible. I don't want to be too sure because I think on the
link |
convolutional neural net is very computationally efficient. Arrel is different. Arrel does require
link |
slightly different techniques because you really do need to take action. You really do need to do
link |
something about exploration, your variance is much higher. But I think there is a lot of unity
link |
even there. And I would expect, for example, that at some point, there will be some
link |
broader unification between Arrel and supervised learning where somehow the Arrel will be making
link |
decisions to make the supervised learning go better. And it will be, I imagine one big black
link |
box and you just throw every, you know, you shovel, shovel things into it. And it just
link |
figures out what to do with whatever you shovel in it. I mean, reinforcement learning has
link |
some aspects of language and vision combined, almost. There's elements of a long term
link |
memory that you should be utilizing. And there's elements of a really rich sensory space. So it
link |
seems like the, it's like the union of the two or something like that. I'd say something
link |
slightly differently. I'd say that reinforcement learning is neither, but it naturally interfaces
link |
and integrates with the two of them. Do you think action is fundamentally different? So yeah,
link |
what is interesting about what is unique about policy of learning to act? Well, so one example,
link |
for instance, is that when you learn to act, you are fundamentally in a non stationary world.
link |
Because as your actions change, the things you see start changing. You,
link |
you experience the world in a different way. And this is not the case for
link |
the more traditional static problem where you have some distribution and you just apply a model to
link |
that distribution. Do you think it's a fundamentally different problem or is it just a more difficult
link |
it's a generalization of the problem of understanding? I mean, it's a question of
link |
definitions almost. There is a huge amount of commonality for sure. You take gradients,
link |
you take gradients, we try to approximate gradients in both cases. In some case,
link |
in the case of reinforcement learning, you have some tools to reduce the variance of the gradients.
link |
You do that. There's lots of commonalities, the same neural net in both cases,
link |
you compute the gradient, you apply Adam in both cases.
link |
So I mean, there's lots in common for sure, but there are some small
link |
differences which are not completely insignificant. It's really just a matter of your point of view,
link |
what frame of reference you what how much do you want to zoom in or out as you look at these
link |
problems? Which problem do you think is harder? So people like Noam Chomsky believe that language
link |
is fundamental to everything. So it underlies everything. Do you think language understanding
link |
is harder than visual scene understanding or vice versa? I think that asking if a problem is hard
link |
is slightly wrong. I think the question is a little bit wrong and I want to explain why.
link |
Okay. So what does it mean for a problem to be hard? Okay, the non interesting dumb answer to
link |
that is there's a benchmark and there's a human level performance on that benchmark. And how
link |
is the effort required to reach the human level? Okay, benchmark. So from the perspective of how
link |
much until we get to human level on a very good benchmark. Yeah, like some I understand what you
link |
mean by that. So what I was going to say that a lot of it depends on, you know, once you solve a
link |
problem, it stops being hard. And that's, that's always true. And so but something is hard or not
link |
depends on what our tools can do today. So you know, you say today, through human level, language
link |
understanding and visual perception are hard in the sense that there is no way of solving the
link |
problem completely in the next three months. Right. So I agree with that statement. Beyond
link |
that, I'm just I'd be my guess would be as good as yours. I don't know. Okay, so you don't have a
link |
fundamental intuition about how hard language understanding is. I think I not change my mind.
link |
I'd say language is probably going to be hard. I mean, it depends on how you define it. Like if
link |
you mean absolute top notch 100% language understanding, I'll go with language. And so
link |
but then if I show you a piece of paper with letters on it, is that you see what I mean?
link |
So you have a vision system, you say it's the best human level vision system. I show you I open
link |
a book, and I show you letters. Will it understand how these letters form into word and sentences
link |
and meaning is this part of the vision problem? Where does vision and the language begin?
link |
Yeah, so Chomsky would say it starts at language. So vision is just a little example of the kind of
link |
structure and, you know, fundamental hierarchy of ideas that's already represented in our brain
link |
somehow that's represented through language. But where does vision stop and language begin?
link |
That's a really interesting question.
link |
So one possibility is that it's impossible to achieve really deep understanding
link |
in either images or language without basically using the same kind of system.
link |
So you're going to get the other for free. I think I think it's pretty likely that yes,
link |
if we can get one we probably our machine learning is probably that good that we can get the other
link |
but it's not 100 I'm not 100% sure. And also, I think a lot, a lot of it really does depend on
link |
your definitions. Definitions of like perfect vision. Because, you know, reading is vision,
link |
but should it count? Yeah, to me, so my definition is if a system looked at an image,
link |
and then a system looked at a piece of text, and then told me something about that,
link |
and I was really impressed. That's relative. You'll be impressed for half an hour and then
link |
you're going to say, well, I mean, all the systems do that. But here's the thing they don't do.
link |
Yeah, but I don't have that with humans. Humans continue to impress me.
link |
Well, the ones, okay, so I'm a fan of monogamy. So I like the idea of marrying somebody being
link |
with them for several decades. So I believe in the fact that yes, it's possible to have somebody
link |
continuously giving you pleasurable, interesting, witty new ideas, friends. Yeah, I think so. They
link |
continue to surprise you. The surprise, it's that injection of randomness seems to be a nice source
link |
of, yeah, continued inspiration, like the wit, the humor. I think, yeah, that would be,
link |
it's a very subjective test, but I think if you have enough humans in the room.
link |
Yeah, I understand what you mean. Yeah, I feel like I misunderstood what you meant by
link |
impressing you. I thought you meant to impress you with its intelligence, with how good,
link |
valid understands an image. I thought you meant something like, I'm going to show you
link |
a really complicated image and it's going to get it right and you're going to say, wow,
link |
that's really cool. The systems of January 2020 have not been doing that.
link |
Yeah, no, I think it all boils down to the reason people click like on stuff on the
link |
internet, which is like it makes them laugh. So it's like humor or wit or insight.
link |
I'm sure we'll get that as well. So forgive the romanticized question, but looking back to you,
link |
what is the most beautiful or surprising idea in deep learning or AI in general you've come across?
link |
So I think the most beautiful thing about deep learning is that it actually works.
link |
And I mean it because you got these ideas, you got the little neural network, you got the back
link |
propagation algorithm. And then you got some theories as to, you know, this is kind of like
link |
the brain. So maybe if you make it large, if you make the neural network large and you're
link |
trained on a lot of data, then it will do the same function that the brain does.
link |
And it turns out to be true. That's crazy. And now we just train these neural networks and you
link |
make them larger and they keep getting better. And I find it unbelievable. I find it unbelievable
link |
that this whole AI stuff with neural networks works. Have you built up an intuition of why are
link |
there a little bits and pieces of intuitions of insights of why this whole thing works?
link |
I mean, some definitely, while we know that optimization, we now have good, you know,
link |
we've had lots of empirical, huge amounts of empirical reasons to believe that optimization
link |
should work on all most problems we care about. Do you have insights of what, so you just said
link |
empirical evidence is most of your sort of empirical evidence kind of convinces you,
link |
it's like evolution is empirical, it shows you that look, this evolutionary process seems to be
link |
a good way to design organisms that survive in their environment. But it doesn't really get you
link |
to the insides of how the whole thing works. I think it's a good analogy is physics. You know how
link |
you say, Hey, let's do some physics calculation and come up with some new physics theory and make
link |
some prediction. But then you got around the experiment. You know, you got around the experiment,
link |
it's important. So it's a bit the same here, except that maybe sometimes the experiment came
link |
before the theory. But it still is the case, you know, you have some data and you come up with
link |
some prediction, you say, Yeah, let's make a big neural network, let's train it, and it's going to
link |
work much better than anything before it. And it will in fact continue to get better as you make
link |
it larger. And it turns out to be true. That's, that's amazing when a theory is validated like
link |
this, you know, it's not a mathematical theory, it's more of a biological theory almost. So I
link |
think there are not terrible analogies between deep learning and biology. I would say it's like
link |
the geometric mean of biology and physics, that's deep learning. The geometric mean of biology and
link |
physics, I think I'm going to need a few hours to wrap my head around that. Because just to find
link |
the geometric, just to find the set of what biology represents. Well, biology, in biology,
link |
things are really complicated. The theories are really, really, it's really hard to have good
link |
predictive theory. And in physics, the theories are too good. In theory, in physics, people make
link |
these super precise theories, which make these amazing predictions. And in machine learning,
link |
they're kind of in between. Kind of in between. But it'd be nice if machine learning somehow
link |
helped us discover the unification of the two as opposed to serve the in between.
link |
But you're right, that's, you're kind of trying to juggle both. So do you think there are still
link |
beautiful and mysterious properties in neural networks that are yet to be discovered? Definitely.
link |
I think that we are still massively underestimating deep learning.
link |
What do you think it will look like? Like what if I knew I would have done it?
link |
So, but if you look at all the progress from the past 10 years, I would say most of it,
link |
I would say there have been a few cases where some were things that felt like really new ideas
link |
showed up. But by and large, it was every year, we thought, okay, deep learning goes this far.
link |
Nope, it actually goes further. And then the next year, okay, now you know, this is this is
link |
big deep learning. We are really done. Nope, it goes further. It just keeps going further each
link |
year. So that means that we keep underestimating, we keep not understanding it as surprising properties
link |
all the time. Do you think it's getting harder and harder to make progress, need to make progress?
link |
It depends on what we mean. I think the field will continue to make very robust progress
link |
for quite a while. I think for individual researchers, especially people who are doing
link |
research, it can be harder because there is a very large number of researchers right now.
link |
I think that if you have a lot of compute, then you can make a lot of very interesting discoveries,
link |
but then you have to deal with the challenge of managing a huge computer, a huge class,
link |
huge computer cluster to run your experiments. It's a little bit harder.
link |
So I'm asking all these questions that nobody knows the answer to, but you're one of the smartest
link |
people I know. So I'm going to keep asking the, so let's imagine all the breakthroughs that happen
link |
in the next 30 years in deep learning. Do you think most of those breakthroughs can be done by
link |
one person with one computer? Sort of in the space of breakthroughs, do you think compute
link |
will be, compute and large efforts will be necessary? I mean, I can't be sure. When you say
link |
one computer, you mean how large? You're clever. I mean, one GPU. I see. I think it's pretty unlikely.
link |
I think it's pretty unlikely. I think that there are many, the stack of deep learning is starting
link |
to be quite deep. If you look at it, you've got all the way from the ideas, the systems to build
link |
the datasets, the distributed programming, the building the actual cluster, the GPU programming,
link |
putting it all together. So the stack is getting really deep. And I think it becomes,
link |
it can be quite hard for a single person to become, to be world class in every single layer of the
link |
stack. What about what like Vladimir Vapnik really insists on is taking MNIST and trying to learn
link |
from very few examples. So being able to learn more efficiently. Do you think that there'll be
link |
breakthroughs in that space that would may not need this huge compute? I think there will be a
link |
large number of breakthroughs in general that will not need a huge amount of compute. So maybe I
link |
should clarify that. I think that some breakthroughs will require a lot of compute. And I think building
link |
systems which actually do things will require a huge amount of compute. That one is pretty obvious.
link |
If you want to do X, and X requires a huge neural net, you got to get a huge neural net.
link |
But I think there will be lots of, I think there is lots of room for very important work being
link |
done by small groups and individuals. Can you maybe sort of on the topic of the science of
link |
deep learning, talk about one of the recent papers that you've released, the deep double descent,
link |
where bigger models and more data hurt. I think it's a really interesting paper.
link |
Can you can describe the main idea? And yeah, definitely. So what happened is that some over
link |
the years, some small number of researchers noticed that it is kind of weird that when you make the
link |
neural network larger, it works better. And it seems to go in contradiction with statistical
link |
ideas. And then some people made an analysis showing that actually you got this double descent
link |
bump. And what we've done was to show that double descent occurs for pretty much all practical
link |
deep learning systems. And that it'll be also so can you step back? What's the X axis and the Y
link |
axis of a double descent plot? Okay, great. So you can you can look you can do things like
link |
you can take your neural network. And you can start increasing its size slowly,
link |
while keeping your data set fixed. So if you increase the size of the neural network slowly,
link |
and if you don't do early stopping, that's a pretty important detail. Then when the
link |
neural network is really small, you make it larger, you get a very rapid increase in performance.
link |
Then you continue to make it larger. And at some point performance will get worse.
link |
And it gets and it gets the worst exactly at the point at which it achieves zero training
link |
error precisely zero training loss. And then as you make it larger, it starts to get better again.
link |
And it's kind of counterintuitive, because you'd expect deep learning phenomena to be
link |
monotonic. And it's hard to be sure what it means. But it also occurs in the case of linear
link |
classifiers. And the intuition basically boils down to the following. When you when you have a lot
link |
when you have a large data set, and a small model, then small, tiny, random. So basically,
link |
what is overfitting? Overfitting is when your model is somehow very sensitive to the small, random,
link |
unimportant stuff in your data set in the training data in the training data set precisely. So if
link |
you have a small model, and you have a big data set, and there may be some random thing, you know,
link |
some training cases are randomly in the data set, and others may not be there. But the small model,
link |
but the small model is kind of insensitive to this randomness, because it's the same you there is
link |
pretty much no uncertainty about the model, when the data set is large. So okay, so at the very
link |
basic level, to me, it is the most surprising thing that neural networks don't overfit every time,
link |
very quickly, before ever being able to learn anything, the huge number of parameters. So here
link |
is so there is one way okay, so maybe so let me try to give the explanation, maybe that will be
link |
that will work. So you got a huge neural network, let's suppose you got them. You are you have a
link |
huge neural network, you have a huge number of parameters. And now let's pretend everything is
link |
linear, which is not let's just pretend. Then there is this big subspace, where your network
link |
achieves zero error. And SGT is going to find approximately the point really that's right,
link |
approximately the point with the smallest norm in that subspace. Okay, and that can also be proven
link |
to be insensitive to the small randomness in the data, when the dimensionality is high. But when
link |
the dimensionality of the data is equal to the dimensionality of the model, then there is a
link |
one to one correspondence between all the data sets and the models. So small changes in the data
link |
set actually lead to large changes in the model. And that's why performance gets worse. So this
link |
is the best explanation more or less. So then it would be good for the model to have more parameters
link |
so to be bigger than the data. That's right. But only if you don't really stop. If you introduce
link |
early stop in your regularization, you can make a double as a descent pump almost completely
link |
disappear. What is early stop early stopping is when you train your model, and you monitor your
link |
test validation performance. And then if at some point validation performance starts to get worse,
link |
you say, Okay, let's stop training. We are good. We are good. We are good enough.
link |
So the magic happens after after that moment. So you don't want to do the early stopping.
link |
Well, if you don't do the early stopping, you get this very, you get a very pronounced double
link |
descent. Do you have any intuition why this happens? Double descent or sorry, are you stopping?
link |
No, the double descent. So the Well, yeah, so I try it, let's see the intuition is basically
link |
is this that when the data set has as many degrees of freedom as the model, then there is a one to
link |
one correspondence between them. And so small changes to the data set lead to noticeable changes
link |
in the model. So your model is very sensitive to all the randomness, it is unable to discard it.
link |
Whereas, it turns out that when you have a lot more data than parameters, or a lot more parameters
link |
than data, the resulting solution will be insensitive to small changes in the data set.
link |
So it's able to nicely put discard the small changes, the randomness. Exactly. The spurious
link |
correlation which you don't want. Jeff Hinton suggested we need to throw back propagation.
link |
We already kind of talked about this a little bit, but he suggested we need to throw away
link |
back propagation and start over. I mean, of course, some of that is a little bit
link |
wit and humor. But what do you think? What could be an alternative method of training neural networks?
link |
Well, the thing that he said precisely is that to the extent that you can't find back propagation
link |
in the brain, it's worth seeing if we can learn something from how the brain learns. But back
link |
propagation is very useful and we should keep using it. Oh, you're saying that once we discover the
link |
mechanism of learning in the brain or any aspects of that mechanism, we should also try to implement
link |
that in neural networks? If it turns out that you can't find back propagation in the brain?
link |
If we can't find back propagation in the brain? Well, so I guess your answer to that is back
link |
propagation is pretty damn useful. So why are we complaining? I mean, I personally am a big fan
link |
of back propagation. I think it's a great algorithm because it solves an extremely fundamental problem
link |
which is finding a neural circuit subject to some constraints. And I don't see that problem going
link |
away. So that's why I really, I think it's pretty unlikely that you'll have anything which is going
link |
to be dramatically different. It could happen. But I wouldn't bet on it right now.
link |
So let me ask a sort of big picture question. Do you think neural networks can be made to reason?
link |
Why not? Well, if you look, for example, at AlphaGo or AlphaZero, the neural network of AlphaZero
link |
plays Go, which we all agree is a game that requires reasoning, better than 99.9% of all humans,
link |
just the neural network without the search, just the neural network itself.
link |
Doesn't that give us an existence proof that neural networks can reason?
link |
To push back and disagree a little bit, we all agree that Go is reasoning. I think I agree. I
link |
don't think it's at trivial. So obviously reasoning like intelligence is a loose gray area term
link |
a little bit. Maybe you disagree with that. But yes, I think it has some of the same elements
link |
of reasoning. Reasoning is almost akin to search. There's a sequential element of
link |
stepwise consideration of possibilities and sort of building on top of those possibilities in a
link |
sequential manner until you arrive at some insight. So yeah, I guess playing Go is kind of like that.
link |
And when you have a single neural network doing that without search, that's kind of like that.
link |
So there's an existence proof in a particular constrained environment that a process akin to
link |
what many people call reasoning exists, but more general kind of reasoning. So off the board.
link |
There is one other existence proof. Oh boy, which one? Us humans? Yes. Okay. All right. So
link |
do you think the architecture that will allow neural networks to reason will look similar
link |
to the neural network architectures we have today? I think it will. I think, well, I don't want to make
link |
two overly definitive statements. I think it's definitely possible that the neural networks
link |
that will produce the reasoning breakthroughs of the future will be very similar to the
link |
architectures that exist today, maybe a little bit more recurrent, maybe a little bit deeper.
link |
But these neural nets are so insanely powerful. Why wouldn't they be able to learn to reason?
link |
Humans can reason. So why can't neural networks? So do you think the kind of stuff we've seen
link |
neural networks do is a kind of just weak reasoning? So it's not a fundamentally different
link |
process. Again, this is stuff we don't nobody knows the answer to. So when it comes to our
link |
neural networks, I would think which I would say is that neural networks are capable of reasoning.
link |
But if you train a neural network on a task which doesn't require reasoning,
link |
it's not going to reason. This is a well known effect where the neural network will solve
link |
exactly the it will solve the problem that you pose in front of it in the easiest way possible.
link |
Right. That takes us to one of the brilliant ways you describe neural networks, which is
link |
you've referred to neural networks as the search for small circuits,
link |
and maybe general intelligence as the search for small programs,
link |
which I found is a metaphor very compelling. Can you elaborate on that difference?
link |
Yeah. So the thing which I said precisely was that if you can find the shortest program that
link |
outputs the data at your disposal, then you will be able to use it to make the best prediction
link |
possible. And that's a theoretical statement which can be proved mathematically. Now,
link |
you can also prove mathematically that it is that finding the shortest program which generates
link |
some data is not a computable operation. No finite amount of compute can do this.
link |
So then with neural networks, neural networks are the next best thing that actually works in
link |
practice. We are not able to find the best, the shortest program which generates our data,
link |
but we are able to find a small, but now that statement should be amended. Even a large circuit
link |
which fits our data in some way. Well, I think what you meant by the small
link |
circuit is the smallest needed circuit. Well, the thing which I would change now,
link |
back then I really haven't fully internalized the overparameterized results. The things we know
link |
about overparameterized neural nets, now I would phrase it as a large circuit whose weights contain
link |
a small amount of information, which I think is what's going on. If you imagine the training
link |
process of a neural network as you slowly transmit entropy from the data set to the parameters,
link |
then somehow the amount of information in the weights ends up being not very large,
link |
which would explain why the general is so well. So the large circuit might be one that's
link |
helpful for the generalization. Yeah, something like this. But do you see it important to be able
link |
to try to learn something like programs? I mean, if we can, definitely. I think it's kind of,
link |
the answer is kind of yes, if we can do it. We should do things that we can do it.
link |
The reason we are pushing on deep learning, the fundamental reason, the root cause is that we
link |
are able to train them. So in other words, training comes first. We've got our pillar,
link |
which is the training pillar. And now if you're trying to contort our neural networks around
link |
the training pillar, we got to stay trainable. This is an invariant we cannot violate. And so
link |
being trainable means starting from scratch, knowing nothing, you can actually pretty quickly
link |
converge towards knowing a lot or even slowly. But it means that given the resources at your
link |
disposal, you can train the neural net and get it to achieve useful performance. Yeah, that's a
link |
pillar we can't move away from. That's right. Because if you can, whereas if you say, Hey,
link |
let's find the shortest program, we can't do that. So it doesn't matter how useful that would be.
link |
We can do it. So we want. So do you think you kind of mentioned that the neural networks are
link |
good at finding small circuits or large circuits? Do you think then the matter of finding small
link |
programs is just the data? No, sorry, not the size or the type of data. Ask giving it programs.
link |
Well, I think the thing is that right now, there are no good precedents of people successfully
link |
finding programs really well. And so the way you'd find programs is you'd train a deep neural
link |
network to do it basically. Right. Which is the right way to go about it. But there's not good
link |
illustrations of that. Yes, hasn't been done yet. But in principle, it should be possible.
link |
Can you elaborate a little bit? What's your insight in principle? And put another way,
link |
you don't see why it's not possible. Well, it's kind of like more, it's more a statement of
link |
I think that it's, I think that it's unwise to bet against deep learning. And
link |
if it's a, if it's a cognitive function that humans seem to be able to do, then
link |
it doesn't take too long for some deep neural net to pop up that can do it too.
link |
Yeah, I'm there with you. I can, I've stopped betting against neural networks at this point
link |
because they continue to surprise us. What about long term memory? Can neural networks have long
link |
term memory or something like knowledge basis? So being able to aggregate important information
link |
over long periods of time, that would then serve as useful sort of representations of
link |
state that you can make decisions by. So have a long term context based on what you make in the
link |
decision. So in some sense, the parameters already do that. The parameters are an aggregation of the
link |
day of the neural of the entirety of the neural nets experience. And so they count as the long
link |
as long form long term knowledge. And people have trained various neural nets to act as
link |
knowledge basis and, you know, investigated with invest people have investigated language
link |
models as knowledge basis. So there is work, there is work there. Yeah, but in some sense,
link |
do you think in every sense, do you think there's a, it's all just a matter of coming up with a
link |
better mechanism of forgetting the useless stuff and remembering the useful stuff? Because right
link |
now, I mean, there's not been mechanisms that do remember really long term information.
link |
What do you mean by that precisely?
link |
Precisely. I like the word precisely. So
link |
I'm thinking of the kind of compression of information the knowledge bases represent,
link |
sort of creating a, now I apologize for my sort of human centric thinking about
link |
what knowledge is because neural networks aren't interpretable necessarily with the
link |
kind of knowledge they have discovered. But a good example for me is knowledge bases being
link |
able to build up over time something like the knowledge that Wikipedia represents.
link |
It's a really compressed structured
link |
knowledge base, obviously not the actual Wikipedia or the language, but like a semantic web,
link |
the dream that semantic web represented. So it's a really nice compressed knowledge base
link |
or something akin to that in the noninterpretable sense as neural networks would have.
link |
Well, the neural networks would be noninterpretable if you look at their weights, but their outputs
link |
should be very interpretable. Okay, so yeah, how do you make very smart neural networks like
link |
language models interpretable? Well, you ask them to generate some text and the text will
link |
generally be interpretable. Do you find that the epitome of interpretability, like can you do better?
link |
Because you can't, okay, I'd like to know what does it know and what doesn't know.
link |
I would like the neural network to come up with examples where it's completely dumb
link |
and examples where it's completely brilliant. And the only way I know how to do that now is to
link |
generate a lot of examples and use my human judgment. But it would be nice if the neural
link |
network had some self awareness about it. Yeah, 100%. I'm a big believer in self awareness. And I
link |
think that I think, I think neural net self awareness will allow for things like the capabilities,
link |
like the ones you described, like for them to know what they know and what they don't know,
link |
and for them to know where to invest to increase their skills most optimally. And to your question
link |
of interpretability, there are actually two answers to that question. One answer is, you know,
link |
we have the neural net, so we can analyze the neurons and we can try to understand what the
link |
different neurons and different layers mean. And you can actually do that and OpenAI has done
link |
some work on that. But there is a different answer which is that I would say that's the human
link |
centric answer where you say, you know, you look at a human being, you can't read, you know,
link |
how do you know what a human being is thinking? You ask them, you say, Hey, what do you think about
link |
this? What do you think about that? And you get some answers. The answers you get are sticky. In
link |
the sense, you already have a mental model, you already have an, yeah, mental model of that human
link |
being. You already have an understanding of like a big conception of what it of that human being,
link |
how they think, what they know, how they see the world, and then everything you ask, you're
link |
adding onto that. And that stickiness seems to be, that's one of the really interesting qualities
link |
of the human being is that information is sticky. You don't, you seem to remember the useful stuff,
link |
aggregate it well, and forget most of the information that's not useful. That process,
link |
but that's also pretty similar to the process that neural networks do. It's just that neural
link |
networks are much crappier at this time. It doesn't seem to be fundamentally that different.
link |
But just to stick on reasoning for a little longer, you said, why not? Why can't that reason?
link |
What's a good, impressive feat benchmark to you of reasoning that you'll be impressed by if
link |
neural networks were able to do? Is that something you already have in mind? Well, I think writing,
link |
writing really good code, I think, proving really hard theorems, solving open ended problems
link |
without the box solutions. And sort of theorem type mathematical problems. Yeah, I think those
link |
ones are a very natural example as well. You know, if you can prove an unproven theorem,
link |
then it's hard to argue it on reason. And so by the way, and this comes back to the point about
link |
the hard results, you know, if you've got a hard, if you have machine learning, deep learning as a
link |
field is very fortunate, because we have the ability to sometimes produce these unambiguous
link |
results. And when they happen, the debate changes, the conversation changes. It's a converse, you
link |
have the ability to produce conversation changing results. Conversation. And then of course,
link |
just like you said, people kind of take that for granted, say that wasn't actually a hard problem.
link |
Well, I mean, at some point, we'll probably run out of hard problems.
link |
Yeah, that whole mortality thing is kind of a sticky problem that we haven't quite figured
link |
out. Maybe we'll solve that one. I think one of the fascinating things in your entire body of work,
link |
but also the work at OpenAI recently, one of the conversation changes has been in the world of
link |
language models. Can you briefly kind of try to describe the recent history of using neural
link |
networks in the domain of language and text? Well, there's been lots of history. I think the
link |
Elman network was a small, tiny recurrent neural network applied to language back in the 80s.
link |
So the history is really, you know, fairly long, at least. And the thing that started the thing
link |
that changed the trajectory of neural networks and language is the thing that changed the trajectory
link |
of all deep learning and that's data and compute. So suddenly you move from small language models,
link |
which learn a little bit. And with language models, in particular, you can, there's a very clear
link |
explanation for why they need to be large, to be good. Because they're trying to predict the
link |
next word. So when you don't know anything, you'll notice very, very broad strokes, surface level
link |
patterns, like sometimes there are characters and there is space between those characters,
link |
you'll notice this pattern. And you'll notice that sometimes there is a comma and then the next
link |
character is a capital letter, you'll notice that pattern. Eventually, you may start to notice that
link |
there are certain words occur often, you may notice that spellings are a thing, you may notice
link |
syntax. And when you get really good at all these, you start to notice the semantics,
link |
you start to notice the facts. But for that to happen, the language model needs to be larger.
link |
So that's, let's linger on that, is that's where you and Noam Chomsky disagree.
link |
See, you think we're actually taking incremental steps, sort of larger network, larger compute
link |
will be able to get to the semantics, be able to understand language without what Noam likes to
link |
sort of think of as a fundamental understandings of the structure of language, like imposing
link |
your theory of language onto the learning mechanism. So you're saying the learning you can learn from
link |
raw data, the mechanism that underlies language? Well, I think it's pretty likely. But I also
link |
want to say that I don't really know precisely what Chomsky means when he talks about him.
link |
You said something about imposing your structural language. I'm not 100% sure what he means, but
link |
empirically, it seems that when you inspect those larger language models, they exhibit signs of
link |
understanding the semantics, whereas the smaller language models do not. We've seen that a few
link |
years ago when we did work on the sentiment neuron, we trained a small, you know, smallish LSTM
link |
to predict the next character in Amazon reviews. And we noticed that when you increase the size
link |
of the LSTM from 500 LSTM cells to 4,000 LSTM cells, then one of the neurons starts to represent
link |
the sentiment of the article of sorry, of their view. Now, why is that sentiment is a pretty
link |
semantic attribute? It's not a syntactic attribute. And for people who might not know, I don't know
link |
if that's a standard term, but sentiment is whether it's a positive or negative review.
link |
That's right. Like, is the person happy with something or is the person unhappy with something?
link |
And so here we had very clear evidence that a small neural net does not capture sentiment
link |
while a large neural net does. And why is that? Well, our theory is that at some point,
link |
you run out of syntax to models, you start to got to focus on something else.
link |
And besides, you quickly run out of syntax to model, and then you really start to focus on
link |
the semantics. This would be the idea. That's right. And so I don't want to imply that our models
link |
have complete semantic understanding, because that's not true. But they definitely are showing
link |
signs of semantic understanding, partial semantic understanding, but the smaller models do not show
link |
that those signs. Can you take a step back and say, what is GPT2, which is one of the big language
link |
models that was the conversation changer in the past couple of years? Yeah, so GPT2 is a
link |
transformer with one and a half billion parameters that was trained on about 40 billion tokens of
link |
text, which were obtained from webpages that were linked to from Reddit articles with more than three
link |
uploads. And what's the transformer? The transformer, it's the most important advance
link |
in neural network architectures in recent history. What is the tension maybe two,
link |
because I think that's an interesting idea, not necessarily sort of technically speaking, but
link |
the idea of attention versus maybe what recurrent neural networks represent.
link |
Yeah. So the thing is, the transformer is a combination of multiple ideas simultaneously
link |
of which attention is one. Do you think attention is the key? No, it's a key, but it's not the key.
link |
The transformer is successful because it is the simultaneous combination of multiple ideas. And
link |
if you were to remove either idea, it would be much less successful. So the transformer uses a
link |
lot of attention, but attention exists for a few years. So that can't be the main innovation.
link |
The transformer is designed in such a way that it runs really fast on the GPU.
link |
And that makes a huge amount of difference. This is one thing. The second thing is that
link |
transformer is not recurrent. And that is really important too, because it is more shallow and
link |
therefore much easier to optimize. So in other words, uses attention. It is, it is a really
link |
great fit to the GPU. And it is not recurrent. So therefore, less deep and easier to optimize.
link |
And the combination of those factors make it successful. So now it makes, it makes great
link |
use of your GPU. It allows you to achieve better results for the same amount of compute.
link |
And that's why it's successful. Were you surprised how well transformers worked?
link |
And GPT2 worked? So you worked on language. You've had a lot of great ideas before
link |
transformers came about in language. So you got to see the whole set of revolutions before and
link |
after. Were you surprised? Yeah, a little. A little. Yeah. I mean, it's hard, it's hard to
link |
remember because you adapt really quickly. But it definitely was surprising. It definitely was,
link |
in fact, I'll, you know what, I'll, I'll retract my statement. It was, it was pretty amazing.
link |
It was just amazing to see generate this text of this. And you know, I gotta keep in mind that
link |
we've seen, at that time, we've seen all this progress in GANs and improving, you know, the
link |
samples produced by GANs were just amazing. You have these realistic faces, but text hasn't really
link |
moved that much. And suddenly we moved from, you know, whatever GANs were in 2015, to the best,
link |
most amazing GANs in one step. And I was really stunning. Even though theory predicted, yeah,
link |
you train a big language model, of course, you should get this, but then to see it with your
link |
own eyes, it's something else. And yet we adapt really quickly. And now there's sort of
link |
some cognitive scientists, right, articles saying that GPT two models don't really understand
link |
language. So we adapt quickly to how amazing the fact that they're able to model the language so
link |
well is. So what do you think is the bar for what for impressing us that I don't know, do you think
link |
that bar will continuously be moved? Definitely. I think when you start to see really dramatic
link |
economic impact, that's when I think that's in some sense the next barrier. Because right now,
link |
if you think about the work in AI, it's really confusing. It's really hard to know what to make
link |
of all these advances. It's kind of like, okay, you got an advance and now you can do more things
link |
and you got another improvement and you got another cool demo. At some point, I think people who are
link |
outside of AI, they can no longer distinguish this progress anymore. So we were talking offline about
link |
translating Russian to English and how there's a lot of brilliant work in Russian that the rest
link |
of the world doesn't know about. That's true for Chinese. It's true for a lot of scientists and
link |
just artistic work in general. Do you think translation is the place where we're going
link |
to see sort of economic big impact? I don't know. I think there is a huge number of
link |
applications. First of all, I want to point out that translation already today is huge.
link |
I think billions of people interact with big chunks of the internet primarily through translation.
link |
So translation is already huge and it's hugely positive too. I think self driving is going to be
link |
hugely impactful. It's unknown exactly when it happens, but again, I would not bet against
link |
deep learning. So that's deep learning in general. Deep learning for self driving.
link |
Yes, deep learning for self driving. But I was talking about sort of language models.
link |
I see. Just to check. I beard off a little bit. Just to check. You're not seeing a connection
link |
between driving and language. No, no. Okay. I'd rather both use neural nets.
link |
That'd be a poetic connection. I think there might be some, like you said, there might be some kind
link |
of unification towards a kind of multitask transformers that can take on both language
link |
and vision tasks. That'd be an interesting unification. Let's see. What can I ask about
link |
GPT2 more? It's simple. It's not much to ask. You take a transform, you make it bigger,
link |
give it more data, and suddenly it does all those amazing things.
link |
Yeah. One of the beautiful things is that GPT, the transformers are fundamentally simple to
link |
explain, to train. Do you think bigger will continue to show better results in language?
link |
Probably. Sort of like what are the next steps with GPT2? Do you think?
link |
I mean, I think for sure seeing what larger versions can do is one direction. Also,
link |
I mean, there are many questions. There's one question which I'm curious about and that's
link |
the following. Right now, GPT2, so we feed it all this data from the internet, which means that
link |
it needs to memorize all those random facts about everything in the internet. It would be nice if
link |
the model could somehow use its own intelligence to decide what data it wants to start, accept,
link |
and what data it wants to reject, just like people. People don't learn all data indiscriminately.
link |
We are super selective about what we learn. I think this kind of active learning I think
link |
would be very nice to have. Yeah. Listen, I love active learning. Let me ask,
link |
does the selection of data, can you just elaborate that a little bit more? Do you think the selection
link |
of data is, I have this kind of sense that the optimization of how you select data,
link |
so the active learning process is going to be a place for a lot of breakthroughs,
link |
even in the near future, because there hasn't been many breakthroughs there that are public.
link |
I feel like there might be private breakthroughs that companies keep to themselves,
link |
because the fundamental problem has to be solved if you want to solve self driving,
link |
if you want to solve a particular task. What do you think about the space in general?
link |
Yeah, so I think that for something like active learning, or in fact for any kind of capability,
link |
like active learning, the thing that it really needs is the problem. It needs a problem that
link |
requires it. It's very hard to do research about the capability if you don't have a task,
link |
because then what's going to happen is you will come up with an artificial task,
link |
get good results, but not really convince anyone. Right. We're now past the stage where
link |
getting a result on MNIST, some clever formulation of MNIST will convince people.
link |
That's right. In fact, you could quite easily come up with a simple active learning scheme
link |
on MNIST and get a 10x speed up, but then so what? I think that active learning will naturally arise
link |
as problems that require it to pop up. That's my take on it.
link |
There's another interesting thing that OpenAS brought up with GPT2, which is when you create a
link |
powerful artificial intelligence system, and it was unclear what kind of detrimental, once you
link |
release GPT2, what kind of detrimental effect it will have. Because if you have a model that can
link |
generate pretty realistic text, you can start to imagine that it would be used by bots in some
link |
way that we can't even imagine. There's this nervousness about what it's possible to do.
link |
So you did a really brave and profound thing, which just started a conversation about this.
link |
How do we release powerful artificial intelligence models to the public? If we do it all, how do we
link |
privately discuss with other even competitors about how we manage the use of the systems and so on?
link |
So from this whole experience, you've released a report on it, but in general, are there any
link |
insights that you've gathered from just thinking about this, about how you release models like this?
link |
I mean, I think that my take on this is that the field of AI has been in a state of childhood,
link |
and now it's exiting that state and it's entering a state of maturity.
link |
What that means is that AI is very successful and also very impactful, and its impact is not only
link |
large, but it's also growing. And so for that reason, it seems wise to start thinking about
link |
the impact of our systems before releasing them, maybe a little bit too soon, rather than a little
link |
bit too late. And with the case of GPT2, like I mentioned earlier, the results really were stunning,
link |
and it seemed plausible. It didn't seem certain. It seemed plausible that something like GPT2 could
link |
easily use to reduce the cost of disinformation. And so there was a question of what's the best
link |
way to release it? And a staged release seemed logical. A small model was released, and there
link |
was time to see the many people use these models in lots of cool ways. There have been lots of
link |
really cool applications. There haven't been any negative applications we know of. And so eventually
link |
it was released, but also other people replicated similar models. That's an interesting question,
link |
though, that we know of. So in your view, staged release is at least part of the answer to the
link |
question of what do we do once we create a system like this? It's part of the answer, yes.
link |
Is there any other insights? Say you don't want to release the model at all, because it's useful
link |
to you for whatever the business is. Well, there are plenty of people who don't release models
link |
already, right? Of course. But is there some moral ethical responsibility when you have a
link |
very powerful model to sort of communicate? Just as you said, when you had GPT2, it was
link |
unclear how much it could be used for misinformation. It's an open question. And getting an answer to
link |
that might require that you talk to other really smart people that are outside of your particular
link |
group. Have you please tell me there's some optimistic pathway for people across the world
link |
to collaborate on these kinds of cases? Or is it still really difficult from one company to
link |
talk to another company? So it's definitely possible. It's definitely possible to discuss
link |
these kind of models with colleagues elsewhere and to get their take on what to do.
link |
How hard is it though? I mean,
link |
do you see that happening? I think that's the place where it's important to gradually build
link |
trust between companies. Because ultimately, all the AI developers are building technology,
link |
which is going to be increasingly more powerful. And so it's
link |
the way to think about it is that ultimately, we're only together.
link |
Yeah, I tend to believe in the better angels of our nature, but I do hope that
link |
when you build a really powerful AI system in a particular domain, that you also think about
link |
the potential negative consequences of AI. It's an interesting and scary possibility
link |
that there will be a race for AI development that would push people to close that development
link |
and not share ideas with others. I don't love this. I've been in a pure academic for 10 years.
link |
I really like sharing ideas and it's fun. It's exciting. Let's talk about AGI a little bit.
link |
What do you think it takes to build a system of human level intelligence? We talked about reasoning,
link |
we talked about long term memory, but in general, what does it take, do you think?
link |
Well, I can't be sure. But I think the deep learning plus maybe another small idea.
link |
Do you think self play will be involved? You've spoken about the powerful mechanism of self play,
link |
where systems learn by exploring the world in a competitive setting against other entities that
link |
are similarly skilled as them and so incrementally improve in this way. Do you think self play will
link |
be a component of building an AGI system? Yeah. What I would say to build AGI, I think it's going
link |
to be deep learning plus some ideas. I think self play will be one of those ideas. I think that
link |
that is a very self play has this amazing property that it can surprise us in truly novel ways. For
link |
example, like we, I mean, pretty much every self play system, both are dot a bot. I don't know if
link |
openly I had a release about multi agents where you had two little agents who were playing hide and
link |
seek. And of course, also alpha zero, they will all produce surprising behaviors. They all produce
link |
behaviors that we didn't expect. They are creative solutions to problems. And that seems like an
link |
important part of AGI that our systems don't exhibit routinely right now. And so that's why I
link |
like this area, I like this direction because of its ability to surprises.
link |
To surprises. And AGI system would surprise us fundamentally. Yes. But and to be precise, not
link |
just a not just a random surprise, but to find the surprising solution to a problem that's also
link |
useful. Right. Now, a lot of the self playing mechanisms have been used in the game context,
link |
or at least in the simulation context. How much, how much, how far along the path to AGI do you
link |
think will be done in simulation? How much faith promise do you have in simulation versus having
link |
to have a system that operates in the real world, whether it's the real world of digital real world
link |
data or real world, like actual physical world with robotics? I don't think it's in either war.
link |
I think simulation is a tool. And it helps it has certain its strengths and certain weaknesses.
link |
And we should use it. Yeah, but okay. I understand that that's
link |
that's true. But one of the criticisms of self play, one of the criticisms and reinforcement
link |
learning is one of the the its current power, its current results, while amazing have been
link |
demonstrated in a simulated environments, or very constrained physical environments,
link |
do you think it's possible to escape them, escape the simulator environments and be able to learn
link |
in non simulated environments? Or do you think it's possible to also just simulate in a photo
link |
realistic, and physics realistic way, the real world in a way that we can solve real problems
link |
with self play in simulation. So I think that's transfer from simulation to the real world is
link |
definitely possible, and has been exhibited many times in by many different groups. It's been
link |
especially successful in vision. Also open AI in the summer has demonstrated a robot hand which
link |
was trained entirely in simulation, in a certain way that allowed for seem to real transfer to occur.
link |
This is for the Rubik's cube. That's right. And I wasn't aware that was trained in simulation
link |
was trained in simulation entirely. Really? So what it wasn't in the physical that the hand
link |
wasn't trained? No, 100% of the training was done in simulation. And the policy that was
link |
learned in simulation was trained to be very adaptive. So adaptive that when you transfer it,
link |
it could very quickly adapt to the physical to the physical world. So the kind of perturbations
link |
with the giraffe or whatever the heck it was, those weren't were those part of the simulation?
link |
Well, the simulation was generally so the simulation was trained to be robust to many
link |
different things, but not the kind of perturbations we've had in the video. So it's never been
link |
trained with a glove. It's never been trained with a stuff giraffe. So in theory, these are
link |
novel perturbations? Correct. It's not a theory in practice. And that those are novel perturbations?
link |
Well, that's okay. That's a clean, small scale, but clean example of a transfer from the simulated
link |
world to the physical world. Yeah. And I will also say that I expect the transfer capabilities
link |
of deep learning to increase in general. And the better the transfer capabilities are,
link |
the more useful simulation will become. Because then you could take you could
link |
experience something in simulation, and then learn a moral of the story, which you could then
link |
carry with you to the real world, right? As humans do all the time in the computer games.
link |
So let me ask sort of an embodied question, staying an AGI for a sec. Do you think AGI system
link |
would need to have a body? We need to have some of those human elements of self awareness, consciousness,
link |
sort of fear of mortality, sort of self preservation in the physical space, which comes with having
link |
a body? I think having a body will be useful. I don't think it's necessary. But I think it's
link |
very useful to have a body for sure, because you can learn a whole new you can learn things which
link |
cannot be learned without a body. But at the same time, I think that you can go if you don't have
link |
a body, you could compensate for it and still succeed. You think so? Yes. Well, there is evidence
link |
for this. For example, there are many people who were born deaf and blind, and they were able to
link |
compensate for the lack of modalities. I'm thinking about Helen Kahler specifically. So even if you're
link |
not able to physically interact with the world, and if you're not able to, I mean, I actually was
link |
getting at maybe let me ask on the more particular, I'm not sure if it's connected to having a body
link |
or not, but the idea of consciousness and a more constrained version of that is self awareness.
link |
Do you think an AGI system should have consciousness? It's what we can't define kind of whatever the
link |
heck you think consciousness is. Yeah, hard question to answer, given how hard is to define it.
link |
Do you think it's useful to think about? I mean, it's definitely interesting. It's fascinating.
link |
I think it's definitely possible that our systems will be conscious.
link |
Do you think that's an emergent thing that just comes from? Do you think consciousness could
link |
emerge from the representation that's stored within your networks? So like that it naturally just
link |
emerges when you become more and more able to represent more and more of the world?
link |
Well, let's say I'd make the following argument, which is humans are conscious. And if you believe
link |
that artificial neural nets are sufficiently similar to the brain, then there should at least
link |
exist artificial neural nets you should be conscious to. You're leaning on that existence proof pretty
link |
heavily. Okay. But that's the best answer I can give. No, I know. I know. I know. There's still
link |
an open question if there's not some magic in the brain that we're not. I mean, I don't mean
link |
a non materialistic magic, but that the brain might be a lot more complicated and interesting
link |
than we give it credit for. If that's the case, then it should show up. And at some point,
link |
at some point, we will find out that we can't continue to make progress. But I think it's
link |
unlikely. So we talk about consciousness, but let me talk about another poorly defined concept of
link |
intelligence. Again, we've talked about reasoning. We've talked about memory. What do you think is
link |
a good test of intelligence for you? Are you impressed by the test that Alan Turing formulated
link |
with the imitation game of natural language? Is there something in your mind that you will be
link |
deeply impressed by if a system was able to do? I mean, lots of things. There's certain
link |
frontier. There is a certain frontier of capabilities today. And there exists things
link |
outside of that frontier. And I would be impressed by any such thing. For example, I would be
link |
impressed by a deep learning system, which solves a very pedestrian, you know, pedestrian task
link |
like machine translation or computer vision task or something, which never makes mistake a human
link |
wouldn't make under any circumstances. I think that is something which have not yet been demonstrated.
link |
And I would find it very impressive. Yes. So right now, they make mistakes in different,
link |
they might be more accurate than human beings, but they still make a different set of mistakes.
link |
So my, my, I would guess that a lot of the skepticism that some people have about deep learning
link |
is when they look at their mistakes and they say, well, those mistakes,
link |
they make no sense. Like if you understood the concept, you wouldn't make that mistake us.
link |
And I think that changing that would be would would that would inspire me that would be,
link |
yes, this is this is this is this is progress. Yeah, that's a really nice way to put it.
link |
But I also just don't like that human instinct to criticize a model is not intelligent. That's
link |
the same instinct as we do when we criticize any group of creatures as the other. Because
link |
it's very possible that GPT two is much smarter than human beings at many things.
link |
That's definitely true. It has a lot more breadth of knowledge.
link |
Yes, breadth of knowledge and even, and even perhaps depth on certain topics. It's kind of
link |
hard to judge what depth means, but there's definitely a sense in which humans don't make
link |
mistakes that these models do. Yes, the same is applied to autonomous vehicles. The same is
link |
probably going to continue being applied to a lot of artificial intelligence systems. We find
link |
this is the annoying thing. This is the process of in the 21st century, the process of analyzing
link |
the progress of AI is the search for one case where the system fails in a big way where humans
link |
would not. And then many people writing articles about it. And then broadly, as the public
link |
generally gets convinced that the system is not intelligent. And we like pacify ourselves by
link |
thinking it's not intelligent because of this one anecdotal case. And this can seems to continue
link |
happening. Yeah, I mean, there is truth to that. Although I'm sure that plenty of people are also
link |
extremely impressed by the systems that exist today. But I think this connects to the earlier
link |
point we discussed that it's just confusing to judge progress in AI. And you have a new robot
link |
demonstrating something. How impressed should you be? And I think that people will start to be
link |
impressed once AI starts to really move the needle on the GDP. So you're one of the people that
link |
might be able to create an AGS system here, not you, but you and open AI. If you do create an
link |
AGS system, and you get to spend sort of the evening with it, him, her, what would you talk
link |
about do you think? The very first time? The first time? Well, the first time was just,
link |
we would just ask all kinds of questions and try to make it to get it to make a mistake. And that
link |
would be amazed that it doesn't make mistakes and just keep asking broad. What kind of questions do
link |
you think? Would they be factual or would they be personal, emotional, psychological? What do you
link |
think? All of the above. Would you ask for advice? Definitely. I mean, why would I limit myself
link |
talking to a system like this? Now, again, let me emphasize the fact that you truly are one of
link |
the people that might be in the room where this happens. So let me ask a sort of a profound
link |
question about, I've just talked to Stalin historian, been talking to a lot of people who
link |
are studying power. Abraham Lincoln said, nearly all men can stand adversity. But if you want to
link |
test a man's character, give him power. I would say the power of the 21st century, maybe the 22nd,
link |
but hopefully the 21st would be the creation of an AGI system and the people who have control,
link |
direct possession and control of the AGI system. So what do you think after spending that evening
link |
having a discussion with the AGI system? What do you think you would do?
link |
Well, the ideal world I'd like to imagine is one where humanity are like the board,
link |
the board members of a company where the AGI is the CEO. So it would be,
link |
I would like the picture of which I would imagine is you have some kind of different
link |
entities, different countries or cities and the people that leave their vote for what the AGI
link |
that represents them should do and the AGI that represents them goes and does it. I think a picture
link |
like that, I find very appealing. You could have multiple AGI, you would have an AGI for a city,
link |
for a country and it would be trying to in effect take the democratic process to the next level.
link |
And the board can almost fire the CEO. Essentially, press the reset button, say.
link |
Press the reset. Rerandomize the parameters.
link |
Well, let me sort of, that's actually, okay, that's a beautiful vision, I think,
link |
as long as it's possible to press the reset button. Do you think it will always be possible to
link |
press the reset button? So I think that it's definitely really possible to build.
link |
So you're talking, so the question that I really understand from you is, will humans or humans
link |
people have control over the AI systems that they build? Yes. And my answer is, it's definitely
link |
possible to build AI systems which will want to be controlled by their humans. Wow, that's part of
link |
their, so it's not that just they can't help but be controlled, but that's one of the objectives
link |
of their existence is to be controlled. In the same way that human parents
link |
generally want to help their children, they want their children to succeed. It's not a burden for
link |
them. They are excited to help the children and to feed them and to dress them and to take care of
link |
them. And I believe with high conviction that the same will be possible for an AGI. It will be
link |
possible to program an AGI, to design it in such a way that it will have a similar deep drive that
link |
it will be delighted to fulfill and the drive will be to help humans flourish. But let me take a step
link |
back to that moment where you create the AGI system. I think this is a really crucial moment.
link |
And between that moment and the Democratic Board members with the AGI at the head,
link |
there has to be a relinquishing of power. So it's George Washington, despite all the bad
link |
things he did, one of the big things he did is he relinquished power. He first of all didn't want
link |
to be president. And even when he became president, he gave, he didn't keep just serving as most
link |
dictators do for indefinitely. Do you see yourself being able to relinquish control over an AGI system
link |
given how much power you can have over the world at first financial, just make a lot of money,
link |
and then control by having possession as a AGI system? I'd find it trivial to do that. I'd find
link |
it trivial to relinquish this kind of power. I mean, the kind of scenario you are describing
link |
sounds terrifying to me. That's all. I would absolutely not want to be in that position.
link |
Do you think you represent the majority or the minority of people in the AGI community?
link |
Well, I mean, it's an open question, an important one. Are most people good is another way to ask it.
link |
So I don't know if most people are good. But I think that when it really counts,
link |
people can be better than we think. That's beautifully put. Yeah.
link |
Are there specific mechanisms you can think of aligning AI gene values to human values?
link |
Is that do you think about these problems of continued alignment as we develop the AI systems?
link |
Yeah, definitely. In some sense, the kind of question which you are asking is,
link |
so if you have to translate the question to today's terms, it would be a question about
link |
how to get an RL agent that's optimizing a value function, which itself is learned.
link |
And if you look at humans, humans are like that because the reward function,
link |
the value function of humans is not external, it is internal.
link |
That's right. And there are definite ideas of how to train a value function,
link |
basically an objective, an as objective as possible perception system
link |
that will be trained separately to recognize, to internalize human judgments on different
link |
situations. And then that component wouldn't be integrated as the value as the base value
link |
function for some more capable RL system. You could imagine a process like this.
link |
I'm not saying this is the process. I'm saying this is an example of the kind of thing you could do.
link |
So on that topic of the objective functions of human existence, what do you think is the
link |
objective function that's implicit in human existence? What's the meaning of life? I think
link |
the question is wrong in some way. I think that the question implies that there is an
link |
objective answer, which is an external answer, you know, your meaning of life is X. I think
link |
what's going on is that we exist and that's amazing. And we should try to make the most
link |
of it and try to maximize our own value and enjoyment of a very short time while we do exist.
link |
It's funny because action does require an objective function. It's definitely there
link |
in some form, but it's difficult to make it explicit and maybe impossible to make it explicit.
link |
I guess is what you're getting at. And that's an interesting fact of an RL environment.
link |
Well, what I was making a slightly different point is that humans want things and their wants
link |
create the drives that cause them to, you know, our wants are our objective functions,
link |
our individual objective functions. We can later decide that we want to change,
link |
that what we wanted before is no longer good and we want something else.
link |
Yeah, but they're so dynamic. There's got to be some underlying sort of Freud.
link |
There's things, there's like sexual stuff. There's people who think it's the fear of death.
link |
And there's also the desire for knowledge and, you know, all these kinds of things,
link |
procreation, sort of all the evolutionary arguments, it seems to be,
link |
there might be some kind of fundamental objective function from which everything else emerges.
link |
But it seems like it's very difficult to make it explicit.
link |
I think that probably is an evolutionary objective function, which is to survive and
link |
procreate and make your students succeed. That would be my guess. But it doesn't give an answer
link |
to the question of what's the meaning of life. I think you can see how humans are part of this
link |
big process, this ancient process we are, we are, we exist on a small planet. And that's it.
link |
So given that we exist, try to make the most of it and try to
link |
enjoy more and suffer less as much as we can. Let me ask two silly questions about life.
link |
One, do you have regrets, moments that if you went back, you would do differently? And two,
link |
are there moments that you're especially proud of that made you truly happy?
link |
So I can answer both questions. Of course, there's a huge number of choices and decisions that
link |
have made that with the benefit of hindsight, I wouldn't have made them. And I do experience
link |
some regret, but, you know, I try to take solace in the knowledge that at the time I did the best
link |
I could. And in terms of things that I'm proud of, I'm very fortunate to have done things I'm
link |
proud of. And they made me happy for some time, but I don't think that that is the source of
link |
happiness. So your academic accomplishments, all the papers, you're one of the most cited people
link |
in the world, all the breakthroughs I mentioned in computer vision and language and so on is
link |
what is the source of happiness and pride for you? I mean, all those things are a source of pride,
link |
for sure. I'm very grateful for having done all those things. And it was very fun to do them.
link |
But happiness comes, you know, you can, happiness, well, my current view is that happiness comes from
link |
our to a lot to a very large degree from the way we look at things. You know, you can have a
link |
simple meal and be quite happy as a result, or you can talk to someone and be happy as a result
link |
as well. Or conversely, you can have a meal and be disappointed that the meal wasn't a better meal.
link |
So I think a lot of happiness comes from that, but I'm not sure. I don't want to be too confident.
link |
Being humble in the face of the uncertainty seems to be also a part of this whole happiness thing.
link |
Well, I don't think there's a better way to end it than meaning of life and discussions of happiness.
link |
So, Ilya, thank you so much. You've given me a few incredible ideas. You've given the world
link |
many incredible ideas. I really appreciate it. And thanks for talking today.
link |
Yeah, thanks for stopping by. I really enjoyed it.
link |
Thanks for listening to this conversation with Ilya Setskever. And thank you to our
link |
presenting sponsor, Cash App. Please consider supporting the podcast by downloading Cash App
link |
and using the code Lex Podcast. If you enjoy this podcast, subscribe on YouTube,
link |
review it with Five Stars and Apple Podcasts. Support on Patreon or simply connect with me
link |
on Twitter at Lex Freedman. And now let me leave you with some words from Alan Turing on Machine
link |
Learning. Instead of trying to produce a program to simulate the adult mind, why not rather try
link |
to produce one which simulates the child? If this were then subjected to an appropriate course of
link |
education, one would obtain the adult brain. Thank you for listening and hope to see you next time.