back to indexOriol Vinyals: Deep Learning and Artificial General Intelligence | Lex Fridman Podcast #306
link |
at which point is the neural network a being versus a tool?
link |
The following is a conversation with Oriel Vinialis,
link |
his second time in the podcast.
link |
Oriel is the research director
link |
and deep learning lead at DeepMind
link |
and one of the most brilliant thinkers and researchers
link |
in the history of artificial intelligence.
link |
This is the Lex Friedman podcast.
link |
To support it, please check out our sponsors
link |
in the description.
link |
And now, to your friends, here's Oriel Vinialis.
link |
You are one of the most brilliant researchers
link |
in the history of AI,
link |
working across all kinds of modalities,
link |
probably the one common theme is,
link |
it's always sequences of data.
link |
So we're talking about languages, images, even biology
link |
and games as we talked about last time.
link |
So you're a good person to ask this.
link |
In your lifetime, will we be able to build an AI system
link |
that's able to replace me as the interviewer
link |
in this conversation,
link |
in terms of ability to ask questions
link |
that are compelling to somebody listening?
link |
And then further question is, are we close?
link |
Will we be able to build a system that replaces you
link |
as the interviewee
link |
in order to create a compelling conversation?
link |
How far away are we, do you think?
link |
It's a good question.
link |
I think partly I would say, do we want that?
link |
I really like when we start now with very powerful models
link |
interacting with them
link |
and thinking of them more closer to us.
link |
The question is, if you remove the human side
link |
of the conversation, is that an interesting,
link |
is that an interesting artifact?
link |
And I would say probably not.
link |
I've seen, for instance, last time we spoke,
link |
like we were talking about Starcraft
link |
and creating agents that play games, involves self play,
link |
but ultimately what people care about
link |
was how does this agent behave
link |
when the opposite side is a human?
link |
So without a doubt,
link |
we will probably be more empowered by AI.
link |
Maybe you can source some questions from an AI system.
link |
I mean, that even today, I would say,
link |
it's quite plausible that with your creativity,
link |
you might actually find very interesting questions
link |
that you can filter.
link |
We call this cherry picking sometimes
link |
in the field of language.
link |
And likewise, if I had now the tools on my side,
link |
I could say, look, you're asking this interesting question.
link |
From this answer, I like the words chosen
link |
by this particular system that created a few words.
link |
Completely replacing it feels not exactly exciting to me,
link |
although in my lifetime, I think way,
link |
I mean, given the trajectory,
link |
I think it's possible that perhaps
link |
there could be interesting maybe self play interviews
link |
as you're suggesting that would look
link |
or sound quite interesting
link |
and probably would educate
link |
or you could learn a topic through listening
link |
to one of these interviews at a basic level at least.
link |
So you said it doesn't seem exciting to you,
link |
but what if exciting is part of the objective function
link |
the thing is optimized over?
link |
So there's probably a huge amount of data of humans,
link |
if you look correctly, of humans communicating online,
link |
and there's probably ways to measure the degree
link |
of as they talk about engagement.
link |
So you can probably optimize the question
link |
that's most created an engaging conversation in the past.
link |
So actually, if you strictly use the word exciting,
link |
there is probably a way to create
link |
a optimally exciting conversations
link |
that involve AI systems.
link |
At least one side is AI.
link |
Yeah, that makes sense. I think maybe looping back a bit
link |
to games and the game industry,
link |
when you design algorithms,
link |
you're thinking about winning as the objective,
link |
right, or the reward function.
link |
But in fact, when we discuss this with Blizzard,
link |
the creators of StarCraft in this case,
link |
I think what's exciting, fun,
link |
if you could measure that and optimize for that,
link |
that's probably why we play video games
link |
or why we interact or listen
link |
or look at cat videos or whatever on the internet.
link |
So it's true that modeling reward beyond
link |
the obvious reward functions we've used to
link |
in reinforcement learning is definitely very exciting.
link |
And again, there is some progress actually
link |
into a particular aspect of AI,
link |
which is quite critical,
link |
which is, for instance, is a conversation
link |
or is the information truthful, right?
link |
So you could start trying to evaluate these
link |
from the internet, right?
link |
That has lots of information.
link |
And then if you can learn a function automated ideally,
link |
so you can also optimize it more easily,
link |
then you could actually have conversations
link |
that optimize for nonobvious things, such as excitement.
link |
So yeah, that's quite possible.
link |
And then I would say in that case,
link |
it would definitely be fun, a fun exercise
link |
and quite unique to have at least one site
link |
that is fully driven by an excitement,
link |
But obviously there would be still quite a lot
link |
of humanity in the system,
link |
both from who is building the system, of course,
link |
and also ultimately, if we think of labeling for excitement,
link |
that those labels must come from us
link |
because it's just hard to have a computational measure
link |
As far as I understand, there's no such thing.
link |
Wow, you mentioned truth also.
link |
I would actually venture to say that excitement
link |
is easier to label than truth,
link |
or it's perhaps has lower consequences of failure.
link |
But there is perhaps the humanness that you mentioned.
link |
That's perhaps part of a thing that could be labeled.
link |
And that could mean an AI system that's doing dialogue,
link |
that's doing conversations, should be flawed, for example.
link |
That's the thing you optimize for,
link |
which is have inherent contradictions by design,
link |
have flaws by design.
link |
Maybe it also needs to have a strong sense of identity.
link |
So it has a backstory, it told itself that it sticks to.
link |
It has memories, not in terms of how the system is designed,
link |
but it's able to tell stories about its past.
link |
It's able to have mortality and fear of mortality
link |
in the following way, that it has an identity.
link |
And if it says something stupid and gets canceled on Twitter,
link |
that's the end of that system.
link |
So it's not like you get to rebrand yourself.
link |
That system is, that's it.
link |
So maybe the high stakes nature of it,
link |
because you can't say anything stupid now,
link |
or because you'd be canceled on Twitter.
link |
And that there's stakes to that.
link |
And that I think part of the reason that makes it interesting.
link |
And then you have a perspective,
link |
like you've built up over time that you stick with,
link |
and then people can disagree with you.
link |
So holding that perspective strongly,
link |
holding sort of maybe a controversial,
link |
at least a strong opinion.
link |
All of those elements, it feels like they can be learned
link |
because it feels like there's a lot of data
link |
on the internet of people having an opinion.
link |
And then combine that with a metric of excitement.
link |
You can start to create something that,
link |
as opposed to trying to optimize for sort of,
link |
grammatical clarity and truthfulness,
link |
the factual consistency over many sentences,
link |
you're optimized for the humanness.
link |
And there's obviously data for humanness on the internet.
link |
So I wonder if there's a future where that's part,
link |
or I mean, I sometimes wonder that about myself.
link |
I'm a huge fan of podcasts,
link |
and I listened to some podcasts,
link |
and I think like, what is interesting about this?
link |
What is compelling?
link |
The same way you watch other games,
link |
like you said, watch, play StarCraft,
link |
or have Magnus Carlson play chess.
link |
So I'm not a chess player,
link |
so but it's still interesting to me, and what is that?
link |
That's the stakes of it,
link |
maybe the end of a domination of a series of wins.
link |
I don't know, there's all those elements
link |
somehow connect to a compelling conversation,
link |
and I wonder how hard is that to replace?
link |
Because ultimately all of that connects
link |
to the initial proposition of how to test,
link |
whether in AI's intelligence or not with the Turing test,
link |
which I guess my question comes from a place
link |
of the spirit of that test.
link |
Yes, I actually recall,
link |
I was just listening to our first podcast
link |
where we discussed Turing test.
link |
So I would say from a neural network,
link |
AI builder perspective,
link |
there's usually you try to map many of these interesting topics
link |
you discuss to benchmarks,
link |
and then also to actual architectures
link |
on how these systems are currently built,
link |
how they learn, what data they learn from,
link |
what are they learning, right?
link |
We're talking about weights of a mathematical function.
link |
And then looking at the current state of the game,
link |
maybe what do we need leaps forward
link |
to get to the ultimate stage of all these experiences,
link |
lifetime experience, fears,
link |
like words that currently barely we're seeing progress
link |
just because what's happening today is
link |
you take all these human interactions,
link |
it's a large bust of variety of human interactions online,
link |
and then you're distilling these sequences, right?
link |
Going back to my passion,
link |
like sequences of words, letters, images, sound,
link |
there's more modalities here to be at play.
link |
And then you're trying to just learn a function
link |
that will be happy, that maximizes the likelihood
link |
of seeing all these through a neural network.
link |
Now, I think there's a few places
link |
where the way currently we train these models
link |
would clearly like to be able to develop
link |
the kinds of capabilities you say.
link |
I'll tell you maybe a couple.
link |
One is the lifetime of an agent or a model.
link |
So you learn from these data offline, right?
link |
So you're just passively observing and maximizing these,
link |
it's almost like a landscape of mountains.
link |
And then everywhere there's data
link |
that humans interacted in this way,
link |
you're trying to make that higher
link |
and then lower where there's no data.
link |
And then these models generally don't
link |
then experience themselves these,
link |
they just are observers, right?
link |
They're passive observers of the data.
link |
And then we're putting them to then generate data
link |
when we interact with them, but that's very limiting.
link |
The experience they actually experience
link |
when they could maybe be optimizing
link |
or further optimizing the weights,
link |
we're not even doing that.
link |
So to be clear, and again, mapping to AlphaGo, AlphaStar,
link |
we train the model.
link |
And when we deploy it to play against humans,
link |
or in this case, interact with humans
link |
like language models, they don't even keep training, right?
link |
They're not learning in the sense of the weights
link |
that you've learned from the data,
link |
they don't keep changing.
link |
Now, there's something a bit more feels magical,
link |
but it's understandable if you're into neural net,
link |
which is, well, they might not learn
link |
in the strict sense of the words, the weights changing,
link |
maybe that's mapping to how neurons interconnect
link |
and how we learn over our lifetime.
link |
But it's true that the context of the conversation
link |
that takes place with when you talk to these systems,
link |
it's held in their working memory, right?
link |
It's almost like you start a computer,
link |
it has a hard drive that has a lot of information,
link |
you have access to the internet,
link |
which has probably all the information,
link |
but there's also a working memory
link |
where these agents as we call them
link |
or start calling them build upon.
link |
Now, this memory is very limited.
link |
Right now, we're talking to be concrete
link |
about 2,000 words that we hold,
link |
and then beyond that, we start forgetting what we've seen.
link |
So you can see that there's some short term coherence
link |
already with when you said,
link |
I mean, it's a very interesting topic,
link |
having sort of a mapping an agent to have consistency,
link |
then if you say, oh, what's your name,
link |
it could remember that,
link |
but then it might forget beyond 2,000 words,
link |
which is not that long of context,
link |
if we think even of these podcast books are much longer.
link |
So technically speaking, there's a limitation there,
link |
super exciting from people that work on deep learning
link |
But I would say we lack maybe benchmarks
link |
and the technology to have this lifetime like experience
link |
of memory that keeps building up.
link |
However, the way it learns offline
link |
is clearly very powerful, right?
link |
So you asked me three years ago, I would say,
link |
oh, we're very far.
link |
I think we've seen the power of this imitation again,
link |
or the internet scale that has enabled this
link |
to feel like at least the knowledge,
link |
the basic knowledge about the world
link |
now is incorporated into the weights,
link |
but then this experience is lacking.
link |
And in fact, as I said, we don't even train them
link |
when we're talking to them,
link |
other than their working memory, of course, is affected.
link |
So that's the dynamic part,
link |
but they don't learn in the same way
link |
that you and I have learned from basically
link |
when we were born and probably before.
link |
So lots of fascinating, interesting questions you asked there.
link |
I think the one I mentioned is this idea of memory
link |
and experience versus just kind of observe the world
link |
and learn its knowledge,
link |
which I think for that,
link |
I would argue lots of recent advancements
link |
that make me very excited about the field.
link |
And then the second maybe issue that I see is
link |
all these models, we train them from scratch.
link |
That's something I would have complained three years ago
link |
or six years ago or 10 years ago.
link |
And it feels, if we take inspiration from how we got here,
link |
how the universe evolved us and we keep evolving,
link |
it feels that is a missing piece,
link |
that we should not be training models from scratch
link |
that there should be some sort of way
link |
in which we can grow models much like as a species
link |
and many other elements in the universe
link |
is building from the previous sort of iterations.
link |
And that's from just purely neural network perspective.
link |
Even though we would like to make it work,
link |
it's proven very hard to not throw away
link |
the previous weights, right?
link |
This landscape we learn from the data
link |
and refresh it with a brand new set of weights,
link |
given maybe a recent snapshot of this dataset
link |
we train on, et cetera,
link |
or even a new game we're learning.
link |
So that feels like something is missing fundamentally.
link |
We might find it, but it's not very clear
link |
how it will look like.
link |
There's many ideas and it's super exciting as well.
link |
Just for people who don't know,
link |
when you approach a new problem in machine learning,
link |
you're going to come up with an architecture
link |
that has a bunch of weights
link |
and then you initialize them somehow,
link |
which in most cases is some version of random.
link |
So that's what you mean by starting from scratch
link |
and it seems like it's a waste every time you solve
link |
the game of go in chess,
link |
starcraft, protein folding, surely there's some way
link |
to reuse the weights as we grow this giant database
link |
of neural networks that have solved
link |
some of the toughest problems in the world.
link |
And so some of that is, what is that?
link |
Methods, how to reuse weights,
link |
how to learn, extract was generalizable,
link |
or at least has a chance to be
link |
and throw away the other stuff.
link |
And maybe the neural network itself
link |
should be able to tell you that.
link |
Like what ideas do you have
link |
for better initialization of weights?
link |
Maybe stepping back, if we look at the field
link |
of machine learning, but especially deep learning,
link |
at the core of deep learning,
link |
there's this beautiful idea that is a single algorithm
link |
can solve any task.
link |
So it's been proven over and over
link |
with more increasing set of benchmarks
link |
and things that were thought impossible
link |
that are being cracked by this basic principle.
link |
That is, you take a neural network of uninitialized weights,
link |
so like a blank computational brain,
link |
then you give it, in the case of supervised learning,
link |
a lot ideally of examples of,
link |
hey, here is what the input looks like
link |
and the desired output should look like this.
link |
I mean, image classification is very clear example,
link |
images to maybe one of a thousand categories,
link |
that's what ImageNet is like,
link |
but many, many, if not all problems,
link |
can be mapped this way.
link |
And then there's a generic recipe that you can use,
link |
and this recipe with very little change,
link |
and I think that's the core of deep learning research,
link |
that what is the recipe that is universal,
link |
that for any new given task I'll be able to use
link |
without thinking, without having to work very hard
link |
on the problem at stake.
link |
We have not found this recipe,
link |
but I think the field is excited to find less tweaks
link |
or tricks that people find when they work
link |
on important problems specific to those
link |
and more of a general algorithm, right?
link |
So at an algorithmic level,
link |
I would say we have something general already,
link |
which is this formula of training a very powerful model
link |
on neural network on a lot of data,
link |
and in many cases, you need some specificity
link |
to the actual problem you're solving.
link |
Protein folding being such an important problem
link |
has some basic recipe that is learned from before, right?
link |
Like transformer models, graph neural networks,
link |
ideas coming from NLP, like something called BERT,
link |
that is a kind of loss that you can
link |
in place to help the model knowledge distillation
link |
is another technique, right?
link |
So this is the formula.
link |
We still had to find some particular things
link |
that were specific to AlphaFold, right?
link |
That's very important because protein folding
link |
is such a high value problem that as humans,
link |
we should solve it no matter if we need to be a bit specific.
link |
And it's possible that some of these learnings
link |
will apply them to the next iteration of this recipe
link |
that deep learners are about.
link |
But it is true that so far, the recipe is what's common,
link |
but the weights you generally throw away,
link |
which feels very sad, although maybe especially
link |
in the last two, three years, and when we last spoke,
link |
I mentioned this area of meta learning,
link |
which is the idea of learning to learn.
link |
That idea and some progress has been
link |
had starting, I would say, mostly from GPT3
link |
on the language domain only, in which you could conceive
link |
a model that is trained once.
link |
And then this model is not narrow in that it only
link |
knows how to translate a pair of languages
link |
or it only knows how to assign sentiment to a sentence.
link |
These actually, you could teach it by a prompting
link |
And this prompting is essentially just showing it
link |
a few more examples, almost like you do show examples,
link |
input output examples, algorithmically speaking
link |
to the process of creating this model.
link |
But now you're doing it through language,
link |
which is very natural way for us to learn from one another.
link |
I tell you, hey, you should do this new task.
link |
I'll tell you a bit more.
link |
Maybe you ask me some questions.
link |
And now you know the task, right?
link |
You didn't need to retrain it from scratch.
link |
And we've seen these magical moments
link |
almost in this way to do few short prompting
link |
through language on language only domain.
link |
And then in the last two years, we've
link |
seen these expanded to beyond language, adding vision,
link |
adding actions and games, lots of progress to be had.
link |
But this is maybe, if you ask me,
link |
about how are we going to crack this problem?
link |
This is perhaps one way in which you have a single model.
link |
The problem of this model is it's
link |
hard to grow in weights or capacity.
link |
But the model is certainly so powerful
link |
that you can teach it some tasks in this way
link |
that I could teach you a new task now if we were,
link |
oh, let's, a text based task or a classification, a vision
link |
But it still feels like more breakthroughs should be had.
link |
But it's a great beginning, right?
link |
We have a good baseline.
link |
We have an idea that this maybe is the way
link |
we want to benchmark progress towards AGI.
link |
And I think in my view, that's critical to always have a way
link |
to benchmark the community sort of converging
link |
to this overall, which is good to see.
link |
And then this is actually what excites me in terms of also
link |
next steps for deep learning is how
link |
to make these models more powerful.
link |
How do you train them?
link |
If they must grow, should they change their weights
link |
as you teach it the task or not?
link |
There's some interesting questions, many to be answered.
link |
Yeah, you've opened the door to a bunch of questions
link |
I want to ask, but let's first return to your tweet
link |
and read it like a Shakespeare.
link |
You wrote, gado is not the end.
link |
It's the beginning.
link |
And then you wrote meow and an emoji of a cat.
link |
So first, two questions.
link |
First, can you explain the meow and the cat emoji?
link |
And second, can you explain what gado is and how it works?
link |
I mean, thanks for reminding me that we're all
link |
exposing on Twitter and it's permanently there.
link |
Yes, permanently there.
link |
One of the greatest AI researchers of all time,
link |
meow and cat emoji.
link |
Can you imagine like Turing and tweeting meow and cat?
link |
Probably he would, probably would.
link |
So yeah, the tweet?
link |
It's important, actually.
link |
I put thought on the tweets.
link |
I hope people do as well.
link |
Which part do you think?
link |
OK, so there's three sentences.
link |
Gado's not the end.
link |
Gado's the beginning.
link |
Meow cat emoji, which is the important part.
link |
Definitely that it is the beginning.
link |
I mean, I probably was just explaining a bit
link |
where the field is going.
link |
But let me tell you about gato.
link |
So first, the name gato comes from maybe a sequence of releases
link |
that the mind had that used animal names
link |
to name some of their models that
link |
are based on this idea of large sequence models.
link |
Initially, they're only language,
link |
but we're expanding to other modalities.
link |
So we had gopher, chinchilla, these were language only.
link |
And then more recently, we released
link |
flamingo, which adds vision to the equation.
link |
And then gato, which adds vision,
link |
and then also actions in the mix.
link |
As we discussed, actions, especially discrete actions,
link |
like up, down, left, right, I just told you the actions,
link |
but they're words.
link |
So you can kind of see how actions naturally
link |
map to sequence modeling of words, which these models are
link |
So gato was named after, I believe, I can only,
link |
from memory, these things always happen
link |
with an amazing team of researchers behind.
link |
So before the release, we had a discussion
link |
about which animal would we pick.
link |
And I think because of the word general agent,
link |
and this is a property quite unique to gato,
link |
we kind of were playing with the GA words.
link |
And then gato arrives with cat.
link |
And gato is obviously a Spanish version of cat.
link |
I had nothing to do with it, although I'm from Spain.
link |
How do you say cat in Spanish?
link |
Now it all makes sense.
link |
Now it all makes sense.
link |
OK, so how do you say meow in Spanish?
link |
No, that's probably the same.
link |
I think you say it the same way.
link |
But you write it as M.I.A.U.
link |
All right, so then how does the thing work?
link |
So you said general, so you said language, vision, action.
link |
How does this, can you explain what kind of neural networks
link |
What does the training look like?
link |
And maybe what do you or some beautiful ideas
link |
within this system?
link |
Yeah, so maybe the basics of gato
link |
are not that dissimilar from many, many work that come.
link |
So here is where the recipe hasn't changed too much.
link |
There is a transformer model.
link |
That's the kind of recurrent neural network
link |
that essentially takes a sequence of modalities,
link |
observations that could be words, could be vision,
link |
or could be actions.
link |
And then its own objective that you train it to do
link |
when you train it is to predict what the next anything is.
link |
And anything means what's the next action
link |
if this sequence that I'm showing you to train
link |
is a sequence of actions and observations,
link |
then you're predicting what's the next action
link |
and the next observation.
link |
So you think of this really as a sequence of bytes.
link |
So take any sequence of words, a sequence of interleaf words
link |
and images, a sequence of maybe observations
link |
that are images and moves in a target up, down, left, right.
link |
And these you just think of them as bytes
link |
and you're modeling what's the next byte gonna be like.
link |
And you might interpret that as an action
link |
and then play it in a game
link |
or you could interpret it as a word
link |
and then write it down
link |
if you're chatting with the system and so on.
link |
So GATO basically can be thought as inputs, images,
link |
text, video, actions.
link |
It also actually inputs some sort of proprioception
link |
sensors from robotics because robotics is one of the tasks
link |
that it's been trained to do.
link |
And then at the output, similarly,
link |
it outputs words, actions.
link |
It does not output images.
link |
That's just by design,
link |
we decided not to go that way for now.
link |
That's also in part why it's the beginning
link |
because there's more to do clearly.
link |
But that's kind of what GATO is,
link |
is this brain that essentially you give it any sequence
link |
of these observations and modalities
link |
and it outputs the next step.
link |
And then off you go, you feed the next step into
link |
and predict the next one and so on.
link |
Now, it is more than a language model
link |
because even though you can chat with GATO,
link |
like you can chat with Chinchilla or Flamingo,
link |
it also is an agent, right?
link |
So that's why we call it A of GATO,
link |
like the letter A and also it's general.
link |
It's not an agent that's been trained to be good
link |
at only StarCraft or only Atari or only Go.
link |
It's been trained on a vast variety of datasets.
link |
What makes an agent, if I may interrupt,
link |
the fact that it can generate actions?
link |
Yes, so when we call it,
link |
I mean, it's a good question, right?
link |
What, when do we call a model?
link |
I mean, everything is a model,
link |
but what is an agent, in my view,
link |
is indeed the capacity to take actions in an environment
link |
that you then send to eat
link |
and then the environment might return
link |
with a new observation
link |
and then you generate the next action and so on.
link |
This actually, this reminds me of the question
link |
from the side of biology, what is life?
link |
Which is actually a very difficult question as well.
link |
What is living when you think about life here
link |
on this planet Earth?
link |
And a question interesting to me about aliens,
link |
what is life when we visit another planet?
link |
Would we be able to recognize it?
link |
And this feels like it sounds perhaps silly,
link |
but I don't think it is.
link |
At which point is the neural network
link |
a being versus a tool?
link |
And it feels like action,
link |
ability to modify its environment,
link |
is that fundamental leap?
link |
Yeah, I think it certainly feels like action
link |
is a necessary condition to be more alive,
link |
but probably not sufficient either.
link |
It's a sole consciousness thing, whatever.
link |
Yeah, yeah, we can get back to that later.
link |
But anyways, going back to the meow and the gato, right?
link |
So one of the leaps forward
link |
and what took the team a lot of effort and time was,
link |
as you were asking, how has gato been trained?
link |
So I told you gato is this transformer neural network,
link |
models actions, sequences of actions, words, et cetera.
link |
And then the way we train it
link |
is by essentially pulling datasets
link |
of observations, right?
link |
So it's a massive imitation learning algorithm
link |
that it imitates obviously to what is the next word
link |
that comes next from the usual datasets we used before, right?
link |
So these are these web scale style datasets
link |
of people writing on webs or chatting or whatnot, right?
link |
So that's an obvious source
link |
that we use on all language work.
link |
But then we also took a lot of agents
link |
that we have a deep mind.
link |
I mean, as you know, deep mind, we're quite interested
link |
in learning reinforcement learning
link |
and learning agents that play in different environments.
link |
So we kind of created a dataset of these trajectories,
link |
as we call them, or agent experiences.
link |
So in a way, there are other agents we train
link |
for a single mind purpose to, let's say,
link |
control a 3D game environment and navigate a maze.
link |
So we had all the experience that was created
link |
through the one agent interacting with that environment.
link |
And we added this to the dataset, right?
link |
And as I said, we just see all the data,
link |
all these sequences of words or sequences
link |
of this agent interacting with that environment,
link |
or agents playing Atari and so on.
link |
We see that as the same kind of data.
link |
And so we mix these datasets together and we train Gato.
link |
That's the G part, right?
link |
It's general because it really has mixed,
link |
it doesn't have different brains for each modality
link |
or each narrow task.
link |
It has a single brain.
link |
It's not that big of a brain compared
link |
to most of the neural networks we see these days.
link |
It has one billion parameters.
link |
Some models we're seeing get in the trillions these days
link |
and certainly 100 billion feels like a size
link |
that is very common from when you train this job.
link |
So the actual agent is relatively small,
link |
but it's been trained on a very challenging,
link |
diverse dataset, not only containing all of internet,
link |
but containing all these agent experience
link |
playing very different distinct environments.
link |
So this brings us to the part of the tweet of,
link |
this is not the end, it's the beginning.
link |
It feels very cool to see Gato in principle
link |
is able to control any sort of environments
link |
that especially the ones that he's been trained to do,
link |
these 3D games, Atari games,
link |
all sorts of robotics tasks and so on,
link |
but obviously it's not as proficient
link |
as the teachers it learned from on these environments.
link |
It's not obvious that it wouldn't be more proficient.
link |
It's just the current beginning part
link |
is that the performance is such that it's not as good
link |
as if it's specialized to that task.
link |
Right, so it's not as good,
link |
although I would argue size matters here.
link |
So the fact that...
link |
I would argue always size always matters.
link |
That's a different question.
link |
But for neural networks, certainly size does matter.
link |
So it's the beginning because it's relatively small.
link |
So obviously scaling this ID app
link |
might make the connections that exist between
link |
text on the internet and playing Atari and so on
link |
more synergistic with one another and you might gain.
link |
And that moment we didn't quite see,
link |
but obviously that's why it's the beginning.
link |
That synergy might emerge with scale.
link |
Right, might emerge with scale.
link |
And also I believe there's some new research
link |
or ways in which you prepare the data
link |
that you might need to sort of make it more clear
link |
to the model that you're not only playing Atari
link |
and it's just you start from a screen
link |
and here is app and a screen and down.
link |
Maybe you can think of playing Atari
link |
as there's some sort of context
link |
that is needed for the agent
link |
before it starts seeing,
link |
oh, this is an Atari screen, I'm gonna start playing.
link |
You might require, for instance, to be told in words,
link |
hey, this is in this sequence that I'm showing,
link |
you're gonna be playing an Atari game.
link |
So text might actually be a good driver
link |
to enhance the data, right?
link |
So then these connections might be made more easily, right?
link |
That's an idea that we start seeing in language,
link |
but obviously beyond this is gonna be effective, right?
link |
It's not like, I don't show you a screen
link |
and you from scratch, you're supposed to learn a game.
link |
There is a lot of context we might set.
link |
So there might be some work needed as well
link |
to set that context, but anyways, there's a lot of work.
link |
So that context puts all the different modalities
link |
on the same level ground if you provide the context best.
link |
So maybe on that point, so there's this task
link |
which may not seem trivial of tokenizing the data,
link |
of converting the data into pieces,
link |
into basic atomic elements
link |
that then could cross modality somehow.
link |
So what's tokenization?
link |
How do you tokenize text?
link |
How do you tokenize images?
link |
How do you tokenize games and actions and robotics tasks?
link |
Yeah, that's a great question.
link |
So tokenization is the entry point
link |
to actually make all the data look like a sequence
link |
because tokens then are just kind of these little puzzle pieces.
link |
We break down anything into these puzzle pieces
link |
and then we just model what's this puzzle look like, right?
link |
When you make it lay down in a line,
link |
so to speak in a sequence.
link |
So in Gato, the text, there's a lot of work.
link |
There's a lot of work, you tokenize text usually by looking
link |
at commonly used substrings, right?
link |
So there's ING in English is a very common substring,
link |
so that becomes a token.
link |
There's quite well studied problem on tokenizing text
link |
and Gato just use the standard techniques
link |
that have been developed from many years,
link |
even starting from Ngram models in the 1950s and so on.
link |
Just for context, how many tokens,
link |
like what order, magnitude, number of tokens
link |
is required for a word?
link |
What are we talking about?
link |
Yeah, for a word in English, right?
link |
I mean, every language is very different.
link |
The current level or granularity of tokenization
link |
generally means is maybe two to five.
link |
I mean, I don't know the statistics exactly,
link |
but to give you an idea,
link |
we don't tokenize at the level of letters
link |
then it would probably be like,
link |
I don't know what the average length of a word is in English,
link |
but that would be the minimum set of tokens you could use.
link |
It was bigger than letter, smaller than words.
link |
And you could think of very, very common words like the,
link |
I mean, that would be a single token,
link |
but very quickly you're talking two, three, four tokens or so.
link |
Have you ever tried to tokenize emojis?
link |
Emojis are actually just sequences of letters, so.
link |
Maybe to you, but to me, they mean so much more.
link |
Yeah, you can render the emoji,
link |
but you might, if you actually just.
link |
Yeah, this is a philosophical question.
link |
Is emojis an image or a text?
link |
The way we do these things is,
link |
they're actually mapped to small sequences of characters.
link |
So you can actually play with these models
link |
and input emojis, it will output emojis back,
link |
which is actually quite a fun exercise.
link |
You probably can find other tweets about these out there.
link |
But yeah, so anyways, text,
link |
there's like, it's very clear how this is done.
link |
And then in Gato, what we did for images
link |
is we map images to essentially,
link |
we compressed images, so to speak,
link |
into something that looks more like less,
link |
like every pixel with every intensity
link |
that would mean we have a very long sequence, right?
link |
Like if we were talking about 100 by 100 pixel images,
link |
that would make the sequences far too long.
link |
So what was done there is you just use a technique
link |
that essentially compresses an image
link |
into maybe 16 by 16 patches of pixels,
link |
and then that is mapped.
link |
Again, tokenize, you just essentially quantize this space
link |
into a special word that actually maps
link |
to this little sequence of pixels.
link |
And then you put the pixels together in some raster order,
link |
and then that's how you get out
link |
or in the image that you're processing.
link |
But there's no semantic aspect to that.
link |
So you're doing some kind of,
link |
you don't need to understand anything about the image
link |
in order to tokenize it currently.
link |
No, you're only using this notion of compression.
link |
So you're trying to find common,
link |
it's like JPG or all these algorithms,
link |
it's actually very similar at the tokenization level.
link |
All we're doing is finding common patterns
link |
and then making sure in a lossy way we compress these images
link |
given the statistics of the images
link |
that are contained in all the data we deal with.
link |
Although you could probably argue that JPG
link |
does have some understanding of images.
link |
Like, because visual information,
link |
maybe color, compressing based,
link |
crudely based on color does capture some,
link |
something important about an image
link |
that's about its meaning, not just about some statistics.
link |
Yeah, I mean, JP, as I said,
link |
these very, the algorithms look actually very similar to,
link |
they use the cosine transform in JPG.
link |
The approach we usually do in machine learning
link |
when we deal with images
link |
and we do this quantization step
link |
is a bit more data driven.
link |
So rather than have some sort of Fourier basis
link |
for how frequencies appear in the natural world,
link |
we actually just use the statistics of the images
link |
and then quantize them based on the statistics
link |
much like you do in words, right?
link |
So common substrings are allocated a token
link |
and images is very similar.
link |
But there's no connection.
link |
The token space, if you think of,
link |
oh, like the tokens are an integer in the end of the day.
link |
So now like we work on, maybe we have about,
link |
let's say, I don't know the exact numbers,
link |
but let's say 10,000 tokens for text, right?
link |
Certainly more than characters
link |
because we have groups of characters and so on.
link |
So from one to 10,000,
link |
those are representing all the language
link |
and the words we'll see.
link |
And then images occupy the next set of integers.
link |
So they're completely independent, right?
link |
So from 10,000 one to 20,000,
link |
those are the tokens that represent
link |
these other modality images.
link |
And that is an interesting aspect
link |
that makes it orthogonal.
link |
So what connects these concepts is the data, right?
link |
Once you have a data set,
link |
for instance, that captions images
link |
that tells you, oh, this is someone
link |
playing a frisbee on a green field.
link |
Now, the model will need to predict the tokens
link |
from the text green field to then the pixels.
link |
And that will start making the connections
link |
between the tokens.
link |
So these connections happen as the algorithm learns.
link |
And then the last, if we think of these integers,
link |
the first few are words, the next few are images.
link |
In GATO, we also allocated the highest order
link |
of integers to actions, right?
link |
Which we discretize and actions are very diverse, right?
link |
In Atari, there's, I don't know if 17 discreet actions
link |
in robotics, actions might be torques
link |
and forces that we apply.
link |
So we just use kind of similar ideas
link |
to compress these actions into tokens.
link |
And then we just, that's how we map now all the space
link |
to these sequence of integers.
link |
But they occupy different space
link |
and what connects them is then the learning algorithm.
link |
That's where the magic happens.
link |
So the modalities are orthogonal to each other
link |
So in the input, everything you add, you add extra tokens.
link |
And then you're shoving all of that into one place.
link |
Yes, the transformer.
link |
And that transformer, that transformer
link |
tries to look at this gigantic token space
link |
and tries to form some kind of representation,
link |
some kind of unique wisdom
link |
about all of these different modalities.
link |
How's that possible?
link |
If you were to sort of like put your psychoanalysis hat on
link |
and try to psychoanalyze this neural network,
link |
is it schizophrenic?
link |
Does it try to, given this very few weights,
link |
represent multiple disjoint things
link |
and somehow have them not interfere with each other?
link |
Or is this a model building on the joint strength,
link |
on whatever is common to all the different modalities?
link |
Like what, if you were to ask a questions,
link |
is it schizophrenic or is it of one mind?
link |
I mean, it is one mind.
link |
And it's actually the very, the simplest algorithm,
link |
which that's kind of in a way how it feels
link |
like the field hasn't changed
link |
since back propagation and gradient descent
link |
was purpose for learning neural networks.
link |
So there is obviously details on the architecture.
link |
The current iteration is still the transformer,
link |
which is a powerful sequence modeling architecture.
link |
But then the goal of this, you know,
link |
setting these weights to predict the data
link |
is essentially the same as basically I could describe.
link |
I mean, we describe a few years ago alpha star,
link |
language modeling and so on, right?
link |
We take, let's say an Atari game,
link |
we map it to a string of numbers
link |
that will all be probably image space
link |
and action space interleaved.
link |
And all we're gonna do is say, okay,
link |
given the numbers, you know, 1001, 1004, 1005,
link |
the next number that comes is 2006,
link |
which is in the action space.
link |
And you're just optimizing these weights
link |
via very simple gradients, like, you know,
link |
mathematical is almost the most boring algorithm
link |
you could imagine.
link |
We settle the weights so that given this particular instance,
link |
these weights are set to maximize the probability
link |
of having seen this particular sequence of integers
link |
for this particular game.
link |
And then the algorithm does this
link |
for many, many, many iterations,
link |
looking at different modalities, different games, right?
link |
That's the mixture of the dataset we discussed.
link |
So in a way, it's a very simple algorithm
link |
and the weights, right, they're all shared, right?
link |
So in terms of, is it focusing on one modality or not?
link |
The intermediate weights that are converting
link |
from these input of integers to the target integer
link |
you're predicting next,
link |
those weights certainly are common.
link |
And then the way that tokenization happens,
link |
there is a special place in the neural network
link |
which is we map this integer, like number 1001,
link |
to a vector of real numbers.
link |
Like real numbers, we can optimize them
link |
with gradient descent, right?
link |
The functions we learn are actually
link |
surprisingly differentiable.
link |
That's why we compute gradients.
link |
So this step is the only one
link |
that this orthogonality you mentioned applies.
link |
So mapping a certain token for text or image or actions,
link |
each of these tokens gets its own little vector
link |
of real numbers that represents this.
link |
If you look at the field back many years ago,
link |
people were talking about word vectors or word embeddings.
link |
These are the same.
link |
We have word vectors or embeddings.
link |
We have image vector or embeddings
link |
and action vector of embeddings.
link |
And the beauty here is that as you train this model,
link |
if you visualize these little vectors,
link |
it might be that they start aligning
link |
even though they're independent parameters.
link |
There could be anything,
link |
but then it might be that you take the word gato or cat,
link |
which maybe is common enough that actually has its own token.
link |
And then you take pixels that have a cat
link |
and you might start seeing that these vectors
link |
look like they align, right?
link |
So by learning from this vast amount of data,
link |
the model is realizing the potential connections
link |
between these modalities.
link |
Now I will say there would be another way,
link |
at least in part, to not have these different vectors
link |
for each different modality.
link |
For instance, when I tell you about actions
link |
in certain space, I'm defining actions by words, right?
link |
So you could imagine a world in which I'm not learning
link |
that the action app in Atari is its own number.
link |
The action app in Atari maybe is literally the word
link |
or the sentence app in Atari, right?
link |
And that would mean we now leverage
link |
much more from the language.
link |
This is not what we did here,
link |
but certainly it might make these connections
link |
much easier to learn and also to teach the model
link |
to correct its own actions and so on, right?
link |
So all this to say that gato is indeed the beginning,
link |
that it is a radical idea to do this way,
link |
but there's probably a lot more to be done
link |
and the results to be more impressive,
link |
not only through scale, but also through some new research
link |
that will come hopefully in the years to come.
link |
So just to elaborate quickly,
link |
you mean one possible next step
link |
or one of the paths that you might take next
link |
is doing the tokenization fundamentally
link |
as a kind of linguistic communication.
link |
So like you convert even images into language.
link |
So doing something like a crude semantic segmentation,
link |
trying to just assign a bunch of words to an image
link |
that like have almost like a dumb entity
link |
explaining as much as it can about the image.
link |
And so you convert that into words
link |
and then you convert games into words
link |
and then you provide the context in words and all of it.
link |
Eventually getting to a point
link |
where everybody agrees with Noam Chomsky
link |
that language is actually at the core of everything
link |
that it's the base layer of intelligence
link |
and consciousness and all that kind of stuff.
link |
You mentioned early on like it's hard to grow.
link |
What did you mean by that?
link |
Cause we're talking about scale might change.
link |
There might be, and we'll talk about this too,
link |
like there's a emergent,
link |
there's certain things about these neural networks
link |
that are emergent.
link |
So certain like performance we can see only with scale
link |
and there's some kind of threshold of scale.
link |
So why is it hard to grow something like this Meow network?
link |
So the Meow network is not,
link |
it's not hard to grow if you retrain it.
link |
What's hard is, well, we have now one billion parameters.
link |
We train them for a while.
link |
We spend some amount of work towards building these weights
link |
that are an amazing initial brain
link |
for doing this kind of task we care about.
link |
Could we reuse the weights and expand to a larger brain?
link |
And that is extraordinarily hard,
link |
but also exciting from a research perspective
link |
and a practical perspective point of view, right?
link |
So there's this notion of modularity in software engineering
link |
and we're starting to see some examples
link |
and work that leverages modularity.
link |
In fact, if we go back one step from GATO
link |
to a work that I would say train much larger,
link |
much more capable network called Flamingo.
link |
Flamingo did not deal with actions,
link |
but it definitely dealt with images
link |
in an interesting way,
link |
kind of akin to what I GATO did,
link |
but slightly different technique for tokenizing.
link |
But we don't need to go into that detail.
link |
But what Flamingo also did, which GATO didn't do,
link |
and that just happens because these projects,
link |
they're different,
link |
it's a bit of like the exploratory nature of research,
link |
The research behind these projects is also modular.
link |
And it has to be, right?
link |
We need to have creativity
link |
and sometimes you need to protect pockets of people,
link |
researchers and so on.
link |
By we human humans.
link |
And also in particular researchers
link |
and maybe even further deep mine or other such labs.
link |
And then the neural networks themselves.
link |
So it's modularity all the way down.
link |
So the way that we did modularity,
link |
very beautifully in Flamingo is we took Chinchilla,
link |
which is a language only model,
link |
not an agent if we think of actions
link |
being necessary for agency.
link |
So we took Chinchilla,
link |
we took the weights of Chinchilla,
link |
and then we froze them.
link |
We said, these don't change.
link |
We train them to be very good at predicting the next word.
link |
He's a very good language model,
link |
state of the art at the time you release it, et cetera, et cetera.
link |
We're gonna add a capability to see, right?
link |
We are gonna add the ability to see to this language model.
link |
So we're gonna attach small pieces of neural networks
link |
at the right places in the model.
link |
It's almost like injecting the network
link |
with some weights and some substructures
link |
in a good way, right?
link |
So you need the research to say, what is effective?
link |
How do you add this capability
link |
without destroying others, et cetera?
link |
So we created a small sub network,
link |
initialized not from random,
link |
but actually from self supervised learning,
link |
that model that understands vision in general.
link |
And then we took data sets that connect the two modalities,
link |
vision and language.
link |
And then we froze the main part,
link |
the largest portion of the network,
link |
which was Chinchilla, that is 70 billion parameters.
link |
And then we added a few more parameters on top,
link |
trained from scratch, and then some others
link |
that were pre trained with the capacity to see.
link |
Like it was not tokenization in the way I described for Gato,
link |
but it's a similar idea.
link |
And then we trained the whole system,
link |
parts of it were frozen, parts of it were new.
link |
And all of a sudden we developed Flamingo,
link |
which is an amazing model that is essentially,
link |
I mean, describing it is a chatbot
link |
where you can also upload images
link |
and start conversing about images,
link |
but it's also kind of a dialogue style chatbot.
link |
So the input is images and text and the output is text.
link |
And how many parameters you said 70 billion for Chinchilla?
link |
Yeah, Chinchilla is 70 billion.
link |
And then the ones we add on top,
link |
which kind of almost is almost like a way to overwrite
link |
its little activations so that when it sees vision,
link |
it does kind of a correct computation of what it's seeing,
link |
mapping it back towards, so to speak,
link |
that adds an extra 10 billion parameters, right?
link |
So it's total 80 billion, the largest one we released.
link |
And then you train it on a few data sets
link |
that contain vision and language.
link |
And once you interact with the model,
link |
you start seeing that you can upload an image
link |
and start sort of having a dialogue about the image,
link |
which is actually not something, it's very similar
link |
and akin to what we saw in language only.
link |
These prompting abilities that it has,
link |
you can teach it a new vision task, right?
link |
It does things beyond the capabilities
link |
that in theory, the data sets provided in themselves,
link |
but because it leverages a lot of the language knowledge
link |
acquired from Chinchilla,
link |
it actually has this few shot learning ability
link |
and these emerging abilities that we didn't even measure
link |
once we were developing the model,
link |
but once developed, then as you play with the interface,
link |
you can start seeing, wow, okay, yeah, it's cool.
link |
We can upload, I think one of the tweets
link |
talking about Twitter was this image from Obama
link |
that is placing a weight
link |
and someone is kind of weighting themselves
link |
and it's kind of a joke style image.
link |
And it's notable because I think
link |
Andrew Karpati a few years ago said,
link |
no computer vision system can understand the subtlety
link |
of this joke in this image, all the things that go on.
link |
And so what we try to do, and it's very anecdotally,
link |
I mean, this is not a proof that we solved this issue,
link |
but it just shows that you can upload now this image
link |
and start conversing with the model, trying to make out
link |
if it gets that there's a joke
link |
because the person weighting themselves
link |
that doesn't see that someone behind
link |
is making the weight higher and so on and so forth.
link |
So it's a fascinating capability.
link |
And it comes from this key idea of modularity
link |
where we took a frozen brain
link |
and we just added a new capability.
link |
So the question is, should we,
link |
so in a way you can see even from DeepMind,
link |
we have Flamingo that this modular approach
link |
and thus could leverage the scale a bit more reasonably
link |
because we didn't need to retrain a system from scratch.
link |
And on the other hand, we had Gato
link |
which used the same data sets,
link |
but then it trained it from scratch, right?
link |
And so I guess big question for the community is,
link |
should we train from scratch
link |
or should we embrace modularity?
link |
And this lies, like this goes back to modularity
link |
as a way to grow, but reuse seems like natural
link |
and it was very effective, certainly.
link |
The next question is, if you go the way of modularity,
link |
is there a systematic way of freezing weights
link |
and joining different modalities
link |
across not just two or three or four networks,
link |
but hundreds of networks
link |
from all different kinds of places,
link |
maybe open source network
link |
that looks at weather patterns
link |
and you shove that in somehow
link |
and then you have networks that, I don't know,
link |
do all kinds of the Plague Starcraft
link |
and play all the other video games
link |
and you can keep adding them in without significant effort,
link |
like maybe the effort scales linearly or something like that
link |
as opposed to like the more network you add,
link |
the more you have to worry about the instabilities created.
link |
Yeah, so that vision is beautiful.
link |
I think there's still the question
link |
about within single modalities,
link |
like Chinchilla was reused,
link |
but now if we train an ex iteration of language models,
link |
are we gonna use Chinchilla or not?
link |
Yeah, how do you swap out Chinchilla?
link |
Right, so there's still big questions,
link |
but that idea is actually really akin to software engineering,
link |
which we're not reimplementing,
link |
libraries from scratch, we're reusing
link |
and then building ever more amazing things,
link |
including neural networks with software that we're reusing.
link |
So I think this idea of modularity, I like it.
link |
I think it's here to stay.
link |
And that's also why I mentioned,
link |
it's just the beginning, not the end.
link |
You mentioned meta learning.
link |
So given this promise of Gato,
link |
can we try to redefine this term?
link |
That's almost akin to consciousness
link |
because it means different things to different people
link |
throughout the history of artificial intelligence.
link |
But what do you think meta learning is and looks like
link |
now in the five years, 10 years,
link |
will it look like the system like Gato,
link |
What's your sense of what is meta learning look like?
link |
Do you think with all the wisdom we've learned so far?
link |
Yeah, great question.
link |
Maybe it's good to give another data point
link |
looking backwards rather than forward.
link |
So when we talk in 2019,
link |
meta learning meant something that has changed
link |
mostly through the revolution of GPT3 and beyond.
link |
So what meta learning meant at the time
link |
was driven by what benchmarks people care about
link |
And the benchmarks were about
link |
a capability to learn about object identities.
link |
So it was very much overfitted to vision
link |
and object classification.
link |
And the part that was met about that was that,
link |
oh, we're not just learning 1,000 categories
link |
that ImageNet tells us to learn.
link |
We're gonna learn object categories
link |
that can be defined when we interact with the model.
link |
So it's interesting to see the evolution.
link |
The way this started was we have a special language
link |
that was a dataset, a small dataset
link |
that we prompted the model with saying,
link |
hey, here is a new classification task.
link |
I'll give you one image and the name,
link |
which was an integer at the time of the image
link |
and a different image and so on.
link |
So you have a small prompt in the form of a dataset,
link |
a machine learning dataset.
link |
And then you got then a system that could
link |
then predict or classify these objects
link |
that you just defined kind of on the fly.
link |
So fast forward, it was revealed
link |
that language models are future learners.
link |
That's the title of the paper.
link |
So very good title.
link |
Sometimes titles are really good.
link |
So this one is really, really good
link |
because that's the point of GPT3 that showed that, look.
link |
Sure, we can focus on object classification
link |
and how what meta learning means
link |
within the space of learning object categories.
link |
This goes beyond or before,
link |
rather to also Omniglot before ImageNet and so on.
link |
So there's a few benchmarks.
link |
To now all of a sudden,
link |
we're a bit unlocked from benchmarks
link |
and through language we can define tasks, right?
link |
So we're literally telling the model some logical task
link |
or little thing that we wanted to do.
link |
We prompt it much like we did before,
link |
but now we prompt it through natural language.
link |
And then not perfectly,
link |
I mean, these models have failure modes and that's fine,
link |
but these models then are now doing a new task, right?
link |
So they meta learn this new capability.
link |
Now, that's where we are now.
link |
Flamingo expanded this to visual and language,
link |
but it basically has the same abilities.
link |
You can teach it, for instance, an emergent property
link |
was that you can take pictures of numbers
link |
and then do arithmetic with the numbers just by teaching it.
link |
Oh, that's, I mean, when I show you three plus six,
link |
you know, I want you to output nine
link |
and you show it a few examples and now it does that.
link |
So it went way beyond the,
link |
oh, this ImageNet sort of categorization of images
link |
that we were a bit stuck maybe before this revelation moment
link |
that happened in 2000, I believe it was 19,
link |
but it was after we checked.
link |
And that way it has solved meta learning
link |
as was previously defined.
link |
Yes, it expanded what it meant.
link |
So that's what you say, what does it mean?
link |
So it's an evolving term.
link |
But here is maybe now looking forward,
link |
looking at what's happening, you know,
link |
obviously in the community with more modalities,
link |
what we can expect.
link |
And I would certainly hope to see the following.
link |
And this is a pretty drastic hope,
link |
but in five years, maybe we chat again.
link |
And we have a system, right?
link |
A set of weights that we can teach it to play StarCraft.
link |
Maybe not at the level of AlphaStar,
link |
but play StarCraft a complex game.
link |
We teach it through interactions to prompting.
link |
You can certainly prompt a system.
link |
That's what Gato shows to play some simple Atari games.
link |
So imagine if you start talking to a system,
link |
teaching it a new game, showing it examples of,
link |
you know, in this particular game,
link |
this user did something good.
link |
Maybe the system can even play and ask you questions.
link |
Say, hey, I played this game.
link |
I just played this game.
link |
Can you teach me more?
link |
So five, maybe to 10 years,
link |
these capabilities or what meta learning means
link |
will be much more interactive, much more rich.
link |
And through domains that we were specializing, right?
link |
So you see the difference, right?
link |
We built AlphaStar specialized to play StarCraft.
link |
The algorithms were general, but the weights were specialized.
link |
And what we're hoping is that we can teach a network
link |
to play games, to play any game, just using games
link |
as an example, through interacting with it,
link |
teaching it, uploading the Wikipedia page of StarCraft.
link |
Like this is in the horizon,
link |
and obviously their details need to be filled
link |
and research need to be done.
link |
But that's how I see meta learning above,
link |
which is gonna be beyond prompting.
link |
It's gonna be a bit more interactive.
link |
It's gonna, you know, the system might tell us
link |
to give it feedback after it maybe makes mistakes
link |
or it loses a game.
link |
But it's nonetheless very exciting
link |
because if you think about this this way,
link |
the benchmarks are already there.
link |
We just repurposed the benchmarks, right?
link |
So in a way, I like to map the space of
link |
what maybe AGI means to say, okay, like,
link |
we went 101% performance in Go, in Chess, in StarCraft.
link |
The next iteration might be 20% performance
link |
across quote unquote all tasks, right?
link |
And even if it's not as good, it's fine.
link |
We actually, we have ways to also measure progress
link |
because we have those special agents,
link |
specialized agents and so on.
link |
So this is to me very exciting.
link |
And these next iteration models are definitely hinting
link |
at that direction of progress, which hopefully we can have.
link |
There are obviously some things that could go wrong
link |
in terms of we might not have the tools,
link |
maybe transformers are not enough,
link |
then we must, there's some breakthroughs to come,
link |
which makes the field more exciting
link |
to people like me as well, of course.
link |
But that's, if you ask me five to 10 years,
link |
you might see these models that start to look more like
link |
weights that are already trained.
link |
And then it's more about teaching or make,
link |
they're meta learned what you're trying to induce
link |
in terms of tasks and so on.
link |
Well beyond the simple now tasks,
link |
we're starting to see emerge like, you know,
link |
smaller arithmetic tasks and so on.
link |
So a few questions around that, this is fascinating.
link |
So that kind of teaching interactive,
link |
so it's beyond prompting,
link |
so it's interacting with the neural network,
link |
that's different than the training process.
link |
So it's different than the optimization
link |
over differentiable functions.
link |
This is already trained and now you're teaching,
link |
I mean, it's almost like akin to the brain,
link |
the neurons already set with their connections.
link |
On top of that, you're now using that infrastructure
link |
to build up further knowledge.
link |
Okay, so that's a really interesting distinction
link |
that's actually not obvious
link |
from a software engineering perspective,
link |
that there's a line to be drawn.
link |
Because you always think for a neural network to learn,
link |
it has to be retrained, trained and retrained.
link |
But maybe, and prompting is a way of teaching
link |
a neural network, a little bit of context
link |
about whatever the heck you're trying it to do.
link |
So you can maybe expand this prompting capability
link |
by making it interact, that's really, really interesting.
link |
By the way, this is not,
link |
if you look at way back at different ways
link |
to tackle even classification tasks,
link |
so this comes from long standing literature
link |
in machine learning, what I'm suggesting could sound
link |
to some like a bit like Nita's neighbor.
link |
So Nita's neighbor is almost the simplest algorithm
link |
that does not require learning.
link |
So it has this interesting like,
link |
you don't need to compute gradients.
link |
And what Nita's neighbor does is,
link |
you quote unquote have a data set or upload a data set.
link |
And then all you need to do is a way to measure distance
link |
And then to classify a new point,
link |
you're just simply computing,
link |
what's the closest point in this massive amount of data?
link |
And that's my answer.
link |
So you can think of prompting in a way
link |
as you're uploading not just simple points
link |
and the metric is not the distance between the images
link |
or something simple,
link |
it's something that you compute that's much more advanced.
link |
But in a way, it's very similar, right?
link |
You simply are uploading some knowledge
link |
to this pre trained system in Nita's neighbor.
link |
Maybe the metric is learned or not,
link |
but you don't need to further train it.
link |
And then now you immediately get a classifier out of this.
link |
Now it's just an evolution of that concept,
link |
very classical concept in machine learning,
link |
which is, yeah, just learning through
link |
what's the closest point, closest by some distance
link |
and that's it, it's an evolution of that.
link |
And I will say how I saw meta learning
link |
when we worked on a few ideas in 2016,
link |
was precisely through the lens of Nita's neighbor,
link |
which is very common in computer vision community, right?
link |
There's a very active area of research
link |
about how do you compute the distance between two images?
link |
But if you have a good distance metric,
link |
you also have a good classifier, right?
link |
All I'm saying is now these distances
link |
and the points are not just images,
link |
they're like words or sequences of words
link |
and images and actions that teach you something new,
link |
but it might be that technique wise, those come back.
link |
And I will say that it's not necessarily true
link |
that you might not ever train the weights a bit further.
link |
Some aspect of meta learning,
link |
some techniques in meta learning
link |
do actually do a bit of fine tuning as it's called, right?
link |
They train the weights a little bit
link |
when they get a new task.
link |
So as I call the how or how we're gonna achieve this,
link |
as a deep learner and very skeptic,
link |
we're gonna try a few things,
link |
whether it's a bit of training,
link |
adding a few parameters,
link |
thinking of these as nearest neighbor
link |
or just simply thinking of there's a sequence of words,
link |
it's a prefix and that's the new classifier we'll see, right?
link |
There's the beauty of research,
link |
but what's important is that is a good goal in itself
link |
that I see as very worthwhile pursuing
link |
for the next stages of not only meta learning.
link |
I think this is basically what's exciting
link |
about machine learning period to me.
link |
Well, the interactive aspect of that
link |
is also very interesting.
link |
The interactive version of nearest neighbor
link |
to help you pull out the classifier from this giant thing.
link |
Okay, is this the way we can go
link |
in five, 10 plus years from any task,
link |
sorry, from many tasks to any task?
link |
So, and what does that mean?
link |
What does it need to be actually trained on?
link |
Which point is the network had enough?
link |
What does a network need to learn about this world
link |
in order to be able to perform any task?
link |
Is it just as simple as language, image, and action?
link |
Or do you need some set of representative images?
link |
Like if you only see land images,
link |
will you know anything about underwater?
link |
Is that some fundamentally different?
link |
Those are open questions, I would say.
link |
I mean, the way you put,
link |
let me maybe further your example, right?
link |
If all you see is land images,
link |
but you're reading all about land and water worlds,
link |
but in books, imagine, would that be enough?
link |
Good question, we don't know,
link |
but I guess maybe you can join us
link |
if you want in our quest to find this.
link |
Water world, yeah.
link |
Yes, that's precisely the beauty of research
link |
and that's the research business
link |
where I guess is to figure this out
link |
and ask the right questions
link |
and then iterate with the whole community,
link |
publishing like findings and so on.
link |
But yeah, this is a question.
link |
It's not the only question,
link |
but it's certainly as you ask is on my mind constantly, right?
link |
And so we'll need to wait for maybe the,
link |
let's say five years, let's hope it's not 10
link |
to see what are the answers.
link |
Some people will largely believe in
link |
and supervised or self supervised learning
link |
of single modalities and then crossing them.
link |
Some people might think end to end learning
link |
is the answer, modularity is maybe the answer.
link |
but we're just definitely excited to find out.
link |
But it feels like this is the right time
link |
and we're at the beginning of this position.
link |
We're finally ready to do these kind of general,
link |
big models and agents.
link |
What do you sort of specific technical thing
link |
about Gato, Flamingo, Chinchilla, Gopher,
link |
any of these that is especially beautiful.
link |
That was surprising, maybe.
link |
Is there something that just jumps out at you?
link |
Of course, there's the general thing of like,
link |
you didn't think it was possible
link |
and then you realize it's possible
link |
in terms of the generalizability across modalities
link |
and all that kind of stuff.
link |
Or maybe how small of a network,
link |
relatively speaking, Gato is all that kind of stuff.
link |
But is there some weird little things that were surprising?
link |
Look, I'll give you an answer that's very important
link |
because maybe people don't quite realize this,
link |
but the teams behind these efforts, the actual humans,
link |
that's maybe the surprising in an obviously positive way.
link |
So anytime you see these breakthroughs,
link |
I mean, it's easy to map it to a few people.
link |
There's people that are great at explaining things
link |
and so on, that's very nice.
link |
But maybe the learnings or the meta learnings
link |
that I get as a human about this is,
link |
sure, we can move forward,
link |
but the surprising bit is how important are all the pieces
link |
of these projects, how do they come together?
link |
So I'll give you maybe some of the ingredients
link |
of success that are common across these,
link |
but not the obvious ones on machine learning.
link |
I can always also give you those,
link |
but basically there is engineering is critical.
link |
So very good engineering
link |
because ultimately we're collecting data sets, right?
link |
So the engineering of data
link |
and then of deploying the models at scale
link |
into some compute cluster that cannot go understated
link |
that is a huge factor of success.
link |
And it's hard to believe that details matter so much.
link |
We would like to believe that it's true
link |
that there is more and more of a standard formula,
link |
as I was saying, like this recipe that works for everything.
link |
But then when you zoom into each of these projects,
link |
then you realize the devil is indeed in the details.
link |
And then the teams have to work kind of together
link |
towards these goals.
link |
So engineering of data and obviously clusters
link |
and large scale is very important.
link |
And then one that is often not,
link |
maybe nowadays it is more clear is benchmark progress, right?
link |
So we're talking here about multiple months
link |
of tens of researchers and people
link |
that are trying to organize the research and so on
link |
working together and you don't know that you can get there.
link |
I mean, this is the beauty.
link |
Like if you're not risking to trying to do something
link |
that feels impossible, you're not gonna get there,
link |
but you need the way to measure progress.
link |
So the benchmarks that you build are critical.
link |
I've seen this beautifully pay out in many projects.
link |
I mean, maybe the one I've seen it more consistently,
link |
which means we established the metric,
link |
actually the community did,
link |
and then we leveraged that massively as AlphaFold.
link |
This is a project where the data, the metrics were all there
link |
and all it took was, and it's easier said than done,
link |
an amazing team working not to try
link |
to find some incremental improvement and publish,
link |
which is one way to do research that is valid,
link |
but aim very high and work literally for years
link |
to iterate over that process.
link |
And working for years with the team,
link |
I mean, it is tricky that also happened to happen
link |
partly during a pandemic and so on.
link |
So I think my meta learning from all this is
link |
the teams are critical to the success.
link |
And then if now going to the machine learning,
link |
the part that's surprising is,
link |
so we like architectures like neural networks,
link |
and I would say this was a very rapidly evolving field
link |
until the transformer came.
link |
So attention might indeed be all you need,
link |
which is the title, also a good title,
link |
although in hindsight is good.
link |
I don't think at the time I thought
link |
this is a great title for a paper,
link |
but that architecture is proving
link |
that the dream of modeling sequences of any bytes,
link |
there is something there that will stick.
link |
And I think these advance in architectures,
link |
in kind of how neural networks are architecture
link |
to do what they do.
link |
It's been hard to find one that has been so stable
link |
and relatively has changed very little
link |
since it was invented five or so years ago.
link |
So that is a surprising,
link |
it's a surprise that keeps recurring into other projects.
link |
Try to, on a philosophical or technical level,
link |
introspect what is the magic of attention?
link |
What is attention?
link |
That's attention in people that study cognition,
link |
so human attention.
link |
I think there's giant wars over what attention means,
link |
how it works in the human mind.
link |
So what, there's very simple looks at what attention
link |
is in neural network from the days of attention
link |
is all you need, but Brod,
link |
do you think there's a general principle
link |
that's really powerful here?
link |
Yeah, so a distinction between transformers and LSTMs,
link |
which were what came before,
link |
and there was a transitional period
link |
where you could use both.
link |
In fact, when we talked about alpha stat,
link |
we used transformers and LSTMs,
link |
so it was still the beginning of transformers.
link |
They were very powerful,
link |
but LSTMs were still also very powerful sequence models.
link |
So the power of the transformer
link |
is that it has built in what we call an inductive bias
link |
of attention that makes the model,
link |
when you think of a sequence of integers, right?
link |
Like we discussed this before, right?
link |
This is a sequence of words.
link |
When you have to do very hard tasks over these words,
link |
this could be we're gonna translate a whole paragraph,
link |
or we're gonna predict the next paragraph
link |
given 10 paragraphs before.
link |
There's some loose intuition
link |
from how we do it as a human
link |
that is very nicely mimicked and replicated,
link |
structurally speaking, in the transformer,
link |
which is this idea of you're looking for something, right?
link |
So you're sort of when you're,
link |
you just read a piece of text,
link |
now you're thinking what comes next.
link |
You might wanna relook at the text
link |
or look it from scratch.
link |
I mean, readily is because there's no recurrence.
link |
You're just thinking what comes next,
link |
and it's almost hypothesis driven, right?
link |
So if I'm thinking the next word that I'll write
link |
is cat or dog, okay? The way the transformer works
link |
almost philosophically is it has these two hypotheses.
link |
Is it gonna be cat or is it gonna be dog?
link |
And then it says, okay, if it's cat,
link |
I'm gonna look for certain words, not necessarily cat,
link |
although cat is an obvious word you would look in the past
link |
to see whether it makes more sense to output cat or dog.
link |
And then it does some very deep computation
link |
over the words and beyond, right?
link |
So it combines the words, but it has the query
link |
as we call it, that is cat.
link |
And then similarly for dog, right?
link |
And so it's a very computational way to think about,
link |
look, if I'm thinking deeply about text,
link |
I need to go back to look at all of the text,
link |
attend over it, but it's not just attention,
link |
like what is guiding the attention?
link |
And that was the key insight from an earlier paper,
link |
is not how far away is it?
link |
I mean, how far away is it is important?
link |
What did I just write about?
link |
That's critical, but what you wrote about 10 pages ago
link |
might also be critical.
link |
So you're looking not positionally, but content wise, right?
link |
And you transformers have this beautiful way
link |
to query for certain content
link |
and pull it out in a compressed way.
link |
So then you can make a more informed decision.
link |
I mean, that's one way to explain transformers,
link |
but I think it's a very powerful inductive bias.
link |
There might be some details that might change over time,
link |
but I think that is what makes transformers
link |
so much more powerful than the recurrent networks
link |
that were more recency bias based,
link |
which obviously works in some tasks,
link |
but it has major flaws.
link |
Transformer itself has flaws.
link |
And I think the main one, the main challenges,
link |
these prompts that we just were talking about,
link |
they can be a thousand words long.
link |
But if I'm teaching you Starcraft,
link |
I mean, I'll have to show you videos.
link |
I'll have to point you to whole Wikipedia articles
link |
We'll have to interact probably
link |
as you play your last me questions.
link |
The context require for us to achieve me
link |
being a good teacher to you on the game
link |
as you would want to do it with a model.
link |
I think goes well beyond the current capabilities.
link |
So the question is, how do we benchmark this?
link |
And then how do we change the structure
link |
of the architecture?
link |
I think there's ideas on both sides,
link |
but we'll have to see empirically, right?
link |
Obviously what ends up working in the future.
link |
And as you talked about, some of the ideas could be,
link |
keeping the constraint of that length in place,
link |
but then forming like hierarchical representations
link |
to where you can start being much clever
link |
in how you use those thousand tokens.
link |
Yeah, that's really interesting.
link |
But it also is possible that this attentional mechanism
link |
where you basically,
link |
you don't have a recency bias,
link |
but you look more generally, you make it learnable.
link |
The mechanism in which way you look back into the past,
link |
you make that learnable.
link |
It's also possible we're at the very beginning of that
link |
because that you might become smarter and smarter
link |
in the way you query the past.
link |
So recent past and distant past
link |
and maybe very, very distant past.
link |
So almost like the attention mechanism
link |
will have to improve and evolve
link |
as good as the tokenization mechanism
link |
where so you can represent longterm memory somehow.
link |
And I mean, hierarchies are very,
link |
I mean, it's a very nice word that sounds appealing.
link |
There's lots of work adding hierarchy to the memories.
link |
In practice, it does seem like we keep coming back
link |
to the main formula or main architecture.
link |
That sometimes tells us something.
link |
There's such a sentence that a friend of mine told me
link |
like, whether it wants to work or not.
link |
So transformer was clearly an idea that wanted to work.
link |
And then I think there's some principles we believe
link |
will be needed, but finding the exact details,
link |
details matter so much, right?
link |
That's gonna be tricky.
link |
I love the idea that there's like,
link |
you as a human being, you want some ideas to work.
link |
And then there's the model that wants some ideas to work
link |
and you get to have a conversation to see
link |
which more likely the model will win in the end.
link |
Because it's the one, you don't have to do any work.
link |
The model is the one that has to do the work.
link |
So you should listen to the model.
link |
And I really love this idea that you talked about
link |
the humans in this picture.
link |
If I could just briefly ask one is you're saying
link |
the benchmarks about the modular humans working on this.
link |
The benchmarks providing a sturdy ground of a wish to do
link |
these things that seem impossible.
link |
They give you, in the darkest of times, give you hope
link |
because little signs of improvement, you could.
link |
Like you're not, somehow you're not lost
link |
if you have metrics to measure your improvement.
link |
And then there's other aspect you said elsewhere
link |
and here today, like titles matter.
link |
I wonder how much humans matter
link |
in the evolution of all of this,
link |
meaning individual humans.
link |
You know, something about their interaction,
link |
something about their ideas,
link |
how much they change the direction of all of this.
link |
Like if you change the humans in this picture,
link |
like is it that the model is sitting there
link |
and it wants you, it wants some idea to work?
link |
Or is it the humans, or maybe the model is providing
link |
20 ideas that could work.
link |
And depending on the humans you pick,
link |
they're going to be able to hear some of those ideas.
link |
Like in all the, because you're now directing
link |
all of deep learning at DeepMind,
link |
you get to interact with a lot of projects,
link |
a lot of brilliant researchers.
link |
How much variability is created by the humans
link |
Yeah, I mean, I do believe humans matter a lot
link |
at the very least at the time scale of years
link |
on when things are happening
link |
and what's the sequencing of it, right?
link |
So you get to interact with people that,
link |
I mean, you mentioned this.
link |
Some people really want some idea to work
link |
and they'll persist.
link |
And then some other people might be more practical.
link |
Like I don't care what idea works.
link |
I care about, you know, cracking protein folding.
link |
And these, at least these two kind of seem opposite sides.
link |
And we've clearly had both historically
link |
and that made certain things happen earlier or later.
link |
So definitely humans involved in all of this endeavor
link |
have had, I would say, years of change or of ordering
link |
how things have happened,
link |
which breakthroughs came before
link |
which other breakthroughs and so on.
link |
So certainly that does happen.
link |
And so one other, maybe one other axis of distinction
link |
is what I called, and this is most commonly used
link |
in reinforcement learning
link |
is the exploration, exploitation tradeoff as well.
link |
It's not exactly what I meant, although quite related.
link |
So when you start trying to help others, right?
link |
Like you become a bit more of a mentor
link |
to a large group of people,
link |
be it a project or the deep learning team
link |
or something or even in the community
link |
when you interact with people in conferences and so on.
link |
You're identifying quickly, right?
link |
Some things that are explorative or exploitative
link |
and it's tempting to try to guide people, obviously.
link |
I mean, that's what makes like our experience,
link |
we bring it and we try to shape things sometimes wrongly.
link |
And there's many times that I've been wrong in the past,
link |
that's great, but it would be wrong to dismiss
link |
any sort of of the research styles that I'm observing.
link |
And I often get asked, well, you're in industry, right?
link |
So we do have access to large compute scale and so on.
link |
So there's certain kinds of research.
link |
I almost feel like we need to do responsibly and so on,
link |
but it is almost, we have the particle accelerator here.
link |
So to speak in physics, so we need to use it,
link |
we need to answer the questions
link |
that we should be answering right now
link |
for the scientific progress.
link |
But then at the same time, I look at many advances,
link |
including attention, which was discovered
link |
in Montreal initially because of lack of compute, right?
link |
So we were working on sequence to sequence
link |
with my friends over at Google Brain at the time.
link |
And we were using, I think, 8GPUs,
link |
which was somehow a lot at the time.
link |
And then I think Montreal was a bit more limited in the scale,
link |
but then they discovered this content based attention concept
link |
that then has obviously triggered things like Transformer.
link |
Not everything obviously starts Transformer.
link |
And there's always a history that is important to recognize
link |
because then you can make sure that then those who might feel now,
link |
well, we don't have so much compute,
link |
you need to then help them optimize that kind of research
link |
that might actually produce amazing change.
link |
Perhaps it's not as short term as some of these advancements
link |
or perhaps it's a different timescale,
link |
but the people and the diversity of the field
link |
is quite critical that we maintain it.
link |
And at times, especially mixed a bit with hype or other things,
link |
it's a bit tricky to be observing maybe too much
link |
of the same thinking across the board.
link |
But the humans definitely are critical.
link |
And I can think of quite a few personal examples
link |
where also someone told me something that had a huge effect
link |
And then that's why I'm saying at least in terms of ears,
link |
probably some things do happen.
link |
It's also fascinating how constraints somehow
link |
are essential for innovation.
link |
And the other thing you mentioned about engineering,
link |
I have a sneaking suspicion.
link |
Maybe I over, you know, my love is with engineering.
link |
So I have a sneaking suspicion that all the genius,
link |
a large percentage of the genius
link |
is in the tiny details of engineering.
link |
So I think we like to think the genius is in the big ideas.
link |
I have a sneaking suspicion that because I've
link |
seen the genius of details, of engineering details,
link |
make the night and day difference.
link |
And I wonder if those kind of have a ripple effect over time.
link |
So that too, so that's taken the engineering perspective
link |
that sometimes that quiet innovation
link |
at the level of an individual engineer
link |
or maybe at the small scale of a few engineers
link |
can make all the difference.
link |
That scales, because we're working
link |
on computers that are scaled across large groups,
link |
that one engineering decision can lead to ripple effects.
link |
It's interesting to think about.
link |
Yeah, I mean, engineering, there's also kind of a historical,
link |
it might be a bit random.
link |
Because if you think of the history of how especially
link |
deep learning and neural networks took off,
link |
it feels like a bit random, because GPUs
link |
happen to be there at the right time for a different purpose,
link |
which was to play video games.
link |
So even the engineering that goes into the hardware,
link |
and it might have a time frame might be very different.
link |
I mean, the GPUs were evolved throughout many years
link |
where we didn't even were looking at that.
link |
So even at that level, that revolution, so to speak,
link |
the ripples are like, we'll see when they stop.
link |
But in terms of thinking of why is this happening,
link |
I think that when I try to categorize it
link |
in sort of things that might not be so obvious,
link |
I mean, clearly there's a hardware revolution.
link |
We are surfing thanks to that.
link |
Data centers as well.
link |
I mean, data centers are where Google, for instance,
link |
obviously they're serving Google,
link |
but there's also now thanks to that
link |
and to have built such amazing data centers.
link |
We can train these models.
link |
Software is an important one.
link |
I think if I look at the state of how
link |
I had to implement things to implement my ideas,
link |
how I discarded ideas because they were too hard to implement.
link |
Yeah, clearly the times have changed,
link |
and thankfully we are in a much better software position
link |
And then, I mean, obviously there's
link |
research that happens at scale, and more people
link |
enter the field, that's great to see,
link |
but it's almost enabled by these other things.
link |
And last but not least is also data, right?
link |
Curating data sets, labeling data sets, these benchmarks
link |
we think about, maybe we'll want to have all the benchmarks
link |
in one system, but it's still very valuable that someone
link |
put the thought and time and the vision
link |
to build certain benchmarks.
link |
We've seen progress thanks to that.
link |
We're going to repurpose the benchmarks.
link |
That's the beauty of Atari is like we solved it in a way,
link |
but we use it in Gato.
link |
It was critical, and I'm sure there's still a lot more
link |
to do thanks to that amazing benchmark that someone took
link |
the time to put, even though at the time maybe, oh,
link |
you have to think what's the next iteration of architectures.
link |
That's what maybe the field recognizes,
link |
but that's another thing we need to balance
link |
in terms of humans behind.
link |
We need to recognize all these aspects
link |
because they're all critical.
link |
And we tend to think of the genius, the scientist,
link |
and so on, but I'm glad you're, I know you have
link |
a strong engineering background.
link |
But also I'm a lover of data, and there's
link |
a pushback on the engineering comment.
link |
Ultimately, it could be the creatives of benchmarks
link |
who have the most impact.
link |
Andre Capati, who you mentioned,
link |
has recently been talking a lot of trash about ImageNet,
link |
which he has the right to do because of how critical he
link |
is about how essential he is to the development
link |
and the success of deep learning around ImageNet.
link |
And you're saying that that's actually,
link |
that benchmark is holding back the field.
link |
Because, I mean, especially in his context,
link |
on Tesla autopilot, that's looking at real world behavior
link |
of a system, it's, there's something fundamentally
link |
missing about ImageNet that doesn't capture
link |
the real worldness of things.
link |
That we need to have data sets, benchmarks
link |
that have the unpredictability, the edge cases,
link |
the, whatever the heck it is that makes the real world
link |
so difficult to operate in, we need to have benchmarks
link |
But just to think about the impact of ImageNet
link |
as a benchmark, and that really puts a lot of emphasis
link |
on the importance of a benchmark,
link |
both sort of internally a deep mind and as a community.
link |
So one is coming in from within, like,
link |
how do I create a benchmark for me to mark and make progress,
link |
and how do I make a benchmark for the community
link |
to mark and push progress.
link |
You have this amazing paper you coauthored,
link |
a survey paper called,
link |
Emergent Abilities of Large Language Models.
link |
It has, again, the philosophy here
link |
that I'd love to ask you about.
link |
What's the intuition about the phenomenon
link |
of emergence in neural networks,
link |
transform as language models?
link |
Is there a magic threshold beyond which we start
link |
to see certain performance?
link |
And is that different from task to task?
link |
Is that us humans just being poetic and romantic,
link |
or is there literally some level
link |
of which we start to see breakthrough performance?
link |
Yeah, I mean, this is a property
link |
that we start seeing in systems
link |
that actually tend to be,
link |
so in machine learning, traditionally,
link |
again, going to benchmarks.
link |
I mean, if you have some input outputs,
link |
like that is just a single input and a single output,
link |
you generally, when you train these systems,
link |
you see reasonably smooth curves
link |
when you analyze how much the data set size
link |
affects the performance,
link |
or how the model size affects the performance,
link |
or how long you train the system for
link |
affects the performance.
link |
So, if we think of ImageNet,
link |
like the train curves look fairly smooth
link |
and predictable in a way.
link |
And I would say that's probably because of the,
link |
it's kind of a one hop reasoning task, right?
link |
It's like, here is an input
link |
and you think for a few milliseconds
link |
or 100 milliseconds, 300 as a human.
link |
And then you tell me, yeah,
link |
there's an alpaca in this image.
link |
So, in language, we are seeing benchmarks
link |
that require more pondering and more thought in a way, right?
link |
This is just kind of, you need to look for some subtleties
link |
that involves inputs that you might think of,
link |
even if the input is a sentence
link |
describing a mathematical problem,
link |
there is a bit more processing required as a human
link |
and more introspection.
link |
So, I think how these benchmarks work
link |
means that there is actually a threshold,
link |
just going back to how transformers work
link |
in this way of querying for the right questions
link |
to get the right answers.
link |
That might mean that performance becomes random
link |
until the right question is asked
link |
by the querying system of a transformer
link |
or of a language model like a transformer.
link |
And then only then you might start seeing performance
link |
going from random to non random.
link |
And this is more empirical.
link |
There's no formalism or theory behind this yet,
link |
although it might be quite important,
link |
but we're seeing these phase transitions
link |
of random performance and until some,
link |
let's say, scale of a model.
link |
And then it goes beyond that.
link |
And it might be that you need to fit
link |
a few low order bits of thought
link |
before you can make progress on the whole task.
link |
And if you could measure, actually,
link |
those breakdown of the task,
link |
maybe you would see more smooth,
link |
oh, like, yeah, this, you know,
link |
once you get this and this and this and this and this,
link |
then you start making progress in the task.
link |
But it's somehow a bit annoying
link |
because then it means that certain questions
link |
we might ask about architectures,
link |
possibly can only be done at certain scale.
link |
And one thing that I'm conversely,
link |
I've seen great progress on in the last couple of years
link |
is this notion of science of deep learning
link |
and science of scale in particular, right?
link |
So on the negative is that there's some benchmarks
link |
for which progress might need to be measured
link |
at minimum at certain scale
link |
until you see then what details of the model matter
link |
to make that performance better, right?
link |
So that's a bit of a con.
link |
But what we've also seen is that you can,
link |
you can sort of empirically analyze behavior of models
link |
at scales that are smaller, right?
link |
So let's say to put an example,
link |
we had this chinchilla paper
link |
that revised the so called scaling laws of models.
link |
And that whole study is done
link |
at a reasonably small scale, right?
link |
That may be hundreds of millions
link |
up to one billion parameters.
link |
And then the cool thing is that you create some loss, right?
link |
Some loss that some trends, right?
link |
You extract trends from data that you see.
link |
Okay, like it looks like the amount of data required
link |
to train now a 10x larger model would be this.
link |
And these laws so far,
link |
these extrapolations have helped us save compute
link |
and just get to a better place in terms of the science
link |
of how should we run these models at scale?
link |
How much data, how much depth
link |
and all sorts of questions we start asking
link |
extrapolating from a small scale.
link |
But then these emergence is sadly
link |
that not everything can be extrapolated from scale
link |
depending on the benchmark.
link |
And maybe the harder benchmarks are not so good
link |
for extracting these laws,
link |
but we have a variety of benchmarks at least.
link |
So I wonder to which degree the threshold,
link |
the phase shift scale is a function of the benchmark.
link |
So some of that, some of the science of scale
link |
might be engineering benchmarks
link |
where that threshold is low,
link |
sort of taking a main benchmark
link |
and reducing it somehow
link |
where the essential difficulty is left,
link |
but the immersion,
link |
the scale at which the emergence happens is lower.
link |
Just for the science aspect of it
link |
versus the actual real world aspect.
link |
Yeah, so luckily we have quite a few benchmarks,
link |
some of which are simpler
link |
or maybe they're more like,
link |
I think people might call these systems one
link |
versus systems two style.
link |
So I think what we're not seeing,
link |
luckily is that extrapolations
link |
from maybe slightly more smooth or simpler benchmarks
link |
are translating to the harder ones.
link |
But that is not to say that
link |
these extrapolation will hit its limits.
link |
then how much we scale or how we scale
link |
will sadly be a bit suboptimal
link |
until we find better loss, right?
link |
And these laws again are very empirical loss.
link |
They're not like physical loss of models,
link |
although I wish there would be better theory
link |
about these things as well.
link |
But so far, I would say empirical theory,
link |
as I call it, is way ahead
link |
than actual theory of machine learning.
link |
Let me ask you almost for fun.
link |
So this is not Oriel as a deep mind person
link |
or anything to do with deep mind or Google,
link |
just as a human being
link |
and looking at these news of a Google engineer
link |
I guess the Lambda language model was sentient
link |
and you still need to look into the details of this,
link |
but sort of making an official report
link |
and the claim that he believes there's evidence
link |
that this system has achieved sentience.
link |
And I think this is a really interesting case
link |
on a human level and a psychological level
link |
on a technical machine learning level
link |
of how language models transform our world
link |
and also just philosophical level
link |
of the role of AI systems in a human world.
link |
So what did you, what do you find interesting?
link |
What's your take on all of this
link |
as a machine learning engineer and a researcher
link |
and also as a human being?
link |
Yeah, I mean, a few reactions, quite a few actually.
link |
Have you ever briefly thought, is this thing sentient?
link |
Like even with like Alpha Star, wait a minute, what?
link |
Sadly though, I think, yeah, sadly I have not,
link |
yeah, I think the current, any of the current models,
link |
although very useful and very good.
link |
Yeah, I think we're quite far from that.
link |
And there's kind of a converse side story.
link |
So one of my passions is about science in general.
link |
And I think I feel I'm a bit of like a failed scientist.
link |
That's why I came to machine learning
link |
because you always feel and you start seeing this
link |
that machine learning is maybe the science
link |
that can help other sciences as we've seen, right?
link |
Like you, you know, it's such a powerful tool.
link |
So thanks to that angle, right?
link |
That, okay, I love science, I love, I mean,
link |
I love astronomy, I love biology,
link |
but I'm not an expert and I decided,
link |
well, the thing I can do better at is computers.
link |
But having, especially with when I was a bit more involved
link |
in AlphaFault, learning a bit about proteins
link |
and about biology and about life, the complexity,
link |
it feels like it really is like, I mean,
link |
if you start looking at the things that are going on
link |
at the atomic level and also, I mean,
link |
there's obviously that we are maybe inclined
link |
to try to think of neural networks as like the brain,
link |
but the complexities and the amount of magic
link |
that it feels when, I mean, I'm not an expert,
link |
so it naturally feels more magic,
link |
but looking at biological systems
link |
as opposed to these computer computational brains
link |
just makes me like, wow, there's such level
link |
of complexity difference still, right?
link |
Like orders of magnitude complexity that, sure,
link |
these weights, I mean, we train them
link |
and they do nice things, but they're not at the level
link |
of biological entities, brains, cells.
link |
It just feels like it's just not possible
link |
to achieve the same level of complexity behavior
link |
and my belief when I talk to other beings,
link |
is certainly shaped by this amazement of biology
link |
that maybe because I know too much,
link |
I don't have about machine learning,
link |
but I certainly feel it's very far fetched
link |
and far in the future to be calling or to be thinking,
link |
well, this mathematical function
link |
that is differentiable is, in fact, sentient and so on.
link |
There's something on that point that it's very interesting.
link |
So you know enough about machines and enough about biology
link |
to know that there's many orders of magnitude
link |
of difference and complexity,
link |
but you know how machine learning works.
link |
So the interesting question from human beings
link |
that are interacting with a system
link |
that don't know about the underlying complexity.
link |
And I've seen people and probably including myself
link |
that have fallen in love with things that are quite simple.
link |
And so maybe the complexity is one part of the picture,
link |
but maybe that's not a necessary,
link |
that's not a necessary condition for sentience,
link |
for perception or emulation of sentience.
link |
Right, so I mean, I guess the other side of this is,
link |
that's how I feel personally.
link |
I mean, you asked me about the person, right?
link |
Now it's very interesting to see
link |
how other humans feel about things, right?
link |
This we are like, again, like I'm not as amazed
link |
about things that I feel are,
link |
this is not as magical as this other thing
link |
because of maybe how I got to learn about it
link |
and how I see the curve a bit more smooth
link |
because I, you know, like just seen the progress
link |
of language models since Shannon in the 50s
link |
and actually looking at that timescale,
link |
we're not that fast progress, right?
link |
I mean, what we were thinking at the time,
link |
like almost a hundred years ago is not that dissimilar
link |
to what we're doing now,
link |
but at the same time, yeah, obviously others,
link |
my experience, right, the personal experience,
link |
I think no one should, you know,
link |
I think no one should tell others how they should feel.
link |
I mean, the feelings are very personal, right?
link |
So how others might feel about the models and so on.
link |
That's one part of the story that is important
link |
to understand for me personally as a researcher.
link |
And then when I maybe disagree
link |
or I don't understand or see that, yeah,
link |
maybe this is not something I think right now is reasonable.
link |
Knowing all that I know, one of the other things
link |
and perhaps partly why it's great to be talking to you
link |
and reaching out to the world about machine learning is,
link |
hey, let's demystify a bit the magic
link |
and try to see a bit more of the math
link |
and the fact that literally to create these models,
link |
if we had the right software,
link |
it would be 10 lines of code
link |
and then just a dump of the internet.
link |
So versus like then the complexity of like the creation
link |
of humans from their inception, right?
link |
And also the complexity of evolution
link |
of the whole universe to where we are
link |
that feels orders of magnitude more complex
link |
and fascinating to me.
link |
So I think, yeah, maybe part of,
link |
the only thing I'm thinking about trying to tell you is,
link |
yeah, I think explaining a bit of the magic,
link |
there is a bit of magic.
link |
It's good to be in love obviously with what you do at work.
link |
And I'm certainly fascinated and surprised
link |
quite often as well.
link |
But I think hopefully as experts in biology,
link |
hopefully will tell me this is not as magic
link |
and I'm happy to learn that through interactions
link |
with the larger community,
link |
we can also have a certain level of education
link |
that in practice also will matter
link |
because I mean, one question is how you feel about this
link |
but then the other very important is,
link |
you starting to interact with these in products and so on.
link |
It's good to understand a bit what's going on,
link |
what's not going on, what's safe, what's not safe
link |
Otherwise the technology will not be used properly for good
link |
which is obviously the goal of all of us, I hope.
link |
So let me then ask the next question.
link |
Do you think in order to solve intelligence
link |
or to replace the leg spot that does interviews
link |
as we started this conversation with,
link |
do you think the system needs to be sentient?
link |
Do you think it needs to achieve something like consciousness?
link |
And do you think about what consciousness is
link |
in the human mind that could be instructive
link |
for creating AI systems?
link |
Yeah, honestly, I think probably not
link |
to the degree of intelligence that there's this brain
link |
that can learn, can be extremely useful,
link |
can challenge you, can teach you.
link |
Conversely, you can teach it to do things.
link |
I'm not sure it's necessary personally speaking
link |
but if consciousness or any other biological
link |
or evolutionary lesson can be repurposed
link |
to then influence our next set of algorithms,
link |
that is a great way to actually make progress, right?
link |
And the same way I tried to explain transformers
link |
a bit how it feels we operate
link |
when we look at texts specifically,
link |
these insights are very important, right?
link |
So there's a distinction between details
link |
of how the brain might be doing computation.
link |
I think my understanding is, sure, there's neurons
link |
and there's some resemblance to neural networks
link |
but we don't quite understand enough of the brain
link |
in detail, right, to be able to replicate it.
link |
But then more, if you zoom out a bit,
link |
how we then, our thought process, how memory works,
link |
maybe even how evolution got us here,
link |
what's exploration, exploitation,
link |
like how these things happen.
link |
I think these clearly can inform algorithmic level research
link |
and I've seen some examples of these being quite useful
link |
to then guide the research,
link |
even it might be for the wrong reasons, right?
link |
So I think biology and what we know about ourselves
link |
can help a whole lot to build,
link |
essentially what we call AGI, this general,
link |
the real ghetto, right, the last step of the chain,
link |
hopefully, but consciousness in particular,
link |
I don't myself at least think too hard about
link |
how to add that to the system.
link |
But maybe my understanding is also very personal
link |
about what it means, right?
link |
I think this, even that in itself is a long debate
link |
that I know people have often
link |
and maybe I should learn more about this.
link |
Yeah, and I personally,
link |
I notice the magic often on a personal level,
link |
especially with physical systems like robots.
link |
I have a lot of legged robots now in Austin that I play with
link |
and even when you program them,
link |
when they do things you didn't expect,
link |
there's an immediate anthropomorphization
link |
and you notice the magic
link |
and you start to think about things like sentience
link |
that has to do more with effective communication
link |
and less with any of these kind of dramatic things.
link |
It seems like a useful part of communication.
link |
Having the perception of consciousness
link |
seems like useful for us humans.
link |
We treat each other more seriously.
link |
We are able to do a nearest neighbor,
link |
shoving of that entity into your memory correctly,
link |
all that kind of stuff, seems useful,
link |
at least to fake it even if you never make it.
link |
So maybe like, yeah, mirroring the question
link |
and since you talked to a few people,
link |
then you do think that we'll need to figure something out
link |
in order to achieve intelligence
link |
in a grander sense of the world?
link |
Yeah, I personally believe yes,
link |
but I don't even think it'll be like a separate island
link |
we'll have to travel to.
link |
I think it will emerge quite naturally.
link |
Okay, that's easier than for us then, thank you.
link |
But the reason I think it's important to think about
link |
is you will start, I believe,
link |
like with this Google engineer,
link |
you will start seeing this a lot more,
link |
especially when you have AI systems
link |
that are actually interacting with human beings
link |
that don't have an engineering background.
link |
And we have to prepare for that.
link |
Because I do believe there will be a civil rights movement
link |
for robots as silly as it is to say.
link |
There's going to be a large number of people
link |
that realize there's these intelligent entities
link |
with whom I have a deep relationship
link |
and I don't wanna lose them.
link |
They've come to be a part of my life and they mean a lot.
link |
They have a name, they have a story, they have a memory.
link |
And we start to ask questions about ourselves.
link |
Well, this thing sure seems like it's capable of suffering
link |
because it tells all these stories of suffering.
link |
It doesn't wanna die and all those kinds of things.
link |
And we have to start to ask ourselves questions.
link |
What is the difference between a human being and this thing?
link |
And wait, so when you engineer,
link |
I believe from an engineering perspective,
link |
from like a deep mind or anybody that builds systems,
link |
there might be laws in the future
link |
where you're not allowed to engineer systems
link |
with displays of sentience
link |
unless they're explicitly designed to be that,
link |
unless it's a pet.
link |
So if you have a system that's just doing customer support,
link |
you're legally not allowed to display sentience.
link |
We'll start to like ask ourselves that question.
link |
And then so that's going to be part
link |
of the software engineering process.
link |
Do we, which features do we have
link |
in one of them as communications of sentience?
link |
But it's important to start thinking about that stuff,
link |
especially how much it captivates public attention.
link |
It's definitely a topic that is important we think about.
link |
And I think in a way, I always see not,
link |
I mean, not every movie is equally on point
link |
with certain things,
link |
but certainly science fiction in this sense,
link |
at least has prepared society to start thinking
link |
about certain topics that,
link |
even if it's too early to talk about,
link |
as long as we are like reasonable,
link |
it's certainly going to prepare us for both the research
link |
to come and how to, I mean,
link |
there's many important challenges and topics
link |
that come with building an intelligent system,
link |
many of which you just mentioned, right?
link |
So I think we're never going to be fully ready
link |
unless we talk about these.
link |
And we start also, as I said,
link |
just kind of expanding the people we talk to
link |
to not include only our own researchers and so on.
link |
And in fact, places like DeepMind,
link |
but elsewhere there's more interdisciplinary groups
link |
forming up to start asking
link |
and really working with us on these questions.
link |
Because obviously this is not initially
link |
what your passion is when you do your PhD,
link |
but certainly it is coming, right?
link |
So it's fascinating kind of it's the thing
link |
that brings me to one of my passions that is learning.
link |
So in this sense, this is kind of a new area
link |
that as a learning system myself,
link |
I want to keep exploring.
link |
And I think it's great that to see parts of the debate
link |
and even I seen a level of maturity in the conferences
link |
that deal with AI, if you look five years ago,
link |
to now just the amount of workshops and so on
link |
has changed so much is impressive to see how much topics
link |
of safety ethics and so on come to the surface,
link |
And if we're too early, clearly it's fine.
link |
I mean, it's a big field and there's lots of people
link |
with lots of interests that will do progress
link |
And obviously I don't believe we're too late.
link |
So in that sense, like I think it's great
link |
that we're doing these already.
link |
It's better to be too early than too late
link |
when it comes to super intelligent AI systems.
link |
Let me ask, speaking of sentient to AI's,
link |
you gave props to your friend, Elias Skiver,
link |
for being elected the fellow of the World Society.
link |
So just as a shout out to a fellow researcher and a friend,
link |
what's the secret to the genius of Elias Skiver?
link |
And also, do you believe that his tweets of
link |
as you have hypothesized and Andre Kapathi did as well,
link |
are generated by a language model?
link |
Yeah, so I strongly believe Elias gonna visit
link |
in a few weeks actually.
link |
So I'll ask him in person, but...
link |
Will he tell you the truth?
link |
Yes, of course, hopefully.
link |
I mean, ultimately we all have shared paths
link |
and there's friendships that go beyond
link |
obviously institutional institutions and so on.
link |
So hope he tells me the truth.
link |
Well, maybe the AI system is holding him hostage somehow.
link |
Maybe he has some videos about, he doesn't wanna release.
link |
So maybe it has taken control over him.
link |
So he can't tell the truth.
link |
If I see him in person, then I'll tell him.
link |
But I think it's a good,
link |
I think Elias's personality, just knowing him for a while.
link |
Yeah, he's, everyone in Twitter, I guess,
link |
gets a different persona and I think Elias one
link |
does not surprise me, right?
link |
So I think knowing Elias from before social media
link |
and before AI was so prevalent,
link |
I recognize a lot of his character.
link |
So that's something for me that I feel good about.
link |
A friend that hasn't changed
link |
or is still true to himself, right?
link |
Obviously, there is though a fact
link |
that your field becomes more popular
link |
and he is obviously one of the main figures in the field
link |
having done a lot of advancement.
link |
So I think that the tricky bit here is
link |
how to balance your true self with the responsibility
link |
that you're worst carry.
link |
So in this sense, I think, yeah,
link |
like I appreciate the style and I understand it,
link |
but it created debates on like some of his tweets, right?
link |
That maybe it's good, we have them early anyways, right?
link |
But yeah, then the reactions are usually polarizing.
link |
I think we're just seeing kind of the reality
link |
of social media a bit there as well,
link |
reflected on that particular topic
link |
or set of topics he's tweeting about.
link |
Yeah, I mean, it's funny they speak to this tension.
link |
He was one of the early seminal figures
link |
in the field of deep learning.
link |
And so there's a responsibility with that,
link |
but he's also from having interacted with him quite a bit.
link |
He's just a brilliant thinker about ideas.
link |
And which as are you,
link |
and that there's a tension between becoming the manager
link |
versus like the actual thinking through very novel ideas,
link |
the scientist versus the manager.
link |
And he's one of the great scientists of our time.
link |
This was quite interesting.
link |
And also people tell me quite silly,
link |
which I haven't quite detected yet,
link |
but in private, we'll have to see about that.
link |
Yeah, yeah, I mean, just on the point of,
link |
I mean, Ilya has been an inspiration.
link |
I mean, quite a few colleagues I can think shaped,
link |
you know, the person you are, like Ilya certainly
link |
gets probably the top spot, if not close to the top.
link |
And if we go back to the question about people in the field,
link |
like how the role would have changed the field or not,
link |
I think Ilya's case is interesting
link |
because he really has a deep belief
link |
in the scaling up of neural networks.
link |
There was a talk that is still famous to this day
link |
from the sequence to sequence paper,
link |
where he was just claiming,
link |
just give me supervised data and a large neural network.
link |
And then, you know, you'll solve basically
link |
all the problems, right?
link |
That vision, right, was already there many years ago.
link |
So it's good to see like someone who is in this case
link |
very deeply into this style of research.
link |
And clearly has had a tremendous track record
link |
of successes and so on.
link |
The funny bit about that talk is that
link |
we rehearsed the talk in a hotel room before
link |
and the original version of that talk
link |
would have been even more controversial.
link |
So maybe I'm the only person
link |
that has seen the unfiltered version of the talk.
link |
And, you know, maybe when the time comes,
link |
maybe we should revisit some of the skip slides
link |
from the talk from Ilya.
link |
But I really think the deep belief
link |
into some certain style of research pays out, right?
link |
It is good to be practical sometimes.
link |
And I actually think Ilya and myself are like practical,
link |
but it's also good.
link |
There's some sort of longterm belief and trajectory.
link |
Obviously, there's a bit of lack involved,
link |
but it might be that that's the right path.
link |
Then you clearly are ahead
link |
and hugely influential to the field, as he has been.
link |
Do you agree with that intuition
link |
that maybe was written about by Rich Sutton
link |
in the bitter lesson, that the biggest lesson
link |
that can be read from 70 years of AI research
link |
is that general methods that leverage computation
link |
are ultimately the most effective.
link |
Do you think that intuition is ultimately correct?
link |
General methods that leverage computation,
link |
allowing the scaling of computation
link |
to do a lot of the work.
link |
And so the basic task of us humans is to design methods
link |
that are more and more general
link |
versus more and more specific to the tasks at hand.
link |
I certainly think this essentially mimics
link |
a bit of the deep learning research,
link |
almost like philosophy,
link |
that on the one hand, we want to be data agnostic.
link |
We don't wanna pre process data sets.
link |
We wanna see the bytes, right?
link |
Like the true data as it is,
link |
and then learn everything on top.
link |
So very much agree with that.
link |
And I think scaling up feels at the very least,
link |
again, necessary for building incredible complex systems.
link |
It's possibly not sufficient
link |
barring that we need a couple of breakthroughs.
link |
I think Rich Sutton mentioned search
link |
being part of the equation of scale and search.
link |
I think search, I've seen it, that's been more mixed
link |
So from that lesson in particular,
link |
search is a bit more tricky
link |
because it is very appealing to search in domains like Go
link |
where you have a clear reward function
link |
that you can then discard some search traces.
link |
But then in some other tasks,
link |
it's not very clear how you would do that.
link |
Although recently one of our recent works
link |
which actually was mostly mimicking or a continuation
link |
and even the team and the people involved
link |
were pretty much very intersecting with AlphaStar
link |
was AlphaCode in which we actually saw
link |
the bitter lesson how scale of the models
link |
and then a massive amount of search yielded this
link |
kind of very interesting result
link |
of being able to have human level code competition.
link |
So I've seen examples of it being
link |
literally mapped to search and scale.
link |
I'm not so convinced about the search bit,
link |
but certainly I'm convinced scale will be needed.
link |
So we need general methods.
link |
We need to test them
link |
and maybe we need to make sure that we can scale them
link |
given the hardware that we have in practice,
link |
but then maybe we should also shape
link |
how the hardware looks like
link |
based on which methods might be needed to scale.
link |
And that's an interesting contrast of this GPU comment
link |
that is we got it for free almost
link |
because games were using this,
link |
but maybe now if sparsity is required,
link |
we don't have the hardware, although in theory,
link |
I mean, many people are building
link |
different kinds of hardware these days,
link |
but there's a bit of this notion of hardware lottery
link |
for scale that might actually have an impact
link |
at least on the year, again, scale of years
link |
on how fast we'll make progress
link |
to maybe a version of neural nets
link |
or whatever comes next that might enable
link |
truly intelligent agents.
link |
Do you think in your lifetime we will build an AGI system
link |
that would undeniably be a thing
link |
that achieves human level intelligence and goes far beyond?
link |
I definitely think it's possible
link |
that it will go far beyond,
link |
but I'm definitely convinced
link |
that it will be human level intelligence.
link |
And I'm hypothesizing about the beyond
link |
because the beyond beat is a bit tricky to define,
link |
especially when we look at the current formula
link |
of starting from this imitation learning standpoint, right?
link |
So we can certainly imitate humans at language and beyond.
link |
So getting at human level through imitation
link |
feels very possible.
link |
Going beyond will require reinforcement learning
link |
And I think in some areas
link |
that certainly already has paid out.
link |
I mean, Go being an example,
link |
that's my favorite so far
link |
in terms of going beyond human capabilities.
link |
But in general, I'm not sure we can define reward functions
link |
that from a seat of imitating human level intelligence
link |
that is general and then going beyond.
link |
That beat is not so clear in my lifetime,
link |
but certainly human level, yes.
link |
And I mean, that in itself is already quite powerful, I think.
link |
So going beyond, I think it's obviously not,
link |
we're not gonna not try that
link |
if then we get to super human scientist and discovery
link |
and advancing the world,
link |
but at least human level is also in general,
link |
is also very, very powerful.
link |
Well, especially if human level or slightly beyond
link |
is integrated deeply with human society
link |
and there's billions of agents like that,
link |
do you think there's a singularity moment beyond which
link |
our world will be just very deeply transformed
link |
by these kinds of systems?
link |
Because now you're talking about intelligent systems
link |
that are just, I mean,
link |
this is no longer just going from horse and buggy to the car.
link |
It feels like a very different kind of shift
link |
in what it means to be a living entity on earth.
link |
Are you excited of this world?
link |
I'm afraid if there's a lot more,
link |
so I think maybe we'll need to think about
link |
if we truly get there just thinking of limited resources,
link |
like humanity clearly hits some limits
link |
and then there's some balance, hopefully,
link |
that biologically the planet is imposing
link |
and we should actually try to get better at this.
link |
As we know, there's quite a few issues
link |
with having too many people coexisting
link |
in a resource limited way.
link |
So for digital entities, it's an interesting question.
link |
I think such a limit maybe should exist,
link |
but maybe it's gonna be imposed by energy availability
link |
because this also consumes energy.
link |
In fact, most systems are more inefficient
link |
than we are in terms of energy required.
link |
But definitely, I think as a society,
link |
we'll need to just work together
link |
to find what would be reasonable in terms of growth
link |
or how we coexist if that is to happen.
link |
I am very excited about obviously the aspects of automation
link |
that make people that obviously don't have access
link |
to certain resources or knowledge
link |
for them to have that access.
link |
I think those are the applications in a way
link |
that I'm most exciting to see and to personally work towards.
link |
Yeah, there's going to be significant improvements
link |
in productivity and the quality of life
link |
across the whole population, which is very interesting.
link |
But I'm looking even far beyond
link |
us becoming a multiplanetary species.
link |
And just as a quick bet, last question,
link |
do you think as humans become multiplanetary species,
link |
go outside our solar system, all that kind of stuff,
link |
do you think there'll be more humans
link |
or more robots in that future world?
link |
So will humans be the quirky, intelligent being of the past?
link |
Or is there something deeply fundamental
link |
to human intelligence that's truly special,
link |
where we will be part of those other planets,
link |
not just AI systems?
link |
I think we're all excited to build AGI to empower
link |
or make us more powerful as human species.
link |
Not to say there might be some hybridization.
link |
I mean, this is obviously speculation,
link |
but there are companies also trying to,
link |
the same way medicine is making us better.
link |
Maybe there are other things that are yet to happen on that.
link |
But if the ratio is not at most one to one,
link |
I would not be happy.
link |
So I would hope that we are part of the equation.
link |
But maybe there's maybe a one to one ratio
link |
feels like possible, constructive, and so on.
link |
But it would not be good to have a misbalance,
link |
at least from my core beliefs and the why I'm doing what
link |
I'm doing when I go to work and I research what I research.
link |
Well, this is how I know you're human.
link |
And this is how you've passed the Turing test.
link |
And you are one of the special humans, Ariel.
link |
It's a huge honor that you had talked with me.
link |
And I hope we get the chance to speak again maybe once
link |
before the singularity, once after,
link |
and see how our view of the world changes.
link |
Thank you again for talking today.
link |
Thank you for the amazing work you do here.
link |
Shining example of a researcher and a human being
link |
in this community.
link |
Thanks a lot, Lex.
link |
Yeah, looking forward to before the singularity, certainly.
link |
Thanks for listening to this conversation with Ariel Vinialis.
link |
To support this podcast, please check out our sponsors
link |
in the description.
link |
And now, let me leave you with some words from Alan Turing.
link |
Those who can imagine anything can create the impossible.
link |
Thank you for listening and hope to see you next time.