back to index

Oriol Vinyals: Deep Learning and Artificial General Intelligence | Lex Fridman Podcast #306


small model | large model

link |
00:00:00.000
at which point is the neural network a being versus a tool?
link |
00:00:08.400
The following is a conversation with Oriel Vinialis,
link |
00:00:11.360
his second time in the podcast.
link |
00:00:13.440
Oriel is the research director
link |
00:00:15.920
and deep learning lead at DeepMind
link |
00:00:18.000
and one of the most brilliant thinkers and researchers
link |
00:00:20.940
in the history of artificial intelligence.
link |
00:00:24.320
This is the Lex Friedman podcast.
link |
00:00:26.640
To support it, please check out our sponsors
link |
00:00:28.840
in the description.
link |
00:00:30.160
And now, to your friends, here's Oriel Vinialis.
link |
00:00:34.480
You are one of the most brilliant researchers
link |
00:00:37.000
in the history of AI,
link |
00:00:38.400
working across all kinds of modalities,
link |
00:00:40.560
probably the one common theme is,
link |
00:00:42.680
it's always sequences of data.
link |
00:00:45.000
So we're talking about languages, images, even biology
link |
00:00:47.960
and games as we talked about last time.
link |
00:00:50.240
So you're a good person to ask this.
link |
00:00:53.360
In your lifetime, will we be able to build an AI system
link |
00:00:57.320
that's able to replace me as the interviewer
link |
00:01:00.760
in this conversation,
link |
00:01:02.600
in terms of ability to ask questions
link |
00:01:04.460
that are compelling to somebody listening?
link |
00:01:06.600
And then further question is, are we close?
link |
00:01:10.640
Will we be able to build a system that replaces you
link |
00:01:13.880
as the interviewee
link |
00:01:16.080
in order to create a compelling conversation?
link |
00:01:18.120
How far away are we, do you think?
link |
00:01:20.040
It's a good question.
link |
00:01:21.800
I think partly I would say, do we want that?
link |
00:01:24.680
I really like when we start now with very powerful models
link |
00:01:29.400
interacting with them
link |
00:01:31.000
and thinking of them more closer to us.
link |
00:01:34.080
The question is, if you remove the human side
link |
00:01:37.040
of the conversation, is that an interesting,
link |
00:01:40.240
is that an interesting artifact?
link |
00:01:42.360
And I would say probably not.
link |
00:01:44.440
I've seen, for instance, last time we spoke,
link |
00:01:47.400
like we were talking about Starcraft
link |
00:01:49.920
and creating agents that play games, involves self play,
link |
00:01:54.880
but ultimately what people care about
link |
00:01:56.560
was how does this agent behave
link |
00:01:59.080
when the opposite side is a human?
link |
00:02:02.680
So without a doubt,
link |
00:02:04.720
we will probably be more empowered by AI.
link |
00:02:08.520
Maybe you can source some questions from an AI system.
link |
00:02:12.480
I mean, that even today, I would say,
link |
00:02:13.960
it's quite plausible that with your creativity,
link |
00:02:17.040
you might actually find very interesting questions
link |
00:02:19.400
that you can filter.
link |
00:02:20.720
We call this cherry picking sometimes
link |
00:02:22.400
in the field of language.
link |
00:02:24.400
And likewise, if I had now the tools on my side,
link |
00:02:27.520
I could say, look, you're asking this interesting question.
link |
00:02:30.680
From this answer, I like the words chosen
link |
00:02:33.240
by this particular system that created a few words.
link |
00:02:36.600
Completely replacing it feels not exactly exciting to me,
link |
00:02:41.280
although in my lifetime, I think way,
link |
00:02:43.760
I mean, given the trajectory,
link |
00:02:45.520
I think it's possible that perhaps
link |
00:02:48.000
there could be interesting maybe self play interviews
link |
00:02:51.760
as you're suggesting that would look
link |
00:02:54.400
or sound quite interesting
link |
00:02:56.160
and probably would educate
link |
00:02:57.720
or you could learn a topic through listening
link |
00:03:00.120
to one of these interviews at a basic level at least.
link |
00:03:03.160
So you said it doesn't seem exciting to you,
link |
00:03:04.800
but what if exciting is part of the objective function
link |
00:03:07.480
the thing is optimized over?
link |
00:03:09.120
So there's probably a huge amount of data of humans,
link |
00:03:12.840
if you look correctly, of humans communicating online,
link |
00:03:16.080
and there's probably ways to measure the degree
link |
00:03:18.840
of as they talk about engagement.
link |
00:03:21.960
So you can probably optimize the question
link |
00:03:24.160
that's most created an engaging conversation in the past.
link |
00:03:28.720
So actually, if you strictly use the word exciting,
link |
00:03:33.240
there is probably a way to create
link |
00:03:37.280
a optimally exciting conversations
link |
00:03:40.360
that involve AI systems.
link |
00:03:42.200
At least one side is AI.
link |
00:03:44.640
Yeah, that makes sense. I think maybe looping back a bit
link |
00:03:48.040
to games and the game industry,
link |
00:03:50.280
when you design algorithms,
link |
00:03:53.080
you're thinking about winning as the objective,
link |
00:03:55.600
right, or the reward function.
link |
00:03:57.360
But in fact, when we discuss this with Blizzard,
link |
00:04:00.120
the creators of StarCraft in this case,
link |
00:04:02.360
I think what's exciting, fun,
link |
00:04:05.360
if you could measure that and optimize for that,
link |
00:04:09.200
that's probably why we play video games
link |
00:04:11.760
or why we interact or listen
link |
00:04:13.360
or look at cat videos or whatever on the internet.
link |
00:04:16.440
So it's true that modeling reward beyond
link |
00:04:20.000
the obvious reward functions we've used to
link |
00:04:22.080
in reinforcement learning is definitely very exciting.
link |
00:04:25.560
And again, there is some progress actually
link |
00:04:28.200
into a particular aspect of AI,
link |
00:04:31.040
which is quite critical,
link |
00:04:32.120
which is, for instance, is a conversation
link |
00:04:36.080
or is the information truthful, right?
link |
00:04:38.200
So you could start trying to evaluate these
link |
00:04:41.600
from the internet, right?
link |
00:04:44.400
That has lots of information.
link |
00:04:45.800
And then if you can learn a function automated ideally,
link |
00:04:50.160
so you can also optimize it more easily,
link |
00:04:52.880
then you could actually have conversations
link |
00:04:54.840
that optimize for nonobvious things, such as excitement.
link |
00:04:59.360
So yeah, that's quite possible.
link |
00:05:01.040
And then I would say in that case,
link |
00:05:03.560
it would definitely be fun, a fun exercise
link |
00:05:05.880
and quite unique to have at least one site
link |
00:05:08.040
that is fully driven by an excitement,
link |
00:05:11.120
reward function.
link |
00:05:12.840
But obviously there would be still quite a lot
link |
00:05:16.200
of humanity in the system,
link |
00:05:18.200
both from who is building the system, of course,
link |
00:05:21.280
and also ultimately, if we think of labeling for excitement,
link |
00:05:26.040
that those labels must come from us
link |
00:05:28.440
because it's just hard to have a computational measure
link |
00:05:32.520
of excitement.
link |
00:05:33.520
As far as I understand, there's no such thing.
link |
00:05:36.160
Wow, you mentioned truth also.
link |
00:05:39.240
I would actually venture to say that excitement
link |
00:05:41.800
is easier to label than truth,
link |
00:05:44.160
or it's perhaps has lower consequences of failure.
link |
00:05:49.920
But there is perhaps the humanness that you mentioned.
link |
00:05:55.720
That's perhaps part of a thing that could be labeled.
link |
00:05:58.240
And that could mean an AI system that's doing dialogue,
link |
00:06:02.480
that's doing conversations, should be flawed, for example.
link |
00:06:07.480
That's the thing you optimize for,
link |
00:06:09.440
which is have inherent contradictions by design,
link |
00:06:13.280
have flaws by design.
link |
00:06:15.080
Maybe it also needs to have a strong sense of identity.
link |
00:06:18.760
So it has a backstory, it told itself that it sticks to.
link |
00:06:22.680
It has memories, not in terms of how the system is designed,
link |
00:06:26.880
but it's able to tell stories about its past.
link |
00:06:30.360
It's able to have mortality and fear of mortality
link |
00:06:35.360
in the following way, that it has an identity.
link |
00:06:38.520
And if it says something stupid and gets canceled on Twitter,
link |
00:06:42.440
that's the end of that system.
link |
00:06:44.040
So it's not like you get to rebrand yourself.
link |
00:06:46.680
That system is, that's it.
link |
00:06:48.680
So maybe the high stakes nature of it,
link |
00:06:51.480
because you can't say anything stupid now,
link |
00:06:53.880
or because you'd be canceled on Twitter.
link |
00:06:57.080
And that there's stakes to that.
link |
00:06:59.080
And that I think part of the reason that makes it interesting.
link |
00:07:02.880
And then you have a perspective,
link |
00:07:04.080
like you've built up over time that you stick with,
link |
00:07:07.120
and then people can disagree with you.
link |
00:07:08.560
So holding that perspective strongly,
link |
00:07:11.280
holding sort of maybe a controversial,
link |
00:07:13.400
at least a strong opinion.
link |
00:07:15.720
All of those elements, it feels like they can be learned
link |
00:07:18.240
because it feels like there's a lot of data
link |
00:07:21.240
on the internet of people having an opinion.
link |
00:07:24.120
And then combine that with a metric of excitement.
link |
00:07:27.240
You can start to create something that,
link |
00:07:29.440
as opposed to trying to optimize for sort of,
link |
00:07:34.040
grammatical clarity and truthfulness,
link |
00:07:38.120
the factual consistency over many sentences,
link |
00:07:42.000
you're optimized for the humanness.
link |
00:07:45.320
And there's obviously data for humanness on the internet.
link |
00:07:48.880
So I wonder if there's a future where that's part,
link |
00:07:53.760
or I mean, I sometimes wonder that about myself.
link |
00:07:56.400
I'm a huge fan of podcasts,
link |
00:07:58.120
and I listened to some podcasts,
link |
00:08:00.760
and I think like, what is interesting about this?
link |
00:08:03.240
What is compelling?
link |
00:08:05.960
The same way you watch other games,
link |
00:08:07.440
like you said, watch, play StarCraft,
link |
00:08:09.160
or have Magnus Carlson play chess.
link |
00:08:13.040
So I'm not a chess player,
link |
00:08:14.680
so but it's still interesting to me, and what is that?
link |
00:08:16.760
That's the stakes of it,
link |
00:08:19.440
maybe the end of a domination of a series of wins.
link |
00:08:23.400
I don't know, there's all those elements
link |
00:08:25.440
somehow connect to a compelling conversation,
link |
00:08:28.000
and I wonder how hard is that to replace?
link |
00:08:30.200
Because ultimately all of that connects
link |
00:08:31.840
to the initial proposition of how to test,
link |
00:08:35.520
whether in AI's intelligence or not with the Turing test,
link |
00:08:38.680
which I guess my question comes from a place
link |
00:08:41.800
of the spirit of that test.
link |
00:08:43.720
Yes, I actually recall,
link |
00:08:45.480
I was just listening to our first podcast
link |
00:08:47.960
where we discussed Turing test.
link |
00:08:50.400
So I would say from a neural network,
link |
00:08:54.760
AI builder perspective,
link |
00:08:57.680
there's usually you try to map many of these interesting topics
link |
00:09:03.200
you discuss to benchmarks,
link |
00:09:05.240
and then also to actual architectures
link |
00:09:08.160
on how these systems are currently built,
link |
00:09:10.680
how they learn, what data they learn from,
link |
00:09:13.120
what are they learning, right?
link |
00:09:14.320
We're talking about weights of a mathematical function.
link |
00:09:17.840
And then looking at the current state of the game,
link |
00:09:21.600
maybe what do we need leaps forward
link |
00:09:26.040
to get to the ultimate stage of all these experiences,
link |
00:09:30.680
lifetime experience, fears,
link |
00:09:32.920
like words that currently barely we're seeing progress
link |
00:09:38.040
just because what's happening today is
link |
00:09:40.800
you take all these human interactions,
link |
00:09:44.040
it's a large bust of variety of human interactions online,
link |
00:09:48.000
and then you're distilling these sequences, right?
link |
00:09:51.680
Going back to my passion,
link |
00:09:53.040
like sequences of words, letters, images, sound,
link |
00:09:56.960
there's more modalities here to be at play.
link |
00:09:59.920
And then you're trying to just learn a function
link |
00:10:03.400
that will be happy, that maximizes the likelihood
link |
00:10:06.800
of seeing all these through a neural network.
link |
00:10:10.960
Now, I think there's a few places
link |
00:10:14.240
where the way currently we train these models
link |
00:10:17.280
would clearly like to be able to develop
link |
00:10:20.040
the kinds of capabilities you say.
link |
00:10:22.160
I'll tell you maybe a couple.
link |
00:10:23.560
One is the lifetime of an agent or a model.
link |
00:10:27.640
So you learn from these data offline, right?
link |
00:10:30.840
So you're just passively observing and maximizing these,
link |
00:10:33.880
it's almost like a landscape of mountains.
link |
00:10:37.720
And then everywhere there's data
link |
00:10:39.160
that humans interacted in this way,
link |
00:10:41.040
you're trying to make that higher
link |
00:10:43.000
and then lower where there's no data.
link |
00:10:45.720
And then these models generally don't
link |
00:10:49.520
then experience themselves these,
link |
00:10:51.120
they just are observers, right?
link |
00:10:52.520
They're passive observers of the data.
link |
00:10:54.600
And then we're putting them to then generate data
link |
00:10:57.440
when we interact with them, but that's very limiting.
link |
00:11:00.920
The experience they actually experience
link |
00:11:03.480
when they could maybe be optimizing
link |
00:11:05.680
or further optimizing the weights,
link |
00:11:07.440
we're not even doing that.
link |
00:11:08.640
So to be clear, and again, mapping to AlphaGo, AlphaStar,
link |
00:11:14.080
we train the model.
link |
00:11:15.280
And when we deploy it to play against humans,
link |
00:11:18.280
or in this case, interact with humans
link |
00:11:20.280
like language models, they don't even keep training, right?
link |
00:11:23.560
They're not learning in the sense of the weights
link |
00:11:26.240
that you've learned from the data,
link |
00:11:28.280
they don't keep changing.
link |
00:11:29.840
Now, there's something a bit more feels magical,
link |
00:11:33.560
but it's understandable if you're into neural net,
link |
00:11:36.280
which is, well, they might not learn
link |
00:11:39.200
in the strict sense of the words, the weights changing,
link |
00:11:41.560
maybe that's mapping to how neurons interconnect
link |
00:11:44.440
and how we learn over our lifetime.
link |
00:11:46.720
But it's true that the context of the conversation
link |
00:11:50.360
that takes place with when you talk to these systems,
link |
00:11:55.040
it's held in their working memory, right?
link |
00:11:57.320
It's almost like you start a computer,
link |
00:12:00.200
it has a hard drive that has a lot of information,
link |
00:12:02.920
you have access to the internet,
link |
00:12:04.080
which has probably all the information,
link |
00:12:06.400
but there's also a working memory
link |
00:12:08.560
where these agents as we call them
link |
00:12:11.160
or start calling them build upon.
link |
00:12:13.920
Now, this memory is very limited.
link |
00:12:17.000
Right now, we're talking to be concrete
link |
00:12:19.280
about 2,000 words that we hold,
link |
00:12:21.840
and then beyond that, we start forgetting what we've seen.
link |
00:12:24.920
So you can see that there's some short term coherence
link |
00:12:28.120
already with when you said,
link |
00:12:29.920
I mean, it's a very interesting topic,
link |
00:12:32.400
having sort of a mapping an agent to have consistency,
link |
00:12:37.480
then if you say, oh, what's your name,
link |
00:12:40.840
it could remember that,
link |
00:12:42.320
but then it might forget beyond 2,000 words,
link |
00:12:45.080
which is not that long of context,
link |
00:12:47.560
if we think even of these podcast books are much longer.
link |
00:12:51.840
So technically speaking, there's a limitation there,
link |
00:12:55.240
super exciting from people that work on deep learning
link |
00:12:58.280
to be working on.
link |
00:13:00.080
But I would say we lack maybe benchmarks
link |
00:13:03.160
and the technology to have this lifetime like experience
link |
00:13:07.960
of memory that keeps building up.
link |
00:13:10.960
However, the way it learns offline
link |
00:13:13.240
is clearly very powerful, right?
link |
00:13:14.960
So you asked me three years ago, I would say,
link |
00:13:17.880
oh, we're very far.
link |
00:13:18.720
I think we've seen the power of this imitation again,
link |
00:13:23.200
or the internet scale that has enabled this
link |
00:13:26.320
to feel like at least the knowledge,
link |
00:13:28.840
the basic knowledge about the world
link |
00:13:30.240
now is incorporated into the weights,
link |
00:13:33.200
but then this experience is lacking.
link |
00:13:36.640
And in fact, as I said, we don't even train them
link |
00:13:39.400
when we're talking to them,
link |
00:13:41.240
other than their working memory, of course, is affected.
link |
00:13:44.840
So that's the dynamic part,
link |
00:13:46.640
but they don't learn in the same way
link |
00:13:48.320
that you and I have learned from basically
link |
00:13:51.560
when we were born and probably before.
link |
00:13:54.120
So lots of fascinating, interesting questions you asked there.
link |
00:13:57.480
I think the one I mentioned is this idea of memory
link |
00:14:01.760
and experience versus just kind of observe the world
link |
00:14:05.560
and learn its knowledge,
link |
00:14:06.800
which I think for that,
link |
00:14:08.040
I would argue lots of recent advancements
link |
00:14:10.400
that make me very excited about the field.
link |
00:14:13.480
And then the second maybe issue that I see is
link |
00:14:18.240
all these models, we train them from scratch.
link |
00:14:21.320
That's something I would have complained three years ago
link |
00:14:24.080
or six years ago or 10 years ago.
link |
00:14:26.480
And it feels, if we take inspiration from how we got here,
link |
00:14:31.480
how the universe evolved us and we keep evolving,
link |
00:14:35.360
it feels that is a missing piece,
link |
00:14:37.960
that we should not be training models from scratch
link |
00:14:41.440
every few months,
link |
00:14:42.600
that there should be some sort of way
link |
00:14:45.360
in which we can grow models much like as a species
link |
00:14:49.080
and many other elements in the universe
link |
00:14:51.600
is building from the previous sort of iterations.
link |
00:14:55.120
And that's from just purely neural network perspective.
link |
00:14:59.640
Even though we would like to make it work,
link |
00:15:02.400
it's proven very hard to not throw away
link |
00:15:06.320
the previous weights, right?
link |
00:15:07.760
This landscape we learn from the data
link |
00:15:09.760
and refresh it with a brand new set of weights,
link |
00:15:13.440
given maybe a recent snapshot of this dataset
link |
00:15:17.040
we train on, et cetera,
link |
00:15:18.160
or even a new game we're learning.
link |
00:15:20.040
So that feels like something is missing fundamentally.
link |
00:15:24.240
We might find it, but it's not very clear
link |
00:15:27.520
how it will look like.
link |
00:15:28.480
There's many ideas and it's super exciting as well.
link |
00:15:30.920
Just for people who don't know,
link |
00:15:32.520
when you approach a new problem in machine learning,
link |
00:15:35.800
you're going to come up with an architecture
link |
00:15:38.280
that has a bunch of weights
link |
00:15:41.040
and then you initialize them somehow,
link |
00:15:43.440
which in most cases is some version of random.
link |
00:15:47.360
So that's what you mean by starting from scratch
link |
00:15:49.040
and it seems like it's a waste every time you solve
link |
00:15:54.520
the game of go in chess,
link |
00:15:56.760
starcraft, protein folding, surely there's some way
link |
00:16:01.520
to reuse the weights as we grow this giant database
link |
00:16:05.240
of neural networks that have solved
link |
00:16:09.000
some of the toughest problems in the world.
link |
00:16:10.800
And so some of that is, what is that?
link |
00:16:15.280
Methods, how to reuse weights,
link |
00:16:19.120
how to learn, extract was generalizable,
link |
00:16:22.520
or at least has a chance to be
link |
00:16:25.200
and throw away the other stuff.
link |
00:16:27.880
And maybe the neural network itself
link |
00:16:29.640
should be able to tell you that.
link |
00:16:31.680
Like what ideas do you have
link |
00:16:35.680
for better initialization of weights?
link |
00:16:37.560
Maybe stepping back, if we look at the field
link |
00:16:40.840
of machine learning, but especially deep learning,
link |
00:16:44.120
at the core of deep learning,
link |
00:16:45.280
there's this beautiful idea that is a single algorithm
link |
00:16:49.280
can solve any task.
link |
00:16:50.960
So it's been proven over and over
link |
00:16:54.440
with more increasing set of benchmarks
link |
00:16:56.440
and things that were thought impossible
link |
00:16:58.600
that are being cracked by this basic principle.
link |
00:17:02.000
That is, you take a neural network of uninitialized weights,
link |
00:17:05.840
so like a blank computational brain,
link |
00:17:09.680
then you give it, in the case of supervised learning,
link |
00:17:12.600
a lot ideally of examples of,
link |
00:17:14.960
hey, here is what the input looks like
link |
00:17:17.160
and the desired output should look like this.
link |
00:17:19.600
I mean, image classification is very clear example,
link |
00:17:22.400
images to maybe one of a thousand categories,
link |
00:17:25.600
that's what ImageNet is like,
link |
00:17:26.880
but many, many, if not all problems,
link |
00:17:29.120
can be mapped this way.
link |
00:17:30.760
And then there's a generic recipe that you can use,
link |
00:17:35.280
and this recipe with very little change,
link |
00:17:38.640
and I think that's the core of deep learning research,
link |
00:17:41.560
that what is the recipe that is universal,
link |
00:17:44.440
that for any new given task I'll be able to use
link |
00:17:47.400
without thinking, without having to work very hard
link |
00:17:50.360
on the problem at stake.
link |
00:17:52.600
We have not found this recipe,
link |
00:17:54.400
but I think the field is excited to find less tweaks
link |
00:18:00.160
or tricks that people find when they work
link |
00:18:02.640
on important problems specific to those
link |
00:18:05.280
and more of a general algorithm, right?
link |
00:18:07.560
So at an algorithmic level,
link |
00:18:09.320
I would say we have something general already,
link |
00:18:11.800
which is this formula of training a very powerful model
link |
00:18:14.520
on neural network on a lot of data,
link |
00:18:17.000
and in many cases, you need some specificity
link |
00:18:21.200
to the actual problem you're solving.
link |
00:18:23.400
Protein folding being such an important problem
link |
00:18:26.080
has some basic recipe that is learned from before, right?
link |
00:18:30.800
Like transformer models, graph neural networks,
link |
00:18:34.120
ideas coming from NLP, like something called BERT,
link |
00:18:38.600
that is a kind of loss that you can
link |
00:18:40.840
in place to help the model knowledge distillation
link |
00:18:44.400
is another technique, right?
link |
00:18:45.680
So this is the formula.
link |
00:18:47.120
We still had to find some particular things
link |
00:18:50.600
that were specific to AlphaFold, right?
link |
00:18:53.640
That's very important because protein folding
link |
00:18:55.920
is such a high value problem that as humans,
link |
00:18:59.160
we should solve it no matter if we need to be a bit specific.
link |
00:19:02.920
And it's possible that some of these learnings
link |
00:19:05.000
will apply them to the next iteration of this recipe
link |
00:19:07.440
that deep learners are about.
link |
00:19:09.400
But it is true that so far, the recipe is what's common,
link |
00:19:13.240
but the weights you generally throw away,
link |
00:19:15.920
which feels very sad, although maybe especially
link |
00:19:21.840
in the last two, three years, and when we last spoke,
link |
00:19:24.640
I mentioned this area of meta learning,
link |
00:19:26.640
which is the idea of learning to learn.
link |
00:19:29.600
That idea and some progress has been
link |
00:19:31.960
had starting, I would say, mostly from GPT3
link |
00:19:35.280
on the language domain only, in which you could conceive
link |
00:19:39.320
a model that is trained once.
link |
00:19:42.160
And then this model is not narrow in that it only
link |
00:19:45.160
knows how to translate a pair of languages
link |
00:19:47.680
or it only knows how to assign sentiment to a sentence.
link |
00:19:51.840
These actually, you could teach it by a prompting
link |
00:19:55.000
it's called.
link |
00:19:55.480
And this prompting is essentially just showing it
link |
00:19:58.080
a few more examples, almost like you do show examples,
link |
00:20:01.520
input output examples, algorithmically speaking
link |
00:20:04.120
to the process of creating this model.
link |
00:20:06.200
But now you're doing it through language,
link |
00:20:07.840
which is very natural way for us to learn from one another.
link |
00:20:11.080
I tell you, hey, you should do this new task.
link |
00:20:13.160
I'll tell you a bit more.
link |
00:20:14.640
Maybe you ask me some questions.
link |
00:20:16.120
And now you know the task, right?
link |
00:20:17.840
You didn't need to retrain it from scratch.
link |
00:20:20.360
And we've seen these magical moments
link |
00:20:22.560
almost in this way to do few short prompting
link |
00:20:26.320
through language on language only domain.
link |
00:20:28.600
And then in the last two years, we've
link |
00:20:31.200
seen these expanded to beyond language, adding vision,
link |
00:20:35.760
adding actions and games, lots of progress to be had.
link |
00:20:39.480
But this is maybe, if you ask me,
link |
00:20:41.400
about how are we going to crack this problem?
link |
00:20:43.680
This is perhaps one way in which you have a single model.
link |
00:20:48.720
The problem of this model is it's
link |
00:20:50.400
hard to grow in weights or capacity.
link |
00:20:54.240
But the model is certainly so powerful
link |
00:20:56.360
that you can teach it some tasks in this way
link |
00:20:59.600
that I could teach you a new task now if we were,
link |
00:21:02.800
oh, let's, a text based task or a classification, a vision
link |
00:21:06.960
style task.
link |
00:21:08.360
But it still feels like more breakthroughs should be had.
link |
00:21:12.760
But it's a great beginning, right?
link |
00:21:14.000
We have a good baseline.
link |
00:21:15.320
We have an idea that this maybe is the way
link |
00:21:17.720
we want to benchmark progress towards AGI.
link |
00:21:20.680
And I think in my view, that's critical to always have a way
link |
00:21:23.800
to benchmark the community sort of converging
link |
00:21:26.800
to this overall, which is good to see.
link |
00:21:29.120
And then this is actually what excites me in terms of also
link |
00:21:34.320
next steps for deep learning is how
link |
00:21:37.360
to make these models more powerful.
link |
00:21:39.040
How do you train them?
link |
00:21:40.480
How to grow them?
link |
00:21:41.720
If they must grow, should they change their weights
link |
00:21:44.520
as you teach it the task or not?
link |
00:21:46.040
There's some interesting questions, many to be answered.
link |
00:21:48.480
Yeah, you've opened the door to a bunch of questions
link |
00:21:51.720
I want to ask, but let's first return to your tweet
link |
00:21:55.680
and read it like a Shakespeare.
link |
00:21:57.120
You wrote, gado is not the end.
link |
00:21:59.880
It's the beginning.
link |
00:22:01.240
And then you wrote meow and an emoji of a cat.
link |
00:22:06.120
So first, two questions.
link |
00:22:07.680
First, can you explain the meow and the cat emoji?
link |
00:22:10.000
And second, can you explain what gado is and how it works?
link |
00:22:13.600
Right, indeed.
link |
00:22:14.560
I mean, thanks for reminding me that we're all
link |
00:22:17.920
exposing on Twitter and it's permanently there.
link |
00:22:20.880
Yes, permanently there.
link |
00:22:21.840
One of the greatest AI researchers of all time,
link |
00:22:25.080
meow and cat emoji.
link |
00:22:27.160
Yes.
link |
00:22:27.480
There you go.
link |
00:22:28.240
Right, so.
link |
00:22:28.920
Can you imagine like Turing and tweeting meow and cat?
link |
00:22:32.600
Probably he would, probably would.
link |
00:22:34.280
Probably.
link |
00:22:34.840
So yeah, the tweet?
link |
00:22:36.080
It's important, actually.
link |
00:22:38.280
I put thought on the tweets.
link |
00:22:39.720
I hope people do as well.
link |
00:22:40.720
Which part do you think?
link |
00:22:41.680
OK, so there's three sentences.
link |
00:22:44.840
Gado's not the end.
link |
00:22:46.680
Gado's the beginning.
link |
00:22:48.640
Meow cat emoji, which is the important part.
link |
00:22:51.640
The meow, no, no.
link |
00:22:53.080
Definitely that it is the beginning.
link |
00:22:56.040
I mean, I probably was just explaining a bit
link |
00:23:00.280
where the field is going.
link |
00:23:01.320
But let me tell you about gato.
link |
00:23:03.680
So first, the name gato comes from maybe a sequence of releases
link |
00:23:08.720
that the mind had that used animal names
link |
00:23:13.120
to name some of their models that
link |
00:23:15.320
are based on this idea of large sequence models.
link |
00:23:19.040
Initially, they're only language,
link |
00:23:20.560
but we're expanding to other modalities.
link |
00:23:23.120
So we had gopher, chinchilla, these were language only.
link |
00:23:29.920
And then more recently, we released
link |
00:23:31.960
flamingo, which adds vision to the equation.
link |
00:23:35.360
And then gato, which adds vision,
link |
00:23:38.120
and then also actions in the mix.
link |
00:23:41.560
As we discussed, actions, especially discrete actions,
link |
00:23:45.760
like up, down, left, right, I just told you the actions,
link |
00:23:48.720
but they're words.
link |
00:23:49.480
So you can kind of see how actions naturally
link |
00:23:52.480
map to sequence modeling of words, which these models are
link |
00:23:55.720
very powerful.
link |
00:23:57.000
So gato was named after, I believe, I can only,
link |
00:24:02.520
from memory, these things always happen
link |
00:24:06.040
with an amazing team of researchers behind.
link |
00:24:08.480
So before the release, we had a discussion
link |
00:24:12.160
about which animal would we pick.
link |
00:24:14.200
And I think because of the word general agent,
link |
00:24:18.360
and this is a property quite unique to gato,
link |
00:24:21.880
we kind of were playing with the GA words.
link |
00:24:24.720
And then gato arrives with cat.
link |
00:24:26.960
Yes.
link |
00:24:28.040
And gato is obviously a Spanish version of cat.
link |
00:24:30.240
I had nothing to do with it, although I'm from Spain.
link |
00:24:32.680
Wait, sorry.
link |
00:24:33.280
How do you say cat in Spanish?
link |
00:24:34.640
Gato.
link |
00:24:35.200
Oh, gato.
link |
00:24:35.680
Yeah.
link |
00:24:36.200
Now it all makes sense.
link |
00:24:36.880
OK, I see.
link |
00:24:37.640
I see you.
link |
00:24:38.120
Now it all makes sense.
link |
00:24:39.080
OK, so how do you say meow in Spanish?
link |
00:24:40.800
No, that's probably the same.
link |
00:24:41.920
I think you say it the same way.
link |
00:24:44.360
But you write it as M.I.A.U.
link |
00:24:48.200
It's universal.
link |
00:24:49.240
All right, so then how does the thing work?
link |
00:24:51.640
So you said general, so you said language, vision, action.
link |
00:24:59.200
How does this, can you explain what kind of neural networks
link |
00:25:03.080
are involved?
link |
00:25:04.160
What does the training look like?
link |
00:25:06.320
And maybe what do you or some beautiful ideas
link |
00:25:10.880
within this system?
link |
00:25:11.840
Yeah, so maybe the basics of gato
link |
00:25:16.000
are not that dissimilar from many, many work that come.
link |
00:25:19.920
So here is where the recipe hasn't changed too much.
link |
00:25:24.240
There is a transformer model.
link |
00:25:25.600
That's the kind of recurrent neural network
link |
00:25:28.680
that essentially takes a sequence of modalities,
link |
00:25:33.360
observations that could be words, could be vision,
link |
00:25:37.600
or could be actions.
link |
00:25:38.840
And then its own objective that you train it to do
link |
00:25:42.160
when you train it is to predict what the next anything is.
link |
00:25:46.400
And anything means what's the next action
link |
00:25:48.800
if this sequence that I'm showing you to train
link |
00:25:51.240
is a sequence of actions and observations,
link |
00:25:53.520
then you're predicting what's the next action
link |
00:25:55.640
and the next observation.
link |
00:25:57.120
So you think of this really as a sequence of bytes.
link |
00:26:00.920
So take any sequence of words, a sequence of interleaf words
link |
00:26:05.920
and images, a sequence of maybe observations
link |
00:26:10.400
that are images and moves in a target up, down, left, right.
link |
00:26:14.280
And these you just think of them as bytes
link |
00:26:17.640
and you're modeling what's the next byte gonna be like.
link |
00:26:20.600
And you might interpret that as an action
link |
00:26:23.440
and then play it in a game
link |
00:26:25.880
or you could interpret it as a word
link |
00:26:27.720
and then write it down
link |
00:26:29.120
if you're chatting with the system and so on.
link |
00:26:32.480
So GATO basically can be thought as inputs, images,
link |
00:26:37.840
text, video, actions.
link |
00:26:41.480
It also actually inputs some sort of proprioception
link |
00:26:45.280
sensors from robotics because robotics is one of the tasks
link |
00:26:48.280
that it's been trained to do.
link |
00:26:49.880
And then at the output, similarly,
link |
00:26:51.880
it outputs words, actions.
link |
00:26:53.720
It does not output images.
link |
00:26:55.680
That's just by design,
link |
00:26:57.440
we decided not to go that way for now.
link |
00:27:00.880
That's also in part why it's the beginning
link |
00:27:02.720
because there's more to do clearly.
link |
00:27:04.920
But that's kind of what GATO is,
link |
00:27:06.440
is this brain that essentially you give it any sequence
link |
00:27:09.200
of these observations and modalities
link |
00:27:11.920
and it outputs the next step.
link |
00:27:13.760
And then off you go, you feed the next step into
link |
00:27:17.400
and predict the next one and so on.
link |
00:27:20.080
Now, it is more than a language model
link |
00:27:24.160
because even though you can chat with GATO,
link |
00:27:26.760
like you can chat with Chinchilla or Flamingo,
link |
00:27:30.520
it also is an agent, right?
link |
00:27:33.200
So that's why we call it A of GATO,
link |
00:27:37.200
like the letter A and also it's general.
link |
00:27:41.360
It's not an agent that's been trained to be good
link |
00:27:44.000
at only StarCraft or only Atari or only Go.
link |
00:27:47.880
It's been trained on a vast variety of datasets.
link |
00:27:50.920
So...
link |
00:27:51.760
What makes an agent, if I may interrupt,
link |
00:27:53.840
the fact that it can generate actions?
link |
00:27:56.040
Yes, so when we call it,
link |
00:27:58.160
I mean, it's a good question, right?
link |
00:28:00.080
What, when do we call a model?
link |
00:28:02.800
I mean, everything is a model,
link |
00:28:03.880
but what is an agent, in my view,
link |
00:28:05.840
is indeed the capacity to take actions in an environment
link |
00:28:09.720
that you then send to eat
link |
00:28:11.720
and then the environment might return
link |
00:28:13.520
with a new observation
link |
00:28:15.080
and then you generate the next action and so on.
link |
00:28:17.600
This actually, this reminds me of the question
link |
00:28:20.480
from the side of biology, what is life?
link |
00:28:23.040
Which is actually a very difficult question as well.
link |
00:28:25.400
What is living?
link |
00:28:26.800
What is living when you think about life here
link |
00:28:29.480
on this planet Earth?
link |
00:28:31.040
And a question interesting to me about aliens,
link |
00:28:33.480
what is life when we visit another planet?
link |
00:28:35.760
Would we be able to recognize it?
link |
00:28:37.240
And this feels like it sounds perhaps silly,
link |
00:28:40.240
but I don't think it is.
link |
00:28:41.400
At which point is the neural network
link |
00:28:43.840
a being versus a tool?
link |
00:28:48.320
And it feels like action,
link |
00:28:50.160
ability to modify its environment,
link |
00:28:52.440
is that fundamental leap?
link |
00:28:54.600
Yeah, I think it certainly feels like action
link |
00:28:57.480
is a necessary condition to be more alive,
link |
00:29:02.000
but probably not sufficient either.
link |
00:29:04.440
So sadly I...
link |
00:29:05.280
It's a sole consciousness thing, whatever.
link |
00:29:06.920
Yeah, yeah, we can get back to that later.
link |
00:29:09.120
But anyways, going back to the meow and the gato, right?
link |
00:29:12.360
So one of the leaps forward
link |
00:29:16.160
and what took the team a lot of effort and time was,
link |
00:29:20.040
as you were asking, how has gato been trained?
link |
00:29:23.120
So I told you gato is this transformer neural network,
link |
00:29:26.120
models actions, sequences of actions, words, et cetera.
link |
00:29:30.600
And then the way we train it
link |
00:29:32.520
is by essentially pulling datasets
link |
00:29:36.840
of observations, right?
link |
00:29:39.440
So it's a massive imitation learning algorithm
link |
00:29:42.640
that it imitates obviously to what is the next word
link |
00:29:46.320
that comes next from the usual datasets we used before, right?
link |
00:29:50.160
So these are these web scale style datasets
link |
00:29:53.040
of people writing on webs or chatting or whatnot, right?
link |
00:29:58.520
So that's an obvious source
link |
00:29:59.840
that we use on all language work.
link |
00:30:02.040
But then we also took a lot of agents
link |
00:30:05.640
that we have a deep mind.
link |
00:30:06.720
I mean, as you know, deep mind, we're quite interested
link |
00:30:10.960
in learning reinforcement learning
link |
00:30:13.640
and learning agents that play in different environments.
link |
00:30:17.000
So we kind of created a dataset of these trajectories,
link |
00:30:20.800
as we call them, or agent experiences.
link |
00:30:23.040
So in a way, there are other agents we train
link |
00:30:25.720
for a single mind purpose to, let's say,
link |
00:30:29.560
control a 3D game environment and navigate a maze.
link |
00:30:33.400
So we had all the experience that was created
link |
00:30:36.120
through the one agent interacting with that environment.
link |
00:30:39.600
And we added this to the dataset, right?
link |
00:30:41.920
And as I said, we just see all the data,
link |
00:30:44.400
all these sequences of words or sequences
link |
00:30:46.440
of this agent interacting with that environment,
link |
00:30:49.720
or agents playing Atari and so on.
link |
00:30:52.200
We see that as the same kind of data.
link |
00:30:54.880
And so we mix these datasets together and we train Gato.
link |
00:31:00.160
That's the G part, right?
link |
00:31:01.600
It's general because it really has mixed,
link |
00:31:05.200
it doesn't have different brains for each modality
link |
00:31:07.520
or each narrow task.
link |
00:31:09.080
It has a single brain.
link |
00:31:10.480
It's not that big of a brain compared
link |
00:31:12.120
to most of the neural networks we see these days.
link |
00:31:14.800
It has one billion parameters.
link |
00:31:18.240
Some models we're seeing get in the trillions these days
link |
00:31:21.080
and certainly 100 billion feels like a size
link |
00:31:25.040
that is very common from when you train this job.
link |
00:31:28.960
So the actual agent is relatively small,
link |
00:31:32.680
but it's been trained on a very challenging,
link |
00:31:35.040
diverse dataset, not only containing all of internet,
link |
00:31:38.000
but containing all these agent experience
link |
00:31:40.400
playing very different distinct environments.
link |
00:31:43.160
So this brings us to the part of the tweet of,
link |
00:31:46.440
this is not the end, it's the beginning.
link |
00:31:48.920
It feels very cool to see Gato in principle
link |
00:31:53.120
is able to control any sort of environments
link |
00:31:56.640
that especially the ones that he's been trained to do,
link |
00:31:59.160
these 3D games, Atari games,
link |
00:32:01.120
all sorts of robotics tasks and so on,
link |
00:32:04.640
but obviously it's not as proficient
link |
00:32:07.760
as the teachers it learned from on these environments.
link |
00:32:10.560
Not obvious.
link |
00:32:11.760
It's not obvious that it wouldn't be more proficient.
link |
00:32:15.080
It's just the current beginning part
link |
00:32:18.040
is that the performance is such that it's not as good
link |
00:32:21.800
as if it's specialized to that task.
link |
00:32:23.440
Right, so it's not as good,
link |
00:32:25.800
although I would argue size matters here.
link |
00:32:28.080
So the fact that...
link |
00:32:29.160
I would argue always size always matters.
link |
00:32:31.360
That's a different question.
link |
00:32:33.400
But for neural networks, certainly size does matter.
link |
00:32:36.240
So it's the beginning because it's relatively small.
link |
00:32:39.640
So obviously scaling this ID app
link |
00:32:42.600
might make the connections that exist between
link |
00:32:47.200
text on the internet and playing Atari and so on
link |
00:32:50.720
more synergistic with one another and you might gain.
link |
00:32:54.240
And that moment we didn't quite see,
link |
00:32:56.360
but obviously that's why it's the beginning.
link |
00:32:58.640
That synergy might emerge with scale.
link |
00:33:01.000
Right, might emerge with scale.
link |
00:33:02.160
And also I believe there's some new research
link |
00:33:04.440
or ways in which you prepare the data
link |
00:33:07.640
that you might need to sort of make it more clear
link |
00:33:10.960
to the model that you're not only playing Atari
link |
00:33:14.160
and it's just you start from a screen
link |
00:33:16.360
and here is app and a screen and down.
link |
00:33:18.400
Maybe you can think of playing Atari
link |
00:33:20.680
as there's some sort of context
link |
00:33:22.560
that is needed for the agent
link |
00:33:23.920
before it starts seeing,
link |
00:33:25.200
oh, this is an Atari screen, I'm gonna start playing.
link |
00:33:28.640
You might require, for instance, to be told in words,
link |
00:33:33.400
hey, this is in this sequence that I'm showing,
link |
00:33:36.880
you're gonna be playing an Atari game.
link |
00:33:39.120
So text might actually be a good driver
link |
00:33:42.000
to enhance the data, right?
link |
00:33:44.440
So then these connections might be made more easily, right?
link |
00:33:46.960
That's an idea that we start seeing in language,
link |
00:33:51.240
but obviously beyond this is gonna be effective, right?
link |
00:33:55.080
It's not like, I don't show you a screen
link |
00:33:57.480
and you from scratch, you're supposed to learn a game.
link |
00:34:01.000
There is a lot of context we might set.
link |
00:34:03.400
So there might be some work needed as well
link |
00:34:05.840
to set that context, but anyways, there's a lot of work.
link |
00:34:10.680
So that context puts all the different modalities
link |
00:34:13.520
on the same level ground if you provide the context best.
link |
00:34:16.680
So maybe on that point, so there's this task
link |
00:34:20.720
which may not seem trivial of tokenizing the data,
link |
00:34:25.560
of converting the data into pieces,
link |
00:34:28.560
into basic atomic elements
link |
00:34:31.360
that then could cross modality somehow.
link |
00:34:35.320
So what's tokenization?
link |
00:34:37.920
How do you tokenize text?
link |
00:34:39.720
How do you tokenize images?
link |
00:34:42.240
How do you tokenize games and actions and robotics tasks?
link |
00:34:47.120
Yeah, that's a great question.
link |
00:34:48.240
So tokenization is the entry point
link |
00:34:52.880
to actually make all the data look like a sequence
link |
00:34:55.640
because tokens then are just kind of these little puzzle pieces.
link |
00:34:59.520
We break down anything into these puzzle pieces
link |
00:35:01.800
and then we just model what's this puzzle look like, right?
link |
00:35:05.400
When you make it lay down in a line,
link |
00:35:07.760
so to speak in a sequence.
link |
00:35:09.520
So in Gato, the text, there's a lot of work.
link |
00:35:14.520
There's a lot of work, you tokenize text usually by looking
link |
00:35:17.400
at commonly used substrings, right?
link |
00:35:20.080
So there's ING in English is a very common substring,
link |
00:35:23.720
so that becomes a token.
link |
00:35:25.560
There's quite well studied problem on tokenizing text
link |
00:35:29.080
and Gato just use the standard techniques
link |
00:35:31.640
that have been developed from many years,
link |
00:35:34.360
even starting from Ngram models in the 1950s and so on.
link |
00:35:38.040
Just for context, how many tokens,
link |
00:35:40.240
like what order, magnitude, number of tokens
link |
00:35:42.680
is required for a word?
link |
00:35:44.560
Yeah.
link |
00:35:45.400
What are we talking about?
link |
00:35:46.240
Yeah, for a word in English, right?
link |
00:35:48.720
I mean, every language is very different.
link |
00:35:51.160
The current level or granularity of tokenization
link |
00:35:53.960
generally means is maybe two to five.
link |
00:35:57.880
I mean, I don't know the statistics exactly,
link |
00:36:00.240
but to give you an idea,
link |
00:36:02.200
we don't tokenize at the level of letters
link |
00:36:04.200
then it would probably be like,
link |
00:36:05.560
I don't know what the average length of a word is in English,
link |
00:36:08.120
but that would be the minimum set of tokens you could use.
link |
00:36:11.440
It was bigger than letter, smaller than words.
link |
00:36:13.240
Yes, yes.
link |
00:36:14.080
And you could think of very, very common words like the,
link |
00:36:16.920
I mean, that would be a single token,
link |
00:36:18.880
but very quickly you're talking two, three, four tokens or so.
link |
00:36:22.400
Have you ever tried to tokenize emojis?
link |
00:36:24.840
Emojis are actually just sequences of letters, so.
link |
00:36:30.120
Maybe to you, but to me, they mean so much more.
link |
00:36:33.080
Yeah, you can render the emoji,
link |
00:36:34.480
but you might, if you actually just.
link |
00:36:36.880
Yeah, this is a philosophical question.
link |
00:36:39.000
Is emojis an image or a text?
link |
00:36:43.360
The way we do these things is,
link |
00:36:46.080
they're actually mapped to small sequences of characters.
link |
00:36:49.600
So you can actually play with these models
link |
00:36:52.640
and input emojis, it will output emojis back,
link |
00:36:55.840
which is actually quite a fun exercise.
link |
00:36:57.960
You probably can find other tweets about these out there.
link |
00:37:02.320
But yeah, so anyways, text,
link |
00:37:03.640
there's like, it's very clear how this is done.
link |
00:37:06.760
And then in Gato, what we did for images
link |
00:37:10.600
is we map images to essentially,
link |
00:37:13.760
we compressed images, so to speak,
link |
00:37:15.440
into something that looks more like less,
link |
00:37:19.080
like every pixel with every intensity
link |
00:37:21.280
that would mean we have a very long sequence, right?
link |
00:37:23.800
Like if we were talking about 100 by 100 pixel images,
link |
00:37:27.240
that would make the sequences far too long.
link |
00:37:29.880
So what was done there is you just use a technique
link |
00:37:33.280
that essentially compresses an image
link |
00:37:35.760
into maybe 16 by 16 patches of pixels,
link |
00:37:40.120
and then that is mapped.
link |
00:37:41.760
Again, tokenize, you just essentially quantize this space
link |
00:37:45.320
into a special word that actually maps
link |
00:37:48.960
to this little sequence of pixels.
link |
00:37:51.760
And then you put the pixels together in some raster order,
link |
00:37:55.080
and then that's how you get out
link |
00:37:57.760
or in the image that you're processing.
link |
00:38:00.760
But there's no semantic aspect to that.
link |
00:38:04.040
So you're doing some kind of,
link |
00:38:05.840
you don't need to understand anything about the image
link |
00:38:07.760
in order to tokenize it currently.
link |
00:38:09.640
No, you're only using this notion of compression.
link |
00:38:12.600
So you're trying to find common,
link |
00:38:15.080
it's like JPG or all these algorithms,
link |
00:38:17.640
it's actually very similar at the tokenization level.
link |
00:38:20.520
All we're doing is finding common patterns
link |
00:38:23.320
and then making sure in a lossy way we compress these images
link |
00:38:27.200
given the statistics of the images
link |
00:38:29.480
that are contained in all the data we deal with.
link |
00:38:31.800
Although you could probably argue that JPG
link |
00:38:34.200
does have some understanding of images.
link |
00:38:36.840
Like, because visual information,
link |
00:38:41.640
maybe color, compressing based,
link |
00:38:45.520
crudely based on color does capture some,
link |
00:38:48.880
something important about an image
link |
00:38:51.160
that's about its meaning, not just about some statistics.
link |
00:38:54.640
Yeah, I mean, JP, as I said,
link |
00:38:56.200
these very, the algorithms look actually very similar to,
link |
00:38:59.440
they use the cosine transform in JPG.
link |
00:39:03.720
The approach we usually do in machine learning
link |
00:39:07.160
when we deal with images
link |
00:39:08.280
and we do this quantization step
link |
00:39:10.160
is a bit more data driven.
link |
00:39:11.440
So rather than have some sort of Fourier basis
link |
00:39:14.160
for how frequencies appear in the natural world,
link |
00:39:18.920
we actually just use the statistics of the images
link |
00:39:23.880
and then quantize them based on the statistics
link |
00:39:27.040
much like you do in words, right?
link |
00:39:28.320
So common substrings are allocated a token
link |
00:39:32.440
and images is very similar.
link |
00:39:34.440
But there's no connection.
link |
00:39:37.280
The token space, if you think of,
link |
00:39:39.240
oh, like the tokens are an integer in the end of the day.
link |
00:39:42.440
So now like we work on, maybe we have about,
link |
00:39:46.200
let's say, I don't know the exact numbers,
link |
00:39:48.000
but let's say 10,000 tokens for text, right?
link |
00:39:51.200
Certainly more than characters
link |
00:39:52.840
because we have groups of characters and so on.
link |
00:39:55.360
So from one to 10,000,
link |
00:39:57.000
those are representing all the language
link |
00:39:59.480
and the words we'll see.
link |
00:40:01.000
And then images occupy the next set of integers.
link |
00:40:04.160
So they're completely independent, right?
link |
00:40:05.800
So from 10,000 one to 20,000,
link |
00:40:08.920
those are the tokens that represent
link |
00:40:10.640
these other modality images.
link |
00:40:12.760
And that is an interesting aspect
link |
00:40:16.920
that makes it orthogonal.
link |
00:40:18.640
So what connects these concepts is the data, right?
link |
00:40:21.600
Once you have a data set,
link |
00:40:23.760
for instance, that captions images
link |
00:40:26.160
that tells you, oh, this is someone
link |
00:40:27.960
playing a frisbee on a green field.
link |
00:40:30.480
Now, the model will need to predict the tokens
link |
00:40:34.560
from the text green field to then the pixels.
link |
00:40:37.800
And that will start making the connections
link |
00:40:39.760
between the tokens.
link |
00:40:40.600
So these connections happen as the algorithm learns.
link |
00:40:43.640
And then the last, if we think of these integers,
link |
00:40:45.840
the first few are words, the next few are images.
link |
00:40:48.720
In GATO, we also allocated the highest order
link |
00:40:53.720
of integers to actions, right?
link |
00:40:56.280
Which we discretize and actions are very diverse, right?
link |
00:41:00.000
In Atari, there's, I don't know if 17 discreet actions
link |
00:41:04.160
in robotics, actions might be torques
link |
00:41:07.000
and forces that we apply.
link |
00:41:08.280
So we just use kind of similar ideas
link |
00:41:11.240
to compress these actions into tokens.
link |
00:41:14.360
And then we just, that's how we map now all the space
link |
00:41:18.720
to these sequence of integers.
link |
00:41:20.840
But they occupy different space
link |
00:41:22.520
and what connects them is then the learning algorithm.
link |
00:41:24.880
That's where the magic happens.
link |
00:41:26.320
So the modalities are orthogonal to each other
link |
00:41:29.440
in token space.
link |
00:41:30.800
So in the input, everything you add, you add extra tokens.
link |
00:41:35.280
Right.
link |
00:41:36.120
And then you're shoving all of that into one place.
link |
00:41:40.480
Yes, the transformer.
link |
00:41:41.680
And that transformer, that transformer
link |
00:41:45.160
tries to look at this gigantic token space
link |
00:41:49.400
and tries to form some kind of representation,
link |
00:41:52.280
some kind of unique wisdom
link |
00:41:56.800
about all of these different modalities.
link |
00:41:59.280
How's that possible?
link |
00:42:02.240
If you were to sort of like put your psychoanalysis hat on
link |
00:42:06.560
and try to psychoanalyze this neural network,
link |
00:42:09.440
is it schizophrenic?
link |
00:42:11.800
Does it try to, given this very few weights,
link |
00:42:17.200
represent multiple disjoint things
link |
00:42:19.600
and somehow have them not interfere with each other?
link |
00:42:22.840
Or is this a model building on the joint strength,
link |
00:42:28.000
on whatever is common to all the different modalities?
link |
00:42:31.840
Like what, if you were to ask a questions,
link |
00:42:34.560
is it schizophrenic or is it of one mind?
link |
00:42:38.760
I mean, it is one mind.
link |
00:42:41.080
And it's actually the very, the simplest algorithm,
link |
00:42:44.400
which that's kind of in a way how it feels
link |
00:42:47.480
like the field hasn't changed
link |
00:42:49.840
since back propagation and gradient descent
link |
00:42:52.600
was purpose for learning neural networks.
link |
00:42:55.760
So there is obviously details on the architecture.
link |
00:42:58.720
This has evolved.
link |
00:42:59.640
The current iteration is still the transformer,
link |
00:43:03.080
which is a powerful sequence modeling architecture.
link |
00:43:07.440
But then the goal of this, you know,
link |
00:43:11.000
setting these weights to predict the data
link |
00:43:13.840
is essentially the same as basically I could describe.
link |
00:43:17.240
I mean, we describe a few years ago alpha star,
link |
00:43:19.760
language modeling and so on, right?
link |
00:43:21.640
We take, let's say an Atari game,
link |
00:43:24.640
we map it to a string of numbers
link |
00:43:27.680
that will all be probably image space
link |
00:43:30.400
and action space interleaved.
link |
00:43:32.480
And all we're gonna do is say, okay,
link |
00:43:35.120
given the numbers, you know, 1001, 1004, 1005,
link |
00:43:40.440
the next number that comes is 2006,
link |
00:43:43.280
which is in the action space.
link |
00:43:45.440
And you're just optimizing these weights
link |
00:43:48.920
via very simple gradients, like, you know,
link |
00:43:52.320
mathematical is almost the most boring algorithm
link |
00:43:54.720
you could imagine.
link |
00:43:55.920
We settle the weights so that given this particular instance,
link |
00:44:00.240
these weights are set to maximize the probability
link |
00:44:04.120
of having seen this particular sequence of integers
link |
00:44:07.320
for this particular game.
link |
00:44:09.160
And then the algorithm does this
link |
00:44:11.680
for many, many, many iterations,
link |
00:44:14.840
looking at different modalities, different games, right?
link |
00:44:17.920
That's the mixture of the dataset we discussed.
link |
00:44:20.480
So in a way, it's a very simple algorithm
link |
00:44:24.040
and the weights, right, they're all shared, right?
link |
00:44:27.560
So in terms of, is it focusing on one modality or not?
link |
00:44:30.920
The intermediate weights that are converting
link |
00:44:33.240
from these input of integers to the target integer
link |
00:44:36.240
you're predicting next,
link |
00:44:37.720
those weights certainly are common.
link |
00:44:40.360
And then the way that tokenization happens,
link |
00:44:43.440
there is a special place in the neural network
link |
00:44:45.880
which is we map this integer, like number 1001,
link |
00:44:49.840
to a vector of real numbers.
link |
00:44:51.960
Like real numbers, we can optimize them
link |
00:44:54.800
with gradient descent, right?
link |
00:44:55.960
The functions we learn are actually
link |
00:44:58.320
surprisingly differentiable.
link |
00:44:59.760
That's why we compute gradients.
link |
00:45:01.760
So this step is the only one
link |
00:45:03.960
that this orthogonality you mentioned applies.
link |
00:45:06.600
So mapping a certain token for text or image or actions,
link |
00:45:11.600
each of these tokens gets its own little vector
link |
00:45:15.080
of real numbers that represents this.
link |
00:45:17.240
If you look at the field back many years ago,
link |
00:45:19.600
people were talking about word vectors or word embeddings.
link |
00:45:23.520
These are the same.
link |
00:45:24.360
We have word vectors or embeddings.
link |
00:45:26.040
We have image vector or embeddings
link |
00:45:28.920
and action vector of embeddings.
link |
00:45:30.920
And the beauty here is that as you train this model,
link |
00:45:33.960
if you visualize these little vectors,
link |
00:45:36.680
it might be that they start aligning
link |
00:45:38.520
even though they're independent parameters.
link |
00:45:41.120
There could be anything,
link |
00:45:42.880
but then it might be that you take the word gato or cat,
link |
00:45:47.480
which maybe is common enough that actually has its own token.
link |
00:45:50.240
And then you take pixels that have a cat
link |
00:45:52.440
and you might start seeing that these vectors
link |
00:45:55.320
look like they align, right?
link |
00:45:57.440
So by learning from this vast amount of data,
link |
00:46:00.680
the model is realizing the potential connections
link |
00:46:03.960
between these modalities.
link |
00:46:05.680
Now I will say there would be another way,
link |
00:46:07.880
at least in part, to not have these different vectors
link |
00:46:13.200
for each different modality.
link |
00:46:15.560
For instance, when I tell you about actions
link |
00:46:18.400
in certain space, I'm defining actions by words, right?
link |
00:46:22.840
So you could imagine a world in which I'm not learning
link |
00:46:26.560
that the action app in Atari is its own number.
link |
00:46:31.120
The action app in Atari maybe is literally the word
link |
00:46:34.440
or the sentence app in Atari, right?
link |
00:46:37.360
And that would mean we now leverage
link |
00:46:39.440
much more from the language.
link |
00:46:41.080
This is not what we did here,
link |
00:46:42.560
but certainly it might make these connections
link |
00:46:45.680
much easier to learn and also to teach the model
link |
00:46:49.120
to correct its own actions and so on, right?
link |
00:46:51.320
So all this to say that gato is indeed the beginning,
link |
00:46:55.880
that it is a radical idea to do this way,
link |
00:46:59.480
but there's probably a lot more to be done
link |
00:47:02.400
and the results to be more impressive,
link |
00:47:04.520
not only through scale, but also through some new research
link |
00:47:08.000
that will come hopefully in the years to come.
link |
00:47:10.520
So just to elaborate quickly,
link |
00:47:12.360
you mean one possible next step
link |
00:47:16.720
or one of the paths that you might take next
link |
00:47:20.240
is doing the tokenization fundamentally
link |
00:47:25.240
as a kind of linguistic communication.
link |
00:47:28.320
So like you convert even images into language.
link |
00:47:31.400
So doing something like a crude semantic segmentation,
link |
00:47:35.600
trying to just assign a bunch of words to an image
link |
00:47:38.440
that like have almost like a dumb entity
link |
00:47:42.360
explaining as much as it can about the image.
link |
00:47:45.400
And so you convert that into words
link |
00:47:47.000
and then you convert games into words
link |
00:47:49.320
and then you provide the context in words and all of it.
link |
00:47:53.840
Eventually getting to a point
link |
00:47:56.360
where everybody agrees with Noam Chomsky
link |
00:47:58.120
that language is actually at the core of everything
link |
00:48:00.960
that it's the base layer of intelligence
link |
00:48:04.280
and consciousness and all that kind of stuff.
link |
00:48:05.880
Okay.
link |
00:48:07.520
You mentioned early on like it's hard to grow.
link |
00:48:11.280
What did you mean by that?
link |
00:48:12.800
Cause we're talking about scale might change.
link |
00:48:17.040
There might be, and we'll talk about this too,
link |
00:48:19.000
like there's a emergent,
link |
00:48:22.960
there's certain things about these neural networks
link |
00:48:25.040
that are emergent.
link |
00:48:25.880
So certain like performance we can see only with scale
link |
00:48:29.040
and there's some kind of threshold of scale.
link |
00:48:31.000
So why is it hard to grow something like this Meow network?
link |
00:48:36.680
So the Meow network is not,
link |
00:48:39.840
it's not hard to grow if you retrain it.
link |
00:48:42.600
What's hard is, well, we have now one billion parameters.
link |
00:48:46.840
We train them for a while.
link |
00:48:48.160
We spend some amount of work towards building these weights
link |
00:48:53.160
that are an amazing initial brain
link |
00:48:55.880
for doing this kind of task we care about.
link |
00:48:58.840
Could we reuse the weights and expand to a larger brain?
link |
00:49:03.920
And that is extraordinarily hard,
link |
00:49:06.720
but also exciting from a research perspective
link |
00:49:10.120
and a practical perspective point of view, right?
link |
00:49:12.560
So there's this notion of modularity in software engineering
link |
00:49:17.680
and we're starting to see some examples
link |
00:49:20.520
and work that leverages modularity.
link |
00:49:23.320
In fact, if we go back one step from GATO
link |
00:49:26.360
to a work that I would say train much larger,
link |
00:49:29.720
much more capable network called Flamingo.
link |
00:49:32.560
Flamingo did not deal with actions,
link |
00:49:34.320
but it definitely dealt with images
link |
00:49:36.080
in an interesting way,
link |
00:49:38.440
kind of akin to what I GATO did,
link |
00:49:40.280
but slightly different technique for tokenizing.
link |
00:49:43.000
But we don't need to go into that detail.
link |
00:49:45.440
But what Flamingo also did, which GATO didn't do,
link |
00:49:49.400
and that just happens because these projects,
link |
00:49:51.880
they're different,
link |
00:49:53.800
it's a bit of like the exploratory nature of research,
link |
00:49:56.480
which is great.
link |
00:49:57.320
The research behind these projects is also modular.
link |
00:50:00.640
Yes, exactly.
link |
00:50:01.880
And it has to be, right?
link |
00:50:02.800
We need to have creativity
link |
00:50:05.600
and sometimes you need to protect pockets of people,
link |
00:50:09.240
researchers and so on.
link |
00:50:10.360
By we human humans.
link |
00:50:11.880
Yes.
link |
00:50:12.840
And also in particular researchers
link |
00:50:14.600
and maybe even further deep mine or other such labs.
link |
00:50:18.840
And then the neural networks themselves.
link |
00:50:21.040
So it's modularity all the way down.
link |
00:50:23.440
All the way down.
link |
00:50:24.280
So the way that we did modularity,
link |
00:50:26.320
very beautifully in Flamingo is we took Chinchilla,
link |
00:50:30.160
which is a language only model,
link |
00:50:32.880
not an agent if we think of actions
link |
00:50:34.760
being necessary for agency.
link |
00:50:36.760
So we took Chinchilla,
link |
00:50:38.640
we took the weights of Chinchilla,
link |
00:50:41.040
and then we froze them.
link |
00:50:42.840
We said, these don't change.
link |
00:50:44.880
We train them to be very good at predicting the next word.
link |
00:50:47.600
He's a very good language model,
link |
00:50:49.480
state of the art at the time you release it, et cetera, et cetera.
link |
00:50:53.000
We're gonna add a capability to see, right?
link |
00:50:55.560
We are gonna add the ability to see to this language model.
link |
00:50:58.360
So we're gonna attach small pieces of neural networks
link |
00:51:02.000
at the right places in the model.
link |
00:51:03.920
It's almost like injecting the network
link |
00:51:07.920
with some weights and some substructures
link |
00:51:10.800
in a good way, right?
link |
00:51:12.880
So you need the research to say, what is effective?
link |
00:51:15.320
How do you add this capability
link |
00:51:16.760
without destroying others, et cetera?
link |
00:51:18.880
So we created a small sub network,
link |
00:51:23.520
initialized not from random,
link |
00:51:25.400
but actually from self supervised learning,
link |
00:51:28.840
that model that understands vision in general.
link |
00:51:32.880
And then we took data sets that connect the two modalities,
link |
00:51:37.320
vision and language.
link |
00:51:38.840
And then we froze the main part,
link |
00:51:41.280
the largest portion of the network,
link |
00:51:42.840
which was Chinchilla, that is 70 billion parameters.
link |
00:51:46.040
And then we added a few more parameters on top,
link |
00:51:49.320
trained from scratch, and then some others
link |
00:51:51.520
that were pre trained with the capacity to see.
link |
00:51:55.360
Like it was not tokenization in the way I described for Gato,
link |
00:51:58.880
but it's a similar idea.
link |
00:52:01.520
And then we trained the whole system,
link |
00:52:03.720
parts of it were frozen, parts of it were new.
link |
00:52:06.680
And all of a sudden we developed Flamingo,
link |
00:52:09.800
which is an amazing model that is essentially,
link |
00:52:12.720
I mean, describing it is a chatbot
link |
00:52:15.120
where you can also upload images
link |
00:52:17.080
and start conversing about images,
link |
00:52:20.040
but it's also kind of a dialogue style chatbot.
link |
00:52:23.840
So the input is images and text and the output is text.
link |
00:52:26.760
Exactly.
link |
00:52:28.040
And how many parameters you said 70 billion for Chinchilla?
link |
00:52:31.920
Yeah, Chinchilla is 70 billion.
link |
00:52:33.360
And then the ones we add on top,
link |
00:52:34.760
which kind of almost is almost like a way to overwrite
link |
00:52:39.320
its little activations so that when it sees vision,
link |
00:52:42.560
it does kind of a correct computation of what it's seeing,
link |
00:52:45.440
mapping it back towards, so to speak,
link |
00:52:48.080
that adds an extra 10 billion parameters, right?
link |
00:52:50.960
So it's total 80 billion, the largest one we released.
link |
00:52:54.080
And then you train it on a few data sets
link |
00:52:57.480
that contain vision and language.
link |
00:52:59.440
And once you interact with the model,
link |
00:53:01.280
you start seeing that you can upload an image
link |
00:53:04.320
and start sort of having a dialogue about the image,
link |
00:53:08.120
which is actually not something, it's very similar
link |
00:53:10.840
and akin to what we saw in language only.
link |
00:53:12.680
These prompting abilities that it has,
link |
00:53:15.400
you can teach it a new vision task, right?
link |
00:53:17.880
It does things beyond the capabilities
link |
00:53:20.600
that in theory, the data sets provided in themselves,
link |
00:53:24.640
but because it leverages a lot of the language knowledge
link |
00:53:27.240
acquired from Chinchilla,
link |
00:53:29.040
it actually has this few shot learning ability
link |
00:53:31.920
and these emerging abilities that we didn't even measure
link |
00:53:34.800
once we were developing the model,
link |
00:53:36.560
but once developed, then as you play with the interface,
link |
00:53:40.200
you can start seeing, wow, okay, yeah, it's cool.
link |
00:53:42.480
We can upload, I think one of the tweets
link |
00:53:45.160
talking about Twitter was this image from Obama
link |
00:53:48.000
that is placing a weight
link |
00:53:50.000
and someone is kind of weighting themselves
link |
00:53:52.560
and it's kind of a joke style image.
link |
00:53:55.040
And it's notable because I think
link |
00:53:57.160
Andrew Karpati a few years ago said,
link |
00:53:59.520
no computer vision system can understand the subtlety
link |
00:54:03.040
of this joke in this image, all the things that go on.
link |
00:54:06.480
And so what we try to do, and it's very anecdotally,
link |
00:54:09.760
I mean, this is not a proof that we solved this issue,
link |
00:54:12.320
but it just shows that you can upload now this image
link |
00:54:15.920
and start conversing with the model, trying to make out
link |
00:54:18.600
if it gets that there's a joke
link |
00:54:21.560
because the person weighting themselves
link |
00:54:23.600
that doesn't see that someone behind
link |
00:54:25.200
is making the weight higher and so on and so forth.
link |
00:54:28.040
So it's a fascinating capability.
link |
00:54:30.920
And it comes from this key idea of modularity
link |
00:54:33.440
where we took a frozen brain
link |
00:54:35.000
and we just added a new capability.
link |
00:54:37.960
So the question is, should we,
link |
00:54:40.800
so in a way you can see even from DeepMind,
link |
00:54:42.920
we have Flamingo that this modular approach
link |
00:54:46.480
and thus could leverage the scale a bit more reasonably
link |
00:54:49.240
because we didn't need to retrain a system from scratch.
link |
00:54:52.400
And on the other hand, we had Gato
link |
00:54:54.280
which used the same data sets,
link |
00:54:56.000
but then it trained it from scratch, right?
link |
00:54:57.600
And so I guess big question for the community is,
link |
00:55:01.720
should we train from scratch
link |
00:55:02.880
or should we embrace modularity?
link |
00:55:04.800
And this lies, like this goes back to modularity
link |
00:55:08.760
as a way to grow, but reuse seems like natural
link |
00:55:12.200
and it was very effective, certainly.
link |
00:55:15.040
The next question is, if you go the way of modularity,
link |
00:55:19.120
is there a systematic way of freezing weights
link |
00:55:22.840
and joining different modalities
link |
00:55:25.520
across not just two or three or four networks,
link |
00:55:29.360
but hundreds of networks
link |
00:55:30.680
from all different kinds of places,
link |
00:55:32.440
maybe open source network
link |
00:55:34.320
that looks at weather patterns
link |
00:55:36.440
and you shove that in somehow
link |
00:55:38.040
and then you have networks that, I don't know,
link |
00:55:40.520
do all kinds of the Plague Starcraft
link |
00:55:42.160
and play all the other video games
link |
00:55:44.120
and you can keep adding them in without significant effort,
link |
00:55:49.640
like maybe the effort scales linearly or something like that
link |
00:55:53.320
as opposed to like the more network you add,
link |
00:55:55.000
the more you have to worry about the instabilities created.
link |
00:55:58.000
Yeah, so that vision is beautiful.
link |
00:56:00.000
I think there's still the question
link |
00:56:03.560
about within single modalities,
link |
00:56:05.440
like Chinchilla was reused,
link |
00:56:06.880
but now if we train an ex iteration of language models,
link |
00:56:10.240
are we gonna use Chinchilla or not?
link |
00:56:11.880
Yeah, how do you swap out Chinchilla?
link |
00:56:13.160
Right, so there's still big questions,
link |
00:56:15.960
but that idea is actually really akin to software engineering,
link |
00:56:19.440
which we're not reimplementing,
link |
00:56:21.360
libraries from scratch, we're reusing
link |
00:56:23.400
and then building ever more amazing things,
link |
00:56:25.440
including neural networks with software that we're reusing.
link |
00:56:29.040
So I think this idea of modularity, I like it.
link |
00:56:32.280
I think it's here to stay.
link |
00:56:34.000
And that's also why I mentioned,
link |
00:56:36.000
it's just the beginning, not the end.
link |
00:56:38.320
You mentioned meta learning.
link |
00:56:39.520
So given this promise of Gato,
link |
00:56:42.920
can we try to redefine this term?
link |
00:56:46.120
That's almost akin to consciousness
link |
00:56:47.720
because it means different things to different people
link |
00:56:50.280
throughout the history of artificial intelligence.
link |
00:56:52.560
But what do you think meta learning is and looks like
link |
00:56:58.240
now in the five years, 10 years,
link |
00:57:00.200
will it look like the system like Gato,
link |
00:57:01.800
but scaled?
link |
00:57:03.280
What's your sense of what is meta learning look like?
link |
00:57:07.120
Do you think with all the wisdom we've learned so far?
link |
00:57:10.600
Yeah, great question.
link |
00:57:11.680
Maybe it's good to give another data point
link |
00:57:14.640
looking backwards rather than forward.
link |
00:57:16.280
So when we talk in 2019,
link |
00:57:23.040
meta learning meant something that has changed
link |
00:57:26.600
mostly through the revolution of GPT3 and beyond.
link |
00:57:31.280
So what meta learning meant at the time
link |
00:57:35.160
was driven by what benchmarks people care about
link |
00:57:37.800
in meta learning.
link |
00:57:38.960
And the benchmarks were about
link |
00:57:41.920
a capability to learn about object identities.
link |
00:57:45.120
So it was very much overfitted to vision
link |
00:57:48.600
and object classification.
link |
00:57:50.520
And the part that was met about that was that,
link |
00:57:53.040
oh, we're not just learning 1,000 categories
link |
00:57:55.440
that ImageNet tells us to learn.
link |
00:57:57.160
We're gonna learn object categories
link |
00:57:59.320
that can be defined when we interact with the model.
link |
00:58:03.400
So it's interesting to see the evolution.
link |
00:58:06.760
The way this started was we have a special language
link |
00:58:10.840
that was a dataset, a small dataset
link |
00:58:13.320
that we prompted the model with saying,
link |
00:58:16.040
hey, here is a new classification task.
link |
00:58:19.080
I'll give you one image and the name,
link |
00:58:21.840
which was an integer at the time of the image
link |
00:58:24.440
and a different image and so on.
link |
00:58:26.080
So you have a small prompt in the form of a dataset,
link |
00:58:30.160
a machine learning dataset.
link |
00:58:31.760
And then you got then a system that could
link |
00:58:34.720
then predict or classify these objects
link |
00:58:37.080
that you just defined kind of on the fly.
link |
00:58:40.440
So fast forward, it was revealed
link |
00:58:44.920
that language models are future learners.
link |
00:58:47.560
That's the title of the paper.
link |
00:58:49.240
So very good title.
link |
00:58:50.200
Sometimes titles are really good.
link |
00:58:51.600
So this one is really, really good
link |
00:58:53.640
because that's the point of GPT3 that showed that, look.
link |
00:58:58.680
Sure, we can focus on object classification
link |
00:59:01.080
and how what meta learning means
link |
00:59:02.680
within the space of learning object categories.
link |
00:59:05.520
This goes beyond or before,
link |
00:59:07.200
rather to also Omniglot before ImageNet and so on.
link |
00:59:10.120
So there's a few benchmarks.
link |
00:59:11.600
To now all of a sudden,
link |
00:59:13.120
we're a bit unlocked from benchmarks
link |
00:59:15.320
and through language we can define tasks, right?
link |
00:59:18.000
So we're literally telling the model some logical task
link |
00:59:21.680
or little thing that we wanted to do.
link |
00:59:23.960
We prompt it much like we did before,
link |
00:59:26.040
but now we prompt it through natural language.
link |
00:59:28.600
And then not perfectly,
link |
00:59:30.560
I mean, these models have failure modes and that's fine,
link |
00:59:33.280
but these models then are now doing a new task, right?
link |
00:59:37.280
So they meta learn this new capability.
link |
00:59:40.600
Now, that's where we are now.
link |
00:59:43.520
Flamingo expanded this to visual and language,
link |
00:59:47.360
but it basically has the same abilities.
link |
00:59:49.440
You can teach it, for instance, an emergent property
link |
00:59:52.760
was that you can take pictures of numbers
link |
00:59:55.400
and then do arithmetic with the numbers just by teaching it.
link |
00:59:59.080
Oh, that's, I mean, when I show you three plus six,
link |
01:00:02.040
you know, I want you to output nine
link |
01:00:03.840
and you show it a few examples and now it does that.
link |
01:00:06.800
So it went way beyond the,
link |
01:00:09.160
oh, this ImageNet sort of categorization of images
link |
01:00:12.800
that we were a bit stuck maybe before this revelation moment
link |
01:00:17.640
that happened in 2000, I believe it was 19,
link |
01:00:20.760
but it was after we checked.
link |
01:00:21.960
And that way it has solved meta learning
link |
01:00:24.400
as was previously defined.
link |
01:00:26.160
Yes, it expanded what it meant.
link |
01:00:27.840
So that's what you say, what does it mean?
link |
01:00:29.600
So it's an evolving term.
link |
01:00:31.440
But here is maybe now looking forward,
link |
01:00:35.280
looking at what's happening, you know,
link |
01:00:37.720
obviously in the community with more modalities,
link |
01:00:41.480
what we can expect.
link |
01:00:42.600
And I would certainly hope to see the following.
link |
01:00:45.040
And this is a pretty drastic hope,
link |
01:00:48.480
but in five years, maybe we chat again.
link |
01:00:51.280
And we have a system, right?
link |
01:00:54.520
A set of weights that we can teach it to play StarCraft.
link |
01:00:59.880
Maybe not at the level of AlphaStar,
link |
01:01:01.520
but play StarCraft a complex game.
link |
01:01:03.720
We teach it through interactions to prompting.
link |
01:01:07.000
You can certainly prompt a system.
link |
01:01:08.600
That's what Gato shows to play some simple Atari games.
link |
01:01:11.840
So imagine if you start talking to a system,
link |
01:01:15.440
teaching it a new game, showing it examples of,
link |
01:01:18.360
you know, in this particular game,
link |
01:01:20.960
this user did something good.
link |
01:01:22.760
Maybe the system can even play and ask you questions.
link |
01:01:25.440
Say, hey, I played this game.
link |
01:01:27.000
I just played this game.
link |
01:01:27.920
Did I do well?
link |
01:01:29.120
Can you teach me more?
link |
01:01:30.480
So five, maybe to 10 years,
link |
01:01:33.080
these capabilities or what meta learning means
link |
01:01:36.200
will be much more interactive, much more rich.
link |
01:01:38.880
And through domains that we were specializing, right?
link |
01:01:41.640
So you see the difference, right?
link |
01:01:42.920
We built AlphaStar specialized to play StarCraft.
link |
01:01:47.040
The algorithms were general, but the weights were specialized.
link |
01:01:50.480
And what we're hoping is that we can teach a network
link |
01:01:54.200
to play games, to play any game, just using games
link |
01:01:57.400
as an example, through interacting with it,
link |
01:02:00.560
teaching it, uploading the Wikipedia page of StarCraft.
link |
01:02:03.760
Like this is in the horizon,
link |
01:02:06.120
and obviously their details need to be filled
link |
01:02:09.400
and research need to be done.
link |
01:02:10.960
But that's how I see meta learning above,
link |
01:02:13.240
which is gonna be beyond prompting.
link |
01:02:15.400
It's gonna be a bit more interactive.
link |
01:02:17.120
It's gonna, you know, the system might tell us
link |
01:02:19.880
to give it feedback after it maybe makes mistakes
link |
01:02:22.360
or it loses a game.
link |
01:02:24.160
But it's nonetheless very exciting
link |
01:02:26.320
because if you think about this this way,
link |
01:02:29.040
the benchmarks are already there.
link |
01:02:30.640
We just repurposed the benchmarks, right?
link |
01:02:33.200
So in a way, I like to map the space of
link |
01:02:38.040
what maybe AGI means to say, okay, like,
link |
01:02:41.520
we went 101% performance in Go, in Chess, in StarCraft.
link |
01:02:47.920
The next iteration might be 20% performance
link |
01:02:51.960
across quote unquote all tasks, right?
link |
01:02:54.760
And even if it's not as good, it's fine.
link |
01:02:56.360
We actually, we have ways to also measure progress
link |
01:03:00.000
because we have those special agents,
link |
01:03:01.680
specialized agents and so on.
link |
01:03:04.240
So this is to me very exciting.
link |
01:03:06.240
And these next iteration models are definitely hinting
link |
01:03:10.520
at that direction of progress, which hopefully we can have.
link |
01:03:14.760
There are obviously some things that could go wrong
link |
01:03:17.640
in terms of we might not have the tools,
link |
01:03:20.160
maybe transformers are not enough,
link |
01:03:21.640
then we must, there's some breakthroughs to come,
link |
01:03:24.360
which makes the field more exciting
link |
01:03:26.320
to people like me as well, of course.
link |
01:03:28.680
But that's, if you ask me five to 10 years,
link |
01:03:32.120
you might see these models that start to look more like
link |
01:03:35.280
weights that are already trained.
link |
01:03:36.920
And then it's more about teaching or make,
link |
01:03:40.560
they're meta learned what you're trying to induce
link |
01:03:45.560
in terms of tasks and so on.
link |
01:03:47.000
Well beyond the simple now tasks,
link |
01:03:49.760
we're starting to see emerge like, you know,
link |
01:03:51.680
smaller arithmetic tasks and so on.
link |
01:03:54.200
So a few questions around that, this is fascinating.
link |
01:03:57.200
So that kind of teaching interactive,
link |
01:04:01.440
so it's beyond prompting,
link |
01:04:02.760
so it's interacting with the neural network,
link |
01:04:05.240
that's different than the training process.
link |
01:04:08.440
So it's different than the optimization
link |
01:04:12.440
over differentiable functions.
link |
01:04:15.920
This is already trained and now you're teaching,
link |
01:04:19.840
I mean, it's almost like akin to the brain,
link |
01:04:23.960
the neurons already set with their connections.
link |
01:04:26.960
On top of that, you're now using that infrastructure
link |
01:04:30.000
to build up further knowledge.
link |
01:04:32.640
Okay, so that's a really interesting distinction
link |
01:04:36.680
that's actually not obvious
link |
01:04:38.080
from a software engineering perspective,
link |
01:04:40.320
that there's a line to be drawn.
link |
01:04:42.800
Because you always think for a neural network to learn,
link |
01:04:44.880
it has to be retrained, trained and retrained.
link |
01:04:48.360
But maybe, and prompting is a way of teaching
link |
01:04:54.080
a neural network, a little bit of context
link |
01:04:55.960
about whatever the heck you're trying it to do.
link |
01:04:58.040
So you can maybe expand this prompting capability
link |
01:05:00.480
by making it interact, that's really, really interesting.
link |
01:05:04.760
By the way, this is not,
link |
01:05:06.400
if you look at way back at different ways
link |
01:05:09.240
to tackle even classification tasks,
link |
01:05:11.880
so this comes from long standing literature
link |
01:05:16.480
in machine learning, what I'm suggesting could sound
link |
01:05:20.360
to some like a bit like Nita's neighbor.
link |
01:05:23.480
So Nita's neighbor is almost the simplest algorithm
link |
01:05:27.160
that does not require learning.
link |
01:05:30.120
So it has this interesting like,
link |
01:05:32.360
you don't need to compute gradients.
link |
01:05:34.400
And what Nita's neighbor does is,
link |
01:05:36.200
you quote unquote have a data set or upload a data set.
link |
01:05:40.040
And then all you need to do is a way to measure distance
link |
01:05:43.120
between points.
link |
01:05:44.840
And then to classify a new point,
link |
01:05:46.720
you're just simply computing,
link |
01:05:48.160
what's the closest point in this massive amount of data?
link |
01:05:51.360
And that's my answer.
link |
01:05:52.760
So you can think of prompting in a way
link |
01:05:55.560
as you're uploading not just simple points
link |
01:05:58.680
and the metric is not the distance between the images
link |
01:06:02.480
or something simple,
link |
01:06:03.320
it's something that you compute that's much more advanced.
link |
01:06:06.080
But in a way, it's very similar, right?
link |
01:06:08.440
You simply are uploading some knowledge
link |
01:06:12.680
to this pre trained system in Nita's neighbor.
link |
01:06:15.160
Maybe the metric is learned or not,
link |
01:06:17.320
but you don't need to further train it.
link |
01:06:19.520
And then now you immediately get a classifier out of this.
link |
01:06:23.800
Now it's just an evolution of that concept,
link |
01:06:25.880
very classical concept in machine learning,
link |
01:06:27.880
which is, yeah, just learning through
link |
01:06:30.960
what's the closest point, closest by some distance
link |
01:06:33.720
and that's it, it's an evolution of that.
link |
01:06:36.160
And I will say how I saw meta learning
link |
01:06:39.080
when we worked on a few ideas in 2016,
link |
01:06:43.960
was precisely through the lens of Nita's neighbor,
link |
01:06:47.280
which is very common in computer vision community, right?
link |
01:06:50.080
There's a very active area of research
link |
01:06:52.200
about how do you compute the distance between two images?
link |
01:06:55.520
But if you have a good distance metric,
link |
01:06:57.640
you also have a good classifier, right?
link |
01:07:00.000
All I'm saying is now these distances
link |
01:07:01.800
and the points are not just images,
link |
01:07:03.840
they're like words or sequences of words
link |
01:07:07.760
and images and actions that teach you something new,
link |
01:07:10.400
but it might be that technique wise, those come back.
link |
01:07:14.800
And I will say that it's not necessarily true
link |
01:07:18.240
that you might not ever train the weights a bit further.
link |
01:07:21.840
Some aspect of meta learning,
link |
01:07:23.920
some techniques in meta learning
link |
01:07:26.080
do actually do a bit of fine tuning as it's called, right?
link |
01:07:28.960
They train the weights a little bit
link |
01:07:31.160
when they get a new task.
link |
01:07:32.880
So as I call the how or how we're gonna achieve this,
link |
01:07:38.000
as a deep learner and very skeptic,
link |
01:07:39.880
we're gonna try a few things,
link |
01:07:41.280
whether it's a bit of training,
link |
01:07:42.680
adding a few parameters,
link |
01:07:44.240
thinking of these as nearest neighbor
link |
01:07:46.000
or just simply thinking of there's a sequence of words,
link |
01:07:49.240
it's a prefix and that's the new classifier we'll see, right?
link |
01:07:53.680
There's the beauty of research,
link |
01:07:55.440
but what's important is that is a good goal in itself
link |
01:08:00.160
that I see as very worthwhile pursuing
link |
01:08:02.760
for the next stages of not only meta learning.
link |
01:08:05.720
I think this is basically what's exciting
link |
01:08:08.480
about machine learning period to me.
link |
01:08:11.440
Well, the interactive aspect of that
link |
01:08:13.760
is also very interesting.
link |
01:08:15.160
The interactive version of nearest neighbor
link |
01:08:18.760
to help you pull out the classifier from this giant thing.
link |
01:08:23.760
Okay, is this the way we can go
link |
01:08:27.280
in five, 10 plus years from any task,
link |
01:08:32.840
sorry, from many tasks to any task?
link |
01:08:36.240
So, and what does that mean?
link |
01:08:39.480
What does it need to be actually trained on?
link |
01:08:42.800
Which point is the network had enough?
link |
01:08:47.680
What does a network need to learn about this world
link |
01:08:50.440
in order to be able to perform any task?
link |
01:08:52.480
Is it just as simple as language, image, and action?
link |
01:08:57.880
Or do you need some set of representative images?
link |
01:09:02.680
Like if you only see land images,
link |
01:09:05.160
will you know anything about underwater?
link |
01:09:06.720
Is that some fundamentally different?
link |
01:09:08.760
I don't know.
link |
01:09:09.600
Those are open questions, I would say.
link |
01:09:12.080
I mean, the way you put,
link |
01:09:13.080
let me maybe further your example, right?
link |
01:09:15.240
If all you see is land images,
link |
01:09:18.400
but you're reading all about land and water worlds,
link |
01:09:21.520
but in books, imagine, would that be enough?
link |
01:09:25.400
Good question, we don't know,
link |
01:09:27.160
but I guess maybe you can join us
link |
01:09:30.400
if you want in our quest to find this.
link |
01:09:32.120
That's precisely.
link |
01:09:33.440
Water world, yeah.
link |
01:09:34.360
Yes, that's precisely the beauty of research
link |
01:09:37.640
and that's the research business
link |
01:09:42.280
where I guess is to figure this out
link |
01:09:44.400
and ask the right questions
link |
01:09:46.240
and then iterate with the whole community,
link |
01:09:49.520
publishing like findings and so on.
link |
01:09:52.440
But yeah, this is a question.
link |
01:09:55.120
It's not the only question,
link |
01:09:56.080
but it's certainly as you ask is on my mind constantly, right?
link |
01:10:00.040
And so we'll need to wait for maybe the,
link |
01:10:03.280
let's say five years, let's hope it's not 10
link |
01:10:05.960
to see what are the answers.
link |
01:10:09.400
Some people will largely believe in
link |
01:10:11.840
and supervised or self supervised learning
link |
01:10:13.840
of single modalities and then crossing them.
link |
01:10:17.040
Some people might think end to end learning
link |
01:10:20.200
is the answer, modularity is maybe the answer.
link |
01:10:23.800
So we don't know,
link |
01:10:24.960
but we're just definitely excited to find out.
link |
01:10:27.520
But it feels like this is the right time
link |
01:10:29.280
and we're at the beginning of this position.
link |
01:10:31.720
We're finally ready to do these kind of general,
link |
01:10:34.640
big models and agents.
link |
01:10:37.640
What do you sort of specific technical thing
link |
01:10:42.480
about Gato, Flamingo, Chinchilla, Gopher,
link |
01:10:47.400
any of these that is especially beautiful.
link |
01:10:49.560
That was surprising, maybe.
link |
01:10:51.640
Is there something that just jumps out at you?
link |
01:10:55.200
Of course, there's the general thing of like,
link |
01:10:57.600
you didn't think it was possible
link |
01:10:58.920
and then you realize it's possible
link |
01:11:01.720
in terms of the generalizability across modalities
link |
01:11:04.480
and all that kind of stuff.
link |
01:11:05.600
Or maybe how small of a network,
link |
01:11:08.040
relatively speaking, Gato is all that kind of stuff.
link |
01:11:10.480
But is there some weird little things that were surprising?
link |
01:11:15.200
Look, I'll give you an answer that's very important
link |
01:11:18.240
because maybe people don't quite realize this,
link |
01:11:22.600
but the teams behind these efforts, the actual humans,
link |
01:11:27.240
that's maybe the surprising in an obviously positive way.
link |
01:11:31.720
So anytime you see these breakthroughs,
link |
01:11:34.560
I mean, it's easy to map it to a few people.
link |
01:11:37.160
There's people that are great at explaining things
link |
01:11:39.240
and so on, that's very nice.
link |
01:11:40.760
But maybe the learnings or the meta learnings
link |
01:11:44.720
that I get as a human about this is,
link |
01:11:47.440
sure, we can move forward,
link |
01:11:50.520
but the surprising bit is how important are all the pieces
link |
01:11:56.560
of these projects, how do they come together?
link |
01:12:00.080
So I'll give you maybe some of the ingredients
link |
01:12:03.760
of success that are common across these,
link |
01:12:06.440
but not the obvious ones on machine learning.
link |
01:12:08.480
I can always also give you those,
link |
01:12:11.360
but basically there is engineering is critical.
link |
01:12:17.360
So very good engineering
link |
01:12:19.600
because ultimately we're collecting data sets, right?
link |
01:12:23.800
So the engineering of data
link |
01:12:26.200
and then of deploying the models at scale
link |
01:12:29.800
into some compute cluster that cannot go understated
link |
01:12:32.880
that is a huge factor of success.
link |
01:12:36.000
And it's hard to believe that details matter so much.
link |
01:12:41.600
We would like to believe that it's true
link |
01:12:44.080
that there is more and more of a standard formula,
link |
01:12:47.480
as I was saying, like this recipe that works for everything.
link |
01:12:50.600
But then when you zoom into each of these projects,
link |
01:12:53.720
then you realize the devil is indeed in the details.
link |
01:12:57.880
And then the teams have to work kind of together
link |
01:13:01.560
towards these goals.
link |
01:13:03.080
So engineering of data and obviously clusters
link |
01:13:07.560
and large scale is very important.
link |
01:13:09.320
And then one that is often not,
link |
01:13:13.120
maybe nowadays it is more clear is benchmark progress, right?
link |
01:13:17.200
So we're talking here about multiple months
link |
01:13:19.880
of tens of researchers and people
link |
01:13:24.240
that are trying to organize the research and so on
link |
01:13:26.720
working together and you don't know that you can get there.
link |
01:13:31.720
I mean, this is the beauty.
link |
01:13:33.920
Like if you're not risking to trying to do something
link |
01:13:36.840
that feels impossible, you're not gonna get there,
link |
01:13:41.160
but you need the way to measure progress.
link |
01:13:43.440
So the benchmarks that you build are critical.
link |
01:13:47.280
I've seen this beautifully pay out in many projects.
link |
01:13:50.080
I mean, maybe the one I've seen it more consistently,
link |
01:13:53.480
which means we established the metric,
link |
01:13:56.360
actually the community did,
link |
01:13:57.840
and then we leveraged that massively as AlphaFold.
link |
01:14:01.120
This is a project where the data, the metrics were all there
link |
01:14:06.160
and all it took was, and it's easier said than done,
link |
01:14:09.120
an amazing team working not to try
link |
01:14:12.880
to find some incremental improvement and publish,
link |
01:14:15.400
which is one way to do research that is valid,
link |
01:14:17.960
but aim very high and work literally for years
link |
01:14:22.520
to iterate over that process.
link |
01:14:24.120
And working for years with the team,
link |
01:14:25.680
I mean, it is tricky that also happened to happen
link |
01:14:29.800
partly during a pandemic and so on.
link |
01:14:32.200
So I think my meta learning from all this is
link |
01:14:35.280
the teams are critical to the success.
link |
01:14:37.960
And then if now going to the machine learning,
link |
01:14:40.200
the part that's surprising is,
link |
01:14:44.760
so we like architectures like neural networks,
link |
01:14:48.720
and I would say this was a very rapidly evolving field
link |
01:14:53.120
until the transformer came.
link |
01:14:54.960
So attention might indeed be all you need,
link |
01:14:58.160
which is the title, also a good title,
link |
01:15:00.280
although in hindsight is good.
link |
01:15:02.280
I don't think at the time I thought
link |
01:15:03.440
this is a great title for a paper,
link |
01:15:05.080
but that architecture is proving
link |
01:15:08.960
that the dream of modeling sequences of any bytes,
link |
01:15:13.520
there is something there that will stick.
link |
01:15:15.360
And I think these advance in architectures,
link |
01:15:18.280
in kind of how neural networks are architecture
link |
01:15:21.040
to do what they do.
link |
01:15:23.120
It's been hard to find one that has been so stable
link |
01:15:26.080
and relatively has changed very little
link |
01:15:28.960
since it was invented five or so years ago.
link |
01:15:33.040
So that is a surprising,
link |
01:15:35.120
it's a surprise that keeps recurring into other projects.
link |
01:15:38.360
Try to, on a philosophical or technical level,
link |
01:15:42.440
introspect what is the magic of attention?
link |
01:15:45.520
What is attention?
link |
01:15:47.360
That's attention in people that study cognition,
link |
01:15:50.160
so human attention.
link |
01:15:52.120
I think there's giant wars over what attention means,
link |
01:15:55.800
how it works in the human mind.
link |
01:15:57.480
So what, there's very simple looks at what attention
link |
01:16:00.960
is in neural network from the days of attention
link |
01:16:03.840
is all you need, but Brod,
link |
01:16:05.360
do you think there's a general principle
link |
01:16:06.880
that's really powerful here?
link |
01:16:08.840
Yeah, so a distinction between transformers and LSTMs,
link |
01:16:13.400
which were what came before,
link |
01:16:15.400
and there was a transitional period
link |
01:16:17.880
where you could use both.
link |
01:16:19.720
In fact, when we talked about alpha stat,
link |
01:16:22.040
we used transformers and LSTMs,
link |
01:16:24.320
so it was still the beginning of transformers.
link |
01:16:26.440
They were very powerful,
link |
01:16:27.440
but LSTMs were still also very powerful sequence models.
link |
01:16:31.560
So the power of the transformer
link |
01:16:35.200
is that it has built in what we call an inductive bias
link |
01:16:39.800
of attention that makes the model,
link |
01:16:43.080
when you think of a sequence of integers, right?
link |
01:16:45.760
Like we discussed this before, right?
link |
01:16:47.480
This is a sequence of words.
link |
01:16:49.000
When you have to do very hard tasks over these words,
link |
01:16:54.800
this could be we're gonna translate a whole paragraph,
link |
01:16:57.920
or we're gonna predict the next paragraph
link |
01:16:59.840
given 10 paragraphs before.
link |
01:17:04.280
There's some loose intuition
link |
01:17:08.360
from how we do it as a human
link |
01:17:10.360
that is very nicely mimicked and replicated,
link |
01:17:14.800
structurally speaking, in the transformer,
link |
01:17:16.600
which is this idea of you're looking for something, right?
link |
01:17:21.200
So you're sort of when you're,
link |
01:17:23.960
you just read a piece of text,
link |
01:17:25.760
now you're thinking what comes next.
link |
01:17:27.960
You might wanna relook at the text
link |
01:17:30.640
or look it from scratch.
link |
01:17:31.800
I mean, readily is because there's no recurrence.
link |
01:17:35.120
You're just thinking what comes next,
link |
01:17:37.360
and it's almost hypothesis driven, right?
link |
01:17:40.080
So if I'm thinking the next word that I'll write
link |
01:17:43.440
is cat or dog, okay? The way the transformer works
link |
01:17:48.760
almost philosophically is it has these two hypotheses.
link |
01:17:52.920
Is it gonna be cat or is it gonna be dog?
link |
01:17:55.680
And then it says, okay, if it's cat,
link |
01:17:58.440
I'm gonna look for certain words, not necessarily cat,
link |
01:18:00.760
although cat is an obvious word you would look in the past
link |
01:18:03.000
to see whether it makes more sense to output cat or dog.
link |
01:18:05.920
And then it does some very deep computation
link |
01:18:09.480
over the words and beyond, right?
link |
01:18:11.480
So it combines the words, but it has the query
link |
01:18:16.200
as we call it, that is cat.
link |
01:18:18.480
And then similarly for dog, right?
link |
01:18:20.680
And so it's a very computational way to think about,
link |
01:18:24.400
look, if I'm thinking deeply about text,
link |
01:18:27.040
I need to go back to look at all of the text,
link |
01:18:29.600
attend over it, but it's not just attention,
link |
01:18:31.920
like what is guiding the attention?
link |
01:18:33.960
And that was the key insight from an earlier paper,
link |
01:18:36.720
is not how far away is it?
link |
01:18:39.160
I mean, how far away is it is important?
link |
01:18:40.800
What did I just write about?
link |
01:18:42.720
That's critical, but what you wrote about 10 pages ago
link |
01:18:46.800
might also be critical.
link |
01:18:48.400
So you're looking not positionally, but content wise, right?
link |
01:18:53.200
And you transformers have this beautiful way
link |
01:18:56.080
to query for certain content
link |
01:18:57.920
and pull it out in a compressed way.
link |
01:19:00.320
So then you can make a more informed decision.
link |
01:19:03.000
I mean, that's one way to explain transformers,
link |
01:19:05.960
but I think it's a very powerful inductive bias.
link |
01:19:10.040
There might be some details that might change over time,
link |
01:19:12.520
but I think that is what makes transformers
link |
01:19:16.440
so much more powerful than the recurrent networks
link |
01:19:19.920
that were more recency bias based,
link |
01:19:22.440
which obviously works in some tasks,
link |
01:19:24.360
but it has major flaws.
link |
01:19:26.720
Transformer itself has flaws.
link |
01:19:29.320
And I think the main one, the main challenges,
link |
01:19:32.200
these prompts that we just were talking about,
link |
01:19:35.760
they can be a thousand words long.
link |
01:19:38.080
But if I'm teaching you Starcraft,
link |
01:19:39.960
I mean, I'll have to show you videos.
link |
01:19:41.880
I'll have to point you to whole Wikipedia articles
link |
01:19:44.640
about the game.
link |
01:19:46.160
We'll have to interact probably
link |
01:19:47.560
as you play your last me questions.
link |
01:19:49.520
The context require for us to achieve me
link |
01:19:52.760
being a good teacher to you on the game
link |
01:19:54.840
as you would want to do it with a model.
link |
01:19:58.560
I think goes well beyond the current capabilities.
link |
01:20:01.640
So the question is, how do we benchmark this?
link |
01:20:03.960
And then how do we change the structure
link |
01:20:06.440
of the architecture?
link |
01:20:07.280
I think there's ideas on both sides,
link |
01:20:08.840
but we'll have to see empirically, right?
link |
01:20:11.280
Obviously what ends up working in the future.
link |
01:20:13.400
And as you talked about, some of the ideas could be,
link |
01:20:15.880
keeping the constraint of that length in place,
link |
01:20:19.480
but then forming like hierarchical representations
link |
01:20:23.080
to where you can start being much clever
link |
01:20:26.280
in how you use those thousand tokens.
link |
01:20:28.840
Indeed.
link |
01:20:31.240
Yeah, that's really interesting.
link |
01:20:32.280
But it also is possible that this attentional mechanism
link |
01:20:34.880
where you basically,
link |
01:20:36.200
you don't have a recency bias,
link |
01:20:37.560
but you look more generally, you make it learnable.
link |
01:20:42.000
The mechanism in which way you look back into the past,
link |
01:20:45.280
you make that learnable.
link |
01:20:46.800
It's also possible we're at the very beginning of that
link |
01:20:50.200
because that you might become smarter and smarter
link |
01:20:54.400
in the way you query the past.
link |
01:20:58.320
So recent past and distant past
link |
01:21:00.600
and maybe very, very distant past.
link |
01:21:02.360
So almost like the attention mechanism
link |
01:21:05.000
will have to improve and evolve
link |
01:21:07.360
as good as the tokenization mechanism
link |
01:21:12.000
where so you can represent longterm memory somehow.
link |
01:21:15.000
Yes.
link |
01:21:16.160
And I mean, hierarchies are very,
link |
01:21:18.240
I mean, it's a very nice word that sounds appealing.
link |
01:21:22.200
There's lots of work adding hierarchy to the memories.
link |
01:21:25.920
In practice, it does seem like we keep coming back
link |
01:21:29.480
to the main formula or main architecture.
link |
01:21:33.040
That sometimes tells us something.
link |
01:21:34.720
There's such a sentence that a friend of mine told me
link |
01:21:37.880
like, whether it wants to work or not.
link |
01:21:40.240
So transformer was clearly an idea that wanted to work.
link |
01:21:44.120
And then I think there's some principles we believe
link |
01:21:47.360
will be needed, but finding the exact details,
link |
01:21:50.320
details matter so much, right?
link |
01:21:52.120
That's gonna be tricky.
link |
01:21:53.520
I love the idea that there's like,
link |
01:21:56.120
you as a human being, you want some ideas to work.
link |
01:22:00.520
And then there's the model that wants some ideas to work
link |
01:22:03.600
and you get to have a conversation to see
link |
01:22:05.520
which more likely the model will win in the end.
link |
01:22:09.640
Because it's the one, you don't have to do any work.
link |
01:22:12.000
The model is the one that has to do the work.
link |
01:22:13.640
So you should listen to the model.
link |
01:22:15.040
And I really love this idea that you talked about
link |
01:22:17.120
the humans in this picture.
link |
01:22:18.120
If I could just briefly ask one is you're saying
link |
01:22:22.080
the benchmarks about the modular humans working on this.
link |
01:22:27.080
The benchmarks providing a sturdy ground of a wish to do
link |
01:22:32.520
these things that seem impossible.
link |
01:22:34.720
They give you, in the darkest of times, give you hope
link |
01:22:39.160
because little signs of improvement, you could.
link |
01:22:41.560
Yes.
link |
01:22:42.400
Like you're not, somehow you're not lost
link |
01:22:45.320
if you have metrics to measure your improvement.
link |
01:22:48.720
And then there's other aspect you said elsewhere
link |
01:22:51.680
and here today, like titles matter.
link |
01:22:56.600
I wonder how much humans matter
link |
01:23:00.520
in the evolution of all of this,
link |
01:23:02.360
meaning individual humans.
link |
01:23:06.080
You know, something about their interaction,
link |
01:23:08.120
something about their ideas,
link |
01:23:09.200
how much they change the direction of all of this.
link |
01:23:12.960
Like if you change the humans in this picture,
link |
01:23:15.240
like is it that the model is sitting there
link |
01:23:18.240
and it wants you, it wants some idea to work?
link |
01:23:22.520
Or is it the humans, or maybe the model is providing
link |
01:23:25.560
20 ideas that could work.
link |
01:23:26.960
And depending on the humans you pick,
link |
01:23:29.080
they're going to be able to hear some of those ideas.
link |
01:23:31.400
Like in all the, because you're now directing
link |
01:23:34.600
all of deep learning at DeepMind,
link |
01:23:35.920
you get to interact with a lot of projects,
link |
01:23:37.400
a lot of brilliant researchers.
link |
01:23:40.600
How much variability is created by the humans
link |
01:23:43.080
in all of this?
link |
01:23:44.160
Yeah, I mean, I do believe humans matter a lot
link |
01:23:47.360
at the very least at the time scale of years
link |
01:23:53.440
on when things are happening
link |
01:23:54.840
and what's the sequencing of it, right?
link |
01:23:56.920
So you get to interact with people that,
link |
01:24:00.520
I mean, you mentioned this.
link |
01:24:02.240
Some people really want some idea to work
link |
01:24:05.160
and they'll persist.
link |
01:24:06.720
And then some other people might be more practical.
link |
01:24:09.360
Like I don't care what idea works.
link |
01:24:12.880
I care about, you know, cracking protein folding.
link |
01:24:16.840
And these, at least these two kind of seem opposite sides.
link |
01:24:21.200
We need both.
link |
01:24:22.480
And we've clearly had both historically
link |
01:24:25.680
and that made certain things happen earlier or later.
link |
01:24:29.000
So definitely humans involved in all of this endeavor
link |
01:24:33.480
have had, I would say, years of change or of ordering
link |
01:24:38.640
how things have happened,
link |
01:24:40.480
which breakthroughs came before
link |
01:24:41.840
which other breakthroughs and so on.
link |
01:24:43.280
So certainly that does happen.
link |
01:24:45.800
And so one other, maybe one other axis of distinction
link |
01:24:50.600
is what I called, and this is most commonly used
link |
01:24:53.840
in reinforcement learning
link |
01:24:54.840
is the exploration, exploitation tradeoff as well.
link |
01:24:57.800
It's not exactly what I meant, although quite related.
link |
01:25:00.920
So when you start trying to help others, right?
link |
01:25:07.000
Like you become a bit more of a mentor
link |
01:25:11.480
to a large group of people,
link |
01:25:13.120
be it a project or the deep learning team
link |
01:25:15.200
or something or even in the community
link |
01:25:17.480
when you interact with people in conferences and so on.
link |
01:25:20.480
You're identifying quickly, right?
link |
01:25:24.040
Some things that are explorative or exploitative
link |
01:25:27.080
and it's tempting to try to guide people, obviously.
link |
01:25:30.720
I mean, that's what makes like our experience,
link |
01:25:33.160
we bring it and we try to shape things sometimes wrongly.
link |
01:25:36.760
And there's many times that I've been wrong in the past,
link |
01:25:39.600
that's great, but it would be wrong to dismiss
link |
01:25:45.360
any sort of of the research styles that I'm observing.
link |
01:25:49.600
And I often get asked, well, you're in industry, right?
link |
01:25:52.800
So we do have access to large compute scale and so on.
link |
01:25:55.680
So there's certain kinds of research.
link |
01:25:57.480
I almost feel like we need to do responsibly and so on,
link |
01:26:01.680
but it is almost, we have the particle accelerator here.
link |
01:26:05.200
So to speak in physics, so we need to use it,
link |
01:26:07.520
we need to answer the questions
link |
01:26:08.840
that we should be answering right now
link |
01:26:10.440
for the scientific progress.
link |
01:26:12.400
But then at the same time, I look at many advances,
link |
01:26:15.240
including attention, which was discovered
link |
01:26:18.400
in Montreal initially because of lack of compute, right?
link |
01:26:22.440
So we were working on sequence to sequence
link |
01:26:24.960
with my friends over at Google Brain at the time.
link |
01:26:27.920
And we were using, I think, 8GPUs,
link |
01:26:30.400
which was somehow a lot at the time.
link |
01:26:32.440
And then I think Montreal was a bit more limited in the scale,
link |
01:26:36.160
but then they discovered this content based attention concept
link |
01:26:39.240
that then has obviously triggered things like Transformer.
link |
01:26:43.400
Not everything obviously starts Transformer.
link |
01:26:46.080
And there's always a history that is important to recognize
link |
01:26:49.960
because then you can make sure that then those who might feel now,
link |
01:26:54.160
well, we don't have so much compute,
link |
01:26:56.400
you need to then help them optimize that kind of research
link |
01:27:01.560
that might actually produce amazing change.
link |
01:27:04.280
Perhaps it's not as short term as some of these advancements
link |
01:27:07.960
or perhaps it's a different timescale,
link |
01:27:09.720
but the people and the diversity of the field
link |
01:27:13.040
is quite critical that we maintain it.
link |
01:27:15.760
And at times, especially mixed a bit with hype or other things,
link |
01:27:19.800
it's a bit tricky to be observing maybe too much
link |
01:27:24.160
of the same thinking across the board.
link |
01:27:27.840
But the humans definitely are critical.
link |
01:27:30.520
And I can think of quite a few personal examples
link |
01:27:33.920
where also someone told me something that had a huge effect
link |
01:27:38.880
on to some idea.
link |
01:27:40.280
And then that's why I'm saying at least in terms of ears,
link |
01:27:43.320
probably some things do happen.
link |
01:27:46.040
It's also fascinating how constraints somehow
link |
01:27:48.200
are essential for innovation.
link |
01:27:51.080
And the other thing you mentioned about engineering,
link |
01:27:53.440
I have a sneaking suspicion.
link |
01:27:54.920
Maybe I over, you know, my love is with engineering.
link |
01:28:00.000
So I have a sneaking suspicion that all the genius,
link |
01:28:04.560
a large percentage of the genius
link |
01:28:06.320
is in the tiny details of engineering.
link |
01:28:09.320
So I think we like to think the genius is in the big ideas.
link |
01:28:17.600
I have a sneaking suspicion that because I've
link |
01:28:20.600
seen the genius of details, of engineering details,
link |
01:28:24.440
make the night and day difference.
link |
01:28:28.840
And I wonder if those kind of have a ripple effect over time.
link |
01:28:32.960
So that too, so that's taken the engineering perspective
link |
01:28:36.880
that sometimes that quiet innovation
link |
01:28:39.400
at the level of an individual engineer
link |
01:28:41.800
or maybe at the small scale of a few engineers
link |
01:28:44.640
can make all the difference.
link |
01:28:45.680
That scales, because we're working
link |
01:28:48.920
on computers that are scaled across large groups,
link |
01:28:53.480
that one engineering decision can lead to ripple effects.
link |
01:28:57.240
It's interesting to think about.
link |
01:28:58.960
Yeah, I mean, engineering, there's also kind of a historical,
link |
01:29:04.240
it might be a bit random.
link |
01:29:06.320
Because if you think of the history of how especially
link |
01:29:10.240
deep learning and neural networks took off,
link |
01:29:12.360
it feels like a bit random, because GPUs
link |
01:29:16.280
happen to be there at the right time for a different purpose,
link |
01:29:18.920
which was to play video games.
link |
01:29:20.680
So even the engineering that goes into the hardware,
link |
01:29:24.920
and it might have a time frame might be very different.
link |
01:29:28.080
I mean, the GPUs were evolved throughout many years
link |
01:29:31.640
where we didn't even were looking at that.
link |
01:29:33.920
So even at that level, that revolution, so to speak,
link |
01:29:38.720
the ripples are like, we'll see when they stop.
link |
01:29:42.240
But in terms of thinking of why is this happening,
link |
01:29:47.080
I think that when I try to categorize it
link |
01:29:49.800
in sort of things that might not be so obvious,
link |
01:29:52.760
I mean, clearly there's a hardware revolution.
link |
01:29:55.000
We are surfing thanks to that.
link |
01:29:58.400
Data centers as well.
link |
01:29:59.800
I mean, data centers are where Google, for instance,
link |
01:30:03.240
obviously they're serving Google,
link |
01:30:04.840
but there's also now thanks to that
link |
01:30:06.960
and to have built such amazing data centers.
link |
01:30:09.680
We can train these models.
link |
01:30:11.760
Software is an important one.
link |
01:30:13.440
I think if I look at the state of how
link |
01:30:16.720
I had to implement things to implement my ideas,
link |
01:30:20.040
how I discarded ideas because they were too hard to implement.
link |
01:30:23.200
Yeah, clearly the times have changed,
link |
01:30:25.280
and thankfully we are in a much better software position
link |
01:30:28.440
as well.
link |
01:30:29.400
And then, I mean, obviously there's
link |
01:30:31.680
research that happens at scale, and more people
link |
01:30:34.400
enter the field, that's great to see,
link |
01:30:35.920
but it's almost enabled by these other things.
link |
01:30:38.280
And last but not least is also data, right?
link |
01:30:40.600
Curating data sets, labeling data sets, these benchmarks
link |
01:30:43.960
we think about, maybe we'll want to have all the benchmarks
link |
01:30:48.120
in one system, but it's still very valuable that someone
link |
01:30:51.320
put the thought and time and the vision
link |
01:30:53.600
to build certain benchmarks.
link |
01:30:54.880
We've seen progress thanks to that.
link |
01:30:57.760
We're going to repurpose the benchmarks.
link |
01:30:59.280
That's the beauty of Atari is like we solved it in a way,
link |
01:31:04.200
but we use it in Gato.
link |
01:31:06.000
It was critical, and I'm sure there's still a lot more
link |
01:31:09.120
to do thanks to that amazing benchmark that someone took
link |
01:31:12.320
the time to put, even though at the time maybe, oh,
link |
01:31:15.400
you have to think what's the next iteration of architectures.
link |
01:31:19.480
That's what maybe the field recognizes,
link |
01:31:21.440
but that's another thing we need to balance
link |
01:31:24.040
in terms of humans behind.
link |
01:31:25.800
We need to recognize all these aspects
link |
01:31:28.000
because they're all critical.
link |
01:31:29.520
And we tend to think of the genius, the scientist,
link |
01:31:33.600
and so on, but I'm glad you're, I know you have
link |
01:31:36.400
a strong engineering background.
link |
01:31:38.000
But also I'm a lover of data, and there's
link |
01:31:40.760
a pushback on the engineering comment.
link |
01:31:43.280
Ultimately, it could be the creatives of benchmarks
link |
01:31:46.120
who have the most impact.
link |
01:31:47.480
Andre Capati, who you mentioned,
link |
01:31:49.240
has recently been talking a lot of trash about ImageNet,
link |
01:31:52.040
which he has the right to do because of how critical he
link |
01:31:54.600
is about how essential he is to the development
link |
01:31:57.800
and the success of deep learning around ImageNet.
link |
01:32:01.520
And you're saying that that's actually,
link |
01:32:02.960
that benchmark is holding back the field.
link |
01:32:05.480
Because, I mean, especially in his context,
link |
01:32:07.720
on Tesla autopilot, that's looking at real world behavior
link |
01:32:11.040
of a system, it's, there's something fundamentally
link |
01:32:15.840
missing about ImageNet that doesn't capture
link |
01:32:17.960
the real worldness of things.
link |
01:32:20.440
That we need to have data sets, benchmarks
link |
01:32:22.640
that have the unpredictability, the edge cases,
link |
01:32:27.080
the, whatever the heck it is that makes the real world
link |
01:32:29.680
so difficult to operate in, we need to have benchmarks
link |
01:32:33.280
with that, so.
link |
01:32:34.680
But just to think about the impact of ImageNet
link |
01:32:37.760
as a benchmark, and that really puts a lot of emphasis
link |
01:32:42.120
on the importance of a benchmark,
link |
01:32:43.720
both sort of internally a deep mind and as a community.
link |
01:32:46.680
So one is coming in from within, like,
link |
01:32:50.120
how do I create a benchmark for me to mark and make progress,
link |
01:32:55.280
and how do I make a benchmark for the community
link |
01:32:58.120
to mark and push progress.
link |
01:33:02.520
You have this amazing paper you coauthored,
link |
01:33:05.880
a survey paper called,
link |
01:33:07.400
Emergent Abilities of Large Language Models.
link |
01:33:10.560
It has, again, the philosophy here
link |
01:33:12.520
that I'd love to ask you about.
link |
01:33:14.480
What's the intuition about the phenomenon
link |
01:33:16.640
of emergence in neural networks,
link |
01:33:18.480
transform as language models?
link |
01:33:20.640
Is there a magic threshold beyond which we start
link |
01:33:24.560
to see certain performance?
link |
01:33:27.200
And is that different from task to task?
link |
01:33:30.000
Is that us humans just being poetic and romantic,
link |
01:33:32.680
or is there literally some level
link |
01:33:35.480
of which we start to see breakthrough performance?
link |
01:33:38.240
Yeah, I mean, this is a property
link |
01:33:40.120
that we start seeing in systems
link |
01:33:43.560
that actually tend to be,
link |
01:33:46.920
so in machine learning, traditionally,
link |
01:33:49.880
again, going to benchmarks.
link |
01:33:51.720
I mean, if you have some input outputs,
link |
01:33:54.920
like that is just a single input and a single output,
link |
01:33:58.320
you generally, when you train these systems,
link |
01:34:01.240
you see reasonably smooth curves
link |
01:34:04.480
when you analyze how much the data set size
link |
01:34:09.640
affects the performance,
link |
01:34:11.000
or how the model size affects the performance,
link |
01:34:13.080
or how long you train the system for
link |
01:34:17.880
affects the performance.
link |
01:34:19.400
So, if we think of ImageNet,
link |
01:34:22.120
like the train curves look fairly smooth
link |
01:34:25.120
and predictable in a way.
link |
01:34:28.200
And I would say that's probably because of the,
link |
01:34:31.400
it's kind of a one hop reasoning task, right?
link |
01:34:36.560
It's like, here is an input
link |
01:34:38.280
and you think for a few milliseconds
link |
01:34:40.840
or 100 milliseconds, 300 as a human.
link |
01:34:43.800
And then you tell me, yeah,
link |
01:34:44.880
there's an alpaca in this image.
link |
01:34:47.920
So, in language, we are seeing benchmarks
link |
01:34:52.840
that require more pondering and more thought in a way, right?
link |
01:34:58.280
This is just kind of, you need to look for some subtleties
link |
01:35:02.000
that involves inputs that you might think of,
link |
01:35:05.520
even if the input is a sentence
link |
01:35:07.880
describing a mathematical problem,
link |
01:35:10.920
there is a bit more processing required as a human
link |
01:35:14.200
and more introspection.
link |
01:35:15.720
So, I think how these benchmarks work
link |
01:35:20.520
means that there is actually a threshold,
link |
01:35:24.760
just going back to how transformers work
link |
01:35:26.800
in this way of querying for the right questions
link |
01:35:29.560
to get the right answers.
link |
01:35:31.160
That might mean that performance becomes random
link |
01:35:35.520
until the right question is asked
link |
01:35:37.800
by the querying system of a transformer
link |
01:35:40.080
or of a language model like a transformer.
link |
01:35:42.880
And then only then you might start seeing performance
link |
01:35:47.720
going from random to non random.
link |
01:35:50.120
And this is more empirical.
link |
01:35:52.680
There's no formalism or theory behind this yet,
link |
01:35:56.320
although it might be quite important,
link |
01:35:57.800
but we're seeing these phase transitions
link |
01:36:00.360
of random performance and until some,
link |
01:36:03.280
let's say, scale of a model.
link |
01:36:05.000
And then it goes beyond that.
link |
01:36:06.800
And it might be that you need to fit
link |
01:36:10.560
a few low order bits of thought
link |
01:36:14.080
before you can make progress on the whole task.
link |
01:36:17.200
And if you could measure, actually,
link |
01:36:19.760
those breakdown of the task,
link |
01:36:21.920
maybe you would see more smooth,
link |
01:36:23.480
oh, like, yeah, this, you know,
link |
01:36:24.960
once you get this and this and this and this and this,
link |
01:36:27.800
then you start making progress in the task.
link |
01:36:30.320
But it's somehow a bit annoying
link |
01:36:33.520
because then it means that certain questions
link |
01:36:37.480
we might ask about architectures,
link |
01:36:40.320
possibly can only be done at certain scale.
link |
01:36:43.040
And one thing that I'm conversely,
link |
01:36:46.120
I've seen great progress on in the last couple of years
link |
01:36:49.200
is this notion of science of deep learning
link |
01:36:52.480
and science of scale in particular, right?
link |
01:36:55.040
So on the negative is that there's some benchmarks
link |
01:36:58.680
for which progress might need to be measured
link |
01:37:01.800
at minimum at certain scale
link |
01:37:04.000
until you see then what details of the model matter
link |
01:37:07.560
to make that performance better, right?
link |
01:37:10.040
So that's a bit of a con.
link |
01:37:11.920
But what we've also seen is that you can,
link |
01:37:16.320
you can sort of empirically analyze behavior of models
link |
01:37:20.040
at scales that are smaller, right?
link |
01:37:22.920
So let's say to put an example,
link |
01:37:25.720
we had this chinchilla paper
link |
01:37:27.880
that revised the so called scaling laws of models.
link |
01:37:31.400
And that whole study is done
link |
01:37:33.240
at a reasonably small scale, right?
link |
01:37:35.040
That may be hundreds of millions
link |
01:37:36.560
up to one billion parameters.
link |
01:37:38.720
And then the cool thing is that you create some loss, right?
link |
01:37:41.880
Some loss that some trends, right?
link |
01:37:43.680
You extract trends from data that you see.
link |
01:37:46.360
Okay, like it looks like the amount of data required
link |
01:37:49.440
to train now a 10x larger model would be this.
link |
01:37:52.160
And these laws so far,
link |
01:37:54.000
these extrapolations have helped us save compute
link |
01:37:57.520
and just get to a better place in terms of the science
link |
01:38:00.960
of how should we run these models at scale?
link |
01:38:03.840
How much data, how much depth
link |
01:38:05.640
and all sorts of questions we start asking
link |
01:38:08.520
extrapolating from a small scale.
link |
01:38:10.600
But then these emergence is sadly
link |
01:38:12.760
that not everything can be extrapolated from scale
link |
01:38:15.680
depending on the benchmark.
link |
01:38:16.880
And maybe the harder benchmarks are not so good
link |
01:38:20.240
for extracting these laws,
link |
01:38:21.960
but we have a variety of benchmarks at least.
link |
01:38:24.160
So I wonder to which degree the threshold,
link |
01:38:27.960
the phase shift scale is a function of the benchmark.
link |
01:38:32.240
So some of that, some of the science of scale
link |
01:38:34.880
might be engineering benchmarks
link |
01:38:38.120
where that threshold is low,
link |
01:38:40.400
sort of taking a main benchmark
link |
01:38:43.880
and reducing it somehow
link |
01:38:46.160
where the essential difficulty is left,
link |
01:38:48.520
but the immersion,
link |
01:38:49.960
the scale at which the emergence happens is lower.
link |
01:38:52.640
Just for the science aspect of it
link |
01:38:54.280
versus the actual real world aspect.
link |
01:38:57.000
Yeah, so luckily we have quite a few benchmarks,
link |
01:38:59.280
some of which are simpler
link |
01:39:00.560
or maybe they're more like,
link |
01:39:01.880
I think people might call these systems one
link |
01:39:03.880
versus systems two style.
link |
01:39:05.920
So I think what we're not seeing,
link |
01:39:09.360
luckily is that extrapolations
link |
01:39:11.840
from maybe slightly more smooth or simpler benchmarks
link |
01:39:15.800
are translating to the harder ones.
link |
01:39:18.600
But that is not to say that
link |
01:39:20.200
these extrapolation will hit its limits.
link |
01:39:22.640
And when it does,
link |
01:39:24.240
then how much we scale or how we scale
link |
01:39:27.600
will sadly be a bit suboptimal
link |
01:39:29.480
until we find better loss, right?
link |
01:39:31.840
And these laws again are very empirical loss.
link |
01:39:33.840
They're not like physical loss of models,
link |
01:39:35.960
although I wish there would be better theory
link |
01:39:38.720
about these things as well.
link |
01:39:40.160
But so far, I would say empirical theory,
link |
01:39:43.040
as I call it, is way ahead
link |
01:39:44.560
than actual theory of machine learning.
link |
01:39:47.880
Let me ask you almost for fun.
link |
01:39:50.520
So this is not Oriel as a deep mind person
link |
01:39:54.680
or anything to do with deep mind or Google,
link |
01:39:57.320
just as a human being
link |
01:39:58.880
and looking at these news of a Google engineer
link |
01:40:01.800
who claimed that,
link |
01:40:05.840
I guess the Lambda language model was sentient
link |
01:40:09.760
or had the,
link |
01:40:11.120
and you still need to look into the details of this,
link |
01:40:14.080
but sort of making an official report
link |
01:40:18.680
and the claim that he believes there's evidence
link |
01:40:21.760
that this system has achieved sentience.
link |
01:40:25.120
And I think this is a really interesting case
link |
01:40:29.560
on a human level and a psychological level
link |
01:40:31.760
on a technical machine learning level
link |
01:40:35.920
of how language models transform our world
link |
01:40:38.360
and also just philosophical level
link |
01:40:39.880
of the role of AI systems in a human world.
link |
01:40:44.120
So what did you, what do you find interesting?
link |
01:40:48.120
What's your take on all of this
link |
01:40:49.720
as a machine learning engineer and a researcher
link |
01:40:52.440
and also as a human being?
link |
01:40:54.320
Yeah, I mean, a few reactions, quite a few actually.
link |
01:40:58.760
Have you ever briefly thought, is this thing sentient?
link |
01:41:02.600
Right, so never.
link |
01:41:04.360
Absolutely never.
link |
01:41:05.200
Like even with like Alpha Star, wait a minute, what?
link |
01:41:08.160
Sadly though, I think, yeah, sadly I have not,
link |
01:41:11.960
yeah, I think the current, any of the current models,
link |
01:41:15.320
although very useful and very good.
link |
01:41:18.960
Yeah, I think we're quite far from that.
link |
01:41:22.440
And there's kind of a converse side story.
link |
01:41:25.360
So one of my passions is about science in general.
link |
01:41:30.440
And I think I feel I'm a bit of like a failed scientist.
link |
01:41:34.520
That's why I came to machine learning
link |
01:41:36.560
because you always feel and you start seeing this
link |
01:41:40.160
that machine learning is maybe the science
link |
01:41:43.200
that can help other sciences as we've seen, right?
link |
01:41:45.440
Like you, you know, it's such a powerful tool.
link |
01:41:48.640
So thanks to that angle, right?
link |
01:41:51.200
That, okay, I love science, I love, I mean,
link |
01:41:53.080
I love astronomy, I love biology,
link |
01:41:54.960
but I'm not an expert and I decided,
link |
01:41:56.880
well, the thing I can do better at is computers.
link |
01:42:00.000
But having, especially with when I was a bit more involved
link |
01:42:04.720
in AlphaFault, learning a bit about proteins
link |
01:42:07.400
and about biology and about life, the complexity,
link |
01:42:13.080
it feels like it really is like, I mean,
link |
01:42:15.000
if you start looking at the things that are going on
link |
01:42:19.240
at the atomic level and also, I mean,
link |
01:42:23.840
there's obviously that we are maybe inclined
link |
01:42:27.680
to try to think of neural networks as like the brain,
link |
01:42:30.400
but the complexities and the amount of magic
link |
01:42:33.760
that it feels when, I mean, I'm not an expert,
link |
01:42:37.080
so it naturally feels more magic,
link |
01:42:38.560
but looking at biological systems
link |
01:42:40.880
as opposed to these computer computational brains
link |
01:42:46.640
just makes me like, wow, there's such level
link |
01:42:49.560
of complexity difference still, right?
link |
01:42:51.440
Like orders of magnitude complexity that, sure,
link |
01:42:55.240
these weights, I mean, we train them
link |
01:42:56.680
and they do nice things, but they're not at the level
link |
01:43:00.160
of biological entities, brains, cells.
link |
01:43:06.120
It just feels like it's just not possible
link |
01:43:09.000
to achieve the same level of complexity behavior
link |
01:43:12.400
and my belief when I talk to other beings,
link |
01:43:16.320
is certainly shaped by this amazement of biology
link |
01:43:20.360
that maybe because I know too much,
link |
01:43:22.360
I don't have about machine learning,
link |
01:43:23.800
but I certainly feel it's very far fetched
link |
01:43:27.600
and far in the future to be calling or to be thinking,
link |
01:43:31.720
well, this mathematical function
link |
01:43:34.560
that is differentiable is, in fact, sentient and so on.
link |
01:43:39.280
There's something on that point that it's very interesting.
link |
01:43:42.000
So you know enough about machines and enough about biology
link |
01:43:47.080
to know that there's many orders of magnitude
link |
01:43:49.080
of difference and complexity,
link |
01:43:51.920
but you know how machine learning works.
link |
01:43:56.080
So the interesting question from human beings
link |
01:43:58.200
that are interacting with a system
link |
01:43:59.440
that don't know about the underlying complexity.
link |
01:44:02.280
And I've seen people and probably including myself
link |
01:44:05.280
that have fallen in love with things that are quite simple.
link |
01:44:07.960
Yeah, so.
link |
01:44:08.800
And so maybe the complexity is one part of the picture,
link |
01:44:11.520
but maybe that's not a necessary,
link |
01:44:15.960
that's not a necessary condition for sentience,
link |
01:44:18.880
for perception or emulation of sentience.
link |
01:44:25.040
Right, so I mean, I guess the other side of this is,
link |
01:44:28.200
that's how I feel personally.
link |
01:44:29.600
I mean, you asked me about the person, right?
link |
01:44:32.400
Now it's very interesting to see
link |
01:44:34.000
how other humans feel about things, right?
link |
01:44:36.400
This we are like, again, like I'm not as amazed
link |
01:44:40.800
about things that I feel are,
link |
01:44:42.360
this is not as magical as this other thing
link |
01:44:44.600
because of maybe how I got to learn about it
link |
01:44:48.040
and how I see the curve a bit more smooth
link |
01:44:50.520
because I, you know, like just seen the progress
link |
01:44:53.120
of language models since Shannon in the 50s
link |
01:44:56.040
and actually looking at that timescale,
link |
01:44:58.920
we're not that fast progress, right?
link |
01:45:00.880
I mean, what we were thinking at the time,
link |
01:45:03.520
like almost a hundred years ago is not that dissimilar
link |
01:45:07.600
to what we're doing now,
link |
01:45:08.960
but at the same time, yeah, obviously others,
link |
01:45:11.480
my experience, right, the personal experience,
link |
01:45:14.520
I think no one should, you know,
link |
01:45:17.400
I think no one should tell others how they should feel.
link |
01:45:20.720
I mean, the feelings are very personal, right?
link |
01:45:23.000
So how others might feel about the models and so on.
link |
01:45:26.160
That's one part of the story that is important
link |
01:45:28.520
to understand for me personally as a researcher.
link |
01:45:32.080
And then when I maybe disagree
link |
01:45:34.920
or I don't understand or see that, yeah,
link |
01:45:37.120
maybe this is not something I think right now is reasonable.
link |
01:45:40.000
Knowing all that I know, one of the other things
link |
01:45:42.920
and perhaps partly why it's great to be talking to you
link |
01:45:46.640
and reaching out to the world about machine learning is,
link |
01:45:49.880
hey, let's demystify a bit the magic
link |
01:45:53.520
and try to see a bit more of the math
link |
01:45:56.280
and the fact that literally to create these models,
link |
01:45:59.960
if we had the right software,
link |
01:46:01.480
it would be 10 lines of code
link |
01:46:03.680
and then just a dump of the internet.
link |
01:46:06.200
So versus like then the complexity of like the creation
link |
01:46:10.360
of humans from their inception, right?
link |
01:46:13.680
And also the complexity of evolution
link |
01:46:15.880
of the whole universe to where we are
link |
01:46:19.280
that feels orders of magnitude more complex
link |
01:46:22.000
and fascinating to me.
link |
01:46:23.520
So I think, yeah, maybe part of,
link |
01:46:26.080
the only thing I'm thinking about trying to tell you is,
link |
01:46:29.320
yeah, I think explaining a bit of the magic,
link |
01:46:32.680
there is a bit of magic.
link |
01:46:33.640
It's good to be in love obviously with what you do at work.
link |
01:46:37.040
And I'm certainly fascinated and surprised
link |
01:46:39.480
quite often as well.
link |
01:46:41.320
But I think hopefully as experts in biology,
link |
01:46:45.080
hopefully will tell me this is not as magic
link |
01:46:47.200
and I'm happy to learn that through interactions
link |
01:46:50.880
with the larger community,
link |
01:46:52.320
we can also have a certain level of education
link |
01:46:56.040
that in practice also will matter
link |
01:46:58.400
because I mean, one question is how you feel about this
link |
01:47:00.840
but then the other very important is,
link |
01:47:03.120
you starting to interact with these in products and so on.
link |
01:47:07.000
It's good to understand a bit what's going on,
link |
01:47:09.200
what's not going on, what's safe, what's not safe
link |
01:47:12.320
and so on, right?
link |
01:47:13.160
Otherwise the technology will not be used properly for good
link |
01:47:17.080
which is obviously the goal of all of us, I hope.
link |
01:47:20.560
So let me then ask the next question.
link |
01:47:22.960
Do you think in order to solve intelligence
link |
01:47:25.840
or to replace the leg spot that does interviews
link |
01:47:29.600
as we started this conversation with,
link |
01:47:31.480
do you think the system needs to be sentient?
link |
01:47:34.880
Do you think it needs to achieve something like consciousness?
link |
01:47:38.840
And do you think about what consciousness is
link |
01:47:41.800
in the human mind that could be instructive
link |
01:47:44.360
for creating AI systems?
link |
01:47:46.800
Yeah, honestly, I think probably not
link |
01:47:51.080
to the degree of intelligence that there's this brain
link |
01:47:57.120
that can learn, can be extremely useful,
link |
01:48:00.360
can challenge you, can teach you.
link |
01:48:03.000
Conversely, you can teach it to do things.
link |
01:48:05.680
I'm not sure it's necessary personally speaking
link |
01:48:08.400
but if consciousness or any other biological
link |
01:48:14.080
or evolutionary lesson can be repurposed
link |
01:48:19.440
to then influence our next set of algorithms,
link |
01:48:22.640
that is a great way to actually make progress, right?
link |
01:48:25.680
And the same way I tried to explain transformers
link |
01:48:28.040
a bit how it feels we operate
link |
01:48:30.240
when we look at texts specifically,
link |
01:48:33.440
these insights are very important, right?
link |
01:48:36.040
So there's a distinction between details
link |
01:48:40.360
of how the brain might be doing computation.
link |
01:48:43.280
I think my understanding is, sure, there's neurons
link |
01:48:46.600
and there's some resemblance to neural networks
link |
01:48:48.560
but we don't quite understand enough of the brain
link |
01:48:51.480
in detail, right, to be able to replicate it.
link |
01:48:55.360
But then more, if you zoom out a bit,
link |
01:48:58.880
how we then, our thought process, how memory works,
link |
01:49:03.440
maybe even how evolution got us here,
link |
01:49:05.680
what's exploration, exploitation,
link |
01:49:07.360
like how these things happen.
link |
01:49:08.800
I think these clearly can inform algorithmic level research
link |
01:49:13.120
and I've seen some examples of these being quite useful
link |
01:49:18.480
to then guide the research,
link |
01:49:19.760
even it might be for the wrong reasons, right?
link |
01:49:21.720
So I think biology and what we know about ourselves
link |
01:49:26.120
can help a whole lot to build,
link |
01:49:29.120
essentially what we call AGI, this general,
link |
01:49:32.960
the real ghetto, right, the last step of the chain,
link |
01:49:35.720
hopefully, but consciousness in particular,
link |
01:49:39.240
I don't myself at least think too hard about
link |
01:49:42.960
how to add that to the system.
link |
01:49:44.840
But maybe my understanding is also very personal
link |
01:49:47.880
about what it means, right?
link |
01:49:48.880
I think this, even that in itself is a long debate
link |
01:49:51.800
that I know people have often
link |
01:49:55.360
and maybe I should learn more about this.
link |
01:49:57.800
Yeah, and I personally,
link |
01:49:59.840
I notice the magic often on a personal level,
link |
01:50:02.760
especially with physical systems like robots.
link |
01:50:06.200
I have a lot of legged robots now in Austin that I play with
link |
01:50:11.720
and even when you program them,
link |
01:50:13.280
when they do things you didn't expect,
link |
01:50:15.640
there's an immediate anthropomorphization
link |
01:50:18.640
and you notice the magic
link |
01:50:19.840
and you start to think about things like sentience
link |
01:50:22.680
that has to do more with effective communication
link |
01:50:26.040
and less with any of these kind of dramatic things.
link |
01:50:28.440
It seems like a useful part of communication.
link |
01:50:32.640
Having the perception of consciousness
link |
01:50:36.560
seems like useful for us humans.
link |
01:50:38.280
We treat each other more seriously.
link |
01:50:40.880
We are able to do a nearest neighbor,
link |
01:50:44.480
shoving of that entity into your memory correctly,
link |
01:50:47.720
all that kind of stuff, seems useful,
link |
01:50:49.840
at least to fake it even if you never make it.
link |
01:50:52.520
So maybe like, yeah, mirroring the question
link |
01:50:55.680
and since you talked to a few people,
link |
01:50:57.480
then you do think that we'll need to figure something out
link |
01:51:01.800
in order to achieve intelligence
link |
01:51:04.600
in a grander sense of the world?
link |
01:51:06.520
Yeah, I personally believe yes,
link |
01:51:08.200
but I don't even think it'll be like a separate island
link |
01:51:12.640
we'll have to travel to.
link |
01:51:14.200
I think it will emerge quite naturally.
link |
01:51:16.440
Okay, that's easier than for us then, thank you.
link |
01:51:20.160
But the reason I think it's important to think about
link |
01:51:22.840
is you will start, I believe,
link |
01:51:25.160
like with this Google engineer,
link |
01:51:26.360
you will start seeing this a lot more,
link |
01:51:28.800
especially when you have AI systems
link |
01:51:30.560
that are actually interacting with human beings
link |
01:51:33.000
that don't have an engineering background.
link |
01:51:35.200
And we have to prepare for that.
link |
01:51:38.600
Because I do believe there will be a civil rights movement
link |
01:51:41.640
for robots as silly as it is to say.
link |
01:51:44.640
There's going to be a large number of people
link |
01:51:46.840
that realize there's these intelligent entities
link |
01:51:49.040
with whom I have a deep relationship
link |
01:51:51.640
and I don't wanna lose them.
link |
01:51:53.240
They've come to be a part of my life and they mean a lot.
link |
01:51:56.000
They have a name, they have a story, they have a memory.
link |
01:51:59.080
And we start to ask questions about ourselves.
link |
01:52:01.360
Well, this thing sure seems like it's capable of suffering
link |
01:52:07.640
because it tells all these stories of suffering.
link |
01:52:09.880
It doesn't wanna die and all those kinds of things.
link |
01:52:11.720
And we have to start to ask ourselves questions.
link |
01:52:14.480
What is the difference between a human being and this thing?
link |
01:52:16.920
And wait, so when you engineer,
link |
01:52:18.640
I believe from an engineering perspective,
link |
01:52:21.560
from like a deep mind or anybody that builds systems,
link |
01:52:25.040
there might be laws in the future
link |
01:52:26.560
where you're not allowed to engineer systems
link |
01:52:29.200
with displays of sentience
link |
01:52:32.520
unless they're explicitly designed to be that,
link |
01:52:36.040
unless it's a pet.
link |
01:52:37.400
So if you have a system that's just doing customer support,
link |
01:52:41.280
you're legally not allowed to display sentience.
link |
01:52:44.200
We'll start to like ask ourselves that question.
link |
01:52:47.320
And then so that's going to be part
link |
01:52:49.520
of the software engineering process.
link |
01:52:51.080
Do we, which features do we have
link |
01:52:53.400
in one of them as communications of sentience?
link |
01:52:56.840
But it's important to start thinking about that stuff,
link |
01:52:58.720
especially how much it captivates public attention.
link |
01:53:01.760
Yeah, absolutely.
link |
01:53:03.240
It's definitely a topic that is important we think about.
link |
01:53:07.920
And I think in a way, I always see not,
link |
01:53:10.840
I mean, not every movie is equally on point
link |
01:53:15.320
with certain things,
link |
01:53:16.160
but certainly science fiction in this sense,
link |
01:53:19.120
at least has prepared society to start thinking
link |
01:53:22.600
about certain topics that,
link |
01:53:24.840
even if it's too early to talk about,
link |
01:53:26.480
as long as we are like reasonable,
link |
01:53:29.520
it's certainly going to prepare us for both the research
link |
01:53:33.920
to come and how to, I mean,
link |
01:53:35.280
there's many important challenges and topics
link |
01:53:38.160
that come with building an intelligent system,
link |
01:53:42.880
many of which you just mentioned, right?
link |
01:53:44.720
So I think we're never going to be fully ready
link |
01:53:49.960
unless we talk about these.
link |
01:53:51.440
And we start also, as I said,
link |
01:53:54.160
just kind of expanding the people we talk to
link |
01:53:59.840
to not include only our own researchers and so on.
link |
01:54:03.280
And in fact, places like DeepMind,
link |
01:54:05.280
but elsewhere there's more interdisciplinary groups
link |
01:54:09.600
forming up to start asking
link |
01:54:11.880
and really working with us on these questions.
link |
01:54:15.000
Because obviously this is not initially
link |
01:54:17.440
what your passion is when you do your PhD,
link |
01:54:19.440
but certainly it is coming, right?
link |
01:54:21.480
So it's fascinating kind of it's the thing
link |
01:54:24.360
that brings me to one of my passions that is learning.
link |
01:54:27.960
So in this sense, this is kind of a new area
link |
01:54:31.760
that as a learning system myself,
link |
01:54:35.200
I want to keep exploring.
link |
01:54:36.720
And I think it's great that to see parts of the debate
link |
01:54:41.080
and even I seen a level of maturity in the conferences
link |
01:54:44.760
that deal with AI, if you look five years ago,
link |
01:54:48.080
to now just the amount of workshops and so on
link |
01:54:52.080
has changed so much is impressive to see how much topics
link |
01:54:56.520
of safety ethics and so on come to the surface,
link |
01:55:00.800
which is great.
link |
01:55:01.680
And if we're too early, clearly it's fine.
link |
01:55:03.840
I mean, it's a big field and there's lots of people
link |
01:55:07.280
with lots of interests that will do progress
link |
01:55:10.280
or make progress.
link |
01:55:11.920
And obviously I don't believe we're too late.
link |
01:55:14.080
So in that sense, like I think it's great
link |
01:55:16.480
that we're doing these already.
link |
01:55:18.240
It's better to be too early than too late
link |
01:55:20.240
when it comes to super intelligent AI systems.
link |
01:55:22.800
Let me ask, speaking of sentient to AI's,
link |
01:55:25.520
you gave props to your friend, Elias Skiver,
link |
01:55:28.720
for being elected the fellow of the World Society.
link |
01:55:32.000
So just as a shout out to a fellow researcher and a friend,
link |
01:55:35.160
what's the secret to the genius of Elias Skiver?
link |
01:55:39.440
And also, do you believe that his tweets of
link |
01:55:42.680
as you have hypothesized and Andre Kapathi did as well,
link |
01:55:46.040
are generated by a language model?
link |
01:55:48.680
Yeah, so I strongly believe Elias gonna visit
link |
01:55:53.760
in a few weeks actually.
link |
01:55:54.720
So I'll ask him in person, but...
link |
01:55:58.080
Will he tell you the truth?
link |
01:55:59.240
Yes, of course, hopefully.
link |
01:56:00.760
I mean, ultimately we all have shared paths
link |
01:56:04.080
and there's friendships that go beyond
link |
01:56:07.000
obviously institutional institutions and so on.
link |
01:56:09.880
So hope he tells me the truth.
link |
01:56:11.760
Well, maybe the AI system is holding him hostage somehow.
link |
01:56:14.440
Maybe he has some videos about, he doesn't wanna release.
link |
01:56:17.000
So maybe it has taken control over him.
link |
01:56:19.760
So he can't tell the truth.
link |
01:56:21.000
If I see him in person, then I'll tell him.
link |
01:56:22.640
He will know.
link |
01:56:23.960
But I think it's a good,
link |
01:56:27.640
I think Elias's personality, just knowing him for a while.
link |
01:56:32.440
Yeah, he's, everyone in Twitter, I guess,
link |
01:56:35.320
gets a different persona and I think Elias one
link |
01:56:39.640
does not surprise me, right?
link |
01:56:40.920
So I think knowing Elias from before social media
link |
01:56:43.600
and before AI was so prevalent,
link |
01:56:45.800
I recognize a lot of his character.
link |
01:56:47.560
So that's something for me that I feel good about.
link |
01:56:50.520
A friend that hasn't changed
link |
01:56:52.520
or is still true to himself, right?
link |
01:56:56.040
Obviously, there is though a fact
link |
01:56:59.000
that your field becomes more popular
link |
01:57:02.200
and he is obviously one of the main figures in the field
link |
01:57:05.480
having done a lot of advancement.
link |
01:57:06.960
So I think that the tricky bit here is
link |
01:57:09.320
how to balance your true self with the responsibility
link |
01:57:12.240
that you're worst carry.
link |
01:57:13.640
So in this sense, I think, yeah,
link |
01:57:16.160
like I appreciate the style and I understand it,
link |
01:57:19.400
but it created debates on like some of his tweets, right?
link |
01:57:24.200
That maybe it's good, we have them early anyways, right?
link |
01:57:26.880
But yeah, then the reactions are usually polarizing.
link |
01:57:31.080
I think we're just seeing kind of the reality
link |
01:57:33.080
of social media a bit there as well,
link |
01:57:35.000
reflected on that particular topic
link |
01:57:38.160
or set of topics he's tweeting about.
link |
01:57:40.320
Yeah, I mean, it's funny they speak to this tension.
link |
01:57:42.960
He was one of the early seminal figures
link |
01:57:46.200
in the field of deep learning.
link |
01:57:47.360
And so there's a responsibility with that,
link |
01:57:49.000
but he's also from having interacted with him quite a bit.
link |
01:57:53.160
He's just a brilliant thinker about ideas.
link |
01:57:57.440
And which as are you,
link |
01:58:01.240
and that there's a tension between becoming the manager
link |
01:58:03.760
versus like the actual thinking through very novel ideas,
link |
01:58:08.760
the scientist versus the manager.
link |
01:58:13.600
And he's one of the great scientists of our time.
link |
01:58:17.680
This was quite interesting.
link |
01:58:18.800
And also people tell me quite silly,
link |
01:58:20.840
which I haven't quite detected yet,
link |
01:58:23.240
but in private, we'll have to see about that.
link |
01:58:26.000
Yeah, yeah, I mean, just on the point of,
link |
01:58:29.640
I mean, Ilya has been an inspiration.
link |
01:58:33.360
I mean, quite a few colleagues I can think shaped,
link |
01:58:36.400
you know, the person you are, like Ilya certainly
link |
01:58:40.680
gets probably the top spot, if not close to the top.
link |
01:58:43.800
And if we go back to the question about people in the field,
link |
01:58:48.000
like how the role would have changed the field or not,
link |
01:58:51.760
I think Ilya's case is interesting
link |
01:58:54.000
because he really has a deep belief
link |
01:58:56.840
in the scaling up of neural networks.
link |
01:58:59.640
There was a talk that is still famous to this day
link |
01:59:03.720
from the sequence to sequence paper,
link |
01:59:06.200
where he was just claiming,
link |
01:59:08.400
just give me supervised data and a large neural network.
link |
01:59:11.760
And then, you know, you'll solve basically
link |
01:59:13.720
all the problems, right?
link |
01:59:15.080
That vision, right, was already there many years ago.
link |
01:59:19.800
So it's good to see like someone who is in this case
link |
01:59:22.880
very deeply into this style of research.
link |
01:59:27.200
And clearly has had a tremendous track record
link |
01:59:32.000
of successes and so on.
link |
01:59:34.160
The funny bit about that talk is that
link |
01:59:36.320
we rehearsed the talk in a hotel room before
link |
01:59:39.040
and the original version of that talk
link |
01:59:42.000
would have been even more controversial.
link |
01:59:44.000
So maybe I'm the only person
link |
01:59:46.560
that has seen the unfiltered version of the talk.
link |
01:59:49.200
And, you know, maybe when the time comes,
link |
01:59:51.680
maybe we should revisit some of the skip slides
link |
01:59:55.120
from the talk from Ilya.
link |
01:59:57.600
But I really think the deep belief
link |
02:00:01.040
into some certain style of research pays out, right?
link |
02:00:03.920
It is good to be practical sometimes.
link |
02:00:06.440
And I actually think Ilya and myself are like practical,
link |
02:00:09.480
but it's also good.
link |
02:00:10.520
There's some sort of longterm belief and trajectory.
link |
02:00:14.920
Obviously, there's a bit of lack involved,
link |
02:00:16.800
but it might be that that's the right path.
link |
02:00:18.880
Then you clearly are ahead
link |
02:00:20.080
and hugely influential to the field, as he has been.
link |
02:00:23.640
Do you agree with that intuition
link |
02:00:25.240
that maybe was written about by Rich Sutton
link |
02:00:29.760
in the bitter lesson, that the biggest lesson
link |
02:00:34.680
that can be read from 70 years of AI research
link |
02:00:37.000
is that general methods that leverage computation
link |
02:00:40.120
are ultimately the most effective.
link |
02:00:42.920
Do you think that intuition is ultimately correct?
link |
02:00:48.680
General methods that leverage computation,
link |
02:00:52.360
allowing the scaling of computation
link |
02:00:54.440
to do a lot of the work.
link |
02:00:56.280
And so the basic task of us humans is to design methods
link |
02:01:01.000
that are more and more general
link |
02:01:02.680
versus more and more specific to the tasks at hand.
link |
02:01:07.160
I certainly think this essentially mimics
link |
02:01:10.400
a bit of the deep learning research,
link |
02:01:14.720
almost like philosophy,
link |
02:01:17.000
that on the one hand, we want to be data agnostic.
link |
02:01:20.480
We don't wanna pre process data sets.
link |
02:01:22.160
We wanna see the bytes, right?
link |
02:01:23.440
Like the true data as it is,
link |
02:01:25.560
and then learn everything on top.
link |
02:01:27.400
So very much agree with that.
link |
02:01:29.840
And I think scaling up feels at the very least,
link |
02:01:32.920
again, necessary for building incredible complex systems.
link |
02:01:39.040
It's possibly not sufficient
link |
02:01:42.160
barring that we need a couple of breakthroughs.
link |
02:01:45.120
I think Rich Sutton mentioned search
link |
02:01:48.000
being part of the equation of scale and search.
link |
02:01:52.320
I think search, I've seen it, that's been more mixed
link |
02:01:56.600
in my experience.
link |
02:01:57.440
So from that lesson in particular,
link |
02:01:59.360
search is a bit more tricky
link |
02:02:01.200
because it is very appealing to search in domains like Go
link |
02:02:05.360
where you have a clear reward function
link |
02:02:07.480
that you can then discard some search traces.
link |
02:02:10.680
But then in some other tasks,
link |
02:02:12.960
it's not very clear how you would do that.
link |
02:02:15.280
Although recently one of our recent works
link |
02:02:18.680
which actually was mostly mimicking or a continuation
link |
02:02:22.160
and even the team and the people involved
link |
02:02:23.720
were pretty much very intersecting with AlphaStar
link |
02:02:27.200
was AlphaCode in which we actually saw
link |
02:02:30.200
the bitter lesson how scale of the models
link |
02:02:32.640
and then a massive amount of search yielded this
link |
02:02:35.240
kind of very interesting result
link |
02:02:36.760
of being able to have human level code competition.
link |
02:02:41.360
So I've seen examples of it being
link |
02:02:43.680
literally mapped to search and scale.
link |
02:02:46.400
I'm not so convinced about the search bit,
link |
02:02:48.160
but certainly I'm convinced scale will be needed.
link |
02:02:50.920
So we need general methods.
link |
02:02:52.680
We need to test them
link |
02:02:53.560
and maybe we need to make sure that we can scale them
link |
02:02:56.160
given the hardware that we have in practice,
link |
02:02:59.120
but then maybe we should also shape
link |
02:03:01.000
how the hardware looks like
link |
02:03:02.920
based on which methods might be needed to scale.
link |
02:03:05.640
And that's an interesting contrast of this GPU comment
link |
02:03:11.640
that is we got it for free almost
link |
02:03:13.400
because games were using this,
link |
02:03:15.080
but maybe now if sparsity is required,
link |
02:03:19.520
we don't have the hardware, although in theory,
link |
02:03:21.920
I mean, many people are building
link |
02:03:23.240
different kinds of hardware these days,
link |
02:03:24.720
but there's a bit of this notion of hardware lottery
link |
02:03:27.800
for scale that might actually have an impact
link |
02:03:31.280
at least on the year, again, scale of years
link |
02:03:33.480
on how fast we'll make progress
link |
02:03:35.240
to maybe a version of neural nets
link |
02:03:37.680
or whatever comes next that might enable
link |
02:03:41.960
truly intelligent agents.
link |
02:03:44.440
Do you think in your lifetime we will build an AGI system
link |
02:03:49.600
that would undeniably be a thing
link |
02:03:54.080
that achieves human level intelligence and goes far beyond?
link |
02:03:58.560
I definitely think it's possible
link |
02:04:02.400
that it will go far beyond,
link |
02:04:03.760
but I'm definitely convinced
link |
02:04:04.920
that it will be human level intelligence.
link |
02:04:08.120
And I'm hypothesizing about the beyond
link |
02:04:11.000
because the beyond beat is a bit tricky to define,
link |
02:04:16.600
especially when we look at the current formula
link |
02:04:20.040
of starting from this imitation learning standpoint, right?
link |
02:04:23.800
So we can certainly imitate humans at language and beyond.
link |
02:04:30.760
So getting at human level through imitation
link |
02:04:33.440
feels very possible.
link |
02:04:34.960
Going beyond will require reinforcement learning
link |
02:04:39.120
and other things.
link |
02:04:39.960
And I think in some areas
link |
02:04:41.760
that certainly already has paid out.
link |
02:04:43.640
I mean, Go being an example,
link |
02:04:45.640
that's my favorite so far
link |
02:04:47.360
in terms of going beyond human capabilities.
link |
02:04:50.480
But in general, I'm not sure we can define reward functions
link |
02:04:55.680
that from a seat of imitating human level intelligence
link |
02:05:00.080
that is general and then going beyond.
link |
02:05:02.960
That beat is not so clear in my lifetime,
link |
02:05:05.320
but certainly human level, yes.
link |
02:05:08.240
And I mean, that in itself is already quite powerful, I think.
link |
02:05:11.400
So going beyond, I think it's obviously not,
link |
02:05:14.560
we're not gonna not try that
link |
02:05:16.200
if then we get to super human scientist and discovery
link |
02:05:20.760
and advancing the world,
link |
02:05:22.160
but at least human level is also in general,
link |
02:05:25.560
is also very, very powerful.
link |
02:05:27.560
Well, especially if human level or slightly beyond
link |
02:05:31.520
is integrated deeply with human society
link |
02:05:33.800
and there's billions of agents like that,
link |
02:05:36.520
do you think there's a singularity moment beyond which
link |
02:05:40.040
our world will be just very deeply transformed
link |
02:05:44.240
by these kinds of systems?
link |
02:05:45.680
Because now you're talking about intelligent systems
link |
02:05:47.880
that are just, I mean,
link |
02:05:50.720
this is no longer just going from horse and buggy to the car.
link |
02:05:56.520
It feels like a very different kind of shift
link |
02:05:59.840
in what it means to be a living entity on earth.
link |
02:06:03.360
Are you afraid?
link |
02:06:04.240
Are you excited of this world?
link |
02:06:06.360
I'm afraid if there's a lot more,
link |
02:06:09.400
so I think maybe we'll need to think about
link |
02:06:13.080
if we truly get there just thinking of limited resources,
link |
02:06:18.400
like humanity clearly hits some limits
link |
02:06:21.480
and then there's some balance, hopefully,
link |
02:06:23.480
that biologically the planet is imposing
link |
02:06:26.360
and we should actually try to get better at this.
link |
02:06:28.600
As we know, there's quite a few issues
link |
02:06:31.600
with having too many people coexisting
link |
02:06:35.840
in a resource limited way.
link |
02:06:37.640
So for digital entities, it's an interesting question.
link |
02:06:40.360
I think such a limit maybe should exist,
link |
02:06:43.600
but maybe it's gonna be imposed by energy availability
link |
02:06:47.680
because this also consumes energy.
link |
02:06:49.760
In fact, most systems are more inefficient
link |
02:06:53.560
than we are in terms of energy required.
link |
02:06:56.760
But definitely, I think as a society,
link |
02:06:59.520
we'll need to just work together
link |
02:07:02.280
to find what would be reasonable in terms of growth
link |
02:07:06.400
or how we coexist if that is to happen.
link |
02:07:11.440
I am very excited about obviously the aspects of automation
link |
02:07:16.040
that make people that obviously don't have access
link |
02:07:19.040
to certain resources or knowledge
link |
02:07:22.080
for them to have that access.
link |
02:07:23.920
I think those are the applications in a way
link |
02:07:26.320
that I'm most exciting to see and to personally work towards.
link |
02:07:31.000
Yeah, there's going to be significant improvements
link |
02:07:32.720
in productivity and the quality of life
link |
02:07:34.400
across the whole population, which is very interesting.
link |
02:07:37.040
But I'm looking even far beyond
link |
02:07:39.280
us becoming a multiplanetary species.
link |
02:07:42.720
And just as a quick bet, last question,
link |
02:07:45.400
do you think as humans become multiplanetary species,
link |
02:07:49.240
go outside our solar system, all that kind of stuff,
link |
02:07:52.520
do you think there'll be more humans
link |
02:07:54.480
or more robots in that future world?
link |
02:07:57.240
So will humans be the quirky, intelligent being of the past?
link |
02:08:04.480
Or is there something deeply fundamental
link |
02:08:06.960
to human intelligence that's truly special,
link |
02:08:09.640
where we will be part of those other planets,
link |
02:08:12.160
not just AI systems?
link |
02:08:13.960
I think we're all excited to build AGI to empower
link |
02:08:21.720
or make us more powerful as human species.
link |
02:08:25.120
Not to say there might be some hybridization.
link |
02:08:27.600
I mean, this is obviously speculation,
link |
02:08:29.720
but there are companies also trying to,
link |
02:08:32.520
the same way medicine is making us better.
link |
02:08:35.680
Maybe there are other things that are yet to happen on that.
link |
02:08:39.120
But if the ratio is not at most one to one,
link |
02:08:43.360
I would not be happy.
link |
02:08:44.680
So I would hope that we are part of the equation.
link |
02:08:49.200
But maybe there's maybe a one to one ratio
link |
02:08:52.800
feels like possible, constructive, and so on.
link |
02:08:56.280
But it would not be good to have a misbalance,
link |
02:08:59.680
at least from my core beliefs and the why I'm doing what
link |
02:09:03.480
I'm doing when I go to work and I research what I research.
link |
02:09:07.160
Well, this is how I know you're human.
link |
02:09:09.560
And this is how you've passed the Turing test.
link |
02:09:12.800
And you are one of the special humans, Ariel.
link |
02:09:15.000
It's a huge honor that you had talked with me.
link |
02:09:17.160
And I hope we get the chance to speak again maybe once
link |
02:09:20.680
before the singularity, once after,
link |
02:09:23.040
and see how our view of the world changes.
link |
02:09:25.320
Thank you again for talking today.
link |
02:09:26.720
Thank you for the amazing work you do here.
link |
02:09:29.200
Shining example of a researcher and a human being
link |
02:09:32.040
in this community.
link |
02:09:32.960
Thanks a lot, Lex.
link |
02:09:34.080
Yeah, looking forward to before the singularity, certainly.
link |
02:09:37.840
And maybe after.
link |
02:09:39.960
Thanks for listening to this conversation with Ariel Vinialis.
link |
02:09:43.160
To support this podcast, please check out our sponsors
link |
02:09:45.560
in the description.
link |
02:09:47.000
And now, let me leave you with some words from Alan Turing.
link |
02:09:51.200
Those who can imagine anything can create the impossible.
link |
02:09:56.120
Thank you for listening and hope to see you next time.