back to index

Oriol Vinyals: Deep Learning and Artificial General Intelligence | Lex Fridman Podcast #306


small model | large model

link |
00:00:00.000
at which point is the neural network a being versus a tool?
link |
00:00:08.400
The following is a conversation with Oriel Veniales,
link |
00:00:11.360
his second time on the podcast.
link |
00:00:13.440
Oriel is the research director
link |
00:00:15.920
and deep learning lead at DeepMind
link |
00:00:18.000
and one of the most brilliant thinkers and researchers
link |
00:00:20.960
in the history of artificial intelligence.
link |
00:00:24.320
This is the Lex Friedman podcast.
link |
00:00:26.640
To support it, please check out our sponsors
link |
00:00:28.840
in the description.
link |
00:00:30.160
And now, dear friends, here's Oriel Veniales.
link |
00:00:34.480
You are one of the most brilliant researchers
link |
00:00:37.020
in the history of AI,
link |
00:00:38.440
working across all kinds of modalities.
link |
00:00:40.560
Probably the one common theme is
link |
00:00:42.680
it's always sequences of data.
link |
00:00:45.000
So we're talking about languages, images,
link |
00:00:46.960
even biology and games, as we talked about last time.
link |
00:00:50.240
So you're a good person to ask this.
link |
00:00:53.360
In your lifetime, will we be able to build an AI system
link |
00:00:57.320
that's able to replace me as the interviewer
link |
00:01:00.740
in this conversation,
link |
00:01:02.580
in terms of ability to ask questions
link |
00:01:04.460
that are compelling to somebody listening?
link |
00:01:06.600
And then further question is, are we close?
link |
00:01:10.640
Will we be able to build a system that replaces you
link |
00:01:13.880
as the interviewee
link |
00:01:16.080
in order to create a compelling conversation?
link |
00:01:18.100
How far away are we, do you think?
link |
00:01:20.020
It's a good question.
link |
00:01:21.800
I think partly I would say, do we want that?
link |
00:01:24.680
I really like when we start now with very powerful models,
link |
00:01:29.400
interacting with them and thinking of them
link |
00:01:32.160
more closer to us.
link |
00:01:34.080
The question is, if you remove the human side
link |
00:01:37.020
of the conversation, is that an interesting artifact?
link |
00:01:42.320
And I would say, probably not.
link |
00:01:44.440
I've seen, for instance, last time we spoke,
link |
00:01:47.400
like we were talking about StarCraft,
link |
00:01:50.320
and creating agents that play games involves self play,
link |
00:01:54.920
but ultimately what people care about was,
link |
00:01:57.660
how does this agent behave
link |
00:01:59.080
when the opposite side is a human?
link |
00:02:02.700
So without a doubt,
link |
00:02:04.720
we will probably be more empowered by AI.
link |
00:02:08.560
Maybe you can source some questions from an AI system.
link |
00:02:12.480
I mean, that even today, I would say it's quite plausible
link |
00:02:15.020
that with your creativity,
link |
00:02:17.060
you might actually find very interesting questions
link |
00:02:19.400
that you can filter.
link |
00:02:20.740
We call this cherry picking sometimes
link |
00:02:22.420
in the field of language.
link |
00:02:24.400
And likewise, if I had now the tools on my side,
link |
00:02:27.540
I could say, look, you're asking this interesting question.
link |
00:02:30.660
From this answer, I like the words chosen
link |
00:02:33.240
by this particular system that created a few words.
link |
00:02:36.600
Completely replacing it feels not exactly exciting to me.
link |
00:02:41.280
Although in my lifetime, I think way,
link |
00:02:43.780
I mean, given the trajectory,
link |
00:02:45.520
I think it's possible that perhaps
link |
00:02:48.020
there could be interesting,
link |
00:02:49.880
maybe self play interviews as you're suggesting
link |
00:02:53.040
that would look or sound quite interesting
link |
00:02:56.160
and probably would educate
link |
00:02:57.720
or you could learn a topic through listening
link |
00:03:00.160
to one of these interviews at a basic level at least.
link |
00:03:03.200
So you said it doesn't seem exciting to you,
link |
00:03:04.800
but what if exciting is part of the objective function
link |
00:03:07.520
the thing is optimized over?
link |
00:03:09.120
So there's probably a huge amount of data of humans
link |
00:03:12.840
if you look correctly, of humans communicating online,
link |
00:03:16.080
and there's probably ways to measure the degree of,
link |
00:03:19.280
you know, as they talk about engagement.
link |
00:03:21.920
So you can probably optimize the question
link |
00:03:24.140
that's most created an engaging conversation in the past.
link |
00:03:28.680
So actually, if you strictly use the word exciting,
link |
00:03:33.200
there is probably a way to create
link |
00:03:37.240
a optimally exciting conversations
link |
00:03:40.320
that involve AI systems.
link |
00:03:42.160
At least one side is AI.
link |
00:03:44.600
Yeah, that makes sense, I think,
link |
00:03:46.560
maybe looping back a bit to games and the game industry,
link |
00:03:50.240
when you design algorithms,
link |
00:03:53.040
you're thinking about winning as the objective, right?
link |
00:03:55.800
Or the reward function.
link |
00:03:57.320
But in fact, when we discussed this with Blizzard,
link |
00:04:00.080
the creators of StarCraft in this case,
link |
00:04:02.320
I think what's exciting, fun,
link |
00:04:05.340
if you could measure that and optimize for that,
link |
00:04:09.160
that's probably why we play video games
link |
00:04:11.720
or why we interact or listen or look at cat videos
link |
00:04:14.640
or whatever on the internet.
link |
00:04:16.460
So it's true that modeling reward
link |
00:04:19.500
beyond the obvious reward functions
link |
00:04:21.320
we've used to in reinforcement learning
link |
00:04:23.720
is definitely very exciting.
link |
00:04:25.560
And again, there is some progress actually
link |
00:04:28.240
into a particular aspect of AI, which is quite critical,
link |
00:04:32.140
which is, for instance, is a conversation
link |
00:04:36.120
or is the information truthful, right?
link |
00:04:38.200
So you could start trying to evaluate these
link |
00:04:41.640
from accepts from the internet, right?
link |
00:04:44.440
That has lots of information.
link |
00:04:45.840
And then if you can learn a function automated ideally,
link |
00:04:50.220
so you can also optimize it more easily,
link |
00:04:52.920
then you could actually have conversations
link |
00:04:54.920
that optimize for non obvious things such as excitement.
link |
00:04:59.440
So yeah, that's quite possible.
link |
00:05:01.100
And then I would say in that case,
link |
00:05:03.620
it would definitely be fun exercise
link |
00:05:05.960
and quite unique to have at least one site
link |
00:05:08.120
that is fully driven by an excitement reward function.
link |
00:05:12.840
But obviously, there would be still quite a lot of humanity
link |
00:05:16.960
in the system, both from who is building the system,
link |
00:05:20.800
of course, and also, ultimately,
link |
00:05:23.600
if we think of labeling for excitement,
link |
00:05:26.040
that those labels must come from us
link |
00:05:28.480
because it's just hard to have a computational measure
link |
00:05:32.560
of excitement.
link |
00:05:33.520
As far as I understand, there's no such thing.
link |
00:05:36.160
Well, as you mentioned truth also,
link |
00:05:39.280
I would actually venture to say that excitement
link |
00:05:41.840
is easier to label than truth,
link |
00:05:44.160
or is perhaps has lower consequences of failure.
link |
00:05:49.920
But there is perhaps the humanness that you mentioned,
link |
00:05:55.760
that's perhaps part of a thing that could be labeled.
link |
00:05:58.280
And that could mean an AI system that's doing dialogue,
link |
00:06:02.520
that's doing conversations should be flawed, for example.
link |
00:06:07.720
Like that's the thing you optimize for,
link |
00:06:09.440
which is have inherent contradictions by design,
link |
00:06:13.280
have flaws by design.
link |
00:06:15.080
Maybe it also needs to have a strong sense of identity.
link |
00:06:18.760
So it has a backstory it told itself that it sticks to.
link |
00:06:22.680
It has memories, not in terms of how the system is designed,
link |
00:06:26.900
but it's able to tell stories about its past.
link |
00:06:30.440
It's able to have mortality and fear of mortality
link |
00:06:36.040
in the following way that it has an identity.
link |
00:06:39.080
And if it says something stupid
link |
00:06:41.200
and gets canceled on Twitter, that's the end of that system.
link |
00:06:44.680
So it's not like you get to rebrand yourself.
link |
00:06:47.320
That system is, that's it.
link |
00:06:49.320
So maybe the high stakes nature of it,
link |
00:06:52.080
because you can't say anything stupid now,
link |
00:06:54.520
or because you'd be canceled on Twitter.
link |
00:06:57.680
And there's stakes to that.
link |
00:06:59.720
And that I think part of the reason
link |
00:07:01.120
that makes it interesting.
link |
00:07:03.480
And then you have a perspective,
link |
00:07:04.680
like you've built up over time that you stick with,
link |
00:07:07.680
and then people can disagree with you.
link |
00:07:09.100
So holding that perspective strongly,
link |
00:07:11.760
holding sort of maybe a controversial,
link |
00:07:14.000
at least a strong opinion.
link |
00:07:16.280
All of those elements, it feels like they can be learned
link |
00:07:18.800
because it feels like there's a lot of data
link |
00:07:21.720
on the internet of people having an opinion.
link |
00:07:24.520
And then combine that with a metric of excitement,
link |
00:07:27.800
you can start to create something that,
link |
00:07:30.020
as opposed to trying to optimize
link |
00:07:31.680
for sort of grammatical clarity and truthfulness,
link |
00:07:38.120
the factual consistency over many sentences,
link |
00:07:42.000
you optimize for the humanness.
link |
00:07:45.320
And there's obviously data for humanness on the internet.
link |
00:07:48.860
So I wonder if there's a future where that's part,
link |
00:07:53.760
or I mean, I sometimes wonder that about myself.
link |
00:07:56.400
I'm a huge fan of podcasts,
link |
00:07:58.120
and I listen to some podcasts,
link |
00:08:00.760
and I think like, what is interesting about this?
link |
00:08:03.240
What is compelling?
link |
00:08:05.960
The same way you watch other games.
link |
00:08:07.440
Like you said, watch, play StarCraft,
link |
00:08:09.160
or have Magnus Carlsen play chess.
link |
00:08:13.040
So I'm not a chess player,
link |
00:08:14.920
but it's still interesting to me.
link |
00:08:16.120
What is that?
link |
00:08:16.960
That's the stakes of it,
link |
00:08:19.440
maybe the end of a domination of a series of wins.
link |
00:08:23.400
I don't know, there's all those elements
link |
00:08:25.440
somehow connect to a compelling conversation.
link |
00:08:28.000
And I wonder how hard is that to replace,
link |
00:08:30.200
because ultimately all of that connects
link |
00:08:31.840
to the initial proposition of how to test,
link |
00:08:35.480
whether an AI is intelligent or not with the Turing test,
link |
00:08:38.640
which I guess my question comes from a place
link |
00:08:41.760
of the spirit of that test.
link |
00:08:43.680
Yes, I actually recall,
link |
00:08:45.440
I was just listening to our first podcast
link |
00:08:47.920
where we discussed Turing test.
link |
00:08:50.380
So I would say from a neural network,
link |
00:08:54.760
AI builder perspective,
link |
00:08:57.640
there's usually you try to map
link |
00:09:01.360
many of these interesting topics you discuss to benchmarks,
link |
00:09:05.200
and then also to actual architectures
link |
00:09:08.140
on the how these systems are currently built,
link |
00:09:10.640
how they learn, what data they learn from,
link |
00:09:13.080
what are they learning, right?
link |
00:09:14.300
We're talking about weights of a mathematical function,
link |
00:09:17.800
and then looking at the current state of the game,
link |
00:09:21.560
maybe what do we need leaps forward
link |
00:09:26.000
to get to the ultimate stage of all these experiences,
link |
00:09:30.660
lifetime experience, fears,
link |
00:09:32.880
like words that currently,
link |
00:09:34.800
barely we're seeing progress
link |
00:09:38.020
just because what's happening today
link |
00:09:40.120
is you take all these human interactions,
link |
00:09:44.020
it's a large vast variety of human interactions online,
link |
00:09:47.960
and then you're distilling these sequences, right?
link |
00:09:51.640
Going back to my passion,
link |
00:09:53.000
like sequences of words, letters, images, sound,
link |
00:09:56.920
there's more modalities here to be at play.
link |
00:09:59.840
And then you're trying to just learn a function
link |
00:10:03.360
that will be happy,
link |
00:10:04.400
that maximizes the likelihood of seeing all these
link |
00:10:08.840
through a neural network.
link |
00:10:10.880
Now, I think there's a few places
link |
00:10:14.200
where the way currently we train these models
link |
00:10:17.240
would clearly lack to be able to develop
link |
00:10:20.000
the kinds of capabilities you save.
link |
00:10:22.120
I'll tell you maybe a couple.
link |
00:10:23.560
One is the lifetime of an agent or a model.
link |
00:10:27.640
So you learn from this data offline, right?
link |
00:10:30.820
So you're just passively observing and maximizing these,
link |
00:10:33.880
it's almost like a mountains,
link |
00:10:35.360
like a landscape of mountains,
link |
00:10:37.600
and then everywhere there's data
link |
00:10:39.140
that humans interacted in this way,
link |
00:10:41.040
you're trying to make that higher
link |
00:10:43.000
and then lower where there's no data.
link |
00:10:45.720
And then these models generally
link |
00:10:48.480
don't then experience themselves.
link |
00:10:51.160
They just are observers, right?
link |
00:10:52.520
They're passive observers of the data.
link |
00:10:54.600
And then we're putting them to then generate data
link |
00:10:57.440
when we interact with them,
link |
00:10:59.180
but that's very limiting.
link |
00:11:00.900
The experience they actually experience
link |
00:11:03.480
when they could maybe be optimizing
link |
00:11:05.680
or further optimizing the weights,
link |
00:11:07.440
we're not even doing that.
link |
00:11:08.640
So to be clear, and again, mapping to AlphaGo, AlphaStar,
link |
00:11:14.080
we train the model.
link |
00:11:15.280
And when we deploy it to play against humans,
link |
00:11:18.260
or in this case interact with humans,
link |
00:11:20.400
like language models,
link |
00:11:21.840
they don't even keep training, right?
link |
00:11:23.560
They're not learning in the sense of the weights
link |
00:11:26.220
that you've learned from the data,
link |
00:11:28.240
they don't keep changing.
link |
00:11:29.820
Now, there's something a bit more feels magical,
link |
00:11:33.540
but it's understandable if you're into Neuronet,
link |
00:11:36.240
which is, well, they might not learn
link |
00:11:39.180
in the strict sense of the words,
link |
00:11:40.520
the weights changing,
link |
00:11:41.520
maybe that's mapping to how neurons interconnect
link |
00:11:44.400
and how we learn over our lifetime.
link |
00:11:46.680
But it's true that the context of the conversation
link |
00:11:50.320
that takes place when you talk to these systems,
link |
00:11:55.020
it's held in their working memory, right?
link |
00:11:57.280
It's almost like you start the computer,
link |
00:12:00.160
it has a hard drive that has a lot of information,
link |
00:12:02.880
you have access to the internet,
link |
00:12:04.040
which has probably all the information,
link |
00:12:06.360
but there's also a working memory
link |
00:12:08.520
where these agents, as we call them,
link |
00:12:11.120
or start calling them, build upon.
link |
00:12:13.880
Now, this memory is very limited.
link |
00:12:16.640
I mean, right now we're talking, to be concrete,
link |
00:12:19.240
about 2,000 words that we hold,
link |
00:12:21.780
and then beyond that, we start forgetting what we've seen.
link |
00:12:24.880
So you can see that there's some short term coherence
link |
00:12:28.080
already, right, with what you said.
link |
00:12:29.880
I mean, it's a very interesting topic.
link |
00:12:32.340
Having sort of a mapping, an agent to have consistency,
link |
00:12:37.440
then if you say, oh, what's your name,
link |
00:12:40.800
it could remember that,
link |
00:12:42.280
but then it might forget beyond 2,000 words,
link |
00:12:45.020
which is not that long of context
link |
00:12:47.520
if we think even of these podcast books are much longer.
link |
00:12:51.800
So technically speaking, there's a limitation there,
link |
00:12:55.160
super exciting from people that work on deep learning
link |
00:12:58.220
to be working on, but I would say we lack maybe benchmarks
link |
00:13:03.080
and the technology to have this lifetime like experience
link |
00:13:07.880
of memory that keeps building up.
link |
00:13:10.900
However, the way it learns offline
link |
00:13:13.200
is clearly very powerful, right?
link |
00:13:14.920
So you asked me three years ago, I would say,
link |
00:13:17.840
oh, we're very far.
link |
00:13:18.680
I think we've seen the power of this imitation,
link |
00:13:22.240
again, on the internet scale that has enabled this
link |
00:13:26.280
to feel like at least the knowledge,
link |
00:13:28.800
the basic knowledge about the world now
link |
00:13:30.720
is incorporated into the weights,
link |
00:13:33.160
but then this experience is lacking.
link |
00:13:36.600
And in fact, as I said, we don't even train them
link |
00:13:39.360
when we're talking to them,
link |
00:13:41.200
other than their working memory, of course, is affected.
link |
00:13:44.800
So that's the dynamic part,
link |
00:13:46.600
but they don't learn in the same way
link |
00:13:48.300
that you and I have learned, right?
link |
00:13:50.640
From basically when we were born and probably before.
link |
00:13:54.080
So lots of fascinating, interesting questions you asked there.
link |
00:13:57.440
I think the one I mentioned is this idea of memory
link |
00:14:01.720
and experience versus just kind of observe the world
link |
00:14:05.540
and learn its knowledge, which I think for that,
link |
00:14:08.040
I would argue lots of recent advancements
link |
00:14:10.400
that make me very excited about the field.
link |
00:14:13.480
And then the second maybe issue that I see is
link |
00:14:18.240
all these models, we train them from scratch.
link |
00:14:21.320
That's something I would have complained three years ago
link |
00:14:24.100
or six years ago or 10 years ago.
link |
00:14:26.480
And it feels if we take inspiration from how we got here,
link |
00:14:31.440
how the universe evolved us and we keep evolving,
link |
00:14:35.340
it feels that is a missing piece,
link |
00:14:37.920
that we should not be training models from scratch
link |
00:14:41.400
every few months,
link |
00:14:42.560
that there should be some sort of way
link |
00:14:45.320
in which we can grow models much like as a species
link |
00:14:49.040
and many other elements in the universe
link |
00:14:51.560
is building from the previous sort of iterations.
link |
00:14:55.080
And that from a just purely neural network perspective,
link |
00:14:59.600
even though we would like to make it work,
link |
00:15:02.360
it's proven very hard to not throw away
link |
00:15:06.300
the previous weights, right?
link |
00:15:07.720
This landscape we learn from the data
link |
00:15:09.720
and refresh it with a brand new set of weights,
link |
00:15:13.400
given maybe a recent snapshot of these data sets
link |
00:15:17.020
we train on, et cetera, or even a new game we're learning.
link |
00:15:20.000
So that feels like something is missing fundamentally.
link |
00:15:24.200
We might find it, but it's not very clear
link |
00:15:27.480
how it will look like.
link |
00:15:28.460
There's many ideas and it's super exciting as well.
link |
00:15:30.860
Yes, just for people who don't know,
link |
00:15:32.480
when you're approaching a new problem in machine learning,
link |
00:15:35.760
you're going to come up with an architecture
link |
00:15:38.240
that has a bunch of weights
link |
00:15:41.000
and then you initialize them somehow,
link |
00:15:43.400
which in most cases is some version of random.
link |
00:15:47.320
So that's what you mean by starting from scratch.
link |
00:15:49.020
And it seems like it's a waste every time you solve
link |
00:15:54.480
the game of Go and chess, StarCraft, protein folding,
link |
00:15:59.720
like surely there's some way to reuse the weights
link |
00:16:03.200
as we grow this giant database of neural networks
link |
00:16:08.400
that have solved some of the toughest problems in the world.
link |
00:16:10.760
And so some of that is, what is that?
link |
00:16:15.240
Methods, how to reuse weights,
link |
00:16:19.080
how to learn, extract what's generalizable
link |
00:16:22.480
or at least has a chance to be
link |
00:16:25.160
and throw away the other stuff.
link |
00:16:27.840
And maybe the neural network itself
link |
00:16:29.580
should be able to tell you that.
link |
00:16:31.640
Like what, yeah, how do you,
link |
00:16:34.400
what ideas do you have for better initialization of weights?
link |
00:16:37.520
Maybe stepping back,
link |
00:16:38.760
if we look at the field of machine learning,
link |
00:16:41.720
but especially deep learning, right?
link |
00:16:44.040
At the core of deep learning,
link |
00:16:45.240
there's this beautiful idea that is a single algorithm
link |
00:16:49.240
can solve any task, right?
link |
00:16:50.920
So it's been proven over and over
link |
00:16:54.400
with more increasing set of benchmarks
link |
00:16:56.420
and things that were thought impossible
link |
00:16:58.580
that are being cracked by this basic principle
link |
00:17:01.960
that is you take a neural network of uninitialized weights,
link |
00:17:05.800
so like a blank computational brain,
link |
00:17:09.620
then you give it, in the case of supervised learning,
link |
00:17:12.580
a lot ideally of examples of,
link |
00:17:14.960
hey, here is what the input looks like
link |
00:17:17.120
and the desired output should look like this.
link |
00:17:19.560
I mean, image classification is very clear example,
link |
00:17:22.360
images to maybe one of a thousand categories,
link |
00:17:25.560
that's what ImageNet is like,
link |
00:17:26.840
but many, many, if not all problems can be mapped this way.
link |
00:17:30.720
And then there's a generic recipe, right?
link |
00:17:33.840
That you can use.
link |
00:17:35.240
And this recipe with very little change,
link |
00:17:38.600
and I think that's the core of deep learning research, right?
link |
00:17:41.520
That what is the recipe that is universal?
link |
00:17:44.420
That for any new given task,
link |
00:17:46.400
I'll be able to use without thinking,
link |
00:17:48.460
without having to work very hard on the problem at stake.
link |
00:17:52.600
We have not found this recipe,
link |
00:17:54.400
but I think the field is excited to find less tweaks
link |
00:18:00.160
or tricks that people find when they work
link |
00:18:02.640
on important problems specific to those
link |
00:18:05.280
and more of a general algorithm, right?
link |
00:18:07.540
So at an algorithmic level,
link |
00:18:09.300
I would say we have something general already,
link |
00:18:11.780
which is this formula of training a very powerful model,
link |
00:18:14.520
a neural network on a lot of data.
link |
00:18:17.000
And in many cases, you need some specificity
link |
00:18:21.200
to the actual problem you're solving,
link |
00:18:23.400
protein folding being such an important problem
link |
00:18:26.080
has some basic recipe that is learned from before, right?
link |
00:18:30.800
Like transformer models, graph neural networks,
link |
00:18:34.140
ideas coming from NLP, like something called BERT,
link |
00:18:38.600
that is a kind of loss that you can emplace
link |
00:18:41.280
to help the knowledge distillation is another technique,
link |
00:18:45.460
right?
link |
00:18:46.300
So this is the formula.
link |
00:18:47.120
We still had to find some particular things
link |
00:18:50.560
that were specific to alpha fold, right?
link |
00:18:53.600
That's very important because protein folding
link |
00:18:55.860
is such a high value problem that as humans,
link |
00:18:59.120
we should solve it no matter
link |
00:19:00.840
if we need to be a bit specific.
link |
00:19:02.880
And it's possible that some of these learnings
link |
00:19:04.940
will apply then to the next iteration of this recipe
link |
00:19:07.380
that deep learners are about.
link |
00:19:09.340
But it is true that so far, the recipe is what's common,
link |
00:19:13.200
but the weights you generally throw away,
link |
00:19:15.880
which feels very sad.
link |
00:19:17.800
Although, maybe in the last,
link |
00:19:20.440
especially in the last two, three years,
link |
00:19:22.280
and when we last spoke,
link |
00:19:23.560
I mentioned this area of meta learning,
link |
00:19:25.560
which is the idea of learning to learn.
link |
00:19:28.560
That idea and some progress has been had starting,
link |
00:19:32.040
I would say, mostly from GPT3 on the language domain only,
link |
00:19:36.040
in which you could conceive a model that is trained once.
link |
00:19:41.040
And then this model is not narrow in that it only knows
link |
00:19:44.720
how to translate a pair of languages or even a set of
link |
00:19:47.760
or it only knows how to assign sentiment to a sentence.
link |
00:19:51.520
These actually, you could teach it by a prompting,
link |
00:19:55.040
it's called, and this prompting is essentially
link |
00:19:56.880
just showing it a few more examples,
link |
00:19:59.920
almost like you do show examples, input, output examples,
link |
00:20:03.040
algorithmically speaking to the process
link |
00:20:04.900
of creating this model.
link |
00:20:06.320
But now you're doing it through language,
link |
00:20:07.840
which is very natural way for us to learn from one another.
link |
00:20:11.080
I tell you, hey, you should do this new task.
link |
00:20:13.180
I'll tell you a bit more.
link |
00:20:14.600
Maybe you ask me some questions
link |
00:20:16.080
and now you know the task, right?
link |
00:20:17.840
You didn't need to retrain it from scratch.
link |
00:20:20.320
And we've seen these magical moments almost
link |
00:20:24.080
in this way to do few shot promptings through language
link |
00:20:26.960
on language only domain.
link |
00:20:28.560
And then in the last two years,
link |
00:20:30.960
we've seen these expanded to beyond language,
link |
00:20:34.640
adding vision, adding actions and games,
link |
00:20:38.040
lots of progress to be had.
link |
00:20:39.480
But this is maybe, if you ask me like about
link |
00:20:42.160
how are we gonna crack this problem?
link |
00:20:43.720
This is perhaps one way in which you have a single model.
link |
00:20:48.760
The problem of this model is it's hard to grow
link |
00:20:52.160
in weights or capacity,
link |
00:20:54.320
but the model is certainly so powerful
link |
00:20:56.400
that you can teach it some tasks, right?
link |
00:20:58.960
In this way that I teach you,
link |
00:21:00.600
I could teach you a new task now,
link |
00:21:02.000
if we were all at a text based task
link |
00:21:05.120
or a classification of vision style task.
link |
00:21:08.440
But it still feels like more breakthroughs should be had,
link |
00:21:12.860
but it's a great beginning, right?
link |
00:21:14.040
We have a good baseline.
link |
00:21:15.440
We have an idea that this maybe is the way we want
link |
00:21:18.160
to benchmark progress towards AGI.
link |
00:21:20.800
And I think in my view, that's critical
link |
00:21:22.880
to always have a way to benchmark the community
link |
00:21:25.760
sort of converging to these overall,
link |
00:21:27.840
which is good to see.
link |
00:21:29.240
And then this is actually what excites me
link |
00:21:33.520
in terms of also next steps for deep learning
link |
00:21:36.640
is how to make these models more powerful,
link |
00:21:39.120
how do you train them, how to grow them
link |
00:21:41.760
if they must grow, should they change their weights
link |
00:21:44.520
as you teach it task or not?
link |
00:21:46.120
There's some interesting questions, many to be answered.
link |
00:21:48.560
Yeah, you've opened the door
link |
00:21:49.760
about to a bunch of questions I want to ask,
link |
00:21:52.320
but let's first return to your tweet
link |
00:21:55.720
and read it like a Shakespeare.
link |
00:21:57.160
You wrote, God is not the end, it's the beginning.
link |
00:22:01.280
And then you wrote meow and then an emoji of a cat.
link |
00:22:06.200
So first two questions.
link |
00:22:07.740
First, can you explain the meow and the cat emoji?
link |
00:22:10.080
And second, can you explain what Godot is and how it works?
link |
00:22:13.680
Right, indeed.
link |
00:22:14.640
I mean, thanks for reminding me
link |
00:22:16.520
that we're all exposing on Twitter and.
link |
00:22:19.920
Permanently there.
link |
00:22:20.960
Yes, permanently there.
link |
00:22:21.920
One of the greatest AI researchers of all time,
link |
00:22:25.120
meow and cat emoji.
link |
00:22:27.200
Yes. There you go.
link |
00:22:28.280
Right, so.
link |
00:22:29.120
Can you imagine like touring, tweeting, meow and cat,
link |
00:22:32.720
probably he would, probably would.
link |
00:22:34.360
Probably.
link |
00:22:35.200
So yeah, the tweet is important actually.
link |
00:22:38.020
You know, I put thought on the tweets, I hope people.
link |
00:22:40.720
Which part do you think?
link |
00:22:41.720
Okay, so there's three sentences.
link |
00:22:44.840
Godot is not the end, Godot is the beginning,
link |
00:22:48.640
meow, cat emoji.
link |
00:22:50.120
Okay, which is the important part?
link |
00:22:51.720
The meow, no, no.
link |
00:22:53.120
Definitely that it is the beginning.
link |
00:22:56.080
I mean, I probably was just explaining a bit
link |
00:23:00.340
where the field is going, but let me tell you about Godot.
link |
00:23:03.760
So first the name Godot comes from maybe a sequence
link |
00:23:08.120
of releases that DeepMind had that named,
link |
00:23:11.820
like used animal names to name some of their models
link |
00:23:15.100
that are based on this idea of large sequence models.
link |
00:23:19.120
Initially they're only language,
link |
00:23:20.620
but we are expanding to other modalities.
link |
00:23:23.180
So we had, you know, we had Gopher, Chinchilla,
link |
00:23:28.800
these were language only.
link |
00:23:29.960
And then more recently we released Flamingo,
link |
00:23:32.720
which adds vision to the equation.
link |
00:23:35.460
And then Godot, which adds vision
link |
00:23:38.160
and then also actions in the mix, right?
link |
00:23:41.660
As we discuss actually actions,
link |
00:23:44.520
especially discrete actions like up, down, left, right.
link |
00:23:47.600
I just told you the actions, but they're words.
link |
00:23:49.520
So you can kind of see how actions naturally map
link |
00:23:52.800
to sequence modeling of words,
link |
00:23:54.560
which these models are very powerful.
link |
00:23:57.100
So Godot was named after, I believe,
link |
00:24:01.720
I can only from memory, right?
link |
00:24:03.640
These, you know, these things always happen
link |
00:24:06.080
with an amazing team of researchers behind.
link |
00:24:08.520
So before the release, we had a discussion
link |
00:24:12.200
about which animal would we pick, right?
link |
00:24:14.240
And I think because of the word general agent, right?
link |
00:24:18.380
And this is a property quite unique to Godot.
link |
00:24:21.920
We kind of were playing with the GA words
link |
00:24:24.760
and then, you know, Godot.
link |
00:24:26.040
Rhymes with cat.
link |
00:24:26.960
Yes.
link |
00:24:28.080
And Godot is obviously a Spanish version of cat.
link |
00:24:30.280
I had nothing to do with it, although I'm from Spain.
link |
00:24:32.280
Oh, how do you, wait, sorry.
link |
00:24:33.320
How do you say cat in Spanish?
link |
00:24:34.700
Gato.
link |
00:24:35.540
Oh, gato, okay.
link |
00:24:36.360
Now it all makes sense.
link |
00:24:37.200
Okay, okay, I see, I see, I see.
link |
00:24:38.160
Now it all makes sense.
link |
00:24:39.120
Okay, so.
link |
00:24:39.960
How do you say meow in Spanish?
link |
00:24:40.840
No, that's probably the same.
link |
00:24:41.960
I think you say it the same way,
link |
00:24:44.440
but you write it as M, I, A, U.
link |
00:24:48.120
Okay, it's universal.
link |
00:24:49.240
Yes.
link |
00:24:50.080
All right, so then how does the thing work?
link |
00:24:51.680
So you said general is, so you said language, vision.
link |
00:24:57.520
And action. Action.
link |
00:24:59.240
How does this, can you explain
link |
00:25:01.840
what kind of neural networks are involved?
link |
00:25:04.240
What does the training look like?
link |
00:25:06.360
And maybe what do you,
link |
00:25:09.380
are some beautiful ideas within the system?
link |
00:25:11.840
Yeah, so maybe the basics of Gato
link |
00:25:16.060
are not that dissimilar from many, many work that come.
link |
00:25:19.920
So here is where the sort of the recipe,
link |
00:25:22.880
I mean, hasn't changed too much.
link |
00:25:24.200
There is a transformer model
link |
00:25:25.600
that's the kind of recurrent neural network
link |
00:25:28.640
that essentially takes a sequence of modalities,
link |
00:25:33.320
observations that could be words,
link |
00:25:36.360
could be vision or could be actions.
link |
00:25:38.800
And then its own objective that you train it to do
link |
00:25:42.120
when you train it is to predict what the next anything is.
link |
00:25:46.360
And anything means what's the next action.
link |
00:25:48.760
If this sequence that I'm showing you to train
link |
00:25:51.220
is a sequence of actions and observations,
link |
00:25:53.500
then you're predicting what's the next action
link |
00:25:55.600
and the next observation, right?
link |
00:25:57.100
So you think of these really as a sequence of bites, right?
link |
00:26:00.880
So take any sequence of words,
link |
00:26:04.220
a sequence of interleaved words and images,
link |
00:26:07.000
a sequence of maybe observations that are images
link |
00:26:11.280
and moves in Atari up, down, left, right.
link |
00:26:14.280
And these you just think of them as bites
link |
00:26:17.640
and you're modeling what's the next bite gonna be like.
link |
00:26:20.580
And you might interpret that as an action
link |
00:26:23.440
and then play it in a game,
link |
00:26:25.880
or you could interpret it as a word
link |
00:26:27.720
and then write it down
link |
00:26:29.120
if you're chatting with the system and so on.
link |
00:26:32.480
So Gato basically can be thought as inputs,
link |
00:26:36.600
images, text, video, actions.
link |
00:26:41.480
It also actually inputs some sort of proprioception sensors
link |
00:26:45.800
from robotics because robotics is one of the tasks
link |
00:26:48.280
that it's been trained to do.
link |
00:26:49.860
And then at the output, similarly,
link |
00:26:51.920
it outputs words, actions.
link |
00:26:53.720
It does not output images, that's just by design,
link |
00:26:57.440
we decided not to go that way for now.
link |
00:27:00.880
That's also in part why it's the beginning
link |
00:27:02.760
because there's more to do clearly.
link |
00:27:04.920
But that's kind of what the Gato is,
link |
00:27:06.440
is this brain that essentially you give it any sequence
link |
00:27:09.200
of these observations and modalities
link |
00:27:11.940
and it outputs the next step.
link |
00:27:13.760
And then off you go, you feed the next step into
link |
00:27:17.380
and predict the next one and so on.
link |
00:27:20.060
Now, it is more than a language model
link |
00:27:24.160
because even though you can chat with Gato,
link |
00:27:26.780
like you can chat with Chinchilla or Flamingo,
link |
00:27:30.520
it also is an agent, right?
link |
00:27:33.200
So that's why we call it A of Gato,
link |
00:27:37.200
like the letter A and also it's general.
link |
00:27:41.340
It's not an agent that's been trained to be good
link |
00:27:43.960
at only StarCraft or only Atari or only Go.
link |
00:27:47.860
It's been trained on a vast variety of datasets.
link |
00:27:51.640
What makes it an agent, if I may interrupt,
link |
00:27:53.840
the fact that it can generate actions?
link |
00:27:56.000
Yes, so when we call it, I mean, it's a good question, right?
link |
00:28:00.080
When do we call a model?
link |
00:28:02.760
I mean, everything is a model,
link |
00:28:03.840
but what is an agent in my view is indeed the capacity
link |
00:28:07.360
to take actions in an environment that you then send to it
link |
00:28:11.680
and then the environment might return
link |
00:28:13.480
with a new observation
link |
00:28:15.040
and then you generate the next action.
link |
00:28:17.560
This actually, this reminds me of the question
link |
00:28:20.440
from the side of biology, what is life?
link |
00:28:23.000
Which is actually a very difficult question as well.
link |
00:28:25.380
What is living, what is living when you think about life
link |
00:28:29.200
here on this planet Earth?
link |
00:28:31.000
And a question interesting to me about aliens,
link |
00:28:33.420
what is life when we visit another planet?
link |
00:28:35.720
Would we be able to recognize it?
link |
00:28:37.200
And this feels like, it sounds perhaps silly,
link |
00:28:40.220
but I don't think it is.
link |
00:28:41.360
At which point is the neural network a being versus a tool?
link |
00:28:48.260
And it feels like action, ability to modify its environment
link |
00:28:52.400
is that fundamental leap.
link |
00:28:54.560
Yeah, I think it certainly feels like action
link |
00:28:57.440
is a necessary condition to be more alive,
link |
00:29:01.960
but probably not sufficient either.
link |
00:29:04.400
So sadly I...
link |
00:29:05.240
It's a soul consciousness thing, whatever.
link |
00:29:06.880
Yeah, yeah, we can get back to that later.
link |
00:29:09.080
But anyways, going back to the meow and the gato, right?
link |
00:29:12.320
So one of the leaps forward and what took the team a lot
link |
00:29:17.640
of effort and time was, as you were asking,
link |
00:29:21.280
how has gato been trained?
link |
00:29:23.080
So I told you gato is this transformer neural network,
link |
00:29:26.080
models actions, sequences of actions, words, et cetera.
link |
00:29:30.840
And then the way we train it is by essentially pulling
link |
00:29:35.520
data sets of observations, right?
link |
00:29:39.400
So it's a massive imitation learning algorithm
link |
00:29:42.600
that it imitates obviously to what
link |
00:29:45.320
is the next word that comes next from the usual data
link |
00:29:48.520
sets we use before, right?
link |
00:29:50.160
So these are these web scale style data sets of people
link |
00:29:54.600
writing on webs or chatting or whatnot, right?
link |
00:29:58.520
So that's an obvious source that we use on all language work.
link |
00:30:02.000
But then we also took a lot of agents
link |
00:30:05.640
that we have at DeepMind.
link |
00:30:06.720
I mean, as you know, DeepMind, we're quite interested
link |
00:30:10.920
in learning reinforcement learning and learning agents
link |
00:30:14.960
that play in different environments.
link |
00:30:17.040
So we kind of created a data set of these trajectories,
link |
00:30:20.760
as we call them, or agent experiences.
link |
00:30:23.120
So in a way, there are other agents
link |
00:30:25.240
we train for a single mind purpose to, let's say,
link |
00:30:29.560
control a 3D game environment and navigate a maze.
link |
00:30:33.320
So we had all the experience that
link |
00:30:35.320
was created through one agent interacting
link |
00:30:38.160
with that environment.
link |
00:30:39.480
And we added this to the data set, right?
link |
00:30:41.800
And as I said, we just see all the data,
link |
00:30:44.400
all these sequences of words or sequences
link |
00:30:46.440
of this agent interacting with that environment or agents
link |
00:30:51.120
playing Atari and so on.
link |
00:30:52.200
We see it as the same kind of data.
link |
00:30:54.920
And so we mix these data sets together.
link |
00:30:57.440
And we train Gato.
link |
00:31:00.120
That's the G part, right?
link |
00:31:01.600
It's general because it really has mixed.
link |
00:31:05.240
It doesn't have different brains for each modality
link |
00:31:07.560
or each narrow task.
link |
00:31:09.040
It has a single brain.
link |
00:31:10.480
It's not that big of a brain compared
link |
00:31:12.120
to most of the neural networks we see these days.
link |
00:31:14.760
It has 1 billion parameters.
link |
00:31:18.200
Some models we're seeing get in the trillions these days.
link |
00:31:21.120
And certainly, 100 billion feels like a size
link |
00:31:25.080
that is very common from when you train these jobs.
link |
00:31:29.000
So the actual agent is relatively small.
link |
00:31:32.640
But it's been trained on a very challenging, diverse data set,
link |
00:31:36.280
not only containing all of the internet
link |
00:31:38.000
but containing all these agent experience playing
link |
00:31:40.680
very different, distinct environments.
link |
00:31:43.240
So this brings us to the part of the tweet of this
link |
00:31:46.720
is not the end, it's the beginning.
link |
00:31:48.880
It feels very cool to see Gato, in principle,
link |
00:31:53.120
is able to control any sort of environments, especially
link |
00:31:57.360
the ones that it's been trained to do, these 3D games, Atari
link |
00:32:00.360
games, all sorts of robotics tasks, and so on.
link |
00:32:04.600
But obviously, it's not as proficient
link |
00:32:07.760
as the teachers it learned from on these environments.
link |
00:32:10.520
Not obvious.
link |
00:32:11.680
It's not obvious that it wouldn't be more proficient.
link |
00:32:15.040
It's just the current beginning part
link |
00:32:17.960
is that the performance is such that it's not as good
link |
00:32:21.760
as if it's specialized to that task.
link |
00:32:23.360
Right.
link |
00:32:23.840
So it's not as good, although I would argue size matters here.
link |
00:32:28.040
So the fact that I would argue always size always matters.
link |
00:32:31.360
That's a different conversation.
link |
00:32:33.320
But for neural networks, certainly size does matter.
link |
00:32:36.200
So it's the beginning because it's relatively small.
link |
00:32:39.600
So obviously, scaling this idea up
link |
00:32:42.560
might make the connections that exist between text
link |
00:32:48.240
on the internet and playing Atari and so on more
link |
00:32:51.560
synergistic with one another.
link |
00:32:53.320
And you might gain.
link |
00:32:54.240
And that moment, we didn't quite see.
link |
00:32:56.320
But obviously, that's why it's the beginning.
link |
00:32:58.600
That synergy might emerge with scale.
link |
00:33:00.920
Right, might emerge with scale.
link |
00:33:02.160
And also, I believe there's some new research or ways
link |
00:33:05.160
in which you prepare the data that you
link |
00:33:08.480
might need to make it more clear to the model
link |
00:33:11.560
that you're not only playing Atari,
link |
00:33:14.120
and you start from a screen.
link |
00:33:16.240
And here is up and a screen and down.
link |
00:33:18.400
Maybe you can think of playing Atari
link |
00:33:20.640
as there's some sort of context that is needed for the agent
link |
00:33:23.880
before it starts seeing, oh, this is an Atari screen.
link |
00:33:26.920
I'm going to start playing.
link |
00:33:28.640
You might require, for instance, to be told in words,
link |
00:33:33.360
hey, in this sequence that I'm showing,
link |
00:33:36.840
you're going to be playing an Atari game.
link |
00:33:39.080
So text might actually be a good driver to enhance the data.
link |
00:33:44.400
So then these connections might be made more easily.
link |
00:33:46.960
So that's an idea that we start seeing in language.
link |
00:33:51.200
But obviously, beyond, this is going to be effective.
link |
00:33:55.080
It's not like I don't show you a screen,
link |
00:33:57.400
and you, from scratch, you're supposed to learn a game.
link |
00:34:01.000
There is a lot of context we might set.
link |
00:34:03.360
So there might be some work needed as well
link |
00:34:05.840
to set that context.
link |
00:34:07.720
But anyways, there's a lot of work.
link |
00:34:10.640
So that context puts all the different modalities
link |
00:34:13.520
on the same level ground to provide the context best.
link |
00:34:16.640
So maybe on that point, so there's
link |
00:34:19.880
this task, which may not seem trivial, of tokenizing the data,
link |
00:34:25.480
of converting the data into pieces,
link |
00:34:28.480
into basic atomic elements that then could cross modalities
link |
00:34:34.400
somehow.
link |
00:34:35.240
So what's tokenization?
link |
00:34:37.840
How do you tokenize text?
link |
00:34:39.640
How do you tokenize images?
link |
00:34:42.160
How do you tokenize games and actions and robotics tasks?
link |
00:34:47.080
Yeah, that's a great question.
link |
00:34:48.320
So tokenization is the entry point
link |
00:34:52.840
to actually make all the data look like a sequence,
link |
00:34:55.600
because tokens then are just these little puzzle pieces.
link |
00:34:59.480
We break down anything into these puzzle pieces,
link |
00:35:01.760
and then we just model, what's this puzzle look like when
link |
00:35:05.560
you make it lay down in a line, so to speak, in a sequence?
link |
00:35:09.600
So in Gato, the text, there's a lot of work.
link |
00:35:15.440
You tokenize text usually by looking
link |
00:35:17.400
at commonly used substrings, right?
link |
00:35:20.040
So there's ING in English is a very common substring,
link |
00:35:23.680
so that becomes a token.
link |
00:35:25.480
There's quite a well studied problem on tokenizing text.
link |
00:35:29.080
And Gato just used the standard techniques
link |
00:35:31.560
that have been developed from many years,
link |
00:35:34.280
even starting from ngram models in the 1950s and so on.
link |
00:35:38.000
Just for context, how many tokens,
link |
00:35:40.440
what order, magnitude, number of tokens
link |
00:35:42.640
is required for a word, usually?
link |
00:35:45.120
What are we talking about?
link |
00:35:46.200
Yeah, for a word in English, I mean,
link |
00:35:48.880
every language is very different.
link |
00:35:51.120
The current level or granularity of tokenization
link |
00:35:53.920
generally means it's maybe two to five.
link |
00:35:57.840
I mean, I don't know the statistics exactly,
link |
00:36:00.200
but to give you an idea, we don't tokenize
link |
00:36:03.000
at the level of letters.
link |
00:36:04.160
Then it would probably be, I don't
link |
00:36:05.720
know what the average length of a word is in English,
link |
00:36:08.080
but that would be the minimum set of tokens you could use.
link |
00:36:11.400
It was bigger than letters, smaller than words.
link |
00:36:13.200
Yes, yes.
link |
00:36:13.880
And you could think of very, very common words like the.
link |
00:36:16.840
I mean, that would be a single token,
link |
00:36:18.760
but very quickly you're talking two, three, four tokens or so.
link |
00:36:22.360
Have you ever tried to tokenize emojis?
link |
00:36:24.680
Emojis are actually just sequences of letters, so.
link |
00:36:30.080
Maybe to you, but to me they mean so much more.
link |
00:36:33.000
Yeah, you can render the emoji, but you
link |
00:36:35.080
might if you actually just.
link |
00:36:36.840
Yeah, this is a philosophical question.
link |
00:36:39.360
Is emojis an image or a text?
link |
00:36:43.320
The way we do these things is they're actually
link |
00:36:46.640
mapped to small sequences of characters.
link |
00:36:49.520
So you can actually play with these models
link |
00:36:52.600
and input emojis, it will output emojis back,
link |
00:36:55.760
which is actually quite a fun exercise.
link |
00:36:57.960
You probably can find other tweets about these out there.
link |
00:37:02.240
But yeah, so anyways, text.
link |
00:37:04.440
It's very clear how this is done.
link |
00:37:06.720
And then in Gato, what we did for images
link |
00:37:10.560
is we map images to essentially we compressed images,
link |
00:37:14.880
so to speak, into something that looks more like less
link |
00:37:19.120
like every pixel with every intensity.
link |
00:37:21.320
That would mean we have a very long sequence, right?
link |
00:37:23.840
Like if we were talking about 100 by 100 pixel images,
link |
00:37:27.320
that would make the sequences far too long.
link |
00:37:30.000
So what was done there is you just
link |
00:37:32.520
use a technique that essentially compresses an image
link |
00:37:35.760
into maybe 16 by 16 patches of pixels,
link |
00:37:40.160
and then that is mapped, again, tokenized.
link |
00:37:42.720
You just essentially quantize this space
link |
00:37:45.360
into a special word that actually
link |
00:37:48.720
maps to these little sequence of pixels.
link |
00:37:51.720
And then you put the pixels together in some raster order,
link |
00:37:55.120
and then that's how you get out or in the image
link |
00:37:59.360
that you're processing.
link |
00:38:00.720
But there's no semantic aspect to that,
link |
00:38:04.040
so you're doing some kind of,
link |
00:38:05.840
you don't need to understand anything about the image
link |
00:38:07.760
in order to tokenize it currently.
link |
00:38:09.640
No, you're only using this notion of compression.
link |
00:38:12.600
So you're trying to find common,
link |
00:38:15.080
it's like JPG or all these algorithms.
link |
00:38:17.640
It's actually very similar at the tokenization level.
link |
00:38:20.520
All we're doing is finding common patterns
link |
00:38:23.320
and then making sure in a lossy way we compress these images
link |
00:38:27.200
given the statistics of the images
link |
00:38:29.480
that are contained in all the data we deal with.
link |
00:38:31.840
Although you could probably argue that JPEG
link |
00:38:34.200
does have some understanding of images.
link |
00:38:38.720
Because visual information, maybe color,
link |
00:38:44.000
compressing crudely based on color
link |
00:38:46.920
does capture something important about an image
link |
00:38:51.160
that's about its meaning, not just about some statistics.
link |
00:38:54.640
Yeah, I mean, JP, as I said,
link |
00:38:56.640
the algorithms look actually very similar
link |
00:38:58.640
to they use the cosine transform in JPG.
link |
00:39:04.120
The approach we usually do in machine learning
link |
00:39:07.120
when we deal with images and we do this quantization step
link |
00:39:10.120
is a bit more data driven.
link |
00:39:11.400
So rather than have some sort of Fourier basis
link |
00:39:14.120
for how frequencies appear in the natural world,
link |
00:39:18.880
we actually just use the statistics of the images
link |
00:39:23.840
and then quantize them based on the statistics,
link |
00:39:27.000
much like you do in words, right?
link |
00:39:28.280
So common substrings are allocated a token
link |
00:39:32.400
and images is very similar.
link |
00:39:34.400
But there's no connection.
link |
00:39:36.960
The token space, if you think of,
link |
00:39:39.240
oh, like the tokens are an integer
link |
00:39:41.080
and in the end of the day.
link |
00:39:42.440
So now like we work on, maybe we have about,
link |
00:39:46.200
let's say, I don't know the exact numbers,
link |
00:39:48.000
but let's say 10,000 tokens for text, right?
link |
00:39:51.160
Certainly more than characters
link |
00:39:52.840
because we have groups of characters and so on.
link |
00:39:55.320
So from one to 10,000, those are representing
link |
00:39:58.280
all the language and the words we'll see.
link |
00:40:01.000
And then images occupy the next set of integers.
link |
00:40:04.160
So they're completely independent, right?
link |
00:40:05.800
So from 10,001 to 20,000,
link |
00:40:08.920
those are the tokens that represent
link |
00:40:10.640
these other modality images.
link |
00:40:12.760
And that is an interesting aspect that makes it orthogonal.
link |
00:40:18.640
So what connects these concepts is the data, right?
link |
00:40:21.600
Once you have a data set,
link |
00:40:23.760
for instance, that captions images that tells you,
link |
00:40:26.880
oh, this is someone playing a frisbee on a green field.
link |
00:40:30.480
Now the model will need to predict the tokens
link |
00:40:34.560
from the text green field to then the pixels.
link |
00:40:37.800
And that will start making the connections
link |
00:40:39.760
between the tokens.
link |
00:40:40.600
So these connections happen as the algorithm learns.
link |
00:40:43.640
And then the last, if we think of these integers,
link |
00:40:45.840
the first few are words, the next few are images.
link |
00:40:48.760
In Gato, we also allocated the highest order of integers
link |
00:40:55.240
to actions, right?
link |
00:40:56.240
Which we discretize and actions are very diverse, right?
link |
00:40:59.920
In Atari, there's, I don't know if 17 discrete actions.
link |
00:41:04.120
In robotics, actions might be torques
link |
00:41:06.960
and forces that we apply.
link |
00:41:08.240
So we just use kind of similar ideas
link |
00:41:11.200
to compress these actions into tokens.
link |
00:41:14.320
And then we just, that's how we map now
link |
00:41:18.000
all the space to these sequence of integers.
link |
00:41:20.800
But they occupy different space
link |
00:41:22.480
and what connects them is then the learning algorithm.
link |
00:41:24.840
That's where the magic happens.
link |
00:41:26.320
So the modalities are orthogonal
link |
00:41:28.840
to each other in token space.
link |
00:41:30.760
So in the input, everything you add, you add extra tokens.
link |
00:41:35.760
And then you're shoving all of that into one place.
link |
00:41:40.440
Yes, the transformer.
link |
00:41:41.640
And that transformer, that transformer tries
link |
00:41:46.400
to look at this gigantic token space
link |
00:41:49.360
and tries to form some kind of representation,
link |
00:41:52.240
some kind of unique wisdom
link |
00:41:56.760
about all of these different modalities.
link |
00:41:59.240
How's that possible?
link |
00:42:02.120
If you were to sort of like put your psychoanalysis hat on
link |
00:42:06.520
and try to psychoanalyze this neural network,
link |
00:42:09.400
is it schizophrenic?
link |
00:42:11.760
Does it try to, given this very few weights,
link |
00:42:17.160
represent multiple disjoint things
link |
00:42:19.560
and somehow have them not interfere with each other?
link |
00:42:22.800
Or is it somehow building on the joint strength,
link |
00:42:27.960
on whatever is common to all the different modalities?
link |
00:42:31.800
Like what, if you were to ask a question,
link |
00:42:34.520
is it schizophrenic or is it of one mind?
link |
00:42:38.720
I mean, it is one mind and it's actually
link |
00:42:42.640
the simplest algorithm, which that's kind of in a way
link |
00:42:46.800
how it feels like the field hasn't changed
link |
00:42:49.840
since back propagation and gradient descent
link |
00:42:52.600
was purpose for learning neural networks.
link |
00:42:55.760
So there is obviously details on the architecture.
link |
00:42:58.720
This has evolved.
link |
00:42:59.640
The current iteration is still the transformer,
link |
00:43:03.080
which is a powerful sequence modeling architecture.
link |
00:43:07.440
But then the goal of this, you know,
link |
00:43:11.000
setting these weights to predict the data
link |
00:43:13.840
is essentially the same as basically I could describe.
link |
00:43:17.240
I mean, we described a few years ago,
link |
00:43:18.680
Alpha star language modeling and so on, right?
link |
00:43:21.600
We take, let's say an Atari game,
link |
00:43:24.600
we map it to a string of numbers
link |
00:43:27.640
that will all be probably image space
link |
00:43:30.360
and action space interleaved.
link |
00:43:32.440
And all we're gonna do is say, okay, given the numbers,
link |
00:43:37.280
you know, 10,001, 10,004, 10,005,
link |
00:43:40.400
the next number that comes is 20,006,
link |
00:43:43.280
which is in the action space.
link |
00:43:45.400
And you're just optimizing these weights
link |
00:43:48.880
via very simple gradients.
link |
00:43:51.720
Like, you know, mathematical is almost
link |
00:43:53.520
the most boring algorithm you could imagine.
link |
00:43:55.880
We settle the weights so that
link |
00:43:57.800
given this particular instance,
link |
00:44:00.200
these weights are set to maximize the probability
link |
00:44:04.080
of having seen this particular sequence of integers
link |
00:44:07.280
for this particular game.
link |
00:44:09.120
And then the algorithm does this
link |
00:44:11.640
for many, many, many iterations,
link |
00:44:14.800
looking at different modalities, different games, right?
link |
00:44:17.920
That's the mixture of the dataset we discussed.
link |
00:44:20.480
So in a way, it's a very simple algorithm
link |
00:44:24.040
and the weights, right, they're all shared, right?
link |
00:44:27.560
So in terms of, is it focusing on one modality or not?
link |
00:44:30.920
The intermediate weights that are converting
link |
00:44:33.240
from these input of integers
link |
00:44:35.160
to the target integer you're predicting next,
link |
00:44:37.720
those weights certainly are common.
link |
00:44:40.320
And then the way that tokenization happens,
link |
00:44:43.400
there is a special place in the neural network,
link |
00:44:45.840
which is we map this integer, like number 10,001,
link |
00:44:49.800
to a vector of real numbers.
link |
00:44:51.920
Like real numbers, we can optimize them
link |
00:44:54.760
with gradient descent, right?
link |
00:44:56.120
The functions we learn
link |
00:44:57.080
are actually surprisingly differentiable.
link |
00:44:59.720
That's why we compute gradients.
link |
00:45:01.720
So this step is the only one
link |
00:45:03.920
that this orthogonality you mentioned applies.
link |
00:45:06.560
So mapping a certain token for text or image or actions,
link |
00:45:12.520
each of these tokens gets its own little vector
link |
00:45:15.040
of real numbers that represents this.
link |
00:45:17.200
If you look at the field back many years ago,
link |
00:45:19.560
people were talking about word vectors or word embeddings.
link |
00:45:23.480
These are the same.
link |
00:45:24.320
We have word vectors or embeddings.
link |
00:45:26.040
We have image vector or embeddings
link |
00:45:28.880
and action vector of embeddings.
link |
00:45:30.880
And the beauty here is that as you train this model,
link |
00:45:33.920
if you visualize these little vectors,
link |
00:45:36.640
it might be that they start aligning
link |
00:45:38.480
even though they're independent parameters.
link |
00:45:41.400
There could be anything,
link |
00:45:42.840
but then it might be that you take the word gato or cat,
link |
00:45:47.480
which maybe is common enough
link |
00:45:48.520
that it actually has its own token.
link |
00:45:50.200
And then you take pixels that have a cat
link |
00:45:52.400
and you might start seeing
link |
00:45:53.960
that these vectors look like they align, right?
link |
00:45:57.400
So by learning from this vast amount of data,
link |
00:46:00.640
the model is realizing the potential connections
link |
00:46:03.920
between these modalities.
link |
00:46:05.640
Now, I will say there will be another way,
link |
00:46:07.840
at least in part, to not have these different vectors
link |
00:46:13.160
for each different modality.
link |
00:46:15.520
For instance, when I tell you about actions
link |
00:46:18.360
in certain space, I'm defining actions by words, right?
link |
00:46:22.800
So you could imagine a world in which I'm not learning
link |
00:46:26.520
that the action app in Atari is its own number.
link |
00:46:31.240
The action app in Atari maybe is literally the word
link |
00:46:34.400
or the sentence app in Atari, right?
link |
00:46:37.320
And that would mean we now leverage
link |
00:46:39.400
much more from the language.
link |
00:46:41.040
This is not what we did here,
link |
00:46:42.520
but certainly it might make these connections
link |
00:46:45.680
much easier to learn and also to teach the model
link |
00:46:49.080
to correct its own actions and so on, right?
link |
00:46:51.280
So all these to say that gato is indeed the beginning,
link |
00:46:55.840
that it is a radical idea to do this this way,
link |
00:46:59.400
but there's probably a lot more to be done
link |
00:47:02.320
and the results to be more impressive,
link |
00:47:04.440
not only through scale, but also through some new research
link |
00:47:07.920
that will come hopefully in the years to come.
link |
00:47:10.480
So just to elaborate quickly,
link |
00:47:12.280
you mean one possible next step
link |
00:47:16.680
or one of the paths that you might take next
link |
00:47:20.200
is doing the tokenization fundamentally
link |
00:47:25.200
as a kind of linguistic communication.
link |
00:47:28.240
So like you convert even images into language.
link |
00:47:31.320
So doing something like a crude semantic segmentation,
link |
00:47:35.520
trying to just assign a bunch of words to an image
link |
00:47:38.360
that like have almost like a dumb entity
link |
00:47:42.280
explaining as much as it can about the image.
link |
00:47:45.320
And so you convert that into words
link |
00:47:46.920
and then you convert games into words
link |
00:47:49.280
and then you provide the context in words and all of it.
link |
00:47:52.120
And eventually getting to a point
link |
00:47:56.320
where everybody agrees with Noam Chomsky
link |
00:47:58.080
that language is actually at the core of everything.
link |
00:48:00.920
That's it's the base layer of intelligence
link |
00:48:04.240
and consciousness and all that kind of stuff, okay.
link |
00:48:07.520
You mentioned early on like size, it's hard to grow.
link |
00:48:11.240
What did you mean by that?
link |
00:48:12.800
Because we're talking about scale might change.
link |
00:48:17.000
There might be, and we'll talk about this too,
link |
00:48:18.960
like there's a emergent, there's certain things
link |
00:48:23.880
about these neural networks that are emergent.
link |
00:48:25.640
So certain like performance we can see only with scale
link |
00:48:28.960
and there's some kind of threshold of scale.
link |
00:48:30.960
So why is it hard to grow something like this Meow network?
link |
00:48:36.640
So the Meow network, it's not hard to grow
link |
00:48:41.120
if you retrain it.
link |
00:48:42.600
What's hard is, well, we have now 1 billion parameters.
link |
00:48:46.840
We train them for a while.
link |
00:48:48.120
We spend some amount of work towards building these weights
link |
00:48:53.120
that are an amazing initial brain
link |
00:48:55.840
for doing these kinds of tasks we care about.
link |
00:48:58.800
Could we reuse the weights and expand to a larger brain?
link |
00:49:03.880
And that is extraordinarily hard,
link |
00:49:06.680
but also exciting from a research perspective
link |
00:49:10.040
and a practical perspective point of view, right?
link |
00:49:12.520
So there's this notion of modularity in software engineering
link |
00:49:17.520
and we starting to see some examples
link |
00:49:20.360
and work that leverages modularity.
link |
00:49:23.160
In fact, if we go back one step from Gato
link |
00:49:26.200
to a work that I would say train much larger,
link |
00:49:29.560
much more capable network called Flamingo.
link |
00:49:32.400
Flamingo did not deal with actions,
link |
00:49:34.160
but it definitely dealt with images in an interesting way,
link |
00:49:38.280
kind of akin to what Gato did,
link |
00:49:40.120
but slightly different technique for tokenizing,
link |
00:49:42.840
but we don't need to go into that detail.
link |
00:49:45.280
But what Flamingo also did, which Gato didn't do,
link |
00:49:49.240
and that just happens because these projects,
link |
00:49:51.560
they're different, it's a bit of like the exploratory nature
link |
00:49:55.760
of research, which is great.
link |
00:49:57.120
The research behind these projects is also modular.
link |
00:50:00.480
Yes, exactly.
link |
00:50:01.720
And it has to be, right?
link |
00:50:02.640
We need to have creativity
link |
00:50:05.480
and sometimes you need to protect pockets of people,
link |
00:50:09.120
researchers and so on.
link |
00:50:10.200
By we, you mean humans.
link |
00:50:11.760
Yes.
link |
00:50:12.720
And also in particular researchers
link |
00:50:14.480
and maybe even further DeepMind or other such labs.
link |
00:50:18.720
And then the neural networks themselves.
link |
00:50:20.880
So it's modularity all the way down.
link |
00:50:23.480
All the way down.
link |
00:50:24.320
So the way that we did modularity very beautifully
link |
00:50:27.400
in Flamingo is we took Chinchilla,
link |
00:50:30.000
which is a language only model, not an agent,
link |
00:50:33.440
if we think of actions being necessary for agency.
link |
00:50:36.600
So we took Chinchilla, we took the weights of Chinchilla
link |
00:50:40.840
and then we froze them.
link |
00:50:42.640
We said, these don't change.
link |
00:50:44.680
We train them to be very good at predicting the next word.
link |
00:50:47.400
It's a very good language model, state of the art
link |
00:50:50.120
at the time you release it, et cetera, et cetera.
link |
00:50:52.800
We're going to add a capability to see, right?
link |
00:50:55.360
We are going to add the ability to see
link |
00:50:56.800
to this language model.
link |
00:50:58.200
So we're going to attach small pieces of neural networks
link |
00:51:01.800
at the right places in the model.
link |
00:51:03.760
It's almost like I'm injecting the network
link |
00:51:07.760
with some weights and some substructures
link |
00:51:10.640
in a good way, right?
link |
00:51:12.760
So you need the research to say, what is effective?
link |
00:51:15.160
How do you add this capability
link |
00:51:16.600
without destroying others, et cetera.
link |
00:51:18.720
So we created a small sub network initialized,
link |
00:51:24.280
not from random, but actually from self supervised learning
link |
00:51:28.680
that a model that understands vision in general.
link |
00:51:32.720
And then we took data sets that connect the two modalities,
link |
00:51:37.160
vision and language.
link |
00:51:38.680
And then we froze the main part,
link |
00:51:41.120
the largest portion of the network, which was Chinchilla,
link |
00:51:43.640
that is 70 billion parameters.
link |
00:51:45.880
And then we added a few more parameters on top,
link |
00:51:49.160
trained from scratch, and then some others
link |
00:51:51.360
that were pre trained with the capacity to see,
link |
00:51:55.200
like it was not tokenization
link |
00:51:57.320
in the way I described for Gato, but it's a similar idea.
link |
00:52:01.360
And then we trained the whole system.
link |
00:52:03.560
Parts of it were frozen, parts of it were new.
link |
00:52:06.520
And all of a sudden, we developed Flamingo,
link |
00:52:09.640
which is an amazing model that is essentially,
link |
00:52:12.520
I mean, describing it is a chatbot
link |
00:52:14.960
where you can also upload images
link |
00:52:16.920
and start conversing about images.
link |
00:52:19.880
But it's also kind of a dialogue style chatbot.
link |
00:52:23.680
So the input is images and text and the output is text.
link |
00:52:26.600
Exactly.
link |
00:52:28.480
How many parameters, you said 70 billion for Chinchilla?
link |
00:52:31.760
Yeah, Chinchilla is 70 billion.
link |
00:52:33.200
And then the ones we add on top,
link |
00:52:34.600
which kind of almost is almost like a way
link |
00:52:38.000
to overwrite its little activations
link |
00:52:40.920
so that when it sees vision,
link |
00:52:42.400
it does kind of a correct computation of what it's seeing,
link |
00:52:45.280
mapping it back towards, so to speak.
link |
00:52:47.920
That adds an extra 10 billion parameters, right?
link |
00:52:50.800
So it's total 80 billion, the largest one we released.
link |
00:52:53.920
And then you train it on a few datasets
link |
00:52:57.320
that contain vision and language.
link |
00:52:59.280
And once you interact with the model,
link |
00:53:01.120
you start seeing that you can upload an image
link |
00:53:04.160
and start sort of having a dialogue about the image,
link |
00:53:07.960
which is actually not something,
link |
00:53:09.480
it's very similar and akin to what we saw in language only.
link |
00:53:12.520
These prompting abilities that it has,
link |
00:53:15.240
you can teach it a new vision task, right?
link |
00:53:17.720
It does things beyond the capabilities
link |
00:53:20.440
that in theory the datasets provided in themselves,
link |
00:53:24.480
but because it leverages a lot of the language knowledge
link |
00:53:27.080
acquired from Chinchilla,
link |
00:53:28.880
it actually has this few shot learning ability
link |
00:53:31.760
and these emerging abilities
link |
00:53:33.080
that we didn't even measure
link |
00:53:34.640
once we were developing the model,
link |
00:53:36.400
but once developed, then as you play with the interface,
link |
00:53:40.080
you can start seeing, wow, okay, yeah, it's cool.
link |
00:53:42.320
We can upload, I think one of the tweets
link |
00:53:45.000
talking about Twitter was this image from Obama
link |
00:53:47.840
that is placing a weight
link |
00:53:49.840
and someone is kind of weighting themselves
link |
00:53:52.400
and it's kind of a joke style image.
link |
00:53:54.880
And it's notable because I think Andrew Carpati
link |
00:53:57.840
a few years ago said,
link |
00:53:59.400
no computer vision system can understand
link |
00:54:02.320
the subtlety of this joke in this image,
link |
00:54:04.720
all the things that go on.
link |
00:54:06.360
And so what we try to do, and it's very anecdotally,
link |
00:54:09.600
I mean, this is not a proof that we solved this issue,
link |
00:54:12.120
but it just shows that you can upload now this image
link |
00:54:15.720
and start conversing with the model,
link |
00:54:17.560
trying to make out if it gets that there's a joke
link |
00:54:21.360
because the person weighting themselves
link |
00:54:23.520
doesn't see that someone behind
link |
00:54:25.040
is making the weight higher and so on and so forth.
link |
00:54:27.840
So it's a fascinating capability
link |
00:54:30.760
and it comes from this key idea of modularity
link |
00:54:33.240
where we took a frozen brain
link |
00:54:34.800
and we just added a new capability.
link |
00:54:37.760
So the question is, should we,
link |
00:54:40.600
so in a way you can see even from DeepMind,
link |
00:54:42.720
we have Flamingo that this moderate approach
link |
00:54:46.280
and thus could leverage the scale a bit more reasonably
link |
00:54:49.040
because we didn't need to retrain a system from scratch.
link |
00:54:52.200
And on the other hand, we had Gato,
link |
00:54:54.080
which used the same data sets,
link |
00:54:55.800
but then he trained it from scratch, right?
link |
00:54:57.400
And so I guess big question for the community
link |
00:55:00.480
is should we train from scratch
link |
00:55:02.760
or should we embrace modularity?
link |
00:55:04.640
And this lies, like this goes back to modularity
link |
00:55:08.640
as a way to grow, but reuse seems like natural
link |
00:55:12.040
and it was very effective, certainly.
link |
00:55:14.920
The next question is, if you go the way of modularity,
link |
00:55:18.960
is there a systematic way of freezing weights
link |
00:55:22.680
and joining different modalities across,
link |
00:55:27.040
you know, not just two or three or four networks,
link |
00:55:29.200
but hundreds of networks from all different kinds of places,
link |
00:55:32.280
maybe open source network that looks at weather patterns
link |
00:55:36.280
and you shove that in somehow
link |
00:55:37.880
and then you have networks that, I don't know,
link |
00:55:40.360
do all kinds of stuff, play StarCraft
link |
00:55:42.000
and play all the other video games
link |
00:55:43.960
and you can keep adding them in without significant effort,
link |
00:55:49.480
like maybe the effort scales linearly or something like that
link |
00:55:53.160
as opposed to like the more network you add,
link |
00:55:54.880
the more you have to worry about the instabilities created.
link |
00:55:57.840
Yeah, so that vision is beautiful.
link |
00:55:59.840
I think there's still the question
link |
00:56:03.400
about within single modalities, like Chinchilla was reused,
link |
00:56:06.720
but now if we train a next iteration of language models,
link |
00:56:10.120
are we gonna use Chinchilla or not?
link |
00:56:11.720
Yeah, how do you swap out Chinchilla?
link |
00:56:13.040
Right, so there's still big questions,
link |
00:56:15.840
but that idea is actually really akin to software engineering,
link |
00:56:19.280
which we're not reimplementing libraries from scratch,
link |
00:56:22.280
we're reusing and then building ever more amazing things,
link |
00:56:25.280
including neural networks with software that we're reusing.
link |
00:56:28.880
So I think this idea of modularity, I like it,
link |
00:56:32.120
I think it's here to stay
link |
00:56:33.800
and that's also why I mentioned
link |
00:56:35.840
it's just the beginning, not the end.
link |
00:56:38.160
You've mentioned meta learning,
link |
00:56:39.360
so given this promise of Gato,
link |
00:56:42.760
can we try to redefine this term
link |
00:56:45.960
that's almost akin to consciousness
link |
00:56:47.560
because it means different things to different people
link |
00:56:50.120
throughout the history of artificial intelligence,
link |
00:56:52.360
but what do you think meta learning is
link |
00:56:56.600
and looks like now in the five years, 10 years,
link |
00:57:00.040
will it look like the system like Gato, but scaled?
link |
00:57:03.160
What's your sense of, what does meta learning look like?
link |
00:57:07.000
Do you think with all the wisdom we've learned so far?
link |
00:57:10.480
Yeah, great question.
link |
00:57:11.520
Maybe it's good to give another data point
link |
00:57:14.480
looking backwards rather than forward.
link |
00:57:16.160
So when we talk in 2019,
link |
00:57:22.880
meta learning meant something that has changed
link |
00:57:26.480
mostly through the revolution of GPT3 and beyond.
link |
00:57:31.120
So what meta learning meant at the time
link |
00:57:35.000
was driven by what benchmarks people care about
link |
00:57:37.640
in meta learning.
link |
00:57:38.800
And the benchmarks were about a capability
link |
00:57:42.560
to learn about object identities.
link |
00:57:44.960
So it was very much overfitted to vision
link |
00:57:48.480
and object classification.
link |
00:57:50.360
And the part that was meta about that was that,
link |
00:57:52.880
oh, we're not just learning a thousand categories
link |
00:57:55.320
that ImageNet tells us to learn.
link |
00:57:57.040
We're going to learn object categories
link |
00:57:59.200
that can be defined when we interact with the model.
link |
00:58:03.280
So it's interesting to see the evolution, right?
link |
00:58:06.640
The way this started was we have a special language
link |
00:58:10.720
that was a data set, a small data set
link |
00:58:13.200
that we prompted the model with saying,
link |
00:58:15.920
hey, here is a new classification task.
link |
00:58:18.960
I'll give you one image and the name,
link |
00:58:21.720
which was an integer at the time of the image
link |
00:58:24.320
and a different image and so on.
link |
00:58:25.920
So you have a small prompt in the form of a data set,
link |
00:58:30.000
a machine learning data set.
link |
00:58:31.600
And then you got then a system that could then predict
link |
00:58:35.480
or classify these objects that you just
link |
00:58:37.440
defined kind of on the fly.
link |
00:58:40.280
So fast forward, it was revealed that language models
link |
00:58:46.480
are few shot learners.
link |
00:58:47.440
That's the title of the paper.
link |
00:58:49.120
So very good title.
link |
00:58:50.080
Sometimes titles are really good.
link |
00:58:51.480
So this one is really, really good.
link |
00:58:53.520
Because that's the point of GPT3 that showed that, look, sure,
link |
00:58:58.800
we can focus on object classification
link |
00:59:00.960
and what meta learning means within the space of learning
link |
00:59:04.160
object categories.
link |
00:59:05.400
This goes beyond, or before rather,
link |
00:59:07.440
to also Omniglot, before ImageNet and so on.
link |
00:59:10.120
So there's a few benchmarks.
link |
00:59:11.560
To now, all of a sudden, we're a bit unlocked from benchmarks.
link |
00:59:15.240
And through language, we can define tasks.
link |
00:59:17.960
So we're literally telling the model
link |
00:59:20.280
some logical task or a little thing that we wanted to do.
link |
00:59:23.920
We prompt it much like we did before,
link |
00:59:26.000
but now we prompt it through natural language.
link |
00:59:28.440
And then not perfectly, I mean, these models have failure modes
link |
00:59:32.280
and that's fine, but these models then
link |
00:59:35.560
are now doing a new task.
link |
00:59:37.240
And so they meta learn this new capability.
link |
00:59:40.520
Now, that's where we are now.
link |
00:59:43.480
Flamingo expanded this to visual and language,
link |
00:59:47.320
but it basically has the same abilities.
link |
00:59:49.440
You can teach it, for instance, an emergent property
link |
00:59:52.720
was that you can take pictures of numbers
link |
00:59:55.320
and then do arithmetic with the numbers just by teaching it,
link |
00:59:59.040
oh, when I show you 3 plus 6, I want you to output 9.
link |
01:00:03.720
And you show it a few examples, and now it does that.
link |
01:00:06.880
So it went way beyond the image net categorization of images
link |
01:00:12.800
that we were a bit stuck maybe before this revelation
link |
01:00:17.280
moment that happened in 2000.
link |
01:00:19.200
I believe it was 19, but it was after we checked.
link |
01:00:21.960
In that way, it has solved meta learning
link |
01:00:24.400
as was previously defined.
link |
01:00:26.160
Yes, it expanded what it meant.
link |
01:00:27.880
So that's what you say, what does it mean?
link |
01:00:29.680
So it's an evolving term.
link |
01:00:31.400
But here is maybe now looking forward,
link |
01:00:35.320
looking at what's happening, obviously,
link |
01:00:38.080
in the community with more modalities, what we can expect.
link |
01:00:42.560
And I would certainly hope to see the following.
link |
01:00:45.040
And this is a pretty drastic hope.
link |
01:00:48.400
But in five years, maybe we chat again.
link |
01:00:51.200
And we have a system, a set of weights
link |
01:00:55.920
that we can teach it to play StarCraft.
link |
01:00:59.840
Maybe not at the level of AlphaStar,
link |
01:01:01.480
but play StarCraft, a complex game,
link |
01:01:03.720
we teach it through interactions to prompting.
link |
01:01:06.920
You can certainly prompt a system.
link |
01:01:08.600
That's what Gata shows to play some simple Atari games.
link |
01:01:11.880
So imagine if you start talking to a system,
link |
01:01:15.360
teaching it a new game, showing it
link |
01:01:17.280
examples of in this particular game,
link |
01:01:20.960
this user did something good.
link |
01:01:22.720
Maybe the system can even play and ask you questions.
link |
01:01:25.440
Say, hey, I played this game.
link |
01:01:27.000
I just played this game.
link |
01:01:28.040
Did I do well?
link |
01:01:29.040
Can you teach me more?
link |
01:01:30.440
So five, maybe to 10 years, these capabilities,
link |
01:01:34.720
or what meta learning means, will
link |
01:01:36.400
be much more interactive, much more rich,
link |
01:01:38.800
and through domains that we were specializing.
link |
01:01:41.640
So you see the difference.
link |
01:01:42.920
We built AlphaStar Specialized to play StarCraft.
link |
01:01:47.000
The algorithms were general, but the weights were specialized.
link |
01:01:50.400
And what we're hoping is that we can teach a network
link |
01:01:54.160
to play games, to play any game, just using games as an example,
link |
01:01:58.560
through interacting with it, teaching it,
link |
01:02:01.440
uploading the Wikipedia page of StarCraft.
link |
01:02:04.000
This is in the horizon.
link |
01:02:06.120
And obviously, there are details that need to be filled
link |
01:02:09.360
and research needs to be done.
link |
01:02:11.000
But that's how I see meta learning above,
link |
01:02:13.200
which is going to be beyond prompting.
link |
01:02:15.360
It's going to be a bit more interactive.
link |
01:02:18.080
The system might tell us to give it feedback
link |
01:02:20.720
after it maybe makes mistakes or it loses a game.
link |
01:02:24.080
But it's nonetheless very exciting
link |
01:02:26.240
because if you think about this this way,
link |
01:02:28.960
the benchmarks are already there.
link |
01:02:30.600
We just repurposed the benchmarks.
link |
01:02:33.120
So in a way, I like to map the space of what
link |
01:02:38.440
maybe AGI means to say, OK, we went 101% performance in Go,
link |
01:02:45.480
in Chess, in StarCraft.
link |
01:02:47.920
The next iteration might be 20% performance
link |
01:02:51.920
across, quote unquote, all tasks.
link |
01:02:54.720
And even if it's not as good, it's fine.
link |
01:02:57.720
We have ways to also measure progress
link |
01:02:59.960
because we have those specialized agents and so on.
link |
01:03:04.320
So this is, to me, very exciting.
link |
01:03:06.160
And these next iteration models are definitely
link |
01:03:10.080
hinting at that direction of progress,
link |
01:03:13.360
which hopefully we can have.
link |
01:03:14.720
There are obviously some things that
link |
01:03:16.440
could go wrong in terms of we might not have the tools.
link |
01:03:20.120
Maybe transformers are not enough.
link |
01:03:22.600
There are some breakthroughs to come, which
link |
01:03:24.920
makes the field more exciting to people like me as well,
link |
01:03:27.560
of course.
link |
01:03:28.600
But that's, if you ask me, five to 10 years,
link |
01:03:32.040
you might see these models that start
link |
01:03:33.800
to look more like weights that are already trained.
link |
01:03:36.880
And then it's more about teaching or make
link |
01:03:40.520
their meta learn what you're trying
link |
01:03:44.040
to induce in terms of tasks and so on,
link |
01:03:47.000
well beyond the simple now tasks we're
link |
01:03:49.920
starting to see emerge like small arithmetic tasks
link |
01:03:53.200
and so on.
link |
01:03:54.280
So a few questions around that.
link |
01:03:55.720
This is fascinating.
link |
01:03:57.200
So that kind of teaching, interactive,
link |
01:04:01.440
so it's beyond prompting.
link |
01:04:02.760
So it's interacting with the neural network.
link |
01:04:05.240
That's different than the training process.
link |
01:04:08.520
So it's different than the optimization
link |
01:04:12.440
over differentiable functions.
link |
01:04:15.920
This is already trained.
link |
01:04:17.240
And now you're teaching, I mean, it's
link |
01:04:21.800
almost akin to the brain, the neurons already
link |
01:04:25.560
set with their connections.
link |
01:04:26.960
On top of that, you're now using that infrastructure
link |
01:04:30.000
to build up further knowledge.
link |
01:04:33.640
So that's a really interesting distinction that's actually
link |
01:04:37.200
not obvious from a software engineering perspective,
link |
01:04:40.320
that there's a line to be drawn.
link |
01:04:42.560
Because you always think for a neural network to learn,
link |
01:04:44.880
it has to be retrained, trained and retrained.
link |
01:04:49.880
And prompting is a way of teaching.
link |
01:04:54.000
And you'll now work a little bit of context
link |
01:04:55.920
about whatever the heck you're trying it to do.
link |
01:04:57.960
So you can maybe expand this prompting capability
link |
01:05:00.440
by making it interact.
link |
01:05:03.320
That's really, really interesting.
link |
01:05:04.680
By the way, this is not, if you look at way back
link |
01:05:08.080
at different ways to tackle even classification tasks.
link |
01:05:11.840
So this comes from longstanding literature
link |
01:05:16.440
in machine learning.
link |
01:05:18.240
What I'm suggesting could sound to some
link |
01:05:20.800
like a bit like nearest neighbor.
link |
01:05:23.400
So nearest neighbor is almost the simplest algorithm
link |
01:05:27.120
that does not require learning.
link |
01:05:30.200
So it has this interesting, you don't
link |
01:05:32.640
need to compute gradients.
link |
01:05:34.320
And what nearest neighbor does is you, quote unquote,
link |
01:05:37.560
have a data set or upload a data set.
link |
01:05:39.960
And then all you need to do is a way
link |
01:05:42.040
to measure distance between points.
link |
01:05:44.720
And then to classify a new point,
link |
01:05:46.680
you're just simply computing, what's
link |
01:05:48.360
the closest point in this massive amount of data?
link |
01:05:51.240
And that's my answer.
link |
01:05:52.720
So you can think of prompting in a way
link |
01:05:55.440
as you're uploading not just simple points.
link |
01:05:58.680
And the metric is not the distance between the images
link |
01:06:02.480
or something simple.
link |
01:06:03.320
It's something that you compute that's much more advanced.
link |
01:06:06.040
But in a way, it's very similar.
link |
01:06:09.040
You simply are uploading some knowledge
link |
01:06:12.600
to this pre trained system in nearest neighbor.
link |
01:06:15.040
Maybe the metric is learned or not,
link |
01:06:17.280
but you don't need to further train it.
link |
01:06:19.400
And then now you immediately get a classifier out of this.
link |
01:06:23.680
Now it's just an evolution of that concept,
link |
01:06:25.840
very classical concept in machine learning, which
link |
01:06:28.080
is just learning through what's the closest point, closest
link |
01:06:32.640
by some distance, and that's it.
link |
01:06:34.720
It's an evolution of that.
link |
01:06:36.120
And I will say how I saw meta learning when
link |
01:06:39.400
we worked on a few ideas in 2016 was precisely
link |
01:06:44.760
through the lens of nearest neighbor, which
link |
01:06:47.520
is very common in computer vision community.
link |
01:06:50.160
There's a very active area of research
link |
01:06:52.160
about how do you compute the distance between two images.
link |
01:06:55.600
But if you have a good distance metric,
link |
01:06:57.560
you also have a good classifier.
link |
01:06:59.920
All I'm saying is now these distances and the points
link |
01:07:02.680
are not just images.
link |
01:07:03.800
They're like words or sequences of words and images
link |
01:07:08.560
and actions that teach you something new.
link |
01:07:10.400
But it might be that technique wise those come back.
link |
01:07:14.680
And I will say that it's not necessarily true
link |
01:07:18.240
that you might not ever train the weights a bit further.
link |
01:07:21.760
Some aspect of meta learning, some techniques
link |
01:07:24.800
in meta learning do actually do a bit of fine tuning
link |
01:07:28.280
as it's called.
link |
01:07:29.080
They train the weights a little bit when they get a new task.
link |
01:07:32.960
So as I call the how or how we're going to achieve this,
link |
01:07:37.960
as a deep learner, I'm very skeptic.
link |
01:07:39.840
We're going to try a few things, whether it's
link |
01:07:41.840
a bit of training, adding a few parameters,
link |
01:07:44.200
thinking of these as nearest neighbor,
link |
01:07:45.960
or just simply thinking of there's a sequence of words,
link |
01:07:49.200
it's a prefix.
link |
01:07:50.440
And that's the new classifier.
link |
01:07:53.000
We'll see.
link |
01:07:53.680
There's the beauty of research.
link |
01:07:55.480
But what's important is that is a good goal in itself
link |
01:08:00.160
that I see as very worthwhile pursuing for the next stages
link |
01:08:03.800
of not only meta learning.
link |
01:08:05.720
I think this is basically what's exciting about machine learning
link |
01:08:10.160
period to me.
link |
01:08:11.400
Well, and the interactive aspect of that
link |
01:08:13.760
is also very interesting, the interactive version
link |
01:08:16.400
of nearest neighbor to help you pull out the classifier
link |
01:08:22.160
from this giant thing.
link |
01:08:23.760
OK, is this the way we can go in 5, 10 plus years
link |
01:08:31.040
from any task, sorry, from many tasks to any task?
link |
01:08:38.200
And what does that mean?
link |
01:08:39.400
What does it need to be actually trained on?
link |
01:08:42.760
Which point is the network had enough?
link |
01:08:45.400
So what does a network need to learn about this world
link |
01:08:50.360
in order to be able to perform any task?
link |
01:08:52.440
Is it just as simple as language, image, and action?
link |
01:08:57.880
Or do you need some set of representative images?
link |
01:09:02.680
Like if you only see land images,
link |
01:09:05.200
will you know anything about underwater?
link |
01:09:06.760
Is that some fundamentally different?
link |
01:09:08.720
I don't know.
link |
01:09:09.800
I mean, those are open questions, I would say.
link |
01:09:12.080
I mean, the way you put, let me maybe further your example.
link |
01:09:15.280
If all you see is land images but you're
link |
01:09:18.920
reading all about land and water worlds
link |
01:09:21.560
but in books, imagine, would that be enough?
link |
01:09:25.360
Good question.
link |
01:09:26.400
We don't know.
link |
01:09:27.120
But I guess maybe you can join us
link |
01:09:30.440
if you want in our quest to find this.
link |
01:09:32.120
That's precisely.
link |
01:09:33.440
Water world, yeah.
link |
01:09:34.360
Yes, that's precisely, I mean, the beauty of research.
link |
01:09:37.640
And that's the research business we're in,
link |
01:09:42.680
I guess, is to figure this out and ask the right questions
link |
01:09:46.160
and then iterate with the whole community,
link |
01:09:49.520
publishing findings and so on.
link |
01:09:52.640
But yeah, this is a question.
link |
01:09:55.160
It's not the only question, but it's certainly, as you ask,
link |
01:09:58.640
on my mind constantly.
link |
01:10:00.080
And so we'll need to wait for maybe the, let's say, five
link |
01:10:03.920
years, let's hope it's not 10, to see what are the answers.
link |
01:10:09.400
Some people will largely believe in unsupervised or
link |
01:10:12.800
self supervised learning of single modalities
link |
01:10:15.640
and then crossing them.
link |
01:10:18.000
Some people might think end to end learning is the answer.
link |
01:10:21.640
Modularity is maybe the answer.
link |
01:10:23.760
So we don't know, but we're just definitely excited
link |
01:10:27.040
to find out.
link |
01:10:27.560
But it feels like this is the right time
link |
01:10:29.280
and we're at the beginning of this journey.
link |
01:10:31.680
We're finally ready to do these kind of general big models
link |
01:10:36.040
and agents.
link |
01:10:37.960
What do you sort of specific technical thing
link |
01:10:42.480
about Gato, Flamingo, Chinchilla, Gopher, any of these
link |
01:10:48.040
that is especially beautiful, that was surprising, maybe?
link |
01:10:51.640
Is there something that just jumps out at you?
link |
01:10:55.200
Of course, there's the general thing of like,
link |
01:10:57.600
you didn't think it was possible and then you
link |
01:11:00.600
realize it's possible in terms of the generalizability
link |
01:11:03.560
across modalities and all that kind of stuff.
link |
01:11:05.640
Or maybe how small of a network, relatively speaking,
link |
01:11:08.920
Gato is, all that kind of stuff.
link |
01:11:10.440
But is there some weird little things that were surprising?
link |
01:11:15.160
Look, I'll give you an answer that's very important
link |
01:11:18.200
because maybe people don't quite realize this,
link |
01:11:22.560
but the teams behind these efforts, the actual humans,
link |
01:11:27.200
that's maybe the surprising in an obviously positive way.
link |
01:11:31.640
So anytime you see these breakthroughs,
link |
01:11:34.560
I mean, it's easy to map it to a few people.
link |
01:11:37.080
There's people that are great at explaining things and so on.
link |
01:11:39.680
And that's very nice.
link |
01:11:40.720
But maybe the learnings or the method learnings
link |
01:11:44.720
that I get as a human about this is, sure, we can move forward.
link |
01:11:50.480
But the surprising bit is how important
link |
01:11:55.640
are all the pieces of these projects,
link |
01:11:58.720
how do they come together?
link |
01:12:00.080
So I'll give you maybe some of the ingredients of success
link |
01:12:04.440
that are common across these, but not the obvious ones
link |
01:12:07.680
on machine learning.
link |
01:12:08.480
I can always also give you those.
link |
01:12:11.320
But basically, there is engineering is critical.
link |
01:12:17.280
So very good engineering because ultimately we're
link |
01:12:21.120
collecting data sets, right?
link |
01:12:23.720
So the engineering of data and then
link |
01:12:26.640
of deploying the models at scale into some compute cluster
link |
01:12:31.160
that cannot go understated, that is a huge factor of success.
link |
01:12:36.800
And it's hard to believe that details matter so much.
link |
01:12:41.560
We would like to believe that it's
link |
01:12:43.760
true that there is more and more of a standard formula,
link |
01:12:47.360
as I was saying, like this recipe that
link |
01:12:49.360
works for everything.
link |
01:12:50.520
But then when you zoom into each of these projects,
link |
01:12:53.680
then you realize the devil is indeed in the details.
link |
01:12:57.760
And then the teams have to work together towards these goals.
link |
01:13:03.040
So engineering of data and obviously clusters
link |
01:13:07.520
and large scale is very important.
link |
01:13:09.280
And then one that is often not, maybe nowadays it is more clear
link |
01:13:15.080
is benchmark progress, right?
link |
01:13:17.160
So we're talking here about multiple months of tens
link |
01:13:20.840
of researchers and people that are
link |
01:13:24.520
trying to organize the research and so on working together.
link |
01:13:28.080
And you don't know that you can get there.
link |
01:13:32.120
I mean, this is the beauty.
link |
01:13:34.520
If you're not risking to trying to do something
link |
01:13:37.320
that feels impossible, you're not going to get there.
link |
01:13:41.600
But you need a way to measure progress.
link |
01:13:43.920
So the benchmarks that you build are critical.
link |
01:13:47.680
I've seen this beautifully play out in many projects.
link |
01:13:50.520
I mean, maybe the one I've seen it more consistently,
link |
01:13:53.840
which means we establish the metric,
link |
01:13:56.760
actually the community did.
link |
01:13:58.240
And then we leverage that massively is alpha fold.
link |
01:14:01.520
This is a project where the data, the metrics
link |
01:14:05.160
were all there.
link |
01:14:06.040
And all it took was, and it's easier said than done,
link |
01:14:09.080
an amazing team working not to try
link |
01:14:12.840
to find some incremental improvement
link |
01:14:14.720
and publish, which is one way to do research that is valid,
link |
01:14:17.920
but aim very high and work literally for years
link |
01:14:22.440
to iterate over that process.
link |
01:14:24.080
And working for years with the team,
link |
01:14:25.640
I mean, it is tricky that also happened to happen partly
link |
01:14:30.120
during a pandemic and so on.
link |
01:14:32.280
So I think my meta learning from all this
link |
01:14:34.200
is the teams are critical to the success.
link |
01:14:37.960
And then if now going to the machine learning,
link |
01:14:40.200
the part that's surprising is so we like architectures
link |
01:14:46.880
like neural networks.
link |
01:14:48.680
And I would say this was a very rapidly evolving field
link |
01:14:53.040
until the transformer came.
link |
01:14:54.920
So attention might indeed be all you need,
link |
01:14:58.280
which is the title, also a good title,
link |
01:15:00.280
although in hindsight is good.
link |
01:15:02.040
I don't think at the time I thought
link |
01:15:03.440
this is a great title for a paper.
link |
01:15:05.040
But that architecture is proving that the dream of modeling
link |
01:15:10.960
sequences of any bytes, there is something there that will stick.
link |
01:15:15.320
And I think these advance in architectures
link |
01:15:18.280
in how neural networks are architecture
link |
01:15:21.000
to do what they do.
link |
01:15:23.080
It's been hard to find one that has been so stable
link |
01:15:26.080
and relatively has changed very little
link |
01:15:28.880
since it was invented five or so years ago.
link |
01:15:33.080
So that is a surprising, is a surprise
link |
01:15:35.840
that keeps recurring into other projects.
link |
01:15:38.280
Try to, on a philosophical or technical level, introspect,
link |
01:15:43.320
what is the magic of attention?
link |
01:15:45.440
What is attention?
link |
01:15:47.280
That's attention in people that study cognition,
link |
01:15:50.120
so human attention.
link |
01:15:52.040
I think there's giant wars over what attention means,
link |
01:15:55.760
how it works in the human mind.
link |
01:15:57.440
So there's very simple looks at what
link |
01:16:00.480
attention is in a neural network from the days of attention
link |
01:16:03.840
is all you need.
link |
01:16:04.440
But do you think there's a general principle that's
link |
01:16:07.520
really powerful here?
link |
01:16:08.760
Yeah, so a distinction between transformers and LSTMs,
link |
01:16:13.400
which were what came before.
link |
01:16:15.360
And there was a transitional period
link |
01:16:17.880
where you could use both.
link |
01:16:19.720
In fact, when we talked about AlphaStar,
link |
01:16:22.000
we used transformers and LSTMs.
link |
01:16:24.240
So it was still the beginning of transformers.
link |
01:16:26.400
They were very powerful.
link |
01:16:27.400
But LSTMs were also very powerful sequence models.
link |
01:16:31.480
So the power of the transformer is
link |
01:16:35.400
that it has built in what we call
link |
01:16:38.440
an inductive bias of attention that makes the model.
link |
01:16:43.040
When you think of a sequence of integers,
link |
01:16:45.720
like we discussed this before, this is a sequence of words.
link |
01:16:50.400
When you have to do very hard tasks over these words,
link |
01:16:54.800
this could be we're going to translate a whole paragraph
link |
01:16:57.840
or we're going to predict the next paragraph given
link |
01:17:00.320
10 paragraphs before.
link |
01:17:04.280
There's some loose intuition from how we do it as a human
link |
01:17:10.360
that is very nicely mimicked and replicated structurally
link |
01:17:15.400
speaking in the transformer, which
link |
01:17:16.840
is this idea of you're looking for something.
link |
01:17:21.160
So you're sort of when you just read a piece of text,
link |
01:17:25.760
now you're thinking what comes next.
link |
01:17:27.880
You might want to relook at the text or look it from scratch.
link |
01:17:31.800
I mean, literally is because there's no recurrence.
link |
01:17:35.040
You're just thinking what comes next.
link |
01:17:37.240
And it's almost hypothesis driven.
link |
01:17:40.040
So if I'm thinking the next word that I write is cat or dog,
link |
01:17:46.600
the way the transformer works almost philosophically
link |
01:17:49.880
is it has these two hypotheses.
link |
01:17:52.840
Is it going to be cat or is it going to be dog?
link |
01:17:55.640
And then it says, OK, if it's cat,
link |
01:17:58.360
I'm going to look for certain words.
link |
01:17:59.920
Not necessarily cat, although cat is an obvious word
link |
01:18:01.920
you would look in the past to see
link |
01:18:03.520
whether it makes more sense to output cat or dog.
link |
01:18:05.960
And then it does some very deep computation
link |
01:18:09.480
over the words and beyond.
link |
01:18:11.480
So it combines the words, but it has the query
link |
01:18:16.200
as we call it that is cat.
link |
01:18:18.440
And then similarly for dog.
link |
01:18:20.680
And so it's a very computational way to think about, look,
link |
01:18:24.760
if I'm thinking deeply about text,
link |
01:18:27.000
I need to go back to look at all of the text, attend over it.
link |
01:18:30.600
But it's not just attention.
link |
01:18:32.200
What is guiding the attention?
link |
01:18:34.000
And that was the key insight from an earlier paper
link |
01:18:36.680
is not how far away is it?
link |
01:18:39.120
I mean, how far away is it is important?
link |
01:18:40.840
What did I just write about?
link |
01:18:42.720
That's critical.
link |
01:18:44.120
But what you wrote about 10 pages ago
link |
01:18:46.760
might also be critical.
link |
01:18:48.480
So you're looking not positionally, but content wise.
link |
01:18:53.160
And transformers have this beautiful way
link |
01:18:56.120
to query for certain content and pull it out
link |
01:18:59.480
in a compressed way.
link |
01:19:00.440
So then you can make a more informed decision.
link |
01:19:02.960
I mean, that's one way to explain transformers.
link |
01:19:05.920
But I think it's a very powerful inductive bias.
link |
01:19:10.000
There might be some details that might change over time,
link |
01:19:12.480
but I think that is what makes transformers so much more
link |
01:19:17.360
powerful than the recurrent networks that
link |
01:19:20.080
were more recency bias based, which obviously works
link |
01:19:23.600
in some tasks, but it has major flaws.
link |
01:19:26.720
Transformer itself has flaws.
link |
01:19:29.240
And I think the main one, the main challenge
link |
01:19:31.680
is these prompts that we just were talking about,
link |
01:19:35.760
they can be 1,000 words long.
link |
01:19:38.040
But if I'm teaching you StarGraph,
link |
01:19:40.440
I'll have to show you videos.
link |
01:19:41.880
I'll have to point you to whole Wikipedia articles
link |
01:19:44.600
about the game.
link |
01:19:46.120
We'll have to interact probably as you play.
link |
01:19:48.040
You'll ask me questions.
link |
01:19:49.480
The context required for us to achieve
link |
01:19:52.320
me being a good teacher to you on the game
link |
01:19:54.720
as you would want to do it with a model, I think
link |
01:19:58.920
goes well beyond the current capabilities.
link |
01:20:01.720
So the question is, how do we benchmark this?
link |
01:20:03.920
And then how do we change the structure of the architectures?
link |
01:20:07.320
I think there's ideas on both sides,
link |
01:20:08.840
but we'll have to see empirically, obviously,
link |
01:20:11.800
what ends up working.
link |
01:20:13.320
And as you talked about, some of the ideas
link |
01:20:15.280
could be keeping the constraint of that length in place,
link |
01:20:19.440
but then forming hierarchical representations
link |
01:20:23.000
to where you can start being much clever in how
link |
01:20:26.600
you use those 1,000 tokens.
link |
01:20:28.800
Indeed.
link |
01:20:30.920
Yeah, that's really interesting.
link |
01:20:32.240
But it also is possible that this attentional mechanism
link |
01:20:34.840
where you basically, you don't have a recency bias,
link |
01:20:37.560
but you look more generally, you make it learnable.
link |
01:20:42.000
The mechanism in which way you look back into the past,
link |
01:20:45.240
you make that learnable.
link |
01:20:46.760
It's also possible we're at the very beginning of that
link |
01:20:50.160
because that, you might become smarter and smarter
link |
01:20:54.400
in the way you query the past.
link |
01:20:58.400
So recent past and distant past and maybe very, very distant
link |
01:21:01.800
past.
link |
01:21:02.320
So almost like the attention mechanism
link |
01:21:04.960
will have to improve and evolve as good as the tokenization
link |
01:21:11.280
mechanism so you can represent long term memory somehow.
link |
01:21:14.960
Yes.
link |
01:21:16.080
And I mean, hierarchies are very,
link |
01:21:18.240
I mean, it's a very nice word that sounds appealing.
link |
01:21:22.160
There's lots of work adding hierarchy to the memories.
link |
01:21:25.920
In practice, it does seem like we keep coming back
link |
01:21:29.480
to the main formula or main architecture.
link |
01:21:33.880
That sometimes tells us something.
link |
01:21:35.320
There is such a sentence that a friend of mine told me,
link |
01:21:38.560
like, whether it wants to work or not.
link |
01:21:41.000
So Transformer was clearly an idea that wanted to work.
link |
01:21:44.920
And then I think there's some principles
link |
01:21:47.520
we believe will be needed.
link |
01:21:49.080
But finding the exact details, details matter so much.
link |
01:21:52.880
That's going to be tricky.
link |
01:21:54.200
I love the idea that there's like you as a human being,
link |
01:21:59.440
you want some ideas to work.
link |
01:22:01.280
And then there's the model that wants some ideas
link |
01:22:03.800
to work and you get to have a conversation
link |
01:22:05.960
to see which more likely the model will win in the end.
link |
01:22:10.520
Because it's the one, you don't have to do any work.
link |
01:22:12.800
The model is the one that has to do the work.
link |
01:22:14.360
So you should listen to the model.
link |
01:22:15.840
And I really love this idea that you
link |
01:22:17.440
talked about the humans in this picture.
link |
01:22:19.160
If I could just briefly ask, one is you're
link |
01:22:21.840
saying the benchmarks about the modular humans working on this,
link |
01:22:28.960
the benchmarks providing a sturdy ground of a wish
link |
01:22:32.160
to do these things that seem impossible.
link |
01:22:34.680
They give you, in the darkest of times,
link |
01:22:37.880
give you hope because little signs of improvement.
link |
01:22:41.520
Yes.
link |
01:22:42.000
Like somehow you're not lost if you have metrics
link |
01:22:46.560
to measure your improvement.
link |
01:22:48.680
And then there's other aspect.
link |
01:22:50.800
You said elsewhere and here today, like titles matter.
link |
01:22:56.560
I wonder how much humans matter in the evolution
link |
01:23:01.280
of all of this, meaning individual humans.
link |
01:23:06.760
Something about their interactions,
link |
01:23:08.160
something about their ideas, how much they change
link |
01:23:11.200
the direction of all of this.
link |
01:23:12.920
Like if you change the humans in this picture,
link |
01:23:15.440
is it that the model is sitting there
link |
01:23:18.160
and it wants some idea to work?
link |
01:23:22.480
Or is it the humans, or maybe the model
link |
01:23:25.000
is providing you 20 ideas that could work.
link |
01:23:27.000
And depending on the humans you pick,
link |
01:23:29.080
they're going to be able to hear some of those ideas.
link |
01:23:33.160
Because you're now directing all of deep learning and deep mind,
link |
01:23:35.920
you get to interact with a lot of projects,
link |
01:23:37.720
a lot of brilliant researchers.
link |
01:23:40.600
How much variability is created by the humans in all of this?
link |
01:23:44.080
Yeah, I mean, I do believe humans matter a lot,
link |
01:23:47.320
at the very least at the time scale of years
link |
01:23:53.360
on when things are happening and what's the sequencing of it.
link |
01:23:56.880
So you get to interact with people that, I mean,
link |
01:24:00.800
you mentioned this.
link |
01:24:02.200
Some people really want some idea to work
link |
01:24:05.080
and they'll persist.
link |
01:24:07.040
And then some other people might be more practical,
link |
01:24:09.400
like I don't care what idea works.
link |
01:24:12.840
I care about cracking protein folding.
link |
01:24:16.800
And at least these two kind of seem opposite sides.
link |
01:24:21.240
We need both.
link |
01:24:22.400
And we've clearly had both historically,
link |
01:24:25.680
and that made certain things happen earlier or later.
link |
01:24:28.960
So definitely humans involved in all of this endeavor
link |
01:24:33.400
have had, I would say, years of change or of ordering
link |
01:24:38.640
how things have happened, which breakthroughs came before,
link |
01:24:41.840
which other breakthroughs, and so on.
link |
01:24:43.280
So certainly that does happen.
link |
01:24:45.800
And so one other, maybe one other axis of distinction
link |
01:24:50.600
is what I called, and this is most commonly used
link |
01:24:53.840
in reinforcement learning, is the exploration exploitation
link |
01:24:56.920
trade off as well.
link |
01:24:57.800
It's not exactly what I meant, although quite related.
link |
01:25:00.960
So when you start trying to help others,
link |
01:25:07.000
like you become a bit more of a mentor
link |
01:25:11.440
to a large group of people, be it a project or the deep
link |
01:25:14.600
learning team or something, or even in the community
link |
01:25:17.440
when you interact with people in conferences and so on,
link |
01:25:20.760
you're identifying quickly some things that are explorative
link |
01:25:26.040
or exploitative.
link |
01:25:27.080
And it's tempting to try to guide people, obviously.
link |
01:25:30.720
I mean, that's what makes our experience.
link |
01:25:33.160
We bring it, and we try to shape things sometimes wrongly.
link |
01:25:36.760
And there's many times that I've been wrong in the past.
link |
01:25:39.560
That's great.
link |
01:25:40.800
But it would be wrong to dismiss any sort of the research
link |
01:25:47.800
styles that I'm observing.
link |
01:25:49.880
And I often get asked, well, you're in industry, right?
link |
01:25:52.760
So we do have access to large compute scale and so on.
link |
01:25:55.640
So there are certain kinds of research
link |
01:25:57.360
I almost feel like we need to do responsibly and so on.
link |
01:26:01.640
But it is, Carlos, we have the particle accelerator here,
link |
01:26:05.160
so to speak, in physics.
link |
01:26:06.280
So we need to use it.
link |
01:26:07.480
We need to answer the questions that we
link |
01:26:09.240
should be answering right now for the scientific progress.
link |
01:26:12.320
But then at the same time, I look at many advances,
link |
01:26:15.200
including attention, which was discovered in Montreal
link |
01:26:19.280
initially because of lack of compute, right?
link |
01:26:22.440
So we were working on sequence to sequence
link |
01:26:24.920
with my friends over at Google Brain at the time.
link |
01:26:27.840
And we were using, I think, eight GPUs,
link |
01:26:30.400
which was somehow a lot at the time.
link |
01:26:32.360
And then I think Montreal was a bit more limited in the scale.
link |
01:26:36.080
But then they discovered this content based attention
link |
01:26:38.800
concept that then has obviously triggered things
link |
01:26:42.240
like Transformer.
link |
01:26:43.320
Not everything obviously starts Transformer.
link |
01:26:46.280
There's always a history that is important to recognize
link |
01:26:49.920
because then you can make sure that then those who might feel
link |
01:26:53.680
now, well, we don't have so much compute,
link |
01:26:56.320
you need to then help them optimize
link |
01:27:00.320
that kind of research that might actually
link |
01:27:02.320
produce amazing change.
link |
01:27:04.240
Perhaps it's not as short term as some of these advancements
link |
01:27:07.960
or perhaps it's a different time scale.
link |
01:27:09.720
But the people and the diversity of the field
link |
01:27:13.040
is quite critical that we maintain it.
link |
01:27:15.680
And at times, especially mixed a bit with hype or other things,
link |
01:27:19.800
it's a bit tricky to be observing maybe
link |
01:27:23.600
too much of the same thinking across the board.
link |
01:27:27.760
But the humans definitely are critical.
link |
01:27:30.520
And I can think of quite a few personal examples
link |
01:27:33.960
where also someone told me something
link |
01:27:36.640
that had a huge effect onto some idea.
link |
01:27:40.320
And then that's why I'm saying at least in terms of years,
link |
01:27:43.360
probably some things do happen.
link |
01:27:44.920
Yeah, it's fascinating.
link |
01:27:46.040
And it's also fascinating how constraints somehow
link |
01:27:48.240
are essential for innovation.
link |
01:27:51.040
And the other thing you mentioned about engineering,
link |
01:27:53.440
I have a sneaking suspicion.
link |
01:27:54.960
Maybe I over, my love is with engineering.
link |
01:28:00.040
So I have a sneaky suspicion that all the genius,
link |
01:28:04.480
a large percentage of the genius is
link |
01:28:06.600
in the tiny details of engineering.
link |
01:28:09.320
So I think we like to think our genius,
link |
01:28:14.000
the genius is in the big ideas.
link |
01:28:17.600
I have a sneaking suspicion that because I've
link |
01:28:20.600
seen the genius of details, of engineering details,
link |
01:28:24.440
make the night and day difference.
link |
01:28:28.840
And I wonder if those kind of have a ripple effect over time.
link |
01:28:32.960
So that too, so that's sort of taking the engineering
link |
01:28:36.360
perspective that sometimes that quiet innovation
link |
01:28:39.400
at the level of an individual engineer
link |
01:28:41.800
or maybe at the small scale of a few engineers
link |
01:28:44.680
can make all the difference.
link |
01:28:46.840
Because we're working on computers that
link |
01:28:50.200
are scaled across large groups, that one engineering decision
link |
01:28:55.080
can lead to ripple effects.
link |
01:28:57.320
It's interesting to think about.
link |
01:28:59.000
Yeah, I mean, engineering, there's
link |
01:29:01.160
also kind of a historical, it might be a bit random.
link |
01:29:06.360
Because if you think of the history of how especially
link |
01:29:10.240
deep learning and neural networks took off,
link |
01:29:12.360
feels like a bit random because GPUs happened
link |
01:29:16.600
to be there at the right time for a different purpose, which
link |
01:29:19.120
was to play video games.
link |
01:29:20.640
So even the engineering that goes into the hardware
link |
01:29:24.920
and it might have a time, the time frame
link |
01:29:27.160
might be very different.
link |
01:29:28.160
I mean, the GPUs were evolved throughout many years
link |
01:29:31.640
where we didn't even were looking at that.
link |
01:29:33.920
So even at that level, that revolution, so to speak,
link |
01:29:38.680
the ripples are like, we'll see when they stop.
link |
01:29:42.200
But in terms of thinking of why is this happening,
link |
01:29:46.960
I think that when I try to categorize it
link |
01:29:49.760
in sort of things that might not be so obvious,
link |
01:29:52.720
I mean, clearly there's a hardware revolution.
link |
01:29:54.920
We are surfing thanks to that.
link |
01:29:58.360
Data centers as well.
link |
01:29:59.720
I mean, data centers are like, I mean, at Google,
link |
01:30:02.680
for instance, obviously they're serving Google.
link |
01:30:04.840
But there's also now thanks to that
link |
01:30:06.920
and to have built such amazing data centers,
link |
01:30:09.680
we can train these models.
link |
01:30:11.720
Software is an important one.
link |
01:30:13.400
I think if I look at the state of how
link |
01:30:16.640
I had to implement things to implement my ideas,
link |
01:30:20.040
how I discarded ideas because they were too hard
link |
01:30:22.080
to implement.
link |
01:30:23.120
Yeah, clearly the times have changed.
link |
01:30:25.280
And thankfully, we are in a much better software position
link |
01:30:28.440
as well.
link |
01:30:29.400
And then, I mean, obviously there's
link |
01:30:31.680
research that happens at scale and more people
link |
01:30:34.360
enter the field.
link |
01:30:35.120
That's great to see.
link |
01:30:35.920
But it's almost enabled by these other things.
link |
01:30:38.200
And last but not least is also data, right?
link |
01:30:40.560
Curating data sets, labeling data sets,
link |
01:30:43.120
these benchmarks we think about.
link |
01:30:44.920
Maybe we'll want to have all the benchmarks in one system.
link |
01:30:48.880
But it's still very valuable that someone
link |
01:30:51.240
put the thought and the time and the vision
link |
01:30:53.600
to build certain benchmarks.
link |
01:30:54.880
We've seen progress thanks to.
link |
01:30:56.640
But we're going to repurpose the benchmarks.
link |
01:30:59.280
That's the beauty of Atari is like we solved it in a way.
link |
01:31:04.160
But we use it in Gato.
link |
01:31:06.000
It was critical.
link |
01:31:06.840
And I'm sure there's still a lot more
link |
01:31:09.080
to do thanks to that amazing benchmark
link |
01:31:10.960
that someone took the time to put,
link |
01:31:13.160
even though at the time maybe, oh, you
link |
01:31:15.560
have to think what's the next iteration of architectures.
link |
01:31:19.480
That's what maybe the field recognizes.
link |
01:31:21.440
But that's another thing we need to balance
link |
01:31:24.040
in terms of humans behind.
link |
01:31:25.760
We need to recognize all these aspects
link |
01:31:27.960
because they're all critical.
link |
01:31:29.440
And we tend to think of the genius, the scientist,
link |
01:31:33.600
and so on.
link |
01:31:34.080
But I'm glad I know you have a strong engineering background.
link |
01:31:38.000
But also, I'm a lover of data.
link |
01:31:40.040
And the pushback on the engineering comment
link |
01:31:43.200
ultimately could be the creators of benchmarks
link |
01:31:46.120
who have the most impact.
link |
01:31:47.400
Andrej Karpathy, who you mentioned,
link |
01:31:49.160
has recently been talking a lot of trash about ImageNet, which
link |
01:31:52.240
he has the right to do because of how critical he is about
link |
01:31:54.960
ImageNet, how essential he is to the development
link |
01:31:57.760
and the success of deep learning around ImageNet.
link |
01:32:01.480
And he's saying that that's actually
link |
01:32:02.960
that benchmark is holding back the field.
link |
01:32:05.520
Because I mean, especially in his context on Tesla Autopilot,
link |
01:32:09.000
that's looking at real world behavior of a system.
link |
01:32:14.280
There's something fundamentally missing
link |
01:32:16.280
about ImageNet that doesn't capture
link |
01:32:17.920
the real worldness of things.
link |
01:32:20.400
That we need to have data sets, benchmarks that
link |
01:32:23.560
have the unpredictability, the edge cases, whatever
link |
01:32:27.600
the heck it is that makes the real world so
link |
01:32:30.760
difficult to operate in.
link |
01:32:32.280
We need to have benchmarks of that.
link |
01:32:34.640
But just to think about the impact of ImageNet
link |
01:32:37.760
as a benchmark, and that really puts a lot of emphasis
link |
01:32:42.120
on the importance of a benchmark,
link |
01:32:43.720
both sort of internally a deep mind and as a community.
link |
01:32:46.640
So one is coming in from within, like,
link |
01:32:50.120
how do I create a benchmark for me to mark and make progress?
link |
01:32:55.280
And how do I make benchmark for the community
link |
01:32:58.120
to mark and push progress?
link |
01:33:02.520
You have this amazing paper you coauthored,
link |
01:33:05.840
a survey paper called Emergent Abilities
link |
01:33:08.600
of Large Language Models.
link |
01:33:10.480
It has, again, the philosophy here
link |
01:33:12.520
that I'd love to ask you about.
link |
01:33:14.520
What's the intuition about the phenomena of emergence
link |
01:33:17.320
in neural networks transformed as language models?
link |
01:33:20.600
Is there a magic threshold beyond which
link |
01:33:24.160
we start to see certain performance?
link |
01:33:27.080
And is that different from task to task?
link |
01:33:29.880
Is that us humans just being poetic and romantic?
link |
01:33:32.640
Or is there literally some level at which we start
link |
01:33:36.160
to see breakthrough performance?
link |
01:33:38.120
Yeah, I mean, this is a property that we start seeing in systems
link |
01:33:43.520
that actually tend to be so in machine learning,
link |
01:33:48.160
traditionally, again, going to benchmarks.
link |
01:33:51.680
I mean, if you have some input, output, right,
link |
01:33:54.840
like that is just a single input and a single output,
link |
01:33:58.200
you generally, when you train these systems,
link |
01:34:01.200
you see reasonably smooth curves when
link |
01:34:04.760
you analyze how much the data set size affects
link |
01:34:10.040
the performance, or how the model size affects
link |
01:34:12.280
the performance, or how long you train the system for affects
link |
01:34:18.200
the performance, right?
link |
01:34:19.280
So if we think of ImageNet, the training curves
link |
01:34:23.080
look fairly smooth and predictable in a way.
link |
01:34:28.080
And I would say that's probably because it's
link |
01:34:31.520
kind of a one hop reasoning task, right?
link |
01:34:36.520
It's like, here is an input, and you
link |
01:34:39.160
think for a few milliseconds or 100 milliseconds, 300
link |
01:34:42.760
as a human, and then you tell me,
link |
01:34:44.560
yeah, there's an alpaca in this image.
link |
01:34:47.840
So in language, we are seeing benchmarks that require more
link |
01:34:55.560
pondering and more thought in a way, right?
link |
01:34:58.200
This is just kind of you need to look for some subtleties.
link |
01:35:02.440
It involves inputs that you might think of,
link |
01:35:05.840
even if the input is a sentence describing
link |
01:35:08.360
a mathematical problem, there is a bit more processing
link |
01:35:13.080
required as a human and more introspection.
link |
01:35:15.640
So I think how these benchmarks work
link |
01:35:20.440
means that there is actually a threshold.
link |
01:35:24.720
Just going back to how transformers
link |
01:35:26.480
work in this way of querying for the right questions
link |
01:35:29.520
to get the right answers, that might
link |
01:35:31.760
mean that performance becomes random
link |
01:35:35.400
until the right question is asked
link |
01:35:37.720
by the querying system of a transformer or of a language
link |
01:35:40.920
model like a transformer.
link |
01:35:42.760
And then only then you might start
link |
01:35:46.240
seeing performance going from random to nonrandom.
link |
01:35:50.000
And this is more empirical.
link |
01:35:53.080
There's no formalism or theory behind this yet,
link |
01:35:56.320
although it might be quite important.
link |
01:35:57.760
But we are seeing these phase transitions
link |
01:36:00.320
of random performance until some,
link |
01:36:03.200
let's say, scale of a model.
link |
01:36:04.880
And then it goes beyond that.
link |
01:36:06.680
And it might be that you need to fit
link |
01:36:10.440
a few low order bits of thought before you can make progress
link |
01:36:16.040
on the whole task.
link |
01:36:17.200
And if you could measure, actually,
link |
01:36:19.720
those breakdown of the task, maybe you
link |
01:36:22.240
would see more smooth, like, yeah,
link |
01:36:25.320
once you get these and these and these and these and these,
link |
01:36:27.760
then you start making progress in the task.
link |
01:36:30.240
But it's somehow a bit annoying because then it
link |
01:36:35.240
means that certain questions we might ask about architectures
link |
01:36:40.240
possibly can only be done at a certain scale.
link |
01:36:42.960
And one thing that, conversely, I've
link |
01:36:46.320
seen great progress on in the last couple of years
link |
01:36:49.200
is this notion of science of deep learning and science
link |
01:36:53.120
of scale in particular.
link |
01:36:55.000
So on the negative is that there are
link |
01:36:57.520
some benchmarks for which progress might
link |
01:37:01.000
need to be measured at minimum at a certain scale
link |
01:37:04.000
until you see then what details of the model
link |
01:37:07.040
matter to make that performance better.
link |
01:37:09.960
So that's a bit of a con.
link |
01:37:11.880
But what we've also seen is that you can empirically
link |
01:37:17.960
analyze behavior of models at scales that are smaller.
link |
01:37:22.880
So let's say, to put an example, we
link |
01:37:25.920
had this Chinchilla paper that revised the so called scaling
link |
01:37:30.080
laws of models.
link |
01:37:31.320
And that whole study is done at a reasonably small scale,
link |
01:37:35.000
that may be hundreds of millions up to 1 billion parameters.
link |
01:37:38.600
And then the cool thing is that you create some loss,
link |
01:37:41.880
some loss that some trends, you extract trends from data
link |
01:37:45.840
that you see, OK, it looks like the amount of data required
link |
01:37:49.400
to train now a 10x larger model would be this.
link |
01:37:52.080
And these laws so far, these extrapolations
link |
01:37:55.200
have helped us save compute and just get to a better place
link |
01:37:59.880
in terms of the science of how should we
link |
01:38:02.520
run these models at scale, how much data, how much depth,
link |
01:38:05.640
and all sorts of questions we start
link |
01:38:07.360
asking extrapolating from a small scale.
link |
01:38:10.560
But then these emergence is sadly that not everything
link |
01:38:13.720
can be extrapolated from scale depending on the benchmark.
link |
01:38:16.920
And maybe the harder benchmarks are not
link |
01:38:19.840
so good for extracting these laws.
link |
01:38:21.920
But we have a variety of benchmarks at least.
link |
01:38:24.160
So I wonder to which degree the threshold, the phase shift
link |
01:38:29.240
scale is a function of the benchmark.
link |
01:38:32.440
So some of the science of scale might
link |
01:38:35.120
be engineering benchmarks where that threshold is low,
link |
01:38:40.400
sort of taking a main benchmark and reducing it somehow
link |
01:38:46.160
where the essential difficulty is left
link |
01:38:48.480
but the scale of which the emergence happens
link |
01:38:51.880
is lower just for the science aspect of it
link |
01:38:54.320
versus the actual real world aspect.
link |
01:38:56.960
Yeah, so luckily we have quite a few benchmarks, some of which
link |
01:38:59.920
are simpler or maybe they're more like I think people might
link |
01:39:02.640
call these systems one versus systems two style.
link |
01:39:05.920
So I think what we're not seeing luckily
link |
01:39:09.920
is that extrapolations from maybe slightly more smooth
link |
01:39:14.080
or simpler benchmarks are translating to the harder ones.
link |
01:39:18.560
But that is not to say that this extrapolation will
link |
01:39:21.480
hit its limits.
link |
01:39:22.560
And when it does, then how much we scale or how we scale
link |
01:39:27.560
will sadly be a bit suboptimal until we find better laws.
link |
01:39:31.760
And these laws, again, are very empirical laws.
link |
01:39:33.840
They're not like physical laws of models,
link |
01:39:35.960
although I wish there would be better theory about these
link |
01:39:39.520
things as well.
link |
01:39:40.240
But so far, I would say empirical theory,
link |
01:39:43.040
as I call it, is way ahead than actual theory
link |
01:39:46.000
of machine learning.
link |
01:39:47.800
Let me ask you almost for fun.
link |
01:39:50.560
So this is not, Oriol, as a deep mind person or anything
link |
01:39:55.840
to do with deep mind or Google, just as a human being,
link |
01:39:59.080
looking at these news of a Google engineer who claimed
link |
01:40:04.320
that, I guess, the lambda language model was sentient.
link |
01:40:11.120
And you still need to look into the details of this.
link |
01:40:14.080
But making an official report and the claim
link |
01:40:19.440
that he believes there's evidence that this system has
link |
01:40:23.880
achieved sentience.
link |
01:40:25.160
And I think this is a really interesting case
link |
01:40:29.480
on a human level, on a psychological level,
link |
01:40:31.720
on a technical machine learning level of how language models
link |
01:40:37.240
transform our world, and also just philosophical level
link |
01:40:39.840
of the role of AI systems in a human world.
link |
01:40:44.200
So what do you find interesting?
link |
01:40:48.080
What's your take on all of this as a machine learning
link |
01:40:51.080
engineer and a researcher and also as a human being?
link |
01:40:54.240
Yeah, I mean, a few reactions.
link |
01:40:57.440
Quite a few, actually.
link |
01:40:58.680
Have you ever briefly thought, is this thing sentient?
link |
01:41:02.560
Right, so never, absolutely never.
link |
01:41:04.800
Like even with Alpha Star?
link |
01:41:06.240
Wait a minute.
link |
01:41:08.080
Sadly, though, I think, yeah, sadly, I have not.
link |
01:41:11.840
Yeah, I think the current, any of the current models,
link |
01:41:15.280
although very useful and very good,
link |
01:41:18.880
yeah, I think we're quite far from that.
link |
01:41:22.320
And there's kind of a converse side story.
link |
01:41:25.320
So one of my passions is about science in general.
link |
01:41:30.320
And I think I feel I'm a bit of a failed scientist.
link |
01:41:34.440
That's why I came to machine learning,
link |
01:41:36.520
because you always feel, and you start seeing this,
link |
01:41:40.080
that machine learning is maybe the science that
link |
01:41:43.320
can help other sciences, as we've seen.
link |
01:41:46.400
It's such a powerful tool.
link |
01:41:48.640
So thanks to that angle, that, OK, I love science.
link |
01:41:52.480
I love, I mean, I love astronomy.
link |
01:41:53.880
I love biology.
link |
01:41:54.880
But I'm not an expert.
link |
01:41:56.000
And I decided, well, the thing I can do better
link |
01:41:58.600
at is computers.
link |
01:41:59.960
But having, especially with when I was a bit more involved
link |
01:42:04.720
in AlphaFold, learning a bit about proteins
link |
01:42:07.400
and about biology and about life,
link |
01:42:11.440
the complexity, it feels like it really is.
link |
01:42:14.840
I mean, if you start looking at the things that are going on
link |
01:42:19.200
at the atomic level, and also, I mean, there's obviously the,
link |
01:42:26.360
we are maybe inclined to try to think of neural networks
link |
01:42:29.280
as like the brain.
link |
01:42:30.400
But the complexities and the amount of magic
link |
01:42:33.760
that it feels when, I mean, I'm not an expert,
link |
01:42:37.080
so it naturally feels more magic.
link |
01:42:38.560
But looking at biological systems,
link |
01:42:40.800
as opposed to these computational brains,
link |
01:42:46.640
just makes me like, wow, there's such a level of complexity
link |
01:42:50.320
difference still, like orders of magnitude complexity that,
link |
01:42:54.840
sure, these weights, I mean, we train them
link |
01:42:56.640
and they do nice things.
link |
01:42:58.040
But they're not at the level of biological entities, brains,
link |
01:43:04.320
cells.
link |
01:43:06.000
It just feels like it's just not possible to achieve
link |
01:43:09.680
the same level of complexity behavior.
link |
01:43:12.400
And my belief, when I talk to other beings,
link |
01:43:16.240
is certainly shaped by this amazement of biology
link |
01:43:20.360
that, maybe because I know too much,
link |
01:43:22.360
I don't have about machine learning,
link |
01:43:23.800
but I certainly feel it's very far fetched and far
link |
01:43:28.120
in the future to be calling or to be thinking,
link |
01:43:31.720
well, this mathematical function that is differentiable
link |
01:43:35.640
is, in fact, sentient and so on.
link |
01:43:39.200
There's something on that point that is very interesting.
link |
01:43:42.000
So you know enough about machines and enough
link |
01:43:46.120
about biology to know that there's
link |
01:43:47.760
many orders of magnitude of difference and complexity.
link |
01:43:51.880
But you know how machine learning works.
link |
01:43:56.080
So the interesting question for human beings
link |
01:43:58.160
that are interacting with a system that don't know
link |
01:44:00.080
about the underlying complexity.
link |
01:44:02.280
And I've seen people, probably including myself,
link |
01:44:05.240
that have fallen in love with things that are quite simple.
link |
01:44:08.400
And so maybe the complexity is one part of the picture,
link |
01:44:11.520
but maybe that's not a necessary condition for sentience,
link |
01:44:18.840
for perception or emulation of sentience.
link |
01:44:24.760
Right.
link |
01:44:25.280
So I mean, I guess the other side of this
link |
01:44:27.560
is that's how I feel personally.
link |
01:44:29.560
I mean, you asked me about the person, right?
link |
01:44:32.360
Now, it's very interesting to see how other humans feel
link |
01:44:35.560
about things, right?
link |
01:44:37.080
We are, again, I'm not as amazed about things
link |
01:44:41.640
that I feel this is not as magical as this other thing
link |
01:44:44.560
because of maybe how I got to learn about it
link |
01:44:48.040
and how I see the curve a bit more smooth
link |
01:44:50.480
because I've just seen the progress of language models
link |
01:44:54.000
since Shannon in the 50s.
link |
01:44:56.000
And actually looking at that time scale,
link |
01:44:58.920
we're not that fast progress, right?
link |
01:45:00.880
I mean, what we were thinking at the time almost 100 years ago
link |
01:45:06.040
is not that dissimilar to what we're doing now.
link |
01:45:08.880
But at the same time, yeah, obviously others,
link |
01:45:11.440
my experience, the personal experience,
link |
01:45:14.440
I think no one should tell others how they should feel.
link |
01:45:20.680
I mean, the feelings are very personal, right?
link |
01:45:22.920
So how others might feel about the models and so on.
link |
01:45:26.080
That's one part of the story that
link |
01:45:27.840
is important to understand for me personally as a researcher.
link |
01:45:31.960
And then when I maybe disagree or I
link |
01:45:35.200
don't understand or see that, yeah, maybe this is not
link |
01:45:38.200
something I think right now is reasonable,
link |
01:45:39.920
knowing all that I know, one of the other things
link |
01:45:42.840
and perhaps partly why it's great to be talking to you
link |
01:45:46.480
and reaching out to the world about machine learning
link |
01:45:49.200
is, hey, let's demystify a bit the magic
link |
01:45:53.440
and try to see a bit more of the math
link |
01:45:56.200
and the fact that literally to create these models,
link |
01:45:59.800
if we had the right software, it would be 10 lines of code
link |
01:46:03.520
and then just a dump of the internet.
link |
01:46:06.760
Versus then the complexity of the creation of humans
link |
01:46:11.600
from their inception, right?
link |
01:46:13.520
And also the complexity of evolution of the whole universe
link |
01:46:17.600
to where we are that feels orders of magnitude
link |
01:46:21.040
more complex and fascinating to me.
link |
01:46:23.400
So I think, yeah, maybe part of the only thing
link |
01:46:26.520
I'm thinking about trying to tell you is, yeah, I think
link |
01:46:30.240
explaining a bit of the magic.
link |
01:46:32.560
There is a bit of magic.
link |
01:46:33.600
It's good to be in love, obviously,
link |
01:46:35.240
with what you do at work.
link |
01:46:36.920
And I'm certainly fascinated and surprised quite often as well.
link |
01:46:41.320
But I think, hopefully, as experts in biology,
link |
01:46:45.040
hopefully will tell me this is not as magic.
link |
01:46:47.080
And I'm happy to learn that through interactions
link |
01:46:50.840
with the larger community, we can also
link |
01:46:54.000
have a certain level of education
link |
01:46:56.000
that in practice also will matter because, I mean,
link |
01:46:58.920
one question is how you feel about this.
link |
01:47:00.800
But then the other very important is
link |
01:47:03.000
you starting to interact with these in products and so on.
link |
01:47:06.960
It's good to understand a bit what's going on,
link |
01:47:09.240
what's not going on, what's safe, what's not safe,
link |
01:47:12.280
and so on, right?
link |
01:47:13.000
Otherwise, the technology will not
link |
01:47:15.280
be used properly for good, which is obviously
link |
01:47:18.120
the goal of all of us, I hope.
link |
01:47:20.480
So let me then ask the next question.
link |
01:47:22.920
Do you think in order to solve intelligence
link |
01:47:25.760
or to replace the leg spot that does interviews
link |
01:47:29.480
as we started this conversation with,
link |
01:47:31.480
do you think the system needs to be sentient?
link |
01:47:34.880
Do you think it needs to achieve something like consciousness?
link |
01:47:38.720
And do you think about what consciousness
link |
01:47:41.120
is in the human mind that could be instructive for creating AI
link |
01:47:45.360
systems?
link |
01:47:46.720
Yeah.
link |
01:47:47.760
Honestly, I think probably not to the degree of intelligence
link |
01:47:53.480
that there's this brain that can learn,
link |
01:47:58.760
can be extremely useful, can challenge you, can teach you.
link |
01:48:02.960
Conversely, you can teach it to do things.
link |
01:48:05.600
I'm not sure it's necessary, personally speaking.
link |
01:48:09.080
But if consciousness or any other biological or evolutionary
link |
01:48:15.680
lesson can be repurposed to then influence
link |
01:48:20.880
our next set of algorithms, that is a great way
link |
01:48:24.360
to actually make progress, right?
link |
01:48:25.680
And the same way I try to explain transformers a bit
link |
01:48:28.240
how it feels we operate when we look at text specifically,
link |
01:48:33.360
these insights are very important, right?
link |
01:48:36.000
So there's a distinction between details of how the brain might
link |
01:48:41.240
be doing computation.
link |
01:48:43.200
I think my understanding is, sure, there's neurons
link |
01:48:46.560
and there's some resemblance to neural networks,
link |
01:48:48.520
but we don't quite understand enough of the brain in detail,
link |
01:48:52.200
right, to be able to replicate it.
link |
01:48:55.240
But then if you zoom out a bit, our thought process,
link |
01:49:01.320
how memory works, maybe even how evolution got us here,
link |
01:49:05.560
what's exploration, exploitation,
link |
01:49:07.280
like how these things happen, I think
link |
01:49:09.080
these clearly can inform algorithmic level research.
link |
01:49:12.960
And I've seen some examples of this
link |
01:49:17.040
being quite useful to then guide the research,
link |
01:49:19.720
even it might be for the wrong reasons, right?
link |
01:49:21.640
So I think biology and what we know about ourselves
link |
01:49:26.080
can help a whole lot to build, essentially,
link |
01:49:30.000
what we call AGI, this general, the real ghetto, right?
link |
01:49:34.480
The last step of the chain, hopefully.
link |
01:49:36.480
But consciousness in particular, I don't myself
link |
01:49:40.760
at least think too hard about how to add that to the system.
link |
01:49:44.760
But maybe my understanding is also very personal
link |
01:49:47.840
about what it means, right?
link |
01:49:48.840
I think even that in itself is a long debate
link |
01:49:51.760
that I know people have often.
link |
01:49:55.240
And maybe I should learn more about this.
link |
01:49:57.720
Yeah, and I personally, I notice the magic often
link |
01:50:01.680
on a personal level, especially with physical systems
link |
01:50:04.960
like robots.
link |
01:50:06.120
I have a lot of legged robots now in Austin
link |
01:50:10.440
that I play with.
link |
01:50:11.680
And even when you program them, when
link |
01:50:13.480
they do things you didn't expect,
link |
01:50:15.560
there's an immediate anthropomorphization.
link |
01:50:18.560
And you notice the magic, and you
link |
01:50:19.960
start to think about things like sentience
link |
01:50:22.600
that has to do more with effective communication
link |
01:50:26.000
and less with any of these kind of dramatic things.
link |
01:50:30.160
It seems like a useful part of communication.
link |
01:50:32.840
Having the perception of consciousness
link |
01:50:36.560
seems like useful for us humans.
link |
01:50:38.800
We treat each other more seriously.
link |
01:50:40.840
We are able to do a nearest neighbor shoving of that entity
link |
01:50:46.000
into your memory correctly, all that kind of stuff.
link |
01:50:48.640
It seems useful, at least to fake it,
link |
01:50:50.800
even if you never make it.
link |
01:50:52.440
So maybe, like, yeah, mirroring the question.
link |
01:50:55.560
And since you talked to a few people,
link |
01:50:57.440
then you do think that we'll need
link |
01:50:59.880
to figure something out in order to achieve intelligence
link |
01:51:04.560
in a grander sense of the word.
link |
01:51:06.520
Yeah, I personally believe yes, but I don't even
link |
01:51:09.360
think it'll be like a separate island we'll have to travel to.
link |
01:51:14.160
I think it will emerge quite naturally.
link |
01:51:16.400
OK, that's easier for us then.
link |
01:51:19.040
Thank you.
link |
01:51:20.080
But the reason I think it's important to think about
link |
01:51:22.760
is you will start, I believe, like with this Google
link |
01:51:25.800
engineer, you will start seeing this a lot more, especially
link |
01:51:29.320
when you have AI systems that are actually interacting
link |
01:51:31.600
with human beings that don't have an engineering background.
link |
01:51:35.120
And we have to prepare for that.
link |
01:51:38.520
Because I do believe there will be a civil rights
link |
01:51:41.160
movement for robots, as silly as it is to say.
link |
01:51:44.520
There's going to be a large number of people
link |
01:51:46.720
that realize there's these intelligent entities with whom
link |
01:51:49.760
I have a deep relationship, and I don't want to lose them.
link |
01:51:53.160
They've come to be a part of my life, and they mean a lot.
link |
01:51:55.920
They have a name.
link |
01:51:57.120
They have a story.
link |
01:51:58.040
They have a memory.
link |
01:51:59.120
And we start to ask questions about ourselves.
link |
01:52:01.240
Well, this thing sure seems like it's capable of suffering,
link |
01:52:07.520
because it tells all these stories of suffering.
link |
01:52:09.800
It doesn't want to die and all those kinds of things.
link |
01:52:11.960
And we have to start to ask ourselves questions.
link |
01:52:14.400
What is the difference between a human being and this thing?
link |
01:52:16.960
And so when you engineer, I believe
link |
01:52:20.120
from an engineering perspective, from a deep mind or anybody
link |
01:52:23.400
that builds systems, there might be laws in the future
link |
01:52:26.440
where you're not allowed to engineer systems
link |
01:52:29.120
with displays of sentience, unless they're explicitly
link |
01:52:35.120
designed to be that, unless it's a pet.
link |
01:52:37.320
So if you have a system that's just doing customer support,
link |
01:52:41.160
you're legally not allowed to display sentience.
link |
01:52:44.160
We'll start to ask ourselves that question.
link |
01:52:47.200
And then so that's going to be part of the software
link |
01:52:49.920
engineering process.
link |
01:52:52.080
Which features do we have?
link |
01:52:53.320
And one of them is communications of the sentience.
link |
01:52:56.440
But it's important to start thinking about that stuff,
link |
01:52:58.680
especially how much it captivates public attention.
link |
01:53:01.640
Yeah, absolutely.
link |
01:53:03.120
It's definitely a topic that is important.
link |
01:53:06.360
We think about.
link |
01:53:07.880
And I think in a way, I always see not every movie
link |
01:53:12.560
is equally on point with certain things.
link |
01:53:16.080
But certainly science fiction in this sense
link |
01:53:19.000
at least has prepared society to start
link |
01:53:22.120
thinking about certain topics that even if it's
link |
01:53:25.360
too early to talk about, as long as we are reasonable,
link |
01:53:29.400
it's certainly going to prepare us for both the research
link |
01:53:33.840
to come and how to.
link |
01:53:34.920
I mean, there's many important challenges and topics
link |
01:53:38.080
that come with building an intelligent system, many of
link |
01:53:43.200
which you just mentioned.
link |
01:53:44.640
So I think we're never going to be fully ready
link |
01:53:49.880
unless we talk about these.
link |
01:53:51.360
And we start also, as I said, just expanding the people
link |
01:53:58.840
we talk to not include only our own researchers and so on.
link |
01:54:03.240
And in fact, places like DeepMind but elsewhere,
link |
01:54:06.480
there's more interdisciplinary groups forming up
link |
01:54:10.320
to start asking and really working
link |
01:54:12.880
with us on these questions.
link |
01:54:14.880
Because obviously, this is not initially
link |
01:54:17.400
what your passion is when you do your PhD,
link |
01:54:19.360
but certainly it is coming.
link |
01:54:21.440
So it's fascinating.
link |
01:54:23.120
It's the thing that brings me to one of my passions
link |
01:54:27.160
that is learning.
link |
01:54:28.000
So in this sense, this is a new area
link |
01:54:31.680
that, as a learning system myself,
link |
01:54:35.120
I want to keep exploring.
link |
01:54:36.640
And I think it's great to see parts of the debate.
link |
01:54:41.000
And even I've seen a level of maturity
link |
01:54:43.720
in the conferences that deal with AI.
link |
01:54:46.400
If you look five years ago to now,
link |
01:54:49.840
just the amount of workshops and so on has changed so much.
link |
01:54:53.040
It's impressive to see how much topics of safety, ethics,
link |
01:54:58.520
and so on come to the surface, which is great.
link |
01:55:01.720
And if it were too early, clearly it's fine.
link |
01:55:03.800
I mean, it's a big field, and there's
link |
01:55:05.920
lots of people with lots of interests
link |
01:55:09.040
that will do progress or make progress.
link |
01:55:11.880
And obviously, I don't believe we're too late.
link |
01:55:14.160
So in that sense, I think it's great
link |
01:55:16.440
that we're doing this already.
link |
01:55:18.160
It better be too early than too late
link |
01:55:20.200
when it comes to super intelligent AI systems.
link |
01:55:22.720
Let me ask, speaking of sentient AIs,
link |
01:55:25.480
you gave props to your friend Ilyas Etzgever
link |
01:55:28.680
for being elected the fellow of the Royal Society.
link |
01:55:31.960
So just as a shout out to a fellow researcher
link |
01:55:34.680
and a friend, what's the secret to the genius of Ilyas
link |
01:55:38.240
Etzgever?
link |
01:55:39.400
And also, do you believe that his tweets,
link |
01:55:42.640
as you've hypothesized and Andrej Karpathy did as well,
link |
01:55:46.000
are generated by a language model?
link |
01:55:48.680
Yeah.
link |
01:55:49.360
So I strongly believe Ilya is going to visit in a few weeks,
link |
01:55:54.240
actually.
link |
01:55:54.720
So I'll ask him in person.
link |
01:55:58.000
Will he tell you the truth?
link |
01:55:59.160
Yes, of course, hopefully.
link |
01:56:00.720
I mean, ultimately, we all have shared paths,
link |
01:56:04.040
and there's friendships that go beyond, obviously,
link |
01:56:08.280
institutions and so on.
link |
01:56:09.960
So I hope he tells me the truth.
link |
01:56:11.680
Well, maybe the AI system is holding him hostage somehow.
link |
01:56:14.400
Maybe he has some videos that he doesn't want to release.
link |
01:56:16.920
So maybe it has taken control over him.
link |
01:56:19.720
So he can't tell the truth.
link |
01:56:20.960
Well, if I see him in person, then I think he will know.
link |
01:56:23.920
But I think Ilya's personality, just knowing him for a while,
link |
01:56:33.920
everyone in Twitter, I guess, gets a different persona.
link |
01:56:36.640
And I think Ilya's one does not surprise me.
link |
01:56:40.920
So I think knowing Ilya from before social media
link |
01:56:43.600
and before AI was so prevalent, I
link |
01:56:46.000
recognize a lot of his character.
link |
01:56:47.560
So that's something for me that I
link |
01:56:49.200
feel good about a friend that hasn't changed
link |
01:56:52.520
or is still true to himself.
link |
01:56:55.960
Obviously, there is, though, a fact
link |
01:56:58.960
that your field becomes more popular,
link |
01:57:02.080
and he is obviously one of the main figures in the field,
link |
01:57:05.440
having done a lot of advancement.
link |
01:57:07.040
So I think that the tricky bit here
link |
01:57:09.080
is how to balance your true self with the responsibility
link |
01:57:12.200
that your words carry.
link |
01:57:13.560
So in this sense, I appreciate the style, and I understand it.
link |
01:57:19.360
But it created debates on some of his tweets
link |
01:57:24.160
that maybe it's good we have them early anyways.
link |
01:57:27.920
But yeah, then the reactions are usually polarizing.
link |
01:57:31.040
I think we're just seeing the reality of social media
link |
01:57:34.160
be there as well, reflected on that particular topic
link |
01:57:38.120
or set of topics he's tweeting about.
link |
01:57:40.200
Yeah, I mean, it's funny that he used to speak to this tension.
link |
01:57:42.960
He was one of the early seminal figures
link |
01:57:46.160
in the field of deep learning, so there's
link |
01:57:47.800
a responsibility with that.
link |
01:57:48.960
But he's also, from having interacted with him quite a bit,
link |
01:57:53.200
he's just a brilliant thinker about ideas, which, as are you.
link |
01:58:01.280
And there's a tension between becoming
link |
01:58:03.120
the manager versus the actual thinking
link |
01:58:06.960
through very novel ideas, the scientist versus the manager.
link |
01:58:13.640
And he's one of the great scientists of our time.
link |
01:58:17.680
So this was quite interesting.
link |
01:58:18.960
And also, people tell me quite silly,
link |
01:58:20.840
which I haven't quite detected yet.
link |
01:58:23.200
But in private, we'll have to see about that.
link |
01:58:26.000
Yeah, yeah.
link |
01:58:27.480
I mean, just on the point of, I mean,
link |
01:58:30.000
Ilya has been an inspiration.
link |
01:58:33.360
I mean, quite a few colleagues, I can think,
link |
01:58:35.480
shaped the person you are.
link |
01:58:38.080
Like, Ilya certainly gets probably the top spot,
link |
01:58:42.320
if not close to the top.
link |
01:58:43.800
And if we go back to the question about people in the field,
link |
01:58:47.960
like how their role would have changed the field or not,
link |
01:58:51.680
I think Ilya's case is interesting
link |
01:58:54.000
because he really has a deep belief in the scaling up
link |
01:58:58.760
of neural networks.
link |
01:58:59.640
There was a talk that is still famous to this day
link |
01:59:03.680
from the Sequence to Sequence paper, where he was just
link |
01:59:07.720
claiming, just give me supervised data
link |
01:59:10.560
and a large neural network, and then you'll
link |
01:59:12.800
solve basically all the problems.
link |
01:59:16.240
That vision was already there many years ago.
link |
01:59:19.800
So it's good to see someone who is, in this case,
link |
01:59:22.880
very deeply into this style of research
link |
01:59:27.160
and clearly has had a tremendous track record of successes
link |
01:59:32.800
and so on.
link |
01:59:34.160
The funny bit about that talk is that we rehearsed the talk
link |
01:59:37.520
in a hotel room before, and the original version of that talk
link |
01:59:42.040
would have been even more controversial.
link |
01:59:44.000
So maybe I'm the only person that
link |
01:59:46.760
has seen the unfiltered version of the talk.
link |
01:59:49.520
And maybe when the time comes, maybe we
link |
01:59:52.160
should revisit some of the skip slides
link |
01:59:55.120
from the talk from Ilya.
link |
01:59:57.560
But I really think the deep belief
link |
02:00:01.040
into some certain style of research
link |
02:00:03.240
pays out, is good to be practical sometimes.
link |
02:00:06.400
And I actually think Ilya and myself are practical,
link |
02:00:09.400
but it's also good.
link |
02:00:10.440
There's some sort of long term belief and trajectory.
link |
02:00:14.840
Obviously, there's a bit of lack involved,
link |
02:00:16.720
but it might be that that's the right path.
link |
02:00:18.840
Then you clearly are ahead and hugely influential to the field
link |
02:00:22.320
as he has been.
link |
02:00:23.560
Do you agree with that intuition that maybe
link |
02:00:26.440
was written about by Rich Sutton in The Bitter Lesson,
link |
02:00:33.600
that the biggest lesson that can be read from 70 years of AI
link |
02:00:36.480
research is that general methods that leverage computation
link |
02:00:40.080
are ultimately the most effective?
link |
02:00:42.800
Do you think that intuition is ultimately correct?
link |
02:00:48.560
General methods that leverage computation,
link |
02:00:52.240
allowing the scaling of computation
link |
02:00:54.360
to do a lot of the work.
link |
02:00:56.240
And so the basic task of us humans
link |
02:00:59.640
is to design methods that are more
link |
02:01:01.440
and more general versus more and more specific to the tasks
link |
02:01:05.960
at hand.
link |
02:01:07.040
I certainly think this essentially mimics
link |
02:01:10.320
a bit of the deep learning research,
link |
02:01:14.680
almost like philosophy, that on the one hand,
link |
02:01:18.840
we want to be data agnostic.
link |
02:01:20.480
We don't want to preprocess data sets.
link |
02:01:22.160
We want to see the bytes, the true data as it is,
link |
02:01:25.560
and then learn everything on top.
link |
02:01:27.440
So very much agree with that.
link |
02:01:30.120
And I think scaling up feels, at the very least, again,
link |
02:01:33.360
necessary for building incredible complex systems.
link |
02:01:38.960
It's possibly not sufficient, barring that we
link |
02:01:42.880
need a couple of breakthroughs.
link |
02:01:45.080
I think Reed Sutton mentioned search
link |
02:01:47.960
being part of the equation of scale and search.
link |
02:01:52.200
I think search, I've seen it, that's
link |
02:01:55.720
been more mixed in my experience.
link |
02:01:57.400
So from that lesson in particular,
link |
02:01:59.320
search is a bit more tricky because it
link |
02:02:02.480
is very appealing to search in domains like Go,
link |
02:02:05.320
where you have a clear reward function that you can then
link |
02:02:08.080
discard some search traces.
link |
02:02:10.560
But then in some other tasks, it's
link |
02:02:13.160
not very clear how you would do that,
link |
02:02:15.160
although recently one of our recent works, which actually
link |
02:02:19.320
was mostly mimicking or a continuation,
link |
02:02:22.120
and even the team and the people involved were pretty much very
link |
02:02:25.840
intersecting with AlphaStar, was AlphaCode,
link |
02:02:28.400
in which we actually saw the bitter lesson how
link |
02:02:31.440
scale of the models and then a massive amount of search
link |
02:02:34.240
yielded this kind of very interesting result
link |
02:02:36.760
of being able to have human level code competition.
link |
02:02:41.280
So I've seen examples of it being
link |
02:02:43.640
literally mapped to search and scale.
link |
02:02:46.320
I'm not so convinced about the search bit,
link |
02:02:48.120
but certainly I'm convinced scale will be needed.
link |
02:02:51.000
So we need general methods.
link |
02:02:52.600
We need to test them, and maybe we
link |
02:02:54.080
need to make sure that we can scale them given the hardware
link |
02:02:57.080
that we have in practice.
link |
02:02:59.080
But then maybe we should also shape how the hardware looks
link |
02:03:01.920
like based on which methods might be needed to scale.
link |
02:03:05.640
And that's an interesting contrast of these GPU comments
link |
02:03:11.600
that is we got it for free almost because games
link |
02:03:14.280
were using these.
link |
02:03:15.080
But maybe now if sparsity is required,
link |
02:03:19.440
we don't have the hardware.
link |
02:03:20.560
Although in theory, many people are
link |
02:03:22.840
building different kinds of hardware these days.
link |
02:03:24.800
But there's a bit of this notion of hardware lottery
link |
02:03:27.760
for scale that might actually have an impact at least
link |
02:03:31.560
on the scale of years on how fast we will make progress
link |
02:03:35.240
to maybe a version of neural nets
link |
02:03:37.680
or whatever comes next that might enable
link |
02:03:41.920
truly intelligent agents.
link |
02:03:44.360
Do you think in your lifetime we will build an AGI system that
link |
02:03:50.520
would undeniably be a thing that achieves human level
link |
02:03:55.640
intelligence and goes far beyond?
link |
02:03:58.480
I definitely think it's possible that it will go far beyond.
link |
02:04:03.720
But I'm definitely convinced that it will
link |
02:04:05.520
be human level intelligence.
link |
02:04:08.480
And I'm hypothesizing about the beyond
link |
02:04:11.000
because the beyond bit is a bit tricky to define,
link |
02:04:16.520
especially when we look at the current formula of starting
link |
02:04:21.280
from this imitation learning standpoint.
link |
02:04:23.760
So we can certainly imitate humans at language and beyond.
link |
02:04:30.760
So getting at human level through imitation
link |
02:04:33.440
feels very possible.
link |
02:04:34.920
Going beyond will require reinforcement learning
link |
02:04:39.120
and other things.
link |
02:04:39.880
And I think in some areas that certainly already has paid out.
link |
02:04:43.600
I mean, Go being an example that's
link |
02:04:46.000
my favorite so far in terms of going
link |
02:04:48.240
beyond human capabilities.
link |
02:04:50.440
But in general, I'm not sure we can define reward functions
link |
02:04:55.600
that from a seed of imitating human level
link |
02:04:59.360
intelligence that is general and then going beyond.
link |
02:05:02.920
That bit is not so clear in my lifetime.
link |
02:05:05.280
But certainly, human level, yes.
link |
02:05:08.240
And I mean, that in itself is already quite powerful,
link |
02:05:11.000
I think.
link |
02:05:11.520
So going beyond, I think it's obviously not.
link |
02:05:14.560
We're not going to not try that if then we
link |
02:05:17.680
get to superhuman scientists and discovery
link |
02:05:20.760
and advancing the world.
link |
02:05:22.160
But at least human level in general
link |
02:05:25.600
is also very, very powerful.
link |
02:05:27.560
Well, especially if human level or slightly beyond
link |
02:05:31.560
is integrated deeply with human society
link |
02:05:33.760
and there's billions of agents like that,
link |
02:05:36.520
do you think there's a singularity moment beyond which
link |
02:05:39.960
our world will be just very deeply transformed
link |
02:05:44.200
by these kinds of systems?
link |
02:05:45.640
Because now you're talking about intelligence systems
link |
02:05:47.840
that are just, I mean, this is no longer just going
link |
02:05:53.040
from horse and buggy to the car.
link |
02:05:56.440
It feels like a very different kind of shift
link |
02:05:59.760
in what it means to be a living entity on Earth.
link |
02:06:03.280
Are you afraid?
link |
02:06:04.240
Are you excited of this world?
link |
02:06:06.280
I'm afraid if there's a lot more.
link |
02:06:09.360
So I think maybe we'll need to think about if we truly
link |
02:06:13.680
get there just thinking of limited resources
link |
02:06:18.400
like humanity clearly hit some limits
link |
02:06:21.480
and then there's some balance, hopefully,
link |
02:06:23.440
that biologically the planet is imposing.
link |
02:06:26.320
And we should actually try to get better at this.
link |
02:06:28.600
As we know, there's quite a few issues
link |
02:06:31.600
with having too many people coexisting
link |
02:06:35.840
in a resource limited way.
link |
02:06:37.720
So for digital entities, it's an interesting question.
link |
02:06:40.360
I think such a limit maybe should exist.
link |
02:06:43.520
But maybe it's going to be imposed by energy availability
link |
02:06:47.680
because this also consumes energy.
link |
02:06:49.760
In fact, most systems are more inefficient
link |
02:06:53.560
than we are in terms of energy required.
link |
02:06:56.720
But definitely, I think as a society,
link |
02:06:59.480
we'll need to just work together to find
link |
02:07:03.520
what would be reasonable in terms of growth
link |
02:07:06.400
or how we coexist if that is to happen.
link |
02:07:11.400
I am very excited about, obviously,
link |
02:07:14.640
the aspects of automation that make people
link |
02:07:17.720
that obviously don't have access to certain resources
link |
02:07:20.120
or knowledge, for them to have that access.
link |
02:07:23.920
I think those are the applications in a way
link |
02:07:26.280
that I'm most excited to see and to personally work towards.
link |
02:07:30.960
Yeah, there's going to be significant improvements
link |
02:07:32.640
in productivity and the quality of life
link |
02:07:34.320
across the whole population, which is very interesting.
link |
02:07:36.960
But I'm looking even far beyond
link |
02:07:39.200
us becoming a multiplanetary species.
link |
02:07:42.680
And just as a quick bet, last question.
link |
02:07:45.360
Do you think as humans become multiplanetary species,
link |
02:07:49.200
go outside our solar system, all that kind of stuff,
link |
02:07:52.480
do you think there will be more humans
link |
02:07:54.440
or more robots in that future world?
link |
02:07:57.200
So will humans be the quirky, intelligent being of the past
link |
02:08:04.480
or is there something deeply fundamental
link |
02:08:07.000
to human intelligence that's truly special,
link |
02:08:09.560
where we will be part of those other planets,
link |
02:08:12.120
not just AI systems?
link |
02:08:13.920
I think we're all excited to build AGI
link |
02:08:18.640
to empower or make us more powerful as human species.
link |
02:08:25.080
Not to say there might be some hybridization.
link |
02:08:27.560
I mean, this is obviously speculation,
link |
02:08:29.680
but there are companies also trying to,
link |
02:08:32.480
the same way medicine is making us better.
link |
02:08:35.640
Maybe there are other things that are yet to happen on that.
link |
02:08:39.080
But if the ratio is not at most one to one,
link |
02:08:43.320
I would not be happy.
link |
02:08:44.520
So I would hope that we are part of the equation,
link |
02:08:49.200
but maybe there's maybe a one to one ratio feels
link |
02:08:53.280
like possible, constructive and so on,
link |
02:08:56.200
but it would not be good to have a misbalance,
link |
02:08:59.600
at least from my core beliefs and the why I'm doing
link |
02:09:03.280
what I'm doing when I go to work and I research
link |
02:09:05.760
what I research.
link |
02:09:07.120
Well, this is how I know you're human
link |
02:09:09.520
and this is how you've passed the Turing test.
link |
02:09:12.720
And you are one of the special humans, Oriel.
link |
02:09:15.000
It's a huge honor that you would talk with me
link |
02:09:17.120
and I hope we get the chance to speak again,
link |
02:09:19.920
maybe once before the singularity, once after
link |
02:09:23.040
and see how our view of the world changes.
link |
02:09:25.440
Thank you again for talking today.
link |
02:09:26.600
Thank you for the amazing work you do.
link |
02:09:28.200
You're a shining example of a research
link |
02:09:31.320
and a human being in this community.
link |
02:09:32.960
Thanks a lot.
link |
02:09:33.800
Like yeah, looking forward to before the singularity
link |
02:09:36.240
certainly and maybe after.
link |
02:09:39.920
Thanks for listening to this conversation
link |
02:09:41.480
with Oriel Venialis.
link |
02:09:43.120
To support this podcast, please check out our sponsors
link |
02:09:45.520
in the description.
link |
02:09:46.960
And now let me leave you with some words from Alan Turing.
link |
02:09:51.160
Those who can imagine anything can create the impossible.
link |
02:09:55.080
Thank you for listening and hope to see you next time.