back to index

Ishan Misra: Self-Supervised Deep Learning in Computer Vision | Lex Fridman Podcast #206


small model | large model

link |
00:00:00.000
The following is a conversation with Eshan Mizra,
link |
00:00:03.240
research scientist at Facebook AI Research,
link |
00:00:05.800
who works on self supervised machine learning
link |
00:00:08.580
in the domain of computer vision,
link |
00:00:10.480
or in other words, making AI systems understand
link |
00:00:14.120
the visual world with minimal help from us humans.
link |
00:00:18.000
Transformers and self attention has been successfully used
link |
00:00:21.720
by OpenAI's DPT3 and other language models
link |
00:00:25.600
to do self supervised learning in the domain of language.
link |
00:00:28.560
Eshan, together with Yann LeCun and others,
link |
00:00:31.800
is trying to achieve the same success
link |
00:00:33.960
in the domain of images and video.
link |
00:00:36.400
The goal is to leave a robot
link |
00:00:38.320
watching YouTube videos all night,
link |
00:00:40.360
and in the morning, come back to a much smarter robot.
link |
00:00:43.600
I read the blog post, Self Supervised Learning,
link |
00:00:46.000
The Dark Matter of Intelligence by Eshan and Yann LeCun,
link |
00:00:50.360
and then listened to Eshan's appearance
link |
00:00:52.960
on the excellent Machine Learning Street Talk podcast,
link |
00:00:57.200
and I knew I had to talk to him.
link |
00:00:59.160
By the way, if you're interested in machine learning and AI,
link |
00:01:02.860
I cannot recommend the ML Street Talk podcast highly enough.
link |
00:01:07.980
Those guys are great.
link |
00:01:09.640
Quick mention of our sponsors.
link |
00:01:11.280
Onnit, The Information, Grammarly, and Athletic Greens.
link |
00:01:15.400
Check them out in the description to support this podcast.
link |
00:01:18.640
As a side note, let me say that,
link |
00:01:20.480
for those of you who may have been listening
link |
00:01:22.560
for quite a while, this podcast used to be called
link |
00:01:24.960
Artificial Intelligence Podcast,
link |
00:01:27.120
because my life passion has always been,
link |
00:01:29.700
will always be artificial intelligence,
link |
00:01:32.640
both narrowly and broadly defined.
link |
00:01:35.440
My goal with this podcast is still
link |
00:01:37.720
to have many conversations with world class researchers
link |
00:01:40.560
in AI, math, physics, biology, and all the other sciences,
link |
00:01:45.120
but I also want to talk to historians, musicians, athletes,
link |
00:01:49.420
and of course, occasionally comedians.
link |
00:01:51.520
In fact, I'm trying out doing this podcast
link |
00:01:53.600
three times a week now to give me more freedom
link |
00:01:56.200
with guest selection and maybe get a chance
link |
00:01:59.380
to have a bit more fun.
link |
00:02:00.880
Speaking of fun, in this conversation,
link |
00:02:03.160
I challenge the listener to count the number of times
link |
00:02:05.440
the word banana is mentioned.
link |
00:02:08.000
Ishan and I use the word banana as the canonical example
link |
00:02:12.580
at the core of the hard problem of computer vision
link |
00:02:15.200
and maybe the hard problem of consciousness.
link |
00:02:19.880
This is the Lex Friedman Podcast,
link |
00:02:22.640
and here is my conversation with Ishan Mizra.
link |
00:02:27.240
What is self supervised learning?
link |
00:02:29.880
And maybe even give the bigger basics
link |
00:02:32.760
of what is supervised and semi supervised learning,
link |
00:02:35.360
and maybe why is self supervised learning
link |
00:02:37.640
a better term than unsupervised learning?
link |
00:02:40.080
Let's start with supervised learning.
link |
00:02:41.600
So typically for machine learning systems,
link |
00:02:43.920
the way they're trained is you get a bunch of humans,
link |
00:02:46.920
the humans point out particular concepts.
link |
00:02:48.600
So if it's in the case of images,
link |
00:02:50.180
you want the humans to come and tell you
link |
00:02:52.960
what is present in the image,
link |
00:02:54.400
draw boxes around them, draw masks of like things,
link |
00:02:57.240
pixels, which are of particular categories or not.
link |
00:03:00.520
For NLP, again, there are like lots
link |
00:03:01.960
of these particular tasks, say about sentiment analysis,
link |
00:03:04.760
about entailment and so on.
link |
00:03:06.620
So typically for supervised learning,
link |
00:03:08.080
we get a big corpus of such annotated or labeled data.
link |
00:03:11.280
And then we feed that to a system
link |
00:03:12.780
and the system is really trying to mimic.
link |
00:03:14.820
So it's taking this input of the data
link |
00:03:16.600
and then trying to mimic the output.
link |
00:03:18.360
So it looks at an image and the human has tagged
link |
00:03:20.680
that this image contains a banana.
link |
00:03:22.400
And now the system is basically trying to mimic that.
link |
00:03:24.680
So that's its learning signal.
link |
00:03:26.680
And so for supervised learning,
link |
00:03:28.000
we try to gather lots of such data
link |
00:03:30.040
and we train these machine learning models
link |
00:03:31.820
to imitate the input output.
link |
00:03:33.460
And the hope is basically by doing so,
link |
00:03:35.600
now on unseen or like new kinds of data,
link |
00:03:38.080
this model can automatically learn
link |
00:03:40.000
to predict these concepts.
link |
00:03:41.320
So this is a standard sort of supervised setting.
link |
00:03:43.400
For semi supervised setting,
link |
00:03:45.760
the idea typically is that you have,
link |
00:03:47.600
of course, all of the supervised data,
link |
00:03:49.280
but you have lots of other data,
link |
00:03:50.800
which is unsupervised or which is like not labeled.
link |
00:03:53.120
Now, the problem basically with supervised learning
link |
00:03:55.280
and why you actually have all of these alternate
link |
00:03:57.440
sort of learning paradigms is,
link |
00:03:59.400
supervised learning just does not scale.
link |
00:04:01.800
So if you look at for computer vision,
link |
00:04:03.900
the sort of largest,
link |
00:04:05.000
one of the most popular data sets is ImageNet, right?
link |
00:04:07.500
So the entire ImageNet data set has about 22,000 concepts
link |
00:04:11.680
and about 14 million images.
link |
00:04:13.800
So these concepts are basically just nouns
link |
00:04:16.160
and they're annotated on images.
link |
00:04:18.360
And this entire data set was a mammoth data collection
link |
00:04:20.600
effort that actually gave rise
link |
00:04:22.320
to a lot of powerful learning algorithms
link |
00:04:23.840
is credited with like sort of the rise
link |
00:04:25.640
of deep learning as well.
link |
00:04:27.240
But this data set took about 22 human years
link |
00:04:30.140
to collect, to annotate.
link |
00:04:31.960
And it's not even that many concepts, right?
link |
00:04:33.520
It's not even that many images,
link |
00:04:34.580
14 million is nothing really.
link |
00:04:36.800
Like you have about, I think 400 million images or so,
link |
00:04:39.360
or even more than that uploaded to most of the popular
link |
00:04:41.920
sort of social media websites today.
link |
00:04:44.200
So now supervised learning just doesn't scale.
link |
00:04:46.440
If I want to now annotate more concepts,
link |
00:04:48.680
if I want to have various types of fine grained concepts,
link |
00:04:51.340
then it won't really scale.
link |
00:04:53.240
So now you come up to these sort of different
link |
00:04:54.880
learning paradigms, for example, semi supervised learning,
link |
00:04:57.560
where the idea is you, of course,
link |
00:04:58.600
you have this annotated corpus of supervised data
link |
00:05:01.400
and you have lots of these unlabeled images.
link |
00:05:03.720
And the idea is that the algorithm should basically try
link |
00:05:05.860
to measure some kind of consistency
link |
00:05:08.000
or really try to measure some kind of signal
link |
00:05:10.320
on this sort of unlabeled data
link |
00:05:12.200
to make itself more confident
link |
00:05:14.200
about what it's really trying to predict.
link |
00:05:16.200
So by access to this, lots of unlabeled data,
link |
00:05:19.680
the idea is that the algorithm actually learns
link |
00:05:22.240
to be more confident and actually gets better
link |
00:05:24.560
at predicting these concepts.
link |
00:05:26.920
And now we come to the other extreme,
link |
00:05:28.520
which is like self supervised learning.
link |
00:05:30.520
The idea basically is that the machine or the algorithm
link |
00:05:33.040
should really discover concepts or discover things
link |
00:05:35.660
about the world or learn representations about the world
link |
00:05:38.200
which are useful without access
link |
00:05:40.080
to explicit human supervision.
link |
00:05:41.800
So the word supervision is still
link |
00:05:44.360
in the term self supervised.
link |
00:05:46.280
So what is the supervision signal?
link |
00:05:48.560
And maybe that perhaps is when Yann LeCun
link |
00:05:51.240
and you argue that unsupervised
link |
00:05:52.920
is the incorrect terminology here.
link |
00:05:55.040
So what is the supervision signal
link |
00:05:57.440
when the humans aren't part of the picture
link |
00:05:59.720
or not a big part of the picture?
link |
00:06:02.400
Right, so self supervised,
link |
00:06:04.520
the reason that it has the term supervised in itself
link |
00:06:06.840
is because you're using the data itself as supervision.
link |
00:06:10.360
So because the data serves as its own source of supervision,
link |
00:06:13.200
it's self supervised in that way.
link |
00:06:15.160
Now, the reason a lot of people,
link |
00:06:16.400
I mean, we did it in that blog post with Yann,
link |
00:06:18.380
but a lot of other people have also argued
link |
00:06:20.120
for using this term self supervised.
link |
00:06:22.080
So starting from like 94 from Virginia Desas group,
link |
00:06:25.680
I think UCSD, and now she's at UCSD.
link |
00:06:28.800
Jeetendra Malik has said this a bunch of times as well.
link |
00:06:31.640
So you have supervised,
link |
00:06:33.080
and then unsupervised basically means everything
link |
00:06:35.200
which is not supervised,
link |
00:06:36.400
but that includes stuff like semi supervised,
link |
00:06:38.640
that includes other like transductive learning,
link |
00:06:41.280
lots of other sort of settings.
link |
00:06:43.000
So that's the reason like now people are preferring
link |
00:06:46.040
this term self supervised
link |
00:06:47.120
because it explicitly says what's happening.
link |
00:06:49.240
The data itself is the source of supervision
link |
00:06:51.620
and any sort of learning algorithm
link |
00:06:53.120
which tries to extract just sort of data supervision signals
link |
00:06:56.920
from the data itself is a self supervised algorithm.
link |
00:06:59.480
But there is within the data,
link |
00:07:02.160
a set of tricks which unlock the supervision.
link |
00:07:05.560
So can you give maybe some examples
link |
00:07:07.200
and there's innovation ingenuity required
link |
00:07:11.360
to unlock that supervision.
link |
00:07:12.840
The data doesn't just speak to you some ground truth,
link |
00:07:15.600
you have to do some kind of trick.
link |
00:07:17.760
So I don't know what your favorite domain is.
link |
00:07:19.560
So you specifically specialize in visual learning,
link |
00:07:23.000
but is there favorite examples,
link |
00:07:24.480
maybe in language or other domains?
link |
00:07:26.520
Perhaps the most successful applications
link |
00:07:28.300
have been in NLP, not language processing.
link |
00:07:31.060
So the idea basically being that you can train models
link |
00:07:34.000
that can you have a sentence and you mask out certain words.
link |
00:07:37.360
And now these models learn to predict the masked out words.
link |
00:07:40.500
So if you have like the cat jumped over the dog,
link |
00:07:44.000
so you can basically mask out cat.
link |
00:07:45.940
And now you're essentially asking the model
link |
00:07:47.360
to predict what was missing, what did I mask out?
link |
00:07:50.280
So the model is going to predict basically a distribution
link |
00:07:53.220
over all the possible words that it knows.
link |
00:07:55.320
And probably it has like if it's a well trained model,
link |
00:07:58.360
it has a sort of higher probability density
link |
00:08:00.580
for this word cat.
link |
00:08:02.560
For vision, I would say the sort of more,
link |
00:08:05.520
I mean, the easier example,
link |
00:08:07.480
which is not as widely used these days,
link |
00:08:09.420
is basically say, for example, video prediction.
link |
00:08:12.040
So video is again, a sequence of things.
link |
00:08:14.080
So you can ask the model,
link |
00:08:15.040
so if you have a video of say 10 seconds,
link |
00:08:17.440
you can feed in the first nine seconds to a model
link |
00:08:19.840
and then ask it, hey, what happens basically
link |
00:08:21.960
in the 10 second, can you predict what's going to happen?
link |
00:08:24.500
And the idea basically is because the model
link |
00:08:26.760
is predicting something about the data itself.
link |
00:08:29.440
Of course, you didn't need any human
link |
00:08:31.380
to tell you what was happening
link |
00:08:32.300
because the 10 second video was naturally captured.
link |
00:08:34.600
Because the model is predicting what's happening there,
link |
00:08:36.680
it's going to automatically learn something
link |
00:08:39.020
about the structure of the world, how objects move,
link |
00:08:41.240
object permanence, and these kinds of things.
link |
00:08:44.000
So like, if I have something at the edge of the table,
link |
00:08:45.960
it will fall down.
link |
00:08:47.520
Things like these, which you really don't have to sit
link |
00:08:49.280
and annotate.
link |
00:08:50.280
In a supervised learning setting,
link |
00:08:51.320
I would have to sit and annotate.
link |
00:08:52.280
This is a cup, now I move this cup, this is still a cup,
link |
00:08:55.200
and now I move this cup, it's still a cup,
link |
00:08:56.640
and then it falls down, and this is a fallen down cup.
link |
00:08:58.840
So I won't have to annotate all of these things
link |
00:09:00.440
in a self supervised setting.
link |
00:09:02.040
Isn't that kind of a brilliant little trick
link |
00:09:05.280
of taking a series of data that is consistent
link |
00:09:08.320
and removing one element in that series,
link |
00:09:11.920
and then teaching the algorithm to predict that element?
link |
00:09:17.040
Isn't that, first of all, that's quite brilliant.
link |
00:09:20.700
It seems to be applicable in anything
link |
00:09:23.080
that has the constraint of being a sequence
link |
00:09:27.920
that is consistent with the physical reality.
link |
00:09:30.260
The question is, are there other tricks like this
link |
00:09:34.400
that can generate the self supervision signal?
link |
00:09:37.840
So sequence is possibly the most widely used one in NLP.
link |
00:09:41.200
For vision, the one that is actually used for images,
link |
00:09:44.080
which is very popular these days,
link |
00:09:45.840
is basically taking an image,
link |
00:09:47.600
and now taking different crops of that image.
link |
00:09:50.080
So you can basically decide to crop,
link |
00:09:51.400
say the top left corner,
link |
00:09:53.100
and you crop, say the bottom right corner,
link |
00:09:55.280
and asking a network to basically present it with a choice,
link |
00:09:58.960
saying that, okay, now you have this image,
link |
00:10:01.360
you have this image, are these the same or not?
link |
00:10:04.480
And so the idea basically is that because different crop,
link |
00:10:06.680
like in an image, different parts of the image
link |
00:10:08.480
are going to be related.
link |
00:10:09.800
So for example, if you have a chair and a table,
link |
00:10:12.420
basically these things are going to be close by,
link |
00:10:14.960
versus if you take, again,
link |
00:10:16.860
if you have like a zoomed in picture of a chair,
link |
00:10:19.520
if you're taking different crops,
link |
00:10:20.480
it's going to be different parts of the chair.
link |
00:10:22.340
So the idea basically is that different crops
link |
00:10:25.020
of the image are related,
link |
00:10:26.180
and so the features or the representations
link |
00:10:27.900
that you get from these different crops
link |
00:10:29.080
should also be related.
link |
00:10:30.320
So this is possibly the most like widely used trick
link |
00:10:32.720
these days for self supervised learning and computer vision.
link |
00:10:35.760
So again, using the consistency that's inherent
link |
00:10:39.080
to physical reality in visual domain,
link |
00:10:42.000
that's, you know, parts of an image are consistent,
link |
00:10:45.640
and then in the language domain,
link |
00:10:48.400
or anything that has sequences,
link |
00:10:50.280
like language or something that's like a time series,
link |
00:10:53.000
then you can chop up parts in time.
link |
00:10:55.440
It's similar to the story of RNNs and CNNs,
link |
00:11:00.280
of RNNs and ConvNets.
link |
00:11:02.300
You and Yann LeCun wrote the blog post in March, 2021,
link |
00:11:06.640
titled, Self Supervised Learning,
link |
00:11:08.840
The Dark Matter of Intelligence.
link |
00:11:11.080
Can you summarize this blog post
link |
00:11:12.640
and maybe explain the main idea or set of ideas?
link |
00:11:15.660
The blog post was mainly about sort of just telling,
link |
00:11:18.680
I mean, this is really a accepted fact,
link |
00:11:21.680
I would say for a lot of people now,
link |
00:11:22.940
that self supervised learning is something
link |
00:11:24.360
that is going to play an important role
link |
00:11:27.200
for machine learning algorithms
link |
00:11:28.320
that come in the future, and even now.
link |
00:11:30.560
Let me just comment that we don't yet
link |
00:11:33.840
have a good understanding of what dark matter is.
link |
00:11:36.480
That's true.
link |
00:11:37.320
So the idea basically being...
link |
00:11:40.040
So maybe the metaphor doesn't exactly transfer,
link |
00:11:41.840
but maybe it's actually perfectly transfers,
link |
00:11:44.840
that we don't know, we have an inkling
link |
00:11:47.880
that it'll be a big part
link |
00:11:49.280
of whatever solving intelligence looks like.
link |
00:11:51.240
Right, so I think self supervised learning,
link |
00:11:52.960
the way it's done right now is,
link |
00:11:54.880
I would say like the first step towards
link |
00:11:56.560
what it probably should end up like learning
link |
00:11:58.600
or what it should enable us to do.
link |
00:12:00.540
So the idea for that particular piece was,
link |
00:12:03.760
self supervised learning is going to be a very powerful way
link |
00:12:06.200
to learn common sense about the world,
link |
00:12:08.420
or like stuff that is really hard to label.
link |
00:12:10.840
For example, like is this piece
link |
00:12:13.760
over here heavier than the cup?
link |
00:12:15.640
Now, for all these kinds of things,
link |
00:12:17.520
you'll have to sit and label these things.
link |
00:12:18.760
So supervised learning is clearly not going to scale.
link |
00:12:21.560
So what is the thing that's actually going to scale?
link |
00:12:23.520
It's probably going to be an agent
link |
00:12:25.060
that can either actually interact with it to lift it up,
link |
00:12:27.920
or observe me doing it.
link |
00:12:29.980
So if I'm basically lifting these things up,
link |
00:12:31.580
it can probably reason about,
link |
00:12:32.600
hey, this is taking him more time to lift up,
link |
00:12:34.760
or the velocity is different,
link |
00:12:36.440
whereas the velocity for this is different,
link |
00:12:37.840
probably this one is heavier.
link |
00:12:39.600
So essentially, by observations of the data,
link |
00:12:42.000
you should be able to infer a lot of things about the world
link |
00:12:44.820
without someone explicitly telling you,
link |
00:12:46.840
this is heavy, this is not,
link |
00:12:48.720
this is something that can pour,
link |
00:12:50.000
this is something that cannot pour,
link |
00:12:51.200
this is somewhere that you can sit,
link |
00:12:52.480
this is not somewhere that you can sit.
link |
00:12:53.920
But you just mentioned ability to interact with the world.
link |
00:12:57.360
There's so many questions that are yet,
link |
00:13:01.000
that are still open, which is,
link |
00:13:02.840
how do you select the set of data
link |
00:13:04.480
over which the self supervised learning process works?
link |
00:13:08.640
How much interactivity like in the active learning
link |
00:13:11.520
or the machine teaching context is there?
link |
00:13:14.400
What are the reward signals?
link |
00:13:16.480
Like how much actual interaction there is
link |
00:13:18.560
with the physical world?
link |
00:13:20.080
That kind of thing.
link |
00:13:21.440
So that could be a huge question.
link |
00:13:24.800
And then on top of that,
link |
00:13:26.720
which I have a million questions about,
link |
00:13:28.960
which we don't know the answers to,
link |
00:13:30.420
but it's worth talking about is,
link |
00:13:32.840
how much reasoning is involved?
link |
00:13:35.120
How much accumulation of knowledge
link |
00:13:38.520
versus something that's more akin to learning
link |
00:13:40.800
or whether that's the same thing.
link |
00:13:43.240
But so we're like, it is truly dark matter.
link |
00:13:46.560
We don't know how exactly to do it.
link |
00:13:49.220
But we are, I mean, a lot of us are actually convinced
link |
00:13:52.040
that it's going to be a sort of major thing
link |
00:13:54.200
in machine learning.
link |
00:13:55.040
So let me reframe it then,
link |
00:13:56.600
that human supervision cannot be at large scale
link |
00:14:01.160
the source of the solution to intelligence.
link |
00:14:04.120
So the machines have to discover the supervision
link |
00:14:08.000
in the natural signal of the world.
link |
00:14:10.240
I mean, the other thing is also
link |
00:14:11.560
that humans are not particularly good labelers.
link |
00:14:14.200
They're not very consistent.
link |
00:14:16.000
For example, like what's the difference
link |
00:14:17.860
between a dining table and a table?
link |
00:14:19.840
Is it just the fact that one,
link |
00:14:21.560
like if you just look at a particular table,
link |
00:14:23.080
what makes us say one is dining table
link |
00:14:24.600
and the other is not?
link |
00:14:26.500
Humans are not particularly consistent.
link |
00:14:28.160
They're not like very good sources of supervision
link |
00:14:30.100
for a lot of these kinds of edge cases.
link |
00:14:32.320
So it may be also the fact that if we want an algorithm
link |
00:14:37.160
or want a machine to solve a particular task for us,
link |
00:14:39.640
we can maybe just specify the end goal
link |
00:14:42.120
and like the stuff in between,
link |
00:14:44.240
we really probably should not be specifying
link |
00:14:46.080
because we're not maybe going to confuse it a lot actually.
link |
00:14:49.320
Well, humans can't even answer the meaning of life.
link |
00:14:51.460
So I'm not sure if we're good supervisors
link |
00:14:53.920
of the end goal either.
link |
00:14:55.220
So let me ask you about categories.
link |
00:14:56.960
Humans are not very good at telling the difference
link |
00:14:59.040
between what is and isn't a table, like you mentioned.
link |
00:15:02.800
Do you think it's possible,
link |
00:15:04.520
let me ask you like pretend you're Plato.
link |
00:15:10.080
Is it possible to create a pretty good taxonomy
link |
00:15:14.800
of objects in the world?
link |
00:15:16.400
It seems like a lot of approaches in machine learning
link |
00:15:19.000
kind of assume a hopeful vision
link |
00:15:21.400
that it's possible to construct a perfect taxonomy
link |
00:15:24.080
or it exists perhaps out of our reach,
link |
00:15:26.520
but we can always get closer and closer to it.
link |
00:15:28.800
Or is that a hopeless pursuit?
link |
00:15:31.240
I think it's hopeless in some way.
link |
00:15:33.040
So the thing is for any particular categorization
link |
00:15:36.080
that you create,
link |
00:15:36.920
if you have a discrete sort of categorization,
link |
00:15:38.760
I can always take the nearest two concepts
link |
00:15:40.520
or I can take a third concept and I can blend it in
link |
00:15:42.600
and I can create a new category.
link |
00:15:44.480
So if you were to enumerate N categories,
link |
00:15:46.560
I will always find an N plus one category for you.
link |
00:15:48.880
That's not going to be in the N categories.
link |
00:15:50.680
And I can actually create not just N plus one,
link |
00:15:52.420
I can very easily create far more than N categories.
link |
00:15:55.120
The thing is a lot of things we talk about
link |
00:15:57.280
are actually compositional.
link |
00:15:58.960
So it's really hard for us to come and sit
link |
00:16:01.680
and enumerate all of these out.
link |
00:16:03.200
And they compose in various weird ways, right?
link |
00:16:05.840
Like you have like a croissant and a donut come together
link |
00:16:08.320
to form a cronut.
link |
00:16:09.680
So if you were to like enumerate all the foods up until,
link |
00:16:12.400
I don't know, whenever the cronut was about 10 years ago
link |
00:16:15.160
or 15 years ago,
link |
00:16:16.440
then this entire thing called cronut would not exist.
link |
00:16:19.000
Yeah, I remember there was the most awesome video
link |
00:16:21.760
of a cat wearing a monkey costume.
link |
00:16:23.500
Yeah, yes.
link |
00:16:26.520
People should look it up, it's great.
link |
00:16:28.240
So is that a monkey or is that a cat?
link |
00:16:31.000
It's a very difficult philosophical question.
link |
00:16:33.840
So there is a concept of similarity between objects.
link |
00:16:37.280
So you think that can take us very far?
link |
00:16:39.860
Just kind of getting a good function,
link |
00:16:43.200
a good way to tell which parts of things are similar
link |
00:16:47.920
and which parts of things are very different.
link |
00:16:50.720
I think so, yeah.
link |
00:16:51.780
So you don't necessarily need to name everything
link |
00:16:54.320
or assign a name to everything to be able to use it, right?
link |
00:16:57.840
So there are like lots of...
link |
00:16:59.560
Shakespeare said that, what's in a name?
link |
00:17:01.720
What's in a name, yeah, okay.
link |
00:17:03.200
And I mean, lots of like, for example, animals, right?
link |
00:17:05.840
They don't have necessarily a well formed
link |
00:17:08.120
like syntactic language,
link |
00:17:09.520
but they're able to go about their day perfectly.
link |
00:17:11.800
The same thing happens for us.
link |
00:17:12.880
So, I mean, we probably look at things and we figure out,
link |
00:17:17.080
oh, this is similar to something else that I've seen before.
link |
00:17:19.360
And then I can probably learn how to use it.
link |
00:17:22.000
So I haven't seen all the possible doorknobs in the world.
link |
00:17:26.280
But if you show me,
link |
00:17:27.800
like I was able to get into this particular place
link |
00:17:29.840
fairly easily, I've never seen that particular doorknob.
link |
00:17:32.120
So I of course related to all the doorknobs that I've seen
link |
00:17:34.360
and I know exactly how it's going to open.
link |
00:17:36.520
I have a pretty good idea of how it's going to open.
link |
00:17:39.440
And I think this kind of translation between experiences
link |
00:17:41.800
only happens because of similarity.
link |
00:17:43.720
Because I'm able to relate it to a doorknob.
link |
00:17:45.360
If I related it to a hairdryer,
link |
00:17:46.600
I would probably be stuck still outside, not able to get in.
link |
00:17:50.400
Again, a bit of a philosophical question,
link |
00:17:52.240
but can similarity take us all the way
link |
00:17:55.600
to understanding a thing?
link |
00:17:58.680
Can having a good function that compares objects
link |
00:18:01.940
get us to understand something profound
link |
00:18:04.900
about singular objects?
link |
00:18:07.200
I think I'll ask you a question back.
link |
00:18:08.600
What does it mean to understand objects?
link |
00:18:11.560
Well, let me tell you what that's similar to.
link |
00:18:13.520
No, so there's an idea of sort of reasoning
link |
00:18:17.680
by analogy kind of thing.
link |
00:18:19.760
I think understanding is the process of placing that thing
link |
00:18:24.920
in some kind of network of knowledge that you have.
link |
00:18:28.440
That it perhaps is fundamentally related to other concepts.
link |
00:18:33.160
So it's not like understanding is fundamentally related
link |
00:18:36.480
by composition of other concepts
link |
00:18:39.280
and maybe in relation to other concepts.
link |
00:18:43.160
And maybe deeper and deeper understanding
link |
00:18:45.800
is maybe just adding more edges to that graph somehow.
link |
00:18:51.840
So maybe it is a composition of similarities.
link |
00:18:55.080
I mean, ultimately, I suppose it is a kind of embedding
link |
00:18:59.560
in that wisdom space.
link |
00:19:02.480
Yeah, okay, wisdom space is good.
link |
00:19:06.480
I think, I do think, right?
link |
00:19:08.040
So similarity does get you very, very far.
link |
00:19:10.720
Is it the answer to everything?
link |
00:19:12.320
I mean, I don't even know what everything is,
link |
00:19:14.120
but it's going to take us really far.
link |
00:19:16.680
And I think the thing is things are similar
link |
00:19:19.640
in very different contexts, right?
link |
00:19:21.640
So an elephant is similar to, I don't know,
link |
00:19:24.320
another sort of wild animal.
link |
00:19:25.600
Let's just pick, I don't know, lion in a different way
link |
00:19:28.500
because they're both four legged creatures.
link |
00:19:30.520
They're also land animals.
link |
00:19:32.040
But of course they're very different
link |
00:19:33.120
in a lot of different ways.
link |
00:19:33.960
So elephants are like herbivores, lions are not.
link |
00:19:37.240
So similarity and particularly dissimilarity
link |
00:19:40.660
also actually helps us understand a lot about things.
link |
00:19:43.720
And so that's actually why I think
link |
00:19:45.200
discrete categorization is very hard.
link |
00:19:47.600
Just like forming this particular category of elephant
link |
00:19:50.060
and a particular category of lion,
link |
00:19:51.840
maybe it's good for just like taxonomy,
link |
00:19:54.360
biological taxonomies.
link |
00:19:55.760
But when it comes to other things which are not as maybe,
link |
00:19:59.760
for example, like grilled cheese, right?
link |
00:20:01.720
I have a grilled cheese,
link |
00:20:02.560
I dip it in tomato and I keep it outside.
link |
00:20:03.960
Now, is that still a grilled cheese
link |
00:20:05.040
or is that something else?
link |
00:20:06.720
Right, so categorization is still very useful
link |
00:20:09.780
for solving problems.
link |
00:20:11.240
But is your intuition then sort of the self supervised
link |
00:20:15.920
should be the, to borrow Jan Lekun's terminology,
link |
00:20:20.880
should be the cake and then categorization,
link |
00:20:23.640
the classification, maybe the supervised like layer
link |
00:20:27.360
should be just like the thing on top,
link |
00:20:29.100
the cherry or the icing or whatever.
link |
00:20:31.020
So if you make it the cake,
link |
00:20:32.920
it gets in the way of learning.
link |
00:20:35.520
If you make it the cake,
link |
00:20:36.360
then you won't be able to sit and annotate everything.
link |
00:20:39.380
That's as simple as it is.
link |
00:20:40.660
Like that's my very practical view on it.
link |
00:20:43.080
It's just, I mean, in my PhD,
link |
00:20:44.920
I sat down and annotated like a bunch of cards
link |
00:20:47.000
for one of my projects.
link |
00:20:48.480
And very quickly, I was just like, it was in a video
link |
00:20:50.640
and I was basically drawing boxes around all these cards.
link |
00:20:53.560
And I think I spent about a week doing all of that
link |
00:20:55.620
and I barely got anything done.
link |
00:20:57.640
And basically this was, I think my first year of my PhD
link |
00:21:00.280
or like a second year of my master's.
link |
00:21:02.700
And then by the end of it, I'm like, okay,
link |
00:21:04.000
this is just hopeless.
link |
00:21:05.000
I can keep doing it.
link |
00:21:05.960
And when I'd done that, someone came up to me
link |
00:21:08.480
and they basically told me, oh, this is a pickup truck.
link |
00:21:10.820
This is not a card.
link |
00:21:12.760
And that's when like, aha, this actually makes sense
link |
00:21:14.800
because a pickup truck is not really like,
link |
00:21:16.140
what was I annotating?
link |
00:21:17.000
Was I annotating anything that is mobile
link |
00:21:19.560
or was I annotating particular sedans
link |
00:21:21.400
or was I annotating SUVs?
link |
00:21:22.660
What was I doing?
link |
00:21:23.600
By the way, the annotation was bounding boxes?
link |
00:21:25.720
Bounding boxes, yeah.
link |
00:21:26.960
There's so many deep, profound questions here
link |
00:21:30.040
that you're almost cheating your way out of
link |
00:21:32.200
by doing self supervised learning, by the way,
link |
00:21:34.400
which is like, what makes for an object?
link |
00:21:37.520
As opposed to solve intelligence,
link |
00:21:39.080
maybe you don't ever need to answer that question.
link |
00:21:42.480
I mean, this is the question
link |
00:21:43.720
that anyone that's ever done annotation
link |
00:21:45.320
because it's so painful gets to ask,
link |
00:21:48.040
like, why am I drawing very careful line around this object?
link |
00:21:55.480
Like, what is the value?
link |
00:21:57.540
I remember when I first saw semantic segmentation
link |
00:22:00.200
where you have like instant segmentation
link |
00:22:03.640
where you have a very exact line
link |
00:22:06.240
around the object in a 2D plane
link |
00:22:09.520
of a fundamentally 3D object projected on a 2D plane.
link |
00:22:13.440
So you're drawing a line around a car
link |
00:22:15.820
that might be occluded.
link |
00:22:16.960
There might be another thing in front of it,
link |
00:22:18.880
but you're still drawing the line
link |
00:22:20.360
of the part of the car that you see.
link |
00:22:23.640
How is that the car?
link |
00:22:25.880
Why is that the car?
link |
00:22:27.880
Like, I had like an existential crisis every time.
link |
00:22:31.040
Like, how's that going to help us understand
link |
00:22:33.560
a solved computer vision?
link |
00:22:35.360
I'm not sure I have a good answer to what's better.
link |
00:22:38.280
And I'm not sure I share the confidence that you have
link |
00:22:41.560
that self supervised learning can take us far.
link |
00:22:46.720
I think I'm more and more convinced
link |
00:22:48.620
that it's a very important component,
link |
00:22:50.880
but I still feel like we need to understand
link |
00:22:52.840
what makes like this dream of maybe what it's called
link |
00:23:00.120
like symbolic AI of arriving,
link |
00:23:03.080
like once you have this common sense base,
link |
00:23:05.580
be able to play with these concepts and build graphs
link |
00:23:10.960
or hierarchies of concepts on top
link |
00:23:13.440
in order to then like form a deep sense
link |
00:23:18.800
of this three dimensional world or four dimensional world
link |
00:23:22.040
and be able to reason and then project that onto 2D plane
link |
00:23:25.480
in order to interpret a 2D image.
link |
00:23:28.520
Can I ask you just an out there question?
link |
00:23:30.960
I remember, I think Andre Karpathy had a blog post
link |
00:23:35.000
about computer vision, like being really hard.
link |
00:23:39.000
I forgot what the title was, but it was many, many years ago.
link |
00:23:42.080
And he had, I think President Obama stepping on a scale
link |
00:23:44.760
and there was humor and there was a bunch of people laughing
link |
00:23:47.120
and whatever.
link |
00:23:48.440
And there's a lot of interesting things about that image
link |
00:23:52.000
and I think Andre highlighted a bunch of things
link |
00:23:55.120
about the image that us humans are able
link |
00:23:56.880
to immediately understand.
link |
00:23:59.000
Like the idea, I think of gravity
link |
00:24:00.960
and that you have the concept of a weight.
link |
00:24:04.040
You immediately project because of our knowledge of pose
link |
00:24:08.120
and how human bodies are constructed,
link |
00:24:10.360
you understand how the forces are being applied
link |
00:24:13.040
with the human body.
link |
00:24:14.560
The really interesting other thing
link |
00:24:16.040
that you're able to understand,
link |
00:24:17.400
there's multiple people looking at each other in the image.
link |
00:24:20.480
You're able to have a mental model
link |
00:24:22.360
of what the people are thinking about.
link |
00:24:23.760
You're able to infer like,
link |
00:24:25.320
oh, this person is probably thinks,
link |
00:24:27.520
like is laughing at how humorous the situation is.
link |
00:24:31.240
And this person is confused about what the situation is
link |
00:24:34.200
because they're looking this way.
link |
00:24:35.600
We're able to infer all of that.
link |
00:24:37.560
So that's human vision.
link |
00:24:41.400
How difficult is computer vision?
link |
00:24:45.040
Like in order to achieve that level of understanding
link |
00:24:48.440
and maybe how big of a part
link |
00:24:51.440
does self supervised learning play in that, do you think?
link |
00:24:54.360
And do you still, you know, back,
link |
00:24:56.440
that was like over a decade ago,
link |
00:24:58.440
I think Andre and I think a lot of people agreed
link |
00:25:00.920
is computer vision is really hard.
link |
00:25:03.320
Do you still think computer vision is really hard?
link |
00:25:06.000
I think it is, yes.
link |
00:25:07.520
And getting to that kind of understanding,
link |
00:25:10.640
I mean, it's really out there.
link |
00:25:12.480
So if you ask me to solve just that particular problem,
link |
00:25:15.360
I can do it the supervised learning route.
link |
00:25:17.560
I can always construct a data set and basically predict,
link |
00:25:19.720
oh, is there humor in this or not?
link |
00:25:21.680
And of course I can do it.
link |
00:25:22.600
Actually, that's a good question.
link |
00:25:23.560
Do you think you can, okay, okay.
link |
00:25:25.200
Do you think you can do human supervised annotation of humor?
link |
00:25:29.000
To some extent, yes.
link |
00:25:29.960
I'm sure it will work.
link |
00:25:30.880
I mean, it won't be as bad as like randomly guessing.
link |
00:25:34.360
I'm sure it can still predict whether it's humorous or not
link |
00:25:36.600
in some way.
link |
00:25:37.840
Yeah, maybe like Reddit upvotes is the signal.
link |
00:25:40.400
I don't know.
link |
00:25:41.240
I mean, it won't do a great job, but it'll do something.
link |
00:25:43.800
It may actually be like, it may find certain things
link |
00:25:46.040
which are not humorous, humorous as well,
link |
00:25:47.560
which is going to be bad for us.
link |
00:25:49.160
But I mean, it'll do, it won't be random.
link |
00:25:52.120
Yeah, kind of like my sense of humor.
link |
00:25:54.520
Okay, so fine.
link |
00:25:55.920
So you can, that particular problem, yes.
link |
00:25:57.520
But the general problem you're saying is hard.
link |
00:25:59.600
The general problem is hard.
link |
00:26:00.440
And I mean, self supervised learning
link |
00:26:02.320
is not the answer to everything.
link |
00:26:03.920
Of course it's not.
link |
00:26:04.760
I think if you have machines that are going to communicate
link |
00:26:07.800
with humans at the end of it,
link |
00:26:08.760
you want to understand what the algorithm is doing, right?
link |
00:26:10.880
You want it to be able to produce an output
link |
00:26:13.720
that you can decipher, that you can understand,
link |
00:26:15.560
or it's actually useful for something else,
link |
00:26:17.440
which again is a human.
link |
00:26:19.360
So at some point in this sort of entire loop,
link |
00:26:22.280
a human steps in.
link |
00:26:23.720
And now this human needs to understand what's going on.
link |
00:26:26.720
And at that point, this entire notion of language
link |
00:26:28.960
or semantics really comes in.
link |
00:26:30.440
If the machine just spits out something
link |
00:26:32.600
and if we can't understand it,
link |
00:26:34.000
then it's not really that useful for us.
link |
00:26:36.280
So self supervised learning is probably going to be useful
link |
00:26:38.440
for a lot of the things before that part,
link |
00:26:40.800
before the machine really needs to communicate
link |
00:26:42.880
a particular kind of output with a human.
link |
00:26:46.080
Because, I mean, otherwise,
link |
00:26:47.800
how is it going to do that without language?
link |
00:26:49.920
Or some kind of communication.
link |
00:26:51.880
But you're saying that it's possible to build
link |
00:26:53.640
a big base of understanding or whatever,
link |
00:26:55.880
of what's a better? Concepts.
link |
00:26:58.280
Of concepts. Concepts, yeah.
link |
00:26:59.800
Like common sense concepts. Right.
link |
00:27:02.280
Supervised learning in the context of computer vision
link |
00:27:06.120
is something you've focused on,
link |
00:27:07.520
but that's a really hard domain.
link |
00:27:09.000
And it's kind of the cutting edge
link |
00:27:10.480
of what we're, as a community, working on today.
link |
00:27:13.040
Can we take a little bit of a step back
link |
00:27:14.760
and look at language?
link |
00:27:16.320
Can you summarize the history of success
link |
00:27:19.000
of self supervised learning in natural language processing,
link |
00:27:22.480
language modeling?
link |
00:27:23.880
What are transformers?
link |
00:27:25.600
What is the masking, the sentence completion
link |
00:27:28.760
that you mentioned before?
link |
00:27:31.000
How does it lead us to understand anything?
link |
00:27:33.560
Semantic meaning of words,
link |
00:27:34.800
syntactic role of words and sentences?
link |
00:27:37.640
So I'm, of course, not the expert on NLP.
link |
00:27:40.120
I kind of follow it a little bit from the sides.
link |
00:27:43.480
So the main sort of reason
link |
00:27:45.760
why all of this masking stuff works is,
link |
00:27:47.880
I think it's called the distributional hypothesis in NLP.
link |
00:27:50.880
The idea basically being that words
link |
00:27:52.640
that occur in the same context
link |
00:27:54.400
should have similar meaning.
link |
00:27:55.960
So if you have the blank jumped over the blank,
link |
00:27:59.040
it basically, whatever is like in the first blank
link |
00:28:01.960
is basically an object that can actually jump,
link |
00:28:04.120
is going to be something that can jump.
link |
00:28:05.840
So a cat or a dog, or I don't know, sheep, something,
link |
00:28:08.360
all of these things can basically be in that particular context.
link |
00:28:11.680
And now, so essentially the idea is that
link |
00:28:13.440
if you have words that are in the same context
link |
00:28:16.080
and you predict them,
link |
00:28:17.360
you're going to learn lots of useful things
link |
00:28:20.040
about how words are related,
link |
00:28:21.520
because you're predicting by looking at their context
link |
00:28:23.600
where the word is going to be.
link |
00:28:24.920
So in this particular case, the blank jumped over the fence.
link |
00:28:28.280
So now if it's a sheep, the sheep jumped over the fence,
link |
00:28:30.960
the dog jumped over the fence.
link |
00:28:32.440
So essentially the algorithm or the representation
link |
00:28:35.600
basically puts together these two concepts together.
link |
00:28:37.640
So it says, okay, dogs are going to be kind of related to sheep
link |
00:28:40.280
because both of them occur in the same context.
link |
00:28:42.760
Of course, now you can decide
link |
00:28:44.480
depending on your particular application downstream,
link |
00:28:46.800
you can say that dogs are absolutely not related to sheep
link |
00:28:49.200
because well, I don't, I really care about dog food,
link |
00:28:52.120
for example, I'm a dog food person
link |
00:28:54.240
and I really want to give this dog food
link |
00:28:55.640
to this particular animal.
link |
00:28:57.320
So depending on what your downstream application is,
link |
00:29:00.120
of course, this notion of similarity or this notion
link |
00:29:03.040
or this common sense that you've learned
link |
00:29:04.320
may not be applicable.
link |
00:29:05.840
But the point is basically that this,
link |
00:29:08.080
just predicting what the blanks are
link |
00:29:09.960
is going to take you really, really far.
link |
00:29:11.760
So there's a nice feature of language
link |
00:29:14.040
that the number of words in a particular language
link |
00:29:18.720
is very large, but it's finite
link |
00:29:20.800
and it's actually not that large
link |
00:29:22.080
in the grand scheme of things.
link |
00:29:24.160
I still got it because we take it for granted.
link |
00:29:26.560
So first of all, when you say masking,
link |
00:29:28.400
you're talking about this very process of the blank,
link |
00:29:31.560
of removing words from a sentence
link |
00:29:33.440
and then having the knowledge of what word went there
link |
00:29:36.760
in the initial data set,
link |
00:29:38.520
that's the ground truth that you're training on
link |
00:29:41.080
and then you're asking the neural network
link |
00:29:43.480
to predict what goes there.
link |
00:29:46.560
That's like a little trick.
link |
00:29:49.240
It's a really powerful trick.
link |
00:29:50.880
The question is how far that takes us.
link |
00:29:53.320
And the other question is, is there other tricks?
link |
00:29:56.280
Because to me, it's very possible
link |
00:29:58.680
there's other very fascinating tricks.
link |
00:30:00.720
I'll give you an example in autonomous driving,
link |
00:30:05.200
there's a bunch of tricks
link |
00:30:06.920
that give you the self supervised signal back.
link |
00:30:10.360
For example, very similar to sentences, but not really,
link |
00:30:16.280
which is you have signals from humans driving the car
link |
00:30:20.240
because a lot of us drive cars to places.
link |
00:30:23.640
And so you can ask the neural network to predict
link |
00:30:27.800
what's going to happen the next two seconds
link |
00:30:30.240
for a safe navigation through the environment.
link |
00:30:33.400
And the signal comes from the fact
link |
00:30:36.200
that you also have knowledge of what happened
link |
00:30:38.640
in the next two seconds, because you have video of the data.
link |
00:30:42.080
The question in autonomous driving, as it is in language,
link |
00:30:46.760
can we learn how to drive autonomously
link |
00:30:50.200
based on that kind of self supervision?
link |
00:30:53.480
Probably the answer is no.
link |
00:30:55.360
The question is how good can we get?
link |
00:30:57.800
And the same with language, how good can we get?
link |
00:31:00.200
And are there other tricks?
link |
00:31:02.160
Like we get sometimes super excited by this trick
link |
00:31:04.680
that works really well.
link |
00:31:05.720
But I wonder, it's almost like mining for gold.
link |
00:31:09.120
I wonder how many signals there are in the data
link |
00:31:12.760
that could be leveraged that are like there.
link |
00:31:17.200
I just wanted to kind of linger on that
link |
00:31:18.600
because sometimes it's easy to think
link |
00:31:20.840
that maybe this masking process is self supervised learning.
link |
00:31:24.840
No, it's only one method.
link |
00:31:27.200
So there could be many, many other methods,
link |
00:31:29.280
many tricky methods, maybe interesting ways
link |
00:31:33.840
to leverage human computation in very interesting ways
link |
00:31:36.880
that might actually border on semi supervised learning,
link |
00:31:39.920
something like that.
link |
00:31:40.840
Obviously the internet is generated by humans
link |
00:31:43.520
at the end of the day.
link |
00:31:44.720
So all that to say is what's your sense
link |
00:31:48.760
in this particular context of language,
link |
00:31:50.680
how far can that masking process take us?
link |
00:31:54.680
So it has stood the test of time, right?
link |
00:31:56.240
I mean, so Word2vec, the initial sort of NLP technique
link |
00:31:59.800
that was using this to now, for example,
link |
00:32:02.120
like all the BERT and all these big models that we get,
link |
00:32:05.880
BERT and Roberta, for example,
link |
00:32:07.560
all of them are still sort of based
link |
00:32:08.760
on the same principle of masking.
link |
00:32:10.600
It's taken us really far.
link |
00:32:12.120
I mean, you can actually do things like,
link |
00:32:14.400
oh, these two sentences are similar or not,
link |
00:32:16.240
whether this particular sentence follows this other sentence
link |
00:32:18.680
in terms of logic, so entailment,
link |
00:32:20.480
you can do a lot of these things
link |
00:32:21.760
with just this masking trick.
link |
00:32:23.640
So I'm not sure if I can predict how far it can take us,
link |
00:32:28.320
because when it first came out, when Word2vec was out,
link |
00:32:31.480
I don't think a lot of us would have imagined
link |
00:32:33.480
that this would actually help us do some kind
link |
00:32:35.960
of entailment problems and really that well.
link |
00:32:38.520
And so just the fact that by just scaling up
link |
00:32:40.920
the amount of data that we're training on
link |
00:32:42.320
and using better and more powerful neural network
link |
00:32:45.120
architectures has taken us from that to this,
link |
00:32:47.600
is just showing you how maybe poor predictors we are,
link |
00:32:52.600
as humans, how poor we are at predicting
link |
00:32:54.880
how successful a particular technique is going to be.
link |
00:32:57.360
So I think I can say something now,
link |
00:32:58.680
but like 10 years from now,
link |
00:33:00.040
I look completely stupid basically predicting this.
link |
00:33:02.800
In the language domain, is there something in your work
link |
00:33:07.160
that you find useful and insightful
link |
00:33:09.560
and transferable to computer vision,
link |
00:33:12.560
but also just, I don't know, beautiful and profound
link |
00:33:15.720
that I think carries through to the vision domain?
link |
00:33:18.160
I mean, the idea of masking has been very powerful.
link |
00:33:21.000
It has been used in vision as well for predicting,
link |
00:33:23.680
like you say, the next sort of if you have
link |
00:33:25.800
and sort of frames and you predict
link |
00:33:28.080
what's going to happen in the next frame.
link |
00:33:29.360
So that's been very powerful.
link |
00:33:30.960
In terms of modeling, like in just terms
link |
00:33:32.880
in terms of architecture, I think you would have asked
link |
00:33:34.600
about transformers a while back.
link |
00:33:36.880
That has really become like,
link |
00:33:38.480
it has become super exciting for computer vision now.
link |
00:33:40.800
Like in the past, I would say year and a half,
link |
00:33:42.760
it's become really powerful.
link |
00:33:44.160
What's a transformer?
link |
00:33:45.240
Right.
link |
00:33:46.080
I mean, the core part of a transformer
link |
00:33:47.440
is something called the self attention model.
link |
00:33:49.040
So it came out of Google
link |
00:33:50.440
and the idea basically is that if you have N elements,
link |
00:33:53.760
what you're creating is a way for all of these N elements
link |
00:33:56.480
to talk to each other.
link |
00:33:57.880
So the idea basically is that you are paying attention.
link |
00:34:01.800
Each element is paying attention
link |
00:34:03.160
to each of the other element.
link |
00:34:04.960
And basically by doing this,
link |
00:34:06.760
it's really trying to figure out,
link |
00:34:08.960
you're basically getting a much better view of the data.
link |
00:34:11.440
So for example, if you have a sentence of like four words,
link |
00:34:14.480
the point is if you get a representation
link |
00:34:16.320
or a feature for this entire sentence,
link |
00:34:18.320
it's constructed in a way such that each word
link |
00:34:21.280
has paid attention to everything else.
link |
00:34:23.840
Now, the reason it's like different from say,
link |
00:34:26.120
what you would do in a ConvNet
link |
00:34:28.440
is basically that in the ConvNet,
link |
00:34:29.560
you would only pay attention to a local window.
link |
00:34:31.400
So each word would only pay attention
link |
00:34:33.160
to its next neighbor or like one neighbor after that.
link |
00:34:36.160
And the same thing goes for images.
link |
00:34:37.840
In images, you would basically pay attention to pixels
link |
00:34:40.120
in a three cross three or a seven cross seven neighborhood.
link |
00:34:42.800
And that's it.
link |
00:34:43.680
Whereas with the transformer, the self attention mainly,
link |
00:34:46.000
the sort of idea is that each element
link |
00:34:48.760
needs to pay attention to each other element.
link |
00:34:50.440
And when you say attention,
link |
00:34:51.960
maybe another way to phrase that
link |
00:34:53.400
is you're considering a context,
link |
00:34:57.680
a wide context in terms of the wide context of the sentence
link |
00:35:01.560
in understanding the meaning of a particular word
link |
00:35:05.160
and in computer vision that's understanding
link |
00:35:06.960
a larger context to understand the local pattern
link |
00:35:10.040
of a particular local part of an image.
link |
00:35:13.080
Right, so basically if you have say,
link |
00:35:14.960
again, a banana in the image,
link |
00:35:16.520
you're looking at the full image first.
link |
00:35:18.600
So whether it's like, you know,
link |
00:35:19.920
you're looking at all the pixels that are off a kitchen
link |
00:35:22.200
or for dining table and so on.
link |
00:35:23.760
And then you're basically looking at the banana also.
link |
00:35:25.920
Yeah, by the way, in terms of,
link |
00:35:27.200
if we were to train the funny classifier,
link |
00:35:29.240
there's something funny about the word banana.
link |
00:35:32.000
Just wanted to anticipate that.
link |
00:35:33.840
I am wearing a banana shirt, so yeah.
link |
00:35:36.200
Is there bananas on it?
link |
00:35:39.720
Okay, so masking has worked for the vision context as well.
link |
00:35:42.440
And so this transformer idea has worked as well.
link |
00:35:44.320
So basically looking at all the elements
link |
00:35:46.280
to understand a particular element
link |
00:35:48.160
has been really powerful in vision.
link |
00:35:49.920
The reason is like a lot of things
link |
00:35:52.080
when you're looking at them in isolation.
link |
00:35:53.480
So if you look at just a blob of pixels,
link |
00:35:55.600
so Antonio Torralba at MIT used to have
link |
00:35:57.520
this like really famous image,
link |
00:35:58.960
which I looked at when I was a PhD student.
link |
00:36:01.040
But he would basically have a blob of pixels
link |
00:36:02.840
and he would ask you, hey, what is this?
link |
00:36:04.960
And it looked basically like a shoe
link |
00:36:06.840
or like it could look like a TV remote.
link |
00:36:08.880
It could look like anything.
link |
00:36:10.080
And it turns out it was a beer bottle.
link |
00:36:12.360
But I'm not sure it was one of these three things,
link |
00:36:14.120
but basically he showed you the full picture
link |
00:36:15.440
and then it was very obvious what it was.
link |
00:36:17.560
But the point is just by looking at
link |
00:36:19.240
that particular local window, you couldn't figure it out.
link |
00:36:21.880
Because of resolution, because of other things,
link |
00:36:23.880
it's just not easy always to just figure it out
link |
00:36:26.080
by looking at just the neighborhood of pixels,
link |
00:36:27.960
what these pixels are.
link |
00:36:29.680
And the same thing happens for language as well.
link |
00:36:32.000
For the parameters that have to learn
link |
00:36:33.920
something about the data,
link |
00:36:35.160
you need to give it the capacity
link |
00:36:37.200
to learn the essential things.
link |
00:36:39.160
Like if it's not actually able to receive the signal at all,
link |
00:36:42.680
then it's not gonna be able to learn that signal.
link |
00:36:44.320
And in order to understand images, to understand language,
link |
00:36:47.320
you have to be able to see words in their full context.
link |
00:36:50.720
Okay, what is harder to solve, vision or language?
link |
00:36:54.960
Visual intelligence or linguistic intelligence?
link |
00:36:57.880
So I'm going to say computer vision is harder.
link |
00:36:59.840
My reason for this is basically that
link |
00:37:02.800
language of course has a big structure to it
link |
00:37:05.000
because we developed it.
link |
00:37:06.880
Whereas vision is something that is common
link |
00:37:08.720
in a lot of animals.
link |
00:37:09.960
Everyone is able to get by a lot of these animals
link |
00:37:12.520
on earth are actually able to get by without language.
link |
00:37:15.080
And a lot of these animals we also deem to be intelligent.
link |
00:37:18.280
So clearly intelligence does have
link |
00:37:20.920
like a visual component to it.
link |
00:37:22.520
And yes, of course, in the case of humans,
link |
00:37:24.240
it of course also has a linguistic component.
link |
00:37:26.400
But it means that there is something far more fundamental
link |
00:37:28.720
about vision than there is about language.
link |
00:37:30.840
And I'm sorry to anyone who disagrees,
link |
00:37:32.960
but yes, this is what I feel.
link |
00:37:34.360
So that's being a little bit reflected in the challenges
link |
00:37:38.880
that have to do with the progress
link |
00:37:40.800
of self supervised learning, would you say?
link |
00:37:42.520
Or is that just a peculiar accidents
link |
00:37:45.560
of the progress of the AI community
link |
00:37:47.400
that we focused on like,
link |
00:37:48.600
or we discovered self attention and transformers
link |
00:37:51.680
in the context of language first?
link |
00:37:53.640
So like the self supervised learning success
link |
00:37:55.520
was actually for vision has not much to do
link |
00:37:58.880
with the transformers part.
link |
00:37:59.960
I would say it's actually been independent a little bit.
link |
00:38:02.480
I think it's just that the signal was a little bit different
link |
00:38:05.360
for vision than there was for like NLP
link |
00:38:08.120
and probably NLP folks discovered it before.
link |
00:38:11.240
So for vision, the main success
link |
00:38:12.680
has basically been this like crops so far,
link |
00:38:14.840
like taking different crops of images.
link |
00:38:16.960
Whereas for NLP, it was this masking thing.
link |
00:38:18.920
But also the level of success
link |
00:38:20.480
is still much higher for language.
link |
00:38:22.080
It has.
link |
00:38:22.920
So that has a lot to do with,
link |
00:38:24.800
I mean, I can get into a lot of details.
link |
00:38:26.920
For this particular question, let's go for it, okay.
link |
00:38:29.040
So the first thing is language is very structured.
link |
00:38:32.280
So you are going to produce a distribution
link |
00:38:34.080
over a finite vocabulary.
link |
00:38:35.920
English has a finite number of words.
link |
00:38:37.680
It's actually not that large.
link |
00:38:39.520
And you need to produce basically,
link |
00:38:41.640
when you're doing this masking thing,
link |
00:38:42.760
all you need to do is basically tell me
link |
00:38:44.160
which one of these like 50,000 words it is.
link |
00:38:46.440
That's it.
link |
00:38:47.280
Now for vision, let's imagine doing the same thing.
link |
00:38:49.560
Okay, we're basically going to blank out
link |
00:38:51.480
a particular part of the image
link |
00:38:52.600
and we ask the network or this neural network
link |
00:38:54.680
to predict what is present in this missing patch.
link |
00:38:58.080
It's combinatorially large, right?
link |
00:38:59.960
You have 256 pixel values.
link |
00:39:02.560
If you're even producing basically a seven cross seven
link |
00:39:04.840
or a 14 cross 14 like window of pixels,
link |
00:39:07.960
at each of these 169 or each of these 49 locations,
link |
00:39:11.320
you have 256 values to predict.
link |
00:39:13.720
And so it's really, really large.
link |
00:39:15.240
And very quickly, the kind of like prediction problems
link |
00:39:18.960
that we're setting up are going to be extremely
link |
00:39:20.800
like interactable for us.
link |
00:39:22.760
And so the thing is for NLP, it has been really successful
link |
00:39:24.960
because we are very good at predicting,
link |
00:39:27.520
like doing this like distribution over a finite set.
link |
00:39:30.840
And the problem is when this set becomes really large,
link |
00:39:33.480
we are going to become really, really bad
link |
00:39:35.520
at making these predictions
link |
00:39:36.960
and at solving basically this particular set of problems.
link |
00:39:41.000
So if you were to do it exactly in the same way
link |
00:39:44.200
as NLP for vision, there is very limited success.
link |
00:39:47.000
The way stuff is working right now
link |
00:39:48.960
is actually not by predicting these masks.
link |
00:39:51.640
It's basically by saying that you take these two
link |
00:39:53.640
like crops from the image,
link |
00:39:55.120
you get a feature representation from it.
link |
00:39:57.040
And just saying that these two features,
link |
00:39:58.640
so they're like vectors,
link |
00:40:00.400
just saying that the distance between these vectors
link |
00:40:02.000
should be small.
link |
00:40:03.200
And so it's a very different way of learning
link |
00:40:06.640
from the visual signal than there is from NLP.
link |
00:40:09.160
Okay, the other reason is the distributional hypothesis
link |
00:40:11.360
that we talked about for NLP, right?
link |
00:40:12.920
So a word given its context,
link |
00:40:15.160
basically the context actually supplies
link |
00:40:16.560
a lot of meaning to the word.
link |
00:40:18.440
Now, because there are just finite number of words
link |
00:40:22.280
and there is a finite way in like which we compose them.
link |
00:40:25.760
Of course, the same thing holds for pixels,
link |
00:40:27.440
but in language, there's a lot of structure, right?
link |
00:40:29.760
So I always say whatever,
link |
00:40:31.000
the dash jumped over the fence, for example.
link |
00:40:33.760
There are lots of these sentences that you'll get.
link |
00:40:36.720
And from this, you can actually look at
link |
00:40:38.680
this particular sentence might occur
link |
00:40:40.160
in a lot of different contexts as well.
link |
00:40:41.480
This exact same sentence
link |
00:40:42.600
might occur in a different context.
link |
00:40:44.080
So the sheep jumped over the fence,
link |
00:40:45.560
the cat jumped over the fence,
link |
00:40:46.800
the dog jumped over the fence.
link |
00:40:48.160
So you immediately get a lot of these words,
link |
00:40:50.480
which are because this particular token itself
link |
00:40:52.720
has so much meaning,
link |
00:40:53.560
you get a lot of these tokens or these words,
link |
00:40:55.480
which are actually going to have sort of
link |
00:40:57.720
this related meaning across given this context.
link |
00:41:00.560
Whereas for vision, it's much harder
link |
00:41:02.640
because just by like pure,
link |
00:41:04.160
like the way we capture images,
link |
00:41:05.600
lighting can be different.
link |
00:41:07.440
There might be like different noise in the sensor.
link |
00:41:09.800
So the thing is you're capturing a physical phenomenon
link |
00:41:12.240
and then you're basically going through
link |
00:41:13.840
a very complicated pipeline of like image processing.
link |
00:41:16.400
And then you're translating that into
link |
00:41:18.040
some kind of like digital signal.
link |
00:41:20.400
Whereas with language, you write it down
link |
00:41:23.520
and you transfer it to a digital signal,
link |
00:41:25.040
almost like it's a lossless like transfer.
link |
00:41:27.520
And each of these tokens are very, very well defined.
link |
00:41:30.160
There could be a little bit of an argument there
link |
00:41:32.840
because language as written down
link |
00:41:36.120
is a projection of thought.
link |
00:41:39.400
This is one of the open questions is
link |
00:41:42.560
if you perfectly can solve language,
link |
00:41:46.320
are you getting close to being able to solve easily
link |
00:41:50.040
with flying colors past the towing test kind of thing.
link |
00:41:52.800
So that's, it's similar, but different
link |
00:41:56.560
and the computer vision problem is in the 2D plane
link |
00:41:59.760
is a projection with three dimensional world.
link |
00:42:02.640
So perhaps there are similar problems there.
link |
00:42:05.640
Maybe this is a good.
link |
00:42:06.480
I mean, I think what I'm saying is NLP is not easy.
link |
00:42:08.560
Of course, don't get me wrong.
link |
00:42:09.520
Like abstract thought expressed in knowledge
link |
00:42:12.920
or knowledge basically expressed in language
link |
00:42:14.600
is really hard to understand, right?
link |
00:42:16.720
I mean, we've been communicating with language for so long
link |
00:42:19.160
and it is of course a very complicated concept.
link |
00:42:22.000
The thing is at least getting like somewhat reasonable,
link |
00:42:27.000
like being able to solve some kind of reasonable tasks
link |
00:42:29.880
with language, I would say slightly easier
link |
00:42:32.080
than it is with computer vision.
link |
00:42:33.640
Yeah, I would say, yeah.
link |
00:42:35.360
So that's well put.
link |
00:42:36.600
I would say getting impressive performance on language
link |
00:42:40.840
is easier.
link |
00:42:43.360
I feel like for both language and computer vision,
link |
00:42:45.320
there's going to be this wall of like,
link |
00:42:49.440
like this hump you have to overcome
link |
00:42:52.240
to achieve superhuman level performance
link |
00:42:54.800
or human level performance.
link |
00:42:56.600
And I feel like for language, that wall is farther away.
link |
00:43:00.200
So you can get pretty nice.
link |
00:43:01.880
You can do a lot of tricks.
link |
00:43:04.080
You can show really impressive performance.
link |
00:43:06.520
You can even fool people that you're tweeting
link |
00:43:09.680
or you write blog posts writing
link |
00:43:11.480
or your question answering has intelligence behind it.
link |
00:43:16.880
But to truly demonstrate understanding of dialogue,
link |
00:43:22.360
of continuous long form dialogue
link |
00:43:25.000
that would require perhaps big breakthroughs.
link |
00:43:28.600
In the same way in computer vision,
link |
00:43:30.440
I think the big breakthroughs need to happen earlier
link |
00:43:33.400
to achieve impressive performance.
link |
00:43:36.600
This might be a good place to, you already mentioned it,
link |
00:43:38.760
but what is contrastive learning
link |
00:43:41.120
and what are energy based models?
link |
00:43:43.840
Contrastive learning is sort of the paradigm of learning
link |
00:43:46.840
where the idea is that you are learning this embedding space
link |
00:43:50.680
or so you're learning this sort of vector space
link |
00:43:52.680
of all your concepts.
link |
00:43:54.520
And the way you learn that is basically by contrasting.
link |
00:43:56.760
So the idea is that you have a sample,
link |
00:43:59.120
you have another sample that's related to it.
link |
00:44:01.000
So that's called the positive
link |
00:44:02.840
and you have another sample that's not related to it.
link |
00:44:05.080
So that's negative.
link |
00:44:06.080
So for example, let's just take an NLP
link |
00:44:08.320
or in a simple example in computer vision.
link |
00:44:10.960
So you have an image of a cat, you have an image of a dog
link |
00:44:14.480
and for whatever application that you're doing,
link |
00:44:16.520
say you're trying to figure out what the pets are,
link |
00:44:18.840
you're saying that these two images are related.
link |
00:44:20.280
So image of a cat and dog are related,
link |
00:44:22.280
but now you have another third image of a banana
link |
00:44:25.400
because you don't like that word.
link |
00:44:26.960
So now you basically have this banana.
link |
00:44:28.920
Thank you for speaking to the crowd.
link |
00:44:30.640
And so you take both of these images
link |
00:44:32.560
and you take the image from the cat,
link |
00:44:34.440
the image from the dog,
link |
00:44:35.280
you get a feature from both of them.
link |
00:44:36.760
And now what you're training the network to do
link |
00:44:38.160
is basically pull both of these features together
link |
00:44:42.080
while pushing them away from the feature of a banana.
link |
00:44:44.720
So this is the contrastive part.
link |
00:44:45.840
So you're contrasting against the banana.
link |
00:44:47.840
So there's always this notion of a negative and a positive.
link |
00:44:51.520
Now, energy based models are like one way
link |
00:44:54.160
that Jan sort of explains a lot of these methods.
link |
00:44:57.480
So Jan basically, I think a couple of years
link |
00:45:00.680
or more than that, like when I joined Facebook,
link |
00:45:02.840
Jan used to keep mentioning this word, energy based models.
link |
00:45:05.080
And of course I had no idea what he was talking about.
link |
00:45:07.200
So then one day I caught him in one of the conference rooms
link |
00:45:09.680
and I'm like, can you please tell me what this is?
link |
00:45:11.240
So then like very patiently,
link |
00:45:13.120
he sat down with like a marker and a whiteboard.
link |
00:45:15.960
And his idea basically is that
link |
00:45:18.280
rather than talking about probability distributions,
link |
00:45:20.280
you can talk about energies of models.
link |
00:45:21.920
So models are trying to minimize certain energies
link |
00:45:24.000
in certain space,
link |
00:45:24.960
or they're trying to maximize a certain kind of energy.
link |
00:45:28.200
And the idea basically is that
link |
00:45:29.760
you can explain a lot of the contrastive models,
link |
00:45:32.200
GANs, for example,
link |
00:45:33.280
which are like Generative Adversarial Networks.
link |
00:45:36.000
A lot of these modern learning methods
link |
00:45:37.880
or VAEs, which are Variational Autoencoders,
link |
00:45:39.880
you can really explain them very nicely
link |
00:45:41.840
in terms of an energy function
link |
00:45:43.160
that they're trying to minimize or maximize.
link |
00:45:45.320
And so by putting this common sort of language
link |
00:45:48.360
for all of these models,
link |
00:45:49.720
what looks very different in machine learning
link |
00:45:51.800
that, oh, VAEs are very different from what GANs are,
link |
00:45:54.160
are very, very different from what contrastive models are,
link |
00:45:56.440
you actually get a sense of like,
link |
00:45:57.560
oh, these are actually very, very related.
link |
00:46:00.120
It's just that the way or the mechanism
link |
00:46:02.520
in which they're sort of maximizing
link |
00:46:04.200
or minimizing this energy function is slightly different.
link |
00:46:07.000
It's revealing the commonalities
link |
00:46:08.920
between all these approaches
link |
00:46:10.400
and putting a sexy word on top of it, like energy.
link |
00:46:13.000
And so similarities,
link |
00:46:14.360
two things that are similar have low energy.
link |
00:46:16.760
Like the low energy signifying similarity.
link |
00:46:20.360
Right, exactly.
link |
00:46:21.200
So basically the idea is that if you were to imagine
link |
00:46:23.560
like the embedding as a manifold, a 2D manifold,
link |
00:46:26.480
you would get a hill or like a high sort of peak
link |
00:46:28.920
in the energy manifold,
link |
00:46:30.600
wherever two things are not related.
link |
00:46:32.400
And basically you would have like a dip
link |
00:46:34.080
where two things are related.
link |
00:46:35.520
So you'd get a dip in the manifold.
link |
00:46:37.080
And in the self supervised context,
link |
00:46:40.200
how do you know two things are related
link |
00:46:42.280
and two things are not related?
link |
00:46:44.120
Right.
link |
00:46:44.960
So this is where all the sort of ingenuity or tricks
link |
00:46:46.920
comes in, right?
link |
00:46:47.840
So for example, like you can take
link |
00:46:50.840
the fill in the blank problem,
link |
00:46:52.160
or you can take in the context problem.
link |
00:46:54.360
And what you can say is two words
link |
00:46:55.920
that are in the same context are related.
link |
00:46:57.800
Two words that are in different contexts are not related.
link |
00:47:00.560
For images, basically two crops
link |
00:47:02.280
from the same image are related.
link |
00:47:03.960
And whereas a third image is not related at all.
link |
00:47:06.440
Or for a video, it can be two frames
link |
00:47:08.200
from that video are related
link |
00:47:09.200
because they're likely to contain
link |
00:47:10.800
the same sort of concepts in them.
link |
00:47:12.720
Whereas a third frame
link |
00:47:13.720
from a different video is not related.
link |
00:47:15.600
So it basically is, it's a very general term.
link |
00:47:18.320
Contrastive learning is nothing really
link |
00:47:19.680
to do with self supervised learning.
link |
00:47:20.840
It actually is very popular in for example,
link |
00:47:23.240
like any kind of metric learning
link |
00:47:25.200
or any kind of embedding learning.
link |
00:47:26.920
So it's also used in supervised learning.
link |
00:47:28.920
And the thing is because we are not really using labels
link |
00:47:32.080
to get these positive or negative pairs,
link |
00:47:34.560
it can basically also be used for self supervised learning.
link |
00:47:37.640
So you mentioned one of the ideas
link |
00:47:39.000
in the vision context that works
link |
00:47:42.760
is to have different crops.
link |
00:47:45.280
So you could think of that as a way
link |
00:47:47.080
to sort of manipulating the data
link |
00:47:49.480
to generate examples that are similar.
link |
00:47:53.280
Obviously, there's a bunch of other techniques.
link |
00:47:55.800
You mentioned lighting as a very,
link |
00:47:58.440
in images lighting is something that varies a lot
link |
00:48:01.680
and you can artificially change those kinds of things.
link |
00:48:04.520
There's the whole broad field of data augmentation,
link |
00:48:07.720
which manipulates images in order to increase arbitrarily
link |
00:48:11.800
the size of the data set.
link |
00:48:13.400
First of all, what is data augmentation?
link |
00:48:15.840
And second of all, what's the role of data augmentation
link |
00:48:18.120
in self supervised learning and contrastive learning?
link |
00:48:22.000
So data augmentation is just a way like you said,
link |
00:48:24.760
it's basically a way to augment the data.
link |
00:48:26.680
So you have say n samples.
link |
00:48:28.640
And what you do is you basically define
link |
00:48:30.120
some kind of transforms for the sample.
link |
00:48:32.280
So you take your say image
link |
00:48:33.640
and then you define a transform
link |
00:48:34.880
where you can just increase say the colors
link |
00:48:37.320
like the colors or the brightness of the image
link |
00:48:39.120
or increase or decrease the contrast of the image
link |
00:48:41.320
for example, or take different crops of it.
link |
00:48:44.560
So data augmentation is just a process
link |
00:48:46.240
to like basically perturb the data
link |
00:48:49.040
or like augment the data, right?
link |
00:48:51.080
And so it has played a fundamental role
link |
00:48:53.160
for computer vision for self supervised learning especially.
link |
00:48:56.640
The way most of the current methods work
link |
00:48:59.160
contrastive or otherwise is by taking an image
link |
00:49:02.720
in the case of images is by taking an image
link |
00:49:05.320
and then computing basically two perturbations of it.
link |
00:49:08.560
So these can be two different crops of the image
link |
00:49:11.480
with like different types of lighting
link |
00:49:12.920
or different contrast or different colors.
link |
00:49:15.000
So you jitter the colors a little bit and so on.
link |
00:49:17.840
And now the idea is basically because it's the same object
link |
00:49:21.720
or because it's like related concepts
link |
00:49:23.440
in both of these perturbations,
link |
00:49:25.200
you want the features from both of these perturbations
link |
00:49:27.960
to be similar.
link |
00:49:28.920
So now you can use a variety of different ways
link |
00:49:31.320
to enforce this constraint,
link |
00:49:32.600
like these features being similar.
link |
00:49:34.200
You can do this by contrastive learning.
link |
00:49:36.040
So basically, both of these things are positives,
link |
00:49:38.440
a third sort of image is negative.
link |
00:49:40.440
You can do this basically by like clustering.
link |
00:49:43.480
For example, you can say that both of these images should,
link |
00:49:46.960
the features from both of these images
link |
00:49:48.120
should belong in the same cluster because they're related,
link |
00:49:50.560
whereas image like another image
link |
00:49:52.280
should belong to a different cluster.
link |
00:49:53.880
So there's a variety of different ways
link |
00:49:55.160
to basically enforce this particular constraint.
link |
00:49:57.560
By the way, when you say features,
link |
00:49:59.080
it means there's a very large neural network
link |
00:50:01.680
that extracting patterns from the image
link |
00:50:03.640
and the kind of patterns that extracts
link |
00:50:05.160
should be either identical or very similar.
link |
00:50:08.440
That's what that means.
link |
00:50:09.640
So the neural network basically takes in the image
link |
00:50:11.880
and then outputs a set of like,
link |
00:50:14.160
basically a vector of like numbers,
link |
00:50:16.600
and that's the feature.
link |
00:50:17.720
And you want this feature for both of these
link |
00:50:20.000
like different crops that you computed to be similar.
link |
00:50:22.120
So you want this vector to be identical
link |
00:50:24.520
in its like entries, for example.
link |
00:50:26.120
Be like literally close
link |
00:50:28.120
in this multi dimensional space to each other.
link |
00:50:31.640
And like you said,
link |
00:50:32.600
close can mean part of the same cluster or something like that
link |
00:50:35.960
in this large space.
link |
00:50:37.440
First of all, that,
link |
00:50:38.920
I wonder if there is connection
link |
00:50:40.680
to the way humans learn to this,
link |
00:50:43.760
almost like maybe subconsciously,
link |
00:50:48.040
in order to understand a thing,
link |
00:50:50.120
you kind of have to see it from two, three multiple angles.
link |
00:50:54.680
I wonder, I have a lot of friends
link |
00:50:57.320
who are neuroscientists maybe and cognitive scientists.
link |
00:51:00.200
I wonder if that's in there somewhere.
link |
00:51:03.200
Like in order for us to place a concept in its proper place,
link |
00:51:08.560
we have to basically crop it in all kinds of ways,
link |
00:51:12.440
do basic data augmentation on it
link |
00:51:14.400
in whatever very clever ways that the brain likes to do.
link |
00:51:17.640
Right.
link |
00:51:19.040
Like spinning around in our minds somehow
link |
00:51:21.160
that that is very effective.
link |
00:51:23.080
So I think for some of them, we like need to do it.
link |
00:51:25.040
So like babies, for example, pick up objects,
link |
00:51:27.000
like move them and put them close to their eye and whatnot.
link |
00:51:30.120
But for certain other things,
link |
00:51:31.200
actually we are good at imagining it as well, right?
link |
00:51:33.800
So if you, I have never seen, for example,
link |
00:51:35.960
an elephant from the top.
link |
00:51:36.960
I've never basically looked at it from like top down.
link |
00:51:39.560
But if you showed me a picture of it,
link |
00:51:40.720
I could very well tell you that that's an elephant.
link |
00:51:43.760
So I think some of it, we're just like,
link |
00:51:45.320
we naturally build it or transfer it from other objects
link |
00:51:47.840
that we've seen to imagine what it's going to look like.
link |
00:51:50.920
Has anyone done that with augmentation?
link |
00:51:53.280
Like imagine all the possible things
link |
00:51:56.920
that are occluded or not there,
link |
00:51:59.880
but not just like normal things, like wild things,
link |
00:52:03.360
but they're nevertheless physically consistent.
link |
00:52:06.960
So, I mean, people do kind of like
link |
00:52:09.720
occlusion based augmentation as well.
link |
00:52:11.800
So you place in like a random like box, gray box
link |
00:52:14.760
to sort of mask out a certain part of the image.
link |
00:52:17.440
And the thing is basically you're kind of occluding it.
link |
00:52:20.000
For example, you place it say on half of a person's face.
link |
00:52:23.600
So basically saying that, you know,
link |
00:52:24.920
something below their nose is occluded
link |
00:52:26.680
because it's grayed out.
link |
00:52:28.280
So, you know, I meant like, you have like, what is it?
link |
00:52:31.680
A table and you can't see behind the table.
link |
00:52:33.880
And you imagine there's a bunch of elves
link |
00:52:37.080
with bananas behind the table.
link |
00:52:38.840
Like, I wonder if there's useful
link |
00:52:40.440
to have a wild imagination for the network
link |
00:52:44.200
because that's possible or maybe not elves,
link |
00:52:46.120
but like puppies and kittens or something like that.
link |
00:52:49.000
Just have a wild imagination
link |
00:52:51.240
and like constantly be generating that wild imagination.
link |
00:52:55.080
Because in terms of data augmentation,
link |
00:52:57.560
as currently applied, it's super ultra, very boring.
link |
00:53:01.200
It's very basic data augmentation.
link |
00:53:02.920
I wonder if there's a benefit to being wildly imaginable
link |
00:53:07.040
while trying to be consistent with physical reality.
link |
00:53:11.880
I think it's a kind of a chicken and egg problem, right?
link |
00:53:14.200
Because to have like amazing data augmentation,
link |
00:53:16.400
you need to understand what the scene is.
link |
00:53:18.520
And what we're trying to do data augmentation
link |
00:53:20.640
to learn what a scene is anyway.
link |
00:53:22.080
So it's basically just keeps going on.
link |
00:53:23.760
Before you understand it,
link |
00:53:24.800
just put elves with bananas
link |
00:53:26.400
until you know it's not to be true.
link |
00:53:29.360
Just like children have a wild imagination
link |
00:53:31.680
until the adults ruin it all.
link |
00:53:33.960
Okay, so what are the different kinds of data augmentation
link |
00:53:36.960
that you've seen to be effective in visual intelligence?
link |
00:53:40.800
For like vision,
link |
00:53:42.040
it's a lot of these image filtering operations.
link |
00:53:44.160
So like blurring the image,
link |
00:53:46.520
you know, all the kind of Instagram filters
link |
00:53:48.160
that you can think of.
link |
00:53:49.440
So like arbitrarily like make the red super red,
link |
00:53:52.520
make the green super greens, like saturate the image.
link |
00:53:55.840
Rotation, cropping.
link |
00:53:56.960
Rotation, cropping, exactly.
link |
00:53:58.440
All of these kinds of things.
link |
00:53:59.560
Like I said, lighting is a really interesting one to me.
link |
00:54:02.600
Like that feels like really complicated to do.
link |
00:54:04.760
I mean, they don't,
link |
00:54:05.600
the augmentations that we work on aren't like
link |
00:54:08.040
that involved,
link |
00:54:08.880
they're not going to be like
link |
00:54:09.720
physically realistic versions of lighting.
link |
00:54:11.280
It's not that you're assuming
link |
00:54:12.680
that there's a light source up
link |
00:54:13.680
and then you're moving it to the right
link |
00:54:15.080
and then what does the thing look like?
link |
00:54:17.000
It's really more about like brightness of the image,
link |
00:54:19.160
overall brightness of the image
link |
00:54:20.400
or overall contrast of the image and so on.
link |
00:54:22.520
But this is a really important point to me.
link |
00:54:25.080
I always thought that data augmentation
link |
00:54:28.680
holds an important key
link |
00:54:31.640
to big improvements in machine learning.
link |
00:54:33.840
And it seems that it is an important aspect
link |
00:54:36.640
of self supervised learning.
link |
00:54:39.080
So I wonder if there's big improvements to be achieved
link |
00:54:42.560
on much more intelligent kinds of data augmentation.
link |
00:54:46.680
For example, currently,
link |
00:54:48.320
maybe you can correct me if I'm wrong,
link |
00:54:50.160
data augmentation is not parameterized.
link |
00:54:52.760
Yeah.
link |
00:54:53.600
You're not learning.
link |
00:54:54.440
You're not learning.
link |
00:54:55.280
To me, it seems like data augmentation potentially
link |
00:54:59.800
should involve more learning
link |
00:55:02.000
than the learning process itself.
link |
00:55:04.160
Right.
link |
00:55:05.360
You're almost like thinking of like generative kind of,
link |
00:55:08.800
it's the elves with bananas.
link |
00:55:10.240
You're trying to,
link |
00:55:11.080
it's like very active imagination
link |
00:55:13.280
of messing with the world
link |
00:55:14.880
and teaching that mechanism for messing with the world
link |
00:55:17.640
to be realistic.
link |
00:55:19.120
Right.
link |
00:55:20.480
Because that feels like,
link |
00:55:22.640
I mean, it's imagination.
link |
00:55:24.200
It's just, as you said,
link |
00:55:25.600
it feels like us humans are able to,
link |
00:55:29.440
maybe sometimes subconsciously,
link |
00:55:30.680
imagine before we see the thing,
link |
00:55:33.000
imagine what we're expecting to see,
link |
00:55:35.480
like maybe several options.
link |
00:55:37.240
And especially, we probably forgot,
link |
00:55:38.800
but when we were younger,
link |
00:55:40.480
probably the possibilities were wilder, more numerous.
link |
00:55:44.200
And then as we get older,
link |
00:55:45.160
we become to understand the world
link |
00:55:47.400
and the possibilities of what we might see
link |
00:55:51.040
becomes less and less and less.
link |
00:55:53.120
So I wonder if you think there's a lot of breakthroughs
link |
00:55:55.600
yet to be had in data augmentation.
link |
00:55:57.160
And maybe also can you just comment on the stuff we have,
link |
00:55:59.760
is that a big part of self supervised learning?
link |
00:56:02.120
Yes.
link |
00:56:02.960
So data augmentation is like key to self supervised learning
link |
00:56:05.520
that has like the kind of augmentation that we're using.
link |
00:56:08.320
And basically the fact that we're trying to learn
link |
00:56:11.040
these neural networks that are predicting these features
link |
00:56:13.920
from images that are robust under data augmentation
link |
00:56:17.080
has been the key for visual self supervised learning.
link |
00:56:19.560
And they play a fairly fundamental role to it.
link |
00:56:22.400
Now, the irony of all of this is that
link |
00:56:24.600
for like deep learning purists will say
link |
00:56:26.720
the entire point of deep learning is that
link |
00:56:28.640
you feed in the pixels to the neural network
link |
00:56:31.160
and it should figure out the patterns on its own.
link |
00:56:33.120
So if it really wants to look at edges,
link |
00:56:34.480
it should look at edges.
link |
00:56:35.640
You shouldn't really like really go
link |
00:56:36.720
and handcraft these like features, right?
link |
00:56:38.600
You shouldn't go tell it that look at edges.
link |
00:56:41.160
So data augmentation
link |
00:56:42.360
should basically be in the same category, right?
link |
00:56:44.400
Why should we tell the network
link |
00:56:46.040
or tell this entire learning paradigm
link |
00:56:48.200
what kinds of data augmentation that we're looking for?
link |
00:56:50.840
We are encoding a very sort of human specific bias there
link |
00:56:55.200
that we know things are like,
link |
00:56:57.560
if you change the contrast of the image,
link |
00:56:59.200
it should still be an apple
link |
00:57:00.280
or it should still see apple, not banana.
link |
00:57:02.240
And basically if we change like colors,
link |
00:57:05.880
it should still be the same kind of concept.
link |
00:57:08.040
Of course, this is not one,
link |
00:57:09.880
this is doesn't feel like super satisfactory
link |
00:57:12.480
because a lot of our human knowledge
link |
00:57:14.560
or our human supervision
link |
00:57:15.760
is actually going into the data augmentation.
link |
00:57:17.600
So although we are calling it self supervised learning,
link |
00:57:19.680
a lot of the human knowledge
link |
00:57:21.040
is actually being encoded in the data augmentation process.
link |
00:57:23.520
So it's really like,
link |
00:57:24.360
we've kind of sneaked away the supervision at the input
link |
00:57:27.120
and we're like really designing
link |
00:57:28.520
these nice list of data augmentations
link |
00:57:30.360
that are working very well.
link |
00:57:31.640
Of course, the idea is that it's much easier
link |
00:57:33.720
to design a list of data augmentation than it is to do.
link |
00:57:36.600
So humans are doing nevertheless doing less and less work
link |
00:57:39.640
and maybe leveraging their creativity more and more.
link |
00:57:42.600
And when we say data augmentation is not parameterized,
link |
00:57:45.080
it means it's not part of the learning process.
link |
00:57:48.200
Do you think it's possible to integrate
link |
00:57:50.560
some of the data augmentation into the learning process?
link |
00:57:53.280
I think so.
link |
00:57:54.120
I think so.
link |
00:57:54.960
And in fact, it will be really beneficial for us
link |
00:57:57.440
because a lot of these data augmentations
link |
00:57:59.720
that we use in vision are very extreme.
link |
00:58:01.840
For example, like when you have certain concepts,
link |
00:58:05.400
again, a banana, you take the banana
link |
00:58:08.160
and then basically you change the color of the banana, right?
link |
00:58:10.560
So you make it a purple banana.
link |
00:58:12.440
Now this data augmentation process
link |
00:58:14.200
is actually independent of the,
link |
00:58:15.920
like it has no notion of what is present in the image.
link |
00:58:18.920
So it can change this color arbitrarily.
link |
00:58:20.520
It can make it a red banana as well.
link |
00:58:22.560
And now what we're doing is we're telling
link |
00:58:24.040
the neural network that this red banana
link |
00:58:26.160
and so a crop of this image which has the red banana
link |
00:58:29.280
and a crop of this image where I changed the color
link |
00:58:30.960
to a purple banana should be,
link |
00:58:32.360
the features should be the same.
link |
00:58:34.080
Now bananas aren't red or purple mostly.
link |
00:58:36.680
So really the data augmentation process
link |
00:58:38.560
should take into account what is present in the image
link |
00:58:41.120
and what are the kinds of physical realities
link |
00:58:43.080
that are possible.
link |
00:58:43.920
It shouldn't be completely independent of the image.
link |
00:58:45.840
So you might get big gains if you,
link |
00:58:48.840
instead of being drastic, do subtle augmentation
link |
00:58:51.560
but realistic augmentation.
link |
00:58:53.280
Right, realistic.
link |
00:58:54.120
I'm not sure if it's subtle, but like realistic for sure.
link |
00:58:56.280
If it's realistic, then even subtle augmentation
link |
00:58:59.600
will give you big benefits.
link |
00:59:00.680
Exactly, yeah.
link |
00:59:01.840
And it will be like for particular domains
link |
00:59:05.040
you might actually see like,
link |
00:59:06.440
if for example, now we're doing medical imaging,
link |
00:59:08.960
there are going to be certain kinds
link |
00:59:10.160
of like geometric augmentation
link |
00:59:11.440
which are not really going to be very valid
link |
00:59:13.480
for the human body.
link |
00:59:15.080
So if you were to like actually loop in data augmentation
link |
00:59:18.280
into the learning process,
link |
00:59:19.480
it will actually be much more useful.
link |
00:59:21.320
Now this actually does take us
link |
00:59:23.280
to maybe a semi supervised kind of a setting
link |
00:59:25.120
because you do want to understand
link |
00:59:27.480
what is it that you're trying to solve.
link |
00:59:29.080
So currently self supervised learning
link |
00:59:30.880
kind of operates in the wild, right?
link |
00:59:32.720
So you do the self supervised learning
link |
00:59:34.960
and the purists and all of us basically say that,
link |
00:59:37.560
okay, this should learn useful representations
link |
00:59:39.440
and they should be useful for any kind of end task,
link |
00:59:42.320
no matter it's like banana recognition
link |
00:59:44.280
or like autonomous driving.
link |
00:59:46.240
Now it's a tall order.
link |
00:59:47.760
Maybe the first baby step for us should be that,
link |
00:59:50.480
okay, if you're trying to loop in this data augmentation
link |
00:59:52.640
into the learning process,
link |
00:59:53.920
then we at least need to have some sense
link |
00:59:56.000
of what we're trying to do.
link |
00:59:56.840
Are we trying to distinguish
link |
00:59:57.760
between different types of bananas
link |
00:59:59.560
or are we trying to distinguish between banana and apple
link |
01:00:02.040
or are we trying to do all of these things at once?
link |
01:00:04.400
And so some notion of like what happens at the end
link |
01:00:07.920
might actually help us do much better at this side.
link |
01:00:10.840
Let me ask you a ridiculous question.
link |
01:00:14.320
If I were to give you like a black box,
link |
01:00:16.280
like a choice to have an arbitrary large data set
link |
01:00:19.520
of real natural data
link |
01:00:22.320
versus really good data augmentation algorithms,
link |
01:00:26.640
which would you like to train in a self supervised way on?
link |
01:00:31.320
So natural data from the internet are arbitrary large,
link |
01:00:35.040
so unlimited data,
link |
01:00:37.360
or it's like more controlled good data augmentation
link |
01:00:41.760
on the finite data set.
link |
01:00:43.600
The thing is like,
link |
01:00:44.440
because our learning algorithms for vision right now
link |
01:00:47.240
really rely on data augmentation,
link |
01:00:49.360
even if you were to give me
link |
01:00:50.480
like an infinite source of like image data,
link |
01:00:52.880
I still need a good data augmentation algorithm.
link |
01:00:54.600
You need something that tells you
link |
01:00:56.080
that two things are similar.
link |
01:00:57.400
Right.
link |
01:00:58.240
And so something,
link |
01:00:59.080
because you've given me an arbitrary large data set,
link |
01:01:01.600
I still need to use data augmentation
link |
01:01:03.760
to take that image construct,
link |
01:01:05.360
like these two perturbations of it,
link |
01:01:06.920
and then learn from it.
link |
01:01:08.240
So the thing is our learning paradigm
link |
01:01:09.960
is very primitive right now.
link |
01:01:11.640
Yeah.
link |
01:01:12.480
Even if you were to give me lots of images,
link |
01:01:13.800
it's still not really useful.
link |
01:01:15.200
A good data augmentation algorithm
link |
01:01:16.520
is actually going to be more useful.
link |
01:01:18.040
So you can like reduce down the amount of data
link |
01:01:21.160
that you give me by like 10 times,
link |
01:01:22.920
but if you were to give me
link |
01:01:23.760
a good data augmentation algorithm,
link |
01:01:25.040
that would probably do better
link |
01:01:26.440
than giving me like 10 times the size of that data,
link |
01:01:29.040
but me having to rely on
link |
01:01:30.800
like a very primitive data augmentation algorithm.
link |
01:01:32.640
Like through tagging and all those kinds of things,
link |
01:01:35.040
is there a way to discover things
link |
01:01:37.240
that are semantically similar on the internet?
link |
01:01:39.600
Obviously there is, but they might be extremely noisy.
link |
01:01:42.520
And the difference might be farther away
link |
01:01:45.760
than you would be comfortable with.
link |
01:01:47.840
So, I mean, yes, tagging will help you a lot.
link |
01:01:49.720
It'll actually go a very long way
link |
01:01:51.480
in figuring out what images are related or not.
link |
01:01:54.360
And then, so, but then the purists would argue
link |
01:01:57.480
that when you're using human tags,
link |
01:01:58.880
because these tags are like supervision,
link |
01:02:01.200
is it really self supervised learning now?
link |
01:02:03.960
Because you're using human tags
link |
01:02:05.320
to figure out which images are like similar.
link |
01:02:07.960
Hashtag no filter means a lot of things.
link |
01:02:10.440
Yes.
link |
01:02:11.280
I mean, there are certain tags
link |
01:02:12.360
which are going to be applicable pretty much to anything.
link |
01:02:15.280
So they're pretty useless for learning.
link |
01:02:18.240
But I mean, certain tags are actually like
link |
01:02:20.800
the Eiffel Tower, for example,
link |
01:02:22.240
or the Taj Mahal, for example.
link |
01:02:23.800
These tags are like very indicative of what's going on.
link |
01:02:26.480
And they are, I mean, they are human supervision.
link |
01:02:29.440
Yeah.
link |
01:02:30.280
This is one of the tasks of discovering
link |
01:02:31.880
from human generated data strong signals
link |
01:02:34.880
that could be leveraged for self supervision.
link |
01:02:39.560
Like humans are doing so much work already.
link |
01:02:42.240
Like many years ago, there was something that was called,
link |
01:02:45.120
I guess, human computation back in the day.
link |
01:02:48.000
Humans are doing so much work.
link |
01:02:50.240
It'd be exciting to discover ways to leverage
link |
01:02:53.480
the work they're doing to teach machines
link |
01:02:55.840
without any extra effort from them.
link |
01:02:57.960
An example could be, like we said, driving,
link |
01:03:00.160
humans driving and machines can learn from the driving.
link |
01:03:03.000
I always hope that there could be some supervision signal
link |
01:03:06.760
discovered in video games,
link |
01:03:08.160
because there's so many people that play video games
link |
01:03:10.720
that it feels like so much effort is put into video games,
link |
01:03:15.840
into playing video games,
link |
01:03:17.680
and you can design video games somewhat cheaply
link |
01:03:21.760
to include whatever signals you want.
link |
01:03:24.640
It feels like that could be leverage somehow.
link |
01:03:27.520
So people are using that.
link |
01:03:28.680
Like there are actually folks right here in UT Austin,
link |
01:03:30.840
like Philip Granbull is a professor at UT Austin.
link |
01:03:33.760
He's been like working on video games
link |
01:03:36.160
as a source of supervision.
link |
01:03:38.000
I mean, it's really fun.
link |
01:03:39.000
Like as a PhD student,
link |
01:03:40.040
getting to basically play video games all day.
link |
01:03:42.200
Yeah, but so I do hope that kind of thing scales
link |
01:03:44.920
and like ultimately boils down to discovering
link |
01:03:48.080
some undeniably very good signal.
link |
01:03:51.600
It's like masking in NLP.
link |
01:03:54.040
But that said, there's non contrastive methods.
link |
01:03:57.640
What do non contrastive energy based
link |
01:04:00.840
self supervised learning methods look like?
link |
01:04:03.520
And why are they promising?
link |
01:04:05.640
So like I said about contrastive learning,
link |
01:04:07.800
you have this notion of a positive and a negative.
link |
01:04:10.720
Now, the thing is, this entire learning paradigm
link |
01:04:13.640
really requires access to a lot of negatives
link |
01:04:17.160
to learn a good sort of feature space.
link |
01:04:19.040
The idea is if I tell you, okay,
link |
01:04:21.680
so a cat and a dog are similar,
link |
01:04:23.680
and they're very different from a banana.
link |
01:04:25.680
The thing is, this is a fairly simple analogy, right?
link |
01:04:28.000
Because bananas look visually very different
link |
01:04:30.840
from what cats and dogs do.
link |
01:04:32.440
So very quickly, if this is the only source
link |
01:04:34.440
of supervision that I'm giving you,
link |
01:04:36.600
your learning is not going to be like,
link |
01:04:38.080
after a point, the neural network
link |
01:04:39.760
is really not going to learn a lot.
link |
01:04:41.640
Because the negative that you're getting
link |
01:04:42.960
is going to be so random.
link |
01:04:43.880
So it can be, oh, a cat and a dog are very similar,
link |
01:04:46.640
but they're very different from a Volkswagen Beetle.
link |
01:04:49.880
Now, like this car looks very different
link |
01:04:51.920
from these animals again.
link |
01:04:52.920
So the thing is in contrastive learning,
link |
01:04:54.880
the quality of the negative sample really matters a lot.
link |
01:04:58.120
And so what has happened is basically that
link |
01:05:00.800
typically these methods that are contrastive
link |
01:05:02.840
really require access to lots of negatives,
link |
01:05:04.880
which becomes harder and harder to sort of scale
link |
01:05:06.920
when designing a learning algorithm.
link |
01:05:09.000
So that's been one of the reasons
link |
01:05:10.920
why non contrastive methods have become like popular
link |
01:05:13.680
and why people think that they're going to be more useful.
link |
01:05:16.360
So a non contrastive method, for example,
link |
01:05:18.440
like clustering is one non contrastive method.
link |
01:05:20.880
The idea basically being that you have
link |
01:05:22.480
two of these samples, so the cat and dog
link |
01:05:25.880
or two crops of this image,
link |
01:05:27.680
they belong to the same cluster.
link |
01:05:30.400
And so essentially you're basically doing clustering online
link |
01:05:33.320
when you're learning this network,
link |
01:05:35.080
and which is very different from having access
link |
01:05:36.720
to a lot of negatives explicitly.
link |
01:05:38.920
The other way which has become really popular
link |
01:05:40.840
is something called self distillation.
link |
01:05:43.120
So the idea basically is that you have a teacher network
link |
01:05:45.680
and a student network,
link |
01:05:47.480
and the teacher network produces a feature.
link |
01:05:49.520
So it takes in the image
link |
01:05:51.080
and basically the neural network figures out the patterns
link |
01:05:53.680
gets the feature out.
link |
01:05:55.240
And there's another neural network
link |
01:05:56.800
which is the student neural network
link |
01:05:57.960
and that also produces a feature.
link |
01:05:59.920
And now all you're doing is basically saying
link |
01:06:01.640
that the features produced by the teacher network
link |
01:06:03.960
and the student network should be very similar.
link |
01:06:06.120
That's it.
link |
01:06:06.960
There is no notion of a negative anymore.
link |
01:06:09.200
And that's it.
link |
01:06:10.040
So it's all about similarity maximization
link |
01:06:11.800
between these two features.
link |
01:06:13.680
And so all I need to now do is figure out
link |
01:06:16.320
how to have these two sorts of parallel networks,
link |
01:06:18.680
a student network and a teacher network.
link |
01:06:20.600
And basically researchers have figured out
link |
01:06:23.000
very cheap methods to do this.
link |
01:06:24.240
So you can actually have for free really
link |
01:06:26.760
two types of neural networks.
link |
01:06:29.000
They're kind of related,
link |
01:06:30.120
but they're different enough that you can actually
link |
01:06:32.040
basically have a learning problem set up.
link |
01:06:34.000
So you can ensure that they always remain different enough.
link |
01:06:38.200
So the thing doesn't collapse into something boring.
link |
01:06:41.040
Exactly.
link |
01:06:41.880
So the main sort of enemy of self supervised learning,
link |
01:06:44.360
any kind of similarity maximization technique is collapse.
link |
01:06:47.560
It's a collapse means that you learn the same feature
link |
01:06:50.520
representation for all the images in the world,
link |
01:06:53.160
which is completely useless.
link |
01:06:54.640
Everything's a banana.
link |
01:06:55.640
Everything is a banana.
link |
01:06:56.560
Everything is a cat.
link |
01:06:57.400
Everything is a car.
link |
01:06:59.200
And so all we need to do is basically come up with ways
link |
01:07:02.120
to prevent collapse.
link |
01:07:03.320
Contrastive learning is one way of doing it.
link |
01:07:05.360
And then for example, like clustering or self distillation
link |
01:07:07.840
or other ways of doing it.
link |
01:07:09.240
We also had a recent paper where we used like
link |
01:07:11.840
de correlation between like two sets of features
link |
01:07:15.400
to prevent collapse.
link |
01:07:16.760
So that's inspired a little bit by like Horace Barlow's
link |
01:07:18.880
neuroscience principles.
link |
01:07:20.680
By the way, I should comment that whoever counts
link |
01:07:23.520
the number of times the word banana, apple, cat and dog
link |
01:07:27.760
were using this conversation wins the internet.
link |
01:07:30.120
I wish you luck.
link |
01:07:32.240
What is Suave and the main improvement proposed
link |
01:07:36.760
in the paper on supervised learning of visual features
link |
01:07:40.360
by contrasting cluster assignments?
link |
01:07:42.960
Suave basically is a clustering based technique,
link |
01:07:46.400
which is for again, the same thing for self supervised
link |
01:07:49.240
learning in vision where we have two crops.
link |
01:07:52.440
And the idea basically is that you want the features
link |
01:07:55.280
from these two crops of an image to lie in the same cluster
link |
01:07:58.920
and basically crops that are coming from different images
link |
01:08:02.520
to be in different clusters.
link |
01:08:03.960
Now, typically in a sort of,
link |
01:08:05.880
if you were to do this clustering,
link |
01:08:07.120
you would perform clustering offline.
link |
01:08:09.520
What that means is you would,
link |
01:08:11.040
if you have a dataset of N examples,
link |
01:08:13.160
you would run over all of these N examples,
link |
01:08:15.360
get features for them, perform clustering.
link |
01:08:17.520
So basically get some clusters
link |
01:08:19.480
and then repeat the process again.
link |
01:08:21.960
So this is offline basically because I need to do one pass
link |
01:08:24.640
through the data to compute its clusters.
link |
01:08:27.200
Suave is basically just a simple way of doing this online.
link |
01:08:30.200
So as you're going through the data,
link |
01:08:31.800
you're actually computing these clusters online.
link |
01:08:34.800
And so of course there is like a lot of tricks involved
link |
01:08:37.480
in how to do this in a robust manner without collapsing,
link |
01:08:40.120
but this is this sort of key idea to it.
link |
01:08:42.440
Is there a nice way to say what is the key methodology
link |
01:08:45.480
of the clustering that enables that?
link |
01:08:47.640
Right, so the idea basically is that
link |
01:08:51.000
when you have N samples,
link |
01:08:52.720
we assume that we have access to,
link |
01:08:54.920
like there are always K clusters in a dataset.
link |
01:08:57.040
K is a fixed number.
link |
01:08:57.880
So for example, K is 3000.
link |
01:09:00.160
And so if you have any,
link |
01:09:02.200
when you look at any sort of small number of examples,
link |
01:09:04.840
all of them must belong to one of these K clusters.
link |
01:09:08.000
And we impose this equipartition constraint.
link |
01:09:10.320
What this means is that basically
link |
01:09:15.200
your entire set of N samples
link |
01:09:16.880
should be equally partitioned into K clusters.
link |
01:09:19.440
So all your K clusters are basically equal,
link |
01:09:21.800
they have equal contribution to these N samples.
link |
01:09:24.400
And this ensures that we never collapse.
link |
01:09:26.520
So collapse can be viewed as a way
link |
01:09:28.280
in which all samples belong to one cluster, right?
link |
01:09:30.640
So all this, if all features become the same,
link |
01:09:33.160
then you have basically just one mega cluster.
link |
01:09:35.120
You don't even have like 10 clusters or 3000 clusters.
link |
01:09:38.120
So Suave basically ensures that at each point,
link |
01:09:40.960
all these 3000 clusters are being used
link |
01:09:42.960
in the clustering process.
link |
01:09:45.040
And that's it.
link |
01:09:46.240
Basically just figure out how to do this online.
link |
01:09:48.480
And again, basically just make sure
link |
01:09:50.960
that two crops from the same image belong to the same cluster
link |
01:09:54.160
and others don't.
link |
01:09:55.720
And the fact they have a fixed K makes things simpler.
link |
01:09:58.840
Fixed K makes things simpler.
link |
01:10:00.360
Our clustering is not like really hard clustering,
link |
01:10:02.560
it's soft clustering.
link |
01:10:03.720
So basically you can be 0.2 to cluster number one
link |
01:10:06.880
and 0.8 to cluster number two.
link |
01:10:08.440
So it's not really hard.
link |
01:10:09.880
So essentially, even though we have like 3000 clusters,
link |
01:10:12.720
we can actually represent a lot of clusters.
link |
01:10:15.160
What is SEER, S E E R?
link |
01:10:19.200
And what are the key results and insights in the paper,
link |
01:10:23.080
Self Supervised Pre Training of Visual Features in the Wild?
link |
01:10:27.360
What is this big, beautiful SEER system?
link |
01:10:30.680
SEER, so I'll first go to Suave
link |
01:10:32.920
because Suave is actually like one
link |
01:10:34.360
of the key components for SEER.
link |
01:10:35.760
So Suave was, when we use Suave,
link |
01:10:37.800
it was demonstrated on ImageNet.
link |
01:10:39.760
So typically like self supervised methods,
link |
01:10:42.920
the way we sort of operate is like in the research community,
link |
01:10:46.160
we kind of cheat.
link |
01:10:47.160
So we take ImageNet, which of course I talked about
link |
01:10:49.720
as having lots of labels.
link |
01:10:51.280
And then we throw away the labels,
link |
01:10:52.920
like throw away all the hard work that went behind
link |
01:10:54.920
basically the labeling process.
link |
01:10:56.800
And we pretend that it is unsupervised.
link |
01:11:00.240
But the problem here is that we have,
link |
01:11:02.840
like when we collected these images,
link |
01:11:05.120
the ImageNet dataset has a particular distribution
link |
01:11:08.200
of concepts, right?
link |
01:11:09.920
So these images are very curated.
link |
01:11:11.720
And what that means is these images, of course,
link |
01:11:15.240
belong to a certain set of noun concepts.
link |
01:11:17.640
And also ImageNet has this bias that all images
link |
01:11:20.360
contain an object, which is like very big
link |
01:11:22.440
and it's typically in the center.
link |
01:11:24.040
So when you're talking about a dog, it's a well framed dog,
link |
01:11:26.120
it's towards the center of the image.
link |
01:11:28.320
So a lot of the data augmentation,
link |
01:11:29.760
a lot of the sort of hidden assumptions
link |
01:11:31.480
in self supervised learning,
link |
01:11:33.400
actually really exploit this bias of ImageNet.
link |
01:11:37.360
And so, I mean, a lot of my work,
link |
01:11:39.680
a lot of work from other people always uses ImageNet
link |
01:11:42.000
sort of as the benchmark to show the success
link |
01:11:44.200
of self supervised learning.
link |
01:11:45.440
So you're implying that there's particular limitations
link |
01:11:47.680
to this kind of dataset?
link |
01:11:49.200
Yes, I mean, it's basically because our data augmentation
link |
01:11:51.880
that we designed, like all data augmentation
link |
01:11:55.320
that we designed for self supervised learning in vision
link |
01:11:57.480
are kind of overfit to ImageNet.
link |
01:11:59.360
But you're saying a little bit hard coded
link |
01:12:02.400
like the cropping.
link |
01:12:03.800
Exactly, the cropping parameters,
link |
01:12:05.480
the kind of lighting that we're using,
link |
01:12:07.280
the kind of blurring that we're using.
link |
01:12:08.800
Yeah, but you would, for more in the wild dataset,
link |
01:12:11.960
you would need to be clever or more careful
link |
01:12:16.240
in setting the range of parameters
link |
01:12:17.520
and those kinds of things.
link |
01:12:18.920
So for SEER, our main goal was twofold.
link |
01:12:21.360
One, basically to move away from ImageNet for training.
link |
01:12:24.680
So the images that we used were like uncurated images.
link |
01:12:27.680
Now there's a lot of debate
link |
01:12:28.600
whether they're actually curated or not,
link |
01:12:30.040
but I'll talk about that later.
link |
01:12:32.360
But the idea was basically,
link |
01:12:33.880
these are going to be random internet images
link |
01:12:36.400
that we're not going to filter out
link |
01:12:37.920
based on like particular categories.
link |
01:12:40.080
So we did not say that, oh, images that belong to dogs
link |
01:12:42.880
and cats should be the only images
link |
01:12:44.280
that come in this dataset, banana.
link |
01:12:47.000
And basically, other images should be thrown out.
link |
01:12:50.040
So we didn't do any of that.
link |
01:12:51.800
So these are random internet images.
link |
01:12:53.560
And of course, it also goes back to like the problem
link |
01:12:56.040
of scale that you talked about.
link |
01:12:57.320
So these were basically about a billion or so images.
link |
01:13:00.120
And for context ImageNet,
link |
01:13:01.560
the ImageNet version that we use
link |
01:13:02.800
was 1 million images earlier.
link |
01:13:04.280
So this is basically going like
link |
01:13:05.400
three orders of magnitude more.
link |
01:13:07.600
The idea was basically to see
link |
01:13:08.600
if we can train a very large convolutional model
link |
01:13:11.800
in a self supervised way on this uncurated,
link |
01:13:14.440
but really large set of images.
link |
01:13:16.400
And how well would this model do?
link |
01:13:18.280
So is self supervised learning really overfit to ImageNet
link |
01:13:21.440
or can it actually work in the wild?
link |
01:13:23.840
And it was also out of curiosity,
link |
01:13:25.720
what kind of things will this model learn?
link |
01:13:27.520
Will it actually be able to still figure out
link |
01:13:30.080
different types of objects and so on?
link |
01:13:32.000
Would there be particular kinds of tasks
link |
01:13:33.720
that would actually do better than an ImageNet train model?
link |
01:13:38.160
And so for Sear, one of our main findings was that
link |
01:13:40.960
we can actually train very large models
link |
01:13:43.120
in a completely self supervised way
link |
01:13:44.800
on lots of internet images
link |
01:13:46.360
without really necessarily filtering them out.
link |
01:13:48.600
Which was in itself a good thing
link |
01:13:49.760
because it's a fairly simple process, right?
link |
01:13:51.960
So you get images which are uploaded
link |
01:13:54.080
and you basically can immediately use them
link |
01:13:55.800
to train a model in an unsupervised way.
link |
01:13:57.680
You don't really need to sit and filter them out.
link |
01:13:59.720
These images can be cartoons, these can be memes,
link |
01:14:02.040
these can be actual pictures uploaded by people.
link |
01:14:04.440
And you don't really care about what these images are.
link |
01:14:06.160
You don't even care about what concepts they contain.
link |
01:14:08.520
So this was a very sort of simple setup.
link |
01:14:10.280
What image selection mechanism would you say
link |
01:14:12.880
is there like inherent in some aspect of the process?
link |
01:14:18.840
So you're kind of implying that there's almost none,
link |
01:14:21.280
but what is there would you say if you were to introspect?
link |
01:14:24.960
Right, so it's not like uncurated can basically
link |
01:14:28.480
like one way of imagining uncurated
link |
01:14:30.400
is basically you have like cameras
link |
01:14:32.920
that can take pictures at random viewpoints.
link |
01:14:35.200
When people upload pictures to the internet,
link |
01:14:37.400
they are typically going to care about the framing of it.
link |
01:14:40.320
They're not going to upload, say,
link |
01:14:41.840
the picture of a zoomed in wall, for example.
link |
01:14:43.800
Well, when you say internet, do you mean social networks?
link |
01:14:46.080
Yes. Okay.
link |
01:14:47.160
So these are not going to be like pictures
link |
01:14:48.680
of like a zoomed in table or a zoomed in wall.
link |
01:14:51.400
So it's not really completely uncurated
link |
01:14:53.160
because people do have the like photographer's bias
link |
01:14:55.800
where they do want to keep things
link |
01:14:57.040
towards the center a little bit,
link |
01:14:58.640
or like really have like nice looking things
link |
01:15:01.320
and so on in the picture.
link |
01:15:02.680
So that's the kind of bias that typically exists
link |
01:15:05.640
in this data set and also the user base, right?
link |
01:15:07.720
You're not going to get lots of pictures
link |
01:15:09.320
from different parts of the world
link |
01:15:10.520
because there are certain parts of the world
link |
01:15:12.120
where people may not actually be uploading
link |
01:15:14.320
a lot of pictures to the internet
link |
01:15:15.440
or may not even have access to a lot of internet.
link |
01:15:17.360
So this is a giant data set and a giant neural network.
link |
01:15:21.720
I don't think we've talked about what architectures
link |
01:15:24.800
work well for SSL, for self supervised learning.
link |
01:15:29.320
For SEER and for SWAB, we were using convolutional networks,
link |
01:15:32.480
but recently in a work called Dyno,
link |
01:15:34.160
we've basically started using transformers for vision.
link |
01:15:36.880
Both seem to work really well, Connets and transformers.
link |
01:15:39.840
And depending on what you want to do,
link |
01:15:41.120
you might choose to use a particular formulation.
link |
01:15:43.560
So for SEER, it was a Connet.
link |
01:15:45.400
It was particularly a RegNet model,
link |
01:15:47.480
which was also a work from Facebook.
link |
01:15:49.720
RegNets are like really good when it comes to compute
link |
01:15:52.640
versus like accuracy.
link |
01:15:54.760
So because it was a very efficient model,
link |
01:15:56.920
compute and memory wise efficient,
link |
01:15:59.680
and basically it worked really well in terms of scaling.
link |
01:16:02.480
So we used a very large RegNet model
link |
01:16:04.200
and trained it on a billion images.
link |
01:16:05.480
Can you maybe quickly comment on what RegNets are?
link |
01:16:09.680
It comes from this paper, Designing Network Design Spaces.
link |
01:16:13.520
This is a super interesting concept
link |
01:16:15.520
that emphasizes how to create efficient neural networks,
link |
01:16:18.400
large neural networks.
link |
01:16:19.520
So one of the sort of key takeaways from this paper,
link |
01:16:21.800
which the authors, like whenever you hear them
link |
01:16:23.400
present this work, they keep saying is,
link |
01:16:26.040
a lot of neural networks are characterized
link |
01:16:27.960
in terms of flops, right?
link |
01:16:29.040
Flops basically being the floating point operations.
link |
01:16:31.480
And people really love to use flops to say,
link |
01:16:33.320
this model is like really computationally heavy,
link |
01:16:36.200
or like our model is computationally cheap and so on.
link |
01:16:39.000
Now it turns out that flops are really not a good indicator
link |
01:16:41.880
of how well a particular network is,
link |
01:16:43.840
like how efficient it is really.
link |
01:16:45.960
And what a better indicator is, is the activation
link |
01:16:49.120
or the memory that is being used by this particular model.
link |
01:16:52.160
And so designing, like one of the key findings
link |
01:16:55.000
from this paper was basically that you need to design
link |
01:16:57.400
network families or neural network architectures
link |
01:17:00.160
that are actually very efficient in the memory space as well,
link |
01:17:02.800
not just in terms of pure flops.
link |
01:17:04.840
So RegNet is basically a network architecture family
link |
01:17:07.600
that came out of this paper that is particularly good
link |
01:17:10.280
at both flops and the sort of memory required for it.
link |
01:17:13.600
And of course it builds upon like earlier work,
link |
01:17:15.800
like ResNet being like the sort of more popular inspiration
link |
01:17:18.640
for it, where you have residual connections.
link |
01:17:20.440
But one of the things in this work is basically
link |
01:17:22.440
they also use like squeeze excitation blocks.
link |
01:17:25.120
So it's a lot of nice sort of technical innovation
link |
01:17:27.120
in all of this from prior work,
link |
01:17:28.760
and a lot of the ingenuity of these particular authors
link |
01:17:31.440
in how to combine these multiple building blocks.
link |
01:17:34.160
But the key constraint was optimize for both flops
link |
01:17:36.880
and memory when you're basically doing this,
link |
01:17:38.360
don't just look at flops.
link |
01:17:39.600
And that allows you to what have a,
link |
01:17:42.360
sort of have very large networks through this process,
link |
01:17:47.320
can optimize for low, like for efficiency, for low memory.
link |
01:17:51.280
Also in just in terms of pure hardware,
link |
01:17:53.600
they fit very well on GPU memory.
link |
01:17:55.880
So they can be like really powerful neural network
link |
01:17:57.920
architectures with lots of parameters, lots of flops,
link |
01:18:00.200
but also because they're like efficient in terms of
link |
01:18:02.760
the amount of memory that they're using,
link |
01:18:04.040
you can actually fit a lot of these on like a,
link |
01:18:06.600
you can fit a very large model on a single GPU for example.
link |
01:18:09.600
Would you say that the choice of architecture
link |
01:18:14.280
matters more than the choice of maybe data augmentation
link |
01:18:17.640
techniques?
link |
01:18:18.560
Is there a possibility to say what matters more?
link |
01:18:21.720
You kind of imply that you can probably go really far
link |
01:18:24.400
with just using basic conv nuts.
link |
01:18:27.600
All right, I think like data and data augmentation,
link |
01:18:30.600
the algorithm being used for the self supervised training
link |
01:18:33.280
matters a lot more than the particular kind of architecture.
link |
01:18:36.400
With different types of architecture,
link |
01:18:37.680
you will get different like properties in the resulting
link |
01:18:40.320
sort of representation.
link |
01:18:41.720
But really, I mean, the secret sauce is in the augmentation
link |
01:18:44.640
and the algorithm being used to train them.
link |
01:18:47.080
The architectures, I mean, at this point,
link |
01:18:49.240
a lot of them perform very similarly,
link |
01:18:51.680
depending on like the particular task that you care about,
link |
01:18:53.840
they have certain advantages and disadvantages.
link |
01:18:56.400
Is there something interesting to be said about what it
link |
01:18:58.680
takes with Sears to train a giant neural network?
link |
01:19:01.920
You're talking about a huge amount of data,
link |
01:19:04.160
a huge neural network.
link |
01:19:05.800
Is there something interesting to be said of how to
link |
01:19:08.280
effectively train something like that fast?
link |
01:19:11.280
Lots of GPUs.
link |
01:19:13.000
Okay.
link |
01:19:15.480
I mean, so the model was like a billion parameters.
link |
01:19:18.800
And it was trained on a billion images.
link |
01:19:20.840
So if like, basically the same number of parameters
link |
01:19:23.320
as the number of images, and it took a while.
link |
01:19:26.160
I don't remember the exact number, it's in the paper,
link |
01:19:28.600
but it took a while.
link |
01:19:31.840
I guess I'm trying to get at is,
link |
01:19:34.640
when you're thinking of scaling this kind of thing,
link |
01:19:38.680
I mean, one of the exciting possibilities of self
link |
01:19:42.600
supervised learning is the several orders of magnitude
link |
01:19:45.920
scaling of everything, both the neural network
link |
01:19:49.000
and the size of the data.
link |
01:19:50.920
And so the question is,
link |
01:19:52.600
do you think there's some interesting tricks to do large
link |
01:19:56.520
scale distributed compute,
link |
01:19:57.880
or is that really outside of even deep learning?
link |
01:20:00.920
That's more about like hardware engineering.
link |
01:20:04.360
I think more and more there is like this,
link |
01:20:07.240
a lot of like systems are designed,
link |
01:20:10.160
basically taking into account
link |
01:20:11.400
the machine learning needs, right?
link |
01:20:12.520
So because whenever you're doing this kind of
link |
01:20:14.760
distributed training, there is a lot of intercommunication
link |
01:20:17.040
between nodes.
link |
01:20:17.880
So like gradients or the model parameters are being passed.
link |
01:20:20.680
So you really want to minimize communication costs
link |
01:20:22.840
when you really want to scale these models up.
link |
01:20:25.280
You want basically to be able to do as much,
link |
01:20:29.240
like as limited amount of communication as possible.
link |
01:20:31.520
So currently like a dominant paradigm
link |
01:20:33.320
is synchronized sort of training.
link |
01:20:35.160
So essentially after every sort of gradient step,
link |
01:20:38.520
all you basically have like a synchronization step
link |
01:20:41.240
between all the sort of compute chips
link |
01:20:43.440
that you're going on with.
link |
01:20:45.720
I think asynchronous training was popular,
link |
01:20:47.880
but it doesn't seem to perform as well.
link |
01:20:50.440
But in general, I think that's sort of the,
link |
01:20:53.400
I guess it's outside my scope as well.
link |
01:20:55.320
But the main thing is like minimize the amount of
link |
01:21:00.000
synchronization steps that you have.
link |
01:21:01.960
That has been the key takeaway, at least in my experience.
link |
01:21:04.680
The others I have no idea about, how to design the chip.
link |
01:21:06.600
Yeah, there's very few things that I see Jim Keller's eyes
link |
01:21:11.200
light up as much as talking about giant computers doing
link |
01:21:15.360
like that fast communication that you're talking to well
link |
01:21:18.040
when they're training machine learning systems.
link |
01:21:21.240
What is VSSL, V I S S L, the PyTorch based SSL library?
link |
01:21:27.880
What are the use cases that you might have?
link |
01:21:30.120
VSSL basically was born out of a lot of us at Facebook
link |
01:21:33.040
are doing the self supervised learning research.
link |
01:21:35.120
So it's a common framework in which we have like a lot of
link |
01:21:38.800
self supervised learning methods implemented for vision.
link |
01:21:41.720
It's also, it has in itself like a benchmark of tasks
link |
01:21:45.920
that you can evaluate the self supervised representations on.
link |
01:21:48.800
So the use case for it is basically for anyone who's either
link |
01:21:51.640
trying to evaluate their self supervised model
link |
01:21:53.760
or train their self supervised model,
link |
01:21:56.000
or a researcher who's trying to build
link |
01:21:57.800
a new self supervised technique.
link |
01:21:59.240
So it's basically supposed to be all of these things.
link |
01:22:01.520
So as a researcher before VSSL, for example,
link |
01:22:04.480
or like when we started doing this work fairly seriously
link |
01:22:06.960
at Facebook, it was very hard for us to go and implement
link |
01:22:09.960
every self supervised learning model,
link |
01:22:11.880
test it out in a like sort of consistent manner.
link |
01:22:14.560
The experimental setup was very different
link |
01:22:16.440
across different groups.
link |
01:22:18.160
Even when someone said that they were reporting
link |
01:22:20.440
image net accuracy, it could mean lots of different things.
link |
01:22:23.200
So with VSSL, we tried to really sort of standardize that
link |
01:22:25.400
as much as possible.
link |
01:22:26.400
And there was a paper like we did in 2019
link |
01:22:28.280
just about benchmarking.
link |
01:22:29.800
And so VSSL basically builds upon a lot of this kind of work
link |
01:22:32.880
that we did about like benchmarking.
link |
01:22:35.160
And then every time we try to like,
link |
01:22:37.200
we come up with a self supervised learning method,
link |
01:22:39.000
a lot of us try to push that into VSSL as well,
link |
01:22:41.240
just so that it basically is like the central piece
link |
01:22:43.480
where a lot of these methods can reside.
link |
01:22:46.400
Just out of curiosity, people may be,
link |
01:22:49.240
so certainly outside of Facebook, but just researchers,
link |
01:22:52.040
or just even people that know how to program in Python
link |
01:22:54.960
and know how to use PyTorch, what would be the use case?
link |
01:22:58.680
What would be a fun thing to play around with VSSL on?
link |
01:23:01.360
Like what's a fun thing to play around
link |
01:23:04.320
with self supervised learning on, would you say?
link |
01:23:07.960
Is there a good Hello World program?
link |
01:23:09.800
Like is it always about big size that's important to have,
link |
01:23:14.640
or is there fun little smaller case playgrounds
link |
01:23:18.880
to play around with?
link |
01:23:19.760
So we're trying to like push something towards that.
link |
01:23:22.440
I think there are a few setups out there,
link |
01:23:24.360
but nothing like super standard on the smaller scale.
link |
01:23:26.840
I mean, ImageNet in itself is actually pretty big also.
link |
01:23:29.320
So that is not something
link |
01:23:31.440
which is like feasible for a lot of people.
link |
01:23:33.520
But we are trying to like push up
link |
01:23:34.880
with like smaller sort of use cases.
link |
01:23:36.400
The thing is, at a smaller scale,
link |
01:23:39.000
a lot of the observations
link |
01:23:40.320
or a lot of the algorithms that work
link |
01:23:41.800
don't necessarily translate into the medium
link |
01:23:43.760
or the larger scale.
link |
01:23:45.000
So it's really tricky to come up
link |
01:23:46.160
with a good small scale setup
link |
01:23:47.480
where a lot of your empirical observations
link |
01:23:49.160
will really translate to the other setup.
link |
01:23:51.560
So it's been really challenging.
link |
01:23:53.280
I've been trying to do that for a little bit as well
link |
01:23:54.920
because it does take time to train stuff on ImageNet.
link |
01:23:56.880
It does take time to train on like more images,
link |
01:23:59.880
but pretty much every time I've tried to do that,
link |
01:24:02.240
it's been unsuccessful
link |
01:24:03.080
because all the observations I draw
link |
01:24:04.480
from my set of experiments on a smaller data set
link |
01:24:07.440
don't translate into ImageNet
link |
01:24:09.440
or like don't translate into another sort of data set.
link |
01:24:11.760
So it's been hard for us to figure this one out,
link |
01:24:14.240
but it's an important problem.
link |
01:24:15.760
So there's this really interesting idea
link |
01:24:17.960
of learning across multiple modalities.
link |
01:24:20.840
You have a CVPR 2021 best paper candidate
link |
01:24:26.400
titled audio visual instance discrimination
link |
01:24:29.280
with cross modal agreement.
link |
01:24:31.440
What are the key results, insights in this paper
link |
01:24:33.880
and what can you say in general
link |
01:24:35.240
about the promise and power of multimodal learning?
link |
01:24:37.640
For this paper, it actually came as a little bit
link |
01:24:40.000
of a shock to me at how well it worked.
link |
01:24:41.960
So I can describe what the problem set up was.
link |
01:24:44.160
So it's been used in the past by lots of folks
link |
01:24:46.560
like for example, Andrew Owens from MIT,
link |
01:24:48.400
Alyosha Efros from Berkeley,
link |
01:24:49.960
Andrew Zisserman from Oxford.
link |
01:24:51.160
So a lot of these people have been
link |
01:24:52.200
sort of showing results in this.
link |
01:24:53.840
Of course, I was aware of this result,
link |
01:24:55.520
but I wasn't really sure how well it would work in practice
link |
01:24:58.600
for like other sort of downstream tasks.
link |
01:25:00.600
So the results kept getting better.
link |
01:25:02.440
And I wasn't sure if like a lot of our insights
link |
01:25:04.200
from self supervised learning would translate
link |
01:25:05.920
into this multimodal learning problem.
link |
01:25:08.320
So multimodal learning is when you have like,
link |
01:25:12.880
when you have multiple modalities.
link |
01:25:14.280
That's not even cool.
link |
01:25:15.680
Okay, so the particular modalities
link |
01:25:19.400
that we worked on in this work were audio and video.
link |
01:25:22.040
So the idea was basically, if you have a video,
link |
01:25:23.920
you have its corresponding audio track.
link |
01:25:25.880
And you want to use both of these signals,
link |
01:25:27.560
the audio signal and the video signal
link |
01:25:29.280
to learn a good representation for video
link |
01:25:31.280
and good representation for audio.
link |
01:25:32.720
Like this podcast.
link |
01:25:33.720
Like this podcast, exactly.
link |
01:25:35.480
So what we did in this work was basically train
link |
01:25:38.160
two different neural networks,
link |
01:25:39.400
one on the video signal, one on the audio signal.
link |
01:25:41.960
And what we wanted is basically the features
link |
01:25:43.800
that we get from both of these neural networks
link |
01:25:45.400
should be similar.
link |
01:25:46.800
So it should basically be able to produce
link |
01:25:48.720
the same kinds of features from the video
link |
01:25:51.120
and the same kinds of features from the audio.
link |
01:25:53.240
Now, why is this useful?
link |
01:25:54.280
Well, for a lot of these objects that we have,
link |
01:25:56.680
there is a characteristic sound, right?
link |
01:25:58.280
So trains, when they go by,
link |
01:25:59.520
they make a particular kind of sound.
link |
01:26:00.760
Boats make a particular kind of sound.
link |
01:26:02.480
People, when they're jumping around,
link |
01:26:03.840
will like shout, whatever.
link |
01:26:06.240
Bananas don't make a sound.
link |
01:26:07.280
So where you can't learn anything about bananas there.
link |
01:26:09.400
Or when humans mentioned bananas.
link |
01:26:11.640
Well, yes, when they say the word banana, then.
link |
01:26:13.520
So you can't trust basically anything
link |
01:26:15.080
that comes out of a human's mouth as a source,
link |
01:26:17.120
that source of audio is useless.
link |
01:26:19.040
The typical use case is basically like,
link |
01:26:20.640
for example, someone playing a musical instrument.
link |
01:26:22.440
So guitars have a particular kind of sound and so on.
link |
01:26:24.720
So because a lot of these things are correlated,
link |
01:26:27.120
the idea in multimodal learning
link |
01:26:28.480
is to take these two kinds of modalities,
link |
01:26:30.160
video and audio, and learn a common embedding space,
link |
01:26:33.160
a common feature space where both of these
link |
01:26:35.240
related modalities can basically be close together.
link |
01:26:38.560
And again, you use contrastive learning for this.
link |
01:26:40.600
So in contrastive learning, basically the video
link |
01:26:43.360
and the corresponding audio are positives.
link |
01:26:45.520
And you can take any other video or any other audio
link |
01:26:48.200
and that becomes a negative.
link |
01:26:49.840
And so basically that's it.
link |
01:26:51.000
It's just a simple application of contrastive learning.
link |
01:26:53.720
The main sort of finding from this work for us
link |
01:26:56.840
was basically that you can actually learn
link |
01:26:58.680
very, very powerful feature representations,
link |
01:27:00.760
very, very powerful video representations.
link |
01:27:02.840
So you can learn the sort of video network
link |
01:27:05.400
that we ended up learning can actually be used
link |
01:27:07.480
for downstream, for example, recognizing human actions
link |
01:27:11.000
or recognizing different types of sounds, for example.
link |
01:27:14.440
So this was sort of the key finding.
link |
01:27:17.160
Can you give kind of an example of a human action
link |
01:27:20.200
or like just so we can build up intuition
link |
01:27:23.400
of what kind of thing?
link |
01:27:24.360
Right, so there is this data set called kinetics,
link |
01:27:26.880
for example, which has like 400 different types
link |
01:27:28.640
of human actions.
link |
01:27:29.480
So people jumping, people doing different kinds of sports
link |
01:27:32.880
or different types of swimming.
link |
01:27:34.240
So like different strokes and swimming, golf and so on.
link |
01:27:37.600
So there are like just different types of actions
link |
01:27:39.640
right there.
link |
01:27:40.560
And the point is this kind of video network
link |
01:27:42.600
that you learn in a self supervised way
link |
01:27:44.360
can be used very easily to kind of recognize
link |
01:27:46.920
these different types of actions.
link |
01:27:48.880
It can also be used for recognizing
link |
01:27:50.440
different types of objects.
link |
01:27:53.120
And what we did is we tried to visualize
link |
01:27:54.760
whether the network can figure out
link |
01:27:56.080
where the sound is coming from.
link |
01:27:57.880
So basically, give it a video
link |
01:27:59.840
and basically play say of a person just strumming a guitar,
link |
01:28:03.000
but of course, there is no audio in this.
link |
01:28:04.760
And now you give it this sound of a guitar.
link |
01:28:07.160
And you ask like basically try to visualize
link |
01:28:08.880
where the network thinks the sound is coming from.
link |
01:28:12.520
And that can kind of basically draw like
link |
01:28:14.560
when you visualize it,
link |
01:28:15.400
you can see that it's basically focusing on the guitar.
link |
01:28:17.480
Yeah, that's surreal.
link |
01:28:18.320
And the same thing, for example,
link |
01:28:20.160
for certain people's voices,
link |
01:28:21.480
like famous celebrities voices,
link |
01:28:22.920
it can actually figure out where their mouth is.
link |
01:28:26.040
So it can actually distinguish different people's voices,
link |
01:28:28.600
for example, a little bit as well.
link |
01:28:30.480
Without that ever being annotated in any way.
link |
01:28:33.680
Right, so this is all what it had discovered.
link |
01:28:35.520
We never pointed out that this is a guitar
link |
01:28:38.200
and this is the kind of sound it produces.
link |
01:28:40.080
It can actually naturally figure that out
link |
01:28:41.520
because it's seen so many correlations of this sound
link |
01:28:44.200
coming with this kind of like an object
link |
01:28:46.680
that it basically learns to associate this sound
link |
01:28:49.040
with this kind of an object.
link |
01:28:50.000
Yeah, that's really fascinating, right?
link |
01:28:52.760
That's really interesting.
link |
01:28:53.600
So the idea with this kind of network
link |
01:28:55.200
is then you then fine tune it for a particular task.
link |
01:28:57.920
So this is forming like a really good knowledge base
link |
01:29:01.880
within a neural network based on which you could then
link |
01:29:04.320
the train a little bit more to accomplish a specific task.
link |
01:29:07.720
Well, so you don't need a lot of videos of humans
link |
01:29:11.680
doing actions annotated.
link |
01:29:12.800
You can just use a few of them to basically get your.
link |
01:29:16.040
How much insight do you draw from the fact
link |
01:29:18.520
that it can figure out where the sound is coming from?
link |
01:29:23.440
I'm trying to see, so that's kind of very,
link |
01:29:26.160
it's very CVPR beautiful, right?
link |
01:29:28.520
It's a cool little insight.
link |
01:29:30.000
I wonder how profound that is.
link |
01:29:33.000
Does it speak to the idea that multiple modalities
link |
01:29:39.320
are somehow much bigger than the sum of their parts?
link |
01:29:44.120
Or is it really, really useful to have multiple modalities?
link |
01:29:48.000
Or is it just that cool thing that there's parts
link |
01:29:50.640
of our world that can be revealed like effectively
link |
01:29:57.320
through multiple modalities,
link |
01:29:58.400
but most of it is really all about vision
link |
01:30:01.200
or about one of the modalities.
link |
01:30:03.880
I would say a little tending more towards the second part.
link |
01:30:07.760
So most of it can be sort of figured out with one modality,
link |
01:30:10.680
but having an extra modality always helps you.
link |
01:30:13.160
So in this case, for example,
link |
01:30:14.560
like one thing is when you're,
link |
01:30:17.720
if you observe someone cutting something
link |
01:30:19.400
and you don't have any sort of sound there,
link |
01:30:21.960
whether it's an apple or whether it's an onion,
link |
01:30:25.080
it's very hard to figure that out.
link |
01:30:26.720
But if you hear someone cutting it,
link |
01:30:28.240
it's very easy to figure it out because apples and onions
link |
01:30:30.760
make a very different kind of characteristics
link |
01:30:33.560
on when they're cut.
link |
01:30:34.840
So you really figure this out based on audio,
link |
01:30:36.880
it's much easier.
link |
01:30:38.240
So your life will become much easier
link |
01:30:40.040
when you have access to different kinds of modalities.
link |
01:30:42.280
And the other thing is, so I like to relate it in this way,
link |
01:30:45.040
it may be like completely wrong,
link |
01:30:46.360
but the distributional hypothesis in NLP,
link |
01:30:49.320
where context basically gives kind of meaning to that word,
link |
01:30:53.040
sound kind of does that too.
link |
01:30:55.040
So if you have the same sound,
link |
01:30:57.160
so that's the same context across different videos,
link |
01:30:59.840
you're very likely to be observing the same kind of concept.
link |
01:31:03.000
So that's the kind of reason
link |
01:31:04.280
why it figures out the guitar thing, right?
link |
01:31:06.440
It observed the same sound across multiple different videos
link |
01:31:09.760
and it figures out maybe this is the common factor
link |
01:31:11.880
that's actually doing it.
link |
01:31:13.240
I wonder, I used to have this argument with my dad a bunch
link |
01:31:17.440
for creating general intelligence,
link |
01:31:19.760
whether smell is an important,
link |
01:31:22.840
like if that's important sensory information,
link |
01:31:25.480
mostly we're talking about like falling in love
link |
01:31:27.560
with an AI system and for him,
link |
01:31:30.000
smell and touch are important.
link |
01:31:31.440
And I was arguing that it's not at all.
link |
01:31:33.880
It's important, it's nice and everything,
link |
01:31:35.320
but like you can fall in love with just language really,
link |
01:31:38.400
but a voice is very powerful and vision is next
link |
01:31:41.400
and smell is not that important.
link |
01:31:43.880
Can I ask you about this process of active learning?
link |
01:31:46.880
You mentioned interactivity.
link |
01:31:49.200
Right.
link |
01:31:50.040
Is there some value
link |
01:31:52.920
within the self supervised learning context
link |
01:31:57.040
to select parts of the data in intelligent ways
link |
01:32:02.280
such that they would most benefit the learning process?
link |
01:32:06.880
So I think so.
link |
01:32:07.720
I mean, I know I'm talking to an active learning fan here,
link |
01:32:10.320
so of course I know the answer.
link |
01:32:12.640
First you were talking bananas
link |
01:32:14.000
and now you're talking about active learning.
link |
01:32:15.600
I love it.
link |
01:32:16.720
I think Yannakun told me that active learning
link |
01:32:18.800
is not that interesting.
link |
01:32:20.480
I think back then I didn't want to argue with him too much,
link |
01:32:24.400
but when we talk again,
link |
01:32:26.040
we're gonna spend three hours arguing about active learning.
link |
01:32:28.400
My sense was you can go extremely far with active learning,
link |
01:32:32.760
perhaps farther than anything else.
link |
01:32:34.920
Like to me, there's this kind of intuition
link |
01:32:37.960
that similar to data augmentation,
link |
01:32:40.840
you can get a lot from the data,
link |
01:32:45.280
from intelligent optimized usage of the data.
link |
01:32:50.480
I'm trying to speak generally in such a way
link |
01:32:53.200
that includes data augmentation
link |
01:32:55.280
and active learning,
link |
01:32:57.040
that there's something about maybe interactive exploration
link |
01:32:59.880
of the data that at least is part
link |
01:33:03.640
of the solution to intelligence, like an important part.
link |
01:33:07.160
I don't know what your thoughts are
link |
01:33:08.200
on active learning in general.
link |
01:33:09.320
I actually really like active learning.
link |
01:33:10.840
So back in the day we did this largely ignored CVPR paper
link |
01:33:14.200
called learning by asking questions.
link |
01:33:16.520
So the idea was basically you would train an agent
link |
01:33:18.240
that would ask a question about the image.
link |
01:33:20.120
It would get an answer
link |
01:33:21.520
and basically then it would update itself.
link |
01:33:23.360
It would see the next image.
link |
01:33:24.360
It would decide what's the next hardest question
link |
01:33:26.800
that I can ask to learn the most.
link |
01:33:28.760
And the idea was basically because it was being smart
link |
01:33:31.320
about the kinds of questions it was asking,
link |
01:33:33.480
it would learn in fewer samples.
link |
01:33:35.080
It would be more efficient at using data.
link |
01:33:37.880
And we did find to some extent
link |
01:33:39.400
that it was actually better than randomly asking questions.
link |
01:33:42.040
Kind of weird thing about active learning
link |
01:33:43.480
is it's also a chicken and egg problem
link |
01:33:45.160
because when you look at an image,
link |
01:33:47.120
to ask a good question about the image,
link |
01:33:48.640
you need to understand something about the image.
link |
01:33:50.880
You can't ask a completely arbitrarily random question.
link |
01:33:53.440
It may not even apply to that particular image.
link |
01:33:55.480
So there is some amount of understanding or knowledge
link |
01:33:57.600
that basically keeps getting built
link |
01:33:59.160
when you're doing active learning.
link |
01:34:01.280
So I think active learning by itself is really good.
link |
01:34:04.560
And the main thing we need to figure out is basically
link |
01:34:07.240
how do we come up with a technique
link |
01:34:09.600
to first model what the model knows
link |
01:34:13.320
and also model what the model does not know.
link |
01:34:16.000
I think that's the sort of beauty of it.
link |
01:34:18.360
Because when you know that there are certain things
link |
01:34:20.480
that you don't know anything about,
link |
01:34:22.120
asking a question about those concepts
link |
01:34:23.640
is actually going to bring you the most value.
link |
01:34:26.480
And I think that's the sort of key challenge.
link |
01:34:28.360
Now, self supervised learning by itself,
link |
01:34:29.960
like selecting data for it and so on,
link |
01:34:31.480
that's actually really useful.
link |
01:34:32.640
But I think that's a very narrow view
link |
01:34:33.960
of looking at active learning.
link |
01:34:35.080
If you look at it more broadly,
link |
01:34:36.360
it is basically about if the model has a knowledge
link |
01:34:40.040
about N concepts,
link |
01:34:41.400
and it is weak basically about certain things.
link |
01:34:43.840
So it needs to ask questions
link |
01:34:45.280
either to discover new concepts
link |
01:34:46.880
or to basically increase its knowledge
link |
01:34:49.200
about these N concepts.
link |
01:34:50.400
So at that level, it's a very powerful technique.
link |
01:34:53.200
I actually do think it's going to be really useful.
link |
01:34:56.520
Even in like simple things such as like data labeling,
link |
01:34:59.040
it's super useful.
link |
01:35:00.240
So here is like one simple way
link |
01:35:02.920
that you can use active learning.
link |
01:35:04.280
For example, you have your self supervised model,
link |
01:35:06.880
which is very good at predicting similarities
link |
01:35:08.760
and dissimilarities between things.
link |
01:35:10.760
And so if you label a picture as basically say a banana,
link |
01:35:15.480
now you know that all the images
link |
01:35:17.720
that are very similar to this image
link |
01:35:19.200
are also likely to contain bananas.
link |
01:35:21.480
So probably when you want to understand
link |
01:35:24.240
what else is a banana,
link |
01:35:25.160
you're not going to use these other images.
link |
01:35:26.880
You're actually going to use an image
link |
01:35:28.160
that is not completely dissimilar,
link |
01:35:31.120
but somewhere in between,
link |
01:35:32.320
which is not super similar to this image,
link |
01:35:33.840
but not super dissimilar either.
link |
01:35:35.640
And that's going to tell you a lot more
link |
01:35:37.120
about what this concept of a banana is.
link |
01:35:39.520
So that's kind of a heuristic.
link |
01:35:41.840
I wonder if it's possible to also learn ways
link |
01:35:46.840
to discover the most likely,
link |
01:35:50.640
the most beneficial image.
link |
01:35:52.880
So like, so not just looking a thing
link |
01:35:54.920
that's somewhat similar to a banana,
link |
01:35:58.360
but not exactly similar,
link |
01:35:59.920
but have some kind of more complicated learning system,
link |
01:36:03.480
like learned discovering mechanism
link |
01:36:07.000
that tells you what image to look for.
link |
01:36:09.360
Like how, yeah, like actually in a self supervised way,
link |
01:36:14.240
learning strictly a function that says,
link |
01:36:17.160
is this image going to be very useful to me
link |
01:36:20.440
given what I currently know?
link |
01:36:22.000
I think there's a lot of synergy there.
link |
01:36:23.880
It's just, I think, yeah, it's going to be explored.
link |
01:36:27.520
I think very much related to that.
link |
01:36:29.240
I kind of think of what Tesla Autopilot is doing
link |
01:36:33.480
currently as kind of active learning.
link |
01:36:36.720
There's something that Andre Capati and their team
link |
01:36:39.120
are calling a data engine.
link |
01:36:41.440
So you're basically deploying a bunch of instantiations
link |
01:36:45.640
of a neural network into the wild,
link |
01:36:47.920
and they're collecting a bunch of edge cases
link |
01:36:50.640
that are then sent back for annotation for particular,
link |
01:36:53.920
and edge cases as defined as near failure
link |
01:36:56.680
or some weirdness on a particular task
link |
01:36:59.960
that's then sent back.
link |
01:37:01.400
It's that not exactly a banana,
link |
01:37:04.000
but almost the banana cases sent back for annotation.
link |
01:37:07.200
And then there's this loop that keeps going
link |
01:37:09.200
and you keep retraining and retraining.
link |
01:37:11.600
And the active learning step there,
link |
01:37:13.280
or whatever you want to call it,
link |
01:37:14.800
is the cars themselves that are sending you back the data.
link |
01:37:19.120
Like, what the hell happened here?
link |
01:37:20.760
This was weird.
link |
01:37:22.840
What are your thoughts about that sort of deployment
link |
01:37:26.440
of neural networks in the wild?
link |
01:37:28.240
Another way to ask a question from first is your thoughts.
link |
01:37:31.360
And maybe if you want to comment,
link |
01:37:33.840
is there applications for autonomous driving,
link |
01:37:36.960
like computer vision based autonomous driving,
link |
01:37:40.160
applications of self supervised learning
link |
01:37:42.040
in the context of computer vision based autonomous driving?
link |
01:37:47.520
So I think so.
link |
01:37:48.360
I think for self supervised learning
link |
01:37:49.560
to be used in autonomous driving,
link |
01:37:50.800
there are lots of opportunities.
link |
01:37:51.800
I mean, just like pure consistency in predictions
link |
01:37:54.880
is one way, right?
link |
01:37:55.840
So because you have this nice sequence of data
link |
01:38:00.280
that is coming in, a video stream of it,
link |
01:38:02.320
associated of course with the actions
link |
01:38:04.040
that say the car took,
link |
01:38:05.240
you can form a very nice predictive model
link |
01:38:07.640
of what's happening.
link |
01:38:08.480
So for example, like all the way,
link |
01:38:11.400
like one way possibly in which how they're figuring out
link |
01:38:14.440
what data to get labeled is basically
link |
01:38:15.880
through prediction uncertainty, right?
link |
01:38:17.440
So you predict that the car was going to turn right.
link |
01:38:20.360
So this was the action that was going to happen,
link |
01:38:21.840
say in the shadow mode.
link |
01:38:23.080
And now the driver turned left.
link |
01:38:24.640
And this is a really big surprise.
link |
01:38:27.160
So basically by forming these good predictive models,
link |
01:38:30.120
you are, I mean, these are kind of self supervised models.
link |
01:38:32.840
Prediction models are basically being trained
link |
01:38:34.600
just by looking at what's going to happen next
link |
01:38:36.800
and asking them to predict what's going to happen next.
link |
01:38:38.960
So I would say this is really like one use
link |
01:38:40.720
of self supervised learning.
link |
01:38:42.320
It's a predictive model
link |
01:38:43.440
and you're learning a predictive model
link |
01:38:44.680
basically just by looking at what data you have.
link |
01:38:46.880
Is there something about that active learning context
link |
01:38:49.600
that you find insights from?
link |
01:38:53.000
Like that kind of deployment of the system,
link |
01:38:54.760
seeing cases where it doesn't perform as you expected
link |
01:38:59.120
and then retraining the system based on that?
link |
01:39:01.000
I think that, I mean, that really resonates with me.
link |
01:39:03.600
It's super smart to do it that way.
link |
01:39:05.560
Because I mean, the thing is with any kind
link |
01:39:08.520
of like practical system, like autonomous driving,
link |
01:39:11.160
there are those edge cases that are the things
link |
01:39:13.040
that are actually the problem, right?
link |
01:39:14.520
I mean, highway driving or like freeway driving
link |
01:39:17.440
has basically been like,
link |
01:39:19.120
there has been a lot of success in that particular part
link |
01:39:21.120
of autonomous driving for a long time.
link |
01:39:22.840
I would say like since the eighties or something.
link |
01:39:25.560
Now the point is all these failure cases
link |
01:39:28.000
are the sort of reason why autonomous driving
link |
01:39:30.600
hasn't become like super, super mainstream and available
link |
01:39:33.800
like in every possible car right now.
link |
01:39:35.640
And so basically by really scaling this problem out
link |
01:39:38.200
by really trying to get all of these edge cases out
link |
01:39:40.440
as quickly as possible,
link |
01:39:41.880
and then just like using those to improve your model,
link |
01:39:43.920
that's super smart.
link |
01:39:45.640
And prediction uncertainty to do that
link |
01:39:47.120
is like one really nice way of doing it.
link |
01:39:49.800
Let me put you on the spot.
link |
01:39:52.040
So we mentioned offline Jitendra,
link |
01:39:55.240
he thinks that the Tesla computer vision approach
link |
01:39:58.240
or really any approach for autonomous driving
link |
01:40:00.800
is very far away.
link |
01:40:02.680
How many years away,
link |
01:40:05.440
if you have to bet all your money on it,
link |
01:40:06.960
are we to solving autonomous driving
link |
01:40:09.600
with this kind of computer vision only
link |
01:40:12.000
machine learning based approach?
link |
01:40:13.600
Okay, so what does solving autonomous driving mean?
link |
01:40:15.400
Does it mean solving it in the US?
link |
01:40:17.200
Does it mean solving it in India?
link |
01:40:18.480
Because I can tell you
link |
01:40:19.320
that very different types of driving happening.
link |
01:40:21.200
Not India, not Russia.
link |
01:40:23.800
In the United States, autonomous,
link |
01:40:26.200
so what solving means is when the car says it has control,
link |
01:40:31.880
it is fully liable.
link |
01:40:34.040
You can go to sleep, it's driving by itself.
link |
01:40:37.800
So this is highway and city driving,
link |
01:40:39.720
but not everywhere, but mostly everywhere.
link |
01:40:42.280
And it's, let's say significantly better,
link |
01:40:45.040
like say five times less accidents than humans.
link |
01:40:50.480
Sufficiently safer such that the public feels
link |
01:40:53.960
like that transition is enticing beneficial
link |
01:40:57.960
both for our safety and financial
link |
01:40:59.480
and all those kinds of things.
link |
01:41:01.040
Okay, so first disclaimer,
link |
01:41:02.240
I'm not an expert in autonomous driving.
link |
01:41:04.200
So let me put it out there.
link |
01:41:05.920
I would say like at least five to 10 years.
link |
01:41:09.360
This would be my guess from now.
link |
01:41:12.920
Yeah, I'm actually very impressed.
link |
01:41:14.640
Like when I sat in a friend's Tesla recently
link |
01:41:16.760
and of course, like looking on that screen,
link |
01:41:20.600
it basically shows all the detections and everything.
link |
01:41:22.800
The car is doing as you're driving by
link |
01:41:24.640
and that's super distracting for me as a person
link |
01:41:26.880
because all I keep looking at is like the bounding boxes
link |
01:41:29.440
in the cars it's tracking and it's really impressive.
link |
01:41:31.760
Like especially when it's raining and it's able to do that,
link |
01:41:34.280
that was the most impressive part for me.
link |
01:41:36.000
It's actually able to get through rain and do that.
link |
01:41:38.520
And one of the reasons why like a lot of us believed
link |
01:41:41.720
and I would put myself in that category
link |
01:41:44.040
is LIDAR based sort of technology for autonomous driving
link |
01:41:47.680
was the key driver, right?
link |
01:41:48.720
So Waymo was using it for the longest time.
link |
01:41:50.960
And Tesla then decided to go this completely other route
link |
01:41:53.280
that we are not going to even use LIDAR.
link |
01:41:55.760
So their initial system I think was camera and radar based
link |
01:41:58.720
and now they're actually moving
link |
01:41:59.640
to a completely like vision based system.
link |
01:42:02.000
And so that was just like, it sounded completely crazy.
link |
01:42:04.640
Like LIDAR is very useful in cases
link |
01:42:07.000
where you have low visibility.
link |
01:42:09.240
Of course it comes with its own set of complications.
link |
01:42:11.720
But now to see that happen in like on a live Tesla
link |
01:42:15.160
that basically just proves everyone wrong
link |
01:42:16.960
I would say in a way.
link |
01:42:18.120
And that's just working really well.
link |
01:42:20.520
I think there were also like a lot of advancements
link |
01:42:22.720
in camera technology.
link |
01:42:23.920
Now there were like, I know at CMU when I was there
link |
01:42:26.280
there was a particular kind of camera
link |
01:42:27.960
that had been developed that was really good
link |
01:42:30.040
at basically low visibility setting.
link |
01:42:32.760
So like lots of snow and lots of rain
link |
01:42:34.400
it could actually still have a very reasonable visibility.
link |
01:42:37.640
And I think there are lots of these kinds of innovations
link |
01:42:39.360
that will happen on the sensor side itself
link |
01:42:40.960
which is actually going to make this very easy
link |
01:42:42.840
in the future.
link |
01:42:43.840
And so maybe that's actually why I'm more optimistic
link |
01:42:46.080
about vision based self, like autonomous driving.
link |
01:42:49.000
I was going to call it self supervised driving, but.
link |
01:42:51.960
Vision based autonomous driving.
link |
01:42:53.520
That's the reason I'm quite optimistic about it
link |
01:42:55.480
because I think there are going to be lots
link |
01:42:56.640
of these advances on the sensor side itself.
link |
01:42:58.960
So acquiring this data
link |
01:43:00.720
we're actually going to get much better about it.
link |
01:43:02.640
And then of course, once we're able to scale out
link |
01:43:05.080
and get all of these edge cases in
link |
01:43:06.800
as like Andre described
link |
01:43:08.720
I think that's going to make us go very far away.
link |
01:43:11.720
Yeah, so it's funny.
link |
01:43:13.560
I'm very much with you on the five to 10 years
link |
01:43:16.280
maybe 10 years
link |
01:43:17.840
but you made it, I'm not sure how you made it sound
link |
01:43:21.760
but for some people that seem
link |
01:43:23.640
that might seem like really far away.
link |
01:43:25.360
And then for other people, it might seem like very close.
link |
01:43:30.440
There's a lot of fundamental questions
link |
01:43:32.320
about how much game theory is in this whole thing.
link |
01:43:36.880
So like, how much is this simply a collision avoidance
link |
01:43:41.160
problem and how much of it is you still interacting
link |
01:43:45.200
with other humans in the scene
link |
01:43:46.960
and you're trying to create an experience
link |
01:43:48.800
that's compelling.
link |
01:43:49.640
So you want to get from point A to point B quickly
link |
01:43:53.080
you want to navigate the scene in a safe way
link |
01:43:55.280
but you also want to show some level of aggression
link |
01:43:58.480
because well, certainly this is why you're screwed in India
link |
01:44:02.000
because you have to show aggression.
link |
01:44:03.320
Or Jersey or New Jersey.
link |
01:44:04.840
Or Jersey, right.
link |
01:44:05.680
So like, or New York or basically any major city
link |
01:44:11.200
but I think it's probably Elon
link |
01:44:13.240
that I talked the most about this
link |
01:44:14.800
which is a surprise to the level of which
link |
01:44:17.720
they're not considering human beings
link |
01:44:20.080
as a huge problem in this, as a source of problem.
link |
01:44:22.960
Like the driving is fundamentally a robot on robot
link |
01:44:29.000
versus the environment problem
link |
01:44:31.160
versus like you can just consider humans
link |
01:44:33.960
not part of the problem.
link |
01:44:35.160
I used to think humans are almost certainly
link |
01:44:38.840
have to be modeled really well.
link |
01:44:41.200
Pedestrians and cyclists and humans inside other cars
link |
01:44:44.360
you have to have like mental models for them.
link |
01:44:46.320
You cannot just see it as objects
link |
01:44:48.280
but more and more it's like the
link |
01:44:51.400
it's the same kind of intuition breaking thing
link |
01:44:53.720
that's self supervised learning does, which is
link |
01:44:57.000
well maybe through the learning
link |
01:44:58.840
you'll get all the human like human information you need.
link |
01:45:04.080
Right?
link |
01:45:04.920
Like maybe you'll get it just with enough data.
link |
01:45:07.760
You don't need to have explicit good models
link |
01:45:09.680
of human behavior.
link |
01:45:10.800
Maybe you get it through the data.
link |
01:45:12.120
So, I mean my skepticism also just knowing
link |
01:45:14.640
a lot of automotive companies
link |
01:45:16.360
and how difficult it is to be innovative.
link |
01:45:18.600
I was skeptical that they would be able at scale
link |
01:45:22.560
to convert the driving scene across the world
link |
01:45:27.400
into digital form such that you can create
link |
01:45:30.640
this data engine at scale.
link |
01:45:33.160
And the fact that Tesla is at least getting there
link |
01:45:36.640
or are already there makes me think that
link |
01:45:41.640
it's now starting to be coupled
link |
01:45:43.680
to this self supervised learning vision
link |
01:45:47.600
which is like if that's gonna work
link |
01:45:49.840
if through purely this process you can get really far
link |
01:45:52.920
then maybe you can solve driving that way.
link |
01:45:54.880
I don't know.
link |
01:45:55.720
I tend to believe we don't give enough credit
link |
01:46:00.000
to the how amazing humans are both at driving
link |
01:46:05.920
and at supervising autonomous systems.
link |
01:46:09.360
And also we don't, this is, I wish we were.
link |
01:46:13.200
I wish there was much more driver sensing inside Teslas
link |
01:46:17.120
and much deeper consideration of human factors
link |
01:46:21.200
like understanding psychology and drowsiness
link |
01:46:24.680
and all those kinds of things
link |
01:46:26.200
when the car does more and more of the work.
link |
01:46:28.720
How to keep utilizing the little human supervision
link |
01:46:32.960
that are needed to keep this whole thing safe.
link |
01:46:35.080
I mean it's a fascinating dance of human robot interaction.
link |
01:46:38.440
To me autonomous driving for a long time
link |
01:46:42.120
is a human robot interaction problem.
link |
01:46:45.040
It is not a robotics problem or computer vision problem.
link |
01:46:48.040
Like you have to have a human in the loop.
link |
01:46:50.000
But so which is why I think it's 10 years plus.
link |
01:46:53.320
But I do think there'll be a bunch of cities and contexts
link |
01:46:56.280
where geo restricted it will work really, really damn well.
link |
01:47:02.360
So I think for me that gets five if I'm being optimistic
link |
01:47:05.000
and it's going to be five for a lot of cases
link |
01:47:07.360
and 10 plus, yeah, I agree with you.
link |
01:47:09.200
10 plus basically if we want to recover most of the,
link |
01:47:13.120
say, contiguous United States or something.
link |
01:47:15.240
Oh, interesting.
link |
01:47:16.080
So my optimistic is five and pessimistic is 30.
link |
01:47:20.280
30.
link |
01:47:21.120
I have a long tail on this one.
link |
01:47:22.480
I haven't watched enough driving videos.
link |
01:47:24.440
I've watched enough pedestrians to think like we may be,
link |
01:47:29.160
like there's a small part of me still, not a small,
link |
01:47:31.680
like a pretty big part of me that thinks
link |
01:47:34.360
we will have to build AGI to solve driving.
link |
01:47:37.520
Oh, well.
link |
01:47:38.440
Like there's something to me,
link |
01:47:39.640
like because humans are part of the picture,
link |
01:47:41.800
deeply part of the picture,
link |
01:47:44.000
and also human society is part of the picture
link |
01:47:46.080
in that human life is at stake.
link |
01:47:47.920
Anytime a robot kills a human,
link |
01:47:50.840
it's not clear to me that that's not a problem
link |
01:47:54.280
that machine learning will also have to solve.
link |
01:47:56.360
Like it has to, you have to integrate that
link |
01:47:59.400
into the whole thing.
link |
01:48:00.240
Just like Facebook or social networks,
link |
01:48:03.280
one thing is to say how to make
link |
01:48:04.600
a really good recommender system.
link |
01:48:06.720
And then the other thing is to integrate
link |
01:48:08.640
into that recommender system,
link |
01:48:10.240
all the journalists that will write articles
link |
01:48:12.080
about that recommender system.
link |
01:48:13.880
Like you have to consider the society
link |
01:48:15.880
within which the AI system operates.
link |
01:48:18.400
And in order to, and like politicians too,
link |
01:48:21.000
this is the regulatory stuff for autonomous driving.
link |
01:48:24.200
It's kind of fascinating that the more successful
link |
01:48:26.720
your AI system becomes,
link |
01:48:28.720
the more it gets integrated in society
link |
01:48:31.600
and the more precious politicians
link |
01:48:33.560
and the public and the clickbait journalists
link |
01:48:36.000
and all the different fascinating forces
link |
01:48:38.040
of our society start acting on it.
link |
01:48:40.360
And then it's no longer how good you are
link |
01:48:42.240
at doing the initial task.
link |
01:48:43.960
It's also how good you are at navigating human nature,
link |
01:48:47.000
which is a fascinating space.
link |
01:48:49.920
What do you think are the limits of deep learning?
link |
01:48:52.600
If you allow me, we'll zoom out a little bit
link |
01:48:54.800
into the big question of artificial intelligence.
link |
01:48:58.120
You said dark matter of intelligence is self supervised
link |
01:49:02.080
learning, but there could be more.
link |
01:49:04.320
What do you think the limits of self supervised learning
link |
01:49:07.760
and just learning in general, deep learning are?
link |
01:49:10.720
I think like for deep learning in particular,
link |
01:49:12.680
because self supervised learning is I would say
link |
01:49:14.640
a little bit more vague right now.
link |
01:49:16.800
So I wouldn't, like for something that's so vague,
link |
01:49:18.680
it's hard to predict what its limits are going to be.
link |
01:49:21.960
But like I said, I think anywhere you want to interact
link |
01:49:25.240
with human self supervised learning kind of hits a boundary
link |
01:49:27.920
very quickly because you need to have an interface
link |
01:49:29.960
to be able to communicate with the human.
link |
01:49:31.600
So really like if you have just like vacuous concepts
link |
01:49:35.040
or like just like nebulous concepts discovered
link |
01:49:37.360
by a network, it's very hard to communicate those
link |
01:49:39.920
with the human without like inserting some kind
link |
01:49:41.760
of human knowledge or some kind of like human bias there.
link |
01:49:45.600
In general, I think for deep learning,
link |
01:49:47.040
the biggest challenge is just like data efficiency.
link |
01:49:50.680
Even with self supervised learning,
link |
01:49:52.600
even with anything else, if you just see
link |
01:49:54.920
a single concept once, like one image of like,
link |
01:49:59.280
I don't know, whatever you want to call it,
link |
01:50:01.200
like any concept, it's really hard for these methods
link |
01:50:03.840
to generalize by looking at just one or two samples
link |
01:50:07.040
of things and that has been a real challenge.
link |
01:50:09.760
I think that's actually why like these edge cases,
link |
01:50:11.680
for example, for Tesla are actually that important.
link |
01:50:14.520
Because if you see just one instance of the car failing
link |
01:50:18.040
and if you just annotate that and you get that
link |
01:50:20.280
into your data set, you have like very limited guarantee
link |
01:50:23.560
that it's not going to happen again.
link |
01:50:25.160
And you're actually going to be able to recognize
link |
01:50:26.720
this kind of instance in a very different scenario.
link |
01:50:28.640
So like when it was snowing, so you got that thing labeled
link |
01:50:31.400
when it was snowing, but now when it's raining,
link |
01:50:33.240
you're actually not able to get it.
link |
01:50:34.640
Or you basically have the same scenario
link |
01:50:36.600
in a different part of the world.
link |
01:50:37.440
So the lighting was different or so on.
link |
01:50:39.120
So it's just really hard for these models,
link |
01:50:41.000
like deep learning especially to do that.
link |
01:50:42.720
What's your intuition?
link |
01:50:43.560
How do we solve handwritten digit recognition problem
link |
01:50:47.800
when we only have one example for each number?
link |
01:50:51.200
It feels like humans are using something like learning.
link |
01:50:54.720
Right.
link |
01:50:55.560
I think we are good at transferring knowledge a little bit.
link |
01:50:59.240
We are just better at like for a lot of these problems
link |
01:51:02.640
where we are generalizing from a single sample
link |
01:51:04.840
or recognizing from a single sample,
link |
01:51:06.960
we are using a lot of our own domain knowledge
link |
01:51:08.760
and a lot of our like inductive bias
link |
01:51:10.320
into that one sample to generalize it.
link |
01:51:12.280
So I've never seen you write the number nine, for example.
link |
01:51:15.320
And if you were to write it, I would still get it.
link |
01:51:17.440
And if you were to write a different kind of alphabet
link |
01:51:19.280
and like write it in two different ways,
link |
01:51:20.840
I would still probably be able to figure out
link |
01:51:22.360
that these are the same two characters.
link |
01:51:24.720
It's just that I have been very used
link |
01:51:26.320
to seeing handwritten digits in my life.
link |
01:51:29.080
The other sort of problem with any deep learning system
link |
01:51:31.360
or any kind of machine learning system is like,
link |
01:51:33.080
it's guarantees, right?
link |
01:51:34.200
There are no guarantees for it.
link |
01:51:35.880
Now you can argue that humans also don't have any guarantees.
link |
01:51:38.200
Like there is no guarantee that I can recognize a cat
link |
01:51:41.160
in every scenario.
link |
01:51:42.280
I'm sure there are going to be lots of cats
link |
01:51:43.920
that I don't recognize, lots of scenarios
link |
01:51:45.720
in which I don't recognize cats in general.
link |
01:51:48.120
But I think from just a sort of application perspective,
link |
01:51:52.840
you do need guarantees, right?
link |
01:51:54.760
We call these things algorithms.
link |
01:51:56.960
Now algorithms, like traditional CS algorithms
link |
01:51:59.080
have guarantees.
link |
01:51:59.960
Sorting is a guarantee.
link |
01:52:01.480
If you were to call sort on a particular array of numbers,
link |
01:52:05.600
you are guaranteed that it's going to be sorted.
link |
01:52:07.640
Otherwise it's a bug.
link |
01:52:09.320
Now for machine learning,
link |
01:52:10.160
it's very hard to characterize this.
link |
01:52:12.440
We know for a fact that a cat recognition model
link |
01:52:15.440
is not going to recognize cats,
link |
01:52:17.040
every cat in the world in every circumstance.
link |
01:52:19.720
I think most people would agree with that statement,
link |
01:52:22.040
but we are still okay with it.
link |
01:52:23.600
We still don't call this as a bug.
link |
01:52:25.400
Whereas in traditional computer science
link |
01:52:26.720
or traditional science,
link |
01:52:27.840
like if you have this kind of failure case existing,
link |
01:52:29.960
then you think of it as like something is wrong.
link |
01:52:33.160
I think there is this sort of notion
link |
01:52:34.520
of nebulous correctness for machine learning.
link |
01:52:37.000
And that's something we just need to be very comfortable
link |
01:52:38.840
with.
link |
01:52:39.680
And for deep learning,
link |
01:52:40.520
or like for a lot of these machine learning algorithms,
link |
01:52:42.680
it's not clear how do we characterize
link |
01:52:44.680
this notion of correctness.
link |
01:52:46.320
I think limitation in our understanding,
link |
01:52:48.120
or at least a limitation in our phrasing of this.
link |
01:52:51.160
And if we were to come up with better ways
link |
01:52:53.080
to understand this limitation,
link |
01:52:55.040
then it would actually help us a lot.
link |
01:52:57.160
Do you think there's a distinction
link |
01:52:58.840
between the concept of learning
link |
01:53:01.800
and the concept of reasoning?
link |
01:53:04.240
Do you think it's possible for neural networks to reason?
link |
01:53:10.280
So I think of it slightly differently.
link |
01:53:11.680
So for me, learning is whenever
link |
01:53:14.520
I can like make a snap judgment.
link |
01:53:16.040
So if you show me a picture of a dog,
link |
01:53:17.200
I can immediately say it's a dog.
link |
01:53:18.880
But if you give me like a puzzle,
link |
01:53:20.680
like whatever a Goldsberg machine
link |
01:53:23.480
of like things going to happen,
link |
01:53:24.960
then I have to reason because I've never,
link |
01:53:26.440
it's a very complicated setup.
link |
01:53:27.600
I've never seen that particular setup.
link |
01:53:29.280
And I really need to draw and like imagine in my head
link |
01:53:32.200
what's going to happen to figure it out.
link |
01:53:34.640
So I think, yes, neural networks are really good
link |
01:53:36.840
at recognition, but they're not very good at reasoning.
link |
01:53:41.160
Because they have seen something before
link |
01:53:44.120
or seen something similar before, they're very good
link |
01:53:46.360
at making those sort of snap judgments.
link |
01:53:48.240
But if you were to give them a very complicated thing
link |
01:53:50.680
that they've not seen before,
link |
01:53:52.480
they have very limited ability right now
link |
01:53:55.320
to compose different things.
link |
01:53:56.560
Like, oh, I've seen this particular part before.
link |
01:53:58.240
I've seen this particular part before.
link |
01:54:00.040
And now probably like this is how
link |
01:54:01.400
they're going to work in tandem.
link |
01:54:02.920
It's very hard for them to come up
link |
01:54:04.160
with these kinds of things.
link |
01:54:05.200
Well, there's a certain aspect to reasoning
link |
01:54:08.800
that you can maybe convert into the process of programming.
link |
01:54:11.880
And so there's the whole field of program synthesis
link |
01:54:14.320
and people have been applying machine learning
link |
01:54:17.240
to the problem of program synthesis.
link |
01:54:18.920
And the question is, can they, the step of composition,
link |
01:54:22.680
why can't that be learned?
link |
01:54:25.280
You know, this step of like building things on top of you,
link |
01:54:29.400
like little intuitions, concepts on top of each other,
link |
01:54:33.200
can that be learnable?
link |
01:54:35.280
What's your intuition there?
link |
01:54:36.800
Or like, I guess similar set of techniques,
link |
01:54:39.440
do you think that will be applicable?
link |
01:54:42.040
So I think it is, of course, it is learnable
link |
01:54:44.640
because like we are prime examples of machines
link |
01:54:47.080
that have like, or individuals that have learned this, right?
link |
01:54:49.480
Like humans have learned this.
link |
01:54:51.080
So it is, of course, it is a technique
link |
01:54:52.760
that is very easy to learn.
link |
01:54:55.840
I think where we are kind of hitting a wall
link |
01:54:58.400
basically with like current machine learning
link |
01:55:00.480
is the fact that when the network learns
link |
01:55:03.400
all of this information,
link |
01:55:04.640
we basically are not able to figure out
link |
01:55:07.480
how well it's going to generalize to an unseen thing.
link |
01:55:10.640
And we have no, like a priori, no way of characterizing that.
link |
01:55:15.040
And I think that's basically telling us a lot about,
link |
01:55:18.480
like a lot about the fact that we really don't know
link |
01:55:20.720
what this model has learned and how well it's basically,
link |
01:55:22.760
because we don't know how well it's going to transfer.
link |
01:55:25.120
There's also a sense in which it feels like
link |
01:55:28.080
we humans may not be aware of how much like background,
link |
01:55:34.400
how good our background model is,
link |
01:55:36.760
how much knowledge we just have slowly building
link |
01:55:39.880
on top of each other.
link |
01:55:41.400
It feels like neural networks
link |
01:55:42.480
are constantly throwing stuff out.
link |
01:55:43.840
Like you'll do some incredible thing
link |
01:55:45.360
where you're learning a particular task in computer vision,
link |
01:55:49.040
you celebrate your state of the art successes
link |
01:55:51.240
and you throw that out.
link |
01:55:52.720
Like, it feels like it's,
link |
01:55:54.240
you're never using stuff you've learned
link |
01:55:56.720
for your future successes in other domains.
link |
01:56:00.080
And humans are obviously doing that exceptionally well,
link |
01:56:03.240
still throwing stuff away in their mind,
link |
01:56:05.840
but keeping certain kernels of truth.
link |
01:56:07.840
Right, so I think we're like,
link |
01:56:09.200
continual learning is sort of the paradigm
link |
01:56:11.080
for this in machine learning.
link |
01:56:11.920
And I don't think it's a very well explored paradigm.
link |
01:56:15.160
We have like things in deep learning, for example,
link |
01:56:17.440
catastrophic forgetting is like one of the standard things.
link |
01:56:20.160
The thing basically being that if you teach a network
link |
01:56:23.120
like to recognize dogs,
link |
01:56:24.760
and now you teach that same network to recognize cats,
link |
01:56:27.400
it basically forgets how to recognize dogs.
link |
01:56:29.040
So it forgets very quickly.
link |
01:56:30.800
I mean, and whereas a human,
link |
01:56:32.520
if you were to teach someone to recognize dogs
link |
01:56:34.560
and then to recognize cats,
link |
01:56:35.880
they don't forget immediately how to recognize these dogs.
link |
01:56:38.440
I think that's basically sort of what you're trying to get.
link |
01:56:40.640
Yeah, I just, I wonder if like
link |
01:56:42.400
the long term memory mechanisms
link |
01:56:44.720
or the mechanisms that store not just memories,
link |
01:56:47.080
but concepts that allow you to the reason
link |
01:56:54.240
and compose concepts,
link |
01:56:57.200
if those things will look very different
link |
01:56:59.000
than neural networks,
link |
01:56:59.880
or if you can do that within a single neural network
link |
01:57:02.320
with some particular sort of architecture quirks,
link |
01:57:06.040
that seems to be a really open problem.
link |
01:57:07.720
And of course I go up and down on that
link |
01:57:09.440
because there's something so compelling to the symbolic AI
link |
01:57:14.840
or to the ideas of logic based sort of expert systems.
link |
01:57:20.320
You have like human interpretable facts
link |
01:57:22.440
that built on top of each other.
link |
01:57:24.080
It's really annoying like with self supervised learning
link |
01:57:27.800
that the AI is not very explainable.
link |
01:57:31.120
Like you can't like understand
link |
01:57:33.360
all the beautiful things it has learned.
link |
01:57:35.520
You can't ask it like questions,
link |
01:57:38.400
but then again, maybe that's a stupid thing
link |
01:57:40.960
for us humans to want.
link |
01:57:42.440
Right, I think whenever we try to like understand it,
link |
01:57:45.240
we are putting our own subjective human bias into it.
link |
01:57:47.840
Yeah.
link |
01:57:48.680
And I think that's the sort of problem
link |
01:57:50.000
with self supervised learning,
link |
01:57:51.000
the goal is that it should learn naturally from the data.
link |
01:57:54.280
So now if you try to understand it,
link |
01:57:55.520
you are using your own preconceived notions
link |
01:57:58.640
of what this model has learned.
link |
01:58:00.600
And that's the problem.
link |
01:58:03.480
High level question.
link |
01:58:04.640
What do you think it takes to build a system
link |
01:58:07.920
with superhuman, maybe let's say human level
link |
01:58:10.520
or superhuman level general intelligence?
link |
01:58:13.520
We've already kind of started talking about this,
link |
01:58:15.560
but what's your intuition?
link |
01:58:17.760
Like, does this thing have to have a body?
link |
01:58:20.760
Does it have to interact richly with the world?
link |
01:58:25.400
Does it have to have some more human elements
link |
01:58:27.920
like self awareness?
link |
01:58:30.480
I think emotion.
link |
01:58:32.240
I think emotion is something which is like,
link |
01:58:35.720
it's not really attributed typically
link |
01:58:37.520
in standard machine learning.
link |
01:58:38.440
It's not something we think about,
link |
01:58:39.560
like there is NLP, there is vision,
link |
01:58:41.040
there is no like emotion.
link |
01:58:42.560
Emotion is never a part of all of this.
link |
01:58:44.600
And that just seems a little bit weird to me.
link |
01:58:47.080
I think the reason basically being that there is surprise
link |
01:58:50.320
and like, basically emotion is like one of the reasons
link |
01:58:53.800
emotions arise is like what happens
link |
01:58:55.800
and what do you expect to happen, right?
link |
01:58:57.120
There is like a mismatch between these things.
link |
01:58:59.440
And so that gives rise to like,
link |
01:59:01.080
I can either be surprised or I can be saddened
link |
01:59:03.520
or I can be happy and all of this.
link |
01:59:05.320
And so this basically indicates
link |
01:59:07.960
that I already have a predictive model in my head
link |
01:59:10.160
and something that I predicted or something
link |
01:59:11.840
that I thought was likely to happen.
link |
01:59:13.720
And then there was something that I observed
link |
01:59:15.120
that happened that there was a disconnect
link |
01:59:16.720
between these two things.
link |
01:59:18.280
And that basically is like maybe one of the reasons
link |
01:59:21.840
like you have a lot of emotions.
link |
01:59:24.280
Yeah, I think, so I talk to people a lot about them
link |
01:59:26.880
like Lisa Feldman Barrett.
link |
01:59:29.120
I think that's an interesting concept of emotion
link |
01:59:31.720
but I have a sense that emotion primarily
link |
01:59:36.880
in the way we think about it,
link |
01:59:38.080
which is the display of emotion
link |
01:59:40.320
is a communication mechanism between humans.
link |
01:59:43.800
So it's a part of basically human to human interaction,
link |
01:59:48.240
an important part, but just the part.
link |
01:59:50.200
So it's like, I would throw it into the full mix
link |
01:59:55.040
of communication.
link |
01:59:58.040
And to me, communication can be done with objects
link |
02:00:01.240
that don't look at all like humans.
link |
02:00:04.360
Okay.
link |
02:00:05.440
I've seen our ability to anthropomorphize
link |
02:00:07.560
our ability to connect with things that look like a Roomba
link |
02:00:10.680
our ability to connect.
link |
02:00:12.000
First of all, let's talk about other biological systems
link |
02:00:14.720
like dogs, our ability to love things
link |
02:00:17.440
that are very different than humans.
link |
02:00:19.400
But they do display emotion, right?
link |
02:00:20.960
I mean, dogs do display emotion.
link |
02:00:23.200
So they don't have to be anthropomorphic
link |
02:00:25.320
for them to like display the kind of emotions
link |
02:00:27.600
that we don't.
link |
02:00:28.440
Exactly.
link |
02:00:29.280
So, I mean, but then the word emotion starts to lose.
link |
02:00:33.920
So then we have to be, I guess specific, but yeah.
link |
02:00:36.280
So have rich flavorful communication.
link |
02:00:39.520
Communication, yeah.
link |
02:00:40.360
Yeah, so like, yes, it's full of emotion.
link |
02:00:43.000
It's full of wit and humor and moods
link |
02:00:49.080
and all those kinds of things, yeah.
link |
02:00:50.280
So you're talking about like flavor.
link |
02:00:53.720
Flavor, yeah.
link |
02:00:54.560
Okay, let's call it that.
link |
02:00:55.400
So there's content and then there is flavor
link |
02:00:57.240
and I'm talking about the flavor.
link |
02:00:58.440
Do you think it needs to have a body?
link |
02:01:00.280
Do you think like to interact with the physical world?
link |
02:01:02.840
Do you think you can understand the physical world
link |
02:01:04.640
without being able to directly interact with it?
link |
02:01:07.080
I don't think so, yeah.
link |
02:01:08.440
I think at some point we will need to bite the bullet
link |
02:01:10.720
and actually interact with the physical,
link |
02:01:12.680
as much as I like working on like passive computer vision
link |
02:01:15.880
where I just like sit in my arm chair
link |
02:01:17.280
and look at videos and learn.
link |
02:01:19.040
I do think that we will need to have some kind of embodiment
link |
02:01:22.760
or some kind of interaction
link |
02:01:24.600
to figure out things about the world.
link |
02:01:26.960
What about consciousness?
link |
02:01:28.640
Do you think, how often do you think about consciousness
link |
02:01:32.320
when you think about your work?
link |
02:01:34.320
You could think of it
link |
02:01:35.280
as the more simple thing of self awareness,
link |
02:01:38.640
of being aware that you are a perceiving,
link |
02:01:43.880
sensing, acting thing in this world.
link |
02:01:46.840
Or you can think about the bigger version of that,
link |
02:01:50.320
which is consciousness,
link |
02:01:51.640
which is having it feel like something to be that entity,
link |
02:01:57.200
the subjective experience of being in this world.
link |
02:01:59.560
So I think of self awareness a little bit more
link |
02:02:01.440
than like the broader goal of it,
link |
02:02:03.400
because I think self awareness is pretty critical
link |
02:02:06.120
for like any kind of like any kind of AGI
link |
02:02:09.280
or whatever you want to call it that we build,
link |
02:02:10.680
because it needs to contextualize what it is
link |
02:02:13.960
and what role it's playing
link |
02:02:15.600
with respect to all the other things that exist around it.
link |
02:02:17.960
I think that requires self awareness.
link |
02:02:19.680
It needs to understand that it's an autonomous car, right?
link |
02:02:23.520
And what does that mean?
link |
02:02:24.920
What are its limitations?
link |
02:02:26.240
What are the things that it is supposed to do and so on?
link |
02:02:29.080
What is its role in some way?
link |
02:02:30.760
Or, I mean, these are the kinds of things
link |
02:02:34.240
that we kind of expect from it, I would say.
link |
02:02:36.880
And so that's the level of self awareness
link |
02:02:39.360
that's, I would say, basically required at least,
link |
02:02:42.200
if not more than that.
link |
02:02:44.280
Yeah, I tend to, on the emotion side,
link |
02:02:46.440
believe that it has to have,
link |
02:02:48.360
it has to be able to display consciousness.
link |
02:02:52.560
Display consciousness, what do you mean by that?
link |
02:02:54.360
Meaning like for us humans to connect with each other
link |
02:02:57.600
or to connect with other living entities,
link |
02:03:01.680
I think we need to feel,
link |
02:03:04.200
like in order for us to truly feel
link |
02:03:06.840
like that there's another being there,
link |
02:03:09.400
we have to believe that they're conscious.
link |
02:03:11.440
And so we won't ever connect with something
link |
02:03:14.960
that doesn't have elements of consciousness.
link |
02:03:17.320
Now I tend to think that that's easier to achieve
link |
02:03:21.560
than it may sound,
link |
02:03:23.080
because we anthropomorphize stuff so hard.
link |
02:03:25.720
Like you have a mug that just like has wheels
link |
02:03:28.760
and like rotates every once in a while and makes a sound.
link |
02:03:31.920
I think a couple of days in,
link |
02:03:34.320
especially if you don't hang out with humans,
link |
02:03:39.520
you might start to believe that mug on wheels is conscious.
link |
02:03:42.200
So I think we anthropomorphize pretty effectively
link |
02:03:44.840
as human beings.
link |
02:03:46.040
But I do think that it's in the same bucket
link |
02:03:49.240
that we'll call emotion,
link |
02:03:50.920
that show that you're,
link |
02:03:54.720
I think of consciousness as the capacity to suffer.
link |
02:03:58.320
And if you're an entity that's able to feel things
link |
02:04:02.400
in the world and to communicate that to others,
link |
02:04:06.640
I think that's a really powerful way
link |
02:04:08.520
to interact with humans.
link |
02:04:10.880
And in order to create an AGI system,
link |
02:04:13.200
I believe you should be able to richly interact with humans.
link |
02:04:18.000
Like humans would need to want to interact with you.
link |
02:04:21.120
Like it can't be like,
link |
02:04:22.200
it's the self supervised learning versus like,
link |
02:04:27.400
like the robot shouldn't have to pay you
link |
02:04:29.280
to interact with me.
link |
02:04:30.400
So like it should be a natural fun thing.
link |
02:04:33.600
And then you're going to scale up significantly
link |
02:04:36.080
how much interaction it gets.
link |
02:04:39.080
It's the Alexa prize,
link |
02:04:40.840
which they were trying to get me to be a judge
link |
02:04:43.400
on their contest.
link |
02:04:44.680
Let's see if I want to do that.
link |
02:04:46.040
But their challenge is to talk to you,
link |
02:04:50.560
make the human sufficiently interested
link |
02:04:53.960
that the human keeps talking for 20 minutes.
link |
02:04:56.160
To Alexa?
link |
02:04:57.000
To Alexa, yeah.
link |
02:04:58.600
And right now they're not even close to that
link |
02:05:00.240
because it just gets so boring when you're like,
link |
02:05:02.560
when the intelligence is not there,
link |
02:05:04.280
it gets very not interesting to talk to it.
link |
02:05:06.920
And so the robot needs to be interesting.
link |
02:05:08.960
And one of the ways it can be interesting
link |
02:05:10.440
is display the capacity to love, to suffer.
link |
02:05:14.680
And I would say that essentially means
link |
02:05:17.480
the capacity to display consciousness.
link |
02:05:20.920
Like it is an entity, much like a human being.
link |
02:05:25.160
Of course, what that really means,
link |
02:05:27.320
I don't know if that's fundamentally a robotics problem
link |
02:05:30.520
or some kind of problem that we're not yet even aware.
link |
02:05:33.040
Like if it is truly a hard problem of consciousness,
link |
02:05:36.040
I tend to maybe optimistically think it's a,
link |
02:05:38.600
we can pretty effectively fake it till we make it.
link |
02:05:42.640
So we can display a lot of human like elements for a while.
link |
02:05:46.400
And that will be sufficient to form
link |
02:05:49.080
really close connections with humans.
link |
02:05:52.000
What's used the most beautiful idea
link |
02:05:53.720
in self supervised learning?
link |
02:05:55.840
Like when you sit back with, I don't know,
link |
02:05:59.040
with a glass of wine and an armchair
link |
02:06:03.200
and just at a fireplace,
link |
02:06:06.080
just thinking how beautiful this world that you get
link |
02:06:08.720
to explore is, what do you think
link |
02:06:10.560
is the especially beautiful idea?
link |
02:06:13.800
The fact that like object level,
link |
02:06:16.480
what objects are and some notion of objectness emerges
link |
02:06:19.960
from these models by just like self supervised learning.
link |
02:06:23.680
So for example, like one of the things like the dyno paper
link |
02:06:28.920
that I was a part of at Facebook is the object sort
link |
02:06:33.040
of boundaries emerge from these representations.
link |
02:06:35.600
So if you have like a dog running in the field,
link |
02:06:38.060
the boundaries around the dog,
link |
02:06:39.440
the network is basically able to figure out
link |
02:06:42.320
what the boundaries of this dog are automatically.
link |
02:06:45.520
And it was never trained to do that.
link |
02:06:47.040
It was never trained to, no one taught it
link |
02:06:50.160
that this is a dog and these pixels belong to a dog.
link |
02:06:52.680
It's able to group these things together automatically.
link |
02:06:55.000
So that's one.
link |
02:06:56.160
I think in general, that entire notion that this dumb idea
link |
02:07:00.000
that you take like these two crops of an image
link |
02:07:01.960
and then you say that the features should be similar,
link |
02:07:04.120
that has resulted in something like this,
link |
02:07:06.040
like the model is able to figure out
link |
02:07:07.920
what the dog pixels are and so on.
link |
02:07:10.320
That just seems like so surprising.
link |
02:07:13.440
And I mean, I don't think a lot of us even understand
link |
02:07:16.200
how that is happening really.
link |
02:07:18.120
And it's something we are taking for granted,
link |
02:07:20.800
maybe like a lot in terms of how we're setting up
link |
02:07:23.120
these algorithms, but it's just,
link |
02:07:24.920
it's a very beautiful and powerful idea.
link |
02:07:26.780
So it's really fundamentally telling us something about
link |
02:07:30.240
that there is so much signal in the pixels
link |
02:07:32.440
that we can be super dumb about it.
link |
02:07:34.120
How about how we are setting up
link |
02:07:35.200
the self sequencing problem.
link |
02:07:37.080
And despite being like super dumb about it,
link |
02:07:39.600
we'll actually get very good,
link |
02:07:41.640
like we'll actually get something that is able to do
link |
02:07:44.000
very like surprising things.
link |
02:07:45.720
I wonder if there's other like objectness
link |
02:07:48.280
of other concepts that can emerge.
link |
02:07:51.600
I don't know if you follow Francois Chollet,
link |
02:07:53.600
he had the competition for intelligence
link |
02:07:56.600
that basically it's kind of like an IQ test,
link |
02:07:59.560
but for machines, but for an IQ test,
link |
02:08:02.400
you have to have a few concepts that you want to apply.
link |
02:08:05.360
One of them is objectness.
link |
02:08:07.800
I wonder if those concepts can emerge
link |
02:08:11.520
through self supervised learning on billions of images.
link |
02:08:14.760
I think something like object permanence
link |
02:08:16.320
can definitely emerge, right?
link |
02:08:17.440
So that's like a fundamental concept which we have,
link |
02:08:20.240
maybe not through images, through video,
link |
02:08:21.480
but that's another concept that should be emerging from it
link |
02:08:25.160
because it's not something that,
link |
02:08:26.760
like if we don't teach humans that this isn't,
link |
02:08:29.120
this is like about this concept of object permanence,
link |
02:08:31.520
it actually emerges.
link |
02:08:32.500
And the same thing for like animals, like dogs,
link |
02:08:34.100
I think actually permanence automatically
link |
02:08:36.360
is something that they are born with.
link |
02:08:38.080
So I think it should emerge from the data.
link |
02:08:40.320
It should emerge basically very quickly.
link |
02:08:42.440
I wonder if ideas like symmetry, rotation,
link |
02:08:45.880
these kinds of things might emerge.
link |
02:08:47.920
So I think rotation, probably yes.
link |
02:08:50.360
Yeah, rotation, yes.
link |
02:08:51.640
I mean, there's some constraints in the architecture itself,
link |
02:08:55.200
but it's interesting if all of them could be,
link |
02:08:59.240
like counting was another one, being able to kind of
link |
02:09:04.280
understand that there's multiple objects
link |
02:09:06.240
of the same kind in the image and be able to count them.
link |
02:09:10.040
I wonder if all of that could be,
link |
02:09:11.560
if constructed correctly, they can emerge
link |
02:09:14.360
because then you can transfer those concepts
link |
02:09:16.480
to then interpret images at a deeper level.
link |
02:09:20.680
Right.
link |
02:09:21.520
Counting, I do believe, I mean, it should be possible.
link |
02:09:24.680
You don't know like yet,
link |
02:09:25.920
but I do think it's not that far in the realm of possibility.
link |
02:09:29.720
Yeah, that'd be interesting
link |
02:09:30.560
if using self supervised learning on images
link |
02:09:33.240
can then be applied to then solving those kinds of IQ tests,
link |
02:09:36.520
which seem currently to be kind of impossible.
link |
02:09:40.440
What idea do you believe might be true
link |
02:09:43.320
that most people think is not true
link |
02:09:46.600
or don't agree with you on?
link |
02:09:48.560
Is there something like that?
link |
02:09:50.040
So this is going to be a little controversial,
link |
02:09:52.400
but okay, sure.
link |
02:09:53.500
I don't believe in simulation.
link |
02:09:55.340
Like actually using simulation to do things very much.
link |
02:09:58.840
Just to clarify, because this is a podcast
link |
02:10:01.040
where you talk about, are we living in a simulation often?
link |
02:10:03.600
You're referring to using simulation to construct worlds
link |
02:10:08.000
that you then leverage for machine learning.
link |
02:10:10.320
Right, yeah.
link |
02:10:11.160
For example, like one example would be like
link |
02:10:13.080
to train an autonomous car driving system.
link |
02:10:15.520
You basically first build a simulator,
link |
02:10:17.400
which builds like the environment of the world.
link |
02:10:19.840
And then you basically have a lot of like,
link |
02:10:22.680
you train your machine learning system in that.
link |
02:10:25.320
So I believe it is possible,
link |
02:10:27.560
but I think it's a really expensive way of doing things.
link |
02:10:30.920
And at the end of it, you do need the real world.
link |
02:10:33.760
So I'm not sure.
link |
02:10:35.520
So maybe for certain settings,
link |
02:10:36.920
like maybe the payout is so large,
link |
02:10:38.880
like for autonomous driving, the payout is so large
link |
02:10:40.880
that you can actually invest that much money to build it.
link |
02:10:43.360
But I think as a general sort of principle,
link |
02:10:45.480
it does not apply to a lot of concepts.
link |
02:10:47.040
You can't really build simulations of everything.
link |
02:10:49.720
Not only because like one, it's expensive,
link |
02:10:51.520
because second, it's also not possible for a lot of things.
link |
02:10:54.800
So in general, like there's a lot of work
link |
02:10:59.400
on like using synthetic data and like synthetic simulators.
link |
02:11:02.120
I generally am not very, like I don't believe in that.
link |
02:11:05.840
So you're saying it's very challenging visually,
link |
02:11:09.040
like to correctly like simulate the visual,
link |
02:11:11.960
like the lighting, all those kinds of things.
link |
02:11:13.600
I mean, all these companies that you have, right?
link |
02:11:15.680
So like Pixar and like whatever,
link |
02:11:17.880
all these companies are,
link |
02:11:19.840
all this like computer graphics stuff
link |
02:11:21.540
is really about accurately,
link |
02:11:22.920
a lot of them is about like accurately trying to figure out
link |
02:11:26.120
how the lighting is and like how things reflect off
link |
02:11:28.760
of one another and so on,
link |
02:11:30.440
and like how sparkly things look and so on.
link |
02:11:32.280
So it's a very hard problem.
link |
02:11:34.040
So do we really need to solve that first
link |
02:11:37.200
to be able to like do computer vision?
link |
02:11:39.440
Probably not.
link |
02:11:40.640
And for me, in the context of autonomous driving,
link |
02:11:44.800
it's very tempting to be able to use simulation, right?
link |
02:11:48.040
Because it's a safety critical application,
link |
02:11:50.560
but the other limitation of simulation that perhaps
link |
02:11:54.960
is a bigger one than the visual limitation
link |
02:11:58.440
is the behavior of objects.
link |
02:12:00.840
So you're ultimately interested in edge cases.
link |
02:12:03.920
And the question is,
link |
02:12:05.000
how well can you generate edge cases in simulation,
link |
02:12:08.800
especially with human behavior?
link |
02:12:11.080
I think another problem is like for autonomous driving,
link |
02:12:13.480
it's a constantly changing world.
link |
02:12:15.260
So say autonomous driving like in 10 years from now,
link |
02:12:18.600
like there are lots of autonomous cars,
link |
02:12:20.800
but they're still going to be humans.
link |
02:12:22.440
So now there are 50% of the agents say, which are humans,
link |
02:12:25.240
50% of the agents that are autonomous,
link |
02:12:26.880
like car driving agents.
link |
02:12:28.600
So now the mixture has changed.
link |
02:12:30.120
So now the kinds of behaviors that you actually expect
link |
02:12:32.360
from the other agents or other cars on the road
link |
02:12:35.200
are actually going to be very different.
link |
02:12:36.760
And as the proportion of the number of autonomous cars
link |
02:12:39.120
to humans keeps changing,
link |
02:12:40.480
this behavior will actually change a lot.
link |
02:12:42.640
So now if you were to build a simulator based on
link |
02:12:44.520
just like right now to build them today,
link |
02:12:46.480
you don't have that many autonomous cars on the road.
link |
02:12:48.440
So you would try to like make all of the other agents
link |
02:12:50.560
in that simulator behave as humans,
link |
02:12:52.920
but that's not really going to hold true 10, 15, 20,
link |
02:12:55.760
30 years from now.
link |
02:12:57.400
Do you think we're living in a simulation?
link |
02:12:59.280
No.
link |
02:13:01.520
How hard is it?
link |
02:13:02.840
This is why I think it's an interesting question.
link |
02:13:04.880
How hard is it to build a video game,
link |
02:13:07.780
like virtual reality game where it is so real,
link |
02:13:12.660
forget like ultra realistic to where
link |
02:13:15.840
you can't tell the difference,
link |
02:13:17.400
but like it's so nice that you just want to stay there.
link |
02:13:20.860
You just want to stay there and you don't want to come back.
link |
02:13:24.960
Do you think that's doable within our lifetime?
link |
02:13:29.380
Within our lifetime, probably.
link |
02:13:31.700
Yeah.
link |
02:13:32.540
I eat healthy, I live long.
link |
02:13:33.880
Does that make you sad that there'll be like
link |
02:13:39.400
like population of kids that basically spend 95%,
link |
02:13:44.280
99% of their time in a virtual world?
link |
02:13:50.120
Very, very hard question to answer.
link |
02:13:53.380
For certain people, it might be something
link |
02:13:55.760
that they really derive a lot of value out of,
link |
02:13:58.160
derive a lot of enjoyment and like happiness out of,
link |
02:14:00.760
and maybe the real world wasn't giving them that.
link |
02:14:03.140
That's why they did that.
link |
02:14:03.980
So maybe it is good for certain people.
link |
02:14:05.960
So ultimately, if it maximizes happiness,
link |
02:14:09.400
Right, I think if.
link |
02:14:10.240
Or we could judge.
link |
02:14:11.060
Yeah, I think if it's making people happy,
link |
02:14:12.780
maybe it's okay.
link |
02:14:14.440
Again, I think this is a very hard question.
link |
02:14:18.320
So like you've been a part of a lot of amazing papers.
link |
02:14:23.520
What advice would you give to somebody
link |
02:14:25.640
on what it takes to write a good paper?
link |
02:14:29.220
Grad students writing papers now,
link |
02:14:31.020
is there common things that you've learned along the way
link |
02:14:34.540
that you think it takes,
link |
02:14:35.760
both for a good idea and a good paper?
link |
02:14:39.020
Right, so I think both of these have picked up
link |
02:14:44.140
from like lots of people I've worked with in the past.
link |
02:14:46.580
So one of them is picking the right problem
link |
02:14:48.740
to work on in research is as important
link |
02:14:51.100
as like finding the solution to it.
link |
02:14:53.720
So I mean, there are multiple reasons for this.
link |
02:14:56.220
So one is that there are certain problems
link |
02:14:59.000
that can actually be solved in a particular timeframe.
link |
02:15:02.380
So now say you want to work on finding the meaning of life.
link |
02:15:06.420
This is a great problem.
link |
02:15:07.460
I think most people will agree with that.
link |
02:15:09.460
But do you believe that your talents
link |
02:15:12.260
and like the energy that you'll spend on it
link |
02:15:13.860
will make some kind of meaningful progress
link |
02:15:17.300
in your lifetime?
link |
02:15:18.860
If you are optimistic about it, then go ahead.
link |
02:15:21.020
That's why I started this podcast.
link |
02:15:22.140
I keep asking people about the meaning of life.
link |
02:15:24.080
I'm hoping by episode like 2.20, I'll figure it out.
link |
02:15:27.460
Oh, not too many episodes to go.
link |
02:15:30.300
All right, cool.
link |
02:15:31.780
Maybe today, I don't know, but you're right.
link |
02:15:33.820
So that seems intractable at the moment.
link |
02:15:36.300
Right, so I think it's just the fact of like,
link |
02:15:39.060
if you're starting a PhD, for example,
link |
02:15:41.100
what is one problem that you want to focus on
link |
02:15:43.020
that you do think is interesting enough,
link |
02:15:45.740
and you will be able to make a reasonable amount
link |
02:15:47.800
of headway into it that you think you'll be doing a PhD for?
link |
02:15:50.540
So in that kind of a timeframe.
link |
02:15:53.100
So that's one.
link |
02:15:53.920
Of course, there's the second part,
link |
02:15:54.780
which is what excites you genuinely.
link |
02:15:56.380
So you shouldn't just pick problems
link |
02:15:57.620
that you are not excited about,
link |
02:15:59.020
because as a grad student or as a researcher,
link |
02:16:01.860
you really need to be passionate about it
link |
02:16:03.220
to continue doing that,
link |
02:16:04.580
because there are so many other things
link |
02:16:05.740
that you could be doing in life.
link |
02:16:07.100
So you really need to believe in that
link |
02:16:08.260
to be able to do that for that long.
link |
02:16:10.740
In terms of papers, I think the one thing
link |
02:16:12.660
that I've learned is,
link |
02:16:15.580
like in the past, whenever I used to write things,
link |
02:16:17.780
and even now, whenever I do that,
link |
02:16:18.940
I try to cram in a lot of things into the paper,
link |
02:16:21.420
whereas what really matters
link |
02:16:22.820
is just pushing one simple idea, that's it.
link |
02:16:25.760
That's all because the paper is going to be like,
link |
02:16:29.980
whatever, eight or nine pages.
link |
02:16:32.180
If you keep cramming in lots of ideas,
link |
02:16:34.240
it's really hard for the single thing
link |
02:16:36.240
that you believe in to stand out.
link |
02:16:38.020
So if you really try to just focus,
link |
02:16:40.900
especially in terms of writing,
link |
02:16:41.940
really try to focus on one particular idea
link |
02:16:43.820
and articulate it out in multiple different ways,
link |
02:16:46.220
it's far more valuable to the reader as well,
link |
02:16:49.020
and basically to the reader, of course,
link |
02:16:51.600
because they get to,
link |
02:16:53.100
they know that this particular idea
link |
02:16:54.420
is associated with this paper,
link |
02:16:56.140
and also for you, because you have,
link |
02:16:59.260
when you write about a particular idea in different ways,
link |
02:17:01.080
you think about it more deeply.
link |
02:17:02.700
So as a grad student, I used to always wait to it,
link |
02:17:06.020
maybe in the last week or whatever, to write the paper,
link |
02:17:08.700
because I used to always believe
link |
02:17:10.280
that doing the experiments
link |
02:17:11.380
was actually the bigger part of research than writing.
link |
02:17:13.860
And my advisor always told me
link |
02:17:15.260
that you should start writing very early on,
link |
02:17:16.660
and I thought, oh, it doesn't matter,
link |
02:17:17.900
I don't know what he's talking about.
link |
02:17:19.700
But I think more and more I realized that's the case.
link |
02:17:22.020
Whenever I write something that I'm doing,
link |
02:17:24.060
I actually think much better about it.
link |
02:17:26.440
And so if you start writing early on,
link |
02:17:28.820
you actually, I think, get better ideas,
link |
02:17:31.220
or at least you figure out holes in your theory,
link |
02:17:33.820
or particular experiments that you should run
link |
02:17:36.260
to plug those holes, and so on.
link |
02:17:38.740
Yeah, I'm continually surprised
link |
02:17:40.340
how many really good papers throughout history
link |
02:17:43.620
are quite short and quite simple.
link |
02:17:48.340
And there's a lesson to that.
link |
02:17:50.180
If you want to dream about writing a paper
link |
02:17:52.620
that changes the world,
link |
02:17:54.180
and you wanna go by example, they're usually simple.
link |
02:17:58.120
And that's, it's not cramming,
link |
02:18:01.280
or it's focusing on one idea, and thinking deeply.
link |
02:18:07.200
And you're right that the writing process itself
link |
02:18:10.340
reveals the idea.
link |
02:18:12.280
It challenges you to really think about what is the idea
link |
02:18:15.320
that explains it, the thread that ties it all together.
link |
02:18:19.040
And so a lot of famous researchers I know
link |
02:18:21.540
actually would start off, like, first they were,
link |
02:18:24.760
even before the experiments were in,
link |
02:18:27.000
a lot of them would actually start
link |
02:18:28.360
with writing the introduction of the paper,
link |
02:18:30.400
with zero experiments in.
link |
02:18:32.160
Because that at least helps them figure out
link |
02:18:33.800
what they're trying to solve,
link |
02:18:35.800
and how it fits in the context of things right now.
link |
02:18:38.660
And that would really guide their entire research.
link |
02:18:40.680
So a lot of them would actually first write in intros
link |
02:18:42.360
with zero experiments in,
link |
02:18:43.560
and that's how they would start projects.
link |
02:18:46.040
Some basic questions about people maybe
link |
02:18:49.800
that are more like beginners in this field.
link |
02:18:51.960
What's the best programming language to learn
link |
02:18:54.080
if you're interested in machine learning?
link |
02:18:56.600
I would say Python,
link |
02:18:57.440
just because it's the easiest one to learn.
link |
02:19:00.320
And also a lot of like programming
link |
02:19:03.160
and machine learning happens in Python.
link |
02:19:05.000
So if you don't know any other programming language,
link |
02:19:07.600
Python is actually going to get you a long way.
link |
02:19:09.560
Yeah, it seems like sort of a,
link |
02:19:11.680
it's a toss up question because it seems like Python
link |
02:19:14.000
is so much dominating the space now.
link |
02:19:16.800
But I wonder if there's an interesting alternative.
link |
02:19:18.520
Obviously there's like Swift,
link |
02:19:19.960
and there's a lot of interesting alternatives popping up,
link |
02:19:22.740
even JavaScript.
link |
02:19:23.960
So I, or are more like for the data science applications.
link |
02:19:28.880
But it seems like Python more and more
link |
02:19:31.240
is actually being used to teach like introduction
link |
02:19:34.160
to programming at universities.
link |
02:19:35.880
So it just combines everything very nicely.
link |
02:19:39.840
Even harder question.
link |
02:19:41.840
What are the pros and cons of PyTorch versus TensorFlow?
link |
02:19:46.120
I see.
link |
02:19:48.440
Okay.
link |
02:19:49.280
You can go with no comment.
link |
02:19:51.360
So a disclaimer to this is that the last time
link |
02:19:53.400
I used TensorFlow was probably like four years ago.
link |
02:19:56.400
And so it was right when it had come out
link |
02:19:58.160
because so I started on like deep learning in 2014 or so,
link |
02:20:02.660
and the dominant sort of framework for us then
link |
02:20:06.480
for vision was Cafe, which was out of Berkeley.
link |
02:20:09.040
And we used Cafe a lot, it was really nice.
link |
02:20:12.120
And then TensorFlow came in,
link |
02:20:13.360
which was basically like Python first.
link |
02:20:15.080
So Cafe was mainly C++,
link |
02:20:17.040
and it had like very loose kind of Python binding.
link |
02:20:19.040
So Python wasn't really the first language you would use.
link |
02:20:21.320
You would really use either MATLAB or C++
link |
02:20:24.680
like get stuff done in like Cafe.
link |
02:20:28.240
And then Python of course became popular a little bit later.
link |
02:20:30.920
So TensorFlow was basically around that time.
link |
02:20:32.620
So 2015, 2016 is when I last used it.
link |
02:20:36.120
It's been a while.
link |
02:20:37.200
And then what, did you use Torch or did you?
link |
02:20:40.600
So then I moved to LuaTorch, which was the torch in Lua.
link |
02:20:44.040
And then in 2017, I think basically pretty much
link |
02:20:46.780
to PyTorch completely.
link |
02:20:48.420
Oh, interesting.
link |
02:20:49.260
So you went to Lua, cool.
link |
02:20:50.520
Yeah.
link |
02:20:51.480
Huh, so you were there before it was cool.
link |
02:20:54.200
Yeah, I mean, so LuaTorch was really good
link |
02:20:56.320
because it actually allowed you
link |
02:20:59.000
to do a lot of different kinds of things.
link |
02:21:01.340
So which Cafe was very rigid in terms of its structure.
link |
02:21:03.880
Like you would create a neural network once and that's it.
link |
02:21:06.800
Whereas if you wanted like very dynamic graphs and so on,
link |
02:21:09.320
it was very hard to do that.
link |
02:21:10.200
And LuaTorch was much more friendly
link |
02:21:11.600
for all of these things.
link |
02:21:13.560
Okay, so in terms of PyTorch and TensorFlow,
link |
02:21:15.600
my personal bias is PyTorch
link |
02:21:17.280
just because I've been using it longer
link |
02:21:19.080
and I'm more familiar with it.
link |
02:21:20.780
And also that PyTorch is much easier to debug
link |
02:21:23.560
is what I find because it's imperative in nature
link |
02:21:26.300
compared to like TensorFlow, which is not imperative.
link |
02:21:28.620
But that's telling you a lot that basically
link |
02:21:30.480
the imperative design is sort of a way
link |
02:21:33.320
in which a lot of people are taught programming
link |
02:21:35.240
and that's what actually makes debugging easier for them.
link |
02:21:38.160
So like I learned programming in C, C++.
link |
02:21:40.480
And so for me, imperative way of programming is more natural.
link |
02:21:44.040
Do you think it's good to have
link |
02:21:45.280
kind of these two communities, this kind of competition?
link |
02:21:48.480
I think PyTorch is kind of more and more
link |
02:21:50.680
becoming dominant in the research community,
link |
02:21:52.520
but TensorFlow is still very popular
link |
02:21:54.600
in the more sort of application machine learning community.
link |
02:21:57.920
So do you think it's good to have
link |
02:21:59.640
that kind of split in code bases?
link |
02:22:02.080
Or so like the benefit there is the competition challenges
link |
02:22:06.560
the library developers to step up to a game.
link |
02:22:09.980
But the downside is there's these code bases
link |
02:22:12.720
that are in different libraries.
link |
02:22:15.180
Right, so I think the downside is that,
link |
02:22:17.080
I mean, for a lot of research code
link |
02:22:18.480
that's released in one framework
link |
02:22:19.640
and if you're using the other one,
link |
02:22:20.600
it's really hard to like really build on top of it.
link |
02:22:23.800
But thankfully the open source community
link |
02:22:25.800
in machine learning is amazing.
link |
02:22:27.080
So whenever like something pops up in TensorFlow,
link |
02:22:30.840
you wait a few days and someone who's like super sharp
link |
02:22:33.200
will actually come and translate that particular code
link |
02:22:35.340
based into PyTorch and basically have figured that
link |
02:22:38.380
all the nooks and crannies out.
link |
02:22:39.700
So the open source community is amazing
link |
02:22:41.800
and they really like figure out this gap.
link |
02:22:44.280
So I think in terms of like having these two frameworks
link |
02:22:47.560
or multiple, I think of course there are different use cases
link |
02:22:49.720
so there are going to be benefits
link |
02:22:51.080
to using one or the other framework.
link |
02:22:52.840
And like you said, I think competition is just healthy
link |
02:22:54.720
because both of these frameworks keep
link |
02:22:57.360
or like all of these frameworks really sort of
link |
02:22:59.060
keep learning from each other
link |
02:23:00.120
and keep incorporating different things
link |
02:23:01.640
to just make them better and better.
link |
02:23:03.760
What advice would you have for someone
link |
02:23:06.320
new to machine learning, you know,
link |
02:23:09.680
maybe just started or haven't even started
link |
02:23:11.520
but are curious about it and who want to get in the field?
link |
02:23:14.880
Don't be afraid to get your hands dirty.
link |
02:23:16.620
I think that's the main thing.
link |
02:23:17.640
So if something doesn't work,
link |
02:23:19.120
like really drill into why things are not working.
link |
02:23:22.200
Can you elaborate what your hands dirty means?
link |
02:23:24.520
Right, so for example, like if an algorithm,
link |
02:23:27.540
if you try to train the network and it's not converging,
link |
02:23:29.720
whatever, rather than trying to like Google the answer
link |
02:23:32.240
or trying to do something,
link |
02:23:33.400
like really spend those like five, eight, 10, 15, 20,
link |
02:23:36.320
whatever number of hours really trying
link |
02:23:37.560
to figure it out yourself.
link |
02:23:39.000
Because in that process, you'll actually learn a lot more.
link |
02:23:41.320
Yeah.
link |
02:23:42.520
Googling is of course like a good way to solve it
link |
02:23:44.600
when you need a quick answer.
link |
02:23:45.960
But I think initially, especially like when you're starting
link |
02:23:48.120
out, it's much nicer to like figure things out by yourself.
link |
02:23:51.840
And I just say that from experience
link |
02:23:52.960
because like when I started out,
link |
02:23:54.280
there were not a lot of resources.
link |
02:23:55.480
So we would like in the lab, a lot of us,
link |
02:23:57.880
like we would look up to senior students
link |
02:23:59.680
and then the senior students were of course busy
link |
02:24:01.360
and they would be like, hey, why don't you go figure it out?
link |
02:24:03.080
Because I just don't have the time.
link |
02:24:04.320
I'm working on my dissertation or whatever.
link |
02:24:06.480
I'll find a PhD students.
link |
02:24:07.640
And so then we would sit down
link |
02:24:08.760
and like just try to figure it out.
link |
02:24:10.480
And that I think really helped me.
link |
02:24:12.440
That has really helped me figure a lot of things out.
link |
02:24:15.040
I think in general, if I were to generalize that,
link |
02:24:18.720
I feel like persevering through any kind of struggle
link |
02:24:22.720
on a thing you care about is good.
link |
02:24:25.640
So you're basically, you try to make it seem
link |
02:24:27.960
like it's good to spend time debugging,
link |
02:24:30.840
but really any kind of struggle, whatever form that takes,
link |
02:24:33.680
it could be just Googling a lot.
link |
02:24:36.080
Just basically anything, just sticking with it
link |
02:24:38.720
and going through the hard thing that could take a form
link |
02:24:41.000
of implementing stuff from scratch.
link |
02:24:43.200
It could take the form of re implementing
link |
02:24:45.600
with different libraries
link |
02:24:46.520
or different programming languages.
link |
02:24:49.320
It could take a lot of different forms,
link |
02:24:50.560
but struggle is good for the soul.
link |
02:24:53.520
So like in Pittsburgh, where I did my PhD,
link |
02:24:55.800
the thing was it used to snow a lot.
link |
02:24:58.360
And so when it was snowed, you really couldn't do much.
link |
02:25:00.800
So the thing that a lot of people said
link |
02:25:02.880
was snow builds character.
link |
02:25:05.320
Because when it's snowing, you can't do anything else.
link |
02:25:07.480
You focus on work.
link |
02:25:09.040
Do you have advice in general for people,
link |
02:25:10.800
you've already exceptionally successful, you're young,
link |
02:25:13.400
but do you have advice for young people starting out
link |
02:25:15.760
in college or maybe in high school?
link |
02:25:18.160
Advice for their career, advice for their life,
link |
02:25:21.040
how to pave a successful path in career and life?
link |
02:25:25.760
I would say just be hungry.
link |
02:25:27.640
Always be hungry for what you want.
link |
02:25:29.680
And I think I've been inspired by a lot of people
link |
02:25:33.280
who are just driven and who really go for what they want,
link |
02:25:36.720
no matter what, like you shouldn't want it,
link |
02:25:39.440
you should need it.
link |
02:25:40.480
So if you need something,
link |
02:25:41.480
you basically go towards the ends to make it work.
link |
02:25:44.360
How do you know when you come across a thing
link |
02:25:47.840
that's like you need?
link |
02:25:51.120
I think there's not going to be any single thing
link |
02:25:53.080
that you're going to need.
link |
02:25:53.920
There are going to be different types of things
link |
02:25:54.920
that you need, but whenever you need something,
link |
02:25:56.600
you just go push for it.
link |
02:25:57.920
And of course, once you, you may not get it,
link |
02:26:00.040
or you may find that this was not even the thing
link |
02:26:01.960
that you were looking for, it might be a different thing.
link |
02:26:03.640
But the point is like you're pushing through things
link |
02:26:06.240
and that actually brings a lot of skills
link |
02:26:08.960
and builds a certain kind of attitude
link |
02:26:12.880
which will probably help you get the other thing
link |
02:26:15.680
once you figure out what's really the thing that you want.
link |
02:26:18.080
Yeah, I think a lot of people are,
link |
02:26:20.480
I've noticed, kind of afraid of that
link |
02:26:22.520
is because one, it's a fear of commitment.
link |
02:26:24.880
And two, there's so many amazing things in this world,
link |
02:26:26.880
you almost don't want to miss out
link |
02:26:28.120
on all the other amazing things
link |
02:26:29.440
by committing to this one thing.
link |
02:26:31.080
So I think a lot of it has to do with just
link |
02:26:32.720
allowing yourself to notice that thing
link |
02:26:37.920
and just go all the way with it.
link |
02:26:41.560
I mean, I also like failure, right?
link |
02:26:43.240
So I know this is like super cheesy that failure
link |
02:26:47.280
is something that you should be prepared for and so on,
link |
02:26:49.760
but I do think, I mean, especially in research,
link |
02:26:52.520
for example, failure is something that happens
link |
02:26:54.400
almost every day is like experiments failing
link |
02:26:58.160
and not working.
link |
02:26:59.080
And so you really need to be so used to it.
link |
02:27:02.240
You need to have a thick skin,
link |
02:27:03.880
but, and only basically through,
link |
02:27:06.280
like when you get through it is when you find
link |
02:27:07.880
the one thing that's actually working.
link |
02:27:09.560
So Thomas Edison was like one person like that, right?
link |
02:27:11.840
So I really, like when I was a kid,
link |
02:27:13.680
I used to really read about how he found like the filament,
link |
02:27:17.040
the light bulb filament.
link |
02:27:18.680
And then he, I think his thing was like,
link |
02:27:20.560
he tried 990 things that didn't work
link |
02:27:23.120
or something of the sort.
link |
02:27:24.320
And then they asked him like, so what did you learn?
link |
02:27:26.920
Because all of these were failed experiments.
link |
02:27:28.480
And then he says, oh, these 990 things don't work.
link |
02:27:31.600
And I know that.
link |
02:27:32.440
Did you know that?
link |
02:27:33.280
I mean, that's really inspiring.
link |
02:27:35.960
So you spent a few years on this earth
link |
02:27:38.480
performing a self supervised kind of learning process.
link |
02:27:43.960
Have you figured out the meaning of life yet?
link |
02:27:46.400
I told you I'm doing this podcast
link |
02:27:47.720
to try to get the answer.
link |
02:27:49.120
I'm hoping you could tell me,
link |
02:27:50.720
what do you think the meaning of it all is?
link |
02:27:54.320
I don't think I figured this out.
link |
02:27:55.800
No, I have no idea.
link |
02:27:57.120
Do you think AI will help us figure it out
link |
02:28:02.560
or do you think there's no answer?
link |
02:28:03.880
The whole point is to keep searching.
link |
02:28:05.480
I think, yeah, I think it's an endless sort of quest for us.
link |
02:28:08.800
I don't think AI will help us there.
link |
02:28:10.560
This is like a very hard, hard, hard question
link |
02:28:13.600
which so many humans have tried to answer.
link |
02:28:15.440
Well, that's the interesting thing
link |
02:28:16.400
about the difference between AI and humans.
link |
02:28:19.560
Humans don't seem to know what the hell they're doing.
link |
02:28:21.880
And AI is almost always operating
link |
02:28:23.720
under well defined objective functions.
link |
02:28:28.360
And I wonder whether our lack of ability
link |
02:28:33.680
to define good longterm objective functions
link |
02:28:37.240
or introspect what is the objective function
link |
02:28:40.400
under which we operate, if that's a feature or a bug.
link |
02:28:44.400
I would say it's a feature
link |
02:28:45.240
because then everyone actually has very different kinds
link |
02:28:47.440
of objective functions that they're optimizing
link |
02:28:49.360
and those objective functions evolve
link |
02:28:51.320
and change dramatically through the course
link |
02:28:53.400
of their life.
link |
02:28:54.240
That's actually what makes us interesting, right?
link |
02:28:56.000
If otherwise, like if everyone was doing
link |
02:28:58.040
the exact same thing, that would be pretty boring.
link |
02:29:00.560
We do want like people with different kinds
link |
02:29:02.600
of perspectives, also people evolve continuously.
link |
02:29:06.160
That's like, I would say the biggest feature of being human.
link |
02:29:09.320
And then we get to like the ones that die
link |
02:29:11.160
because they do something stupid.
link |
02:29:12.560
We get to watch that, see it and learn from it.
link |
02:29:15.440
And as a species, we take that lesson
link |
02:29:20.360
and become better and better
link |
02:29:22.600
because of all the dumb people in the world
link |
02:29:24.280
that died doing something wild and beautiful.
link |
02:29:29.080
Ishan, thank you so much for this incredible conversation.
link |
02:29:31.840
We did a depth first search through the space
link |
02:29:37.080
of machine learning and it was fun and fascinating.
link |
02:29:41.640
So it's really an honor to meet you
link |
02:29:43.920
and it was a really awesome conversation.
link |
02:29:45.760
Thanks for coming down today and talking with me.
link |
02:29:48.200
Thanks Lex, I mean, I've listened to you.
link |
02:29:50.240
I told you it was unreal for me to actually meet you
link |
02:29:52.400
in person and I'm so happy to be here, thank you.
link |
02:29:55.000
Thanks man.
link |
02:29:56.680
Thanks for listening to this conversation
link |
02:29:58.200
with Ishan Misra and thank you to Onnit,
link |
02:30:01.280
The Information, Grammarly and Athletic Greens.
link |
02:30:05.280
Check them out in the description to support this podcast.
link |
02:30:08.560
And now let me leave you with some words
link |
02:30:10.440
from Arthur C. Clarke.
link |
02:30:12.480
Any sufficiently advanced technology
link |
02:30:14.920
is indistinguishable from magic.
link |
02:30:18.120
Thank you for listening and hope to see you next time.