back to index

Ishan Misra: Self-Supervised Deep Learning in Computer Vision | Lex Fridman Podcast #206


small model | large model

link |
00:00:00.000
The following is a conversation with Ishan Mizra,
link |
00:00:03.240
research scientist at Facebook AI Research,
link |
00:00:05.840
who works on self supervised machine learning
link |
00:00:08.600
in the domain of computer vision.
link |
00:00:10.480
Or in other words, making AI systems understand
link |
00:00:14.160
the visual world with minimal help from us humans.
link |
00:00:18.000
Transformers and self attention has been successfully used
link |
00:00:21.720
by OpenAI GPT3 and other language models
link |
00:00:25.600
to do self supervised learning in the domain of language.
link |
00:00:28.600
Ishan together with Yan Likun and others
link |
00:00:31.800
is trying to achieve the same success
link |
00:00:33.920
in the domain of images and video.
link |
00:00:36.360
The goal is to leave a robot watching YouTube videos
link |
00:00:39.520
all night and in the morning,
link |
00:00:41.240
come back to a much smarter robot.
link |
00:00:43.560
I read the blog post self supervised learning
link |
00:00:45.960
the dark matter of intelligence by Ishan and Yan Likun
link |
00:00:50.320
and then listened to Ishan's appearance
link |
00:00:52.920
on the excellent machine learning street talk podcast.
link |
00:00:57.160
And I knew I had to talk to him.
link |
00:00:59.160
By the way, if you're interested in machine learning and AI,
link |
00:01:02.840
I cannot recommend the ML street talk podcast highly enough.
link |
00:01:07.960
Those guys are great.
link |
00:01:09.640
Quick mention of our sponsors on it,
link |
00:01:12.040
the information, Grammarly and Athletic Greens.
link |
00:01:15.400
Check them out in the description to support this podcast.
link |
00:01:18.640
As a side note, let me say that for those of you
link |
00:01:21.680
who may have been listening for quite a while,
link |
00:01:23.240
this podcast used to be called
link |
00:01:24.960
artificial intelligence podcast.
link |
00:01:27.120
Because my life passion has always been,
link |
00:01:29.680
will always be artificial intelligence,
link |
00:01:32.640
both narrowly and broadly defined.
link |
00:01:35.440
My goal with this podcast is still to have many conversations
link |
00:01:39.080
with world class researchers in AI,
link |
00:01:41.720
math, physics, biology and all the other sciences.
link |
00:01:45.120
But I also want to talk to historians, musicians, athletes
link |
00:01:49.400
and of course, occasionally comedians.
link |
00:01:51.520
In fact, I'm trying out doing this podcast
link |
00:01:53.600
three times a week now to give me more freedom
link |
00:01:56.200
with guest selection and maybe get a chance
link |
00:01:59.400
to have a bit more fun.
link |
00:02:00.880
Speaking of fun, in this conversation,
link |
00:02:03.160
I challenged the listener to count the number of times
link |
00:02:05.440
the word banana is mentioned.
link |
00:02:08.000
Ishan and I used the word banana as the canonical example
link |
00:02:12.600
at the core of the hard problem of computer vision
link |
00:02:15.200
and maybe the hard problem of consciousness.
link |
00:02:19.880
This is the Lex Friedman podcast
link |
00:02:22.640
and here is my conversation with Ishan Mizra.
link |
00:02:27.240
What is self supervised learning?
link |
00:02:29.880
And maybe even give the bigger basics
link |
00:02:32.760
of what is supervised and semi supervised learning.
link |
00:02:35.360
And maybe why is self supervised learning
link |
00:02:37.640
a better term than unsupervised learning?
link |
00:02:40.080
Let's start with supervised learning.
link |
00:02:41.560
So typically for machine learning systems,
link |
00:02:43.920
the way they're trained is you get a bunch of humans.
link |
00:02:46.920
The humans point out particular concepts.
link |
00:02:48.600
So if it's in the case of images,
link |
00:02:50.160
you want the humans to come and tell you
link |
00:02:51.960
what is present in the image,
link |
00:02:54.400
draw boxes around them,
link |
00:02:55.800
draw masks of things, pixels,
link |
00:02:57.680
which are of particular categories or not.
link |
00:03:00.440
For NLP, again, there are lots of these particular tasks,
link |
00:03:03.240
say about sentiment analysis,
link |
00:03:04.760
about entailment and so on.
link |
00:03:06.640
So typically for supervised learning,
link |
00:03:08.080
we get a big corpus of such annotated or labeled data
link |
00:03:11.280
and then we feed that to a system
link |
00:03:12.800
and the system is really trying to mimic,
link |
00:03:14.800
so it's taking this input of the data
link |
00:03:16.600
and then trying to mimic the output.
link |
00:03:18.360
So it looks at an image and the human has tagged
link |
00:03:20.680
that this image contains a banana
link |
00:03:22.440
and now the system is basically trying to mimic that.
link |
00:03:24.680
So that's its learning signal.
link |
00:03:26.720
And so for supervised learning,
link |
00:03:28.040
we try to gather lots of such data
link |
00:03:30.080
and we train these machine learning models
link |
00:03:31.880
to imitate the input output.
link |
00:03:33.480
And the hope is basically by doing so,
link |
00:03:35.640
now on unseen or like new kinds of data,
link |
00:03:38.120
this model can automatically learn to predict these concepts.
link |
00:03:41.360
So this is a standard sort of supervised setting.
link |
00:03:43.440
For semi supervised setting,
link |
00:03:45.800
the idea typically is that you have,
link |
00:03:47.640
of course, all of the supervised data,
link |
00:03:49.320
but you have lots of other data
link |
00:03:50.840
which is unsupervised or which is like not labeled.
link |
00:03:53.160
Now the problem basically with supervised learning
link |
00:03:55.320
and why you actually have all of these alternate
link |
00:03:57.480
sort of learning paradigms is
link |
00:03:59.440
supervised learning just does not scale.
link |
00:04:01.840
So if you look at for computer vision,
link |
00:04:03.960
the sort of largest one of the most popular datasets
link |
00:04:06.320
is ImageNet, right?
link |
00:04:07.520
So the entire ImageNet dataset
link |
00:04:09.360
has about 22,000 concepts and about 14 million images.
link |
00:04:13.840
So these concepts are basically just nouns
link |
00:04:16.200
and they're annotated on images.
link |
00:04:18.360
And this entire dataset was a mammoth data collection effort.
link |
00:04:20.840
It actually gave rise to a lot of powerful learning algorithms
link |
00:04:23.800
as credited with like sort of the rise
link |
00:04:25.600
of deep learning as well.
link |
00:04:27.200
But this dataset took about 22 human years
link |
00:04:30.120
to collect, to annotate.
link |
00:04:31.920
And it's not even that many concepts, right?
link |
00:04:33.480
It's not even that many images.
link |
00:04:34.520
14 million is nothing really.
link |
00:04:36.760
Like you have about I think 400 million images or so
link |
00:04:39.320
or even more than that uploaded to most of the popular
link |
00:04:41.880
sort of social media websites today.
link |
00:04:44.160
So now supervised learning just doesn't scale.
link |
00:04:46.400
If I want to now annotate more concepts,
link |
00:04:48.640
if I want to have this various types of fine grained concepts,
link |
00:04:51.280
then it won't really scale.
link |
00:04:53.200
So now you come up to these sort of different learning paradigms,
link |
00:04:55.680
for example, semi supervised learning,
link |
00:04:57.520
where the idea is, of course,
link |
00:04:58.560
you have this annotated corpus of supervised data
link |
00:05:01.360
and you have lots of these unlabeled images.
link |
00:05:03.680
And the idea is that the algorithm should basically try
link |
00:05:05.840
to measure some kind of consistency
link |
00:05:07.960
or really try to measure some kind of signal
link |
00:05:10.280
on this sort of unlabeled data
link |
00:05:12.160
to make it self more confident
link |
00:05:14.160
about what it's really trying to predict.
link |
00:05:16.160
So by access to this lots of unlabeled data,
link |
00:05:19.640
the idea is that the algorithm actually learns
link |
00:05:22.200
to be more confident and actually gets better
link |
00:05:24.520
at predicting these concepts.
link |
00:05:26.880
And now we come to the other extreme,
link |
00:05:28.480
which is like self supervised learning.
link |
00:05:30.480
The idea basically is that the machine
link |
00:05:32.280
or the algorithm should really discover concepts
link |
00:05:34.720
or discover things about the world
link |
00:05:36.400
or learn representations about the world which are useful
link |
00:05:39.200
without access to explicit human supervision.
link |
00:05:41.760
So the word supervision is still in the term self supervised.
link |
00:05:46.280
So what is the supervision signal?
link |
00:05:48.560
And maybe that perhaps is when Yann LeCun and you argue
link |
00:05:52.040
that unsupervised is the incorrect in terminology here.
link |
00:05:55.040
So what is the supervision signal
link |
00:05:57.440
when the humans aren't part of the picture
link |
00:05:59.720
or not a big part of the picture?
link |
00:06:02.400
Right, so self supervised,
link |
00:06:04.520
the reason it has the term supervised in itself
link |
00:06:06.800
is because you're using the data itself as supervision.
link |
00:06:10.360
So because the data serves as its own source of supervision
link |
00:06:13.240
it's self supervised in that way.
link |
00:06:15.200
Now the reason a lot of people,
link |
00:06:16.440
I mean, we did it in that blog post with Yann,
link |
00:06:18.440
but a lot of other people have also argued
link |
00:06:20.160
for using this term self supervised.
link |
00:06:22.120
So starting from like 94 from Virginia Desa's group
link |
00:06:25.640
at I think UCSD and now she's at UCSD.
link |
00:06:28.840
Jitendra Malik has said this a bunch of times as well.
link |
00:06:31.680
So you have supervised
link |
00:06:33.120
and then unsupervised basically means everything
link |
00:06:35.240
which is not supervised,
link |
00:06:36.440
but that includes stuff like semi supervised
link |
00:06:38.680
that includes other like transductive learning
link |
00:06:41.320
lots of other sort of settings.
link |
00:06:43.040
So that's the reason like now people
link |
00:06:45.520
are preferring this term self supervised
link |
00:06:47.160
because it explicitly says what's happening.
link |
00:06:49.280
The data itself is the source of supervision
link |
00:06:51.640
and any sort of learning algorithm
link |
00:06:53.160
which tries to extract just sort of data supervision signals
link |
00:06:56.960
from the data itself is a self supervised algorithm.
link |
00:06:59.520
But there is within the data a set of tricks
link |
00:07:03.200
which unlock the supervision.
link |
00:07:05.600
So can you give me some examples?
link |
00:07:07.240
And there's innovation, ingenuity required
link |
00:07:11.400
to unlock that supervision.
link |
00:07:12.880
The data doesn't just speak to you some ground truth.
link |
00:07:15.640
You have to do some kind of trick.
link |
00:07:17.800
So I don't know what your favorite domain is.
link |
00:07:19.600
So you specifically specialize in visual learning
link |
00:07:23.040
but is there favorite examples
link |
00:07:24.520
maybe in language or other domains?
link |
00:07:26.560
Perhaps the most successful applications
link |
00:07:28.320
have been in NLP, not language processing.
link |
00:07:31.080
So the idea basically being that you can train models
link |
00:07:34.040
that you have a sentence and you mask out certain words
link |
00:07:37.400
and now these models learn to predict the masked out words.
link |
00:07:40.520
So if you have like the cat jumped over the dog.
link |
00:07:44.040
So you can basically mask out cat
link |
00:07:46.000
and now you're essentially asking the model to predict
link |
00:07:47.920
what was missing, what did I mask out?
link |
00:07:50.320
So the model is going to predict basically
link |
00:07:52.480
a distribution over all the possible words that it knows
link |
00:07:55.360
and probably it has like if it's a well trained model
link |
00:07:58.400
it has a sort of higher probability density
link |
00:08:00.600
for this word cat.
link |
00:08:02.600
For vision I would say the sort of more,
link |
00:08:05.520
I mean the easier example
link |
00:08:07.480
which is not as widely used these days
link |
00:08:09.400
is basically say for example video prediction.
link |
00:08:12.040
So video is again a sequence of things.
link |
00:08:14.080
So you can ask the model.
link |
00:08:15.040
So if you have a video of say 10 seconds
link |
00:08:17.440
you can feed in the first nine seconds to a model
link |
00:08:19.840
and then ask it, hey, what happens basically
link |
00:08:21.960
in the 10 second?
link |
00:08:22.800
Can you predict what's going to happen?
link |
00:08:24.480
And the idea basically is because the model
link |
00:08:26.760
is predicting something about the data itself.
link |
00:08:29.440
Of course you didn't need any human to tell you
link |
00:08:31.680
what was happening because the 10 second video
link |
00:08:33.160
was naturally captured.
link |
00:08:34.600
Because the model is predicting what's happening there
link |
00:08:36.680
it's going to automatically learn something
link |
00:08:39.040
about the structure of the world, how objects move,
link |
00:08:41.240
object permanence and these kinds of things.
link |
00:08:44.000
So like if I have something at the edge of the table
link |
00:08:45.960
it'll fall down.
link |
00:08:47.520
Things like these which you really don't have
link |
00:08:48.960
to sit and annotate.
link |
00:08:50.240
In a supervised learning setting
link |
00:08:51.320
I would have to sit and annotate.
link |
00:08:52.280
This is a cup, now I move this cup, this is still a cup
link |
00:08:55.200
and now I move this cup, it's still a cup
link |
00:08:56.640
and then it falls down and this is a fallen down cup.
link |
00:08:58.840
So I won't have to annotate all of these things
link |
00:09:00.400
in a self supervised setting.
link |
00:09:02.000
Isn't that kind of a brilliant little trick
link |
00:09:05.240
of taking a series of data that is consistent
link |
00:09:08.280
and removing one element in that series
link |
00:09:11.880
and then teaching the algorithm to predict that element?
link |
00:09:17.000
Isn't that, first of all, that's quite brilliant.
link |
00:09:20.680
It seems to be applicable in anything that
link |
00:09:24.200
has the constraint of being a sequence
link |
00:09:27.880
that is consistent with the physical reality.
link |
00:09:31.760
The question is, are there other tricks like this
link |
00:09:34.400
that can generate the self supervision signal?
link |
00:09:37.840
So sequence is possibly the most widely used one in NLP.
link |
00:09:41.200
For vision, the one that is actually used for images
link |
00:09:44.080
which is very popular these days
link |
00:09:45.840
is basically taking an image
link |
00:09:47.600
and now taking different crops of that image.
link |
00:09:50.080
So you can basically decide to crop say the top left corner
link |
00:09:53.080
and you crop say the bottom right corner
link |
00:09:55.280
and asking a network to basically present it
link |
00:09:58.080
with a choice saying that, okay, now you have this image,
link |
00:10:01.360
you have this image, are these the same or not?
link |
00:10:04.480
And so the idea basically is that
link |
00:10:05.760
because different crop, like in an image,
link |
00:10:07.480
different parts of the image are going to be related.
link |
00:10:09.800
So for example, if you have a chair and a table,
link |
00:10:12.400
basically these things are going to be close by
link |
00:10:15.080
versus if you take, again, if you have like a zoomed
link |
00:10:18.360
in picture of a chair, if you're taking different crops,
link |
00:10:20.520
it's going to be different parts of the chair.
link |
00:10:22.360
So the idea basically is that different crops
link |
00:10:25.040
of the image are related.
link |
00:10:26.200
And so the features or the representations
link |
00:10:27.920
that you get from these different crops
link |
00:10:29.080
should also be related.
link |
00:10:30.320
So this is possibly the most widely used trick
link |
00:10:32.720
these days for cell supervised learning and computer vision.
link |
00:10:35.760
So again, using the consistency
link |
00:10:38.400
that's inherent to physical reality in visual domain,
link |
00:10:42.040
that's parts of an image are consistent.
link |
00:10:45.640
And then in the language domain or anything
link |
00:10:49.120
that has sequences like language or something
link |
00:10:51.600
that's like a time series,
link |
00:10:53.000
then you can chop off parts in time.
link |
00:10:55.440
It's similar to the story of RNNs and CNNs,
link |
00:11:00.280
of RNNs and covenants.
link |
00:11:02.280
Yuen Yan Likun wrote the blog post in March, 2021,
link |
00:11:06.640
titled self supervised learning,
link |
00:11:08.840
the dark matter of intelligence.
link |
00:11:11.080
Can you summarize this blog post
link |
00:11:12.640
and maybe explain the main idea or set of ideas?
link |
00:11:15.640
The blog post was mainly about sort of just telling,
link |
00:11:18.680
I mean, this is really a accepted fact,
link |
00:11:21.680
I would say for a lot of people now,
link |
00:11:22.960
that self supervised learning is something
link |
00:11:24.360
that is going to be a play
link |
00:11:26.600
an important role for machine learning algorithms
link |
00:11:28.320
that come in the future and even now.
link |
00:11:30.400
Well, let me just comment that we don't yet
link |
00:11:33.840
have a good understanding of what dark matter is.
link |
00:11:36.480
That's true.
link |
00:11:37.320
So the idea basically being.
link |
00:11:40.080
Maybe the metaphor doesn't exactly transfer,
link |
00:11:41.840
but maybe it's actually perfectly transfers
link |
00:11:44.840
that we don't know.
link |
00:11:45.680
We have an inkling that it'll be a big part
link |
00:11:49.280
of whatever solving intelligence looks like.
link |
00:11:51.200
Right.
link |
00:11:52.040
I think self supervised learning,
link |
00:11:53.000
the way it's done right now is,
link |
00:11:54.880
I would say like the first step towards what it probably
link |
00:11:57.360
should end up like learning
link |
00:11:58.600
or what it should enable us to do.
link |
00:12:00.520
So the idea for that particular piece was
link |
00:12:03.760
self supervised learning is going to be a very powerful way
link |
00:12:06.200
to learn common sense about the world
link |
00:12:08.400
or like stuff that is really hard to label.
link |
00:12:10.800
For example, like is this piece
link |
00:12:13.760
over here heavier than the cup?
link |
00:12:15.640
Now, for all these kinds of things,
link |
00:12:17.520
you'll have to sit and label these things.
link |
00:12:18.760
So supervised learning is clearly not going to scale.
link |
00:12:21.560
So what is the thing that's actually going to scale?
link |
00:12:23.520
It's probably going to be an agent
link |
00:12:25.040
that can either actually interact with it to lift it up
link |
00:12:27.920
or observe me doing it.
link |
00:12:29.960
So if I'm basically lifting these things up,
link |
00:12:31.560
it can probably reason about,
link |
00:12:32.600
hey, this is taking him more time to lift up
link |
00:12:34.760
or the velocity is different,
link |
00:12:36.200
whereas the velocity for this is different,
link |
00:12:37.840
probably this one is heavier.
link |
00:12:39.600
So essentially by observations of the data,
link |
00:12:42.000
you should be able to infer a lot of things
link |
00:12:44.240
about the world without someone explicitly telling you,
link |
00:12:46.840
this is heavy, this is not,
link |
00:12:48.720
this is something that can pour,
link |
00:12:50.000
this is something that cannot pour,
link |
00:12:51.200
this is somewhere that you can sit,
link |
00:12:52.480
this is not somewhere that you can sit.
link |
00:12:53.920
But you've just mentioned the ability
link |
00:12:55.520
to interact with the world.
link |
00:12:57.440
There's so many questions that are yet to be,
link |
00:13:01.000
that are still open,
link |
00:13:02.240
which is how do you select a set of data
link |
00:13:04.480
over which the self supervised learning process works?
link |
00:13:08.640
How much interactivity, like in the active learning
link |
00:13:11.520
or the machine teaching context,
link |
00:13:13.440
is there, what are the reward signals?
link |
00:13:16.480
Like how much actual interaction there is
link |
00:13:18.560
with the physical world?
link |
00:13:20.080
That kind of thing.
link |
00:13:21.440
So that could be a huge question.
link |
00:13:24.800
And then on top of that,
link |
00:13:26.720
which I have a million questions about,
link |
00:13:29.000
which we don't know the answers to,
link |
00:13:30.440
but it's worth talking about is,
link |
00:13:32.840
how much reasoning is involved?
link |
00:13:35.120
How much accumulation of knowledge
link |
00:13:38.520
versus something that's more akin to learning
link |
00:13:40.800
or whether that's the same thing.
link |
00:13:43.240
But so we're like, it is truly dark matter.
link |
00:13:46.560
We don't know how exactly to do it,
link |
00:13:49.200
but we are, I mean, a lot of us are actually convinced
link |
00:13:52.040
that it's going to be a sort of major thing
link |
00:13:54.200
in machine learning.
link |
00:13:55.040
So let me reframe it then,
link |
00:13:56.600
that human supervision cannot be at large scale,
link |
00:14:01.160
the source of the solution to intelligence.
link |
00:14:04.120
So the machines have to discover the supervision
link |
00:14:08.000
in the natural signal of the world.
link |
00:14:10.240
I mean, the other thing is also that
link |
00:14:12.000
humans are not particularly good labors,
link |
00:14:14.200
they're not very consistent.
link |
00:14:16.000
For example, like, what's the difference
link |
00:14:17.880
between a dining table and a table?
link |
00:14:19.840
Is it just the fact that one,
link |
00:14:21.560
like if you just look at a particular table,
link |
00:14:23.080
what makes us say one is dining table
link |
00:14:24.600
and the other is not?
link |
00:14:26.520
Humans are not particularly consistent,
link |
00:14:28.160
they're not like very good sources of supervision
link |
00:14:30.120
for a lot of these kind of edge cases.
link |
00:14:32.320
So it may be also the fact that if we want,
link |
00:14:35.880
like want an algorithm or want a machine
link |
00:14:37.960
to solve a particular task for us,
link |
00:14:39.640
we can maybe just specify the end goal
link |
00:14:42.120
and like the stuff in between,
link |
00:14:44.240
we really probably should not be specifying
link |
00:14:46.080
because we're not maybe going to confuse it a lot actually.
link |
00:14:49.320
Well, humans can't even answer the meaning of life.
link |
00:14:51.440
So I'm not sure if we're good supervisors
link |
00:14:53.920
of the end goal either.
link |
00:14:55.240
So let me ask you about categories.
link |
00:14:56.960
Humans are not very good at telling the difference
link |
00:14:59.040
between what is and isn't a table, like you mentioned.
link |
00:15:02.840
Do you think it's possible,
link |
00:15:04.520
let me ask you like a pretend you're a play dough.
link |
00:15:10.120
Is it possible to create a pretty good taxonomy
link |
00:15:14.800
of objects in the world?
link |
00:15:16.400
It seems like a lot of approaches in machine learning
link |
00:15:19.000
kind of assume a hopeful vision
link |
00:15:21.400
that it's possible to construct a perfect taxonomy
link |
00:15:24.080
or it exists perhaps out of our reach,
link |
00:15:26.520
but we can always get closer and closer to it.
link |
00:15:28.840
Or is that a hopeless pursuit?
link |
00:15:31.240
I think it's hopeless in some way.
link |
00:15:33.040
So the thing is for any particular categorization
link |
00:15:36.080
that you create,
link |
00:15:36.920
if you have a discrete sort of categorization,
link |
00:15:38.760
I can always take the nearest two concepts
link |
00:15:40.520
or I can take a third concept and I can blend it in
link |
00:15:42.600
and I can create a new category.
link |
00:15:44.480
So if you were to enumerate N categories,
link |
00:15:46.560
I will always find an N plus one category for you.
link |
00:15:48.880
That's not going to be in the N categories.
link |
00:15:50.680
And I can actually create not just N plus one,
link |
00:15:52.400
I can very easily create far more than N categories.
link |
00:15:55.120
The thing is,
link |
00:15:55.960
a lot of things we talk about are actually compositional.
link |
00:15:58.960
So it's really hard for us to come and sit
link |
00:16:01.680
and enumerate all of these out.
link |
00:16:03.200
And they compose in various weird ways, right?
link |
00:16:05.840
Like you have a croissant and a doughnut
link |
00:16:07.840
come together to form a cronut.
link |
00:16:09.680
So if you were to like enumerate all the foods up until,
link |
00:16:12.400
I don't know, whenever the cronut was about 10 years ago
link |
00:16:15.160
or 15 years ago,
link |
00:16:16.440
then this entire thing called cronut would not exist.
link |
00:16:19.000
Yeah, I remember there was the most awesome video
link |
00:16:21.760
of a cat wearing a monkey costume.
link |
00:16:23.520
Yeah, yes.
link |
00:16:26.520
People should look it up, it's great.
link |
00:16:28.240
So is that a monkey or is that a cat?
link |
00:16:31.000
It's a very difficult philosophical question.
link |
00:16:33.840
So there is a concept of similarity between objects.
link |
00:16:37.280
So you think that can take us very far?
link |
00:16:39.880
Just kind of getting a good function,
link |
00:16:43.200
a good way to tell which parts of things are similar
link |
00:16:47.920
and which parts of things are very different?
link |
00:16:50.720
I think so, yeah.
link |
00:16:51.800
So you don't necessarily need to name everything
link |
00:16:54.320
or assign a name to everything to be able to use it, right?
link |
00:16:57.840
So there are like lots of...
link |
00:16:59.560
Shakespeare said that, what's in a name?
link |
00:17:01.720
What's in a name?
link |
00:17:02.560
Yeah, okay.
link |
00:17:03.400
I mean, lots of like, for example, animals, right?
link |
00:17:05.840
They don't have necessarily a well formed
link |
00:17:08.120
like syntactic language,
link |
00:17:09.520
but they're able to go about their day perfectly.
link |
00:17:11.800
The same thing happens for us.
link |
00:17:12.880
So, I mean, we probably look at things and we figure out,
link |
00:17:17.040
oh, this is similar to something else that I've seen before.
link |
00:17:19.320
And then I can probably learn how to use it.
link |
00:17:22.000
So I haven't seen all the possible doorknobs in the world.
link |
00:17:26.280
But if you show me, like I was able to get into
link |
00:17:29.000
this particular place fairly easily,
link |
00:17:30.360
I've never seen that particular doorknob.
link |
00:17:32.120
So I, of course, related to all the doorknobs
link |
00:17:33.920
that I've seen and I know exactly how it's going to open.
link |
00:17:36.520
I have a pretty good idea of how it's going to open.
link |
00:17:39.400
And I think this kind of translation between experiences
link |
00:17:41.800
only happens because of similarity.
link |
00:17:43.680
Because I'm able to relate it to a doorknob.
link |
00:17:45.360
If I related it to a hairdryer,
link |
00:17:46.560
I would probably be stuck still outside,
link |
00:17:48.360
not able to get in.
link |
00:17:50.360
Again, a bit of a philosophical question,
link |
00:17:52.200
but is, can similarity take us all the way
link |
00:17:55.600
to understanding a thing?
link |
00:17:58.640
Can having a good function that compares objects
link |
00:18:01.920
get us to understand something profound
link |
00:18:04.880
about singular objects?
link |
00:18:07.160
I think I'll ask you a question back.
link |
00:18:08.600
What does it mean to understand objects?
link |
00:18:11.560
Well, let me tell you what that's similar to.
link |
00:18:13.480
No.
link |
00:18:14.320
So there's an idea of sort of reasoning
link |
00:18:17.680
by analogy kind of thing.
link |
00:18:19.760
I think understanding is the process of placing that thing
link |
00:18:24.920
in some kind of network of knowledge that you have.
link |
00:18:28.440
That it perhaps is fundamentally related to other concepts.
link |
00:18:33.160
So it's not like understanding is fundamentally related
link |
00:18:36.480
by composition of other concepts
link |
00:18:39.240
and maybe in relation to other concepts.
link |
00:18:43.160
And maybe deeper and deeper understanding
link |
00:18:45.800
is maybe just adding more edges to that graph somehow.
link |
00:18:51.840
So maybe it is a composition of similarities.
link |
00:18:55.080
I mean, ultimately, I suppose it is a kind of embedding
link |
00:18:59.560
in that wisdom space.
link |
00:19:02.480
Yeah, okay, wisdom space is good.
link |
00:19:06.520
I think, I do think, right?
link |
00:19:08.080
So similarity does get you very, very far.
link |
00:19:10.760
Is it the answer to everything?
link |
00:19:12.360
I mean, I don't even know what everything is,
link |
00:19:14.160
but it's going to take us really far.
link |
00:19:16.720
And I think the thing is things are similar
link |
00:19:19.680
in very different contexts, right?
link |
00:19:21.680
So an elephant is similar to, I don't know,
link |
00:19:24.360
another sort of wild animal, let's just pick,
link |
00:19:26.280
I don't know, lion in a different way
link |
00:19:28.560
because they're both four legged creatures.
link |
00:19:30.560
They're also land animals.
link |
00:19:32.080
But of course, they're very different
link |
00:19:33.160
in a lot of different ways.
link |
00:19:34.000
So elephants are like herbivores, lions are not.
link |
00:19:37.280
So similarity does, similarity and particularly dissimilarity
link |
00:19:40.680
also sort of actually helps us understand a lot about things.
link |
00:19:43.760
And so that's actually why I think
link |
00:19:45.240
discrete categorization is very hard.
link |
00:19:47.640
Just like forming this particular category of elephant
link |
00:19:50.080
and a particular category of lion,
link |
00:19:51.880
maybe it's good for like just like taxonomy,
link |
00:19:54.400
biological taxonomies.
link |
00:19:55.800
But when it comes to like other things
link |
00:19:57.240
which are not as maybe, for example, like grilled cheese,
link |
00:20:01.160
right? I have a grilled cheese I dip it in tomato
link |
00:20:03.080
and I keep it outside.
link |
00:20:04.000
Now, is that still a grilled cheese
link |
00:20:05.080
or is that something else?
link |
00:20:06.760
All right, so categorization is still very useful
link |
00:20:09.800
for solving problems.
link |
00:20:11.280
But is your intuition then sort of the self supervised
link |
00:20:15.960
should be the, to borrow Jan Lacoon's terminology,
link |
00:20:20.960
should be the cake and then categorization,
link |
00:20:23.680
the classification, maybe the supervised like layer
link |
00:20:27.400
should be just like the thing on top,
link |
00:20:29.120
the cherry or the icing or whatever.
link |
00:20:31.040
So if you make it the cake, it gets in the way of learning.
link |
00:20:35.560
If you make it the cake, then you don't,
link |
00:20:37.000
we won't be able to sit and annotate everything.
link |
00:20:39.400
That's as simple as it is.
link |
00:20:40.680
Like that's my very practical view on it.
link |
00:20:43.120
It's just, I mean, in my PhD,
link |
00:20:44.960
I sat down and annotated like a bunch of cars
link |
00:20:47.040
for one of my projects.
link |
00:20:48.520
And very quickly I was just like,
link |
00:20:49.920
it was in a video and I was basically drawing boxes
link |
00:20:52.200
around all these cars.
link |
00:20:53.600
And I think I spent about a week doing all of that
link |
00:20:55.640
and I barely got anything done.
link |
00:20:57.680
And basically this was, I think my first year of my PhD
link |
00:21:00.320
at like a second year of my master's.
link |
00:21:02.720
And then by the end of it, I'm like, okay,
link |
00:21:04.040
this is just hopeless.
link |
00:21:05.040
I can keep doing it.
link |
00:21:06.000
And when I'd done that, someone came up to me
link |
00:21:08.520
and they basically told me,
link |
00:21:09.600
oh, this is a pickup truck.
link |
00:21:10.880
This is not a car.
link |
00:21:12.800
And that's like, aha, this actually makes sense
link |
00:21:14.840
because a pickup truck is not really like,
link |
00:21:16.160
what was I annotating?
link |
00:21:17.040
Was I annotating anything that is mobile?
link |
00:21:19.600
Or was I annotating particular sedans
link |
00:21:21.440
or was I annotating SUVs?
link |
00:21:22.680
What was I doing?
link |
00:21:23.640
By the way, the annotation was bounding boxes?
link |
00:21:25.760
Bounding boxes.
link |
00:21:27.000
There's so many deep, profound questions here.
link |
00:21:30.080
You're almost cheating your way out of
link |
00:21:32.200
by doing self supervised learning, by the way,
link |
00:21:34.440
which is like, what makes for an object?
link |
00:21:37.520
As opposed to solve intelligence,
link |
00:21:39.080
maybe you don't ever need to answer that question.
link |
00:21:42.480
I mean, this is the question that anyone
link |
00:21:44.160
that's ever done annotation because it's so painful
link |
00:21:47.240
gets to ask like, why am I doing a drawing
link |
00:21:51.960
very careful line around this object?
link |
00:21:55.480
Like what is the value?
link |
00:21:57.560
I remember when I first saw semantic segmentation
link |
00:22:00.240
where you have like instant segmentation
link |
00:22:03.680
where you have a very exact line
link |
00:22:06.280
around the object in a 2D plane
link |
00:22:09.560
of a fundamentally 3D object projected on a 2D plane.
link |
00:22:13.480
So you're drawing a line around a car
link |
00:22:15.840
that might be occluded.
link |
00:22:17.000
There might be another thing in front of it,
link |
00:22:18.880
but you're still drawing the line
link |
00:22:20.400
of the part of the car that you see.
link |
00:22:23.680
How is that the car?
link |
00:22:26.280
Why is that the car?
link |
00:22:27.920
Like I had like an existential crisis every time.
link |
00:22:31.080
Like how is that going to help us understand
link |
00:22:33.600
a solve computer vision?
link |
00:22:35.400
I'm not sure I have a good answer to what's better.
link |
00:22:38.320
And I'm not sure I share the confidence that you have
link |
00:22:41.600
that self supervised learning can take us far.
link |
00:22:46.760
I think I'm more and more convinced
link |
00:22:48.680
that it's a very important component,
link |
00:22:50.920
but I still feel like we need to understand what makes,
link |
00:22:54.240
like this dream of maybe what it's called symbolic AI
link |
00:23:01.440
of arriving, like once you have this common sense base,
link |
00:23:05.600
be able to play with these concepts
link |
00:23:09.040
and build graphs or hierarchies of concepts on top
link |
00:23:13.480
in order to then form a deep sense
link |
00:23:18.840
of this three dimensional world or four dimensional world
link |
00:23:22.080
and be able to reason and then project that
link |
00:23:24.480
onto 2D playing in order to interpret a 2D image.
link |
00:23:28.560
Can I ask you just an out there question?
link |
00:23:31.000
I remember, I think Andre Capati had a blog post
link |
00:23:35.040
about computer vision, like being really hard.
link |
00:23:39.040
I forgot what the title was, but it's many, many years ago.
link |
00:23:42.120
And he had, I think President Obama stepping on a scale
link |
00:23:44.800
and there was humor and there was a bunch of people
link |
00:23:46.600
laughing and whatever.
link |
00:23:48.480
And there's a lot of interesting things about that image
link |
00:23:52.040
and I think Andre highlighted a bunch of things
link |
00:23:55.160
about the image that us humans are able to immediately
link |
00:23:57.680
understand, like the idea, I think of gravity
link |
00:24:00.960
and that you can, you have the concept of a weight.
link |
00:24:04.080
You have a, you immediately project,
link |
00:24:06.560
because of our knowledge of pose
link |
00:24:08.160
and how human bodies are constructed,
link |
00:24:10.360
you understand how the forces are being applied
link |
00:24:13.040
with the human body.
link |
00:24:14.600
They're really interesting.
link |
00:24:15.560
Other thing that you're able to understand
link |
00:24:17.440
is multiple people looking at each other in the image.
link |
00:24:20.520
You're able to have a mental model
link |
00:24:22.360
of what the people are thinking about.
link |
00:24:23.760
You're able to infer like, oh, this person is probably
link |
00:24:26.960
thinks like is laughing at how humorous the situation is.
link |
00:24:31.240
And this person is confused about what the situation is
link |
00:24:34.200
because they're looking this way.
link |
00:24:35.600
We're able to infer all of that.
link |
00:24:37.560
So that's human vision.
link |
00:24:41.400
How difficult is computer vision?
link |
00:24:45.040
Like in order to achieve that level of understanding
link |
00:24:48.440
and maybe how big of a part
link |
00:24:51.440
does self supervised learning play in that, do you think?
link |
00:24:54.400
And do you still, you know, back,
link |
00:24:56.440
that was like over a decade ago,
link |
00:24:58.440
I think Andre and I think a lot of people agreed
link |
00:25:00.960
is computer vision is really hard.
link |
00:25:03.360
Do you still think computer vision is really hard?
link |
00:25:06.000
I think it is, yes.
link |
00:25:07.520
And getting to that kind of understanding,
link |
00:25:10.600
I mean, it's really out there.
link |
00:25:12.480
So if you ask me to solve just that particular problem,
link |
00:25:15.360
I can do it the supervised learning route.
link |
00:25:17.560
I can always construct a data set and basically predict,
link |
00:25:19.720
oh, is there humor in this or not?
link |
00:25:21.600
And of course I can do it.
link |
00:25:22.600
Actually, that's a good question.
link |
00:25:23.560
Do you think you can, okay, okay.
link |
00:25:25.200
Do you think you can do human supervised annotation
link |
00:25:28.120
of humor?
link |
00:25:29.040
To some extent, yes.
link |
00:25:29.960
I'm sure it'll work.
link |
00:25:30.880
I mean, it won't be as bad as like randomly guessing.
link |
00:25:34.400
I'm sure it can still predict whether it's humorous
link |
00:25:36.200
or not in some way.
link |
00:25:37.840
Yeah, maybe like Reddit upvotes is the signal.
link |
00:25:40.400
I don't know.
link |
00:25:41.240
I mean, it won't do a great job, but it'll do something.
link |
00:25:43.800
It may actually be like it may find certain things
link |
00:25:46.040
which are not humorous, humorous as well,
link |
00:25:47.560
which is going to be bad for us.
link |
00:25:49.160
But I mean, it'll do a, it won't be random.
link |
00:25:52.120
Yeah, kind of like my sense of humor.
link |
00:25:54.520
Okay, so fine.
link |
00:25:55.920
So you can, that particular problem, yes.
link |
00:25:57.520
But the general problem you're saying is hard.
link |
00:25:59.600
The general problem is hard.
link |
00:26:00.480
And I mean, self supervised learning
link |
00:26:02.320
is not the answer to everything.
link |
00:26:03.920
Of course it's not.
link |
00:26:04.760
I think if you have machines that are going to communicate
link |
00:26:07.800
with humans at the end of it,
link |
00:26:08.760
you want to understand what the algorithm is doing, right?
link |
00:26:10.880
You want it to be able to like produce an output
link |
00:26:13.720
that you can decipher, that you can understand,
link |
00:26:15.600
or it's actually useful for something else,
link |
00:26:17.480
which again is a human.
link |
00:26:19.360
So at some point in this sort of entire loop,
link |
00:26:22.280
a human steps in.
link |
00:26:23.720
And now this human needs to understand what's going on.
link |
00:26:26.760
And at that point, this entire notion of language
link |
00:26:28.960
or semantics really comes in.
link |
00:26:30.440
If the machine just spits out something,
link |
00:26:32.600
and if we can't understand it,
link |
00:26:34.000
then it's not really that useful for us.
link |
00:26:36.280
So self supervised learning is probably going to be useful
link |
00:26:38.440
for a lot of the things before that part.
link |
00:26:40.760
Before the machine really needs to communicate
link |
00:26:42.880
a particular kind of output with a human.
link |
00:26:46.120
Because I mean, otherwise,
link |
00:26:47.840
how is it going to do that without language?
link |
00:26:49.960
Or some kind of communication,
link |
00:26:51.920
but you're saying that it's possible to build
link |
00:26:53.680
a big base of understanding or whatever of,
link |
00:26:57.240
what's it about?
link |
00:26:58.080
Concepts.
link |
00:26:58.920
Concepts, yeah.
link |
00:26:59.760
Of like common sense concepts.
link |
00:27:02.320
Supervised learning in the context of computer vision
link |
00:27:06.160
is something you focused on,
link |
00:27:07.560
but that's a really hard domain.
link |
00:27:09.040
And it's kind of the cutting edge
link |
00:27:10.520
of what we're as a community working on today.
link |
00:27:13.080
Can we take a little bit of a step back
link |
00:27:14.800
and look at language?
link |
00:27:16.360
Can you summarize the history of success
link |
00:27:19.040
of self supervised learning
link |
00:27:20.480
in natural language processing, language modeling?
link |
00:27:23.920
What are transformers?
link |
00:27:25.640
What is the masking, the sentence completion
link |
00:27:28.800
that you mentioned before?
link |
00:27:31.040
How does it lead us to understand anything?
link |
00:27:33.600
Semantic meaning of words,
link |
00:27:34.880
syntactic role of words and sentences.
link |
00:27:37.680
So I'm of course not the expert in NLP.
link |
00:27:40.760
I kind of follow it a little bit from the sides.
link |
00:27:43.520
So the main sort of reason
link |
00:27:45.800
why all of this masking stuff works
link |
00:27:47.440
is I think it's called the distributional hypothesis
link |
00:27:49.920
in NLP.
link |
00:27:50.920
The idea basically being that words
link |
00:27:52.680
that occur in the same context
link |
00:27:54.440
should have similar meaning.
link |
00:27:56.000
So if you have the blank jumped over the blank,
link |
00:27:59.080
it basically whatever is like in the first blank
link |
00:28:02.000
is basically an object that can actually jump
link |
00:28:04.160
is going to be something that can jump.
link |
00:28:05.880
So a cat or a dog or I don't know, sheep, something,
link |
00:28:08.360
all of these things can basically be
link |
00:28:09.720
in that particular context.
link |
00:28:11.680
And now so essentially the idea is that
link |
00:28:13.440
if you have words that are in the same context
link |
00:28:16.080
and you predict them,
link |
00:28:17.360
you're going to learn a lots of useful things
link |
00:28:20.040
about how words are related
link |
00:28:21.520
because you're predicting by looking at their context
link |
00:28:23.600
what the word is going to be.
link |
00:28:24.920
So in this particular case,
link |
00:28:26.200
the blank jumped over the fence.
link |
00:28:28.280
So now if it's a sheep,
link |
00:28:29.680
the sheep jumped over the fence,
link |
00:28:30.960
the dog jumped over the fence.
link |
00:28:32.440
So essentially the algorithm
link |
00:28:34.400
or the representation basically puts together
link |
00:28:36.520
these two concepts together.
link |
00:28:37.640
So it says, okay, dogs are going to be kind of slated to sheep
link |
00:28:40.280
because both of them occur in the same context.
link |
00:28:42.760
Of course, now you can decide
link |
00:28:44.480
depending on your particular application downstream,
link |
00:28:46.800
you can say that dogs are absolutely not related to sheep
link |
00:28:49.200
because well, I really care about dog food, for example.
link |
00:28:53.040
I'm a dog food person
link |
00:28:54.240
and I really want to give this dog food
link |
00:28:55.640
to this particular animal.
link |
00:28:57.320
So depending on what your downstream application is,
link |
00:29:00.120
of course, this notion of similarity
link |
00:29:02.120
or this notion or this common sense
link |
00:29:03.960
that you've learned may not be applicable.
link |
00:29:05.840
But the point is basically that this,
link |
00:29:08.080
just predicting what the blanks are
link |
00:29:09.960
is going to take you really, really far.
link |
00:29:11.760
So there's a nice feature of language
link |
00:29:14.040
that the number of words in a particular language
link |
00:29:18.720
is very large, but it's finite
link |
00:29:20.800
and it's actually not that large
link |
00:29:22.080
in the grand scheme of things.
link |
00:29:24.160
I still got up because we take it for granted.
link |
00:29:26.560
So first of all, when you say masking,
link |
00:29:28.400
you're talking about this very process
link |
00:29:30.120
of the blank of removing words from a sentence
link |
00:29:33.440
and then having the knowledge of what word went there
link |
00:29:36.760
in the initial data set.
link |
00:29:38.520
That's the ground truth that you're training on
link |
00:29:41.080
and then you're asking the neural network
link |
00:29:43.440
to predict where it goes there.
link |
00:29:46.560
That's like a little trick.
link |
00:29:49.240
It's a really powerful trick.
link |
00:29:50.880
The question is how far that takes us
link |
00:29:53.320
and the other question is, is there other tricks?
link |
00:29:56.280
Because to me, it's very possible
link |
00:29:58.680
there's other very fascinating tricks.
link |
00:30:00.720
I'll give you an example in autonomous driving,
link |
00:30:05.200
there's a bunch of tricks
link |
00:30:06.920
that give you the self supervised signal back.
link |
00:30:10.360
For example, very similar to sentences,
link |
00:30:14.440
but not really, which is you have signals
link |
00:30:18.600
from humans driving the car
link |
00:30:20.240
because a lot of us drive cars to places.
link |
00:30:23.640
And so you can ask the neural network to predict
link |
00:30:27.800
what's going to happen the next two seconds
link |
00:30:30.240
for a safe navigation through the environment.
link |
00:30:33.400
And the signal is comes from the fact
link |
00:30:36.200
that you also have knowledge of what happened
link |
00:30:38.640
in the next two seconds
link |
00:30:40.040
because you have video of the data.
link |
00:30:42.160
The question in autonomous driving, as it is in language,
link |
00:30:46.760
can we learn how to drive autonomously
link |
00:30:50.160
based on that kind of self supervision?
link |
00:30:53.480
Probably the answer is no.
link |
00:30:55.360
The question is how good can we get?
link |
00:30:57.760
And the same with language, how good can we get?
link |
00:31:00.080
And are there other tricks?
link |
00:31:02.160
Like we get sometimes super excited
link |
00:31:03.840
by this trick that works really well.
link |
00:31:05.720
But I wonder, it's almost like mining for gold.
link |
00:31:09.120
I wonder how many signals there are in the data
link |
00:31:12.760
that could be leveraged that are like there, right?
link |
00:31:15.760
Is that, I just want to kind of linger on that
link |
00:31:18.600
because sometimes it's easy to think
link |
00:31:20.840
that maybe this masking process is self supervised learning.
link |
00:31:24.840
No, it's only one method.
link |
00:31:27.200
So there could be many, many other methods,
link |
00:31:29.280
many tricky methods,
link |
00:31:32.160
maybe interesting ways to leverage human computation
link |
00:31:35.440
in very interesting ways
link |
00:31:36.880
that might actually border on semi supervised learning,
link |
00:31:39.920
something like that.
link |
00:31:40.840
Obviously the internet is generated by humans
link |
00:31:43.480
at the end of the day.
link |
00:31:44.720
So all that to say is what's your sense
link |
00:31:48.760
in this particular context of language,
link |
00:31:50.720
how far can that masking process take us?
link |
00:31:54.680
So it has stood the test of time, right?
link |
00:31:56.240
I mean, so Word2Vec, the initial sort of NLP technique
link |
00:31:59.800
that was using this to now, for example,
link |
00:32:02.120
like all the BERT and all these big models
link |
00:32:04.920
that we get, BERT and Roberta, for example,
link |
00:32:07.560
all of them are still sort of based
link |
00:32:08.760
on the same principle of masking.
link |
00:32:10.600
It's taken us really far.
link |
00:32:12.120
I mean, you can actually do things like,
link |
00:32:14.400
oh, these two sentences are similar or not,
link |
00:32:16.240
whether this particular sentence follows this other sentence
link |
00:32:18.680
in terms of logic, so entailment.
link |
00:32:20.480
You can do a lot of these things with this,
link |
00:32:22.440
just this masking trick.
link |
00:32:23.640
Yeah, so I'm not sure if I can predict how far it can take us
link |
00:32:28.360
because when it first came out, when Word2Vec was out,
link |
00:32:31.520
I don't think a lot of us would have imagined
link |
00:32:33.520
that this would actually help us do some kind
link |
00:32:36.000
of entailment problems and really that well.
link |
00:32:38.560
And so just the fact that by just scaling up
link |
00:32:40.960
the amount of data that we're training on
link |
00:32:42.360
and using better and more powerful neural network
link |
00:32:45.160
architectures has taken us from that to this,
link |
00:32:47.600
is just showing you how maybe poor predictors we are,
link |
00:32:52.600
as humans, how poor we are at predicting
link |
00:32:54.840
how successful a particular technique is going to be.
link |
00:32:57.360
So I think I can say something now,
link |
00:32:58.680
but like 10 years from now,
link |
00:33:00.040
I look completely stupid basically predicting this.
link |
00:33:02.760
In the language domain, is there something in your work
link |
00:33:07.120
that you find useful and insightful
link |
00:33:09.520
and transferable to computer vision,
link |
00:33:12.520
but also just, I don't know, beautiful and profound
link |
00:33:15.720
that I think carries through to the vision domain?
link |
00:33:18.120
I mean, the idea of masking has been very powerful.
link |
00:33:21.000
It has been used in vision as well for predicting,
link |
00:33:23.720
like you say, the next sort of,
link |
00:33:25.360
if you have in sort of frames,
link |
00:33:27.240
then you predict what's going to happen in the next frame.
link |
00:33:29.400
So that's been very powerful.
link |
00:33:31.000
In terms of modeling, like in just terms
link |
00:33:32.920
in terms of architecture,
link |
00:33:33.800
I think you would ask about transformers while back.
link |
00:33:36.920
That has really become,
link |
00:33:38.360
like it has become super exciting for computer vision now.
link |
00:33:40.840
Like in the past, I would say year and a half,
link |
00:33:42.800
it's become really powerful.
link |
00:33:44.200
What's a transformer?
link |
00:33:45.240
Right.
link |
00:33:46.080
I mean, the core part of a transformer
link |
00:33:47.440
is something called the self attention model.
link |
00:33:49.040
So it came out of Google.
link |
00:33:50.480
And the idea basically is that if you have N elements,
link |
00:33:53.800
what you're creating is a way
link |
00:33:55.280
for all of these N elements to talk to each other.
link |
00:33:57.920
So the idea basically is that you are paying attention.
link |
00:34:01.840
Each element is paying attention
link |
00:34:03.200
to each of the other element.
link |
00:34:05.000
And basically by doing this,
link |
00:34:06.800
it's really trying to figure out,
link |
00:34:09.000
you're basically getting a much better view of the data.
link |
00:34:11.480
So for example, if you have a sentence of like four words,
link |
00:34:14.520
the point is if you get a representation
link |
00:34:16.360
or a feature for this entire sentence,
link |
00:34:18.360
it's constructed in a way
link |
00:34:19.800
such that each word has paid attention
link |
00:34:22.120
to everything else.
link |
00:34:23.840
Now, the reason it's like different from say,
link |
00:34:26.120
what you would do in a ConvNet is basically
link |
00:34:29.000
that in the ConvNet,
link |
00:34:29.840
you would only pay attention to a local window.
link |
00:34:31.400
So each word would only pay attention
link |
00:34:33.160
to its next neighbor or like one neighbor after that.
link |
00:34:36.160
And the same thing goes for images.
link |
00:34:37.880
In images, you would basically pay attention to pixels
link |
00:34:40.120
in a three cross three or a seven cross seven neighborhood.
link |
00:34:42.800
And that's it.
link |
00:34:43.680
Whereas with the transformer,
link |
00:34:44.840
that self attention mainly the sort of idea
link |
00:34:47.400
is that each element needs to pay attention
link |
00:34:49.680
to each other element.
link |
00:34:50.520
And when you say attention,
link |
00:34:52.040
maybe another way to phrase that
link |
00:34:53.440
is you're considering a context,
link |
00:34:57.720
a wide context in terms of the wide context of the sentence
link |
00:35:01.600
in understanding the meaning of a particular word
link |
00:35:05.200
and a computer vision.
link |
00:35:06.200
That's understanding a larger context
link |
00:35:07.920
to understand the local pattern
link |
00:35:10.080
of a particular local part of an image.
link |
00:35:13.120
Right.
link |
00:35:13.960
So basically if you have say,
link |
00:35:15.000
again a banana in the image,
link |
00:35:16.560
you're looking at the full image first.
link |
00:35:18.640
So whether it's like,
link |
00:35:19.960
you're looking at all the pixels that are off a kitchen
link |
00:35:22.200
or for dining table and so on.
link |
00:35:23.800
And then you're basically looking at the banana also.
link |
00:35:25.960
Yeah.
link |
00:35:26.800
By the way, in terms of if we were to train
link |
00:35:28.200
the funny classifier,
link |
00:35:29.240
there's something funny about the word banana.
link |
00:35:32.000
Just wanted to anticipate that my my
link |
00:35:34.600
I am wearing a banana shirt.
link |
00:35:36.280
Is there bananas on it?
link |
00:35:39.760
Okay. So masking has worked for the vision context as well.
link |
00:35:42.440
And so this transformer idea has worked as well.
link |
00:35:44.320
So basically looking at all the elements
link |
00:35:46.280
to understand a particular element
link |
00:35:48.120
has been really powerful in vision.
link |
00:35:49.880
The reason is like a lot of things
link |
00:35:52.040
when you're looking at them in isolation.
link |
00:35:53.440
So if you look at just a blob of pixels.
link |
00:35:55.560
So Antonio Torralba at MIT used to have this
link |
00:35:57.600
like really famous image,
link |
00:35:58.920
which I looked at when I was a PhD student,
link |
00:36:01.000
where he would basically have a blob of pixels
link |
00:36:02.800
and he would ask you,
link |
00:36:03.640
Hey, what is this?
link |
00:36:04.920
And it looked basically like a shoe
link |
00:36:06.800
or like it could look like a TV remote.
link |
00:36:08.840
It could look like anything.
link |
00:36:10.040
And it turns out it was a beer bottle.
link |
00:36:12.320
But I'm not sure.
link |
00:36:13.160
It was one of these three things,
link |
00:36:14.080
but basically he showed you the full picture
link |
00:36:15.400
and then it was very obvious what it was.
link |
00:36:17.520
But the point is,
link |
00:36:18.440
just by looking at that particular local window,
link |
00:36:20.560
you couldn't figure out
link |
00:36:21.880
because of resolution,
link |
00:36:22.880
because of other things,
link |
00:36:23.880
it's just not easy always to just figure out
link |
00:36:26.080
by looking at just the neighborhood of pixels,
link |
00:36:27.960
what these pixels are.
link |
00:36:29.680
And the same thing happens for language as well.
link |
00:36:32.000
For the parameters that have to learn
link |
00:36:33.920
something about the data,
link |
00:36:35.160
you need to give it the capacity
link |
00:36:37.160
to learn the essential things.
link |
00:36:39.160
Like if it's not actually able to receive the signal at all,
link |
00:36:42.680
then it's not going to be able to learn that signal.
link |
00:36:44.280
And in order to understand images,
link |
00:36:45.920
to understand language,
link |
00:36:47.280
you have to be able to see words in their full context.
link |
00:36:50.680
Okay.
link |
00:36:52.000
What is harder to solve?
link |
00:36:53.280
Vision or language?
link |
00:36:54.920
Visual intelligence or linguistic intelligence?
link |
00:36:57.840
So I'm going to say computer vision is harder.
link |
00:36:59.800
My reason for this is basically that
link |
00:37:02.760
language of course has a big structure to it
link |
00:37:04.960
because we developed it.
link |
00:37:06.840
Whereas vision is something that is common
link |
00:37:08.680
in a lot of animals.
link |
00:37:09.920
Everyone is able to get by,
link |
00:37:11.440
a lot of these animals on Earth
link |
00:37:12.880
are actually able to get by without language.
link |
00:37:15.080
And a lot of these animals,
link |
00:37:16.480
we also deem to be intelligent.
link |
00:37:18.280
So clearly intelligence does have
link |
00:37:20.920
like a visual component to it.
link |
00:37:22.520
And yes, of course in the case of humans,
link |
00:37:24.240
it of course also has a linguistic component.
link |
00:37:26.400
But it means that there is something far more fundamental
link |
00:37:28.720
about vision than there is about language.
link |
00:37:30.840
And I'm sorry to anyone who disagrees,
link |
00:37:32.960
but yes, this is what I feel.
link |
00:37:34.400
So that's being a little bit reflected
link |
00:37:36.960
in the challenges that have to do with the progress
link |
00:37:40.800
of self supervised learning, would you say?
link |
00:37:42.520
Or is that just the peculiar accidents
link |
00:37:45.560
of the progress of the AI community
link |
00:37:47.400
that we focused on?
link |
00:37:48.640
Or we discovered self attention
link |
00:37:50.280
and transformers in the context of language first.
link |
00:37:53.640
So like the self supervised learning success was actually,
link |
00:37:57.320
for vision has not much to do with the transformers part.
link |
00:37:59.960
I would say it's actually been independent a little bit.
link |
00:38:02.480
I think it's just that the signal
link |
00:38:03.960
was a little bit different for vision
link |
00:38:06.760
than there was for like NLP
link |
00:38:08.120
and probably NLP folks discovered it before.
link |
00:38:11.240
So for vision, the main success
link |
00:38:12.680
has basically been this like crops so far,
link |
00:38:14.800
like taking different crops of images.
link |
00:38:16.960
Whereas for NLP, it was this masking thing.
link |
00:38:18.920
But also the level of success
link |
00:38:20.480
is still much higher for language.
link |
00:38:22.080
Yes, it has.
link |
00:38:22.920
So that has a lot to do with,
link |
00:38:24.760
I mean, I can get into a lot of details.
link |
00:38:26.920
For this particular question, let's go for it.
link |
00:38:28.520
Okay, so the first thing is language is very structured.
link |
00:38:32.240
So you are going to produce a distribution
link |
00:38:34.040
over a finite vocabulary.
link |
00:38:35.920
English has a finite number of words.
link |
00:38:37.680
It's actually not that large.
link |
00:38:39.520
And you need to produce basically,
link |
00:38:41.600
when you're doing this masking thing,
link |
00:38:42.760
all you need to do is basically tell me
link |
00:38:44.160
which one of these like 50,000 words it is.
link |
00:38:46.440
That's it.
link |
00:38:47.280
Now for vision, let's imagine doing the same thing.
link |
00:38:49.560
Okay, we're basically going to blank out
link |
00:38:51.480
a particular part of the image.
link |
00:38:52.600
And we ask the network or this neural network
link |
00:38:54.680
to predict what is present in this missing patch.
link |
00:38:58.080
It's combinatorially large, right?
link |
00:38:59.960
You have 256 pixel values.
link |
00:39:02.560
If you're even producing basically a seven cross seven
link |
00:39:04.840
or a 14 cross 14 like window of pixels
link |
00:39:08.000
at each of these 169 or each of these 49 locations,
link |
00:39:11.360
you have 256 values to predict.
link |
00:39:13.760
And so it's really, really large.
link |
00:39:15.280
And very quickly, the kind of like prediction problems
link |
00:39:19.000
that we're setting up are going to be extremely
link |
00:39:20.840
like intractable for us.
link |
00:39:22.800
And so the thing is for NLP, it has been really successful
link |
00:39:25.000
because we are very good at predicting,
link |
00:39:27.560
like doing this like distribution over a finite set.
link |
00:39:30.880
And the problem is when this set becomes really large,
link |
00:39:33.520
we're going to become really, really bad
link |
00:39:35.560
at making these predictions.
link |
00:39:37.000
And at solving basically this particular set of problems.
link |
00:39:41.040
So if you were to do it exactly in the same way
link |
00:39:44.240
as NLP for vision, there is very limited success.
link |
00:39:47.040
The way stuff is working right now
link |
00:39:49.000
is actually not by predicting these masks.
link |
00:39:51.680
It's basically by saying that you take these two
link |
00:39:53.680
like crops from the image,
link |
00:39:55.160
you get a feature representation from it.
link |
00:39:57.080
And just saying that these two features,
link |
00:39:58.680
so they're like vectors,
link |
00:40:00.440
just saying that the distance between these vectors
link |
00:40:02.040
should be small.
link |
00:40:03.240
And so it's a very different way of learning
link |
00:40:06.600
from the visual signal than there is from NLP.
link |
00:40:09.160
Okay, the other reason is the distributional hypothesis
link |
00:40:11.320
that we talked about for NLP, right?
link |
00:40:12.920
So a word given its context,
link |
00:40:15.160
basically the context actually supplies a lot
link |
00:40:16.760
of meaning to the word.
link |
00:40:18.440
Now, because there are just finite number of words
link |
00:40:22.280
and there is a finite way in which we compose them,
link |
00:40:25.760
of course, the same thing holds for pixels,
link |
00:40:27.440
but in language, there's a lot of structure, right?
link |
00:40:29.760
So I always say whatever,
link |
00:40:31.000
the dash jumped over the fence, for example.
link |
00:40:33.760
There are lots of these sentences that you'll get.
link |
00:40:36.720
And from this, you can actually look at
link |
00:40:38.680
this particular sentence might occur
link |
00:40:40.160
in a lot of different contexts as well.
link |
00:40:41.480
This exact same sentence might occur in a different context.
link |
00:40:44.080
So the sheep jumped over the fence,
link |
00:40:45.560
the cat jumped over the fence,
link |
00:40:46.800
the dog jumped over the fence.
link |
00:40:48.160
So you immediately get a lot of these words,
link |
00:40:50.440
which are, because this particular token itself
link |
00:40:52.720
has so much meaning, you get a lot of these tokens
link |
00:40:54.840
or these words which are actually going to have
link |
00:40:57.440
sort of this related meaning across, given this context.
link |
00:41:00.560
Whereas for vision, it's much harder.
link |
00:41:02.640
Because just by pure, the way we capture images,
link |
00:41:05.600
lighting can be different.
link |
00:41:07.440
There might be different noise in the sensor.
link |
00:41:09.800
So the thing is you're capturing a physical phenomenon
link |
00:41:12.200
and then you're basically going through
link |
00:41:13.840
a very complicated pipeline of image processing
link |
00:41:16.360
and then you're translating that
link |
00:41:17.440
into some kind of digital signal.
link |
00:41:20.400
Whereas with language, you write it down
link |
00:41:23.520
and you transfer it to a digital signal,
link |
00:41:25.040
almost like it's a lossless transfer.
link |
00:41:27.520
And each of these tokens are very, very well defined.
link |
00:41:30.160
There could be a little bit of an argument there
link |
00:41:32.840
because language has written down
link |
00:41:36.120
is a projection of thought.
link |
00:41:39.400
This is one of the open questions is
link |
00:41:42.560
if you perfectly can solve language,
link |
00:41:46.320
are you getting close to being able to solve,
link |
00:41:49.360
easily with flying colors past the Turing test
link |
00:41:51.840
kind of thing.
link |
00:41:52.840
So that's, it's similar, but different
link |
00:41:56.560
and the computer vision problem is in the 2D plane
link |
00:41:59.760
is a projection to be to mention a world.
link |
00:42:02.680
So perhaps there are similar problems there.
link |
00:42:05.680
Maybe this is a good, yeah.
link |
00:42:06.680
I think what I'm saying is NLP is not easy.
link |
00:42:08.600
Of course, don't get me wrong.
link |
00:42:09.560
Like abstract thought expressed in knowledge
link |
00:42:12.960
or knowledge basically expressed in language
link |
00:42:14.640
is really hard to understand, right?
link |
00:42:16.760
I mean, we've been communicating with language for so long
link |
00:42:19.200
and it's, it is of course a very complicated concept.
link |
00:42:22.040
The thing is, at least getting like some,
link |
00:42:25.600
somewhat reasonable, like being able to solve
link |
00:42:28.600
some kind of reasonable tasks with language,
link |
00:42:30.960
I would say slightly easier than it is
link |
00:42:32.480
with computer vision.
link |
00:42:33.680
Yeah, I would say, yeah.
link |
00:42:35.400
So that's well put.
link |
00:42:36.640
I would say getting impressive performance on language
link |
00:42:40.880
is easier.
link |
00:42:43.400
I feel like for both language and computer vision,
link |
00:42:45.360
there's going to be this wall of like,
link |
00:42:49.480
like this hump you have to overcome
link |
00:42:52.280
to achieve super human level performance
link |
00:42:54.840
or human level performance.
link |
00:42:56.640
And I feel like for language, that wall is farther away.
link |
00:43:00.240
So you can get pretty nice.
link |
00:43:01.920
You can do a lot of tricks.
link |
00:43:04.120
You can show really impressive performance.
link |
00:43:06.560
You can even fool people that you're tweeting
link |
00:43:09.720
or you're blog posts writing
link |
00:43:11.520
or your question answering has intelligence behind it.
link |
00:43:16.920
But to truly demonstrate understanding of dialogue,
link |
00:43:22.400
of continuous long form dialogue,
link |
00:43:25.040
that would require perhaps big breakthroughs.
link |
00:43:28.600
In the same way in computer vision,
link |
00:43:30.440
I think the big breakthroughs need to happen earlier
link |
00:43:33.400
to achieve impressive performance.
link |
00:43:36.640
This might be a good place to, you already mentioned it,
link |
00:43:38.760
but what is contrastive learning
link |
00:43:41.120
and what are energy based models?
link |
00:43:43.840
Contrastive learning is sort of the paradigm of learning
link |
00:43:46.840
where the idea is that you are learning this embedding space
link |
00:43:50.680
or so you're learning this sort of vector space
link |
00:43:52.680
of all your concepts.
link |
00:43:54.520
And the way you learn that is basically by contrasting.
link |
00:43:56.800
So the idea is that you have a sample,
link |
00:43:59.120
you have another sample that's related to it.
link |
00:44:01.000
So that's called the positive
link |
00:44:02.880
and you have another sample that's not related to it.
link |
00:44:05.080
So that's negative.
link |
00:44:06.080
So for example, let's just take an NLP
link |
00:44:08.320
or in a simple example in computer vision.
link |
00:44:10.960
So you have an image of a cat,
link |
00:44:12.760
you have an image of a dog
link |
00:44:14.480
and for whatever application that you're doing,
link |
00:44:16.520
say you're trying to figure out what pets are,
link |
00:44:18.880
you think that these two images are related.
link |
00:44:20.280
So image of a cat and dog are related,
link |
00:44:22.280
but now you have another third image of a banana
link |
00:44:25.400
because you don't like that word.
link |
00:44:27.000
So now you basically have this banana.
link |
00:44:28.920
Thank you for speaking to the crowd.
link |
00:44:30.640
And so you take both of these images
link |
00:44:32.560
and you take the image from the cat,
link |
00:44:34.440
the image from the dog,
link |
00:44:35.280
you get a feature from both of them.
link |
00:44:36.760
And now what you're training the network to do
link |
00:44:38.160
is basically pull both of these features together
link |
00:44:42.080
while pushing them away from the feature of a banana.
link |
00:44:44.720
So this is the contrastive part.
link |
00:44:45.840
So you're contrasting against the banana.
link |
00:44:47.840
So there's always this notion of a negative and a positive.
link |
00:44:51.520
Now, energy based models are like one way
link |
00:44:54.160
that Jan sort of explains a lot of these methods.
link |
00:44:57.480
So Jan basically, I think a couple of years or more
link |
00:45:00.920
than that, like when I joined Facebook,
link |
00:45:02.840
Jan used to keep mentioning this word energy based models.
link |
00:45:05.080
And of course, I had no idea what he was talking about.
link |
00:45:07.200
So then one day I caught him in one of the conference rooms
link |
00:45:09.680
and I'm like, can you please tell me what this is?
link |
00:45:11.240
So then like very patiently,
link |
00:45:13.120
he sat down with like a marker and a whiteboard.
link |
00:45:15.960
And his idea basically is that
link |
00:45:18.280
rather than talking about probability distributions,
link |
00:45:20.280
you can talk about energies of models.
link |
00:45:21.920
So models are trying to minimize certain energies
link |
00:45:23.960
in certain space,
link |
00:45:24.960
or they're trying to maximize a certain kind of energy.
link |
00:45:28.200
And the idea basically is that
link |
00:45:29.760
you can explain a lot of the contrastive models,
link |
00:45:32.200
GANs for example, which are like
link |
00:45:33.880
generative adversarial networks.
link |
00:45:36.000
A lot of these modern learning methods
link |
00:45:37.880
or VAEs, which are variational autoencoders,
link |
00:45:39.920
you can really explain them very nicely
link |
00:45:41.840
in terms of an energy function
link |
00:45:43.160
that they're trying to minimize or maximize.
link |
00:45:45.320
And so by putting this common sort of language
link |
00:45:48.360
for all of these models,
link |
00:45:49.720
what looks very different in machine learning
link |
00:45:51.800
that OVAEs are very different from what GANs are,
link |
00:45:54.160
are very different from what contrastive models are,
link |
00:45:56.440
you actually get a sense of like,
link |
00:45:57.560
oh, these are actually very, very related.
link |
00:46:00.120
It's just that the way or the mechanism
link |
00:46:02.520
in which they're sort of maximizing
link |
00:46:04.200
or minimizing this energy function is slightly different.
link |
00:46:07.000
It's revealing the commonalities between all these approaches
link |
00:46:10.400
and putting a sexy word on top of it, like energy.
link |
00:46:12.960
And so similarities, two things that are similar
link |
00:46:15.520
have low energy.
link |
00:46:16.720
Like the low energy signifying similarity.
link |
00:46:20.320
Right, exactly.
link |
00:46:21.160
So basically the idea is that if you were to imagine
link |
00:46:23.520
like the embedding as a manifold, a 2D manifold,
link |
00:46:26.440
you would get a hill or like a high sort of peak
link |
00:46:28.880
in the energy manifold,
link |
00:46:30.560
wherever two things are not related.
link |
00:46:32.360
And basically you would have like a dip
link |
00:46:34.040
where two things are related.
link |
00:46:35.480
So you'd get a dip in the manner.
link |
00:46:37.040
And in the self supervised context,
link |
00:46:40.160
how do you know two things are related
link |
00:46:42.240
and two things are not related?
link |
00:46:44.080
Right.
link |
00:46:44.920
This is where all the sort of ingenuity or tricks comes in,
link |
00:46:47.360
right?
link |
00:46:48.200
So for example, like you can take the fill in the blank
link |
00:46:51.720
problem or you can take in the context problem.
link |
00:46:54.400
And what you can say is two words
link |
00:46:55.960
that are in the same context are related.
link |
00:46:57.840
Two words that are in different contexts are not related.
link |
00:47:00.600
For images, basically two crops from the same image
link |
00:47:03.040
are related and whereas a third image is not related at all.
link |
00:47:06.520
For a video, it can be two frames from that video
link |
00:47:08.840
are related because they're likely to contain
link |
00:47:10.840
the same sort of concepts in them.
link |
00:47:12.760
Whereas a third frame from a different video
link |
00:47:14.480
is not related.
link |
00:47:15.640
So it basically is, it's a very general term.
link |
00:47:18.360
Contrasting learning is nothing really
link |
00:47:19.720
to do with self supervised learning.
link |
00:47:20.920
It actually is very popular in, for example,
link |
00:47:23.280
like any kind of metric learning
link |
00:47:25.240
or any kind of embedding learning.
link |
00:47:26.960
So it's also used in supervised learning.
link |
00:47:28.960
It's also, and the thing is because we are not really
link |
00:47:31.360
using labels to get these positive or negative pairs,
link |
00:47:34.600
it can basically also be used for self supervised learning.
link |
00:47:37.680
So you mentioned one of the ideas in the vision context
link |
00:47:41.080
to that works is to have different crops.
link |
00:47:45.280
So you could think of that as a way to sort of
link |
00:47:47.840
manipulating the data to generate examples that are similar.
link |
00:47:53.320
Obviously, there's a bunch of other techniques.
link |
00:47:55.800
You mentioned lighting as a very, in images,
link |
00:47:59.480
lighting is something that varies a lot
link |
00:48:01.680
and you can artificially change those kinds of things.
link |
00:48:04.520
There's the whole broad field of data augmentation
link |
00:48:07.720
which manipulates images in order to increase arbitrarily
link |
00:48:11.800
the size of the data set.
link |
00:48:13.400
First of all, what is data augmentation?
link |
00:48:15.840
And second of all, what's the role of data augmentation
link |
00:48:18.120
in self supervised learning and contrastive learning?
link |
00:48:22.000
So data augmentation is just a way, like you said,
link |
00:48:24.800
it's basically a way to augment the data.
link |
00:48:26.680
So you have say N samples and what you do is
link |
00:48:29.320
you basically define some kind of transforms for the sample.
link |
00:48:32.280
So you take your say image and then you define a transform
link |
00:48:34.880
where you can just increase the colors or the brightness
link |
00:48:38.680
of the image or increase or decrease the contrast of the image,
link |
00:48:41.320
for example, or take different crops of it.
link |
00:48:44.520
So data augmentation is just a process
link |
00:48:46.240
to basically perturb the data or augment the data.
link |
00:48:51.080
And so it has played a fundamental role
link |
00:48:53.160
for computer vision for self supervised learning,
link |
00:48:55.320
especially the way most of the current methods
link |
00:48:58.920
work contrastive or otherwise is by taking an image,
link |
00:49:02.720
in the case of images, is by taking an image
link |
00:49:05.320
and then computing basically two perturbations of it.
link |
00:49:08.560
So these can be two different crops of the image
link |
00:49:11.480
with like different types of lighting
link |
00:49:12.920
or different contrast or different colors.
link |
00:49:15.000
So you jitter the colors a little bit and so on.
link |
00:49:17.840
And now the idea is basically because it's the same object
link |
00:49:21.720
or because it's like related concepts
link |
00:49:23.440
in both of these perturbations,
link |
00:49:25.240
you want the features from both of these perturbations
link |
00:49:27.960
to be similar.
link |
00:49:28.920
So now you can use a variety of different ways
link |
00:49:31.320
to enforce this constraint, like these features being similar.
link |
00:49:34.200
You can do this by contrastive learning.
link |
00:49:36.040
So basically both of these things are positives,
link |
00:49:38.440
a third sort of image is negative.
link |
00:49:40.440
You can do this basically by like clustering.
link |
00:49:43.480
For example, you can say that both of these images should,
link |
00:49:46.960
the features from both of these images
link |
00:49:48.120
should belong in the same cluster because they're related.
link |
00:49:50.560
Whereas image, like another image
link |
00:49:52.280
should belong to a different cluster.
link |
00:49:53.880
So there's a variety of different ways
link |
00:49:55.160
to basically enforce this particular constraint.
link |
00:49:57.560
By the way, when you say features,
link |
00:49:59.080
it means there's a very large neural network
link |
00:50:01.680
that extracting patterns from the image
link |
00:50:03.640
and the kind of patterns that extracts
link |
00:50:05.160
should be either identical or very similar.
link |
00:50:08.440
That's what that means.
link |
00:50:09.640
So the neural network basically takes in the image
link |
00:50:11.880
and then outputs a set of basically a vector of numbers.
link |
00:50:16.600
And that's the feature.
link |
00:50:17.720
And you want this feature for both of these different crops
link |
00:50:20.840
that you computed to be similar.
link |
00:50:22.120
So you want this vector to be identical
link |
00:50:24.520
in its entries, for example.
link |
00:50:26.120
Be like literally close in this multidimensional space
link |
00:50:30.040
to each other.
link |
00:50:31.640
And like you said, close can mean part of the same cluster
link |
00:50:34.760
or something like that in this large space.
link |
00:50:37.440
First of all, that, I wonder if there is connection
link |
00:50:40.680
to the way humans learn to this.
link |
00:50:43.760
Almost like maybe subconsciously,
link |
00:50:48.040
in order to understand a thing,
link |
00:50:50.120
you kind of have to see it from two, three multiple angles.
link |
00:50:54.680
I wonder, I have a lot of friends who are neuroscientists
link |
00:50:58.120
maybe and cognitive scientists.
link |
00:50:59.960
I wonder if that's in there somewhere.
link |
00:51:03.200
Like in order for us to place a concept in its proper place,
link |
00:51:08.560
we have to basically crop it in all kinds of ways,
link |
00:51:12.440
do basic data augmentation on it
link |
00:51:14.400
in whatever very clever ways that the brain likes to do.
link |
00:51:17.640
Right.
link |
00:51:19.000
Like spinning around in our mind somehow
link |
00:51:21.160
that that is very effective.
link |
00:51:23.080
So I think for some of them, we need to do it.
link |
00:51:25.040
So like babies, for example, pick up objects,
link |
00:51:27.000
like move them, put them, go sit there and whatnot.
link |
00:51:30.120
But for certain other things,
link |
00:51:31.200
actually we are good at imagining it as well.
link |
00:51:33.800
So if you, I have never seen, for example,
link |
00:51:35.960
an elephant from the top.
link |
00:51:36.960
I've never basically looked at it from top down.
link |
00:51:39.560
But if you showed me a picture of it,
link |
00:51:40.720
I could very well tell you that that's an elephant.
link |
00:51:43.760
So I think some of it, we just like,
link |
00:51:45.320
we naturally build it or transfer it from other objects
link |
00:51:47.840
that we've seen to imagine what it's going to look like.
link |
00:51:50.960
Has anyone done that with the augmentation?
link |
00:51:53.320
Like imagine all the possible things
link |
00:51:56.960
that are occluded or not there,
link |
00:51:59.920
but not just like normal things, like wild things,
link |
00:52:03.400
but they're nevertheless physically consistent.
link |
00:52:07.000
So I mean, people do kind of like occlusion based
link |
00:52:10.800
augmentation as well.
link |
00:52:11.840
So you place in like a random like box, gray box
link |
00:52:14.800
to sort of mask out a certain part of the image.
link |
00:52:17.480
And the thing is basically you're kind of occluding it.
link |
00:52:20.040
For example, you place it say on half of a person's face.
link |
00:52:23.600
So basically saying that, you know,
link |
00:52:24.960
something below their nose is occluded
link |
00:52:26.720
because it's grayed out.
link |
00:52:28.280
So, this is kind of.
link |
00:52:29.120
No, I meant like, you have like, what is it?
link |
00:52:31.680
A table and you can't see behind the table.
link |
00:52:34.240
And you imagine there's a bunch of elves
link |
00:52:37.040
with bananas behind the table.
link |
00:52:38.800
Like I wonder if there's useful to have a,
link |
00:52:41.920
a wild imagination for the network.
link |
00:52:44.160
Because that's possible.
link |
00:52:45.280
Well, maybe not elves, but like puppies
link |
00:52:47.280
and kittens or something like that.
link |
00:52:48.960
Just have a wild imagination and like constantly
link |
00:52:53.080
be generating that wild imagination.
link |
00:52:55.040
Cause in terms of data augmentation
link |
00:52:57.520
that's currently applied, it's super ultra very boring.
link |
00:53:01.160
It's very basic data augmentation.
link |
00:53:02.880
I wonder if, I wonder if there's a benefit
link |
00:53:05.240
to being wildly imaginable while trying to be
link |
00:53:08.880
consistent with physical reality.
link |
00:53:11.840
I think it's a kind of a chicken and egg problem, right?
link |
00:53:14.160
Because to have like amazing data augmentation,
link |
00:53:16.360
you need to understand what the scene is.
link |
00:53:18.480
And what we're trying to do data augmentation
link |
00:53:20.600
to learn what a scene is anyway.
link |
00:53:22.000
So it's basically just keeps going on.
link |
00:53:23.680
Before you understand it, just put elves with bananas
link |
00:53:26.000
until you know it's not to be true.
link |
00:53:29.320
Just like children have a wild imagination
link |
00:53:31.640
until the adults ruin it all.
link |
00:53:33.920
Okay.
link |
00:53:34.760
So what are the different kinds of data augmentation
link |
00:53:36.920
that you've seen to be effective in visual intelligence?
link |
00:53:40.760
For like vision, it's a lot of these image filtering
link |
00:53:43.600
operations.
link |
00:53:44.440
So like blurring the image, you know,
link |
00:53:46.720
all the kind of Instagram filters that you can think of.
link |
00:53:49.400
So like arbitrarily like make the red super red,
link |
00:53:52.480
make the green super greens, like saturate the image.
link |
00:53:55.800
Rotation cropping.
link |
00:53:56.960
Rotation cropping.
link |
00:53:58.000
Exactly.
link |
00:53:58.840
All of these kind of things.
link |
00:53:59.680
Like I said, lighting is a really interesting one to me.
link |
00:54:02.640
Like that feels like really complicated to do.
link |
00:54:04.760
So I mean, they don't, the augmentations that we work on
link |
00:54:07.320
aren't like that involved.
link |
00:54:08.840
So they're not going to be like physically realistic
link |
00:54:10.520
versions of lighting.
link |
00:54:11.360
It's not that you're assuming that there's a light source
link |
00:54:13.520
up and then you're moving it to the right.
link |
00:54:15.080
And then what does the thing look like?
link |
00:54:17.000
It's really more about like brightness of the image,
link |
00:54:19.160
overall brightness of the image or overall contrast
link |
00:54:21.440
of the image and so on.
link |
00:54:22.480
But this is a really important point to me.
link |
00:54:25.080
I always thought that data augmentation
link |
00:54:28.680
holds an important key to big improvements in machine
link |
00:54:33.120
learning.
link |
00:54:33.840
And it seems that it is an important aspect
link |
00:54:36.640
of self supervised learning.
link |
00:54:39.080
So I wonder if there's big improvements
link |
00:54:41.480
to be achieved on much more intelligent kinds of data
link |
00:54:45.280
augmentation.
link |
00:54:46.680
For example, currently, maybe you can correct me
link |
00:54:49.240
if I'm wrong, data augmentation is not parametrized.
link |
00:54:53.280
You're not learning.
link |
00:54:54.400
You're not learning.
link |
00:54:55.240
To me, it seems like data augmentation potentially
link |
00:54:59.760
should involve more learning than the learning process
link |
00:55:03.360
itself.
link |
00:55:05.320
You're almost like thinking of like generative kind of,
link |
00:55:08.800
it's the elves of bananas.
link |
00:55:10.200
You're trying to, it's like very active imagination
link |
00:55:13.240
of messing with the world and teaching that mechanism
link |
00:55:16.480
for messing with the world to be realistic.
link |
00:55:20.440
Because that feels like, I mean, it's imagination.
link |
00:55:24.680
Just as you said, it feels like us humans
link |
00:55:27.240
are able to maybe sometimes subconsciously
link |
00:55:30.680
imagine, before we see the thing,
link |
00:55:33.000
imagine what we're expecting to see.
link |
00:55:35.480
Like maybe several options.
link |
00:55:37.240
And especially, we probably forgot, but when we were younger,
link |
00:55:40.480
probably the possibilities were wild.
link |
00:55:42.600
There are more numerous.
link |
00:55:44.160
And then as we get older, we become to understand the world
link |
00:55:47.360
and the possibilities of what we might see
link |
00:55:51.000
becomes less and less and less.
link |
00:55:53.080
So I wonder if you think there's a lot of breakthroughs yet
link |
00:55:55.800
to be had in data augmentation.
link |
00:55:57.160
And maybe also, can you just comment on the stuff we have?
link |
00:55:59.760
Is that a big part of self supervised learning?
link |
00:56:02.080
Yes.
link |
00:56:02.320
So data augmentation is like key to self supervised learning.
link |
00:56:05.480
That has the kind of augmentation that we're using.
link |
00:56:08.240
And basically, the fact that we're
link |
00:56:10.240
trying to learn these neural networks that
link |
00:56:12.640
are predicting these features from images that
link |
00:56:14.520
are robust under data augmentation
link |
00:56:17.040
has been the key for visual self supervised learning.
link |
00:56:19.480
And they play a fairly fundamental role to it.
link |
00:56:22.320
Now, the irony of all of this is that deep learning purists
link |
00:56:26.120
will say the entire point of deep learning
link |
00:56:28.360
is that you feed in the pixels to the neural network.
link |
00:56:31.120
And it should figure out the patterns on its own.
link |
00:56:33.080
So if it really wants to look at edges,
link |
00:56:34.440
it should look at edges.
link |
00:56:35.640
You shouldn't really go and handcraft these features.
link |
00:56:38.600
You shouldn't go tell it that look at edges.
link |
00:56:41.160
So data augmentation should basically
link |
00:56:43.080
be in the same category.
link |
00:56:44.400
Why should we tell the network or tell this entire learning
link |
00:56:47.480
paradigm what kinds of data augmentation
link |
00:56:49.520
that we're looking for?
link |
00:56:50.840
We are encoding a very sort of human specific bias there
link |
00:56:55.200
that we know things are, if you change the contrast of the image,
link |
00:56:59.160
it should still be an apple.
link |
00:57:00.240
Or it should still be apple, not banana.
link |
00:57:02.200
Thank you.
link |
00:57:03.520
Basically, if we change colors, it
link |
00:57:05.960
should still be the same kind of concept.
link |
00:57:08.040
Of course, this is not one.
link |
00:57:09.880
This doesn't feel like super satisfactory,
link |
00:57:12.480
because a lot of our human knowledge or our human supervision
link |
00:57:15.720
is actually going into the data augmentation.
link |
00:57:17.600
So although we are calling it self supervised learning,
link |
00:57:19.680
a lot of the human knowledge is actually
link |
00:57:21.360
being encoded in the data augmentation process.
link |
00:57:23.520
So it's really like we've kind of sneaked away
link |
00:57:25.480
the supervision at the input.
link |
00:57:27.120
And we're really designing these nice list of data
link |
00:57:29.680
augmentations that are working very well.
link |
00:57:31.640
Of course, the idea is that it's much easier
link |
00:57:33.720
to design a list of data augmentation than it is to do.
link |
00:57:36.600
So humans are doing, nevertheless,
link |
00:57:38.160
doing less and less work, and maybe leveraging
link |
00:57:40.600
their creativity more and more.
link |
00:57:42.640
And when we say data augmentation is not parameterized,
link |
00:57:45.040
it means it's not part of the learning process.
link |
00:57:48.200
Do you think it's possible to integrate some of the data
link |
00:57:51.320
augmentation into the learning process?
link |
00:57:53.280
I think so.
link |
00:57:53.880
I think so.
link |
00:57:54.280
And in fact, it will be really beneficial for us,
link |
00:57:57.400
because a lot of these data augmentation that we use in vision
link |
00:58:00.360
are very extreme.
link |
00:58:01.840
For example, when you have certain concepts, again, a banana,
link |
00:58:07.240
you take the banana and then basically you
link |
00:58:09.200
change the color of the banana.
link |
00:58:10.520
So you make it a purple banana.
link |
00:58:12.440
Now, this data augmentation process
link |
00:58:14.160
is actually independent of the, it
link |
00:58:16.320
has no notion of what is present in the image.
link |
00:58:18.800
So it can change this color arbitrarily.
link |
00:58:20.480
It can make it a red banana as well.
link |
00:58:22.520
And now what we're doing is we're
link |
00:58:23.760
telling the neural network that this red banana and,
link |
00:58:27.160
so a crop of this image which has the red banana
link |
00:58:29.240
and a crop of this image where I change the color to a purple
link |
00:58:31.480
banana should be, the features should be the same.
link |
00:58:34.080
Now bananas are in red or purple, mostly.
link |
00:58:36.680
So really the data augmentation process
link |
00:58:38.520
should take into account what is present in the image
link |
00:58:41.120
and what are the kinds of physical realities that are possible.
link |
00:58:43.720
It shouldn't be completely independent of the image.
link |
00:58:45.840
So you might get big gains if you, instead of being drastic,
link |
00:58:49.960
do subtle augmentation, but realistic augmentation.
link |
00:58:53.240
Right, realistic.
link |
00:58:54.040
I'm not sure if it's subtle, but realistic for sure.
link |
00:58:56.280
If it's realistic, then even subtle augmentation
link |
00:58:59.560
will give you big benefits.
link |
00:59:00.640
Exactly, yeah.
link |
00:59:01.840
And it will be, for particular domains,
link |
00:59:05.040
you might actually see, if, for example, now
link |
00:59:07.480
we're doing medical imaging, there
link |
00:59:09.040
are going to be certain kinds of geometric augmentation
link |
00:59:11.400
that are not really going to be very valid for the human body.
link |
00:59:15.080
So if you were to actually loop in data augmentation
link |
00:59:18.280
into the learning process, it will actually be much more
link |
00:59:20.480
useful.
link |
00:59:21.440
Now, this actually does take us to maybe a semi supervised
link |
00:59:24.480
kind of a setting because you do want to understand
link |
00:59:27.440
what is it that you're trying to solve.
link |
00:59:29.120
So currently self supervised learning kind of
link |
00:59:31.200
operates in the wild, right?
link |
00:59:32.720
So you do the self supervised learning,
link |
00:59:34.960
and the purists and all of us basically say that, OK,
link |
00:59:37.800
this should learn useful representations,
link |
00:59:39.440
and they should be useful for any kind of end task,
link |
00:59:42.320
no matter it's like banana recognition
link |
00:59:44.280
or like autonomous driving.
link |
00:59:46.200
Now, it's a tall order.
link |
00:59:47.760
Maybe the first baby step for us should be that, OK,
link |
00:59:50.760
if you're trying to loop in this data augmentation
link |
00:59:52.640
into the learning process, then we at least
link |
00:59:55.040
need to have some sense of what we're trying to do.
link |
00:59:56.880
Are we trying to distinguish between different types
link |
00:59:58.800
of bananas, or are we trying to distinguish between banana
link |
01:00:01.200
and apple, or are we trying to do all of these things at once?
link |
01:00:04.400
And so some notion of what happens at the end
link |
01:00:07.960
might actually help us do much better at this side.
link |
01:00:10.880
Let me ask you a ridiculous question.
link |
01:00:14.320
If I were to give you like a black box,
link |
01:00:16.280
like a choice to have an arbitrary large data
link |
01:00:19.200
set of real natural data versus really
link |
01:00:23.680
good data augmentation algorithms,
link |
01:00:26.600
which would you like to train in a self supervised way on?
link |
01:00:31.280
So natural data from the internet are arbitrary large,
link |
01:00:35.040
so unlimited data.
link |
01:00:37.360
Or it's like more controlled, good data augmentation
link |
01:00:41.760
on the finite data set.
link |
01:00:43.600
The thing is like because our learning algorithms
link |
01:00:45.720
for vision right now really rely on data augmentation,
link |
01:00:49.360
even if you were to give me like an infinite source of like
link |
01:00:51.880
image data, I still need a good data augmentation algorithm.
link |
01:00:54.520
You need something that tells you
link |
01:00:56.040
that two things are similar.
link |
01:00:57.360
Right.
link |
01:00:58.000
And so something, because you've given me
link |
01:00:59.880
an arbitrarily large data set, I still
link |
01:01:01.960
need to use data augmentation to take that image, construct
link |
01:01:05.320
like these two perturbations of it, and then learn from it.
link |
01:01:08.240
So the thing is our learning paradigm
link |
01:01:09.920
is very primitive right now.
link |
01:01:11.880
Even if you were to give me lots of images,
link |
01:01:13.760
it's still not really useful.
link |
01:01:15.200
A good data augmentation algorithm
link |
01:01:16.520
is actually going to be more useful.
link |
01:01:18.120
So you can reduce down the amount of data
link |
01:01:21.160
that you give me by like 10 times.
link |
01:01:22.560
But if you were to give me a good data augmentation algorithm,
link |
01:01:25.040
that will probably do better than giving me like 10 times
link |
01:01:27.920
the size of that data, but me having to rely on a very
link |
01:01:31.240
primitive data augmentation algorithm.
link |
01:01:32.840
Through tagging and all those kinds of things,
link |
01:01:35.040
is there a way to discover things that are semantically
link |
01:01:38.200
similar on the internet?
link |
01:01:39.640
Obviously there is, but it might be extremely noisy.
link |
01:01:42.560
And the difference might be farther away
link |
01:01:45.800
than you would be comfortable with.
link |
01:01:47.840
So I mean, yes, tagging will help you a lot.
link |
01:01:49.720
It'll actually go a very long way in figuring out
link |
01:01:52.160
what images are related or not.
link |
01:01:54.360
And then so, but then the purists would argue that when
link |
01:01:57.840
you're using human tags, because these tags are like
link |
01:02:00.320
supervision, is it really really self supervised learning
link |
01:02:03.320
now, because you're using human tags to figure out
link |
01:02:05.840
which images are like similar.
link |
01:02:08.000
Hashtag no filter means a lot of things.
link |
01:02:10.440
Yes.
link |
01:02:11.320
I mean, there are certain tags which are going to be
link |
01:02:13.000
applicable pretty much to anything.
link |
01:02:15.360
So they're pretty useless for learning.
link |
01:02:18.320
But I mean, certain tags are actually like
link |
01:02:20.880
DI filter, for example, or the Taj Mahal, for example.
link |
01:02:23.880
These tags are like very indicative of what's going on.
link |
01:02:26.520
And they are, I mean, they are human supervision.
link |
01:02:29.520
Yeah.
link |
01:02:30.360
This is one of the tasks of discovering from human
link |
01:02:32.800
generated data, strong signals that could be
link |
01:02:35.480
leveraged for self supervision.
link |
01:02:39.600
Like humans are doing so much work already.
link |
01:02:42.280
Like many years ago, there was something that was called,
link |
01:02:45.160
I guess, human computation back in the day.
link |
01:02:48.040
Humans are doing so much work.
link |
01:02:50.320
It'd be exciting to discover ways to leverage the work
link |
01:02:53.880
they're doing to teach machines without any extra
link |
01:02:57.040
effort from them.
link |
01:02:58.000
An example could be, like we said, driving.
link |
01:03:00.200
Humans driving and machines can learn from the driving.
link |
01:03:03.040
I always hope that there could be some supervision signal
link |
01:03:06.800
discovered in video games, because there's so many
link |
01:03:09.080
people that play video games that it feels like
link |
01:03:11.720
so much effort is put into video games, into playing
link |
01:03:16.520
video games, and you can design video games somewhat
link |
01:03:21.120
cheaply to include whatever signals you want.
link |
01:03:24.640
It feels like that could be leveraged somehow.
link |
01:03:27.560
So people are using that.
link |
01:03:28.680
Like there are actually folks right here in UT Austin,
link |
01:03:30.840
like Phillip Crainbull is a professor at UT Austin.
link |
01:03:33.760
He's been working on video games as a source of
link |
01:03:36.760
supervision.
link |
01:03:38.000
I mean, it's really fun, like as a PhD student,
link |
01:03:40.040
getting to basically play video games all day.
link |
01:03:42.200
Yeah, but so I do hope that kind of thing scales.
link |
01:03:44.960
And ultimately, it boils down to discovering some
link |
01:03:48.720
undeniably very good signal.
link |
01:03:51.600
It's like masking in an LP.
link |
01:03:54.040
But that said, there's noncontrastive methods.
link |
01:03:57.640
What do noncontrastive, energy based, self supervised
link |
01:04:01.760
learning methods look like, and why are they promising?
link |
01:04:05.640
So like I said about contrastive learning,
link |
01:04:07.800
you have this notion of a positive and a negative.
link |
01:04:10.720
Now, the thing is, this entire learning paradigm
link |
01:04:13.640
really requires access to a lot of negatives to learn
link |
01:04:17.560
a good sort of feature space.
link |
01:04:19.040
The idea is if I tell you, okay, so a cat and a dog
link |
01:04:23.080
are similar, and they're very different from a banana.
link |
01:04:25.680
The thing is, this is a fairly simple analogy, right?
link |
01:04:28.000
Because bananas look visually very different
link |
01:04:30.840
from what cats and dogs do.
link |
01:04:32.440
So very quickly, if this is the only source of
link |
01:04:34.520
supervision that I'm giving you, your learning is not
link |
01:04:37.400
going to be like, after a point, the neural network
link |
01:04:39.760
is really not going to learn a lot.
link |
01:04:41.640
Because the negative that you're getting
link |
01:04:42.960
is going to be so random.
link |
01:04:43.880
So it can be, oh, a cat and a dog are similar,
link |
01:04:46.640
but they're very different from a Volkswagen Beetle.
link |
01:04:49.880
Now, this car looks very different
link |
01:04:51.880
from these animals again.
link |
01:04:52.920
So the thing is in contrastive learning,
link |
01:04:54.880
the quality of the negative sample really matters a lot.
link |
01:04:58.120
And so what has happened is basically that
link |
01:05:00.800
typically these methods that are contrastive
link |
01:05:02.840
really require access to lots of negatives,
link |
01:05:04.880
which becomes harder and harder to sort of scale
link |
01:05:06.880
when designing a learning algorithm.
link |
01:05:09.000
So that's been one of the reasons
link |
01:05:10.920
why noncontrastive methods have become popular
link |
01:05:13.680
and why people think that they're going to be more useful.
link |
01:05:16.360
So a noncontrastive method, for example,
link |
01:05:18.440
like clustering is one noncontrastive method.
link |
01:05:20.880
The idea basically being that you have two of these samples,
link |
01:05:24.640
so the cat and dog or two crops of this image,
link |
01:05:27.640
they belong to the same cluster.
link |
01:05:30.360
And so essentially you're basically doing clustering online
link |
01:05:33.280
when you're learning this network
link |
01:05:35.040
and which is very different from having access
link |
01:05:36.680
to a lot of negatives explicitly.
link |
01:05:38.920
The other way which has become really popular
link |
01:05:40.840
is something called self distillation.
link |
01:05:43.160
So the idea basically is that you have a teacher network
link |
01:05:45.720
and a student network,
link |
01:05:47.520
and the teacher network produces a feature.
link |
01:05:49.520
So it takes in the image
link |
01:05:51.080
and basically the neural network
link |
01:05:52.800
figures out the patterns, gets the feature out.
link |
01:05:55.240
And there's another neural network
link |
01:05:56.800
which is the student neural network
link |
01:05:57.960
and that also produces a feature.
link |
01:05:59.920
And now all you're doing is basically saying
link |
01:06:01.680
that the features produced by the teacher network
link |
01:06:03.960
and the student network should be very similar.
link |
01:06:06.120
That's it.
link |
01:06:06.960
There is no notion of a negative anymore.
link |
01:06:09.200
And that's it.
link |
01:06:10.040
So it's all about similarity maximization
link |
01:06:11.800
between these two features.
link |
01:06:13.680
And so all I need to now do
link |
01:06:15.760
is figure out how to have these two sorts of parallel networks,
link |
01:06:18.680
a student network and a teacher network.
link |
01:06:20.600
And basically researchers have figured out
link |
01:06:23.000
very cheap methods to do this.
link |
01:06:24.240
So you can actually have for free really
link |
01:06:26.760
two types of neural networks.
link |
01:06:29.000
They're kind of related,
link |
01:06:30.120
but they're different enough
link |
01:06:31.400
that you can actually basically have a learning problem set up.
link |
01:06:34.000
So you can ensure that they always remain different enough
link |
01:06:38.200
so the thing doesn't collapse into something boring.
link |
01:06:41.040
Exactly.
link |
01:06:41.880
So the main sort of enemy of self supervised learning,
link |
01:06:44.360
any kind of similarity maximization technique is collapse.
link |
01:06:47.560
It's a collapse means that you learn
link |
01:06:49.840
the same feature representation
link |
01:06:51.560
for all the images in the world,
link |
01:06:53.160
which is completely useless.
link |
01:06:54.640
Everything is a banana.
link |
01:06:55.640
Everything is a banana.
link |
01:06:56.600
Everything is a cat.
link |
01:06:57.440
Everything is a car.
link |
01:06:58.280
Yeah.
link |
01:06:59.200
And so all we need to do is basically come up
link |
01:07:01.720
with ways to prevent collapse,
link |
01:07:03.320
contrasted learning is one way of doing it.
link |
01:07:05.400
And then for example,
link |
01:07:06.320
like clustering or self distillation
link |
01:07:07.880
or other ways of doing it.
link |
01:07:09.280
We also had a recent paper
link |
01:07:10.440
where we used like decorrelation
link |
01:07:13.160
between like two sets of features to prevent collapse.
link |
01:07:16.800
So that's inspired a little bit
link |
01:07:17.880
by like Horace Barlow's neuroscience principles.
link |
01:07:20.760
By the way, I should comment that whoever counts
link |
01:07:23.600
the number of times then the word banana,
link |
01:07:26.760
apple, cat and dog were using this conversation,
link |
01:07:29.000
wins the internet.
link |
01:07:30.160
I wish you luck.
link |
01:07:31.200
What is SWAV and the main improvement proposed
link |
01:07:36.800
in the paper on supervised learning of visual features
link |
01:07:40.360
by contrasting cluster assignments?
link |
01:07:43.000
SWAV basically is a clustering based technique,
link |
01:07:46.400
which is for again, the same thing
link |
01:07:48.400
for self supervised learning in vision,
link |
01:07:50.760
where we have two crops.
link |
01:07:52.440
And the idea basically is that you want the features
link |
01:07:55.280
from these two crops of an image to lie in the same cluster.
link |
01:07:58.920
And basically crops that are coming from different images
link |
01:08:02.560
to be in different clusters.
link |
01:08:03.960
Now, typically in a sort of,
link |
01:08:05.920
if you were to do this clustering,
link |
01:08:07.160
you would perform clustering offline.
link |
01:08:09.520
What that means is you would,
link |
01:08:11.040
if you have a data set of N examples,
link |
01:08:13.160
you would run over all of these N examples,
link |
01:08:15.360
get features for them, perform clustering.
link |
01:08:17.520
So basically get some clusters
link |
01:08:19.480
and then repeat the process again.
link |
01:08:21.960
So this is offline basically because I need to do one
link |
01:08:24.280
pass through the data to compute its clusters.
link |
01:08:27.240
SWAV is basically just a simple way of doing this online.
link |
01:08:30.200
So as you're going through the data,
link |
01:08:31.840
you're actually computing these clusters online.
link |
01:08:34.800
And so of course, there is like a lot of tricks involved
link |
01:08:37.480
in how to do this in a robust manner without collapsing.
link |
01:08:40.160
But this is the sort of key idea to it.
link |
01:08:42.440
Is there a nice way to say what is the key methodology
link |
01:08:45.480
of the clustering that enables that?
link |
01:08:47.680
Right, so the idea basically is that
link |
01:08:51.000
when you have N samples,
link |
01:08:52.720
we assume that we have access to like,
link |
01:08:55.160
there are always K clusters in a data set.
link |
01:08:57.080
K is a fixed number.
link |
01:08:57.920
So for example, K is 3000.
link |
01:09:00.160
And so if you have any,
link |
01:09:02.240
when you look at any sort of small number of examples,
link |
01:09:04.840
all of them must belong to one of these K clusters.
link |
01:09:08.000
And we impose this equi partition constraint.
link |
01:09:10.360
What this means is that basically,
link |
01:09:15.240
your entire set of N samples
link |
01:09:16.880
should be equally partitioned into K clusters.
link |
01:09:19.440
So all your K clusters are basically equal,
link |
01:09:21.800
they have equal contribution to these N samples.
link |
01:09:24.400
And this ensures that we never collapse.
link |
01:09:26.520
So collapse can be viewed as a way
link |
01:09:28.280
in which all samples belong to one cluster.
link |
01:09:30.680
So all this, if all features become the same,
link |
01:09:33.160
then you have basically just one mega cluster.
link |
01:09:35.160
You don't even have like 10 clusters or 3000 clusters.
link |
01:09:38.160
So SWAP basically ensures that at each point,
link |
01:09:41.000
all these 3000 clusters
link |
01:09:42.400
are being used in the clustering process.
link |
01:09:45.080
And that's it.
link |
01:09:46.280
Basically just figure out how to do this online.
link |
01:09:48.520
And again, basically just make sure
link |
01:09:51.000
that two crops from the same image
link |
01:09:52.600
belong to the same cluster and others don't.
link |
01:09:55.760
And the fact they have a fixed K makes things simpler.
link |
01:09:58.880
Fixed K makes things simpler.
link |
01:10:00.400
Our clustering is not like really hard clustering,
link |
01:10:02.600
it's soft clustering.
link |
01:10:03.760
So basically you can be 0.2 to cluster number one
link |
01:10:06.920
and 0.8 to cluster number two.
link |
01:10:08.480
So it's not really hard.
link |
01:10:09.920
So essentially, even though we have like 3000 clusters,
link |
01:10:12.760
we can actually represent a lot of clusters.
link |
01:10:15.200
What is CER, S E E R?
link |
01:10:19.240
And what are the key results and insights in the paper,
link |
01:10:23.120
self supervised pre training of visual features in the wild?
link |
01:10:27.440
What is this big, beautiful CER system?
link |
01:10:30.760
CER, so I'll first go to SWAP
link |
01:10:33.000
because SWAP is actually like one
link |
01:10:34.440
of the key components for CER.
link |
01:10:35.840
So SWAP was, when we use SWAP,
link |
01:10:37.840
it was demonstrated on ImageNet.
link |
01:10:39.840
So typically like self supervised methods,
link |
01:10:42.960
the way we sort of operate is like
link |
01:10:45.480
in the research community, we kind of cheat.
link |
01:10:47.240
So we take ImageNet, which of course I talked
link |
01:10:49.560
about as having lots of labels.
link |
01:10:51.320
And then we throw away the labels,
link |
01:10:53.200
throw away all the hard work
link |
01:10:54.280
that went behind basically the labeling process.
link |
01:10:56.800
And we pretend that it is self unsupervised.
link |
01:11:00.240
But the problem here is that we have,
link |
01:11:02.480
like when we collected these images,
link |
01:11:05.160
the ImageNet dataset has a particular distribution
link |
01:11:08.240
of concepts, right?
link |
01:11:09.960
So these images are very curated.
link |
01:11:11.800
And what that means is these images,
link |
01:11:14.880
of course, belong to a certain set of noun concepts.
link |
01:11:17.680
And also ImageNet has this bias
link |
01:11:19.320
that all images contain an object
link |
01:11:21.240
which is like very big and it's typically in the center.
link |
01:11:24.120
So when you're talking about a dog,
link |
01:11:25.120
it's a well framed dog.
link |
01:11:26.160
It's towards the center of the image.
link |
01:11:28.360
So a lot of the data augmentation,
link |
01:11:29.800
a lot of the sort of hidden assumptions
link |
01:11:31.520
in self supervised learning,
link |
01:11:33.440
actually really exploit this bias of ImageNet.
link |
01:11:37.400
And so, I mean, a lot of my work,
link |
01:11:39.720
a lot of work from other people always uses ImageNet
link |
01:11:42.040
sort of as the benchmark to show
link |
01:11:43.720
the success of self supervised learning.
link |
01:11:45.480
So you're implying that there's particular limitations
link |
01:11:47.720
to this kind of dataset?
link |
01:11:49.240
Yes, I mean, it's basically because our data augmentation
link |
01:11:51.880
that we designed, like all the augmentation
link |
01:11:55.360
that we designed for self supervised learning
link |
01:11:56.880
and vision are kind of overfed to ImageNet.
link |
01:11:59.400
But you're saying a little bit hard coded
link |
01:12:02.400
in like the cropping.
link |
01:12:03.840
Exactly, the cropping parameters,
link |
01:12:05.480
the kind of lighting that we're using,
link |
01:12:07.320
the kind of blurring that we're using.
link |
01:12:08.840
Yeah, but you would, for a more in the wild dataset,
link |
01:12:12.000
you would need to be clever or more careful
link |
01:12:16.240
in setting the range of parameters
link |
01:12:17.520
and those kinds of things.
link |
01:12:18.960
So for SEER, our main goal was to fold one
link |
01:12:21.720
basically to move away from ImageNet for training.
link |
01:12:24.720
So the images that we used were like uncurated images.
link |
01:12:27.720
Now there's a lot of debate
link |
01:12:28.640
whether they're actually curated or not,
link |
01:12:30.080
but I'll talk about that later.
link |
01:12:32.360
But the idea was basically these are going to be
link |
01:12:34.520
random internet images that we're not going to filter out
link |
01:12:37.920
based on like a particular categories.
link |
01:12:40.080
So we did not say that, oh, images that belong to dogs
link |
01:12:42.880
and cats should be the only images
link |
01:12:44.280
that come in this dataset, banana.
link |
01:12:47.000
And basically other images should be thrown out.
link |
01:12:50.040
So we didn't do any of that.
link |
01:12:51.800
So these are random internet images.
link |
01:12:53.560
And of course, it also goes back to like the problem
link |
01:12:56.040
of scale that you talked about.
link |
01:12:57.280
So these were basically about a billion or so images.
link |
01:13:00.120
And for Context ImageNet, the ImageNet version
link |
01:13:02.320
that we use was one million images earlier.
link |
01:13:04.320
So this is basically going like three orders
link |
01:13:05.920
of magnitude more.
link |
01:13:07.600
The idea was basically to see if we can train
link |
01:13:09.360
a very large convolutional model in a self supervised way
link |
01:13:13.320
on this uncurated, but really large set of images.
link |
01:13:16.400
And how well would this model do?
link |
01:13:18.280
So is self supervised learning really overfit to ImageNet?
link |
01:13:21.480
Or can it actually work in the wild?
link |
01:13:23.840
And it was also out of curiosity,
link |
01:13:25.720
what kind of things will this model learn?
link |
01:13:27.520
Will it actually be able to still figure out,
link |
01:13:30.080
different types of objects and so on?
link |
01:13:32.000
Would there be particular kinds of tasks
link |
01:13:33.720
it would actually do better than an ImageNet trained model?
link |
01:13:38.160
And so for Sierra, one of our main findings was that
link |
01:13:40.960
we can actually train very large models
link |
01:13:43.120
in a completely self supervised way
link |
01:13:44.800
on lots of Internet images
link |
01:13:46.400
without really necessarily filtering them out,
link |
01:13:48.640
which was in itself a good thing
link |
01:13:49.800
because it's a fairly simple process, right?
link |
01:13:52.000
So you get images which are uploaded
link |
01:13:54.120
and you basically can immediately use them
link |
01:13:55.800
to train a model in an unsupervised way.
link |
01:13:57.720
You don't really need to sit and filter them out.
link |
01:13:59.760
These images can be cartoons, these can be memes,
link |
01:14:02.080
these can be actual pictures uploaded by people.
link |
01:14:04.480
And you don't really care about what these images are.
link |
01:14:06.200
You don't even care about what concepts they contain.
link |
01:14:08.560
So this was a very sort of simple setup.
link |
01:14:10.320
What ImageSelection mechanism would you say
link |
01:14:12.920
is there like inherent in some aspect of the process?
link |
01:14:18.880
So you're kind of implying it, there's almost none.
link |
01:14:21.320
But what is there would you say if you were to introspect?
link |
01:14:25.000
Right, so it's not like uncurated can basically like,
link |
01:14:28.960
one way of imagining uncurated is basically
link |
01:14:30.880
you have like cameras that can take pictures
link |
01:14:33.800
at random viewpoints.
link |
01:14:35.240
When people upload pictures to the Internet,
link |
01:14:37.440
they are typically going to care about the framing of it.
link |
01:14:40.360
They're not going to upload, say,
link |
01:14:41.880
the picture of a zoomed in wall, for example.
link |
01:14:43.840
Well, when we say Internet,
link |
01:14:44.920
do you mean social networks?
link |
01:14:46.080
Yes.
link |
01:14:47.160
So these are not going to be like pictures
link |
01:14:48.680
of like a zoomed in table or a zoomed in wall.
link |
01:14:51.400
So it's not really completely uncurated
link |
01:14:53.160
because people do have their like photographers bias,
link |
01:14:55.800
where they do want to keep things towards the center
link |
01:14:57.720
a little bit or like really have like,
link |
01:14:59.920
you know, nice looking things and so on in the picture.
link |
01:15:02.680
So that's the kind of bias that typically exists
link |
01:15:05.640
in this data set.
link |
01:15:06.480
And also the user base, right?
link |
01:15:07.720
You're not going to get lots of pictures
link |
01:15:09.320
from different parts of the world
link |
01:15:10.520
because there are certain parts of the world
link |
01:15:12.080
where people may not actually be uploading
link |
01:15:14.280
a lot of pictures to the Internet
link |
01:15:15.400
or may not even have access to a lot of Internet.
link |
01:15:17.360
So this is a giant data set and a giant neural network.
link |
01:15:21.720
I don't think we've talked about what architectures
link |
01:15:24.760
work well for SSL, for self supervised learning.
link |
01:15:29.280
For Seer and for Swab,
link |
01:15:30.640
we were using convolutional networks,
link |
01:15:32.440
but recently in a work called Dino,
link |
01:15:34.120
we've basically started using transformers for vision.
link |
01:15:36.840
Both seem to work really well,
link |
01:15:38.560
con nets and transformers and depending on what you want to do,
link |
01:15:41.120
you might choose to use a particular formulation.
link |
01:15:43.520
So for Seer, it was a con net.
link |
01:15:45.320
It was particularly a reg net model,
link |
01:15:47.440
which was also work from Facebook.
link |
01:15:49.680
Reg nets are like really good when it comes to compute
link |
01:15:52.600
versus like accuracy.
link |
01:15:54.720
So because it was a very efficient model,
link |
01:15:56.880
compute and memory wise efficient
link |
01:15:59.640
and basically it worked really well in terms of scaling.
link |
01:16:02.440
So we used a very large reg net model
link |
01:16:04.160
and trained it on a billion images.
link |
01:16:05.440
Can you maybe quickly comment on what reg nets are?
link |
01:16:09.680
It comes from this paper,
link |
01:16:10.680
Designing Network Design Spaces.
link |
01:16:13.520
It's just a super interesting concept
link |
01:16:15.520
that emphasizes how to create efficient neural networks,
link |
01:16:18.400
large neural networks.
link |
01:16:19.520
So one of the sort of key takeaways from this paper,
link |
01:16:21.760
which the authors like whenever you hear them present this work,
link |
01:16:24.160
they keep saying is a lot of neural networks
link |
01:16:27.200
are characterized in terms of flops.
link |
01:16:29.040
Flops basically being the floating point operations
link |
01:16:31.480
and people really love to use flops to say,
link |
01:16:33.360
this model is like really computationally heavy
link |
01:16:36.200
or like our model is computationally cheap and so on.
link |
01:16:39.000
Now it turns out that flops are really not a good indicator
link |
01:16:41.880
of how well a particular network is,
link |
01:16:43.840
like how efficient it is really.
link |
01:16:45.960
And what a better indicator is is the activation
link |
01:16:49.120
or the memory that is being used by this particular model.
link |
01:16:52.120
And so designing like one of the key findings
link |
01:16:54.960
from this paper was basically that you need to design
link |
01:16:57.360
network families or neural network architectures
link |
01:17:00.120
that are actually very efficient in the memory space as well,
link |
01:17:02.760
not just in terms of pure flops.
link |
01:17:04.800
So RegNet is basically a network architecture family
link |
01:17:07.560
that came out of this paper
link |
01:17:08.920
that is particularly good at both flops
link |
01:17:11.160
and the sort of memory required for it.
link |
01:17:13.560
And of course it builds upon like earlier work,
link |
01:17:15.760
like ResNet being like the sort of more popular inspiration
link |
01:17:18.560
for it where you have residual connections.
link |
01:17:20.400
But one of the things in this work is basically
link |
01:17:22.400
they also use like squeeze excitation blocks.
link |
01:17:25.080
So it's a lot of nice sort of technical innovation
link |
01:17:27.080
in all of this from prior work
link |
01:17:28.720
and a lot of the ingenuity of these particular authors
link |
01:17:31.400
in how to combine these multiple building blocks.
link |
01:17:34.120
But the key constraint was optimize for both flops
link |
01:17:36.840
and memory when you're basically doing this.
link |
01:17:38.320
Don't just look at flops.
link |
01:17:39.560
And that allows you to what have sort of have very large
link |
01:17:44.040
networks through this process can optimize for low
link |
01:17:49.120
like for efficiency, for low memory.
link |
01:17:51.280
Also in just in terms of pure hardware,
link |
01:17:53.600
they fit very well on GPU memory.
link |
01:17:55.880
So they can be like really powerful neural network
link |
01:17:57.920
architectures with lots of parameters, lots of flops,
link |
01:18:00.200
but also because they're like efficient in terms of
link |
01:18:02.760
the amount of memory that they're using,
link |
01:18:04.040
you can actually fit a lot of these on,
link |
01:18:05.960
like you can fit a very large model
link |
01:18:08.120
on a single GPU for example.
link |
01:18:09.600
Would you say that the choice of architecture matters more
link |
01:18:15.120
than the choice of maybe data augmentation techniques?
link |
01:18:18.560
Is there a possibility to say what matters more?
link |
01:18:21.720
You kind of imply that you can probably go really far
link |
01:18:24.400
with just using basic convenants.
link |
01:18:27.600
All right, I think like data and data augmentation,
link |
01:18:30.600
the algorithm being used for the self supervised training
link |
01:18:33.280
matters a lot more than the particular kind of architecture.
link |
01:18:36.400
With different types of architecture,
link |
01:18:37.680
you will get different like properties
link |
01:18:39.560
in the resulting sort of representation.
link |
01:18:41.720
But really, I mean, the secret sauce is in the data
link |
01:18:44.200
augmentation and the algorithm being used to train them.
link |
01:18:47.080
The architectures, I mean, at this point,
link |
01:18:49.240
a lot of them perform very similarly,
link |
01:18:51.680
depending on like the particular task that you care about,
link |
01:18:53.840
they have certain advantages and disadvantages.
link |
01:18:56.400
Is there something interesting to be said
link |
01:18:58.120
about what it takes with Sears to train
link |
01:19:00.600
a giant neural network?
link |
01:19:01.920
You're talking about a huge amount of data,
link |
01:19:04.160
a huge neural network.
link |
01:19:05.800
Is there something interesting to be said
link |
01:19:07.800
of how to effectively train something like that fast?
link |
01:19:11.280
Lots of GPUs.
link |
01:19:13.040
Okay, so.
link |
01:19:15.480
I mean, so the model was like a billion parameters.
link |
01:19:18.080
Yeah.
link |
01:19:18.920
And it was trained on the billion images.
link |
01:19:20.640
Yeah.
link |
01:19:21.480
So basically the same number of parameters
link |
01:19:23.360
as the number of images.
link |
01:19:24.920
And it took a while.
link |
01:19:26.200
I don't remember the exact number.
link |
01:19:27.480
It's in the paper.
link |
01:19:28.640
But it took a while.
link |
01:19:29.480
I guess I'm trying to get at is
link |
01:19:34.680
when you're thinking of scaling this kind of thing.
link |
01:19:38.680
I mean, one of the exciting possibilities
link |
01:19:41.920
of self supervised learning is the several orders
link |
01:19:45.320
of magnitude scaling of everything,
link |
01:19:47.360
both both the neural network and the size of the data.
link |
01:19:50.920
And so the question is,
link |
01:19:52.600
do you think there's some interesting tricks
link |
01:19:55.120
to do large scale distributed compute?
link |
01:19:57.880
Or is it,
link |
01:19:58.720
or is that really outside of even deep learning?
link |
01:20:00.920
That's more about like hardware engineering.
link |
01:20:04.360
I think more and more there is like this,
link |
01:20:07.240
a lot of like systems are designed,
link |
01:20:10.120
basically taking it to account
link |
01:20:11.360
the machine learning needs, right?
link |
01:20:12.480
So because whenever you're doing this kind
link |
01:20:14.640
of distributed training,
link |
01:20:15.480
there is a lot of inter communication between nodes.
link |
01:20:17.760
So like gradients or the model parameters are being passed.
link |
01:20:20.600
So you really want to minimize communication costs
link |
01:20:22.800
when you really want to scale these models up.
link |
01:20:25.240
You want basically to be able to do as much,
link |
01:20:29.160
like as limited amount of communication as possible.
link |
01:20:31.440
So currently like a dominant paradigm
link |
01:20:33.280
is synchronized sort of training.
link |
01:20:35.000
So essentially after every sort of gradient step,
link |
01:20:38.480
all you basically have like a synchronization step
link |
01:20:41.200
between all the sort of compute chips
link |
01:20:43.400
that you're going on with.
link |
01:20:45.680
I think asynchronous training was popular,
link |
01:20:47.840
but it doesn't seem to perform as well.
link |
01:20:50.400
But in general, I think that's sort of the,
link |
01:20:53.320
I guess it's outside my scope as well.
link |
01:20:55.280
But the main thing is like minimize the amount
link |
01:20:59.000
of synchronization steps that you have.
link |
01:21:01.880
That has been the key take away at least in my experience.
link |
01:21:04.600
The others, I have no idea about how to design the chip.
link |
01:21:06.640
Yeah, there's very few things that I see Jim Keller's eyes
link |
01:21:11.200
light up as much as talking about giant computers doing
link |
01:21:15.360
like that fast communication that you're talking to,
link |
01:21:17.640
well, when they're training machine learning systems.
link |
01:21:21.200
What is Vistle, the ISSL, the PyTorch based SSL library?
link |
01:21:27.880
What are the use cases that you might have?
link |
01:21:30.080
Vistle basically was born out of a lot of us
link |
01:21:32.320
at Facebook doing the self supervised learning research.
link |
01:21:35.120
So it's a common framework in which we have
link |
01:21:38.160
like a lot of self supervised learning methods
link |
01:21:39.920
implemented for vision.
link |
01:21:41.680
It's also, it has in itself like a benchmark of tasks
link |
01:21:45.920
that you can evaluate the self supervised representations on.
link |
01:21:48.800
So the use case for it is basically for anyone
link |
01:21:51.240
who's either trying to evaluate their self supervised model
link |
01:21:53.760
or train their self supervised model
link |
01:21:56.000
or a researcher who's trying to build
link |
01:21:57.800
a new self supervised technique.
link |
01:21:59.240
So it's basically supposed to be all of these things.
link |
01:22:01.520
So as a researcher before Vistle, for example,
link |
01:22:04.480
or like when we started doing this work
link |
01:22:06.160
fairly seriously at Facebook,
link |
01:22:07.920
it was very hard for us to go and implement
link |
01:22:09.960
every self supervised learning model
link |
01:22:11.880
tested out in a like sort of consistent manner.
link |
01:22:14.600
The experimental setup was very different
link |
01:22:16.440
across different groups.
link |
01:22:18.160
Even when someone said that they were reporting image net
link |
01:22:20.800
accuracy, it could mean lots of different things.
link |
01:22:23.200
So with Vistle, we tried to really sort of standardize that
link |
01:22:25.360
as much as possible.
link |
01:22:26.400
And it was a paper like we did in 2019
link |
01:22:28.240
just about benchmarking.
link |
01:22:29.760
And so Vistle basically builds upon a lot of
link |
01:22:32.240
this kind of work that we did about like benchmarking.
link |
01:22:35.160
And then every time we try to like,
link |
01:22:37.160
we come up with a self supervised learning method,
link |
01:22:39.040
a lot of us try to push that into Vistle as well
link |
01:22:41.200
just so that it basically is like the central piece
link |
01:22:43.440
where a lot of these methods can reside.
link |
01:22:46.360
Just out of curiosity, people maybe,
link |
01:22:49.200
so certainly outside of Facebook, but just researchers,
link |
01:22:52.000
or just even people that know how to program in Python
link |
01:22:54.920
and know how to use PyTorch,
link |
01:22:57.120
what would be the use case?
link |
01:22:58.640
What would be a fun thing to play around with Vistle on?
link |
01:23:01.320
Like what's a fun thing to play around
link |
01:23:04.320
with self supervised learning on, would you say?
link |
01:23:07.920
Is there a good Hello World program?
link |
01:23:09.760
Like is it always about big size
link |
01:23:12.440
that's important to have?
link |
01:23:14.640
Or is there a fun little smaller case
link |
01:23:18.080
playgrounds to play around with?
link |
01:23:19.760
So we're trying to like push something towards that.
link |
01:23:22.440
I think there are a few setups out there,
link |
01:23:24.360
but nothing like super standard on the smaller scale.
link |
01:23:26.760
I mean, ImageNet in itself is actually pretty big also.
link |
01:23:29.320
So that is not something which is like feasible
link |
01:23:32.280
for a lot of people, but we are trying to like push up
link |
01:23:34.920
with like smaller sort of use cases.
link |
01:23:36.400
The thing is at a smaller scale,
link |
01:23:39.000
a lot of the observations are a lot of the algorithms
link |
01:23:41.240
that work don't necessarily translate
link |
01:23:42.800
into the medium or the larger scale.
link |
01:23:45.000
So it's really tricky to come up with a good small scale setup
link |
01:23:47.480
where a lot of your empirical observations
link |
01:23:49.160
will really translate to the other setup.
link |
01:23:51.560
So it's been really challenging.
link |
01:23:53.280
I've been trying to do that for a little bit as well
link |
01:23:54.920
because it does take time to train stuff on ImageNet,
link |
01:23:56.840
it does take time to train on like more images,
link |
01:23:59.880
but pretty much every time I've tried to do that,
link |
01:24:02.240
it's been unsuccessful because all the observations
link |
01:24:04.120
I draw from my set of experiments on a smaller dataset
link |
01:24:06.880
don't translate into ImageNet
link |
01:24:09.240
or like don't translate into another sort of dataset.
link |
01:24:11.720
So it's been hard for us to figure this one out,
link |
01:24:14.160
but it's an important problem.
link |
01:24:15.720
So there's this really interesting idea
link |
01:24:17.920
of learning across multiple modalities.
link |
01:24:20.800
You have a CVPR 2021 best paper candidate
link |
01:24:26.320
titled Audiovisual Instance Discrimination
link |
01:24:29.200
with Crossmodal Agreement.
link |
01:24:31.360
What are the key results, insights in this paper
link |
01:24:33.840
and what can you say in general about the promise
link |
01:24:35.880
and power of multimodal learning?
link |
01:24:37.600
For this paper, it actually came as a little bit
link |
01:24:39.960
of a shock to me at how well it worked.
link |
01:24:41.960
So I can describe what the problem setup was.
link |
01:24:44.120
So it's been used in the past by lots of folks,
link |
01:24:46.520
like for example, Andrew Owens from MIT,
link |
01:24:48.360
Alyosha Efros from Berkeley,
link |
01:24:49.920
Andrew Zisserman from Oxford.
link |
01:24:51.120
So a lot of these people have been sort of showing results
link |
01:24:53.000
in this.
link |
01:24:54.080
Of course, I was aware of this result,
link |
01:24:55.480
but I wasn't really sure how well it would work in practice
link |
01:24:58.600
for like other sort of downstream tasks.
link |
01:25:00.600
So the results kept getting better
link |
01:25:02.440
and I wasn't sure if like a lot of our insights
link |
01:25:04.200
from self supervised learning would translate
link |
01:25:05.920
into this multimodal learning problem.
link |
01:25:08.320
So multimodal learning is when you have like,
link |
01:25:12.880
when you have multiple modalities.
link |
01:25:14.280
And that's not equal.
link |
01:25:15.640
Excellent.
link |
01:25:16.960
Okay, so the particular modalities that we worked on
link |
01:25:20.040
in this work were audio and video.
link |
01:25:22.040
So the idea was basically if you have a video,
link |
01:25:23.920
you have it's corresponding audio track.
link |
01:25:25.880
And you want to use both of these signals,
link |
01:25:27.560
the audio signal and the video signal
link |
01:25:29.280
to learn a good representation for video
link |
01:25:31.280
and good representation for audio.
link |
01:25:32.640
Like this podcast.
link |
01:25:33.680
Like this podcast, exactly.
link |
01:25:35.480
So what we did in this work was basically trained
link |
01:25:38.160
two different neural networks,
link |
01:25:39.400
one on the video signal, one on the audio signal.
link |
01:25:41.960
And what we wanted is basically the features
link |
01:25:43.800
that we get from both of these neural networks
link |
01:25:45.400
should be similar.
link |
01:25:46.800
So it should basically be able to produce
link |
01:25:48.720
the same kinds of features from the video
link |
01:25:51.120
and the same kinds of features from the audio.
link |
01:25:53.240
Now, why is this useful?
link |
01:25:54.280
Well, for a lot of these objects that we have,
link |
01:25:56.680
there is a characteristic sound, right?
link |
01:25:58.280
So trains, when they go by,
link |
01:25:59.520
they make a particular kind of sound.
link |
01:26:00.760
Boats make a particular kind of sound.
link |
01:26:02.480
People, when they're jumping around,
link |
01:26:03.840
they will like shout or whatever.
link |
01:26:06.280
Bananas don't make a sound.
link |
01:26:07.320
So well, you can't learn anything about bananas there.
link |
01:26:09.440
Or when humans mention bananas.
link |
01:26:11.680
Well, yes.
link |
01:26:12.520
When they say the word banana, then probably.
link |
01:26:13.840
So you can't trust basically anything
link |
01:26:15.120
that comes out of a human's mouth as a source,
link |
01:26:17.160
that source of audio is useless.
link |
01:26:18.960
So the typical use case is basically like,
link |
01:26:20.680
for example, someone playing a musical instrument.
link |
01:26:22.480
So guitars have a particular kind of sound and so on.
link |
01:26:24.720
So because a lot of these things are correlated,
link |
01:26:27.160
the idea in multimodal learning
link |
01:26:28.480
is to take these two kinds of modalities,
link |
01:26:30.160
video and audio,
link |
01:26:31.360
and learn a common embedding space,
link |
01:26:33.160
a common feature space,
link |
01:26:34.480
where both of these related modalities
link |
01:26:36.120
can basically be close together.
link |
01:26:38.560
And again, you use contrastive learning for this.
link |
01:26:40.600
So in contrastive learning,
link |
01:26:42.080
basically the video and the corresponding audio are positives,
link |
01:26:45.520
and you can take any other video or any other audio,
link |
01:26:48.200
and that becomes a negative.
link |
01:26:49.840
And so basically that's it.
link |
01:26:51.040
It's just a simple application of contrastive learning.
link |
01:26:53.720
The main sort of finding from this work for us
link |
01:26:56.840
was basically that you can actually learn
link |
01:26:58.680
very, very powerful feature representations,
link |
01:27:00.760
very, very powerful video representations.
link |
01:27:02.840
So you can learn the sort of video network
link |
01:27:05.400
that we ended up learning
link |
01:27:06.520
can actually be used for downstream,
link |
01:27:08.280
for example, recognizing human actions,
link |
01:27:11.000
or recognizing different types of sounds, for example.
link |
01:27:14.440
So this was sort of the key finding.
link |
01:27:17.160
Can you give kind of an example of a human action
link |
01:27:20.200
or just so we can build up intuition
link |
01:27:23.400
of what kind of thing?
link |
01:27:24.360
Right, so there is this data set called Kinetics,
link |
01:27:26.880
for example, which has like 400 different types
link |
01:27:28.640
of human actions.
link |
01:27:29.480
So people jumping, people doing different kinds
link |
01:27:32.360
of sports or different types of swimming.
link |
01:27:34.280
So like different strokes and swimming, golf and so on.
link |
01:27:37.640
So there are like just different types of actions right there.
link |
01:27:40.560
And the point is this kind of video network
link |
01:27:42.640
that you learn in a self supervised way
link |
01:27:44.400
can be used very easily to kind of recognize
link |
01:27:46.960
these different types of actions.
link |
01:27:48.920
It can also be used for recognizing
link |
01:27:50.440
different types of objects.
link |
01:27:53.160
And what we did is we tried to visualize
link |
01:27:54.800
whether the network can figure out
link |
01:27:56.120
where the sound is coming from.
link |
01:27:57.920
So basically give it a video
link |
01:27:59.880
and basically play of a person just strumming a guitar,
link |
01:28:03.040
but of course there is no audio in this.
link |
01:28:04.800
And now you give it the sound of a guitar.
link |
01:28:07.200
And you ask like basically try to visualize
link |
01:28:08.920
where the network thinks the sound is coming from.
link |
01:28:12.560
And then it can kind of basically draw like,
link |
01:28:14.600
when you visualize it,
link |
01:28:15.440
you can see that it's basically focusing on the guitar.
link |
01:28:17.520
Yeah, that's so real.
link |
01:28:18.360
And the same thing, for example,
link |
01:28:20.200
for certain people's voices,
link |
01:28:21.520
like famous celebrities voices,
link |
01:28:22.960
it can actually figure out where their mouth is.
link |
01:28:26.080
So it can actually distinguish different people's voices,
link |
01:28:28.640
for example, a little bit as well.
link |
01:28:30.520
Without that ever being annotated in any way.
link |
01:28:33.680
Right, so this is all what it had discovered.
link |
01:28:35.560
We never pointed out that this is a guitar
link |
01:28:38.240
and this is the kind of sound it produces.
link |
01:28:40.120
It can actually naturally figure that out
link |
01:28:41.560
because it's seen so many correlations of this sound
link |
01:28:44.240
coming with this kind of like an object
link |
01:28:46.720
that it basically learns to associate this sound
link |
01:28:49.080
with this kind of an object.
link |
01:28:50.080
Yeah, that's really fascinating, right?
link |
01:28:52.840
That's really interesting.
link |
01:28:53.680
So the idea with this kind of network
link |
01:28:55.240
is then you then fine tune it for a particular task.
link |
01:28:57.960
So this is forming like a really good knowledge base
link |
01:29:01.880
within a neural network based on which you could then,
link |
01:29:04.320
the train a little bit more
link |
01:29:05.600
to accomplish a specific task well.
link |
01:29:08.800
Exactly, so you don't need a lot of videos of humans
link |
01:29:11.680
doing actions annotated.
link |
01:29:12.800
You can just use a few of them to basically get your.
link |
01:29:16.080
How much insight do you draw from the fact
link |
01:29:18.520
that it can figure out where the sound is coming from?
link |
01:29:23.480
I'm trying to see, so that's kind of very,
link |
01:29:26.160
it's very CVPR, beautiful, right?
link |
01:29:28.120
It's a cool little insight.
link |
01:29:30.000
I wonder how profound that is.
link |
01:29:34.240
Does it speak to the idea that multiple modalities
link |
01:29:39.320
are somehow much bigger than the sum of their parts
link |
01:29:44.120
or is it really, really useful to have multiple modalities
link |
01:29:48.000
or is it just that cool thing that there's parts
link |
01:29:50.640
of our world that can be revealed
link |
01:29:54.560
like effectively through multiple modalities,
link |
01:29:58.360
but most of it is really all about vision
link |
01:30:01.200
or about one of the modalities.
link |
01:30:03.880
I would say a little tending more towards the second part.
link |
01:30:07.800
So most of it can be sort of figured out with one modality,
link |
01:30:10.720
but having an extra modality always helps you.
link |
01:30:13.200
So in this case, for example, like one thing is when you're,
link |
01:30:17.720
if you observe someone cutting something
link |
01:30:19.400
and you don't have any sort of sound there,
link |
01:30:21.960
whether it's an apple or whether it's an onion,
link |
01:30:25.080
it's very hard to figure that out.
link |
01:30:26.720
But if you hear someone cutting it,
link |
01:30:28.240
it's very easy to figure it out
link |
01:30:29.800
because apples and onions make a very different kind
link |
01:30:32.840
of characteristics on when they're cutting.
link |
01:30:34.840
So you really figure this out based on audio.
link |
01:30:36.880
It's much easier.
link |
01:30:38.240
So your life will become much easier
link |
01:30:40.040
when you have access to different kinds of modalities.
link |
01:30:42.320
And the other thing is,
link |
01:30:43.440
so I like to relate it in this way,
link |
01:30:45.040
it may be like completely wrong,
link |
01:30:46.360
but the distributional hypothesis in NLP, right?
link |
01:30:49.360
Where context basically gives kind of meaning
link |
01:30:51.880
to that word.
link |
01:30:53.080
Sound kind of does that too, right?
link |
01:30:55.080
So if you have the same sound,
link |
01:30:57.040
so that's the same context across different videos,
link |
01:30:59.880
you're very likely to be observing
link |
01:31:01.320
the same kind of concept.
link |
01:31:03.040
So that's the kind of reason
link |
01:31:04.320
why it figures out the guitar thing, right?
link |
01:31:06.480
It observed the same sound across multiple different videos
link |
01:31:09.800
and it figures out maybe this is the common factor
link |
01:31:11.920
that's actually doing it.
link |
01:31:13.280
I wonder, I used to have this argument with my dad a bunch
link |
01:31:17.480
for creating general intelligence,
link |
01:31:19.800
whether smell is important,
link |
01:31:22.880
like if that's important sensory information.
link |
01:31:25.520
Mostly we're talking about like falling in love
link |
01:31:27.600
with an AI system.
link |
01:31:28.960
And for him, smell and touch are important.
link |
01:31:31.440
And I was arguing that it's not at all,
link |
01:31:33.880
it's nice and everything,
link |
01:31:35.320
but like you can fall in love with just language really,
link |
01:31:38.400
but voice is very powerful and vision is next
link |
01:31:41.400
and smell is not that important.
link |
01:31:43.880
Can I ask you about this process of active learning?
link |
01:31:46.880
You mentioned interactivity.
link |
01:31:49.160
Right.
link |
01:31:50.000
Is there some value within the self supervised learning
link |
01:31:56.000
context to select parts of the data in intelligent ways
link |
01:32:02.240
such that they would most benefit the learning process?
link |
01:32:06.840
So I think so.
link |
01:32:07.680
I mean, I know I'm talking to an active learning fan here,
link |
01:32:10.280
so of course I know the answer.
link |
01:32:12.600
First you were talking bananas
link |
01:32:14.000
and now you're talking about active learning, I love it.
link |
01:32:16.720
I think Yana Koon told me that active learning
link |
01:32:18.760
is not that interesting.
link |
01:32:20.440
And I think back then I didn't want to argue with him too much,
link |
01:32:24.360
but when we talk again,
link |
01:32:25.680
we're gonna spend three hours arguing about active learning.
link |
01:32:28.400
My sense was you can go extremely far with active learning,
link |
01:32:32.480
you know, perhaps farther than anything else.
link |
01:32:34.920
Like the, to me, there's this kind of intuition
link |
01:32:37.960
that similar to data augmentation,
link |
01:32:40.840
you can get a lot from the data,
link |
01:32:44.160
from intelligent optimized usage of the data.
link |
01:32:50.160
Right.
link |
01:32:51.000
I'm trying to speak generally in such a way
link |
01:32:53.240
that includes data augmentation and active learning,
link |
01:32:57.080
that there's something about maybe interactive exploration
link |
01:32:59.920
of the data that at least as part of the solution
link |
01:33:04.360
to intelligence, like an important part.
link |
01:33:07.160
I don't know what your thoughts
link |
01:33:08.120
are on active learning in general.
link |
01:33:09.360
I actually really like active learning.
link |
01:33:10.880
So back in the day we did this largely ignored
link |
01:33:13.480
CVPR paper called Learning by Asking Questions.
link |
01:33:16.600
So the idea was basically you would train an agent
link |
01:33:18.320
that would ask a question about the image,
link |
01:33:20.160
it would get an answer.
link |
01:33:21.600
And basically then it would update itself,
link |
01:33:23.440
it would see the next image,
link |
01:33:24.440
it would decide what's the next hardest question
link |
01:33:26.880
that I can ask to learn the most.
link |
01:33:28.840
And the idea was basically because it was being smart
link |
01:33:31.360
about the kinds of questions it was asking,
link |
01:33:33.560
it would learn in fewer samples,
link |
01:33:35.160
it would be more efficient at using data.
link |
01:33:37.960
And we did find to some extent
link |
01:33:39.480
that it was actually better than randomly asking questions.
link |
01:33:42.080
Kind of weird thing about active learning is
link |
01:33:43.960
it's also a chicken and egg problem
link |
01:33:45.240
because when you look at an image
link |
01:33:47.200
to ask a good question about the image
link |
01:33:48.720
you need to understand something about the image.
link |
01:33:50.960
You can't ask a completely arbitrarily random question,
link |
01:33:53.480
it may not even apply to that particular image.
link |
01:33:55.560
So there is some amount of understanding or knowledge
link |
01:33:57.680
that basically keeps getting built
link |
01:33:59.240
when you're doing active learning.
link |
01:34:01.360
So I think active learning in by itself is really good.
link |
01:34:04.640
And the main thing we need to figure out is basically
link |
01:34:07.280
how do we come up with a technique
link |
01:34:09.680
to first model what the model knows
link |
01:34:13.360
and also model what the model does not know.
link |
01:34:16.040
I think that's the sort of beauty of it, right?
link |
01:34:18.360
Because when you know that there are certain things
link |
01:34:20.520
that you don't know anything about,
link |
01:34:22.160
asking a question about those concepts
link |
01:34:23.640
is actually going to bring you the most value.
link |
01:34:26.520
And I think that's the sort of key challenge.
link |
01:34:28.360
Now self supervised learning by itself,
link |
01:34:29.960
like selecting data for it and so on,
link |
01:34:31.480
that's actually really useful.
link |
01:34:32.680
But I think that's a very narrow view
link |
01:34:34.000
of looking at active learning, right?
link |
01:34:35.120
If you look at it more broadly,
link |
01:34:36.360
it is basically about if the model has a knowledge
link |
01:34:40.080
about end concepts,
link |
01:34:41.440
and it is weak basically about certain things.
link |
01:34:43.880
So it needs to ask questions either to discover new concepts
link |
01:34:46.920
or to basically like increase its knowledge
link |
01:34:49.240
about these end concepts.
link |
01:34:50.440
So at that level, it's a very powerful technique.
link |
01:34:53.240
I actually do think it's going to be really useful.
link |
01:34:56.560
Even in like simple things such as like data labeling,
link |
01:34:59.080
it's super useful.
link |
01:35:00.280
So here is like one simple way
link |
01:35:02.960
that you can use active learning.
link |
01:35:04.320
For example, you have your self supervised model,
link |
01:35:06.920
which is very good at predicting similarities
link |
01:35:08.760
and dissimilarities between things.
link |
01:35:10.800
And so if you label a picture as basically say a banana,
link |
01:35:15.520
now you know that all the images
link |
01:35:17.760
that are very similar to this image
link |
01:35:19.240
are also likely to contain bananas.
link |
01:35:21.480
So probably when you want to understand what else
link |
01:35:24.680
is a banana, you're not going to use these other images.
link |
01:35:26.920
You're actually going to use an image
link |
01:35:28.200
that is not completely dissimilar,
link |
01:35:31.160
but somewhere in between,
link |
01:35:32.360
which is not super similar to this image,
link |
01:35:33.840
but not super dissimilar either.
link |
01:35:35.640
And that's going to tell you a lot more
link |
01:35:37.120
about what this concept of a banana is.
link |
01:35:39.520
So that's kind of a heuristic.
link |
01:35:41.840
I wonder if it's possible to also learn,
link |
01:35:45.320
learn ways to discover the most likely
link |
01:35:50.720
the most beneficial image.
link |
01:35:52.960
So like, so not just looking a thing
link |
01:35:55.000
that's somewhat similar to a banana,
link |
01:35:58.440
but not exactly similar,
link |
01:36:00.000
but have some kind of more complicated learning system,
link |
01:36:03.560
like learned discovery mechanism
link |
01:36:07.080
that tells you what image to look for.
link |
01:36:09.440
Like how, yeah, like actually in a self supervised way,
link |
01:36:14.360
learning strictly a function that says,
link |
01:36:17.280
is this image going to be very useful to me,
link |
01:36:20.600
given what I currently know?
link |
01:36:22.160
I think there is a lot of synergy there.
link |
01:36:24.040
It's just, I think, yeah, it's going to be explored.
link |
01:36:27.680
I think very much related to that.
link |
01:36:29.400
I kind of think of what Tesla autopilot is doing
link |
01:36:32.400
at currently as kind of active learning.
link |
01:36:36.880
There's something that Andre Capati and their team
link |
01:36:39.280
are calling data engine.
link |
01:36:41.280
So you're basically deploying a bunch of instantiations
link |
01:36:45.680
of a neural network into the wild
link |
01:36:47.880
and they're collecting a bunch of edge cases
link |
01:36:50.680
that are then sent back for annotation,
link |
01:36:53.240
for particular, and edge cases as defined as near failure
link |
01:36:56.720
or some weirdness on a particular task
link |
01:37:00.000
that's then sent back, it's that not exactly a banana,
link |
01:37:04.000
but almost a banana cases, send back for annotation
link |
01:37:07.200
and then there's this loop that keeps going
link |
01:37:09.200
and you keep retraining and retraining
link |
01:37:11.600
and the active learning step there,
link |
01:37:13.360
or whatever you want to call it,
link |
01:37:14.800
is the cars themselves that are sending you back the data,
link |
01:37:19.080
like what the hell happened here?
link |
01:37:20.760
This was weird.
link |
01:37:22.800
What are your thoughts about that sort of deployment
link |
01:37:26.440
of neural networks in the wild?
link |
01:37:28.240
Another way to ask a question for first is your thoughts
link |
01:37:31.320
and maybe if you want to comment,
link |
01:37:33.840
is there applications for autonomous driving,
link |
01:37:36.960
like computer vision based autonomous driving,
link |
01:37:40.160
applications of self supervised learning
link |
01:37:42.040
in the context of computer vision based autonomous driving?
link |
01:37:47.520
So I think so.
link |
01:37:48.360
I think for self supervised learning to be used
link |
01:37:50.040
in autonomous driving, there's lots of opportunities.
link |
01:37:52.720
Just like pure consistency in predictions is one way, right?
link |
01:37:55.840
So because you have this nice sequence of data
link |
01:38:00.280
that is coming in a video stream of it,
link |
01:38:02.360
associated of course with the actions
link |
01:38:04.080
that say the car took,
link |
01:38:05.440
you can form a very nice predictive model
link |
01:38:07.640
of what's happening.
link |
01:38:08.480
So for example, like all the way,
link |
01:38:11.440
like one way possibly in which how they're figuring out
link |
01:38:14.480
what data to get labeled is basically
link |
01:38:15.920
through prediction uncertainty, right?
link |
01:38:17.480
So you predict that the car was going to turn right.
link |
01:38:20.400
So this was the action that was going to happen,
link |
01:38:21.880
say in the shadow mode and now the driver turned left.
link |
01:38:24.680
And this is a really big surprise.
link |
01:38:27.200
So basically by forming these good predictive models,
link |
01:38:30.160
you are, I mean, these are kind of self supervised models,
link |
01:38:32.760
right?
link |
01:38:33.600
Prediction models are basically being trained
link |
01:38:34.640
just by looking at what's going to happen next
link |
01:38:36.800
and asking them to predict what's going to happen next.
link |
01:38:38.960
So I would say this is really like one use
link |
01:38:40.720
of self supervised learning.
link |
01:38:42.320
It's a predictive model
link |
01:38:43.440
and you're learning a predictive model
link |
01:38:44.680
basically just by looking at what data you have.
link |
01:38:46.880
Is there something about that active learning context
link |
01:38:49.600
that you find insights from?
link |
01:38:53.000
Like that kind of deployment of the system,
link |
01:38:54.760
seeing cases where it doesn't perform as you expected
link |
01:38:59.120
and then retraining the system based on that?
link |
01:39:01.000
I think that, I mean, that really resonates with me.
link |
01:39:03.600
It's super smart to do it that way.
link |
01:39:05.520
Because I mean, the thing is with any kind of like
link |
01:39:08.760
practical system like autonomous driving,
link |
01:39:11.120
there are those edge cases
link |
01:39:12.600
that are the things that are actually the problem, right?
link |
01:39:14.520
I mean, highway driving or like freeway driving
link |
01:39:17.400
has basically been like, there has been a lot of success
link |
01:39:20.120
in that particular part of autonomous driving
link |
01:39:21.800
for a long time, I would say like since the 80s or something.
link |
01:39:25.520
Now, the point is all these failure cases
link |
01:39:28.000
are the sort of reason why autonomous driving
link |
01:39:30.560
hasn't become like super, super mainstream
link |
01:39:33.360
available like in every possible car right now.
link |
01:39:35.640
And so basically by really scaling this problem out
link |
01:39:38.200
by really trying to get all of these edge cases out
link |
01:39:40.440
as quickly as possible.
link |
01:39:41.840
And then just like using those to improve your model,
link |
01:39:43.920
that's super smart.
link |
01:39:45.640
And prediction uncertainty to do that is like
link |
01:39:47.440
one really nice way of doing it.
link |
01:39:49.800
Let me put you on the spot.
link |
01:39:52.080
So we mentioned offline Jitendra,
link |
01:39:55.320
he thinks that the Tesla computer vision approach
link |
01:39:58.280
or really any approach for autonomous driving
link |
01:40:00.840
is very far away.
link |
01:40:02.720
How many years away, if you have to bet all your money on it,
link |
01:40:07.000
are we just solving autonomous driving
link |
01:40:09.640
with this kind of computer vision only
link |
01:40:12.040
machine learning based approach?
link |
01:40:13.600
Okay, so what does solving autonomous driving mean?
link |
01:40:15.440
Does it mean solving it in the US?
link |
01:40:17.240
Does it mean solving it in India?
link |
01:40:18.520
Because I can tell you that very different types
link |
01:40:20.160
of driving happening.
link |
01:40:21.200
Not India, not Russia.
link |
01:40:22.880
In the United States autonomous,
link |
01:40:26.280
so what solving means is when the car says it has control,
link |
01:40:32.000
it is fully liable.
link |
01:40:34.200
You can go to sleep, is driving by itself.
link |
01:40:37.920
So this is highway and city driving,
link |
01:40:39.880
but not everywhere, but mostly everywhere.
link |
01:40:42.440
And it's let's say significantly better,
link |
01:40:45.160
like say five times less accidents than humans.
link |
01:40:50.600
Sufficiently safer such that the public feels
link |
01:40:54.080
like that transition is enticing beneficial
link |
01:40:58.040
both for our safety and financing,
link |
01:40:59.640
all those kinds of things.
link |
01:41:01.120
Okay, so first disclaimer,
link |
01:41:02.360
I'm not an expert in autonomous driving.
link |
01:41:04.320
So let me put it out there.
link |
01:41:06.040
I would say like at least five to 10 years.
link |
01:41:09.440
This would be mine, I guess from now.
link |
01:41:13.800
I'm actually very impressed.
link |
01:41:14.760
Like when I sat in a friend's Tesla recently
link |
01:41:16.880
and of course like looking,
link |
01:41:19.240
so it can, on the screen,
link |
01:41:20.720
it basically shows all the detections and everything
link |
01:41:22.920
that the car is doing as you're driving by.
link |
01:41:24.720
And that's super distracting for me as a person
link |
01:41:26.960
because all I keep looking at is like the bounding boxes
link |
01:41:29.520
in the cars, it's tracking and it's really impressive.
link |
01:41:31.800
Like especially when it's raining
link |
01:41:33.040
and it's able to do that,
link |
01:41:34.360
that was the most impressive part for me.
link |
01:41:36.040
It's actually able to get through rain and do that.
link |
01:41:38.600
And one of the reasons why like a lot of us believed
link |
01:41:41.800
and I would put myself in that category
link |
01:41:44.120
is LiDAR based sort of technology
link |
01:41:46.880
for autonomous driving was the key driver, right?
link |
01:41:48.760
So Waymo was using it for the longest time.
link |
01:41:51.000
And Tesla then decided to go this completely other route
link |
01:41:53.320
that oh, we're not going to even use LiDAR.
link |
01:41:55.800
So their initial system I think was camera and radar based
link |
01:41:58.720
and now they're actually moving
link |
01:41:59.680
to a completely like vision based system.
link |
01:42:02.040
And so that was just like, it sounded completely crazy.
link |
01:42:04.680
Like LiDAR is very useful in cases
link |
01:42:07.040
where you have low visibility.
link |
01:42:09.280
Of course it comes with its own set of complications.
link |
01:42:11.760
But now to see that happen in like on a live Tesla
link |
01:42:15.200
that basically just proves everyone wrong,
link |
01:42:17.040
I would say in a way.
link |
01:42:18.200
And that's just working really well.
link |
01:42:20.600
I think there were also like a lot of advancements
link |
01:42:22.800
in camera technology.
link |
01:42:23.960
Now there were like, I know at CMU when I was there
link |
01:42:26.360
there was a particular kind of camera
link |
01:42:28.000
that had been developed that was really good
link |
01:42:30.120
at basically low visibility setting.
link |
01:42:32.800
So like lots of snow and lots of rain,
link |
01:42:34.480
it could actually still have a very reasonable visibility.
link |
01:42:37.720
And I think there are lots of these kinds of innovations
link |
01:42:39.440
that will happen on the sensor side itself
link |
01:42:41.000
which is actually going to make this very easy
link |
01:42:42.880
in the future.
link |
01:42:43.880
And so maybe that's actually why I'm more optimistic
link |
01:42:46.120
about vision based self like autonomous driving.
link |
01:42:49.040
It's gonna call it self supervised driving,
link |
01:42:50.440
but vision based autonomous driving,
link |
01:42:53.520
that's the reason I'm quite optimistic about it.
link |
01:42:55.440
Because I think there are going to be lots
link |
01:42:56.640
of these advances on the sensor side itself.
link |
01:42:58.960
So acquiring this data,
link |
01:43:00.720
we're actually going to get much better about it.
link |
01:43:02.640
And then of course when once we're able to scale out
link |
01:43:05.080
and get all of these edge cases in,
link |
01:43:06.800
as like Andre described,
link |
01:43:08.720
I think that's going to make us go very far away.
link |
01:43:11.720
Yeah, so it's funny,
link |
01:43:13.560
I'm very much with you on the five to 10 years,
link |
01:43:16.280
maybe 10 years,
link |
01:43:17.840
but you made it,
link |
01:43:20.080
I'm not sure how you made it sound,
link |
01:43:21.760
but for some people that might seem like really far away,
link |
01:43:25.320
and then for other people,
link |
01:43:27.800
it might seem like very close.
link |
01:43:30.440
There's a lot of fundamental questions
link |
01:43:32.320
about how much game theory is in this whole thing.
link |
01:43:36.880
So how much is this simply collision avoidance problem?
link |
01:43:42.120
And how much of it is,
link |
01:43:44.360
you're still interacting with other humans in the scene,
link |
01:43:46.960
and you're trying to create an experience that's compelling
link |
01:43:49.480
so you want to get from point A to point B quickly,
link |
01:43:53.080
you want to navigate the scene in a safe way,
link |
01:43:55.280
but you also want to show some level of aggression,
link |
01:43:58.480
because, well, certainly this is why you're screwed in India
link |
01:44:02.000
because you have to show aggression.
link |
01:44:03.400
Or Jersey, or New Jersey.
link |
01:44:04.600
Or Jersey.
link |
01:44:05.440
So like, or New York, or basically any major city,
link |
01:44:11.200
but I think it's probably Elon that I talked the most
link |
01:44:14.440
about this, which is a surprise to the level
link |
01:44:16.960
of which they're not considering human beings
link |
01:44:20.080
as a huge problem in this as a source of problem.
link |
01:44:22.960
Like the driving is fundamentally a robot on robot
link |
01:44:29.000
versus the environment problem,
link |
01:44:31.160
versus like you can just consider humans
link |
01:44:33.960
not part of the problem.
link |
01:44:35.160
I used to think humans are almost certainly
link |
01:44:38.840
have to be modeled really well.
link |
01:44:41.200
Pedestrians and cyclists and humans inside of the cars,
link |
01:44:44.360
you have to have like mental models for them.
link |
01:44:46.320
You cannot just see it as objects.
link |
01:44:48.280
But more and more, it's like the,
link |
01:44:51.400
it's the same kind of intuition breaking thing
link |
01:44:53.700
that self supervised learning does,
link |
01:44:56.200
which is, well, maybe through the learning,
link |
01:44:58.840
you'll get all the human,
link |
01:45:01.520
like human information you need, right?
link |
01:45:04.440
Like maybe you'll get it just with enough data.
link |
01:45:07.320
You don't need to have explicit good models
link |
01:45:09.680
of human behavior.
link |
01:45:10.800
Maybe you get it through the data.
link |
01:45:12.120
So I mean, my skepticism also just knowing
link |
01:45:14.640
a lot of automotive companies
link |
01:45:16.360
and how difficult it is to be innovative.
link |
01:45:18.600
I was skeptical that they would be able at scale
link |
01:45:22.520
to convert the driving scene across the world
link |
01:45:27.400
into digital form such that you can create
link |
01:45:30.640
this data engine at scale.
link |
01:45:33.160
And the fact that Tesla is at least getting there
link |
01:45:36.640
or are already there makes me think
link |
01:45:40.400
that it's now starting to be coupled
link |
01:45:43.640
to this self supervised learning vision,
link |
01:45:47.600
which is like, if that's gonna work,
link |
01:45:49.840
if through purely this process you can get really far,
link |
01:45:52.920
then maybe you can solve driving that way.
link |
01:45:54.880
I don't know.
link |
01:45:55.720
I tend to believe we don't give enough credit
link |
01:46:00.000
to the how amazing humans are both at driving
link |
01:46:05.960
and at supervising autonomous systems.
link |
01:46:09.400
And also we don't, I wish we were,
link |
01:46:13.240
I wish there was much more driver sensing inside Teslas
link |
01:46:17.160
and much deeper consideration of human factors,
link |
01:46:21.240
like understanding psychology and drowsiness
link |
01:46:24.720
and all those kinds of things.
link |
01:46:26.240
When the car does more and more of the work,
link |
01:46:28.760
how to keep utilizing the little human supervision
link |
01:46:33.000
that I needed to keep this whole thing safe.
link |
01:46:35.120
I mean, it's a fascinating dance of human robot interaction.
link |
01:46:38.480
To me, autonomous driving for a long time
link |
01:46:42.160
is a human robot interaction problem.
link |
01:46:45.080
It is not a robotics problem or computer vision problem.
link |
01:46:48.080
Like you have to have a human in the loop.
link |
01:46:50.040
But so, which is why I think it's 10 years plus.
link |
01:46:53.360
But I do think there'll be a bunch of cities and contexts
link |
01:46:56.320
where geo restricted, it will work really, really damn well.
link |
01:47:01.800
Yeah.
link |
01:47:02.640
So I think for me, it's five if I'm being optimistic
link |
01:47:05.000
and it's going to be five for a lot of cases.
link |
01:47:07.400
And 10 plus, yeah, I agree with you.
link |
01:47:09.240
10 plus, basically, if we want to recover most of,
link |
01:47:13.160
say, contiguous United States or something.
link |
01:47:15.280
Oh, interesting.
link |
01:47:16.120
So my optimistic is five and pessimistic is 30.
link |
01:47:20.320
30.
link |
01:47:21.160
I have a long tail on this one.
link |
01:47:22.520
I haven't watched enough driving videos.
link |
01:47:24.440
I've watched enough pedestrians to think like we may be,
link |
01:47:29.160
like there's a small part of me still, not a small,
link |
01:47:31.680
like a pretty big part of me that thinks
link |
01:47:34.360
we will have to build AGI to solve driving.
link |
01:47:37.560
Oh well.
link |
01:47:38.440
Like there's something to me like,
link |
01:47:40.040
because humans are part of the picture,
link |
01:47:41.800
deeply part of the picture,
link |
01:47:44.000
and also human society is part of the picture
link |
01:47:46.080
in that human life is at stake.
link |
01:47:47.920
Anytime a robot kills a human,
link |
01:47:50.840
it's not clear to me that that's not a problem
link |
01:47:54.280
that machine learning will also have to solve.
link |
01:47:56.360
Like you have to integrate that into the whole thing.
link |
01:48:00.080
Just like Facebook or social networks,
link |
01:48:03.280
one thing is to say how to make
link |
01:48:04.600
a really good recommender system.
link |
01:48:06.720
And then the other thing is to integrate
link |
01:48:08.640
into that recommender system,
link |
01:48:10.240
all the journalists that will write articles
link |
01:48:12.080
about that recommender system.
link |
01:48:13.880
Like you have to consider the society
link |
01:48:15.880
within which the AI system operates.
link |
01:48:18.400
And in order to, and like politicians too,
link |
01:48:21.000
this is regulatory stuff for autonomous driving.
link |
01:48:24.200
It's kind of fascinating that the more successful
link |
01:48:26.720
your AI system becomes,
link |
01:48:28.720
the more it gets integrated in society
link |
01:48:31.600
and the more precious politicians and the public
link |
01:48:34.600
and the clickbait journalists
link |
01:48:36.000
and all the different fascinating forces
link |
01:48:38.040
of our society start acting on it.
link |
01:48:40.360
And then it's no longer how good you are
link |
01:48:42.200
at doing the initial task.
link |
01:48:43.960
It's also how good you are at navigating human nature,
link |
01:48:47.000
which is a fascinating space.
link |
01:48:49.920
What do you think are the limits of deep learning?
link |
01:48:52.600
If you allow me, we'll zoom out a little bit
link |
01:48:54.800
into the big question of artificial intelligence.
link |
01:48:58.080
You said dark matter of intelligence
link |
01:49:01.240
is self supervised learning, but there could be more.
link |
01:49:04.320
What do you think the limits of self supervised learning
link |
01:49:07.760
and just learning in general, deep learning are?
link |
01:49:10.720
I think like for deep learning in particular,
link |
01:49:12.680
because self supervised learning is I would say
link |
01:49:14.640
a little bit more vague right now.
link |
01:49:16.800
So I wouldn't like for something that's so vague,
link |
01:49:18.680
it's hard to predict what its limits are going to be.
link |
01:49:21.960
But like I said, I think anywhere you want to interact
link |
01:49:25.240
with human self supervised learning kind of hits a boundary
link |
01:49:27.920
very quickly because you need to have an interface
link |
01:49:29.960
to be able to communicate with the human.
link |
01:49:31.600
So really like if you have just like vacuous concepts
link |
01:49:35.040
or like just like nebulous concepts discovered by a network,
link |
01:49:38.600
it's very hard to communicate those for the human
link |
01:49:40.360
without like inserting some kind of human knowledge
link |
01:49:42.440
or some kind of like human bias there.
link |
01:49:45.600
In general, I think for deep learning,
link |
01:49:47.040
the biggest challenge is just like data efficiency.
link |
01:49:50.680
Even with self supervised learning,
link |
01:49:52.200
even with anything else,
link |
01:49:53.560
if you just see a single concept once,
link |
01:49:57.440
like one image of a, like I don't know
link |
01:49:59.840
whatever you want to call it, like any concept,
link |
01:50:02.520
it's really hard for these methods to generalize
link |
01:50:04.800
by looking at just one or two samples of things.
link |
01:50:07.680
And that has been a real challenge.
link |
01:50:09.760
And I think that's actually why like these edge cases,
link |
01:50:11.680
for example, for Tesla are actually that important.
link |
01:50:14.520
Because if you see just one instance of the car failing,
link |
01:50:18.040
and if you just annotate that
link |
01:50:19.320
and you get that into your data set,
link |
01:50:21.360
it's you have like very limited guarantee
link |
01:50:23.560
that it's not going to happen again.
link |
01:50:25.160
And you're actually going to be able to recognize
link |
01:50:26.720
this kind of instance in a very different scenario.
link |
01:50:28.640
So like when it was snowing,
link |
01:50:30.320
so you got that thing labeled when it was snowing,
link |
01:50:32.040
but now when it's raining,
link |
01:50:33.240
you're actually not able to get it.
link |
01:50:34.640
Or you basically have the same scenario
link |
01:50:36.600
in a different part of the world.
link |
01:50:37.440
So the lighting was different or so on.
link |
01:50:39.120
So it's just really hard for these models,
link |
01:50:41.000
like deep learning, especially to do that.
link |
01:50:42.720
What's your intuition?
link |
01:50:43.560
How do we solve Henry and Digi recognition problem
link |
01:50:47.600
when we only have one example for each number?
link |
01:50:51.240
It feels like humans are using something like learning.
link |
01:50:54.760
Right, I think it's,
link |
01:50:56.040
we are good at transferring knowledge a little bit.
link |
01:50:59.280
We are just better at like,
link |
01:51:01.280
for a lot of these problems
link |
01:51:02.680
where we are generalizing from a single sample,
link |
01:51:04.880
recognizing from a single sample,
link |
01:51:07.000
we are using a lot of our own domain knowledge
link |
01:51:08.800
and a lot of our like inductive bias
link |
01:51:10.360
into that one sample to generalize it.
link |
01:51:12.320
So I've never seen you write the number nine, for example.
link |
01:51:15.360
And if you were to write it, I would still get it.
link |
01:51:17.480
And if you were to write a different kind of alphabet
link |
01:51:19.320
and like write it in two different ways,
link |
01:51:20.880
I would still probably be able to figure out
link |
01:51:22.360
that these are the same two characters.
link |
01:51:24.720
It's just that I have been very used to seeing
link |
01:51:26.960
Henry and digits in my life.
link |
01:51:29.080
The other sort of problem with any deep learning system
link |
01:51:31.360
or any kind of machine learning system
link |
01:51:32.720
is like it's guarantees, right?
link |
01:51:34.200
There are no guarantees for it.
link |
01:51:35.880
Now you can argue that humans also don't have any guarantees.
link |
01:51:38.200
Like there is no guarantee that I can recognize a cat
link |
01:51:41.160
in every scenario.
link |
01:51:42.280
I'm sure there are going to be lots of cats
link |
01:51:43.920
that I don't recognize,
link |
01:51:45.040
lots of scenarios in which I don't recognize cats
link |
01:51:47.120
in general.
link |
01:51:48.120
But I think from just a sort of application perspective,
link |
01:51:52.880
you do need guarantees, right?
link |
01:51:54.800
We call these things algorithms.
link |
01:51:57.000
Now algorithms, like traditional CS algorithms
link |
01:51:59.120
have guarantees.
link |
01:52:00.000
Sorting is a guarantee.
link |
01:52:01.520
If you were to call sort on a particular array of numbers,
link |
01:52:05.640
you are guaranteed that it's going to be sorted.
link |
01:52:07.680
Otherwise, it's a bug.
link |
01:52:09.360
Now for machine learning, it's very hard to characterize this.
link |
01:52:12.480
We know for a fact that a cat recognition model
link |
01:52:15.480
is not going to recognize cats, every cat in the world
link |
01:52:18.040
in every circumstance.
link |
01:52:19.760
I think most people would agree with that statement.
link |
01:52:22.080
But we are still OK with it.
link |
01:52:23.640
We still don't call this as a bug.
link |
01:52:25.400
Whereas in traditional computer science
link |
01:52:26.720
or traditional science, if you have this kind of failure case
link |
01:52:29.520
existing, then you think of it as something is wrong.
link |
01:52:33.200
I think there is this sort of notion of nebulous correctness
link |
01:52:36.080
for machine learning.
link |
01:52:37.040
And that's something we just need to be very comfortable with.
link |
01:52:39.520
And for deep learning or for a lot of these machine learning
link |
01:52:42.000
algorithms, it's not clear how do we characterize this notion
link |
01:52:45.160
of correctness.
link |
01:52:46.360
I think limitation in our understanding
link |
01:52:48.160
or at least a limitation in our phrasing of this.
link |
01:52:51.200
And if we were to come up with better ways
link |
01:52:53.080
to understand this limitation, then it would actually
link |
01:52:55.760
help us a lot.
link |
01:52:57.240
Do you think there's a distinction
link |
01:52:58.840
between the concept of learning and the concept of reasoning?
link |
01:53:04.320
Do you think it's possible for neural networks to reason?
link |
01:53:10.320
So I think of it slightly differently.
link |
01:53:11.800
So for me, learning is whenever I can make a snap judgment.
link |
01:53:15.680
So if you show me a picture of a dog,
link |
01:53:17.200
I can immediately say it's a dog.
link |
01:53:18.920
But if you give me a puzzle, whatever,
link |
01:53:21.760
a Goldberg machine of things going to happen,
link |
01:53:24.640
then I have to reason.
link |
01:53:25.640
Because it's a very complicated setup.
link |
01:53:27.600
I've never seen that particular setup.
link |
01:53:29.320
And I really need to draw and imagine in my head
link |
01:53:32.200
what's going to happen to figure it out.
link |
01:53:34.680
So I think, yes, neural networks are really good at recognition,
link |
01:53:38.920
but they're not very good at reasoning.
link |
01:53:41.160
Because if they have seen something before or seen
link |
01:53:44.760
something similar before, they're
link |
01:53:45.920
very good at making those sort of snap judgments.
link |
01:53:48.240
But if you were to give them a very complicated thing
link |
01:53:50.680
that they've not seen before, they
link |
01:53:52.600
have very limited ability right now
link |
01:53:55.280
to compose different things.
link |
01:53:56.560
Like, oh, I've seen this particular part before.
link |
01:53:58.320
I've seen this particular part before.
link |
01:54:00.040
And now probably this is how they're going to work in tandem.
link |
01:54:02.920
It's very hard for them to come up with these kinds of things.
link |
01:54:05.200
Well, there's a certain aspect to reasoning
link |
01:54:08.800
that you can maybe convert into the process of programming.
link |
01:54:11.880
And so there's the whole field of the program synthesis.
link |
01:54:14.320
And people have been applying machine learning
link |
01:54:17.240
to the problem of program synthesis.
link |
01:54:18.920
And the question is, can the step of composition,
link |
01:54:22.680
why can't that be learned?
link |
01:54:25.520
This step of building things on top of it,
link |
01:54:29.400
like little intuitions, concepts on top of each other,
link |
01:54:33.240
can that be learnable?
link |
01:54:35.320
What's your intuition there?
link |
01:54:37.760
I guess a similar set of techniques,
link |
01:54:39.480
do you think that would be applicable?
link |
01:54:42.080
So I think it is, of course, learnable.
link |
01:54:44.000
It is learnable because we are prime examples of machines
link |
01:54:47.080
that have, or individuals that have learned this.
link |
01:54:49.640
Humans have learned this.
link |
01:54:51.120
So it is, of course, it is a technique that
link |
01:54:52.920
is very easy to learn.
link |
01:54:55.920
I think where we are kind of hitting a wall basically
link |
01:54:58.920
with current machine learning is the fact
link |
01:55:01.280
that when the network learns all of this information,
link |
01:55:04.680
we basically are not able to figure out how well it's
link |
01:55:08.200
going to generalize to an unseen thing.
link |
01:55:10.680
And we have no a priori, no way of characterizing that.
link |
01:55:15.080
And I think that's basically telling us a lot about the fact
link |
01:55:19.640
that we really don't know what this model has learned
link |
01:55:21.680
and how well it's basically, because we don't know how well
link |
01:55:23.960
it's going to transfer.
link |
01:55:25.240
There's also a sense in which it feels like we humans may not
link |
01:55:29.400
be aware of how much background, how good our background model
link |
01:55:35.960
is, how much knowledge we just have slowly building
link |
01:55:40.080
on top of each other.
link |
01:55:41.240
It feels like neural networks are constantly throwing stuff
link |
01:55:43.760
out.
link |
01:55:44.240
You'll do some incredible thing where
link |
01:55:45.720
you're learning a particular task in computer vision.
link |
01:55:49.240
You celebrate your state of the art successes,
link |
01:55:51.440
and you throw that out.
link |
01:55:53.200
It feels like you're never using stuff
link |
01:55:56.400
you've learned for your future successes in other domains.
link |
01:56:00.280
And humans are obviously doing that exceptionally well,
link |
01:56:03.400
still throwing stuff away in their mind,
link |
01:56:06.000
but keeping certain kernels of truth.
link |
01:56:08.000
Right, so I think we're like, continual learning
link |
01:56:10.280
is sort of the paradigm for listen machine learning.
link |
01:56:12.280
And I don't think it's a very well explored paradigm.
link |
01:56:15.360
We have things in deep learning, for example.
link |
01:56:17.560
Catastrophic forgetting is one of the standard things.
link |
01:56:20.320
The thing basically being that if you teach a network
link |
01:56:23.360
to recognize dogs, and now you teach
link |
01:56:25.560
that same network to recognize cats,
link |
01:56:27.520
it basically forgets how to recognize dogs.
link |
01:56:29.200
So it forgets very quickly.
link |
01:56:31.760
And whereas a human, if you were to teach someone
link |
01:56:33.760
to recognize dogs and then to recognize cats,
link |
01:56:36.040
they don't forget immediately how to recognize these dogs.
link |
01:56:38.600
I think that's basically what you're trying to get.
link |
01:56:40.800
Yeah, I wonder if the long term memory mechanisms,
link |
01:56:44.880
or the mechanisms that store not just memories,
link |
01:56:47.240
but concepts that allow you to reason and compose concepts,
link |
01:56:57.360
if those things will look very different than your networks,
link |
01:57:00.040
or if you can do that within a single neural network
link |
01:57:02.480
with some particular sort of architecture quirks.
link |
01:57:06.160
That seems to be a really open problem.
link |
01:57:07.840
And of course, I go up and down on that
link |
01:57:09.560
because there's something so compelling to the symbolic AI
link |
01:57:15.000
or to the ideas of logic based sort of expert systems.
link |
01:57:20.480
You have human interpretable facts
link |
01:57:22.600
that built on top of each other.
link |
01:57:24.240
It's really annoying with self supervised learning
link |
01:57:27.960
that the AI is not very explainable.
link |
01:57:31.440
You can't understand all the beautiful things it has learned.
link |
01:57:35.680
You can't ask it questions.
link |
01:57:38.520
But then again, maybe that's a stupid thing for us humans
link |
01:57:41.680
to want.
link |
01:57:42.600
Right, I think whenever we try to understand it,
link |
01:57:45.400
we're putting our own subjective human bias into it.
link |
01:57:48.560
And I think that's the sort of problem.
link |
01:57:50.160
With self supervised learning, the goal
link |
01:57:51.560
is that it should learn naturally from the data.
link |
01:57:54.400
So now if you try to understand it,
link |
01:57:55.680
you are using your own preconceived notions
link |
01:57:58.840
of what this model has learned.
link |
01:58:01.080
That's the problem.
link |
01:58:03.560
High level question, what do you think
link |
01:58:05.200
it takes to build a system with super human,
link |
01:58:09.360
maybe let's say human level or super human level,
link |
01:58:11.920
general intelligence?
link |
01:58:13.560
We've already kind of started talking about this,
link |
01:58:15.600
but what's your intuition?
link |
01:58:18.040
Does this thing have to have a body?
link |
01:58:20.880
Does it have to interact richly with the world?
link |
01:58:25.440
Does it have to have some more human elements
link |
01:58:27.920
like self awareness?
link |
01:58:30.520
I think emotion.
link |
01:58:32.280
I think emotion is something which is like it's not really
link |
01:58:36.560
attributed typically in standard machine learning.
link |
01:58:38.440
It's not something we think about.
link |
01:58:39.760
There is NLP, there is vision, there
link |
01:58:41.200
is no emotion.
link |
01:58:42.600
Emotion is never a part of all of this.
link |
01:58:44.600
And that just seems a little bit weird to me.
link |
01:58:47.080
I think the reason basically being that there is surprise
link |
01:58:50.320
and basically emotion is one of the reasons emotions arises,
link |
01:58:54.520
like what happens and what you expect to happen.
link |
01:58:57.120
There is a mismatch between these things.
link |
01:58:59.400
And so that gives rise like I can either be surprised
link |
01:59:02.280
or I can be saddened or I can be happy and all of this.
link |
01:59:05.320
And so this basically indicates that I already
link |
01:59:08.520
have a predictive model in my head
link |
01:59:10.120
and something that I predicted or something
link |
01:59:11.880
that I thought was likely to happen.
link |
01:59:13.640
And then there was something that I observed that happened.
link |
01:59:16.000
There was a disconnect between these two things.
link |
01:59:18.200
And that basically is like maybe one of the reasons
link |
01:59:21.840
I like you have a lot of emotions.
link |
01:59:24.240
Yeah, I think so I talk to people a lot about them
link |
01:59:26.840
like Lisa Feldman Barrett.
link |
01:59:29.080
I think that's an interesting concept of emotion.
link |
01:59:31.680
But I have a sense that emotion primarily
link |
01:59:36.800
in the way we think about it, which
link |
01:59:38.280
is the display of emotion is a communication mechanism
link |
01:59:42.640
between humans.
link |
01:59:43.840
So it's a part of basically human to human interaction.
link |
01:59:48.280
An important part, but just the part.
link |
01:59:50.240
So it's like I would throw it into the full mix
link |
01:59:55.080
of communication.
link |
01:59:58.080
And to me, communication can be done with objects
link |
02:00:01.280
that don't look at all like humans.
link |
02:00:04.360
OK.
link |
02:00:05.480
I've seen our ability to anthropomorphize,
link |
02:00:07.600
our ability to connect with things
link |
02:00:09.120
that look like a Roomba, our ability to connect.
link |
02:00:12.000
First of all, let's talk about other biological systems
link |
02:00:14.720
like dogs, our ability to love things that are very different
link |
02:00:18.200
than humans.
link |
02:00:19.400
But they do display emotion, right?
link |
02:00:20.960
I mean, dogs do display emotion.
link |
02:00:23.200
So they don't have to be anthropomorphic for them
link |
02:00:25.640
to display the kind of emotions that we don't.
link |
02:00:28.240
Exactly.
link |
02:00:28.720
So I mean, but then the word emotion starts to lose.
link |
02:00:33.800
So then we have to be, I guess, specific.
link |
02:00:35.920
But yeah, so have rich, flavorful communication.
link |
02:00:39.400
Communication, yeah.
link |
02:00:40.240
Yeah, so like, yes, it's full of emotion.
link |
02:00:42.960
It's full of wit and humor and moods and all those kinds of things.
link |
02:00:50.040
Yeah, so you're talking about like flavor.
link |
02:00:53.640
Flavor, yeah.
link |
02:00:54.480
OK, let's follow that.
link |
02:00:55.400
So there's content and then there is flavor
link |
02:00:57.200
and I'm talking about the flavor.
link |
02:00:58.400
Do you think it needs to have a body?
link |
02:01:00.240
Do you think like to interact with the physical world,
link |
02:01:02.800
do you think you can understand the physical world
link |
02:01:04.640
without being able to directly interact with it?
link |
02:01:07.040
I don't think so, yeah.
link |
02:01:08.440
I think at some point we will need to bite the bullet
link |
02:01:10.680
and actually interact with the physical world.
link |
02:01:12.680
As much as I like working on like passive computer vision,
link |
02:01:15.880
where I just like sit in my armchair and look at videos
link |
02:01:18.160
and learn, I do think that we will
link |
02:01:20.840
need to have some kind of embodiment
link |
02:01:22.760
or some kind of interaction to figure out
link |
02:01:25.040
things about the world.
link |
02:01:26.960
What about consciousness?
link |
02:01:28.640
Do you think, how often do you think about consciousness
link |
02:01:32.320
when you think about your work?
link |
02:01:34.400
You could think of it as the more simple thing
link |
02:01:36.520
of self awareness, of being aware that you
link |
02:01:40.840
are a perceiving, sensing, acting thing in this world,
link |
02:01:46.840
or you can think about the bigger version of that,
link |
02:01:50.320
which is consciousness, which is having,
link |
02:01:53.800
it feel like something to be that entity,
link |
02:01:57.200
the subjective experience of being in this world.
link |
02:01:59.520
So I think of self awareness a little bit more than the broader
link |
02:02:02.880
goal of it, because I think self awareness
link |
02:02:04.920
is pretty critical for any kind of AGI or whatever you
link |
02:02:09.360
want to call it that we build, because it
link |
02:02:11.480
needs to contextualize what it is and what role it's playing
link |
02:02:15.520
with respect to all the other things that exist around it.
link |
02:02:17.920
I think that requires self awareness.
link |
02:02:19.640
It needs to understand that it's an autonomous car.
link |
02:02:23.440
And what does that mean?
link |
02:02:24.880
What are its limitations?
link |
02:02:26.200
What are the things that it is supposed to do and so on?
link |
02:02:29.040
What is its role in some way?
link |
02:02:30.680
Or, I mean, these are the kind of things
link |
02:02:34.200
that we kind of expect from it, I would say.
link |
02:02:36.840
And so that's the level of self awareness
link |
02:02:39.320
that's, I would say, basically required at least,
link |
02:02:42.160
if not more than that.
link |
02:02:44.240
Yeah, I tend to, on the emotion side,
link |
02:02:46.400
believe that it has to be able to display consciousness.
link |
02:02:52.520
Display consciousness, what do you mean by that?
link |
02:02:54.320
Meaning for us humans to connect with each other
link |
02:02:57.560
or to connect with other living entities,
link |
02:03:01.640
I think in order for us to truly feel
link |
02:03:06.840
like that there's another being there,
link |
02:03:09.360
we have to believe that they're conscious.
link |
02:03:11.400
And so we won't ever connect with something
link |
02:03:14.960
that doesn't have elements of consciousness.
link |
02:03:17.280
Now, I tend to think that that's easier to achieve
link |
02:03:21.520
than it may sound, because we anthropomorphize stuff so hard.
link |
02:03:26.320
You have a mug that just has wheels and rotates
link |
02:03:29.880
every once in a while and makes a sound.
link |
02:03:31.840
I think a couple of days in, especially if you're,
link |
02:03:37.880
if you don't hang out with humans,
link |
02:03:39.480
you might start to believe that mug on wheels is conscious.
link |
02:03:42.160
So I think we anthropomorphize
link |
02:03:43.960
pretty effectively as human beings.
link |
02:03:46.000
But I do think that it's in the same bucket
link |
02:03:49.200
that we'll call emotion,
link |
02:03:50.880
that show that you're, I think of consciousness as the capacity to suffer.
link |
02:03:58.280
And if you're an entity that's able to feel things in the world
link |
02:04:03.520
and to communicate that to others,
link |
02:04:06.600
I think that's a really powerful way to interact with humans.
link |
02:04:10.880
And in order to create an AGI system,
link |
02:04:13.160
I believe you should be able to richly interact with humans.
link |
02:04:17.920
Like humans would need to want to interact with you.
link |
02:04:21.040
Like it can't be like, it's the self supervised learning versus like,
link |
02:04:27.800
the robot shouldn't have to pay you to interact with me.
link |
02:04:31.320
So it should be a natural, fun thing.
link |
02:04:33.560
And then you're going to scale up significantly
link |
02:04:36.040
how much interaction it gets.
link |
02:04:39.040
It's the elect surprise,
link |
02:04:40.800
which they're trying to give me to be a judge on their contest.
link |
02:04:44.360
I'll see if I want to do that.
link |
02:04:45.960
But their challenge is to talk to you,
link |
02:04:50.520
make the human sufficiently interested
link |
02:04:53.920
that the human keeps talking for 20 minutes.
link |
02:04:56.120
To Alexa.
link |
02:04:56.800
To Alexa, yeah.
link |
02:04:58.560
And right now they're not even close to that
link |
02:05:00.200
because it just gets so boring when you're like,
link |
02:05:02.520
when the intelligence is not there,
link |
02:05:04.240
it gets very not interesting to talk to it.
link |
02:05:06.880
And so the robot needs to be interesting.
link |
02:05:08.920
And one of the ways it can be interesting
link |
02:05:10.400
is display the capacity to love, to suffer.
link |
02:05:14.640
And I would say that essentially means
link |
02:05:17.480
the capacity to display consciousness.
link |
02:05:20.920
Like it is an entity, much like a human being.
link |
02:05:25.160
Of course, what that really means,
link |
02:05:27.320
I don't know if that's fundamentally a robotics problem
link |
02:05:30.520
or some kind of problem that we're not yet even aware.
link |
02:05:33.040
Like if it is truly a hard problem of consciousness,
link |
02:05:36.040
I tend to maybe optimistically think it's a,
link |
02:05:40.000
we can pretty effectively fake it till we make it.
link |
02:05:42.640
So we can display a lot of human like elements for a while.
link |
02:05:46.400
And that will be sufficient to form
link |
02:05:49.080
really close connections with humans.
link |
02:05:52.000
What to use the most beautiful idea
link |
02:05:53.720
in self supervised learning?
link |
02:05:55.840
Like when you sit back with, I don't know,
link |
02:05:59.040
with a glass of wine and armchair
link |
02:06:03.200
and just at a fireplace,
link |
02:06:06.080
just thinking how beautiful this world
link |
02:06:08.320
that you get to explore is,
link |
02:06:10.080
what do you think is the especially beautiful idea?
link |
02:06:13.800
The fact that like object level,
link |
02:06:16.480
what objects are in some notion of objectness emerges
link |
02:06:19.960
from these models by just like self supervised learning.
link |
02:06:23.680
So for example, like one of the things like the dyno paper
link |
02:06:28.920
that I was a part of at Facebook is,
link |
02:06:32.160
the object sort of boundaries emerge
link |
02:06:34.240
from these representations.
link |
02:06:35.600
So if you have like a dog running in the field,
link |
02:06:38.080
the boundaries around the dog,
link |
02:06:39.440
the network is basically able to figure out
link |
02:06:42.320
what the boundaries of this dog are automatically.
link |
02:06:45.520
And it was never trained to do that.
link |
02:06:47.040
It was never trained to,
link |
02:06:49.120
no one taught it that this is a dog
link |
02:06:51.000
and these pixels belong to a dog.
link |
02:06:52.680
It's able to group these things together automatically.
link |
02:06:55.000
So that's one.
link |
02:06:56.160
I think in general that entire notion that
link |
02:06:58.960
this dumb idea that you take like these two crops
link |
02:07:01.400
of an image and then you say that the features
link |
02:07:03.160
should be similar,
link |
02:07:04.120
that has resulted in something like this.
link |
02:07:06.040
Like the model is able to figure out
link |
02:07:07.920
what the dog pixels are and so on.
link |
02:07:10.320
That just seems like so surprising.
link |
02:07:13.440
And I mean, I don't think a lot of us even understand
link |
02:07:15.680
how that is happening really.
link |
02:07:18.120
And it's something we are taking for granted,
link |
02:07:20.800
maybe like a lot in terms of how we're setting up
link |
02:07:23.120
these algorithms,
link |
02:07:24.320
but it's just, it's a very beautiful and powerful idea.
link |
02:07:26.800
So it's really fundamentally telling us something
link |
02:07:28.720
about that there is so much signal in the pixels
link |
02:07:32.440
that we can be super dumb about it
link |
02:07:34.120
about how we're setting up the self supervised learning
link |
02:07:36.040
problem and despite being like super dumb about it,
link |
02:07:39.560
we'll actually get very good,
link |
02:07:41.600
like we'll actually get something that is able to do
link |
02:07:43.960
very like surprising things.
link |
02:07:45.680
I wonder if there's other like objectness,
link |
02:07:48.240
other concepts that can emerge.
link |
02:07:51.560
I don't know if you follow Francois Chollet,
link |
02:07:53.520
he had the competition for intelligence
link |
02:07:56.600
that basically it's kind of like an IQ test
link |
02:07:59.520
but for machines.
link |
02:08:01.200
But for an IQ test, you have to have a few concepts
link |
02:08:04.040
that you want to apply.
link |
02:08:05.360
One of them is objectness.
link |
02:08:07.800
I wonder if those concepts can emerge
link |
02:08:11.520
through self supervised learning on billions of images.
link |
02:08:14.760
I think something like object permanence
link |
02:08:16.320
can definitely emerge, right?
link |
02:08:17.440
So that's like a fundamental concept which we have,
link |
02:08:20.280
maybe not through images, through video,
link |
02:08:21.480
but that's another concept that should be emerging from it.
link |
02:08:25.160
Because it's not something that,
link |
02:08:26.760
like we don't teach humans that this isn't,
link |
02:08:29.120
this is like about this concept of object permanence,
link |
02:08:31.520
it actually emerges.
link |
02:08:32.480
And the same thing for like animals,
link |
02:08:33.640
like dogs I think actually permanence automatically
link |
02:08:36.360
is something that they are born with.
link |
02:08:38.120
So I think it should emerge from the data.
link |
02:08:40.320
It should emerge basically very quickly.
link |
02:08:42.440
I wonder if ideas like symmetry, rotation,
link |
02:08:45.880
these kinds of things might emerge.
link |
02:08:47.920
So I think rotation probably, yes, yeah, rotation, yes.
link |
02:08:51.640
I mean, there's some constraints
link |
02:08:52.680
in the architecture itself.
link |
02:08:54.000
Right.
link |
02:08:55.200
But it's interesting if all of them could be,
link |
02:08:59.240
like counting was another one.
link |
02:09:01.080
You know, being able to kind of understand
link |
02:09:04.880
that there's multiple objects of the same kind in the image
link |
02:09:07.680
and be able to count them.
link |
02:09:10.040
I wonder if all of that could be,
link |
02:09:11.560
if constructed correctly, they can emerge.
link |
02:09:14.360
Cause then you can transfer those concepts
link |
02:09:16.480
to then interpret images at a deeper level.
link |
02:09:20.680
Right.
link |
02:09:21.480
Counting I do believe, I mean, should be possible.
link |
02:09:24.680
You don't know like yet,
link |
02:09:25.920
but I do think it's not that far in the realm of possibility.
link |
02:09:29.720
Yeah, that'd be interesting
link |
02:09:30.560
if using self supervised learning on images
link |
02:09:33.240
can then be applied to then solving those kinds of IQ tests,
link |
02:09:36.520
which seem currently to be kind of impossible.
link |
02:09:40.440
What idea do you believe might be true
link |
02:09:43.320
that most people think is not true
link |
02:09:46.600
or don't agree with you on?
link |
02:09:48.560
Is there something like that?
link |
02:09:50.040
So this is going to be a little controversial,
link |
02:09:52.400
but okay, sure.
link |
02:09:53.520
I don't believe in simulation,
link |
02:09:55.320
like actually using simulation to do things very much.
link |
02:09:58.840
I want to clarify, because this is a podcast
link |
02:10:01.080
where you talk about, are we living in a simulation often?
link |
02:10:03.640
You're referring to using simulation to construct worlds
link |
02:10:08.040
that you then leverage for machine learning.
link |
02:10:10.360
Right. Yeah.
link |
02:10:11.200
For example, like one example would be like to train
link |
02:10:14.000
an autonomous car driving system.
link |
02:10:15.560
You basically first build a simulator,
link |
02:10:17.480
which builds like the environment of the world.
link |
02:10:19.880
And then you basically have a lot of like,
link |
02:10:22.720
you train your machine learning system in that.
link |
02:10:25.360
So I believe it is possible,
link |
02:10:27.600
but I think it's a really expensive way of doing things.
link |
02:10:30.960
And at the end of it, you do need the real world.
link |
02:10:33.800
So I'm not sure.
link |
02:10:35.560
So maybe for certain settings,
link |
02:10:36.960
like maybe the payout is so large,
link |
02:10:38.920
like for autonomous driving,
link |
02:10:39.920
the payout is so large
link |
02:10:40.920
that you can actually invest that much money to build it.
link |
02:10:43.400
But I think as a general sort of principle,
link |
02:10:45.520
it does not apply to a lot of concepts.
link |
02:10:47.080
You can't really build simulations of everything,
link |
02:10:49.760
not only because like one, it's expensive,
link |
02:10:51.560
because second, it's also not possible for a lot of things.
link |
02:10:54.840
So in general, like there is a lot of like,
link |
02:10:58.760
there's a lot of work on like using synthetic data
link |
02:11:00.800
and like synthetic simulators.
link |
02:11:02.120
I generally am not very, like I don't believe in that.
link |
02:11:05.840
So you're saying it's very challenging visually,
link |
02:11:09.040
like to correctly like simulate the visual,
link |
02:11:11.960
like the lighting, all those kinds of things.
link |
02:11:13.600
I mean, I mean, all these companies that you have, right?
link |
02:11:15.680
So like Pixar and like whatever,
link |
02:11:17.880
all these companies are,
link |
02:11:19.160
if they're all this like computer graphic stuff
link |
02:11:21.560
is really about accurately a lot of them
link |
02:11:23.920
is about like accurately trying
link |
02:11:25.200
to figure out how the lighting is
link |
02:11:27.160
and like how things reflect off of one another and so on
link |
02:11:30.440
and like how sparkly things look and so on.
link |
02:11:32.280
So it's a very hard problem.
link |
02:11:34.000
So do we really need to solve that first
link |
02:11:37.200
to be able to like do computer vision?
link |
02:11:39.440
Probably not.
link |
02:11:40.640
And for me, in the context of autonomous driving,
link |
02:11:44.800
it's very tempting to be able to use simulation, right?
link |
02:11:48.040
Because it's a safety critical application,
link |
02:11:50.560
but the other limitation of simulation
link |
02:11:53.360
that perhaps is a bigger one than the visual limitation
link |
02:11:58.400
is the behavior of objects.
link |
02:12:00.800
Because so you're ultimately interested in edge cases.
link |
02:12:03.880
And the question is,
link |
02:12:04.960
how well can you generate edge cases in simulation,
link |
02:12:08.760
especially with human behavior?
link |
02:12:11.040
I think another problem is like for autonomous driving, right?
link |
02:12:13.440
It's a constantly changing world.
link |
02:12:15.240
So say autonomous driving like in 10 years from now,
link |
02:12:18.560
like there are lots of autonomous cars,
link |
02:12:20.800
but there's still going to be humans.
link |
02:12:22.480
So now there are 50% of the agents,
link |
02:12:24.360
say which are humans,
link |
02:12:25.280
50% of the agents that are autonomous,
link |
02:12:26.920
like car driving agents.
link |
02:12:28.640
So now the mixture has changed.
link |
02:12:30.160
So now the kinds of behaviors
link |
02:12:31.520
that you actually expect from the other agents
link |
02:12:34.080
or other cars on the road
link |
02:12:35.240
are actually going to be very different.
link |
02:12:36.800
And as the proportion of the number of autonomous cars
link |
02:12:39.160
to humans keeps changing,
link |
02:12:40.520
this behavior will actually change a lot.
link |
02:12:42.680
So now if you were to build a simulator
link |
02:12:44.120
based on just like right now to build them today,
link |
02:12:46.520
you don't have that many autonomous cars on the road.
link |
02:12:48.480
So you'll try to like make all of the other agents
link |
02:12:50.560
in that simulator behave as humans,
link |
02:12:53.000
but that's not really going to hold true
link |
02:12:54.680
10, 15, 20, 30 years from now.
link |
02:12:57.400
Do you think we're living in a simulation?
link |
02:12:59.320
No.
link |
02:13:01.520
How hard is it?
link |
02:13:02.840
This is why I think it's an interesting question.
link |
02:13:04.880
How hard is it to build a video game,
link |
02:13:07.800
like virtual reality game,
link |
02:13:09.560
where it is so real,
link |
02:13:12.680
forget like ultra realistic
link |
02:13:15.240
to where you can't tell the difference,
link |
02:13:17.400
but like it's so nice that you just want to stay there.
link |
02:13:20.880
You just want to stay there
link |
02:13:22.960
and you don't want to come back.
link |
02:13:24.960
Do you think that's doable within our lifetime?
link |
02:13:29.400
Within our lifetime, probably.
link |
02:13:31.680
Yeah.
link |
02:13:32.520
How you tell they are live long.
link |
02:13:33.920
Ha ha ha ha.
link |
02:13:35.760
Does that make you sad
link |
02:13:37.240
that there will be like population of kids
link |
02:13:42.000
that basically spend 95%, 99% of their time
link |
02:13:45.920
in a virtual world?
link |
02:13:50.120
Very, very hard question to answer.
link |
02:13:53.400
For certain people, it might be something
link |
02:13:55.760
that they really derive a lot of value out of,
link |
02:13:58.160
derive a lot of enjoyment and like happiness out of,
link |
02:14:00.760
and maybe the real world wasn't giving them that,
link |
02:14:03.120
that's why they did that.
link |
02:14:03.960
So maybe it is good for certain people.
link |
02:14:06.000
So ultimately, if it maximizes happiness,
link |
02:14:09.400
or we could judge.
link |
02:14:10.760
Yeah, I think if it's making people happy,
link |
02:14:12.760
maybe it's okay.
link |
02:14:14.440
Again, I think this is a very hard question.
link |
02:14:18.320
So like you've been a part of a lot of amazing papers.
link |
02:14:23.520
What advice would you give to somebody
link |
02:14:25.640
on what it takes to write a good paper?
link |
02:14:29.200
Grad students writing papers now,
link |
02:14:31.000
is there common things that you've learned along the way
link |
02:14:34.560
that you think it takes,
link |
02:14:35.760
both for a good idea and a good paper?
link |
02:14:39.920
Right, so I think both of these
link |
02:14:42.840
I've picked up from like lots of people
link |
02:14:45.440
I've worked with in the past.
link |
02:14:46.560
So one of them is picking the right problem
link |
02:14:48.680
to work on in research is as important
link |
02:14:51.040
as like finding the solution to it.
link |
02:14:53.680
So I mean, there are multiple reasons for this.
link |
02:14:56.200
So one is that there are certain problems
link |
02:14:58.960
that can actually be solved in a particular timeframe.
link |
02:15:02.360
So now say you want to work on finding the meaning of life.
link |
02:15:06.400
This is a great problem.
link |
02:15:07.400
I think most people will agree with that.
link |
02:15:09.440
But do you believe that your talents
link |
02:15:12.240
and like the energy that you'll spend on it
link |
02:15:13.840
will make some kind of meaningful progress
link |
02:15:17.280
in your lifetime?
link |
02:15:18.840
If you are optimistic about it, then like go ahead.
link |
02:15:21.040
That's why I started this podcast.
link |
02:15:22.120
I keep asking people about the meaning of life.
link |
02:15:24.080
I'm hoping by episode like 220, I'll figure it out.
link |
02:15:27.480
Oh, not too many episodes to go then.
link |
02:15:30.360
All right, maybe today, I don't know.
link |
02:15:33.080
But you're right.
link |
02:15:33.920
So that seems intractable at the moment.
link |
02:15:36.280
Right, so I think it's just the fact of
link |
02:15:38.560
like if you're starting a PhD for example,
link |
02:15:41.080
what is one problem that you want to focus on
link |
02:15:43.000
that you do think is interesting enough
link |
02:15:45.720
and you will be able to make a reasonable amount
link |
02:15:47.800
of headway into it that you think you'll be doing a PhD for.
link |
02:15:50.520
So in that kind of a timeframe.
link |
02:15:53.080
So that's one.
link |
02:15:53.920
Of course, there's the second part
link |
02:15:54.760
which is what excites you genuinely.
link |
02:15:56.360
So you shouldn't just pick problems
link |
02:15:57.600
that you are not excited about
link |
02:15:59.040
because as a grad student or as a researcher,
link |
02:16:01.840
you really need to be passionate about it
link |
02:16:03.200
to continue doing that
link |
02:16:04.600
because there are so many other things
link |
02:16:05.760
that you could be doing in life.
link |
02:16:07.080
So you really need to believe in that
link |
02:16:08.280
to be able to do that for that long.
link |
02:16:10.760
In terms of papers,
link |
02:16:11.600
I think the one thing that I've learned is
link |
02:16:14.920
I've like in the past,
link |
02:16:16.440
whenever I used to write things
link |
02:16:17.760
and even now whenever I do that,
link |
02:16:18.920
I try to cram in a lot of things into the paper.
link |
02:16:21.400
Whereas what really matters is just pushing
link |
02:16:23.840
one simple idea, that's it.
link |
02:16:25.760
That's all because that's,
link |
02:16:28.480
the paper is going to be like whatever,
link |
02:16:30.320
eight or nine pages.
link |
02:16:32.200
If you keep cramming in lots of ideas,
link |
02:16:34.240
it's really hard for the single thing
link |
02:16:36.240
that you believe in to stand out.
link |
02:16:38.000
So if you really try to just focus
link |
02:16:40.400
on like especially in terms of writing,
link |
02:16:41.920
really try to focus on one particular idea
link |
02:16:43.840
and articulate it out in multiple different ways.
link |
02:16:46.240
It's far more valuable to the reader as well.
link |
02:16:49.040
And basically to the reader, of course,
link |
02:16:51.600
because they get to,
link |
02:16:53.120
they know that this particular idea
link |
02:16:54.400
is associated with this paper.
link |
02:16:56.160
And also for you because you have,
link |
02:16:59.040
like when you write about a particular idea
link |
02:17:00.440
in different ways, you think about it more deeply.
link |
02:17:02.680
So as a grad student,
link |
02:17:03.600
I used to always wait toward like maybe in the last week
link |
02:17:07.200
or whatever to write the paper
link |
02:17:08.680
because I used to always believe that doing the experiments
link |
02:17:11.320
was actually the bigger part of research than writing.
link |
02:17:13.840
And my advisor always told me
link |
02:17:15.200
that you should start writing very early on.
link |
02:17:16.600
And I thought, oh, it doesn't matter.
link |
02:17:17.840
I don't know what he's talking about.
link |
02:17:19.640
But I think more and more I realized that's the case.
link |
02:17:21.760
Like whenever I write something that I'm doing,
link |
02:17:24.000
I actually think much better about it.
link |
02:17:26.400
And so if you start writing early on,
link |
02:17:28.800
you actually, I think get better ideas
link |
02:17:31.160
or at least you figure out like holes in your theory
link |
02:17:33.760
or like particular experiments
link |
02:17:35.440
that you should run to block those holes and so on.
link |
02:17:38.680
Yeah, I'm continually surprised
link |
02:17:40.320
how many really good papers throughout history
link |
02:17:43.560
are quite short and quite simple.
link |
02:17:48.280
And there's a lesson to that.
link |
02:17:49.800
Like if you want to dream about writing a paper
link |
02:17:52.600
that changes the world and you want to go by example,
link |
02:17:56.760
they're usually simple and that it's not cramming
link |
02:18:01.280
or it's focusing on one idea and thinking deeply
link |
02:18:07.240
and you're right that the writing process itself
link |
02:18:10.360
reveals the idea.
link |
02:18:12.280
It challenges you to really think about what is the idea
link |
02:18:15.320
that explains that the thread that ties it all together.
link |
02:18:19.040
And so like a lot of famous researchers I know
link |
02:18:21.560
actually would start off like,
link |
02:18:24.120
first they were even before the experiments were in,
link |
02:18:27.240
a lot of them would actually start
link |
02:18:28.360
with writing the introduction of the paper
link |
02:18:30.400
with zero experiments in.
link |
02:18:32.160
Because that at least helps them figure out
link |
02:18:33.800
what they're trying to solve
link |
02:18:35.800
and how it fits in like the context of things right now.
link |
02:18:38.640
And that would really guide their entire research.
link |
02:18:40.720
So a lot of them would actually first write in intros
link |
02:18:42.360
with like zero experiments in
link |
02:18:43.560
and that's how they would start projects.
link |
02:18:46.040
Some basic questions about people maybe
link |
02:18:49.800
there are more like beginners in this field.
link |
02:18:51.960
What's the best programming language to learn
link |
02:18:54.080
if you're interested in machine learning?
link |
02:18:56.600
I would say Python just because it's the easiest one to learn.
link |
02:19:00.320
And also a lot of like programming
link |
02:19:03.160
in machine learning happens in Python.
link |
02:19:05.000
So it'll, if you don't know any other programming language
link |
02:19:07.600
Python is actually going to get you a long way.
link |
02:19:09.560
Yeah, it seems like sort of a, it's a toss up question
link |
02:19:12.800
because it seems like Python is so much dominating
link |
02:19:15.160
the space now, but I wonder if there's interesting
link |
02:19:18.040
alternative, obviously there's like Swift
link |
02:19:19.960
and there's a lot of interesting alternatives popping up
link |
02:19:22.720
even JavaScript or R, more like for the data science
link |
02:19:27.720
applications, but it seems like Python more and more
link |
02:19:31.240
is actually being used to teach like introduction
link |
02:19:34.160
to programming at universities.
link |
02:19:35.880
So it just combines everything very nicely.
link |
02:19:39.840
Even harder question.
link |
02:19:41.840
What are the pros and cons of PyTorch versus TensorFlow?
link |
02:19:46.120
I see.
link |
02:19:48.400
Okay, so.
link |
02:19:49.360
You can go with no comment.
link |
02:19:51.320
So a disclaimer to this is that the last time
link |
02:19:53.400
I used TensorFlow was probably like four years ago.
link |
02:19:56.400
And so it was right when it had come out
link |
02:19:58.160
because so I started on like deep learning in 2014 or so
link |
02:20:02.640
and the dominant sort of pattern framework for us then
link |
02:20:06.440
for vision was Cafe, which was out of Berkeley
link |
02:20:09.040
and we used Cafe a lot, it was really nice.
link |
02:20:12.120
And then TensorFlow came in, which was basically
link |
02:20:14.080
like Python first.
link |
02:20:15.080
So Cafe was mainly C++ and it had like very loose
link |
02:20:18.080
kind of Python binding.
link |
02:20:19.040
So Python wasn't really the first language you would use.
link |
02:20:21.360
You would really use either MATLAB or C++
link |
02:20:24.680
like get stuff done in like Cafe.
link |
02:20:28.240
And then Python of course became popular a little bit later.
link |
02:20:30.920
So TensorFlow was basically around that time.
link |
02:20:32.600
So 2015, 2016 is when I last used it.
link |
02:20:36.120
It's been a while.
link |
02:20:37.200
And then what, did you use Torch or did you?
link |
02:20:40.600
So then I moved to Lua Torch, which was the Torch in Lua.
link |
02:20:44.000
And then in 2017, I think basically pretty much
link |
02:20:46.760
to PyTorch completely.
link |
02:20:48.400
Oh, interesting.
link |
02:20:49.240
So you went to Lua, cool.
link |
02:20:50.520
Yeah.
link |
02:20:51.440
Huh, so you were there before it was cool.
link |
02:20:54.160
Yeah, I mean, so Lua Torch was really good
link |
02:20:56.320
because it actually allowed you to do a lot
link |
02:20:59.520
of different kinds of things.
link |
02:21:01.360
So which Cafe was very rigid in terms of its structure.
link |
02:21:03.880
Like you would create a neural network once and that's it.
link |
02:21:06.800
Whereas if you wanted like very dynamic graphs and so on,
link |
02:21:09.320
it was very hard to do that.
link |
02:21:10.200
And Lua Torch was much more friendly
link |
02:21:11.600
for all of these things.
link |
02:21:13.560
Okay, so in terms of PyTorch and TensorFlow,
link |
02:21:15.600
my personal bias is PyTorch just because I've been using it
link |
02:21:18.480
longer and I'm more familiar with it.
link |
02:21:20.760
And also that PyTorch is much easier to debug
link |
02:21:23.560
is what I find because it's imperative in nature
link |
02:21:26.320
compared to like TensorFlow, which is not imperative.
link |
02:21:28.680
But that's telling you a lot that basically
link |
02:21:30.520
the imperative design is sort of a way in which a lot
link |
02:21:33.920
of people are taught programming
link |
02:21:35.280
and that's what actually makes debugging easier for them.
link |
02:21:38.200
So like I learned programming in C++.
link |
02:21:40.520
And so for me, imperative way of programming
link |
02:21:42.240
is more natural.
link |
02:21:44.080
Do you think it's good to have kind of these two communities,
link |
02:21:46.720
this kind of competition?
link |
02:21:48.520
I think PyTorch is kind of more and more becoming dominant
link |
02:21:51.520
in the research community,
link |
02:21:52.560
but TensorFlow is still very popular
link |
02:21:54.600
in the more sort of application machine learning community.
link |
02:21:57.920
So do you think it's good to have that kind of split
link |
02:22:00.480
in code bases or, so like the benefit there
link |
02:22:04.880
is the competition challenges the library developers
link |
02:22:07.800
to step up their game.
link |
02:22:10.000
But the downside is there's these code bases
link |
02:22:12.760
that are in different libraries.
link |
02:22:15.200
Right, so I think the downside is there.
link |
02:22:17.080
I mean, for a lot of research code
link |
02:22:18.480
that's released in one framework
link |
02:22:19.640
and if you're using the other one, it's really hard
link |
02:22:21.600
to like really build on top of it.
link |
02:22:23.920
But thankfully the open source community
link |
02:22:25.800
in machine learning is amazing.
link |
02:22:27.080
So whenever like something pops up in TensorFlow,
link |
02:22:30.840
you wait a few days and someone who's like super sharp
link |
02:22:33.200
will actually come and translate that particular code
link |
02:22:35.360
based into PyTorch and basically have figured
link |
02:22:38.160
that all those nooks and crannies out.
link |
02:22:39.720
So the open source community is amazing
link |
02:22:41.800
and they really like figure out this gap.
link |
02:22:45.240
So I think in terms of like having these two frameworks
link |
02:22:47.560
or multiple, I think of course there are different use cases
link |
02:22:49.720
so there are going to be benefits to using one
link |
02:22:51.600
or the other framework.
link |
02:22:52.880
And like you said, I think competition is just healthy
link |
02:22:54.760
because both of these frameworks keep
link |
02:22:57.400
or like all of these frameworks really sort of keep learning
link |
02:22:59.600
from each other and keep incorporating different things
link |
02:23:01.680
to just make them better and better.
link |
02:23:03.800
What advice would you have for someone
link |
02:23:06.360
new to machine learning?
link |
02:23:09.720
Maybe just started or haven't even started
link |
02:23:11.560
but are curious about it and who want to get in the field.
link |
02:23:14.920
Don't be afraid to get your hands dirty.
link |
02:23:16.640
I think that's the main thing.
link |
02:23:17.640
So if something doesn't work, like really drill
link |
02:23:20.160
into why things are not working.
link |
02:23:22.200
Can you elaborate what your hands dirty means?
link |
02:23:24.520
Right, so for example, like if an algorithm,
link |
02:23:27.560
if you try to train a network and it's not converging,
link |
02:23:29.720
whatever, rather than trying to like Google the answer
link |
02:23:32.240
or trying to do something, like really spend those
link |
02:23:34.360
like five, eight, 10, 15, 20, whatever number of hours
link |
02:23:37.200
really trying to figure it out yourself.
link |
02:23:39.000
Because in that process, you'll actually learn a lot more.
link |
02:23:42.520
Googling is of course like a good way to solve it
link |
02:23:44.600
when you need a quick answer.
link |
02:23:45.960
But I think initially especially like when you're starting out
link |
02:23:48.280
it's much nicer to like figure things out by yourself.
link |
02:23:51.840
And I just say that from experience
link |
02:23:52.960
because like when I started out,
link |
02:23:54.280
there were not a lot of resources.
link |
02:23:55.480
So we would like in the lab a lot of us
link |
02:23:57.880
like we would look up to senior students
link |
02:23:59.680
and the senior students were of course busy
link |
02:24:01.360
and they would be like, hey, why don't you go figure it out
link |
02:24:03.080
because I just don't have the time
link |
02:24:04.320
I'm working on my dissertation or whatever.
link |
02:24:06.480
I'll find a PhD students.
link |
02:24:07.640
And so then we would sit down
link |
02:24:08.760
and like just try to figure it out.
link |
02:24:10.480
And that I think really helped me.
link |
02:24:12.440
That has really helped me figure a lot of things out.
link |
02:24:15.080
I think in general, if I were to generalize that,
link |
02:24:18.720
I feel like persevering through any kind of struggle
link |
02:24:22.720
on a thing you care about is good.
link |
02:24:25.680
So you're basically, you try to make it seem like
link |
02:24:28.160
it's good to spend time debugging
link |
02:24:30.840
but really any kind of struggle, whatever form that takes
link |
02:24:33.680
it could be just Googling a lot.
link |
02:24:36.080
Just basically anything just sticking with it
link |
02:24:38.760
and going through the hard thing
link |
02:24:39.960
that could take a form of implementing stuff from scratch.
link |
02:24:43.240
It could take the form of re implementing
link |
02:24:45.640
with different libraries or different programming languages.
link |
02:24:49.360
It could take a lot of different forms
link |
02:24:50.600
but struggle is good for the soul.
link |
02:24:53.560
So like in Pittsburgh, where I did my PhD,
link |
02:24:55.840
the thing was it used to snow a lot, right?
link |
02:24:58.400
And so when it was snowed, you really couldn't do much.
link |
02:25:00.840
So the thing that a lot of people said was snow
link |
02:25:03.720
builds character because when it's snowing,
link |
02:25:06.200
you can't do anything else.
link |
02:25:07.520
You focus on work.
link |
02:25:09.080
Do you have advice in general for people
link |
02:25:10.840
you've already exceptionally successful, you're young
link |
02:25:13.440
but do you have advice for young people starting out
link |
02:25:15.800
in college or maybe in high school?
link |
02:25:18.160
Advice for their career, advice for their life,
link |
02:25:21.040
how to pave a successful path in career and life.
link |
02:25:25.680
I would say just be hungry,
link |
02:25:27.360
like always be hungry for what you want.
link |
02:25:29.720
And I think like I've been inspired by a lot of people
link |
02:25:33.320
who are just like driven and who really like go
link |
02:25:35.800
for what they want, no matter what like,
link |
02:25:38.440
you shouldn't want it, you should need it.
link |
02:25:40.520
So if you need something, you basically go towards
link |
02:25:42.920
the ends to make it work.
link |
02:25:44.360
How do you know when you come across a thing
link |
02:25:47.840
that's like you need?
link |
02:25:51.040
I think there's not going to be any single thing
link |
02:25:53.080
that you're going to need, there are going to be
link |
02:25:54.120
different types of things that you need,
link |
02:25:55.360
but whenever you need something, you just go push for it.
link |
02:25:57.920
And of course, once you may not get it
link |
02:26:00.080
or you may find that this was not even the thing
link |
02:26:01.960
that you were looking for, it might be a different thing.
link |
02:26:03.640
But the point is like you're pushing through things
link |
02:26:06.240
and that actually brings a lot of skills
link |
02:26:08.960
and brings a lot of like build a certain kind of attitude
link |
02:26:12.880
which will probably help you get the other thing.
link |
02:26:15.680
Once you figure out what's really the thing that you want.
link |
02:26:18.080
Yeah, I think a lot of people are,
link |
02:26:20.520
I've noticed the kind of afraid of that
link |
02:26:22.520
is because one, it's a fear of commitment.
link |
02:26:24.880
And two, there's so many amazing things in this world.
link |
02:26:26.880
You almost don't want to miss out on all the other
link |
02:26:28.800
amazing things by committing to this one thing.
link |
02:26:31.080
So I think a lot of it has to do with just allowing yourself
link |
02:26:33.840
to like notice that thing.
link |
02:26:37.960
And just go all the way with it.
link |
02:26:41.600
I mean, also like failure, right?
link |
02:26:43.280
So I know this is like super cheesy that failure is something
link |
02:26:47.960
that you should be prepared for and so on.
link |
02:26:49.800
But I do think, I mean, especially in research,
link |
02:26:52.560
for example, failure is something that happens almost
link |
02:26:54.800
like almost every day is like experiments failing
link |
02:26:58.200
and not working.
link |
02:26:59.160
And so you really need to be so used to it.
link |
02:27:02.320
You need to have a thick skin.
link |
02:27:03.920
But and only basically through, like when you get through it
link |
02:27:07.320
is when you find the one thing that's actually working.
link |
02:27:09.640
Like Thomas Edison was like one person like that, right?
link |
02:27:11.880
So I really, like when I was a kid,
link |
02:27:13.760
I used to really read about how he found like the filament,
link |
02:27:17.200
the light bulb filament.
link |
02:27:18.760
And then I think his thing was like,
link |
02:27:20.680
he tried 990 things that didn't work or something of the sort.
link |
02:27:24.400
And then they asked him like, so what did you learn?
link |
02:27:26.960
Because all of these were failed experiments.
link |
02:27:28.520
And then he says, oh, these 990 things don't work.
link |
02:27:31.640
And I know that.
link |
02:27:32.240
Did you know that?
link |
02:27:34.120
I mean, that's really inspiring.
link |
02:27:36.000
So you spent a few years on this earth
link |
02:27:38.440
performing a self supervised kind of learning process.
link |
02:27:44.000
Have you figured out the meaning of life yet?
link |
02:27:46.440
I told you I'm doing this podcast to try to get the answer.
link |
02:27:49.240
I'm hoping you could tell me.
link |
02:27:50.760
What do you think the meaning of it all is?
link |
02:27:54.400
I don't think I figured this out.
link |
02:27:55.840
No, I have no idea.
link |
02:27:59.000
Do you think AI will help us figure it out?
link |
02:28:02.400
Or do you think there's no answer?
link |
02:28:03.920
The whole point is to keep searching.
link |
02:28:05.520
I think it's an endless sort of quest for us.
link |
02:28:08.840
I don't think AI will help us there.
link |
02:28:10.600
This is like a very hard, hard, hard question
link |
02:28:13.600
which so many humans have tried to answer.
link |
02:28:15.440
Well, that's the interesting thing about the difference
link |
02:28:17.440
between AI and humans.
link |
02:28:19.560
Humans don't seem to know what the hell they're doing.
link |
02:28:21.880
And AI is almost always operating
link |
02:28:23.760
under well defined objective functions.
link |
02:28:28.400
And I wonder whether there are a lack of ability
link |
02:28:34.880
to define good long term objective functions
link |
02:28:37.240
or in retrospect, what is the objective function under which
link |
02:28:40.840
we operate if that's a feature or a bug?
link |
02:28:44.160
I would say it's a feature because then everyone actually
link |
02:28:46.480
has very different kinds of objective functions
link |
02:28:48.440
that they're optimizing.
link |
02:28:49.400
And those objective functions evolve and change dramatically
link |
02:28:52.480
through their course of their life.
link |
02:28:53.880
That's actually what makes us interesting, right?
link |
02:28:56.040
If otherwise, if everyone was doing the exact same thing,
link |
02:28:59.120
that would be pretty boring.
link |
02:29:00.560
We do want people with different kinds of perspectives.
link |
02:29:03.880
Also, people evolve continuously.
link |
02:29:06.160
That's like, I would say, the biggest
link |
02:29:08.040
feature of being human.
link |
02:29:09.320
And then we get to the ones that die
link |
02:29:11.160
because they do something stupid.
link |
02:29:12.560
We get to watch that, see it, and learn from it.
link |
02:29:15.440
And as a species, we take that lesson
link |
02:29:20.400
and become better and better because of all the dumb people
link |
02:29:23.880
in the world that died doing something wild and beautiful.
link |
02:29:29.080
Ishan, thank you so much for this incredible conversation.
link |
02:29:31.840
We did a depth first search through the space of machine
link |
02:29:37.480
learning.
link |
02:29:38.000
And it was fun and fascinating.
link |
02:29:41.600
So it's really an honor to meet you.
link |
02:29:43.920
And it was a really awesome conversation.
link |
02:29:45.720
Thanks for coming down today and talking with me.
link |
02:29:48.160
Thanks, Lexi.
link |
02:29:49.000
I mean, I've listened to you.
link |
02:29:50.200
I told you it was unreal for me to actually meet you in person.
link |
02:29:52.960
And I'm so happy to be here.
link |
02:29:54.080
Thank you.
link |
02:29:55.000
Thanks, man.
link |
02:29:56.680
Thanks for listening to this conversation with Ishan Mizra.
link |
02:29:59.360
And thank you to Anit, The Information, Grammarly,
link |
02:30:03.280
and Athletic Greens.
link |
02:30:05.280
Check them out in the description to support this podcast.
link |
02:30:08.560
And now let me leave you with some words from Arthur C. Clark.
link |
02:30:12.480
Any sufficiently advanced technology
link |
02:30:14.920
is indistinguishable from magic.
link |
02:30:18.120
Thank you for listening and hope to see you next time.