back to indexIshan Misra: Self-Supervised Deep Learning in Computer Vision | Lex Fridman Podcast #206
link |
The following is a conversation with Eshan Mizra,
link |
research scientist at Facebook AI Research,
link |
who works on self supervised machine learning
link |
in the domain of computer vision,
link |
or in other words, making AI systems understand
link |
the visual world with minimal help from us humans.
link |
Transformers and self attention has been successfully used
link |
by OpenAI's DPT3 and other language models
link |
to do self supervised learning in the domain of language.
link |
Eshan, together with Yann LeCun and others,
link |
is trying to achieve the same success
link |
in the domain of images and video.
link |
The goal is to leave a robot
link |
watching YouTube videos all night,
link |
and in the morning, come back to a much smarter robot.
link |
I read the blog post, Self Supervised Learning,
link |
The Dark Matter of Intelligence by Eshan and Yann LeCun,
link |
and then listened to Eshan's appearance
link |
on the excellent Machine Learning Street Talk podcast,
link |
and I knew I had to talk to him.
link |
By the way, if you're interested in machine learning and AI,
link |
I cannot recommend the ML Street Talk podcast highly enough.
link |
Those guys are great.
link |
Quick mention of our sponsors.
link |
Onnit, The Information, Grammarly, and Athletic Greens.
link |
Check them out in the description to support this podcast.
link |
As a side note, let me say that,
link |
for those of you who may have been listening
link |
for quite a while, this podcast used to be called
link |
Artificial Intelligence Podcast,
link |
because my life passion has always been,
link |
will always be artificial intelligence,
link |
both narrowly and broadly defined.
link |
My goal with this podcast is still
link |
to have many conversations with world class researchers
link |
in AI, math, physics, biology, and all the other sciences,
link |
but I also want to talk to historians, musicians, athletes,
link |
and of course, occasionally comedians.
link |
In fact, I'm trying out doing this podcast
link |
three times a week now to give me more freedom
link |
with guest selection and maybe get a chance
link |
to have a bit more fun.
link |
Speaking of fun, in this conversation,
link |
I challenge the listener to count the number of times
link |
the word banana is mentioned.
link |
Ishan and I use the word banana as the canonical example
link |
at the core of the hard problem of computer vision
link |
and maybe the hard problem of consciousness.
link |
This is the Lex Friedman Podcast,
link |
and here is my conversation with Ishan Mizra.
link |
What is self supervised learning?
link |
And maybe even give the bigger basics
link |
of what is supervised and semi supervised learning,
link |
and maybe why is self supervised learning
link |
a better term than unsupervised learning?
link |
Let's start with supervised learning.
link |
So typically for machine learning systems,
link |
the way they're trained is you get a bunch of humans,
link |
the humans point out particular concepts.
link |
So if it's in the case of images,
link |
you want the humans to come and tell you
link |
what is present in the image,
link |
draw boxes around them, draw masks of like things,
link |
pixels, which are of particular categories or not.
link |
For NLP, again, there are like lots
link |
of these particular tasks, say about sentiment analysis,
link |
about entailment and so on.
link |
So typically for supervised learning,
link |
we get a big corpus of such annotated or labeled data.
link |
And then we feed that to a system
link |
and the system is really trying to mimic.
link |
So it's taking this input of the data
link |
and then trying to mimic the output.
link |
So it looks at an image and the human has tagged
link |
that this image contains a banana.
link |
And now the system is basically trying to mimic that.
link |
So that's its learning signal.
link |
And so for supervised learning,
link |
we try to gather lots of such data
link |
and we train these machine learning models
link |
to imitate the input output.
link |
And the hope is basically by doing so,
link |
now on unseen or like new kinds of data,
link |
this model can automatically learn
link |
to predict these concepts.
link |
So this is a standard sort of supervised setting.
link |
For semi supervised setting,
link |
the idea typically is that you have,
link |
of course, all of the supervised data,
link |
but you have lots of other data,
link |
which is unsupervised or which is like not labeled.
link |
Now, the problem basically with supervised learning
link |
and why you actually have all of these alternate
link |
sort of learning paradigms is,
link |
supervised learning just does not scale.
link |
So if you look at for computer vision,
link |
the sort of largest,
link |
one of the most popular data sets is ImageNet, right?
link |
So the entire ImageNet data set has about 22,000 concepts
link |
and about 14 million images.
link |
So these concepts are basically just nouns
link |
and they're annotated on images.
link |
And this entire data set was a mammoth data collection
link |
effort that actually gave rise
link |
to a lot of powerful learning algorithms
link |
is credited with like sort of the rise
link |
of deep learning as well.
link |
But this data set took about 22 human years
link |
to collect, to annotate.
link |
And it's not even that many concepts, right?
link |
It's not even that many images,
link |
14 million is nothing really.
link |
Like you have about, I think 400 million images or so,
link |
or even more than that uploaded to most of the popular
link |
sort of social media websites today.
link |
So now supervised learning just doesn't scale.
link |
If I want to now annotate more concepts,
link |
if I want to have various types of fine grained concepts,
link |
then it won't really scale.
link |
So now you come up to these sort of different
link |
learning paradigms, for example, semi supervised learning,
link |
where the idea is you, of course,
link |
you have this annotated corpus of supervised data
link |
and you have lots of these unlabeled images.
link |
And the idea is that the algorithm should basically try
link |
to measure some kind of consistency
link |
or really try to measure some kind of signal
link |
on this sort of unlabeled data
link |
to make itself more confident
link |
about what it's really trying to predict.
link |
So by access to this, lots of unlabeled data,
link |
the idea is that the algorithm actually learns
link |
to be more confident and actually gets better
link |
at predicting these concepts.
link |
And now we come to the other extreme,
link |
which is like self supervised learning.
link |
The idea basically is that the machine or the algorithm
link |
should really discover concepts or discover things
link |
about the world or learn representations about the world
link |
which are useful without access
link |
to explicit human supervision.
link |
So the word supervision is still
link |
in the term self supervised.
link |
So what is the supervision signal?
link |
And maybe that perhaps is when Yann LeCun
link |
and you argue that unsupervised
link |
is the incorrect terminology here.
link |
So what is the supervision signal
link |
when the humans aren't part of the picture
link |
or not a big part of the picture?
link |
Right, so self supervised,
link |
the reason that it has the term supervised in itself
link |
is because you're using the data itself as supervision.
link |
So because the data serves as its own source of supervision,
link |
it's self supervised in that way.
link |
Now, the reason a lot of people,
link |
I mean, we did it in that blog post with Yann,
link |
but a lot of other people have also argued
link |
for using this term self supervised.
link |
So starting from like 94 from Virginia Desas group,
link |
I think UCSD, and now she's at UCSD.
link |
Jeetendra Malik has said this a bunch of times as well.
link |
So you have supervised,
link |
and then unsupervised basically means everything
link |
which is not supervised,
link |
but that includes stuff like semi supervised,
link |
that includes other like transductive learning,
link |
lots of other sort of settings.
link |
So that's the reason like now people are preferring
link |
this term self supervised
link |
because it explicitly says what's happening.
link |
The data itself is the source of supervision
link |
and any sort of learning algorithm
link |
which tries to extract just sort of data supervision signals
link |
from the data itself is a self supervised algorithm.
link |
But there is within the data,
link |
a set of tricks which unlock the supervision.
link |
So can you give maybe some examples
link |
and there's innovation ingenuity required
link |
to unlock that supervision.
link |
The data doesn't just speak to you some ground truth,
link |
you have to do some kind of trick.
link |
So I don't know what your favorite domain is.
link |
So you specifically specialize in visual learning,
link |
but is there favorite examples,
link |
maybe in language or other domains?
link |
Perhaps the most successful applications
link |
have been in NLP, not language processing.
link |
So the idea basically being that you can train models
link |
that can you have a sentence and you mask out certain words.
link |
And now these models learn to predict the masked out words.
link |
So if you have like the cat jumped over the dog,
link |
so you can basically mask out cat.
link |
And now you're essentially asking the model
link |
to predict what was missing, what did I mask out?
link |
So the model is going to predict basically a distribution
link |
over all the possible words that it knows.
link |
And probably it has like if it's a well trained model,
link |
it has a sort of higher probability density
link |
for this word cat.
link |
For vision, I would say the sort of more,
link |
I mean, the easier example,
link |
which is not as widely used these days,
link |
is basically say, for example, video prediction.
link |
So video is again, a sequence of things.
link |
So you can ask the model,
link |
so if you have a video of say 10 seconds,
link |
you can feed in the first nine seconds to a model
link |
and then ask it, hey, what happens basically
link |
in the 10 second, can you predict what's going to happen?
link |
And the idea basically is because the model
link |
is predicting something about the data itself.
link |
Of course, you didn't need any human
link |
to tell you what was happening
link |
because the 10 second video was naturally captured.
link |
Because the model is predicting what's happening there,
link |
it's going to automatically learn something
link |
about the structure of the world, how objects move,
link |
object permanence, and these kinds of things.
link |
So like, if I have something at the edge of the table,
link |
it will fall down.
link |
Things like these, which you really don't have to sit
link |
In a supervised learning setting,
link |
I would have to sit and annotate.
link |
This is a cup, now I move this cup, this is still a cup,
link |
and now I move this cup, it's still a cup,
link |
and then it falls down, and this is a fallen down cup.
link |
So I won't have to annotate all of these things
link |
in a self supervised setting.
link |
Isn't that kind of a brilliant little trick
link |
of taking a series of data that is consistent
link |
and removing one element in that series,
link |
and then teaching the algorithm to predict that element?
link |
Isn't that, first of all, that's quite brilliant.
link |
It seems to be applicable in anything
link |
that has the constraint of being a sequence
link |
that is consistent with the physical reality.
link |
The question is, are there other tricks like this
link |
that can generate the self supervision signal?
link |
So sequence is possibly the most widely used one in NLP.
link |
For vision, the one that is actually used for images,
link |
which is very popular these days,
link |
is basically taking an image,
link |
and now taking different crops of that image.
link |
So you can basically decide to crop,
link |
say the top left corner,
link |
and you crop, say the bottom right corner,
link |
and asking a network to basically present it with a choice,
link |
saying that, okay, now you have this image,
link |
you have this image, are these the same or not?
link |
And so the idea basically is that because different crop,
link |
like in an image, different parts of the image
link |
are going to be related.
link |
So for example, if you have a chair and a table,
link |
basically these things are going to be close by,
link |
versus if you take, again,
link |
if you have like a zoomed in picture of a chair,
link |
if you're taking different crops,
link |
it's going to be different parts of the chair.
link |
So the idea basically is that different crops
link |
of the image are related,
link |
and so the features or the representations
link |
that you get from these different crops
link |
should also be related.
link |
So this is possibly the most like widely used trick
link |
these days for self supervised learning and computer vision.
link |
So again, using the consistency that's inherent
link |
to physical reality in visual domain,
link |
that's, you know, parts of an image are consistent,
link |
and then in the language domain,
link |
or anything that has sequences,
link |
like language or something that's like a time series,
link |
then you can chop up parts in time.
link |
It's similar to the story of RNNs and CNNs,
link |
of RNNs and ConvNets.
link |
You and Yann LeCun wrote the blog post in March, 2021,
link |
titled, Self Supervised Learning,
link |
The Dark Matter of Intelligence.
link |
Can you summarize this blog post
link |
and maybe explain the main idea or set of ideas?
link |
The blog post was mainly about sort of just telling,
link |
I mean, this is really a accepted fact,
link |
I would say for a lot of people now,
link |
that self supervised learning is something
link |
that is going to play an important role
link |
for machine learning algorithms
link |
that come in the future, and even now.
link |
Let me just comment that we don't yet
link |
have a good understanding of what dark matter is.
link |
So the idea basically being...
link |
So maybe the metaphor doesn't exactly transfer,
link |
but maybe it's actually perfectly transfers,
link |
that we don't know, we have an inkling
link |
that it'll be a big part
link |
of whatever solving intelligence looks like.
link |
Right, so I think self supervised learning,
link |
the way it's done right now is,
link |
I would say like the first step towards
link |
what it probably should end up like learning
link |
or what it should enable us to do.
link |
So the idea for that particular piece was,
link |
self supervised learning is going to be a very powerful way
link |
to learn common sense about the world,
link |
or like stuff that is really hard to label.
link |
For example, like is this piece
link |
over here heavier than the cup?
link |
Now, for all these kinds of things,
link |
you'll have to sit and label these things.
link |
So supervised learning is clearly not going to scale.
link |
So what is the thing that's actually going to scale?
link |
It's probably going to be an agent
link |
that can either actually interact with it to lift it up,
link |
or observe me doing it.
link |
So if I'm basically lifting these things up,
link |
it can probably reason about,
link |
hey, this is taking him more time to lift up,
link |
or the velocity is different,
link |
whereas the velocity for this is different,
link |
probably this one is heavier.
link |
So essentially, by observations of the data,
link |
you should be able to infer a lot of things about the world
link |
without someone explicitly telling you,
link |
this is heavy, this is not,
link |
this is something that can pour,
link |
this is something that cannot pour,
link |
this is somewhere that you can sit,
link |
this is not somewhere that you can sit.
link |
But you just mentioned ability to interact with the world.
link |
There's so many questions that are yet,
link |
that are still open, which is,
link |
how do you select the set of data
link |
over which the self supervised learning process works?
link |
How much interactivity like in the active learning
link |
or the machine teaching context is there?
link |
What are the reward signals?
link |
Like how much actual interaction there is
link |
with the physical world?
link |
That kind of thing.
link |
So that could be a huge question.
link |
And then on top of that,
link |
which I have a million questions about,
link |
which we don't know the answers to,
link |
but it's worth talking about is,
link |
how much reasoning is involved?
link |
How much accumulation of knowledge
link |
versus something that's more akin to learning
link |
or whether that's the same thing.
link |
But so we're like, it is truly dark matter.
link |
We don't know how exactly to do it.
link |
But we are, I mean, a lot of us are actually convinced
link |
that it's going to be a sort of major thing
link |
in machine learning.
link |
So let me reframe it then,
link |
that human supervision cannot be at large scale
link |
the source of the solution to intelligence.
link |
So the machines have to discover the supervision
link |
in the natural signal of the world.
link |
I mean, the other thing is also
link |
that humans are not particularly good labelers.
link |
They're not very consistent.
link |
For example, like what's the difference
link |
between a dining table and a table?
link |
Is it just the fact that one,
link |
like if you just look at a particular table,
link |
what makes us say one is dining table
link |
and the other is not?
link |
Humans are not particularly consistent.
link |
They're not like very good sources of supervision
link |
for a lot of these kinds of edge cases.
link |
So it may be also the fact that if we want an algorithm
link |
or want a machine to solve a particular task for us,
link |
we can maybe just specify the end goal
link |
and like the stuff in between,
link |
we really probably should not be specifying
link |
because we're not maybe going to confuse it a lot actually.
link |
Well, humans can't even answer the meaning of life.
link |
So I'm not sure if we're good supervisors
link |
of the end goal either.
link |
So let me ask you about categories.
link |
Humans are not very good at telling the difference
link |
between what is and isn't a table, like you mentioned.
link |
Do you think it's possible,
link |
let me ask you like pretend you're Plato.
link |
Is it possible to create a pretty good taxonomy
link |
of objects in the world?
link |
It seems like a lot of approaches in machine learning
link |
kind of assume a hopeful vision
link |
that it's possible to construct a perfect taxonomy
link |
or it exists perhaps out of our reach,
link |
but we can always get closer and closer to it.
link |
Or is that a hopeless pursuit?
link |
I think it's hopeless in some way.
link |
So the thing is for any particular categorization
link |
if you have a discrete sort of categorization,
link |
I can always take the nearest two concepts
link |
or I can take a third concept and I can blend it in
link |
and I can create a new category.
link |
So if you were to enumerate N categories,
link |
I will always find an N plus one category for you.
link |
That's not going to be in the N categories.
link |
And I can actually create not just N plus one,
link |
I can very easily create far more than N categories.
link |
The thing is a lot of things we talk about
link |
are actually compositional.
link |
So it's really hard for us to come and sit
link |
and enumerate all of these out.
link |
And they compose in various weird ways, right?
link |
Like you have like a croissant and a donut come together
link |
So if you were to like enumerate all the foods up until,
link |
I don't know, whenever the cronut was about 10 years ago
link |
then this entire thing called cronut would not exist.
link |
Yeah, I remember there was the most awesome video
link |
of a cat wearing a monkey costume.
link |
People should look it up, it's great.
link |
So is that a monkey or is that a cat?
link |
It's a very difficult philosophical question.
link |
So there is a concept of similarity between objects.
link |
So you think that can take us very far?
link |
Just kind of getting a good function,
link |
a good way to tell which parts of things are similar
link |
and which parts of things are very different.
link |
So you don't necessarily need to name everything
link |
or assign a name to everything to be able to use it, right?
link |
So there are like lots of...
link |
Shakespeare said that, what's in a name?
link |
What's in a name, yeah, okay.
link |
And I mean, lots of like, for example, animals, right?
link |
They don't have necessarily a well formed
link |
like syntactic language,
link |
but they're able to go about their day perfectly.
link |
The same thing happens for us.
link |
So, I mean, we probably look at things and we figure out,
link |
oh, this is similar to something else that I've seen before.
link |
And then I can probably learn how to use it.
link |
So I haven't seen all the possible doorknobs in the world.
link |
But if you show me,
link |
like I was able to get into this particular place
link |
fairly easily, I've never seen that particular doorknob.
link |
So I of course related to all the doorknobs that I've seen
link |
and I know exactly how it's going to open.
link |
I have a pretty good idea of how it's going to open.
link |
And I think this kind of translation between experiences
link |
only happens because of similarity.
link |
Because I'm able to relate it to a doorknob.
link |
If I related it to a hairdryer,
link |
I would probably be stuck still outside, not able to get in.
link |
Again, a bit of a philosophical question,
link |
but can similarity take us all the way
link |
to understanding a thing?
link |
Can having a good function that compares objects
link |
get us to understand something profound
link |
about singular objects?
link |
I think I'll ask you a question back.
link |
What does it mean to understand objects?
link |
Well, let me tell you what that's similar to.
link |
No, so there's an idea of sort of reasoning
link |
by analogy kind of thing.
link |
I think understanding is the process of placing that thing
link |
in some kind of network of knowledge that you have.
link |
That it perhaps is fundamentally related to other concepts.
link |
So it's not like understanding is fundamentally related
link |
by composition of other concepts
link |
and maybe in relation to other concepts.
link |
And maybe deeper and deeper understanding
link |
is maybe just adding more edges to that graph somehow.
link |
So maybe it is a composition of similarities.
link |
I mean, ultimately, I suppose it is a kind of embedding
link |
in that wisdom space.
link |
Yeah, okay, wisdom space is good.
link |
I think, I do think, right?
link |
So similarity does get you very, very far.
link |
Is it the answer to everything?
link |
I mean, I don't even know what everything is,
link |
but it's going to take us really far.
link |
And I think the thing is things are similar
link |
in very different contexts, right?
link |
So an elephant is similar to, I don't know,
link |
another sort of wild animal.
link |
Let's just pick, I don't know, lion in a different way
link |
because they're both four legged creatures.
link |
They're also land animals.
link |
But of course they're very different
link |
in a lot of different ways.
link |
So elephants are like herbivores, lions are not.
link |
So similarity and particularly dissimilarity
link |
also actually helps us understand a lot about things.
link |
And so that's actually why I think
link |
discrete categorization is very hard.
link |
Just like forming this particular category of elephant
link |
and a particular category of lion,
link |
maybe it's good for just like taxonomy,
link |
biological taxonomies.
link |
But when it comes to other things which are not as maybe,
link |
for example, like grilled cheese, right?
link |
I have a grilled cheese,
link |
I dip it in tomato and I keep it outside.
link |
Now, is that still a grilled cheese
link |
or is that something else?
link |
Right, so categorization is still very useful
link |
for solving problems.
link |
But is your intuition then sort of the self supervised
link |
should be the, to borrow Jan Lekun's terminology,
link |
should be the cake and then categorization,
link |
the classification, maybe the supervised like layer
link |
should be just like the thing on top,
link |
the cherry or the icing or whatever.
link |
So if you make it the cake,
link |
it gets in the way of learning.
link |
If you make it the cake,
link |
then you won't be able to sit and annotate everything.
link |
That's as simple as it is.
link |
Like that's my very practical view on it.
link |
It's just, I mean, in my PhD,
link |
I sat down and annotated like a bunch of cards
link |
for one of my projects.
link |
And very quickly, I was just like, it was in a video
link |
and I was basically drawing boxes around all these cards.
link |
And I think I spent about a week doing all of that
link |
and I barely got anything done.
link |
And basically this was, I think my first year of my PhD
link |
or like a second year of my master's.
link |
And then by the end of it, I'm like, okay,
link |
this is just hopeless.
link |
I can keep doing it.
link |
And when I'd done that, someone came up to me
link |
and they basically told me, oh, this is a pickup truck.
link |
This is not a card.
link |
And that's when like, aha, this actually makes sense
link |
because a pickup truck is not really like,
link |
what was I annotating?
link |
Was I annotating anything that is mobile
link |
or was I annotating particular sedans
link |
or was I annotating SUVs?
link |
By the way, the annotation was bounding boxes?
link |
Bounding boxes, yeah.
link |
There's so many deep, profound questions here
link |
that you're almost cheating your way out of
link |
by doing self supervised learning, by the way,
link |
which is like, what makes for an object?
link |
As opposed to solve intelligence,
link |
maybe you don't ever need to answer that question.
link |
I mean, this is the question
link |
that anyone that's ever done annotation
link |
because it's so painful gets to ask,
link |
like, why am I drawing very careful line around this object?
link |
Like, what is the value?
link |
I remember when I first saw semantic segmentation
link |
where you have like instant segmentation
link |
where you have a very exact line
link |
around the object in a 2D plane
link |
of a fundamentally 3D object projected on a 2D plane.
link |
So you're drawing a line around a car
link |
that might be occluded.
link |
There might be another thing in front of it,
link |
but you're still drawing the line
link |
of the part of the car that you see.
link |
How is that the car?
link |
Why is that the car?
link |
Like, I had like an existential crisis every time.
link |
Like, how's that going to help us understand
link |
a solved computer vision?
link |
I'm not sure I have a good answer to what's better.
link |
And I'm not sure I share the confidence that you have
link |
that self supervised learning can take us far.
link |
I think I'm more and more convinced
link |
that it's a very important component,
link |
but I still feel like we need to understand
link |
what makes like this dream of maybe what it's called
link |
like symbolic AI of arriving,
link |
like once you have this common sense base,
link |
be able to play with these concepts and build graphs
link |
or hierarchies of concepts on top
link |
in order to then like form a deep sense
link |
of this three dimensional world or four dimensional world
link |
and be able to reason and then project that onto 2D plane
link |
in order to interpret a 2D image.
link |
Can I ask you just an out there question?
link |
I remember, I think Andre Karpathy had a blog post
link |
about computer vision, like being really hard.
link |
I forgot what the title was, but it was many, many years ago.
link |
And he had, I think President Obama stepping on a scale
link |
and there was humor and there was a bunch of people laughing
link |
And there's a lot of interesting things about that image
link |
and I think Andre highlighted a bunch of things
link |
about the image that us humans are able
link |
to immediately understand.
link |
Like the idea, I think of gravity
link |
and that you have the concept of a weight.
link |
You immediately project because of our knowledge of pose
link |
and how human bodies are constructed,
link |
you understand how the forces are being applied
link |
with the human body.
link |
The really interesting other thing
link |
that you're able to understand,
link |
there's multiple people looking at each other in the image.
link |
You're able to have a mental model
link |
of what the people are thinking about.
link |
You're able to infer like,
link |
oh, this person is probably thinks,
link |
like is laughing at how humorous the situation is.
link |
And this person is confused about what the situation is
link |
because they're looking this way.
link |
We're able to infer all of that.
link |
So that's human vision.
link |
How difficult is computer vision?
link |
Like in order to achieve that level of understanding
link |
and maybe how big of a part
link |
does self supervised learning play in that, do you think?
link |
And do you still, you know, back,
link |
that was like over a decade ago,
link |
I think Andre and I think a lot of people agreed
link |
is computer vision is really hard.
link |
Do you still think computer vision is really hard?
link |
I think it is, yes.
link |
And getting to that kind of understanding,
link |
I mean, it's really out there.
link |
So if you ask me to solve just that particular problem,
link |
I can do it the supervised learning route.
link |
I can always construct a data set and basically predict,
link |
oh, is there humor in this or not?
link |
And of course I can do it.
link |
Actually, that's a good question.
link |
Do you think you can, okay, okay.
link |
Do you think you can do human supervised annotation of humor?
link |
To some extent, yes.
link |
I'm sure it will work.
link |
I mean, it won't be as bad as like randomly guessing.
link |
I'm sure it can still predict whether it's humorous or not
link |
Yeah, maybe like Reddit upvotes is the signal.
link |
I mean, it won't do a great job, but it'll do something.
link |
It may actually be like, it may find certain things
link |
which are not humorous, humorous as well,
link |
which is going to be bad for us.
link |
But I mean, it'll do, it won't be random.
link |
Yeah, kind of like my sense of humor.
link |
So you can, that particular problem, yes.
link |
But the general problem you're saying is hard.
link |
The general problem is hard.
link |
And I mean, self supervised learning
link |
is not the answer to everything.
link |
Of course it's not.
link |
I think if you have machines that are going to communicate
link |
with humans at the end of it,
link |
you want to understand what the algorithm is doing, right?
link |
You want it to be able to produce an output
link |
that you can decipher, that you can understand,
link |
or it's actually useful for something else,
link |
which again is a human.
link |
So at some point in this sort of entire loop,
link |
And now this human needs to understand what's going on.
link |
And at that point, this entire notion of language
link |
or semantics really comes in.
link |
If the machine just spits out something
link |
and if we can't understand it,
link |
then it's not really that useful for us.
link |
So self supervised learning is probably going to be useful
link |
for a lot of the things before that part,
link |
before the machine really needs to communicate
link |
a particular kind of output with a human.
link |
Because, I mean, otherwise,
link |
how is it going to do that without language?
link |
Or some kind of communication.
link |
But you're saying that it's possible to build
link |
a big base of understanding or whatever,
link |
of what's a better? Concepts.
link |
Of concepts. Concepts, yeah.
link |
Like common sense concepts. Right.
link |
Supervised learning in the context of computer vision
link |
is something you've focused on,
link |
but that's a really hard domain.
link |
And it's kind of the cutting edge
link |
of what we're, as a community, working on today.
link |
Can we take a little bit of a step back
link |
and look at language?
link |
Can you summarize the history of success
link |
of self supervised learning in natural language processing,
link |
language modeling?
link |
What are transformers?
link |
What is the masking, the sentence completion
link |
that you mentioned before?
link |
How does it lead us to understand anything?
link |
Semantic meaning of words,
link |
syntactic role of words and sentences?
link |
So I'm, of course, not the expert on NLP.
link |
I kind of follow it a little bit from the sides.
link |
So the main sort of reason
link |
why all of this masking stuff works is,
link |
I think it's called the distributional hypothesis in NLP.
link |
The idea basically being that words
link |
that occur in the same context
link |
should have similar meaning.
link |
So if you have the blank jumped over the blank,
link |
it basically, whatever is like in the first blank
link |
is basically an object that can actually jump,
link |
is going to be something that can jump.
link |
So a cat or a dog, or I don't know, sheep, something,
link |
all of these things can basically be in that particular context.
link |
And now, so essentially the idea is that
link |
if you have words that are in the same context
link |
and you predict them,
link |
you're going to learn lots of useful things
link |
about how words are related,
link |
because you're predicting by looking at their context
link |
where the word is going to be.
link |
So in this particular case, the blank jumped over the fence.
link |
So now if it's a sheep, the sheep jumped over the fence,
link |
the dog jumped over the fence.
link |
So essentially the algorithm or the representation
link |
basically puts together these two concepts together.
link |
So it says, okay, dogs are going to be kind of related to sheep
link |
because both of them occur in the same context.
link |
Of course, now you can decide
link |
depending on your particular application downstream,
link |
you can say that dogs are absolutely not related to sheep
link |
because well, I don't, I really care about dog food,
link |
for example, I'm a dog food person
link |
and I really want to give this dog food
link |
to this particular animal.
link |
So depending on what your downstream application is,
link |
of course, this notion of similarity or this notion
link |
or this common sense that you've learned
link |
may not be applicable.
link |
But the point is basically that this,
link |
just predicting what the blanks are
link |
is going to take you really, really far.
link |
So there's a nice feature of language
link |
that the number of words in a particular language
link |
is very large, but it's finite
link |
and it's actually not that large
link |
in the grand scheme of things.
link |
I still got it because we take it for granted.
link |
So first of all, when you say masking,
link |
you're talking about this very process of the blank,
link |
of removing words from a sentence
link |
and then having the knowledge of what word went there
link |
in the initial data set,
link |
that's the ground truth that you're training on
link |
and then you're asking the neural network
link |
to predict what goes there.
link |
That's like a little trick.
link |
It's a really powerful trick.
link |
The question is how far that takes us.
link |
And the other question is, is there other tricks?
link |
Because to me, it's very possible
link |
there's other very fascinating tricks.
link |
I'll give you an example in autonomous driving,
link |
there's a bunch of tricks
link |
that give you the self supervised signal back.
link |
For example, very similar to sentences, but not really,
link |
which is you have signals from humans driving the car
link |
because a lot of us drive cars to places.
link |
And so you can ask the neural network to predict
link |
what's going to happen the next two seconds
link |
for a safe navigation through the environment.
link |
And the signal comes from the fact
link |
that you also have knowledge of what happened
link |
in the next two seconds, because you have video of the data.
link |
The question in autonomous driving, as it is in language,
link |
can we learn how to drive autonomously
link |
based on that kind of self supervision?
link |
Probably the answer is no.
link |
The question is how good can we get?
link |
And the same with language, how good can we get?
link |
And are there other tricks?
link |
Like we get sometimes super excited by this trick
link |
that works really well.
link |
But I wonder, it's almost like mining for gold.
link |
I wonder how many signals there are in the data
link |
that could be leveraged that are like there.
link |
I just wanted to kind of linger on that
link |
because sometimes it's easy to think
link |
that maybe this masking process is self supervised learning.
link |
No, it's only one method.
link |
So there could be many, many other methods,
link |
many tricky methods, maybe interesting ways
link |
to leverage human computation in very interesting ways
link |
that might actually border on semi supervised learning,
link |
something like that.
link |
Obviously the internet is generated by humans
link |
at the end of the day.
link |
So all that to say is what's your sense
link |
in this particular context of language,
link |
how far can that masking process take us?
link |
So it has stood the test of time, right?
link |
I mean, so Word2vec, the initial sort of NLP technique
link |
that was using this to now, for example,
link |
like all the BERT and all these big models that we get,
link |
BERT and Roberta, for example,
link |
all of them are still sort of based
link |
on the same principle of masking.
link |
It's taken us really far.
link |
I mean, you can actually do things like,
link |
oh, these two sentences are similar or not,
link |
whether this particular sentence follows this other sentence
link |
in terms of logic, so entailment,
link |
you can do a lot of these things
link |
with just this masking trick.
link |
So I'm not sure if I can predict how far it can take us,
link |
because when it first came out, when Word2vec was out,
link |
I don't think a lot of us would have imagined
link |
that this would actually help us do some kind
link |
of entailment problems and really that well.
link |
And so just the fact that by just scaling up
link |
the amount of data that we're training on
link |
and using better and more powerful neural network
link |
architectures has taken us from that to this,
link |
is just showing you how maybe poor predictors we are,
link |
as humans, how poor we are at predicting
link |
how successful a particular technique is going to be.
link |
So I think I can say something now,
link |
but like 10 years from now,
link |
I look completely stupid basically predicting this.
link |
In the language domain, is there something in your work
link |
that you find useful and insightful
link |
and transferable to computer vision,
link |
but also just, I don't know, beautiful and profound
link |
that I think carries through to the vision domain?
link |
I mean, the idea of masking has been very powerful.
link |
It has been used in vision as well for predicting,
link |
like you say, the next sort of if you have
link |
and sort of frames and you predict
link |
what's going to happen in the next frame.
link |
So that's been very powerful.
link |
In terms of modeling, like in just terms
link |
in terms of architecture, I think you would have asked
link |
about transformers a while back.
link |
That has really become like,
link |
it has become super exciting for computer vision now.
link |
Like in the past, I would say year and a half,
link |
it's become really powerful.
link |
What's a transformer?
link |
I mean, the core part of a transformer
link |
is something called the self attention model.
link |
So it came out of Google
link |
and the idea basically is that if you have N elements,
link |
what you're creating is a way for all of these N elements
link |
to talk to each other.
link |
So the idea basically is that you are paying attention.
link |
Each element is paying attention
link |
to each of the other element.
link |
And basically by doing this,
link |
it's really trying to figure out,
link |
you're basically getting a much better view of the data.
link |
So for example, if you have a sentence of like four words,
link |
the point is if you get a representation
link |
or a feature for this entire sentence,
link |
it's constructed in a way such that each word
link |
has paid attention to everything else.
link |
Now, the reason it's like different from say,
link |
what you would do in a ConvNet
link |
is basically that in the ConvNet,
link |
you would only pay attention to a local window.
link |
So each word would only pay attention
link |
to its next neighbor or like one neighbor after that.
link |
And the same thing goes for images.
link |
In images, you would basically pay attention to pixels
link |
in a three cross three or a seven cross seven neighborhood.
link |
Whereas with the transformer, the self attention mainly,
link |
the sort of idea is that each element
link |
needs to pay attention to each other element.
link |
And when you say attention,
link |
maybe another way to phrase that
link |
is you're considering a context,
link |
a wide context in terms of the wide context of the sentence
link |
in understanding the meaning of a particular word
link |
and in computer vision that's understanding
link |
a larger context to understand the local pattern
link |
of a particular local part of an image.
link |
Right, so basically if you have say,
link |
again, a banana in the image,
link |
you're looking at the full image first.
link |
So whether it's like, you know,
link |
you're looking at all the pixels that are off a kitchen
link |
or for dining table and so on.
link |
And then you're basically looking at the banana also.
link |
Yeah, by the way, in terms of,
link |
if we were to train the funny classifier,
link |
there's something funny about the word banana.
link |
Just wanted to anticipate that.
link |
I am wearing a banana shirt, so yeah.
link |
Is there bananas on it?
link |
Okay, so masking has worked for the vision context as well.
link |
And so this transformer idea has worked as well.
link |
So basically looking at all the elements
link |
to understand a particular element
link |
has been really powerful in vision.
link |
The reason is like a lot of things
link |
when you're looking at them in isolation.
link |
So if you look at just a blob of pixels,
link |
so Antonio Torralba at MIT used to have
link |
this like really famous image,
link |
which I looked at when I was a PhD student.
link |
But he would basically have a blob of pixels
link |
and he would ask you, hey, what is this?
link |
And it looked basically like a shoe
link |
or like it could look like a TV remote.
link |
It could look like anything.
link |
And it turns out it was a beer bottle.
link |
But I'm not sure it was one of these three things,
link |
but basically he showed you the full picture
link |
and then it was very obvious what it was.
link |
But the point is just by looking at
link |
that particular local window, you couldn't figure it out.
link |
Because of resolution, because of other things,
link |
it's just not easy always to just figure it out
link |
by looking at just the neighborhood of pixels,
link |
what these pixels are.
link |
And the same thing happens for language as well.
link |
For the parameters that have to learn
link |
something about the data,
link |
you need to give it the capacity
link |
to learn the essential things.
link |
Like if it's not actually able to receive the signal at all,
link |
then it's not gonna be able to learn that signal.
link |
And in order to understand images, to understand language,
link |
you have to be able to see words in their full context.
link |
Okay, what is harder to solve, vision or language?
link |
Visual intelligence or linguistic intelligence?
link |
So I'm going to say computer vision is harder.
link |
My reason for this is basically that
link |
language of course has a big structure to it
link |
because we developed it.
link |
Whereas vision is something that is common
link |
in a lot of animals.
link |
Everyone is able to get by a lot of these animals
link |
on earth are actually able to get by without language.
link |
And a lot of these animals we also deem to be intelligent.
link |
So clearly intelligence does have
link |
like a visual component to it.
link |
And yes, of course, in the case of humans,
link |
it of course also has a linguistic component.
link |
But it means that there is something far more fundamental
link |
about vision than there is about language.
link |
And I'm sorry to anyone who disagrees,
link |
but yes, this is what I feel.
link |
So that's being a little bit reflected in the challenges
link |
that have to do with the progress
link |
of self supervised learning, would you say?
link |
Or is that just a peculiar accidents
link |
of the progress of the AI community
link |
that we focused on like,
link |
or we discovered self attention and transformers
link |
in the context of language first?
link |
So like the self supervised learning success
link |
was actually for vision has not much to do
link |
with the transformers part.
link |
I would say it's actually been independent a little bit.
link |
I think it's just that the signal was a little bit different
link |
for vision than there was for like NLP
link |
and probably NLP folks discovered it before.
link |
So for vision, the main success
link |
has basically been this like crops so far,
link |
like taking different crops of images.
link |
Whereas for NLP, it was this masking thing.
link |
But also the level of success
link |
is still much higher for language.
link |
So that has a lot to do with,
link |
I mean, I can get into a lot of details.
link |
For this particular question, let's go for it, okay.
link |
So the first thing is language is very structured.
link |
So you are going to produce a distribution
link |
over a finite vocabulary.
link |
English has a finite number of words.
link |
It's actually not that large.
link |
And you need to produce basically,
link |
when you're doing this masking thing,
link |
all you need to do is basically tell me
link |
which one of these like 50,000 words it is.
link |
Now for vision, let's imagine doing the same thing.
link |
Okay, we're basically going to blank out
link |
a particular part of the image
link |
and we ask the network or this neural network
link |
to predict what is present in this missing patch.
link |
It's combinatorially large, right?
link |
You have 256 pixel values.
link |
If you're even producing basically a seven cross seven
link |
or a 14 cross 14 like window of pixels,
link |
at each of these 169 or each of these 49 locations,
link |
you have 256 values to predict.
link |
And so it's really, really large.
link |
And very quickly, the kind of like prediction problems
link |
that we're setting up are going to be extremely
link |
like interactable for us.
link |
And so the thing is for NLP, it has been really successful
link |
because we are very good at predicting,
link |
like doing this like distribution over a finite set.
link |
And the problem is when this set becomes really large,
link |
we are going to become really, really bad
link |
at making these predictions
link |
and at solving basically this particular set of problems.
link |
So if you were to do it exactly in the same way
link |
as NLP for vision, there is very limited success.
link |
The way stuff is working right now
link |
is actually not by predicting these masks.
link |
It's basically by saying that you take these two
link |
like crops from the image,
link |
you get a feature representation from it.
link |
And just saying that these two features,
link |
so they're like vectors,
link |
just saying that the distance between these vectors
link |
And so it's a very different way of learning
link |
from the visual signal than there is from NLP.
link |
Okay, the other reason is the distributional hypothesis
link |
that we talked about for NLP, right?
link |
So a word given its context,
link |
basically the context actually supplies
link |
a lot of meaning to the word.
link |
Now, because there are just finite number of words
link |
and there is a finite way in like which we compose them.
link |
Of course, the same thing holds for pixels,
link |
but in language, there's a lot of structure, right?
link |
So I always say whatever,
link |
the dash jumped over the fence, for example.
link |
There are lots of these sentences that you'll get.
link |
And from this, you can actually look at
link |
this particular sentence might occur
link |
in a lot of different contexts as well.
link |
This exact same sentence
link |
might occur in a different context.
link |
So the sheep jumped over the fence,
link |
the cat jumped over the fence,
link |
the dog jumped over the fence.
link |
So you immediately get a lot of these words,
link |
which are because this particular token itself
link |
has so much meaning,
link |
you get a lot of these tokens or these words,
link |
which are actually going to have sort of
link |
this related meaning across given this context.
link |
Whereas for vision, it's much harder
link |
because just by like pure,
link |
like the way we capture images,
link |
lighting can be different.
link |
There might be like different noise in the sensor.
link |
So the thing is you're capturing a physical phenomenon
link |
and then you're basically going through
link |
a very complicated pipeline of like image processing.
link |
And then you're translating that into
link |
some kind of like digital signal.
link |
Whereas with language, you write it down
link |
and you transfer it to a digital signal,
link |
almost like it's a lossless like transfer.
link |
And each of these tokens are very, very well defined.
link |
There could be a little bit of an argument there
link |
because language as written down
link |
is a projection of thought.
link |
This is one of the open questions is
link |
if you perfectly can solve language,
link |
are you getting close to being able to solve easily
link |
with flying colors past the towing test kind of thing.
link |
So that's, it's similar, but different
link |
and the computer vision problem is in the 2D plane
link |
is a projection with three dimensional world.
link |
So perhaps there are similar problems there.
link |
Maybe this is a good.
link |
I mean, I think what I'm saying is NLP is not easy.
link |
Of course, don't get me wrong.
link |
Like abstract thought expressed in knowledge
link |
or knowledge basically expressed in language
link |
is really hard to understand, right?
link |
I mean, we've been communicating with language for so long
link |
and it is of course a very complicated concept.
link |
The thing is at least getting like somewhat reasonable,
link |
like being able to solve some kind of reasonable tasks
link |
with language, I would say slightly easier
link |
than it is with computer vision.
link |
Yeah, I would say, yeah.
link |
So that's well put.
link |
I would say getting impressive performance on language
link |
I feel like for both language and computer vision,
link |
there's going to be this wall of like,
link |
like this hump you have to overcome
link |
to achieve superhuman level performance
link |
or human level performance.
link |
And I feel like for language, that wall is farther away.
link |
So you can get pretty nice.
link |
You can do a lot of tricks.
link |
You can show really impressive performance.
link |
You can even fool people that you're tweeting
link |
or you write blog posts writing
link |
or your question answering has intelligence behind it.
link |
But to truly demonstrate understanding of dialogue,
link |
of continuous long form dialogue
link |
that would require perhaps big breakthroughs.
link |
In the same way in computer vision,
link |
I think the big breakthroughs need to happen earlier
link |
to achieve impressive performance.
link |
This might be a good place to, you already mentioned it,
link |
but what is contrastive learning
link |
and what are energy based models?
link |
Contrastive learning is sort of the paradigm of learning
link |
where the idea is that you are learning this embedding space
link |
or so you're learning this sort of vector space
link |
of all your concepts.
link |
And the way you learn that is basically by contrasting.
link |
So the idea is that you have a sample,
link |
you have another sample that's related to it.
link |
So that's called the positive
link |
and you have another sample that's not related to it.
link |
So that's negative.
link |
So for example, let's just take an NLP
link |
or in a simple example in computer vision.
link |
So you have an image of a cat, you have an image of a dog
link |
and for whatever application that you're doing,
link |
say you're trying to figure out what the pets are,
link |
you're saying that these two images are related.
link |
So image of a cat and dog are related,
link |
but now you have another third image of a banana
link |
because you don't like that word.
link |
So now you basically have this banana.
link |
Thank you for speaking to the crowd.
link |
And so you take both of these images
link |
and you take the image from the cat,
link |
the image from the dog,
link |
you get a feature from both of them.
link |
And now what you're training the network to do
link |
is basically pull both of these features together
link |
while pushing them away from the feature of a banana.
link |
So this is the contrastive part.
link |
So you're contrasting against the banana.
link |
So there's always this notion of a negative and a positive.
link |
Now, energy based models are like one way
link |
that Jan sort of explains a lot of these methods.
link |
So Jan basically, I think a couple of years
link |
or more than that, like when I joined Facebook,
link |
Jan used to keep mentioning this word, energy based models.
link |
And of course I had no idea what he was talking about.
link |
So then one day I caught him in one of the conference rooms
link |
and I'm like, can you please tell me what this is?
link |
So then like very patiently,
link |
he sat down with like a marker and a whiteboard.
link |
And his idea basically is that
link |
rather than talking about probability distributions,
link |
you can talk about energies of models.
link |
So models are trying to minimize certain energies
link |
or they're trying to maximize a certain kind of energy.
link |
And the idea basically is that
link |
you can explain a lot of the contrastive models,
link |
GANs, for example,
link |
which are like Generative Adversarial Networks.
link |
A lot of these modern learning methods
link |
or VAEs, which are Variational Autoencoders,
link |
you can really explain them very nicely
link |
in terms of an energy function
link |
that they're trying to minimize or maximize.
link |
And so by putting this common sort of language
link |
for all of these models,
link |
what looks very different in machine learning
link |
that, oh, VAEs are very different from what GANs are,
link |
are very, very different from what contrastive models are,
link |
you actually get a sense of like,
link |
oh, these are actually very, very related.
link |
It's just that the way or the mechanism
link |
in which they're sort of maximizing
link |
or minimizing this energy function is slightly different.
link |
It's revealing the commonalities
link |
between all these approaches
link |
and putting a sexy word on top of it, like energy.
link |
And so similarities,
link |
two things that are similar have low energy.
link |
Like the low energy signifying similarity.
link |
So basically the idea is that if you were to imagine
link |
like the embedding as a manifold, a 2D manifold,
link |
you would get a hill or like a high sort of peak
link |
in the energy manifold,
link |
wherever two things are not related.
link |
And basically you would have like a dip
link |
where two things are related.
link |
So you'd get a dip in the manifold.
link |
And in the self supervised context,
link |
how do you know two things are related
link |
and two things are not related?
link |
So this is where all the sort of ingenuity or tricks
link |
So for example, like you can take
link |
the fill in the blank problem,
link |
or you can take in the context problem.
link |
And what you can say is two words
link |
that are in the same context are related.
link |
Two words that are in different contexts are not related.
link |
For images, basically two crops
link |
from the same image are related.
link |
And whereas a third image is not related at all.
link |
Or for a video, it can be two frames
link |
from that video are related
link |
because they're likely to contain
link |
the same sort of concepts in them.
link |
Whereas a third frame
link |
from a different video is not related.
link |
So it basically is, it's a very general term.
link |
Contrastive learning is nothing really
link |
to do with self supervised learning.
link |
It actually is very popular in for example,
link |
like any kind of metric learning
link |
or any kind of embedding learning.
link |
So it's also used in supervised learning.
link |
And the thing is because we are not really using labels
link |
to get these positive or negative pairs,
link |
it can basically also be used for self supervised learning.
link |
So you mentioned one of the ideas
link |
in the vision context that works
link |
is to have different crops.
link |
So you could think of that as a way
link |
to sort of manipulating the data
link |
to generate examples that are similar.
link |
Obviously, there's a bunch of other techniques.
link |
You mentioned lighting as a very,
link |
in images lighting is something that varies a lot
link |
and you can artificially change those kinds of things.
link |
There's the whole broad field of data augmentation,
link |
which manipulates images in order to increase arbitrarily
link |
the size of the data set.
link |
First of all, what is data augmentation?
link |
And second of all, what's the role of data augmentation
link |
in self supervised learning and contrastive learning?
link |
So data augmentation is just a way like you said,
link |
it's basically a way to augment the data.
link |
So you have say n samples.
link |
And what you do is you basically define
link |
some kind of transforms for the sample.
link |
So you take your say image
link |
and then you define a transform
link |
where you can just increase say the colors
link |
like the colors or the brightness of the image
link |
or increase or decrease the contrast of the image
link |
for example, or take different crops of it.
link |
So data augmentation is just a process
link |
to like basically perturb the data
link |
or like augment the data, right?
link |
And so it has played a fundamental role
link |
for computer vision for self supervised learning especially.
link |
The way most of the current methods work
link |
contrastive or otherwise is by taking an image
link |
in the case of images is by taking an image
link |
and then computing basically two perturbations of it.
link |
So these can be two different crops of the image
link |
with like different types of lighting
link |
or different contrast or different colors.
link |
So you jitter the colors a little bit and so on.
link |
And now the idea is basically because it's the same object
link |
or because it's like related concepts
link |
in both of these perturbations,
link |
you want the features from both of these perturbations
link |
So now you can use a variety of different ways
link |
to enforce this constraint,
link |
like these features being similar.
link |
You can do this by contrastive learning.
link |
So basically, both of these things are positives,
link |
a third sort of image is negative.
link |
You can do this basically by like clustering.
link |
For example, you can say that both of these images should,
link |
the features from both of these images
link |
should belong in the same cluster because they're related,
link |
whereas image like another image
link |
should belong to a different cluster.
link |
So there's a variety of different ways
link |
to basically enforce this particular constraint.
link |
By the way, when you say features,
link |
it means there's a very large neural network
link |
that extracting patterns from the image
link |
and the kind of patterns that extracts
link |
should be either identical or very similar.
link |
That's what that means.
link |
So the neural network basically takes in the image
link |
and then outputs a set of like,
link |
basically a vector of like numbers,
link |
and that's the feature.
link |
And you want this feature for both of these
link |
like different crops that you computed to be similar.
link |
So you want this vector to be identical
link |
in its like entries, for example.
link |
Be like literally close
link |
in this multi dimensional space to each other.
link |
And like you said,
link |
close can mean part of the same cluster or something like that
link |
in this large space.
link |
First of all, that,
link |
I wonder if there is connection
link |
to the way humans learn to this,
link |
almost like maybe subconsciously,
link |
in order to understand a thing,
link |
you kind of have to see it from two, three multiple angles.
link |
I wonder, I have a lot of friends
link |
who are neuroscientists maybe and cognitive scientists.
link |
I wonder if that's in there somewhere.
link |
Like in order for us to place a concept in its proper place,
link |
we have to basically crop it in all kinds of ways,
link |
do basic data augmentation on it
link |
in whatever very clever ways that the brain likes to do.
link |
Like spinning around in our minds somehow
link |
that that is very effective.
link |
So I think for some of them, we like need to do it.
link |
So like babies, for example, pick up objects,
link |
like move them and put them close to their eye and whatnot.
link |
But for certain other things,
link |
actually we are good at imagining it as well, right?
link |
So if you, I have never seen, for example,
link |
an elephant from the top.
link |
I've never basically looked at it from like top down.
link |
But if you showed me a picture of it,
link |
I could very well tell you that that's an elephant.
link |
So I think some of it, we're just like,
link |
we naturally build it or transfer it from other objects
link |
that we've seen to imagine what it's going to look like.
link |
Has anyone done that with augmentation?
link |
Like imagine all the possible things
link |
that are occluded or not there,
link |
but not just like normal things, like wild things,
link |
but they're nevertheless physically consistent.
link |
So, I mean, people do kind of like
link |
occlusion based augmentation as well.
link |
So you place in like a random like box, gray box
link |
to sort of mask out a certain part of the image.
link |
And the thing is basically you're kind of occluding it.
link |
For example, you place it say on half of a person's face.
link |
So basically saying that, you know,
link |
something below their nose is occluded
link |
because it's grayed out.
link |
So, you know, I meant like, you have like, what is it?
link |
A table and you can't see behind the table.
link |
And you imagine there's a bunch of elves
link |
with bananas behind the table.
link |
Like, I wonder if there's useful
link |
to have a wild imagination for the network
link |
because that's possible or maybe not elves,
link |
but like puppies and kittens or something like that.
link |
Just have a wild imagination
link |
and like constantly be generating that wild imagination.
link |
Because in terms of data augmentation,
link |
as currently applied, it's super ultra, very boring.
link |
It's very basic data augmentation.
link |
I wonder if there's a benefit to being wildly imaginable
link |
while trying to be consistent with physical reality.
link |
I think it's a kind of a chicken and egg problem, right?
link |
Because to have like amazing data augmentation,
link |
you need to understand what the scene is.
link |
And what we're trying to do data augmentation
link |
to learn what a scene is anyway.
link |
So it's basically just keeps going on.
link |
Before you understand it,
link |
just put elves with bananas
link |
until you know it's not to be true.
link |
Just like children have a wild imagination
link |
until the adults ruin it all.
link |
Okay, so what are the different kinds of data augmentation
link |
that you've seen to be effective in visual intelligence?
link |
it's a lot of these image filtering operations.
link |
So like blurring the image,
link |
you know, all the kind of Instagram filters
link |
that you can think of.
link |
So like arbitrarily like make the red super red,
link |
make the green super greens, like saturate the image.
link |
Rotation, cropping.
link |
Rotation, cropping, exactly.
link |
All of these kinds of things.
link |
Like I said, lighting is a really interesting one to me.
link |
Like that feels like really complicated to do.
link |
I mean, they don't,
link |
the augmentations that we work on aren't like
link |
they're not going to be like
link |
physically realistic versions of lighting.
link |
It's not that you're assuming
link |
that there's a light source up
link |
and then you're moving it to the right
link |
and then what does the thing look like?
link |
It's really more about like brightness of the image,
link |
overall brightness of the image
link |
or overall contrast of the image and so on.
link |
But this is a really important point to me.
link |
I always thought that data augmentation
link |
holds an important key
link |
to big improvements in machine learning.
link |
And it seems that it is an important aspect
link |
of self supervised learning.
link |
So I wonder if there's big improvements to be achieved
link |
on much more intelligent kinds of data augmentation.
link |
For example, currently,
link |
maybe you can correct me if I'm wrong,
link |
data augmentation is not parameterized.
link |
You're not learning.
link |
You're not learning.
link |
To me, it seems like data augmentation potentially
link |
should involve more learning
link |
than the learning process itself.
link |
You're almost like thinking of like generative kind of,
link |
it's the elves with bananas.
link |
it's like very active imagination
link |
of messing with the world
link |
and teaching that mechanism for messing with the world
link |
Because that feels like,
link |
I mean, it's imagination.
link |
It's just, as you said,
link |
it feels like us humans are able to,
link |
maybe sometimes subconsciously,
link |
imagine before we see the thing,
link |
imagine what we're expecting to see,
link |
like maybe several options.
link |
And especially, we probably forgot,
link |
but when we were younger,
link |
probably the possibilities were wilder, more numerous.
link |
And then as we get older,
link |
we become to understand the world
link |
and the possibilities of what we might see
link |
becomes less and less and less.
link |
So I wonder if you think there's a lot of breakthroughs
link |
yet to be had in data augmentation.
link |
And maybe also can you just comment on the stuff we have,
link |
is that a big part of self supervised learning?
link |
So data augmentation is like key to self supervised learning
link |
that has like the kind of augmentation that we're using.
link |
And basically the fact that we're trying to learn
link |
these neural networks that are predicting these features
link |
from images that are robust under data augmentation
link |
has been the key for visual self supervised learning.
link |
And they play a fairly fundamental role to it.
link |
Now, the irony of all of this is that
link |
for like deep learning purists will say
link |
the entire point of deep learning is that
link |
you feed in the pixels to the neural network
link |
and it should figure out the patterns on its own.
link |
So if it really wants to look at edges,
link |
it should look at edges.
link |
You shouldn't really like really go
link |
and handcraft these like features, right?
link |
You shouldn't go tell it that look at edges.
link |
So data augmentation
link |
should basically be in the same category, right?
link |
Why should we tell the network
link |
or tell this entire learning paradigm
link |
what kinds of data augmentation that we're looking for?
link |
We are encoding a very sort of human specific bias there
link |
that we know things are like,
link |
if you change the contrast of the image,
link |
it should still be an apple
link |
or it should still see apple, not banana.
link |
And basically if we change like colors,
link |
it should still be the same kind of concept.
link |
Of course, this is not one,
link |
this is doesn't feel like super satisfactory
link |
because a lot of our human knowledge
link |
or our human supervision
link |
is actually going into the data augmentation.
link |
So although we are calling it self supervised learning,
link |
a lot of the human knowledge
link |
is actually being encoded in the data augmentation process.
link |
So it's really like,
link |
we've kind of sneaked away the supervision at the input
link |
and we're like really designing
link |
these nice list of data augmentations
link |
that are working very well.
link |
Of course, the idea is that it's much easier
link |
to design a list of data augmentation than it is to do.
link |
So humans are doing nevertheless doing less and less work
link |
and maybe leveraging their creativity more and more.
link |
And when we say data augmentation is not parameterized,
link |
it means it's not part of the learning process.
link |
Do you think it's possible to integrate
link |
some of the data augmentation into the learning process?
link |
And in fact, it will be really beneficial for us
link |
because a lot of these data augmentations
link |
that we use in vision are very extreme.
link |
For example, like when you have certain concepts,
link |
again, a banana, you take the banana
link |
and then basically you change the color of the banana, right?
link |
So you make it a purple banana.
link |
Now this data augmentation process
link |
is actually independent of the,
link |
like it has no notion of what is present in the image.
link |
So it can change this color arbitrarily.
link |
It can make it a red banana as well.
link |
And now what we're doing is we're telling
link |
the neural network that this red banana
link |
and so a crop of this image which has the red banana
link |
and a crop of this image where I changed the color
link |
to a purple banana should be,
link |
the features should be the same.
link |
Now bananas aren't red or purple mostly.
link |
So really the data augmentation process
link |
should take into account what is present in the image
link |
and what are the kinds of physical realities
link |
that are possible.
link |
It shouldn't be completely independent of the image.
link |
So you might get big gains if you,
link |
instead of being drastic, do subtle augmentation
link |
but realistic augmentation.
link |
I'm not sure if it's subtle, but like realistic for sure.
link |
If it's realistic, then even subtle augmentation
link |
will give you big benefits.
link |
And it will be like for particular domains
link |
you might actually see like,
link |
if for example, now we're doing medical imaging,
link |
there are going to be certain kinds
link |
of like geometric augmentation
link |
which are not really going to be very valid
link |
for the human body.
link |
So if you were to like actually loop in data augmentation
link |
into the learning process,
link |
it will actually be much more useful.
link |
Now this actually does take us
link |
to maybe a semi supervised kind of a setting
link |
because you do want to understand
link |
what is it that you're trying to solve.
link |
So currently self supervised learning
link |
kind of operates in the wild, right?
link |
So you do the self supervised learning
link |
and the purists and all of us basically say that,
link |
okay, this should learn useful representations
link |
and they should be useful for any kind of end task,
link |
no matter it's like banana recognition
link |
or like autonomous driving.
link |
Now it's a tall order.
link |
Maybe the first baby step for us should be that,
link |
okay, if you're trying to loop in this data augmentation
link |
into the learning process,
link |
then we at least need to have some sense
link |
of what we're trying to do.
link |
Are we trying to distinguish
link |
between different types of bananas
link |
or are we trying to distinguish between banana and apple
link |
or are we trying to do all of these things at once?
link |
And so some notion of like what happens at the end
link |
might actually help us do much better at this side.
link |
Let me ask you a ridiculous question.
link |
If I were to give you like a black box,
link |
like a choice to have an arbitrary large data set
link |
of real natural data
link |
versus really good data augmentation algorithms,
link |
which would you like to train in a self supervised way on?
link |
So natural data from the internet are arbitrary large,
link |
so unlimited data,
link |
or it's like more controlled good data augmentation
link |
on the finite data set.
link |
The thing is like,
link |
because our learning algorithms for vision right now
link |
really rely on data augmentation,
link |
even if you were to give me
link |
like an infinite source of like image data,
link |
I still need a good data augmentation algorithm.
link |
You need something that tells you
link |
that two things are similar.
link |
because you've given me an arbitrary large data set,
link |
I still need to use data augmentation
link |
to take that image construct,
link |
like these two perturbations of it,
link |
and then learn from it.
link |
So the thing is our learning paradigm
link |
is very primitive right now.
link |
Even if you were to give me lots of images,
link |
it's still not really useful.
link |
A good data augmentation algorithm
link |
is actually going to be more useful.
link |
So you can like reduce down the amount of data
link |
that you give me by like 10 times,
link |
but if you were to give me
link |
a good data augmentation algorithm,
link |
that would probably do better
link |
than giving me like 10 times the size of that data,
link |
but me having to rely on
link |
like a very primitive data augmentation algorithm.
link |
Like through tagging and all those kinds of things,
link |
is there a way to discover things
link |
that are semantically similar on the internet?
link |
Obviously there is, but they might be extremely noisy.
link |
And the difference might be farther away
link |
than you would be comfortable with.
link |
So, I mean, yes, tagging will help you a lot.
link |
It'll actually go a very long way
link |
in figuring out what images are related or not.
link |
And then, so, but then the purists would argue
link |
that when you're using human tags,
link |
because these tags are like supervision,
link |
is it really self supervised learning now?
link |
Because you're using human tags
link |
to figure out which images are like similar.
link |
Hashtag no filter means a lot of things.
link |
I mean, there are certain tags
link |
which are going to be applicable pretty much to anything.
link |
So they're pretty useless for learning.
link |
But I mean, certain tags are actually like
link |
the Eiffel Tower, for example,
link |
or the Taj Mahal, for example.
link |
These tags are like very indicative of what's going on.
link |
And they are, I mean, they are human supervision.
link |
This is one of the tasks of discovering
link |
from human generated data strong signals
link |
that could be leveraged for self supervision.
link |
Like humans are doing so much work already.
link |
Like many years ago, there was something that was called,
link |
I guess, human computation back in the day.
link |
Humans are doing so much work.
link |
It'd be exciting to discover ways to leverage
link |
the work they're doing to teach machines
link |
without any extra effort from them.
link |
An example could be, like we said, driving,
link |
humans driving and machines can learn from the driving.
link |
I always hope that there could be some supervision signal
link |
discovered in video games,
link |
because there's so many people that play video games
link |
that it feels like so much effort is put into video games,
link |
into playing video games,
link |
and you can design video games somewhat cheaply
link |
to include whatever signals you want.
link |
It feels like that could be leverage somehow.
link |
So people are using that.
link |
Like there are actually folks right here in UT Austin,
link |
like Philip Granbull is a professor at UT Austin.
link |
He's been like working on video games
link |
as a source of supervision.
link |
I mean, it's really fun.
link |
Like as a PhD student,
link |
getting to basically play video games all day.
link |
Yeah, but so I do hope that kind of thing scales
link |
and like ultimately boils down to discovering
link |
some undeniably very good signal.
link |
It's like masking in NLP.
link |
But that said, there's non contrastive methods.
link |
What do non contrastive energy based
link |
self supervised learning methods look like?
link |
And why are they promising?
link |
So like I said about contrastive learning,
link |
you have this notion of a positive and a negative.
link |
Now, the thing is, this entire learning paradigm
link |
really requires access to a lot of negatives
link |
to learn a good sort of feature space.
link |
The idea is if I tell you, okay,
link |
so a cat and a dog are similar,
link |
and they're very different from a banana.
link |
The thing is, this is a fairly simple analogy, right?
link |
Because bananas look visually very different
link |
from what cats and dogs do.
link |
So very quickly, if this is the only source
link |
of supervision that I'm giving you,
link |
your learning is not going to be like,
link |
after a point, the neural network
link |
is really not going to learn a lot.
link |
Because the negative that you're getting
link |
is going to be so random.
link |
So it can be, oh, a cat and a dog are very similar,
link |
but they're very different from a Volkswagen Beetle.
link |
Now, like this car looks very different
link |
from these animals again.
link |
So the thing is in contrastive learning,
link |
the quality of the negative sample really matters a lot.
link |
And so what has happened is basically that
link |
typically these methods that are contrastive
link |
really require access to lots of negatives,
link |
which becomes harder and harder to sort of scale
link |
when designing a learning algorithm.
link |
So that's been one of the reasons
link |
why non contrastive methods have become like popular
link |
and why people think that they're going to be more useful.
link |
So a non contrastive method, for example,
link |
like clustering is one non contrastive method.
link |
The idea basically being that you have
link |
two of these samples, so the cat and dog
link |
or two crops of this image,
link |
they belong to the same cluster.
link |
And so essentially you're basically doing clustering online
link |
when you're learning this network,
link |
and which is very different from having access
link |
to a lot of negatives explicitly.
link |
The other way which has become really popular
link |
is something called self distillation.
link |
So the idea basically is that you have a teacher network
link |
and a student network,
link |
and the teacher network produces a feature.
link |
So it takes in the image
link |
and basically the neural network figures out the patterns
link |
gets the feature out.
link |
And there's another neural network
link |
which is the student neural network
link |
and that also produces a feature.
link |
And now all you're doing is basically saying
link |
that the features produced by the teacher network
link |
and the student network should be very similar.
link |
There is no notion of a negative anymore.
link |
So it's all about similarity maximization
link |
between these two features.
link |
And so all I need to now do is figure out
link |
how to have these two sorts of parallel networks,
link |
a student network and a teacher network.
link |
And basically researchers have figured out
link |
very cheap methods to do this.
link |
So you can actually have for free really
link |
two types of neural networks.
link |
They're kind of related,
link |
but they're different enough that you can actually
link |
basically have a learning problem set up.
link |
So you can ensure that they always remain different enough.
link |
So the thing doesn't collapse into something boring.
link |
So the main sort of enemy of self supervised learning,
link |
any kind of similarity maximization technique is collapse.
link |
It's a collapse means that you learn the same feature
link |
representation for all the images in the world,
link |
which is completely useless.
link |
Everything's a banana.
link |
Everything is a banana.
link |
Everything is a cat.
link |
Everything is a car.
link |
And so all we need to do is basically come up with ways
link |
to prevent collapse.
link |
Contrastive learning is one way of doing it.
link |
And then for example, like clustering or self distillation
link |
or other ways of doing it.
link |
We also had a recent paper where we used like
link |
de correlation between like two sets of features
link |
to prevent collapse.
link |
So that's inspired a little bit by like Horace Barlow's
link |
neuroscience principles.
link |
By the way, I should comment that whoever counts
link |
the number of times the word banana, apple, cat and dog
link |
were using this conversation wins the internet.
link |
What is Suave and the main improvement proposed
link |
in the paper on supervised learning of visual features
link |
by contrasting cluster assignments?
link |
Suave basically is a clustering based technique,
link |
which is for again, the same thing for self supervised
link |
learning in vision where we have two crops.
link |
And the idea basically is that you want the features
link |
from these two crops of an image to lie in the same cluster
link |
and basically crops that are coming from different images
link |
to be in different clusters.
link |
Now, typically in a sort of,
link |
if you were to do this clustering,
link |
you would perform clustering offline.
link |
What that means is you would,
link |
if you have a dataset of N examples,
link |
you would run over all of these N examples,
link |
get features for them, perform clustering.
link |
So basically get some clusters
link |
and then repeat the process again.
link |
So this is offline basically because I need to do one pass
link |
through the data to compute its clusters.
link |
Suave is basically just a simple way of doing this online.
link |
So as you're going through the data,
link |
you're actually computing these clusters online.
link |
And so of course there is like a lot of tricks involved
link |
in how to do this in a robust manner without collapsing,
link |
but this is this sort of key idea to it.
link |
Is there a nice way to say what is the key methodology
link |
of the clustering that enables that?
link |
Right, so the idea basically is that
link |
when you have N samples,
link |
we assume that we have access to,
link |
like there are always K clusters in a dataset.
link |
K is a fixed number.
link |
So for example, K is 3000.
link |
And so if you have any,
link |
when you look at any sort of small number of examples,
link |
all of them must belong to one of these K clusters.
link |
And we impose this equipartition constraint.
link |
What this means is that basically
link |
your entire set of N samples
link |
should be equally partitioned into K clusters.
link |
So all your K clusters are basically equal,
link |
they have equal contribution to these N samples.
link |
And this ensures that we never collapse.
link |
So collapse can be viewed as a way
link |
in which all samples belong to one cluster, right?
link |
So all this, if all features become the same,
link |
then you have basically just one mega cluster.
link |
You don't even have like 10 clusters or 3000 clusters.
link |
So Suave basically ensures that at each point,
link |
all these 3000 clusters are being used
link |
in the clustering process.
link |
Basically just figure out how to do this online.
link |
And again, basically just make sure
link |
that two crops from the same image belong to the same cluster
link |
And the fact they have a fixed K makes things simpler.
link |
Fixed K makes things simpler.
link |
Our clustering is not like really hard clustering,
link |
it's soft clustering.
link |
So basically you can be 0.2 to cluster number one
link |
and 0.8 to cluster number two.
link |
So it's not really hard.
link |
So essentially, even though we have like 3000 clusters,
link |
we can actually represent a lot of clusters.
link |
What is SEER, S E E R?
link |
And what are the key results and insights in the paper,
link |
Self Supervised Pre Training of Visual Features in the Wild?
link |
What is this big, beautiful SEER system?
link |
SEER, so I'll first go to Suave
link |
because Suave is actually like one
link |
of the key components for SEER.
link |
So Suave was, when we use Suave,
link |
it was demonstrated on ImageNet.
link |
So typically like self supervised methods,
link |
the way we sort of operate is like in the research community,
link |
So we take ImageNet, which of course I talked about
link |
as having lots of labels.
link |
And then we throw away the labels,
link |
like throw away all the hard work that went behind
link |
basically the labeling process.
link |
And we pretend that it is unsupervised.
link |
But the problem here is that we have,
link |
like when we collected these images,
link |
the ImageNet dataset has a particular distribution
link |
of concepts, right?
link |
So these images are very curated.
link |
And what that means is these images, of course,
link |
belong to a certain set of noun concepts.
link |
And also ImageNet has this bias that all images
link |
contain an object, which is like very big
link |
and it's typically in the center.
link |
So when you're talking about a dog, it's a well framed dog,
link |
it's towards the center of the image.
link |
So a lot of the data augmentation,
link |
a lot of the sort of hidden assumptions
link |
in self supervised learning,
link |
actually really exploit this bias of ImageNet.
link |
And so, I mean, a lot of my work,
link |
a lot of work from other people always uses ImageNet
link |
sort of as the benchmark to show the success
link |
of self supervised learning.
link |
So you're implying that there's particular limitations
link |
to this kind of dataset?
link |
Yes, I mean, it's basically because our data augmentation
link |
that we designed, like all data augmentation
link |
that we designed for self supervised learning in vision
link |
are kind of overfit to ImageNet.
link |
But you're saying a little bit hard coded
link |
like the cropping.
link |
Exactly, the cropping parameters,
link |
the kind of lighting that we're using,
link |
the kind of blurring that we're using.
link |
Yeah, but you would, for more in the wild dataset,
link |
you would need to be clever or more careful
link |
in setting the range of parameters
link |
and those kinds of things.
link |
So for SEER, our main goal was twofold.
link |
One, basically to move away from ImageNet for training.
link |
So the images that we used were like uncurated images.
link |
Now there's a lot of debate
link |
whether they're actually curated or not,
link |
but I'll talk about that later.
link |
But the idea was basically,
link |
these are going to be random internet images
link |
that we're not going to filter out
link |
based on like particular categories.
link |
So we did not say that, oh, images that belong to dogs
link |
and cats should be the only images
link |
that come in this dataset, banana.
link |
And basically, other images should be thrown out.
link |
So we didn't do any of that.
link |
So these are random internet images.
link |
And of course, it also goes back to like the problem
link |
of scale that you talked about.
link |
So these were basically about a billion or so images.
link |
And for context ImageNet,
link |
the ImageNet version that we use
link |
was 1 million images earlier.
link |
So this is basically going like
link |
three orders of magnitude more.
link |
The idea was basically to see
link |
if we can train a very large convolutional model
link |
in a self supervised way on this uncurated,
link |
but really large set of images.
link |
And how well would this model do?
link |
So is self supervised learning really overfit to ImageNet
link |
or can it actually work in the wild?
link |
And it was also out of curiosity,
link |
what kind of things will this model learn?
link |
Will it actually be able to still figure out
link |
different types of objects and so on?
link |
Would there be particular kinds of tasks
link |
that would actually do better than an ImageNet train model?
link |
And so for Sear, one of our main findings was that
link |
we can actually train very large models
link |
in a completely self supervised way
link |
on lots of internet images
link |
without really necessarily filtering them out.
link |
Which was in itself a good thing
link |
because it's a fairly simple process, right?
link |
So you get images which are uploaded
link |
and you basically can immediately use them
link |
to train a model in an unsupervised way.
link |
You don't really need to sit and filter them out.
link |
These images can be cartoons, these can be memes,
link |
these can be actual pictures uploaded by people.
link |
And you don't really care about what these images are.
link |
You don't even care about what concepts they contain.
link |
So this was a very sort of simple setup.
link |
What image selection mechanism would you say
link |
is there like inherent in some aspect of the process?
link |
So you're kind of implying that there's almost none,
link |
but what is there would you say if you were to introspect?
link |
Right, so it's not like uncurated can basically
link |
like one way of imagining uncurated
link |
is basically you have like cameras
link |
that can take pictures at random viewpoints.
link |
When people upload pictures to the internet,
link |
they are typically going to care about the framing of it.
link |
They're not going to upload, say,
link |
the picture of a zoomed in wall, for example.
link |
Well, when you say internet, do you mean social networks?
link |
So these are not going to be like pictures
link |
of like a zoomed in table or a zoomed in wall.
link |
So it's not really completely uncurated
link |
because people do have the like photographer's bias
link |
where they do want to keep things
link |
towards the center a little bit,
link |
or like really have like nice looking things
link |
and so on in the picture.
link |
So that's the kind of bias that typically exists
link |
in this data set and also the user base, right?
link |
You're not going to get lots of pictures
link |
from different parts of the world
link |
because there are certain parts of the world
link |
where people may not actually be uploading
link |
a lot of pictures to the internet
link |
or may not even have access to a lot of internet.
link |
So this is a giant data set and a giant neural network.
link |
I don't think we've talked about what architectures
link |
work well for SSL, for self supervised learning.
link |
For SEER and for SWAB, we were using convolutional networks,
link |
but recently in a work called Dyno,
link |
we've basically started using transformers for vision.
link |
Both seem to work really well, Connets and transformers.
link |
And depending on what you want to do,
link |
you might choose to use a particular formulation.
link |
So for SEER, it was a Connet.
link |
It was particularly a RegNet model,
link |
which was also a work from Facebook.
link |
RegNets are like really good when it comes to compute
link |
versus like accuracy.
link |
So because it was a very efficient model,
link |
compute and memory wise efficient,
link |
and basically it worked really well in terms of scaling.
link |
So we used a very large RegNet model
link |
and trained it on a billion images.
link |
Can you maybe quickly comment on what RegNets are?
link |
It comes from this paper, Designing Network Design Spaces.
link |
This is a super interesting concept
link |
that emphasizes how to create efficient neural networks,
link |
large neural networks.
link |
So one of the sort of key takeaways from this paper,
link |
which the authors, like whenever you hear them
link |
present this work, they keep saying is,
link |
a lot of neural networks are characterized
link |
in terms of flops, right?
link |
Flops basically being the floating point operations.
link |
And people really love to use flops to say,
link |
this model is like really computationally heavy,
link |
or like our model is computationally cheap and so on.
link |
Now it turns out that flops are really not a good indicator
link |
of how well a particular network is,
link |
like how efficient it is really.
link |
And what a better indicator is, is the activation
link |
or the memory that is being used by this particular model.
link |
And so designing, like one of the key findings
link |
from this paper was basically that you need to design
link |
network families or neural network architectures
link |
that are actually very efficient in the memory space as well,
link |
not just in terms of pure flops.
link |
So RegNet is basically a network architecture family
link |
that came out of this paper that is particularly good
link |
at both flops and the sort of memory required for it.
link |
And of course it builds upon like earlier work,
link |
like ResNet being like the sort of more popular inspiration
link |
for it, where you have residual connections.
link |
But one of the things in this work is basically
link |
they also use like squeeze excitation blocks.
link |
So it's a lot of nice sort of technical innovation
link |
in all of this from prior work,
link |
and a lot of the ingenuity of these particular authors
link |
in how to combine these multiple building blocks.
link |
But the key constraint was optimize for both flops
link |
and memory when you're basically doing this,
link |
don't just look at flops.
link |
And that allows you to what have a,
link |
sort of have very large networks through this process,
link |
can optimize for low, like for efficiency, for low memory.
link |
Also in just in terms of pure hardware,
link |
they fit very well on GPU memory.
link |
So they can be like really powerful neural network
link |
architectures with lots of parameters, lots of flops,
link |
but also because they're like efficient in terms of
link |
the amount of memory that they're using,
link |
you can actually fit a lot of these on like a,
link |
you can fit a very large model on a single GPU for example.
link |
Would you say that the choice of architecture
link |
matters more than the choice of maybe data augmentation
link |
Is there a possibility to say what matters more?
link |
You kind of imply that you can probably go really far
link |
with just using basic conv nuts.
link |
All right, I think like data and data augmentation,
link |
the algorithm being used for the self supervised training
link |
matters a lot more than the particular kind of architecture.
link |
With different types of architecture,
link |
you will get different like properties in the resulting
link |
sort of representation.
link |
But really, I mean, the secret sauce is in the augmentation
link |
and the algorithm being used to train them.
link |
The architectures, I mean, at this point,
link |
a lot of them perform very similarly,
link |
depending on like the particular task that you care about,
link |
they have certain advantages and disadvantages.
link |
Is there something interesting to be said about what it
link |
takes with Sears to train a giant neural network?
link |
You're talking about a huge amount of data,
link |
a huge neural network.
link |
Is there something interesting to be said of how to
link |
effectively train something like that fast?
link |
I mean, so the model was like a billion parameters.
link |
And it was trained on a billion images.
link |
So if like, basically the same number of parameters
link |
as the number of images, and it took a while.
link |
I don't remember the exact number, it's in the paper,
link |
but it took a while.
link |
I guess I'm trying to get at is,
link |
when you're thinking of scaling this kind of thing,
link |
I mean, one of the exciting possibilities of self
link |
supervised learning is the several orders of magnitude
link |
scaling of everything, both the neural network
link |
and the size of the data.
link |
And so the question is,
link |
do you think there's some interesting tricks to do large
link |
scale distributed compute,
link |
or is that really outside of even deep learning?
link |
That's more about like hardware engineering.
link |
I think more and more there is like this,
link |
a lot of like systems are designed,
link |
basically taking into account
link |
the machine learning needs, right?
link |
So because whenever you're doing this kind of
link |
distributed training, there is a lot of intercommunication
link |
So like gradients or the model parameters are being passed.
link |
So you really want to minimize communication costs
link |
when you really want to scale these models up.
link |
You want basically to be able to do as much,
link |
like as limited amount of communication as possible.
link |
So currently like a dominant paradigm
link |
is synchronized sort of training.
link |
So essentially after every sort of gradient step,
link |
all you basically have like a synchronization step
link |
between all the sort of compute chips
link |
that you're going on with.
link |
I think asynchronous training was popular,
link |
but it doesn't seem to perform as well.
link |
But in general, I think that's sort of the,
link |
I guess it's outside my scope as well.
link |
But the main thing is like minimize the amount of
link |
synchronization steps that you have.
link |
That has been the key takeaway, at least in my experience.
link |
The others I have no idea about, how to design the chip.
link |
Yeah, there's very few things that I see Jim Keller's eyes
link |
light up as much as talking about giant computers doing
link |
like that fast communication that you're talking to well
link |
when they're training machine learning systems.
link |
What is VSSL, V I S S L, the PyTorch based SSL library?
link |
What are the use cases that you might have?
link |
VSSL basically was born out of a lot of us at Facebook
link |
are doing the self supervised learning research.
link |
So it's a common framework in which we have like a lot of
link |
self supervised learning methods implemented for vision.
link |
It's also, it has in itself like a benchmark of tasks
link |
that you can evaluate the self supervised representations on.
link |
So the use case for it is basically for anyone who's either
link |
trying to evaluate their self supervised model
link |
or train their self supervised model,
link |
or a researcher who's trying to build
link |
a new self supervised technique.
link |
So it's basically supposed to be all of these things.
link |
So as a researcher before VSSL, for example,
link |
or like when we started doing this work fairly seriously
link |
at Facebook, it was very hard for us to go and implement
link |
every self supervised learning model,
link |
test it out in a like sort of consistent manner.
link |
The experimental setup was very different
link |
across different groups.
link |
Even when someone said that they were reporting
link |
image net accuracy, it could mean lots of different things.
link |
So with VSSL, we tried to really sort of standardize that
link |
as much as possible.
link |
And there was a paper like we did in 2019
link |
just about benchmarking.
link |
And so VSSL basically builds upon a lot of this kind of work
link |
that we did about like benchmarking.
link |
And then every time we try to like,
link |
we come up with a self supervised learning method,
link |
a lot of us try to push that into VSSL as well,
link |
just so that it basically is like the central piece
link |
where a lot of these methods can reside.
link |
Just out of curiosity, people may be,
link |
so certainly outside of Facebook, but just researchers,
link |
or just even people that know how to program in Python
link |
and know how to use PyTorch, what would be the use case?
link |
What would be a fun thing to play around with VSSL on?
link |
Like what's a fun thing to play around
link |
with self supervised learning on, would you say?
link |
Is there a good Hello World program?
link |
Like is it always about big size that's important to have,
link |
or is there fun little smaller case playgrounds
link |
to play around with?
link |
So we're trying to like push something towards that.
link |
I think there are a few setups out there,
link |
but nothing like super standard on the smaller scale.
link |
I mean, ImageNet in itself is actually pretty big also.
link |
So that is not something
link |
which is like feasible for a lot of people.
link |
But we are trying to like push up
link |
with like smaller sort of use cases.
link |
The thing is, at a smaller scale,
link |
a lot of the observations
link |
or a lot of the algorithms that work
link |
don't necessarily translate into the medium
link |
or the larger scale.
link |
So it's really tricky to come up
link |
with a good small scale setup
link |
where a lot of your empirical observations
link |
will really translate to the other setup.
link |
So it's been really challenging.
link |
I've been trying to do that for a little bit as well
link |
because it does take time to train stuff on ImageNet.
link |
It does take time to train on like more images,
link |
but pretty much every time I've tried to do that,
link |
it's been unsuccessful
link |
because all the observations I draw
link |
from my set of experiments on a smaller data set
link |
don't translate into ImageNet
link |
or like don't translate into another sort of data set.
link |
So it's been hard for us to figure this one out,
link |
but it's an important problem.
link |
So there's this really interesting idea
link |
of learning across multiple modalities.
link |
You have a CVPR 2021 best paper candidate
link |
titled audio visual instance discrimination
link |
with cross modal agreement.
link |
What are the key results, insights in this paper
link |
and what can you say in general
link |
about the promise and power of multimodal learning?
link |
For this paper, it actually came as a little bit
link |
of a shock to me at how well it worked.
link |
So I can describe what the problem set up was.
link |
So it's been used in the past by lots of folks
link |
like for example, Andrew Owens from MIT,
link |
Alyosha Efros from Berkeley,
link |
Andrew Zisserman from Oxford.
link |
So a lot of these people have been
link |
sort of showing results in this.
link |
Of course, I was aware of this result,
link |
but I wasn't really sure how well it would work in practice
link |
for like other sort of downstream tasks.
link |
So the results kept getting better.
link |
And I wasn't sure if like a lot of our insights
link |
from self supervised learning would translate
link |
into this multimodal learning problem.
link |
So multimodal learning is when you have like,
link |
when you have multiple modalities.
link |
That's not even cool.
link |
Okay, so the particular modalities
link |
that we worked on in this work were audio and video.
link |
So the idea was basically, if you have a video,
link |
you have its corresponding audio track.
link |
And you want to use both of these signals,
link |
the audio signal and the video signal
link |
to learn a good representation for video
link |
and good representation for audio.
link |
Like this podcast.
link |
Like this podcast, exactly.
link |
So what we did in this work was basically train
link |
two different neural networks,
link |
one on the video signal, one on the audio signal.
link |
And what we wanted is basically the features
link |
that we get from both of these neural networks
link |
should be similar.
link |
So it should basically be able to produce
link |
the same kinds of features from the video
link |
and the same kinds of features from the audio.
link |
Now, why is this useful?
link |
Well, for a lot of these objects that we have,
link |
there is a characteristic sound, right?
link |
So trains, when they go by,
link |
they make a particular kind of sound.
link |
Boats make a particular kind of sound.
link |
People, when they're jumping around,
link |
will like shout, whatever.
link |
Bananas don't make a sound.
link |
So where you can't learn anything about bananas there.
link |
Or when humans mentioned bananas.
link |
Well, yes, when they say the word banana, then.
link |
So you can't trust basically anything
link |
that comes out of a human's mouth as a source,
link |
that source of audio is useless.
link |
The typical use case is basically like,
link |
for example, someone playing a musical instrument.
link |
So guitars have a particular kind of sound and so on.
link |
So because a lot of these things are correlated,
link |
the idea in multimodal learning
link |
is to take these two kinds of modalities,
link |
video and audio, and learn a common embedding space,
link |
a common feature space where both of these
link |
related modalities can basically be close together.
link |
And again, you use contrastive learning for this.
link |
So in contrastive learning, basically the video
link |
and the corresponding audio are positives.
link |
And you can take any other video or any other audio
link |
and that becomes a negative.
link |
And so basically that's it.
link |
It's just a simple application of contrastive learning.
link |
The main sort of finding from this work for us
link |
was basically that you can actually learn
link |
very, very powerful feature representations,
link |
very, very powerful video representations.
link |
So you can learn the sort of video network
link |
that we ended up learning can actually be used
link |
for downstream, for example, recognizing human actions
link |
or recognizing different types of sounds, for example.
link |
So this was sort of the key finding.
link |
Can you give kind of an example of a human action
link |
or like just so we can build up intuition
link |
of what kind of thing?
link |
Right, so there is this data set called kinetics,
link |
for example, which has like 400 different types
link |
So people jumping, people doing different kinds of sports
link |
or different types of swimming.
link |
So like different strokes and swimming, golf and so on.
link |
So there are like just different types of actions
link |
And the point is this kind of video network
link |
that you learn in a self supervised way
link |
can be used very easily to kind of recognize
link |
these different types of actions.
link |
It can also be used for recognizing
link |
different types of objects.
link |
And what we did is we tried to visualize
link |
whether the network can figure out
link |
where the sound is coming from.
link |
So basically, give it a video
link |
and basically play say of a person just strumming a guitar,
link |
but of course, there is no audio in this.
link |
And now you give it this sound of a guitar.
link |
And you ask like basically try to visualize
link |
where the network thinks the sound is coming from.
link |
And that can kind of basically draw like
link |
when you visualize it,
link |
you can see that it's basically focusing on the guitar.
link |
Yeah, that's surreal.
link |
And the same thing, for example,
link |
for certain people's voices,
link |
like famous celebrities voices,
link |
it can actually figure out where their mouth is.
link |
So it can actually distinguish different people's voices,
link |
for example, a little bit as well.
link |
Without that ever being annotated in any way.
link |
Right, so this is all what it had discovered.
link |
We never pointed out that this is a guitar
link |
and this is the kind of sound it produces.
link |
It can actually naturally figure that out
link |
because it's seen so many correlations of this sound
link |
coming with this kind of like an object
link |
that it basically learns to associate this sound
link |
with this kind of an object.
link |
Yeah, that's really fascinating, right?
link |
That's really interesting.
link |
So the idea with this kind of network
link |
is then you then fine tune it for a particular task.
link |
So this is forming like a really good knowledge base
link |
within a neural network based on which you could then
link |
the train a little bit more to accomplish a specific task.
link |
Well, so you don't need a lot of videos of humans
link |
doing actions annotated.
link |
You can just use a few of them to basically get your.
link |
How much insight do you draw from the fact
link |
that it can figure out where the sound is coming from?
link |
I'm trying to see, so that's kind of very,
link |
it's very CVPR beautiful, right?
link |
It's a cool little insight.
link |
I wonder how profound that is.
link |
Does it speak to the idea that multiple modalities
link |
are somehow much bigger than the sum of their parts?
link |
Or is it really, really useful to have multiple modalities?
link |
Or is it just that cool thing that there's parts
link |
of our world that can be revealed like effectively
link |
through multiple modalities,
link |
but most of it is really all about vision
link |
or about one of the modalities.
link |
I would say a little tending more towards the second part.
link |
So most of it can be sort of figured out with one modality,
link |
but having an extra modality always helps you.
link |
So in this case, for example,
link |
like one thing is when you're,
link |
if you observe someone cutting something
link |
and you don't have any sort of sound there,
link |
whether it's an apple or whether it's an onion,
link |
it's very hard to figure that out.
link |
But if you hear someone cutting it,
link |
it's very easy to figure it out because apples and onions
link |
make a very different kind of characteristics
link |
on when they're cut.
link |
So you really figure this out based on audio,
link |
So your life will become much easier
link |
when you have access to different kinds of modalities.
link |
And the other thing is, so I like to relate it in this way,
link |
it may be like completely wrong,
link |
but the distributional hypothesis in NLP,
link |
where context basically gives kind of meaning to that word,
link |
sound kind of does that too.
link |
So if you have the same sound,
link |
so that's the same context across different videos,
link |
you're very likely to be observing the same kind of concept.
link |
So that's the kind of reason
link |
why it figures out the guitar thing, right?
link |
It observed the same sound across multiple different videos
link |
and it figures out maybe this is the common factor
link |
that's actually doing it.
link |
I wonder, I used to have this argument with my dad a bunch
link |
for creating general intelligence,
link |
whether smell is an important,
link |
like if that's important sensory information,
link |
mostly we're talking about like falling in love
link |
with an AI system and for him,
link |
smell and touch are important.
link |
And I was arguing that it's not at all.
link |
It's important, it's nice and everything,
link |
but like you can fall in love with just language really,
link |
but a voice is very powerful and vision is next
link |
and smell is not that important.
link |
Can I ask you about this process of active learning?
link |
You mentioned interactivity.
link |
Is there some value
link |
within the self supervised learning context
link |
to select parts of the data in intelligent ways
link |
such that they would most benefit the learning process?
link |
I mean, I know I'm talking to an active learning fan here,
link |
so of course I know the answer.
link |
First you were talking bananas
link |
and now you're talking about active learning.
link |
I think Yannakun told me that active learning
link |
is not that interesting.
link |
I think back then I didn't want to argue with him too much,
link |
but when we talk again,
link |
we're gonna spend three hours arguing about active learning.
link |
My sense was you can go extremely far with active learning,
link |
perhaps farther than anything else.
link |
Like to me, there's this kind of intuition
link |
that similar to data augmentation,
link |
you can get a lot from the data,
link |
from intelligent optimized usage of the data.
link |
I'm trying to speak generally in such a way
link |
that includes data augmentation
link |
and active learning,
link |
that there's something about maybe interactive exploration
link |
of the data that at least is part
link |
of the solution to intelligence, like an important part.
link |
I don't know what your thoughts are
link |
on active learning in general.
link |
I actually really like active learning.
link |
So back in the day we did this largely ignored CVPR paper
link |
called learning by asking questions.
link |
So the idea was basically you would train an agent
link |
that would ask a question about the image.
link |
It would get an answer
link |
and basically then it would update itself.
link |
It would see the next image.
link |
It would decide what's the next hardest question
link |
that I can ask to learn the most.
link |
And the idea was basically because it was being smart
link |
about the kinds of questions it was asking,
link |
it would learn in fewer samples.
link |
It would be more efficient at using data.
link |
And we did find to some extent
link |
that it was actually better than randomly asking questions.
link |
Kind of weird thing about active learning
link |
is it's also a chicken and egg problem
link |
because when you look at an image,
link |
to ask a good question about the image,
link |
you need to understand something about the image.
link |
You can't ask a completely arbitrarily random question.
link |
It may not even apply to that particular image.
link |
So there is some amount of understanding or knowledge
link |
that basically keeps getting built
link |
when you're doing active learning.
link |
So I think active learning by itself is really good.
link |
And the main thing we need to figure out is basically
link |
how do we come up with a technique
link |
to first model what the model knows
link |
and also model what the model does not know.
link |
I think that's the sort of beauty of it.
link |
Because when you know that there are certain things
link |
that you don't know anything about,
link |
asking a question about those concepts
link |
is actually going to bring you the most value.
link |
And I think that's the sort of key challenge.
link |
Now, self supervised learning by itself,
link |
like selecting data for it and so on,
link |
that's actually really useful.
link |
But I think that's a very narrow view
link |
of looking at active learning.
link |
If you look at it more broadly,
link |
it is basically about if the model has a knowledge
link |
and it is weak basically about certain things.
link |
So it needs to ask questions
link |
either to discover new concepts
link |
or to basically increase its knowledge
link |
about these N concepts.
link |
So at that level, it's a very powerful technique.
link |
I actually do think it's going to be really useful.
link |
Even in like simple things such as like data labeling,
link |
it's super useful.
link |
So here is like one simple way
link |
that you can use active learning.
link |
For example, you have your self supervised model,
link |
which is very good at predicting similarities
link |
and dissimilarities between things.
link |
And so if you label a picture as basically say a banana,
link |
now you know that all the images
link |
that are very similar to this image
link |
are also likely to contain bananas.
link |
So probably when you want to understand
link |
what else is a banana,
link |
you're not going to use these other images.
link |
You're actually going to use an image
link |
that is not completely dissimilar,
link |
but somewhere in between,
link |
which is not super similar to this image,
link |
but not super dissimilar either.
link |
And that's going to tell you a lot more
link |
about what this concept of a banana is.
link |
So that's kind of a heuristic.
link |
I wonder if it's possible to also learn ways
link |
to discover the most likely,
link |
the most beneficial image.
link |
So like, so not just looking a thing
link |
that's somewhat similar to a banana,
link |
but not exactly similar,
link |
but have some kind of more complicated learning system,
link |
like learned discovering mechanism
link |
that tells you what image to look for.
link |
Like how, yeah, like actually in a self supervised way,
link |
learning strictly a function that says,
link |
is this image going to be very useful to me
link |
given what I currently know?
link |
I think there's a lot of synergy there.
link |
It's just, I think, yeah, it's going to be explored.
link |
I think very much related to that.
link |
I kind of think of what Tesla Autopilot is doing
link |
currently as kind of active learning.
link |
There's something that Andre Capati and their team
link |
are calling a data engine.
link |
So you're basically deploying a bunch of instantiations
link |
of a neural network into the wild,
link |
and they're collecting a bunch of edge cases
link |
that are then sent back for annotation for particular,
link |
and edge cases as defined as near failure
link |
or some weirdness on a particular task
link |
that's then sent back.
link |
It's that not exactly a banana,
link |
but almost the banana cases sent back for annotation.
link |
And then there's this loop that keeps going
link |
and you keep retraining and retraining.
link |
And the active learning step there,
link |
or whatever you want to call it,
link |
is the cars themselves that are sending you back the data.
link |
Like, what the hell happened here?
link |
What are your thoughts about that sort of deployment
link |
of neural networks in the wild?
link |
Another way to ask a question from first is your thoughts.
link |
And maybe if you want to comment,
link |
is there applications for autonomous driving,
link |
like computer vision based autonomous driving,
link |
applications of self supervised learning
link |
in the context of computer vision based autonomous driving?
link |
I think for self supervised learning
link |
to be used in autonomous driving,
link |
there are lots of opportunities.
link |
I mean, just like pure consistency in predictions
link |
is one way, right?
link |
So because you have this nice sequence of data
link |
that is coming in, a video stream of it,
link |
associated of course with the actions
link |
that say the car took,
link |
you can form a very nice predictive model
link |
of what's happening.
link |
So for example, like all the way,
link |
like one way possibly in which how they're figuring out
link |
what data to get labeled is basically
link |
through prediction uncertainty, right?
link |
So you predict that the car was going to turn right.
link |
So this was the action that was going to happen,
link |
say in the shadow mode.
link |
And now the driver turned left.
link |
And this is a really big surprise.
link |
So basically by forming these good predictive models,
link |
you are, I mean, these are kind of self supervised models.
link |
Prediction models are basically being trained
link |
just by looking at what's going to happen next
link |
and asking them to predict what's going to happen next.
link |
So I would say this is really like one use
link |
of self supervised learning.
link |
It's a predictive model
link |
and you're learning a predictive model
link |
basically just by looking at what data you have.
link |
Is there something about that active learning context
link |
that you find insights from?
link |
Like that kind of deployment of the system,
link |
seeing cases where it doesn't perform as you expected
link |
and then retraining the system based on that?
link |
I think that, I mean, that really resonates with me.
link |
It's super smart to do it that way.
link |
Because I mean, the thing is with any kind
link |
of like practical system, like autonomous driving,
link |
there are those edge cases that are the things
link |
that are actually the problem, right?
link |
I mean, highway driving or like freeway driving
link |
has basically been like,
link |
there has been a lot of success in that particular part
link |
of autonomous driving for a long time.
link |
I would say like since the eighties or something.
link |
Now the point is all these failure cases
link |
are the sort of reason why autonomous driving
link |
hasn't become like super, super mainstream and available
link |
like in every possible car right now.
link |
And so basically by really scaling this problem out
link |
by really trying to get all of these edge cases out
link |
as quickly as possible,
link |
and then just like using those to improve your model,
link |
that's super smart.
link |
And prediction uncertainty to do that
link |
is like one really nice way of doing it.
link |
Let me put you on the spot.
link |
So we mentioned offline Jitendra,
link |
he thinks that the Tesla computer vision approach
link |
or really any approach for autonomous driving
link |
How many years away,
link |
if you have to bet all your money on it,
link |
are we to solving autonomous driving
link |
with this kind of computer vision only
link |
machine learning based approach?
link |
Okay, so what does solving autonomous driving mean?
link |
Does it mean solving it in the US?
link |
Does it mean solving it in India?
link |
Because I can tell you
link |
that very different types of driving happening.
link |
Not India, not Russia.
link |
In the United States, autonomous,
link |
so what solving means is when the car says it has control,
link |
it is fully liable.
link |
You can go to sleep, it's driving by itself.
link |
So this is highway and city driving,
link |
but not everywhere, but mostly everywhere.
link |
And it's, let's say significantly better,
link |
like say five times less accidents than humans.
link |
Sufficiently safer such that the public feels
link |
like that transition is enticing beneficial
link |
both for our safety and financial
link |
and all those kinds of things.
link |
Okay, so first disclaimer,
link |
I'm not an expert in autonomous driving.
link |
So let me put it out there.
link |
I would say like at least five to 10 years.
link |
This would be my guess from now.
link |
Yeah, I'm actually very impressed.
link |
Like when I sat in a friend's Tesla recently
link |
and of course, like looking on that screen,
link |
it basically shows all the detections and everything.
link |
The car is doing as you're driving by
link |
and that's super distracting for me as a person
link |
because all I keep looking at is like the bounding boxes
link |
in the cars it's tracking and it's really impressive.
link |
Like especially when it's raining and it's able to do that,
link |
that was the most impressive part for me.
link |
It's actually able to get through rain and do that.
link |
And one of the reasons why like a lot of us believed
link |
and I would put myself in that category
link |
is LIDAR based sort of technology for autonomous driving
link |
was the key driver, right?
link |
So Waymo was using it for the longest time.
link |
And Tesla then decided to go this completely other route
link |
that we are not going to even use LIDAR.
link |
So their initial system I think was camera and radar based
link |
and now they're actually moving
link |
to a completely like vision based system.
link |
And so that was just like, it sounded completely crazy.
link |
Like LIDAR is very useful in cases
link |
where you have low visibility.
link |
Of course it comes with its own set of complications.
link |
But now to see that happen in like on a live Tesla
link |
that basically just proves everyone wrong
link |
I would say in a way.
link |
And that's just working really well.
link |
I think there were also like a lot of advancements
link |
in camera technology.
link |
Now there were like, I know at CMU when I was there
link |
there was a particular kind of camera
link |
that had been developed that was really good
link |
at basically low visibility setting.
link |
So like lots of snow and lots of rain
link |
it could actually still have a very reasonable visibility.
link |
And I think there are lots of these kinds of innovations
link |
that will happen on the sensor side itself
link |
which is actually going to make this very easy
link |
And so maybe that's actually why I'm more optimistic
link |
about vision based self, like autonomous driving.
link |
I was going to call it self supervised driving, but.
link |
Vision based autonomous driving.
link |
That's the reason I'm quite optimistic about it
link |
because I think there are going to be lots
link |
of these advances on the sensor side itself.
link |
So acquiring this data
link |
we're actually going to get much better about it.
link |
And then of course, once we're able to scale out
link |
and get all of these edge cases in
link |
as like Andre described
link |
I think that's going to make us go very far away.
link |
Yeah, so it's funny.
link |
I'm very much with you on the five to 10 years
link |
but you made it, I'm not sure how you made it sound
link |
but for some people that seem
link |
that might seem like really far away.
link |
And then for other people, it might seem like very close.
link |
There's a lot of fundamental questions
link |
about how much game theory is in this whole thing.
link |
So like, how much is this simply a collision avoidance
link |
problem and how much of it is you still interacting
link |
with other humans in the scene
link |
and you're trying to create an experience
link |
that's compelling.
link |
So you want to get from point A to point B quickly
link |
you want to navigate the scene in a safe way
link |
but you also want to show some level of aggression
link |
because well, certainly this is why you're screwed in India
link |
because you have to show aggression.
link |
Or Jersey or New Jersey.
link |
So like, or New York or basically any major city
link |
but I think it's probably Elon
link |
that I talked the most about this
link |
which is a surprise to the level of which
link |
they're not considering human beings
link |
as a huge problem in this, as a source of problem.
link |
Like the driving is fundamentally a robot on robot
link |
versus the environment problem
link |
versus like you can just consider humans
link |
not part of the problem.
link |
I used to think humans are almost certainly
link |
have to be modeled really well.
link |
Pedestrians and cyclists and humans inside other cars
link |
you have to have like mental models for them.
link |
You cannot just see it as objects
link |
but more and more it's like the
link |
it's the same kind of intuition breaking thing
link |
that's self supervised learning does, which is
link |
well maybe through the learning
link |
you'll get all the human like human information you need.
link |
Like maybe you'll get it just with enough data.
link |
You don't need to have explicit good models
link |
of human behavior.
link |
Maybe you get it through the data.
link |
So, I mean my skepticism also just knowing
link |
a lot of automotive companies
link |
and how difficult it is to be innovative.
link |
I was skeptical that they would be able at scale
link |
to convert the driving scene across the world
link |
into digital form such that you can create
link |
this data engine at scale.
link |
And the fact that Tesla is at least getting there
link |
or are already there makes me think that
link |
it's now starting to be coupled
link |
to this self supervised learning vision
link |
which is like if that's gonna work
link |
if through purely this process you can get really far
link |
then maybe you can solve driving that way.
link |
I tend to believe we don't give enough credit
link |
to the how amazing humans are both at driving
link |
and at supervising autonomous systems.
link |
And also we don't, this is, I wish we were.
link |
I wish there was much more driver sensing inside Teslas
link |
and much deeper consideration of human factors
link |
like understanding psychology and drowsiness
link |
and all those kinds of things
link |
when the car does more and more of the work.
link |
How to keep utilizing the little human supervision
link |
that are needed to keep this whole thing safe.
link |
I mean it's a fascinating dance of human robot interaction.
link |
To me autonomous driving for a long time
link |
is a human robot interaction problem.
link |
It is not a robotics problem or computer vision problem.
link |
Like you have to have a human in the loop.
link |
But so which is why I think it's 10 years plus.
link |
But I do think there'll be a bunch of cities and contexts
link |
where geo restricted it will work really, really damn well.
link |
So I think for me that gets five if I'm being optimistic
link |
and it's going to be five for a lot of cases
link |
and 10 plus, yeah, I agree with you.
link |
10 plus basically if we want to recover most of the,
link |
say, contiguous United States or something.
link |
So my optimistic is five and pessimistic is 30.
link |
I have a long tail on this one.
link |
I haven't watched enough driving videos.
link |
I've watched enough pedestrians to think like we may be,
link |
like there's a small part of me still, not a small,
link |
like a pretty big part of me that thinks
link |
we will have to build AGI to solve driving.
link |
Like there's something to me,
link |
like because humans are part of the picture,
link |
deeply part of the picture,
link |
and also human society is part of the picture
link |
in that human life is at stake.
link |
Anytime a robot kills a human,
link |
it's not clear to me that that's not a problem
link |
that machine learning will also have to solve.
link |
Like it has to, you have to integrate that
link |
into the whole thing.
link |
Just like Facebook or social networks,
link |
one thing is to say how to make
link |
a really good recommender system.
link |
And then the other thing is to integrate
link |
into that recommender system,
link |
all the journalists that will write articles
link |
about that recommender system.
link |
Like you have to consider the society
link |
within which the AI system operates.
link |
And in order to, and like politicians too,
link |
this is the regulatory stuff for autonomous driving.
link |
It's kind of fascinating that the more successful
link |
your AI system becomes,
link |
the more it gets integrated in society
link |
and the more precious politicians
link |
and the public and the clickbait journalists
link |
and all the different fascinating forces
link |
of our society start acting on it.
link |
And then it's no longer how good you are
link |
at doing the initial task.
link |
It's also how good you are at navigating human nature,
link |
which is a fascinating space.
link |
What do you think are the limits of deep learning?
link |
If you allow me, we'll zoom out a little bit
link |
into the big question of artificial intelligence.
link |
You said dark matter of intelligence is self supervised
link |
learning, but there could be more.
link |
What do you think the limits of self supervised learning
link |
and just learning in general, deep learning are?
link |
I think like for deep learning in particular,
link |
because self supervised learning is I would say
link |
a little bit more vague right now.
link |
So I wouldn't, like for something that's so vague,
link |
it's hard to predict what its limits are going to be.
link |
But like I said, I think anywhere you want to interact
link |
with human self supervised learning kind of hits a boundary
link |
very quickly because you need to have an interface
link |
to be able to communicate with the human.
link |
So really like if you have just like vacuous concepts
link |
or like just like nebulous concepts discovered
link |
by a network, it's very hard to communicate those
link |
with the human without like inserting some kind
link |
of human knowledge or some kind of like human bias there.
link |
In general, I think for deep learning,
link |
the biggest challenge is just like data efficiency.
link |
Even with self supervised learning,
link |
even with anything else, if you just see
link |
a single concept once, like one image of like,
link |
I don't know, whatever you want to call it,
link |
like any concept, it's really hard for these methods
link |
to generalize by looking at just one or two samples
link |
of things and that has been a real challenge.
link |
I think that's actually why like these edge cases,
link |
for example, for Tesla are actually that important.
link |
Because if you see just one instance of the car failing
link |
and if you just annotate that and you get that
link |
into your data set, you have like very limited guarantee
link |
that it's not going to happen again.
link |
And you're actually going to be able to recognize
link |
this kind of instance in a very different scenario.
link |
So like when it was snowing, so you got that thing labeled
link |
when it was snowing, but now when it's raining,
link |
you're actually not able to get it.
link |
Or you basically have the same scenario
link |
in a different part of the world.
link |
So the lighting was different or so on.
link |
So it's just really hard for these models,
link |
like deep learning especially to do that.
link |
What's your intuition?
link |
How do we solve handwritten digit recognition problem
link |
when we only have one example for each number?
link |
It feels like humans are using something like learning.
link |
I think we are good at transferring knowledge a little bit.
link |
We are just better at like for a lot of these problems
link |
where we are generalizing from a single sample
link |
or recognizing from a single sample,
link |
we are using a lot of our own domain knowledge
link |
and a lot of our like inductive bias
link |
into that one sample to generalize it.
link |
So I've never seen you write the number nine, for example.
link |
And if you were to write it, I would still get it.
link |
And if you were to write a different kind of alphabet
link |
and like write it in two different ways,
link |
I would still probably be able to figure out
link |
that these are the same two characters.
link |
It's just that I have been very used
link |
to seeing handwritten digits in my life.
link |
The other sort of problem with any deep learning system
link |
or any kind of machine learning system is like,
link |
it's guarantees, right?
link |
There are no guarantees for it.
link |
Now you can argue that humans also don't have any guarantees.
link |
Like there is no guarantee that I can recognize a cat
link |
in every scenario.
link |
I'm sure there are going to be lots of cats
link |
that I don't recognize, lots of scenarios
link |
in which I don't recognize cats in general.
link |
But I think from just a sort of application perspective,
link |
you do need guarantees, right?
link |
We call these things algorithms.
link |
Now algorithms, like traditional CS algorithms
link |
Sorting is a guarantee.
link |
If you were to call sort on a particular array of numbers,
link |
you are guaranteed that it's going to be sorted.
link |
Otherwise it's a bug.
link |
Now for machine learning,
link |
it's very hard to characterize this.
link |
We know for a fact that a cat recognition model
link |
is not going to recognize cats,
link |
every cat in the world in every circumstance.
link |
I think most people would agree with that statement,
link |
but we are still okay with it.
link |
We still don't call this as a bug.
link |
Whereas in traditional computer science
link |
or traditional science,
link |
like if you have this kind of failure case existing,
link |
then you think of it as like something is wrong.
link |
I think there is this sort of notion
link |
of nebulous correctness for machine learning.
link |
And that's something we just need to be very comfortable
link |
And for deep learning,
link |
or like for a lot of these machine learning algorithms,
link |
it's not clear how do we characterize
link |
this notion of correctness.
link |
I think limitation in our understanding,
link |
or at least a limitation in our phrasing of this.
link |
And if we were to come up with better ways
link |
to understand this limitation,
link |
then it would actually help us a lot.
link |
Do you think there's a distinction
link |
between the concept of learning
link |
and the concept of reasoning?
link |
Do you think it's possible for neural networks to reason?
link |
So I think of it slightly differently.
link |
So for me, learning is whenever
link |
I can like make a snap judgment.
link |
So if you show me a picture of a dog,
link |
I can immediately say it's a dog.
link |
But if you give me like a puzzle,
link |
like whatever a Goldsberg machine
link |
of like things going to happen,
link |
then I have to reason because I've never,
link |
it's a very complicated setup.
link |
I've never seen that particular setup.
link |
And I really need to draw and like imagine in my head
link |
what's going to happen to figure it out.
link |
So I think, yes, neural networks are really good
link |
at recognition, but they're not very good at reasoning.
link |
Because they have seen something before
link |
or seen something similar before, they're very good
link |
at making those sort of snap judgments.
link |
But if you were to give them a very complicated thing
link |
that they've not seen before,
link |
they have very limited ability right now
link |
to compose different things.
link |
Like, oh, I've seen this particular part before.
link |
I've seen this particular part before.
link |
And now probably like this is how
link |
they're going to work in tandem.
link |
It's very hard for them to come up
link |
with these kinds of things.
link |
Well, there's a certain aspect to reasoning
link |
that you can maybe convert into the process of programming.
link |
And so there's the whole field of program synthesis
link |
and people have been applying machine learning
link |
to the problem of program synthesis.
link |
And the question is, can they, the step of composition,
link |
why can't that be learned?
link |
You know, this step of like building things on top of you,
link |
like little intuitions, concepts on top of each other,
link |
can that be learnable?
link |
What's your intuition there?
link |
Or like, I guess similar set of techniques,
link |
do you think that will be applicable?
link |
So I think it is, of course, it is learnable
link |
because like we are prime examples of machines
link |
that have like, or individuals that have learned this, right?
link |
Like humans have learned this.
link |
So it is, of course, it is a technique
link |
that is very easy to learn.
link |
I think where we are kind of hitting a wall
link |
basically with like current machine learning
link |
is the fact that when the network learns
link |
all of this information,
link |
we basically are not able to figure out
link |
how well it's going to generalize to an unseen thing.
link |
And we have no, like a priori, no way of characterizing that.
link |
And I think that's basically telling us a lot about,
link |
like a lot about the fact that we really don't know
link |
what this model has learned and how well it's basically,
link |
because we don't know how well it's going to transfer.
link |
There's also a sense in which it feels like
link |
we humans may not be aware of how much like background,
link |
how good our background model is,
link |
how much knowledge we just have slowly building
link |
on top of each other.
link |
It feels like neural networks
link |
are constantly throwing stuff out.
link |
Like you'll do some incredible thing
link |
where you're learning a particular task in computer vision,
link |
you celebrate your state of the art successes
link |
and you throw that out.
link |
Like, it feels like it's,
link |
you're never using stuff you've learned
link |
for your future successes in other domains.
link |
And humans are obviously doing that exceptionally well,
link |
still throwing stuff away in their mind,
link |
but keeping certain kernels of truth.
link |
Right, so I think we're like,
link |
continual learning is sort of the paradigm
link |
for this in machine learning.
link |
And I don't think it's a very well explored paradigm.
link |
We have like things in deep learning, for example,
link |
catastrophic forgetting is like one of the standard things.
link |
The thing basically being that if you teach a network
link |
like to recognize dogs,
link |
and now you teach that same network to recognize cats,
link |
it basically forgets how to recognize dogs.
link |
So it forgets very quickly.
link |
I mean, and whereas a human,
link |
if you were to teach someone to recognize dogs
link |
and then to recognize cats,
link |
they don't forget immediately how to recognize these dogs.
link |
I think that's basically sort of what you're trying to get.
link |
Yeah, I just, I wonder if like
link |
the long term memory mechanisms
link |
or the mechanisms that store not just memories,
link |
but concepts that allow you to the reason
link |
and compose concepts,
link |
if those things will look very different
link |
than neural networks,
link |
or if you can do that within a single neural network
link |
with some particular sort of architecture quirks,
link |
that seems to be a really open problem.
link |
And of course I go up and down on that
link |
because there's something so compelling to the symbolic AI
link |
or to the ideas of logic based sort of expert systems.
link |
You have like human interpretable facts
link |
that built on top of each other.
link |
It's really annoying like with self supervised learning
link |
that the AI is not very explainable.
link |
Like you can't like understand
link |
all the beautiful things it has learned.
link |
You can't ask it like questions,
link |
but then again, maybe that's a stupid thing
link |
for us humans to want.
link |
Right, I think whenever we try to like understand it,
link |
we are putting our own subjective human bias into it.
link |
And I think that's the sort of problem
link |
with self supervised learning,
link |
the goal is that it should learn naturally from the data.
link |
So now if you try to understand it,
link |
you are using your own preconceived notions
link |
of what this model has learned.
link |
And that's the problem.
link |
High level question.
link |
What do you think it takes to build a system
link |
with superhuman, maybe let's say human level
link |
or superhuman level general intelligence?
link |
We've already kind of started talking about this,
link |
but what's your intuition?
link |
Like, does this thing have to have a body?
link |
Does it have to interact richly with the world?
link |
Does it have to have some more human elements
link |
like self awareness?
link |
I think emotion is something which is like,
link |
it's not really attributed typically
link |
in standard machine learning.
link |
It's not something we think about,
link |
like there is NLP, there is vision,
link |
there is no like emotion.
link |
Emotion is never a part of all of this.
link |
And that just seems a little bit weird to me.
link |
I think the reason basically being that there is surprise
link |
and like, basically emotion is like one of the reasons
link |
emotions arise is like what happens
link |
and what do you expect to happen, right?
link |
There is like a mismatch between these things.
link |
And so that gives rise to like,
link |
I can either be surprised or I can be saddened
link |
or I can be happy and all of this.
link |
And so this basically indicates
link |
that I already have a predictive model in my head
link |
and something that I predicted or something
link |
that I thought was likely to happen.
link |
And then there was something that I observed
link |
that happened that there was a disconnect
link |
between these two things.
link |
And that basically is like maybe one of the reasons
link |
like you have a lot of emotions.
link |
Yeah, I think, so I talk to people a lot about them
link |
like Lisa Feldman Barrett.
link |
I think that's an interesting concept of emotion
link |
but I have a sense that emotion primarily
link |
in the way we think about it,
link |
which is the display of emotion
link |
is a communication mechanism between humans.
link |
So it's a part of basically human to human interaction,
link |
an important part, but just the part.
link |
So it's like, I would throw it into the full mix
link |
And to me, communication can be done with objects
link |
that don't look at all like humans.
link |
I've seen our ability to anthropomorphize
link |
our ability to connect with things that look like a Roomba
link |
our ability to connect.
link |
First of all, let's talk about other biological systems
link |
like dogs, our ability to love things
link |
that are very different than humans.
link |
But they do display emotion, right?
link |
I mean, dogs do display emotion.
link |
So they don't have to be anthropomorphic
link |
for them to like display the kind of emotions
link |
So, I mean, but then the word emotion starts to lose.
link |
So then we have to be, I guess specific, but yeah.
link |
So have rich flavorful communication.
link |
Communication, yeah.
link |
Yeah, so like, yes, it's full of emotion.
link |
It's full of wit and humor and moods
link |
and all those kinds of things, yeah.
link |
So you're talking about like flavor.
link |
Okay, let's call it that.
link |
So there's content and then there is flavor
link |
and I'm talking about the flavor.
link |
Do you think it needs to have a body?
link |
Do you think like to interact with the physical world?
link |
Do you think you can understand the physical world
link |
without being able to directly interact with it?
link |
I don't think so, yeah.
link |
I think at some point we will need to bite the bullet
link |
and actually interact with the physical,
link |
as much as I like working on like passive computer vision
link |
where I just like sit in my arm chair
link |
and look at videos and learn.
link |
I do think that we will need to have some kind of embodiment
link |
or some kind of interaction
link |
to figure out things about the world.
link |
What about consciousness?
link |
Do you think, how often do you think about consciousness
link |
when you think about your work?
link |
You could think of it
link |
as the more simple thing of self awareness,
link |
of being aware that you are a perceiving,
link |
sensing, acting thing in this world.
link |
Or you can think about the bigger version of that,
link |
which is consciousness,
link |
which is having it feel like something to be that entity,
link |
the subjective experience of being in this world.
link |
So I think of self awareness a little bit more
link |
than like the broader goal of it,
link |
because I think self awareness is pretty critical
link |
for like any kind of like any kind of AGI
link |
or whatever you want to call it that we build,
link |
because it needs to contextualize what it is
link |
and what role it's playing
link |
with respect to all the other things that exist around it.
link |
I think that requires self awareness.
link |
It needs to understand that it's an autonomous car, right?
link |
And what does that mean?
link |
What are its limitations?
link |
What are the things that it is supposed to do and so on?
link |
What is its role in some way?
link |
Or, I mean, these are the kinds of things
link |
that we kind of expect from it, I would say.