back to indexIshan Misra: Self-Supervised Deep Learning in Computer Vision | Lex Fridman Podcast #206
link |
The following is a conversation with Ishan Mizra,
link |
research scientist at Facebook AI Research,
link |
who works on self supervised machine learning
link |
in the domain of computer vision.
link |
Or in other words, making AI systems understand
link |
the visual world with minimal help from us humans.
link |
Transformers and self attention has been successfully used
link |
by OpenAI GPT3 and other language models
link |
to do self supervised learning in the domain of language.
link |
Ishan together with Yan Likun and others
link |
is trying to achieve the same success
link |
in the domain of images and video.
link |
The goal is to leave a robot watching YouTube videos
link |
all night and in the morning,
link |
come back to a much smarter robot.
link |
I read the blog post self supervised learning
link |
the dark matter of intelligence by Ishan and Yan Likun
link |
and then listened to Ishan's appearance
link |
on the excellent machine learning street talk podcast.
link |
And I knew I had to talk to him.
link |
By the way, if you're interested in machine learning and AI,
link |
I cannot recommend the ML street talk podcast highly enough.
link |
Those guys are great.
link |
Quick mention of our sponsors on it,
link |
the information, Grammarly and Athletic Greens.
link |
Check them out in the description to support this podcast.
link |
As a side note, let me say that for those of you
link |
who may have been listening for quite a while,
link |
this podcast used to be called
link |
artificial intelligence podcast.
link |
Because my life passion has always been,
link |
will always be artificial intelligence,
link |
both narrowly and broadly defined.
link |
My goal with this podcast is still to have many conversations
link |
with world class researchers in AI,
link |
math, physics, biology and all the other sciences.
link |
But I also want to talk to historians, musicians, athletes
link |
and of course, occasionally comedians.
link |
In fact, I'm trying out doing this podcast
link |
three times a week now to give me more freedom
link |
with guest selection and maybe get a chance
link |
to have a bit more fun.
link |
Speaking of fun, in this conversation,
link |
I challenged the listener to count the number of times
link |
the word banana is mentioned.
link |
Ishan and I used the word banana as the canonical example
link |
at the core of the hard problem of computer vision
link |
and maybe the hard problem of consciousness.
link |
This is the Lex Friedman podcast
link |
and here is my conversation with Ishan Mizra.
link |
What is self supervised learning?
link |
And maybe even give the bigger basics
link |
of what is supervised and semi supervised learning.
link |
And maybe why is self supervised learning
link |
a better term than unsupervised learning?
link |
Let's start with supervised learning.
link |
So typically for machine learning systems,
link |
the way they're trained is you get a bunch of humans.
link |
The humans point out particular concepts.
link |
So if it's in the case of images,
link |
you want the humans to come and tell you
link |
what is present in the image,
link |
draw boxes around them,
link |
draw masks of things, pixels,
link |
which are of particular categories or not.
link |
For NLP, again, there are lots of these particular tasks,
link |
say about sentiment analysis,
link |
about entailment and so on.
link |
So typically for supervised learning,
link |
we get a big corpus of such annotated or labeled data
link |
and then we feed that to a system
link |
and the system is really trying to mimic,
link |
so it's taking this input of the data
link |
and then trying to mimic the output.
link |
So it looks at an image and the human has tagged
link |
that this image contains a banana
link |
and now the system is basically trying to mimic that.
link |
So that's its learning signal.
link |
And so for supervised learning,
link |
we try to gather lots of such data
link |
and we train these machine learning models
link |
to imitate the input output.
link |
And the hope is basically by doing so,
link |
now on unseen or like new kinds of data,
link |
this model can automatically learn to predict these concepts.
link |
So this is a standard sort of supervised setting.
link |
For semi supervised setting,
link |
the idea typically is that you have,
link |
of course, all of the supervised data,
link |
but you have lots of other data
link |
which is unsupervised or which is like not labeled.
link |
Now the problem basically with supervised learning
link |
and why you actually have all of these alternate
link |
sort of learning paradigms is
link |
supervised learning just does not scale.
link |
So if you look at for computer vision,
link |
the sort of largest one of the most popular datasets
link |
is ImageNet, right?
link |
So the entire ImageNet dataset
link |
has about 22,000 concepts and about 14 million images.
link |
So these concepts are basically just nouns
link |
and they're annotated on images.
link |
And this entire dataset was a mammoth data collection effort.
link |
It actually gave rise to a lot of powerful learning algorithms
link |
as credited with like sort of the rise
link |
of deep learning as well.
link |
But this dataset took about 22 human years
link |
to collect, to annotate.
link |
And it's not even that many concepts, right?
link |
It's not even that many images.
link |
14 million is nothing really.
link |
Like you have about I think 400 million images or so
link |
or even more than that uploaded to most of the popular
link |
sort of social media websites today.
link |
So now supervised learning just doesn't scale.
link |
If I want to now annotate more concepts,
link |
if I want to have this various types of fine grained concepts,
link |
then it won't really scale.
link |
So now you come up to these sort of different learning paradigms,
link |
for example, semi supervised learning,
link |
where the idea is, of course,
link |
you have this annotated corpus of supervised data
link |
and you have lots of these unlabeled images.
link |
And the idea is that the algorithm should basically try
link |
to measure some kind of consistency
link |
or really try to measure some kind of signal
link |
on this sort of unlabeled data
link |
to make it self more confident
link |
about what it's really trying to predict.
link |
So by access to this lots of unlabeled data,
link |
the idea is that the algorithm actually learns
link |
to be more confident and actually gets better
link |
at predicting these concepts.
link |
And now we come to the other extreme,
link |
which is like self supervised learning.
link |
The idea basically is that the machine
link |
or the algorithm should really discover concepts
link |
or discover things about the world
link |
or learn representations about the world which are useful
link |
without access to explicit human supervision.
link |
So the word supervision is still in the term self supervised.
link |
So what is the supervision signal?
link |
And maybe that perhaps is when Yann LeCun and you argue
link |
that unsupervised is the incorrect in terminology here.
link |
So what is the supervision signal
link |
when the humans aren't part of the picture
link |
or not a big part of the picture?
link |
Right, so self supervised,
link |
the reason it has the term supervised in itself
link |
is because you're using the data itself as supervision.
link |
So because the data serves as its own source of supervision
link |
it's self supervised in that way.
link |
Now the reason a lot of people,
link |
I mean, we did it in that blog post with Yann,
link |
but a lot of other people have also argued
link |
for using this term self supervised.
link |
So starting from like 94 from Virginia Desa's group
link |
at I think UCSD and now she's at UCSD.
link |
Jitendra Malik has said this a bunch of times as well.
link |
So you have supervised
link |
and then unsupervised basically means everything
link |
which is not supervised,
link |
but that includes stuff like semi supervised
link |
that includes other like transductive learning
link |
lots of other sort of settings.
link |
So that's the reason like now people
link |
are preferring this term self supervised
link |
because it explicitly says what's happening.
link |
The data itself is the source of supervision
link |
and any sort of learning algorithm
link |
which tries to extract just sort of data supervision signals
link |
from the data itself is a self supervised algorithm.
link |
But there is within the data a set of tricks
link |
which unlock the supervision.
link |
So can you give me some examples?
link |
And there's innovation, ingenuity required
link |
to unlock that supervision.
link |
The data doesn't just speak to you some ground truth.
link |
You have to do some kind of trick.
link |
So I don't know what your favorite domain is.
link |
So you specifically specialize in visual learning
link |
but is there favorite examples
link |
maybe in language or other domains?
link |
Perhaps the most successful applications
link |
have been in NLP, not language processing.
link |
So the idea basically being that you can train models
link |
that you have a sentence and you mask out certain words
link |
and now these models learn to predict the masked out words.
link |
So if you have like the cat jumped over the dog.
link |
So you can basically mask out cat
link |
and now you're essentially asking the model to predict
link |
what was missing, what did I mask out?
link |
So the model is going to predict basically
link |
a distribution over all the possible words that it knows
link |
and probably it has like if it's a well trained model
link |
it has a sort of higher probability density
link |
for this word cat.
link |
For vision I would say the sort of more,
link |
I mean the easier example
link |
which is not as widely used these days
link |
is basically say for example video prediction.
link |
So video is again a sequence of things.
link |
So you can ask the model.
link |
So if you have a video of say 10 seconds
link |
you can feed in the first nine seconds to a model
link |
and then ask it, hey, what happens basically
link |
Can you predict what's going to happen?
link |
And the idea basically is because the model
link |
is predicting something about the data itself.
link |
Of course you didn't need any human to tell you
link |
what was happening because the 10 second video
link |
was naturally captured.
link |
Because the model is predicting what's happening there
link |
it's going to automatically learn something
link |
about the structure of the world, how objects move,
link |
object permanence and these kinds of things.
link |
So like if I have something at the edge of the table
link |
Things like these which you really don't have
link |
to sit and annotate.
link |
In a supervised learning setting
link |
I would have to sit and annotate.
link |
This is a cup, now I move this cup, this is still a cup
link |
and now I move this cup, it's still a cup
link |
and then it falls down and this is a fallen down cup.
link |
So I won't have to annotate all of these things
link |
in a self supervised setting.
link |
Isn't that kind of a brilliant little trick
link |
of taking a series of data that is consistent
link |
and removing one element in that series
link |
and then teaching the algorithm to predict that element?
link |
Isn't that, first of all, that's quite brilliant.
link |
It seems to be applicable in anything that
link |
has the constraint of being a sequence
link |
that is consistent with the physical reality.
link |
The question is, are there other tricks like this
link |
that can generate the self supervision signal?
link |
So sequence is possibly the most widely used one in NLP.
link |
For vision, the one that is actually used for images
link |
which is very popular these days
link |
is basically taking an image
link |
and now taking different crops of that image.
link |
So you can basically decide to crop say the top left corner
link |
and you crop say the bottom right corner
link |
and asking a network to basically present it
link |
with a choice saying that, okay, now you have this image,
link |
you have this image, are these the same or not?
link |
And so the idea basically is that
link |
because different crop, like in an image,
link |
different parts of the image are going to be related.
link |
So for example, if you have a chair and a table,
link |
basically these things are going to be close by
link |
versus if you take, again, if you have like a zoomed
link |
in picture of a chair, if you're taking different crops,
link |
it's going to be different parts of the chair.
link |
So the idea basically is that different crops
link |
of the image are related.
link |
And so the features or the representations
link |
that you get from these different crops
link |
should also be related.
link |
So this is possibly the most widely used trick
link |
these days for cell supervised learning and computer vision.
link |
So again, using the consistency
link |
that's inherent to physical reality in visual domain,
link |
that's parts of an image are consistent.
link |
And then in the language domain or anything
link |
that has sequences like language or something
link |
that's like a time series,
link |
then you can chop off parts in time.
link |
It's similar to the story of RNNs and CNNs,
link |
of RNNs and covenants.
link |
Yuen Yan Likun wrote the blog post in March, 2021,
link |
titled self supervised learning,
link |
the dark matter of intelligence.
link |
Can you summarize this blog post
link |
and maybe explain the main idea or set of ideas?
link |
The blog post was mainly about sort of just telling,
link |
I mean, this is really a accepted fact,
link |
I would say for a lot of people now,
link |
that self supervised learning is something
link |
that is going to be a play
link |
an important role for machine learning algorithms
link |
that come in the future and even now.
link |
Well, let me just comment that we don't yet
link |
have a good understanding of what dark matter is.
link |
So the idea basically being.
link |
Maybe the metaphor doesn't exactly transfer,
link |
but maybe it's actually perfectly transfers
link |
that we don't know.
link |
We have an inkling that it'll be a big part
link |
of whatever solving intelligence looks like.
link |
I think self supervised learning,
link |
the way it's done right now is,
link |
I would say like the first step towards what it probably
link |
should end up like learning
link |
or what it should enable us to do.
link |
So the idea for that particular piece was
link |
self supervised learning is going to be a very powerful way
link |
to learn common sense about the world
link |
or like stuff that is really hard to label.
link |
For example, like is this piece
link |
over here heavier than the cup?
link |
Now, for all these kinds of things,
link |
you'll have to sit and label these things.
link |
So supervised learning is clearly not going to scale.
link |
So what is the thing that's actually going to scale?
link |
It's probably going to be an agent
link |
that can either actually interact with it to lift it up
link |
or observe me doing it.
link |
So if I'm basically lifting these things up,
link |
it can probably reason about,
link |
hey, this is taking him more time to lift up
link |
or the velocity is different,
link |
whereas the velocity for this is different,
link |
probably this one is heavier.
link |
So essentially by observations of the data,
link |
you should be able to infer a lot of things
link |
about the world without someone explicitly telling you,
link |
this is heavy, this is not,
link |
this is something that can pour,
link |
this is something that cannot pour,
link |
this is somewhere that you can sit,
link |
this is not somewhere that you can sit.
link |
But you've just mentioned the ability
link |
to interact with the world.
link |
There's so many questions that are yet to be,
link |
that are still open,
link |
which is how do you select a set of data
link |
over which the self supervised learning process works?
link |
How much interactivity, like in the active learning
link |
or the machine teaching context,
link |
is there, what are the reward signals?
link |
Like how much actual interaction there is
link |
with the physical world?
link |
That kind of thing.
link |
So that could be a huge question.
link |
And then on top of that,
link |
which I have a million questions about,
link |
which we don't know the answers to,
link |
but it's worth talking about is,
link |
how much reasoning is involved?
link |
How much accumulation of knowledge
link |
versus something that's more akin to learning
link |
or whether that's the same thing.
link |
But so we're like, it is truly dark matter.
link |
We don't know how exactly to do it,
link |
but we are, I mean, a lot of us are actually convinced
link |
that it's going to be a sort of major thing
link |
in machine learning.
link |
So let me reframe it then,
link |
that human supervision cannot be at large scale,
link |
the source of the solution to intelligence.
link |
So the machines have to discover the supervision
link |
in the natural signal of the world.
link |
I mean, the other thing is also that
link |
humans are not particularly good labors,
link |
they're not very consistent.
link |
For example, like, what's the difference
link |
between a dining table and a table?
link |
Is it just the fact that one,
link |
like if you just look at a particular table,
link |
what makes us say one is dining table
link |
and the other is not?
link |
Humans are not particularly consistent,
link |
they're not like very good sources of supervision
link |
for a lot of these kind of edge cases.
link |
So it may be also the fact that if we want,
link |
like want an algorithm or want a machine
link |
to solve a particular task for us,
link |
we can maybe just specify the end goal
link |
and like the stuff in between,
link |
we really probably should not be specifying
link |
because we're not maybe going to confuse it a lot actually.
link |
Well, humans can't even answer the meaning of life.
link |
So I'm not sure if we're good supervisors
link |
of the end goal either.
link |
So let me ask you about categories.
link |
Humans are not very good at telling the difference
link |
between what is and isn't a table, like you mentioned.
link |
Do you think it's possible,
link |
let me ask you like a pretend you're a play dough.
link |
Is it possible to create a pretty good taxonomy
link |
of objects in the world?
link |
It seems like a lot of approaches in machine learning
link |
kind of assume a hopeful vision
link |
that it's possible to construct a perfect taxonomy
link |
or it exists perhaps out of our reach,
link |
but we can always get closer and closer to it.
link |
Or is that a hopeless pursuit?
link |
I think it's hopeless in some way.
link |
So the thing is for any particular categorization
link |
if you have a discrete sort of categorization,
link |
I can always take the nearest two concepts
link |
or I can take a third concept and I can blend it in
link |
and I can create a new category.
link |
So if you were to enumerate N categories,
link |
I will always find an N plus one category for you.
link |
That's not going to be in the N categories.
link |
And I can actually create not just N plus one,
link |
I can very easily create far more than N categories.
link |
a lot of things we talk about are actually compositional.
link |
So it's really hard for us to come and sit
link |
and enumerate all of these out.
link |
And they compose in various weird ways, right?
link |
Like you have a croissant and a doughnut
link |
come together to form a cronut.
link |
So if you were to like enumerate all the foods up until,
link |
I don't know, whenever the cronut was about 10 years ago
link |
then this entire thing called cronut would not exist.
link |
Yeah, I remember there was the most awesome video
link |
of a cat wearing a monkey costume.
link |
People should look it up, it's great.
link |
So is that a monkey or is that a cat?
link |
It's a very difficult philosophical question.
link |
So there is a concept of similarity between objects.
link |
So you think that can take us very far?
link |
Just kind of getting a good function,
link |
a good way to tell which parts of things are similar
link |
and which parts of things are very different?
link |
So you don't necessarily need to name everything
link |
or assign a name to everything to be able to use it, right?
link |
So there are like lots of...
link |
Shakespeare said that, what's in a name?
link |
I mean, lots of like, for example, animals, right?
link |
They don't have necessarily a well formed
link |
like syntactic language,
link |
but they're able to go about their day perfectly.
link |
The same thing happens for us.
link |
So, I mean, we probably look at things and we figure out,
link |
oh, this is similar to something else that I've seen before.
link |
And then I can probably learn how to use it.
link |
So I haven't seen all the possible doorknobs in the world.
link |
But if you show me, like I was able to get into
link |
this particular place fairly easily,
link |
I've never seen that particular doorknob.
link |
So I, of course, related to all the doorknobs
link |
that I've seen and I know exactly how it's going to open.
link |
I have a pretty good idea of how it's going to open.
link |
And I think this kind of translation between experiences
link |
only happens because of similarity.
link |
Because I'm able to relate it to a doorknob.
link |
If I related it to a hairdryer,
link |
I would probably be stuck still outside,
link |
not able to get in.
link |
Again, a bit of a philosophical question,
link |
but is, can similarity take us all the way
link |
to understanding a thing?
link |
Can having a good function that compares objects
link |
get us to understand something profound
link |
about singular objects?
link |
I think I'll ask you a question back.
link |
What does it mean to understand objects?
link |
Well, let me tell you what that's similar to.
link |
So there's an idea of sort of reasoning
link |
by analogy kind of thing.
link |
I think understanding is the process of placing that thing
link |
in some kind of network of knowledge that you have.
link |
That it perhaps is fundamentally related to other concepts.
link |
So it's not like understanding is fundamentally related
link |
by composition of other concepts
link |
and maybe in relation to other concepts.
link |
And maybe deeper and deeper understanding
link |
is maybe just adding more edges to that graph somehow.
link |
So maybe it is a composition of similarities.
link |
I mean, ultimately, I suppose it is a kind of embedding
link |
in that wisdom space.
link |
Yeah, okay, wisdom space is good.
link |
I think, I do think, right?
link |
So similarity does get you very, very far.
link |
Is it the answer to everything?
link |
I mean, I don't even know what everything is,
link |
but it's going to take us really far.
link |
And I think the thing is things are similar
link |
in very different contexts, right?
link |
So an elephant is similar to, I don't know,
link |
another sort of wild animal, let's just pick,
link |
I don't know, lion in a different way
link |
because they're both four legged creatures.
link |
They're also land animals.
link |
But of course, they're very different
link |
in a lot of different ways.
link |
So elephants are like herbivores, lions are not.
link |
So similarity does, similarity and particularly dissimilarity
link |
also sort of actually helps us understand a lot about things.
link |
And so that's actually why I think
link |
discrete categorization is very hard.
link |
Just like forming this particular category of elephant
link |
and a particular category of lion,
link |
maybe it's good for like just like taxonomy,
link |
biological taxonomies.
link |
But when it comes to like other things
link |
which are not as maybe, for example, like grilled cheese,
link |
right? I have a grilled cheese I dip it in tomato
link |
and I keep it outside.
link |
Now, is that still a grilled cheese
link |
or is that something else?
link |
All right, so categorization is still very useful
link |
for solving problems.
link |
But is your intuition then sort of the self supervised
link |
should be the, to borrow Jan Lacoon's terminology,
link |
should be the cake and then categorization,
link |
the classification, maybe the supervised like layer
link |
should be just like the thing on top,
link |
the cherry or the icing or whatever.
link |
So if you make it the cake, it gets in the way of learning.
link |
If you make it the cake, then you don't,
link |
we won't be able to sit and annotate everything.
link |
That's as simple as it is.
link |
Like that's my very practical view on it.
link |
It's just, I mean, in my PhD,
link |
I sat down and annotated like a bunch of cars
link |
for one of my projects.
link |
And very quickly I was just like,
link |
it was in a video and I was basically drawing boxes
link |
around all these cars.
link |
And I think I spent about a week doing all of that
link |
and I barely got anything done.
link |
And basically this was, I think my first year of my PhD
link |
at like a second year of my master's.
link |
And then by the end of it, I'm like, okay,
link |
this is just hopeless.
link |
I can keep doing it.
link |
And when I'd done that, someone came up to me
link |
and they basically told me,
link |
oh, this is a pickup truck.
link |
This is not a car.
link |
And that's like, aha, this actually makes sense
link |
because a pickup truck is not really like,
link |
what was I annotating?
link |
Was I annotating anything that is mobile?
link |
Or was I annotating particular sedans
link |
or was I annotating SUVs?
link |
By the way, the annotation was bounding boxes?
link |
There's so many deep, profound questions here.
link |
You're almost cheating your way out of
link |
by doing self supervised learning, by the way,
link |
which is like, what makes for an object?
link |
As opposed to solve intelligence,
link |
maybe you don't ever need to answer that question.
link |
I mean, this is the question that anyone
link |
that's ever done annotation because it's so painful
link |
gets to ask like, why am I doing a drawing
link |
very careful line around this object?
link |
Like what is the value?
link |
I remember when I first saw semantic segmentation
link |
where you have like instant segmentation
link |
where you have a very exact line
link |
around the object in a 2D plane
link |
of a fundamentally 3D object projected on a 2D plane.
link |
So you're drawing a line around a car
link |
that might be occluded.
link |
There might be another thing in front of it,
link |
but you're still drawing the line
link |
of the part of the car that you see.
link |
How is that the car?
link |
Why is that the car?
link |
Like I had like an existential crisis every time.
link |
Like how is that going to help us understand
link |
a solve computer vision?
link |
I'm not sure I have a good answer to what's better.
link |
And I'm not sure I share the confidence that you have
link |
that self supervised learning can take us far.
link |
I think I'm more and more convinced
link |
that it's a very important component,
link |
but I still feel like we need to understand what makes,
link |
like this dream of maybe what it's called symbolic AI
link |
of arriving, like once you have this common sense base,
link |
be able to play with these concepts
link |
and build graphs or hierarchies of concepts on top
link |
in order to then form a deep sense
link |
of this three dimensional world or four dimensional world
link |
and be able to reason and then project that
link |
onto 2D playing in order to interpret a 2D image.
link |
Can I ask you just an out there question?
link |
I remember, I think Andre Capati had a blog post
link |
about computer vision, like being really hard.
link |
I forgot what the title was, but it's many, many years ago.
link |
And he had, I think President Obama stepping on a scale
link |
and there was humor and there was a bunch of people
link |
laughing and whatever.
link |
And there's a lot of interesting things about that image
link |
and I think Andre highlighted a bunch of things
link |
about the image that us humans are able to immediately
link |
understand, like the idea, I think of gravity
link |
and that you can, you have the concept of a weight.
link |
You have a, you immediately project,
link |
because of our knowledge of pose
link |
and how human bodies are constructed,
link |
you understand how the forces are being applied
link |
with the human body.
link |
They're really interesting.
link |
Other thing that you're able to understand
link |
is multiple people looking at each other in the image.
link |
You're able to have a mental model
link |
of what the people are thinking about.
link |
You're able to infer like, oh, this person is probably
link |
thinks like is laughing at how humorous the situation is.
link |
And this person is confused about what the situation is
link |
because they're looking this way.
link |
We're able to infer all of that.
link |
So that's human vision.
link |
How difficult is computer vision?
link |
Like in order to achieve that level of understanding
link |
and maybe how big of a part
link |
does self supervised learning play in that, do you think?
link |
And do you still, you know, back,
link |
that was like over a decade ago,
link |
I think Andre and I think a lot of people agreed
link |
is computer vision is really hard.
link |
Do you still think computer vision is really hard?
link |
I think it is, yes.
link |
And getting to that kind of understanding,
link |
I mean, it's really out there.
link |
So if you ask me to solve just that particular problem,
link |
I can do it the supervised learning route.
link |
I can always construct a data set and basically predict,
link |
oh, is there humor in this or not?
link |
And of course I can do it.
link |
Actually, that's a good question.
link |
Do you think you can, okay, okay.
link |
Do you think you can do human supervised annotation
link |
To some extent, yes.
link |
I'm sure it'll work.
link |
I mean, it won't be as bad as like randomly guessing.
link |
I'm sure it can still predict whether it's humorous
link |
or not in some way.
link |
Yeah, maybe like Reddit upvotes is the signal.
link |
I mean, it won't do a great job, but it'll do something.
link |
It may actually be like it may find certain things
link |
which are not humorous, humorous as well,
link |
which is going to be bad for us.
link |
But I mean, it'll do a, it won't be random.
link |
Yeah, kind of like my sense of humor.
link |
So you can, that particular problem, yes.
link |
But the general problem you're saying is hard.
link |
The general problem is hard.
link |
And I mean, self supervised learning
link |
is not the answer to everything.
link |
Of course it's not.
link |
I think if you have machines that are going to communicate
link |
with humans at the end of it,
link |
you want to understand what the algorithm is doing, right?
link |
You want it to be able to like produce an output
link |
that you can decipher, that you can understand,
link |
or it's actually useful for something else,
link |
which again is a human.
link |
So at some point in this sort of entire loop,
link |
And now this human needs to understand what's going on.
link |
And at that point, this entire notion of language
link |
or semantics really comes in.
link |
If the machine just spits out something,
link |
and if we can't understand it,
link |
then it's not really that useful for us.
link |
So self supervised learning is probably going to be useful
link |
for a lot of the things before that part.
link |
Before the machine really needs to communicate
link |
a particular kind of output with a human.
link |
Because I mean, otherwise,
link |
how is it going to do that without language?
link |
Or some kind of communication,
link |
but you're saying that it's possible to build
link |
a big base of understanding or whatever of,
link |
Of like common sense concepts.
link |
Supervised learning in the context of computer vision
link |
is something you focused on,
link |
but that's a really hard domain.
link |
And it's kind of the cutting edge
link |
of what we're as a community working on today.
link |
Can we take a little bit of a step back
link |
and look at language?
link |
Can you summarize the history of success
link |
of self supervised learning
link |
in natural language processing, language modeling?
link |
What are transformers?
link |
What is the masking, the sentence completion
link |
that you mentioned before?
link |
How does it lead us to understand anything?
link |
Semantic meaning of words,
link |
syntactic role of words and sentences.
link |
So I'm of course not the expert in NLP.
link |
I kind of follow it a little bit from the sides.
link |
So the main sort of reason
link |
why all of this masking stuff works
link |
is I think it's called the distributional hypothesis
link |
The idea basically being that words
link |
that occur in the same context
link |
should have similar meaning.
link |
So if you have the blank jumped over the blank,
link |
it basically whatever is like in the first blank
link |
is basically an object that can actually jump
link |
is going to be something that can jump.
link |
So a cat or a dog or I don't know, sheep, something,
link |
all of these things can basically be
link |
in that particular context.
link |
And now so essentially the idea is that
link |
if you have words that are in the same context
link |
and you predict them,
link |
you're going to learn a lots of useful things
link |
about how words are related
link |
because you're predicting by looking at their context
link |
what the word is going to be.
link |
So in this particular case,
link |
the blank jumped over the fence.
link |
So now if it's a sheep,
link |
the sheep jumped over the fence,
link |
the dog jumped over the fence.
link |
So essentially the algorithm
link |
or the representation basically puts together
link |
these two concepts together.
link |
So it says, okay, dogs are going to be kind of slated to sheep
link |
because both of them occur in the same context.
link |
Of course, now you can decide
link |
depending on your particular application downstream,
link |
you can say that dogs are absolutely not related to sheep
link |
because well, I really care about dog food, for example.
link |
I'm a dog food person
link |
and I really want to give this dog food
link |
to this particular animal.
link |
So depending on what your downstream application is,
link |
of course, this notion of similarity
link |
or this notion or this common sense
link |
that you've learned may not be applicable.
link |
But the point is basically that this,
link |
just predicting what the blanks are
link |
is going to take you really, really far.
link |
So there's a nice feature of language
link |
that the number of words in a particular language
link |
is very large, but it's finite
link |
and it's actually not that large
link |
in the grand scheme of things.
link |
I still got up because we take it for granted.
link |
So first of all, when you say masking,
link |
you're talking about this very process
link |
of the blank of removing words from a sentence
link |
and then having the knowledge of what word went there
link |
in the initial data set.
link |
That's the ground truth that you're training on
link |
and then you're asking the neural network
link |
to predict where it goes there.
link |
That's like a little trick.
link |
It's a really powerful trick.
link |
The question is how far that takes us
link |
and the other question is, is there other tricks?
link |
Because to me, it's very possible
link |
there's other very fascinating tricks.
link |
I'll give you an example in autonomous driving,
link |
there's a bunch of tricks
link |
that give you the self supervised signal back.
link |
For example, very similar to sentences,
link |
but not really, which is you have signals
link |
from humans driving the car
link |
because a lot of us drive cars to places.
link |
And so you can ask the neural network to predict
link |
what's going to happen the next two seconds
link |
for a safe navigation through the environment.
link |
And the signal is comes from the fact
link |
that you also have knowledge of what happened
link |
in the next two seconds
link |
because you have video of the data.
link |
The question in autonomous driving, as it is in language,
link |
can we learn how to drive autonomously
link |
based on that kind of self supervision?
link |
Probably the answer is no.
link |
The question is how good can we get?
link |
And the same with language, how good can we get?
link |
And are there other tricks?
link |
Like we get sometimes super excited
link |
by this trick that works really well.
link |
But I wonder, it's almost like mining for gold.
link |
I wonder how many signals there are in the data
link |
that could be leveraged that are like there, right?
link |
Is that, I just want to kind of linger on that
link |
because sometimes it's easy to think
link |
that maybe this masking process is self supervised learning.
link |
No, it's only one method.
link |
So there could be many, many other methods,
link |
many tricky methods,
link |
maybe interesting ways to leverage human computation
link |
in very interesting ways
link |
that might actually border on semi supervised learning,
link |
something like that.
link |
Obviously the internet is generated by humans
link |
at the end of the day.
link |
So all that to say is what's your sense
link |
in this particular context of language,
link |
how far can that masking process take us?
link |
So it has stood the test of time, right?
link |
I mean, so Word2Vec, the initial sort of NLP technique
link |
that was using this to now, for example,
link |
like all the BERT and all these big models
link |
that we get, BERT and Roberta, for example,
link |
all of them are still sort of based
link |
on the same principle of masking.
link |
It's taken us really far.
link |
I mean, you can actually do things like,
link |
oh, these two sentences are similar or not,
link |
whether this particular sentence follows this other sentence
link |
in terms of logic, so entailment.
link |
You can do a lot of these things with this,
link |
just this masking trick.
link |
Yeah, so I'm not sure if I can predict how far it can take us
link |
because when it first came out, when Word2Vec was out,
link |
I don't think a lot of us would have imagined
link |
that this would actually help us do some kind
link |
of entailment problems and really that well.
link |
And so just the fact that by just scaling up
link |
the amount of data that we're training on
link |
and using better and more powerful neural network
link |
architectures has taken us from that to this,
link |
is just showing you how maybe poor predictors we are,
link |
as humans, how poor we are at predicting
link |
how successful a particular technique is going to be.
link |
So I think I can say something now,
link |
but like 10 years from now,
link |
I look completely stupid basically predicting this.
link |
In the language domain, is there something in your work
link |
that you find useful and insightful
link |
and transferable to computer vision,
link |
but also just, I don't know, beautiful and profound
link |
that I think carries through to the vision domain?
link |
I mean, the idea of masking has been very powerful.
link |
It has been used in vision as well for predicting,
link |
like you say, the next sort of,
link |
if you have in sort of frames,
link |
then you predict what's going to happen in the next frame.
link |
So that's been very powerful.
link |
In terms of modeling, like in just terms
link |
in terms of architecture,
link |
I think you would ask about transformers while back.
link |
That has really become,
link |
like it has become super exciting for computer vision now.
link |
Like in the past, I would say year and a half,
link |
it's become really powerful.
link |
What's a transformer?
link |
I mean, the core part of a transformer
link |
is something called the self attention model.
link |
So it came out of Google.
link |
And the idea basically is that if you have N elements,
link |
what you're creating is a way
link |
for all of these N elements to talk to each other.
link |
So the idea basically is that you are paying attention.
link |
Each element is paying attention
link |
to each of the other element.
link |
And basically by doing this,
link |
it's really trying to figure out,
link |
you're basically getting a much better view of the data.
link |
So for example, if you have a sentence of like four words,
link |
the point is if you get a representation
link |
or a feature for this entire sentence,
link |
it's constructed in a way
link |
such that each word has paid attention
link |
to everything else.
link |
Now, the reason it's like different from say,
link |
what you would do in a ConvNet is basically
link |
that in the ConvNet,
link |
you would only pay attention to a local window.
link |
So each word would only pay attention
link |
to its next neighbor or like one neighbor after that.
link |
And the same thing goes for images.
link |
In images, you would basically pay attention to pixels
link |
in a three cross three or a seven cross seven neighborhood.
link |
Whereas with the transformer,
link |
that self attention mainly the sort of idea
link |
is that each element needs to pay attention
link |
to each other element.
link |
And when you say attention,
link |
maybe another way to phrase that
link |
is you're considering a context,
link |
a wide context in terms of the wide context of the sentence
link |
in understanding the meaning of a particular word
link |
and a computer vision.
link |
That's understanding a larger context
link |
to understand the local pattern
link |
of a particular local part of an image.
link |
So basically if you have say,
link |
again a banana in the image,
link |
you're looking at the full image first.
link |
So whether it's like,
link |
you're looking at all the pixels that are off a kitchen
link |
or for dining table and so on.
link |
And then you're basically looking at the banana also.
link |
By the way, in terms of if we were to train
link |
the funny classifier,
link |
there's something funny about the word banana.
link |
Just wanted to anticipate that my my
link |
I am wearing a banana shirt.
link |
Is there bananas on it?
link |
Okay. So masking has worked for the vision context as well.
link |
And so this transformer idea has worked as well.
link |
So basically looking at all the elements
link |
to understand a particular element
link |
has been really powerful in vision.
link |
The reason is like a lot of things
link |
when you're looking at them in isolation.
link |
So if you look at just a blob of pixels.
link |
So Antonio Torralba at MIT used to have this
link |
like really famous image,
link |
which I looked at when I was a PhD student,
link |
where he would basically have a blob of pixels
link |
and he would ask you,
link |
Hey, what is this?
link |
And it looked basically like a shoe
link |
or like it could look like a TV remote.
link |
It could look like anything.
link |
And it turns out it was a beer bottle.
link |
It was one of these three things,
link |
but basically he showed you the full picture
link |
and then it was very obvious what it was.
link |
just by looking at that particular local window,
link |
you couldn't figure out
link |
because of resolution,
link |
because of other things,
link |
it's just not easy always to just figure out
link |
by looking at just the neighborhood of pixels,
link |
what these pixels are.
link |
And the same thing happens for language as well.
link |
For the parameters that have to learn
link |
something about the data,
link |
you need to give it the capacity
link |
to learn the essential things.
link |
Like if it's not actually able to receive the signal at all,
link |
then it's not going to be able to learn that signal.
link |
And in order to understand images,
link |
to understand language,
link |
you have to be able to see words in their full context.
link |
What is harder to solve?
link |
Vision or language?
link |
Visual intelligence or linguistic intelligence?
link |
So I'm going to say computer vision is harder.
link |
My reason for this is basically that
link |
language of course has a big structure to it
link |
because we developed it.
link |
Whereas vision is something that is common
link |
in a lot of animals.
link |
Everyone is able to get by,
link |
a lot of these animals on Earth
link |
are actually able to get by without language.
link |
And a lot of these animals,
link |
we also deem to be intelligent.
link |
So clearly intelligence does have
link |
like a visual component to it.
link |
And yes, of course in the case of humans,
link |
it of course also has a linguistic component.
link |
But it means that there is something far more fundamental
link |
about vision than there is about language.
link |
And I'm sorry to anyone who disagrees,
link |
but yes, this is what I feel.
link |
So that's being a little bit reflected
link |
in the challenges that have to do with the progress
link |
of self supervised learning, would you say?
link |
Or is that just the peculiar accidents
link |
of the progress of the AI community
link |
that we focused on?
link |
Or we discovered self attention
link |
and transformers in the context of language first.
link |
So like the self supervised learning success was actually,
link |
for vision has not much to do with the transformers part.
link |
I would say it's actually been independent a little bit.
link |
I think it's just that the signal
link |
was a little bit different for vision
link |
than there was for like NLP
link |
and probably NLP folks discovered it before.
link |
So for vision, the main success
link |
has basically been this like crops so far,
link |
like taking different crops of images.
link |
Whereas for NLP, it was this masking thing.
link |
But also the level of success
link |
is still much higher for language.
link |
So that has a lot to do with,
link |
I mean, I can get into a lot of details.
link |
For this particular question, let's go for it.
link |
Okay, so the first thing is language is very structured.
link |
So you are going to produce a distribution
link |
over a finite vocabulary.
link |
English has a finite number of words.
link |
It's actually not that large.
link |
And you need to produce basically,
link |
when you're doing this masking thing,
link |
all you need to do is basically tell me
link |
which one of these like 50,000 words it is.
link |
Now for vision, let's imagine doing the same thing.
link |
Okay, we're basically going to blank out
link |
a particular part of the image.
link |
And we ask the network or this neural network
link |
to predict what is present in this missing patch.
link |
It's combinatorially large, right?
link |
You have 256 pixel values.
link |
If you're even producing basically a seven cross seven
link |
or a 14 cross 14 like window of pixels
link |
at each of these 169 or each of these 49 locations,
link |
you have 256 values to predict.
link |
And so it's really, really large.
link |
And very quickly, the kind of like prediction problems
link |
that we're setting up are going to be extremely
link |
like intractable for us.
link |
And so the thing is for NLP, it has been really successful
link |
because we are very good at predicting,
link |
like doing this like distribution over a finite set.
link |
And the problem is when this set becomes really large,
link |
we're going to become really, really bad
link |
at making these predictions.
link |
And at solving basically this particular set of problems.
link |
So if you were to do it exactly in the same way
link |
as NLP for vision, there is very limited success.
link |
The way stuff is working right now
link |
is actually not by predicting these masks.
link |
It's basically by saying that you take these two
link |
like crops from the image,
link |
you get a feature representation from it.
link |
And just saying that these two features,
link |
so they're like vectors,
link |
just saying that the distance between these vectors
link |
And so it's a very different way of learning
link |
from the visual signal than there is from NLP.
link |
Okay, the other reason is the distributional hypothesis
link |
that we talked about for NLP, right?
link |
So a word given its context,
link |
basically the context actually supplies a lot
link |
of meaning to the word.
link |
Now, because there are just finite number of words
link |
and there is a finite way in which we compose them,
link |
of course, the same thing holds for pixels,
link |
but in language, there's a lot of structure, right?
link |
So I always say whatever,
link |
the dash jumped over the fence, for example.
link |
There are lots of these sentences that you'll get.
link |
And from this, you can actually look at
link |
this particular sentence might occur
link |
in a lot of different contexts as well.
link |
This exact same sentence might occur in a different context.
link |
So the sheep jumped over the fence,
link |
the cat jumped over the fence,
link |
the dog jumped over the fence.
link |
So you immediately get a lot of these words,
link |
which are, because this particular token itself
link |
has so much meaning, you get a lot of these tokens
link |
or these words which are actually going to have
link |
sort of this related meaning across, given this context.
link |
Whereas for vision, it's much harder.
link |
Because just by pure, the way we capture images,
link |
lighting can be different.
link |
There might be different noise in the sensor.
link |
So the thing is you're capturing a physical phenomenon
link |
and then you're basically going through
link |
a very complicated pipeline of image processing
link |
and then you're translating that
link |
into some kind of digital signal.
link |
Whereas with language, you write it down
link |
and you transfer it to a digital signal,
link |
almost like it's a lossless transfer.
link |
And each of these tokens are very, very well defined.
link |
There could be a little bit of an argument there
link |
because language has written down
link |
is a projection of thought.
link |
This is one of the open questions is
link |
if you perfectly can solve language,
link |
are you getting close to being able to solve,
link |
easily with flying colors past the Turing test
link |
So that's, it's similar, but different
link |
and the computer vision problem is in the 2D plane
link |
is a projection to be to mention a world.
link |
So perhaps there are similar problems there.
link |
Maybe this is a good, yeah.
link |
I think what I'm saying is NLP is not easy.
link |
Of course, don't get me wrong.
link |
Like abstract thought expressed in knowledge
link |
or knowledge basically expressed in language
link |
is really hard to understand, right?
link |
I mean, we've been communicating with language for so long
link |
and it's, it is of course a very complicated concept.
link |
The thing is, at least getting like some,
link |
somewhat reasonable, like being able to solve
link |
some kind of reasonable tasks with language,
link |
I would say slightly easier than it is
link |
with computer vision.
link |
Yeah, I would say, yeah.
link |
So that's well put.
link |
I would say getting impressive performance on language
link |
I feel like for both language and computer vision,
link |
there's going to be this wall of like,
link |
like this hump you have to overcome
link |
to achieve super human level performance
link |
or human level performance.
link |
And I feel like for language, that wall is farther away.
link |
So you can get pretty nice.
link |
You can do a lot of tricks.
link |
You can show really impressive performance.
link |
You can even fool people that you're tweeting
link |
or you're blog posts writing
link |
or your question answering has intelligence behind it.
link |
But to truly demonstrate understanding of dialogue,
link |
of continuous long form dialogue,
link |
that would require perhaps big breakthroughs.
link |
In the same way in computer vision,
link |
I think the big breakthroughs need to happen earlier
link |
to achieve impressive performance.
link |
This might be a good place to, you already mentioned it,
link |
but what is contrastive learning
link |
and what are energy based models?
link |
Contrastive learning is sort of the paradigm of learning
link |
where the idea is that you are learning this embedding space
link |
or so you're learning this sort of vector space
link |
of all your concepts.
link |
And the way you learn that is basically by contrasting.
link |
So the idea is that you have a sample,
link |
you have another sample that's related to it.
link |
So that's called the positive
link |
and you have another sample that's not related to it.
link |
So that's negative.
link |
So for example, let's just take an NLP
link |
or in a simple example in computer vision.
link |
So you have an image of a cat,
link |
you have an image of a dog
link |
and for whatever application that you're doing,
link |
say you're trying to figure out what pets are,
link |
you think that these two images are related.
link |
So image of a cat and dog are related,
link |
but now you have another third image of a banana
link |
because you don't like that word.
link |
So now you basically have this banana.
link |
Thank you for speaking to the crowd.
link |
And so you take both of these images
link |
and you take the image from the cat,
link |
the image from the dog,
link |
you get a feature from both of them.
link |
And now what you're training the network to do
link |
is basically pull both of these features together
link |
while pushing them away from the feature of a banana.
link |
So this is the contrastive part.
link |
So you're contrasting against the banana.
link |
So there's always this notion of a negative and a positive.
link |
Now, energy based models are like one way
link |
that Jan sort of explains a lot of these methods.
link |
So Jan basically, I think a couple of years or more
link |
than that, like when I joined Facebook,
link |
Jan used to keep mentioning this word energy based models.
link |
And of course, I had no idea what he was talking about.
link |
So then one day I caught him in one of the conference rooms
link |
and I'm like, can you please tell me what this is?
link |
So then like very patiently,
link |
he sat down with like a marker and a whiteboard.
link |
And his idea basically is that
link |
rather than talking about probability distributions,
link |
you can talk about energies of models.
link |
So models are trying to minimize certain energies
link |
or they're trying to maximize a certain kind of energy.
link |
And the idea basically is that
link |
you can explain a lot of the contrastive models,
link |
GANs for example, which are like
link |
generative adversarial networks.
link |
A lot of these modern learning methods
link |
or VAEs, which are variational autoencoders,
link |
you can really explain them very nicely
link |
in terms of an energy function
link |
that they're trying to minimize or maximize.
link |
And so by putting this common sort of language
link |
for all of these models,
link |
what looks very different in machine learning
link |
that OVAEs are very different from what GANs are,
link |
are very different from what contrastive models are,
link |
you actually get a sense of like,
link |
oh, these are actually very, very related.
link |
It's just that the way or the mechanism
link |
in which they're sort of maximizing
link |
or minimizing this energy function is slightly different.
link |
It's revealing the commonalities between all these approaches
link |
and putting a sexy word on top of it, like energy.
link |
And so similarities, two things that are similar
link |
Like the low energy signifying similarity.
link |
So basically the idea is that if you were to imagine
link |
like the embedding as a manifold, a 2D manifold,
link |
you would get a hill or like a high sort of peak
link |
in the energy manifold,
link |
wherever two things are not related.
link |
And basically you would have like a dip
link |
where two things are related.
link |
So you'd get a dip in the manner.
link |
And in the self supervised context,
link |
how do you know two things are related
link |
and two things are not related?
link |
This is where all the sort of ingenuity or tricks comes in,
link |
So for example, like you can take the fill in the blank
link |
problem or you can take in the context problem.
link |
And what you can say is two words
link |
that are in the same context are related.
link |
Two words that are in different contexts are not related.
link |
For images, basically two crops from the same image
link |
are related and whereas a third image is not related at all.
link |
For a video, it can be two frames from that video
link |
are related because they're likely to contain
link |
the same sort of concepts in them.
link |
Whereas a third frame from a different video
link |
So it basically is, it's a very general term.
link |
Contrasting learning is nothing really
link |
to do with self supervised learning.
link |
It actually is very popular in, for example,
link |
like any kind of metric learning
link |
or any kind of embedding learning.
link |
So it's also used in supervised learning.
link |
It's also, and the thing is because we are not really
link |
using labels to get these positive or negative pairs,
link |
it can basically also be used for self supervised learning.
link |
So you mentioned one of the ideas in the vision context
link |
to that works is to have different crops.
link |
So you could think of that as a way to sort of
link |
manipulating the data to generate examples that are similar.
link |
Obviously, there's a bunch of other techniques.
link |
You mentioned lighting as a very, in images,
link |
lighting is something that varies a lot
link |
and you can artificially change those kinds of things.
link |
There's the whole broad field of data augmentation
link |
which manipulates images in order to increase arbitrarily
link |
the size of the data set.
link |
First of all, what is data augmentation?
link |
And second of all, what's the role of data augmentation
link |
in self supervised learning and contrastive learning?
link |
So data augmentation is just a way, like you said,
link |
it's basically a way to augment the data.
link |
So you have say N samples and what you do is
link |
you basically define some kind of transforms for the sample.
link |
So you take your say image and then you define a transform
link |
where you can just increase the colors or the brightness
link |
of the image or increase or decrease the contrast of the image,
link |
for example, or take different crops of it.
link |
So data augmentation is just a process
link |
to basically perturb the data or augment the data.
link |
And so it has played a fundamental role
link |
for computer vision for self supervised learning,
link |
especially the way most of the current methods
link |
work contrastive or otherwise is by taking an image,
link |
in the case of images, is by taking an image
link |
and then computing basically two perturbations of it.
link |
So these can be two different crops of the image
link |
with like different types of lighting
link |
or different contrast or different colors.
link |
So you jitter the colors a little bit and so on.
link |
And now the idea is basically because it's the same object
link |
or because it's like related concepts
link |
in both of these perturbations,
link |
you want the features from both of these perturbations
link |
So now you can use a variety of different ways
link |
to enforce this constraint, like these features being similar.
link |
You can do this by contrastive learning.
link |
So basically both of these things are positives,
link |
a third sort of image is negative.
link |
You can do this basically by like clustering.
link |
For example, you can say that both of these images should,
link |
the features from both of these images
link |
should belong in the same cluster because they're related.
link |
Whereas image, like another image
link |
should belong to a different cluster.
link |
So there's a variety of different ways
link |
to basically enforce this particular constraint.
link |
By the way, when you say features,
link |
it means there's a very large neural network
link |
that extracting patterns from the image
link |
and the kind of patterns that extracts
link |
should be either identical or very similar.
link |
That's what that means.
link |
So the neural network basically takes in the image
link |
and then outputs a set of basically a vector of numbers.
link |
And that's the feature.
link |
And you want this feature for both of these different crops
link |
that you computed to be similar.
link |
So you want this vector to be identical
link |
in its entries, for example.
link |
Be like literally close in this multidimensional space
link |
And like you said, close can mean part of the same cluster
link |
or something like that in this large space.
link |
First of all, that, I wonder if there is connection
link |
to the way humans learn to this.
link |
Almost like maybe subconsciously,
link |
in order to understand a thing,
link |
you kind of have to see it from two, three multiple angles.
link |
I wonder, I have a lot of friends who are neuroscientists
link |
maybe and cognitive scientists.
link |
I wonder if that's in there somewhere.
link |
Like in order for us to place a concept in its proper place,
link |
we have to basically crop it in all kinds of ways,
link |
do basic data augmentation on it
link |
in whatever very clever ways that the brain likes to do.
link |
Like spinning around in our mind somehow
link |
that that is very effective.
link |
So I think for some of them, we need to do it.
link |
So like babies, for example, pick up objects,
link |
like move them, put them, go sit there and whatnot.
link |
But for certain other things,
link |
actually we are good at imagining it as well.
link |
So if you, I have never seen, for example,
link |
an elephant from the top.
link |
I've never basically looked at it from top down.
link |
But if you showed me a picture of it,
link |
I could very well tell you that that's an elephant.
link |
So I think some of it, we just like,
link |
we naturally build it or transfer it from other objects
link |
that we've seen to imagine what it's going to look like.
link |
Has anyone done that with the augmentation?
link |
Like imagine all the possible things
link |
that are occluded or not there,
link |
but not just like normal things, like wild things,
link |
but they're nevertheless physically consistent.
link |
So I mean, people do kind of like occlusion based
link |
augmentation as well.
link |
So you place in like a random like box, gray box
link |
to sort of mask out a certain part of the image.
link |
And the thing is basically you're kind of occluding it.
link |
For example, you place it say on half of a person's face.
link |
So basically saying that, you know,
link |
something below their nose is occluded
link |
because it's grayed out.
link |
So, this is kind of.
link |
No, I meant like, you have like, what is it?
link |
A table and you can't see behind the table.
link |
And you imagine there's a bunch of elves
link |
with bananas behind the table.
link |
Like I wonder if there's useful to have a,
link |
a wild imagination for the network.
link |
Because that's possible.
link |
Well, maybe not elves, but like puppies
link |
and kittens or something like that.
link |
Just have a wild imagination and like constantly
link |
be generating that wild imagination.
link |
Cause in terms of data augmentation
link |
that's currently applied, it's super ultra very boring.
link |
It's very basic data augmentation.
link |
I wonder if, I wonder if there's a benefit
link |
to being wildly imaginable while trying to be
link |
consistent with physical reality.
link |
I think it's a kind of a chicken and egg problem, right?
link |
Because to have like amazing data augmentation,
link |
you need to understand what the scene is.
link |
And what we're trying to do data augmentation
link |
to learn what a scene is anyway.
link |
So it's basically just keeps going on.
link |
Before you understand it, just put elves with bananas
link |
until you know it's not to be true.
link |
Just like children have a wild imagination
link |
until the adults ruin it all.
link |
So what are the different kinds of data augmentation
link |
that you've seen to be effective in visual intelligence?
link |
For like vision, it's a lot of these image filtering
link |
So like blurring the image, you know,
link |
all the kind of Instagram filters that you can think of.
link |
So like arbitrarily like make the red super red,
link |
make the green super greens, like saturate the image.
link |
Rotation cropping.
link |
Rotation cropping.
link |
All of these kind of things.
link |
Like I said, lighting is a really interesting one to me.
link |
Like that feels like really complicated to do.
link |
So I mean, they don't, the augmentations that we work on
link |
aren't like that involved.
link |
So they're not going to be like physically realistic
link |
versions of lighting.
link |
It's not that you're assuming that there's a light source
link |
up and then you're moving it to the right.
link |
And then what does the thing look like?
link |
It's really more about like brightness of the image,
link |
overall brightness of the image or overall contrast
link |
of the image and so on.
link |
But this is a really important point to me.
link |
I always thought that data augmentation
link |
holds an important key to big improvements in machine
link |
And it seems that it is an important aspect
link |
of self supervised learning.
link |
So I wonder if there's big improvements
link |
to be achieved on much more intelligent kinds of data
link |
For example, currently, maybe you can correct me
link |
if I'm wrong, data augmentation is not parametrized.
link |
You're not learning.
link |
You're not learning.
link |
To me, it seems like data augmentation potentially
link |
should involve more learning than the learning process
link |
You're almost like thinking of like generative kind of,
link |
it's the elves of bananas.
link |
You're trying to, it's like very active imagination
link |
of messing with the world and teaching that mechanism
link |
for messing with the world to be realistic.
link |
Because that feels like, I mean, it's imagination.
link |
Just as you said, it feels like us humans
link |
are able to maybe sometimes subconsciously
link |
imagine, before we see the thing,
link |
imagine what we're expecting to see.
link |
Like maybe several options.
link |
And especially, we probably forgot, but when we were younger,
link |
probably the possibilities were wild.
link |
There are more numerous.
link |
And then as we get older, we become to understand the world
link |
and the possibilities of what we might see
link |
becomes less and less and less.
link |
So I wonder if you think there's a lot of breakthroughs yet
link |
to be had in data augmentation.
link |
And maybe also, can you just comment on the stuff we have?
link |
Is that a big part of self supervised learning?
link |
So data augmentation is like key to self supervised learning.
link |
That has the kind of augmentation that we're using.
link |
And basically, the fact that we're
link |
trying to learn these neural networks that
link |
are predicting these features from images that
link |
are robust under data augmentation
link |
has been the key for visual self supervised learning.
link |
And they play a fairly fundamental role to it.
link |
Now, the irony of all of this is that deep learning purists
link |
will say the entire point of deep learning
link |
is that you feed in the pixels to the neural network.
link |
And it should figure out the patterns on its own.
link |
So if it really wants to look at edges,
link |
it should look at edges.
link |
You shouldn't really go and handcraft these features.
link |
You shouldn't go tell it that look at edges.
link |
So data augmentation should basically
link |
be in the same category.
link |
Why should we tell the network or tell this entire learning
link |
paradigm what kinds of data augmentation
link |
that we're looking for?
link |
We are encoding a very sort of human specific bias there
link |
that we know things are, if you change the contrast of the image,
link |
it should still be an apple.
link |
Or it should still be apple, not banana.
link |
Basically, if we change colors, it
link |
should still be the same kind of concept.
link |
Of course, this is not one.
link |
This doesn't feel like super satisfactory,
link |
because a lot of our human knowledge or our human supervision
link |
is actually going into the data augmentation.
link |
So although we are calling it self supervised learning,
link |
a lot of the human knowledge is actually
link |
being encoded in the data augmentation process.
link |
So it's really like we've kind of sneaked away
link |
the supervision at the input.
link |
And we're really designing these nice list of data
link |
augmentations that are working very well.
link |
Of course, the idea is that it's much easier
link |
to design a list of data augmentation than it is to do.
link |
So humans are doing, nevertheless,
link |
doing less and less work, and maybe leveraging
link |
their creativity more and more.
link |
And when we say data augmentation is not parameterized,
link |
it means it's not part of the learning process.
link |
Do you think it's possible to integrate some of the data
link |
augmentation into the learning process?
link |
And in fact, it will be really beneficial for us,
link |
because a lot of these data augmentation that we use in vision
link |
For example, when you have certain concepts, again, a banana,
link |
you take the banana and then basically you
link |
change the color of the banana.
link |
So you make it a purple banana.
link |
Now, this data augmentation process
link |
is actually independent of the, it
link |
has no notion of what is present in the image.
link |
So it can change this color arbitrarily.
link |
It can make it a red banana as well.
link |
And now what we're doing is we're
link |
telling the neural network that this red banana and,
link |
so a crop of this image which has the red banana
link |
and a crop of this image where I change the color to a purple
link |
banana should be, the features should be the same.
link |
Now bananas are in red or purple, mostly.
link |
So really the data augmentation process
link |
should take into account what is present in the image
link |
and what are the kinds of physical realities that are possible.
link |
It shouldn't be completely independent of the image.
link |
So you might get big gains if you, instead of being drastic,
link |
do subtle augmentation, but realistic augmentation.
link |
I'm not sure if it's subtle, but realistic for sure.
link |
If it's realistic, then even subtle augmentation
link |
will give you big benefits.
link |
And it will be, for particular domains,
link |
you might actually see, if, for example, now
link |
we're doing medical imaging, there
link |
are going to be certain kinds of geometric augmentation
link |
that are not really going to be very valid for the human body.
link |
So if you were to actually loop in data augmentation
link |
into the learning process, it will actually be much more
link |
Now, this actually does take us to maybe a semi supervised
link |
kind of a setting because you do want to understand
link |
what is it that you're trying to solve.
link |
So currently self supervised learning kind of
link |
operates in the wild, right?
link |
So you do the self supervised learning,
link |
and the purists and all of us basically say that, OK,
link |
this should learn useful representations,
link |
and they should be useful for any kind of end task,
link |
no matter it's like banana recognition
link |
or like autonomous driving.
link |
Now, it's a tall order.
link |
Maybe the first baby step for us should be that, OK,
link |
if you're trying to loop in this data augmentation
link |
into the learning process, then we at least
link |
need to have some sense of what we're trying to do.
link |
Are we trying to distinguish between different types
link |
of bananas, or are we trying to distinguish between banana
link |
and apple, or are we trying to do all of these things at once?
link |
And so some notion of what happens at the end
link |
might actually help us do much better at this side.
link |
Let me ask you a ridiculous question.
link |
If I were to give you like a black box,
link |
like a choice to have an arbitrary large data
link |
set of real natural data versus really
link |
good data augmentation algorithms,
link |
which would you like to train in a self supervised way on?
link |
So natural data from the internet are arbitrary large,
link |
so unlimited data.
link |
Or it's like more controlled, good data augmentation
link |
on the finite data set.
link |
The thing is like because our learning algorithms
link |
for vision right now really rely on data augmentation,
link |
even if you were to give me like an infinite source of like
link |
image data, I still need a good data augmentation algorithm.
link |
You need something that tells you
link |
that two things are similar.
link |
And so something, because you've given me
link |
an arbitrarily large data set, I still
link |
need to use data augmentation to take that image, construct
link |
like these two perturbations of it, and then learn from it.
link |
So the thing is our learning paradigm
link |
is very primitive right now.
link |
Even if you were to give me lots of images,
link |
it's still not really useful.
link |
A good data augmentation algorithm
link |
is actually going to be more useful.
link |
So you can reduce down the amount of data
link |
that you give me by like 10 times.
link |
But if you were to give me a good data augmentation algorithm,
link |
that will probably do better than giving me like 10 times
link |
the size of that data, but me having to rely on a very
link |
primitive data augmentation algorithm.
link |
Through tagging and all those kinds of things,
link |
is there a way to discover things that are semantically
link |
similar on the internet?
link |
Obviously there is, but it might be extremely noisy.
link |
And the difference might be farther away
link |
than you would be comfortable with.
link |
So I mean, yes, tagging will help you a lot.
link |
It'll actually go a very long way in figuring out
link |
what images are related or not.
link |
And then so, but then the purists would argue that when
link |
you're using human tags, because these tags are like
link |
supervision, is it really really self supervised learning
link |
now, because you're using human tags to figure out
link |
which images are like similar.
link |
Hashtag no filter means a lot of things.
link |
I mean, there are certain tags which are going to be
link |
applicable pretty much to anything.
link |
So they're pretty useless for learning.
link |
But I mean, certain tags are actually like
link |
DI filter, for example, or the Taj Mahal, for example.
link |
These tags are like very indicative of what's going on.
link |
And they are, I mean, they are human supervision.
link |
This is one of the tasks of discovering from human
link |
generated data, strong signals that could be
link |
leveraged for self supervision.
link |
Like humans are doing so much work already.
link |
Like many years ago, there was something that was called,
link |
I guess, human computation back in the day.
link |
Humans are doing so much work.
link |
It'd be exciting to discover ways to leverage the work
link |
they're doing to teach machines without any extra
link |
An example could be, like we said, driving.
link |
Humans driving and machines can learn from the driving.
link |
I always hope that there could be some supervision signal
link |
discovered in video games, because there's so many
link |
people that play video games that it feels like
link |
so much effort is put into video games, into playing
link |
video games, and you can design video games somewhat
link |
cheaply to include whatever signals you want.
link |
It feels like that could be leveraged somehow.
link |
So people are using that.
link |
Like there are actually folks right here in UT Austin,
link |
like Phillip Crainbull is a professor at UT Austin.
link |
He's been working on video games as a source of
link |
I mean, it's really fun, like as a PhD student,
link |
getting to basically play video games all day.
link |
Yeah, but so I do hope that kind of thing scales.
link |
And ultimately, it boils down to discovering some
link |
undeniably very good signal.
link |
It's like masking in an LP.
link |
But that said, there's noncontrastive methods.
link |
What do noncontrastive, energy based, self supervised
link |
learning methods look like, and why are they promising?
link |
So like I said about contrastive learning,
link |
you have this notion of a positive and a negative.
link |
Now, the thing is, this entire learning paradigm
link |
really requires access to a lot of negatives to learn
link |
a good sort of feature space.
link |
The idea is if I tell you, okay, so a cat and a dog
link |
are similar, and they're very different from a banana.
link |
The thing is, this is a fairly simple analogy, right?
link |
Because bananas look visually very different
link |
from what cats and dogs do.
link |
So very quickly, if this is the only source of
link |
supervision that I'm giving you, your learning is not
link |
going to be like, after a point, the neural network
link |
is really not going to learn a lot.
link |
Because the negative that you're getting
link |
is going to be so random.
link |
So it can be, oh, a cat and a dog are similar,
link |
but they're very different from a Volkswagen Beetle.
link |
Now, this car looks very different
link |
from these animals again.
link |
So the thing is in contrastive learning,
link |
the quality of the negative sample really matters a lot.
link |
And so what has happened is basically that
link |
typically these methods that are contrastive
link |
really require access to lots of negatives,
link |
which becomes harder and harder to sort of scale
link |
when designing a learning algorithm.
link |
So that's been one of the reasons
link |
why noncontrastive methods have become popular
link |
and why people think that they're going to be more useful.
link |
So a noncontrastive method, for example,
link |
like clustering is one noncontrastive method.
link |
The idea basically being that you have two of these samples,
link |
so the cat and dog or two crops of this image,
link |
they belong to the same cluster.
link |
And so essentially you're basically doing clustering online
link |
when you're learning this network
link |
and which is very different from having access
link |
to a lot of negatives explicitly.
link |
The other way which has become really popular
link |
is something called self distillation.
link |
So the idea basically is that you have a teacher network
link |
and a student network,
link |
and the teacher network produces a feature.
link |
So it takes in the image
link |
and basically the neural network
link |
figures out the patterns, gets the feature out.
link |
And there's another neural network
link |
which is the student neural network
link |
and that also produces a feature.
link |
And now all you're doing is basically saying
link |
that the features produced by the teacher network
link |
and the student network should be very similar.
link |
There is no notion of a negative anymore.
link |
So it's all about similarity maximization
link |
between these two features.
link |
And so all I need to now do
link |
is figure out how to have these two sorts of parallel networks,
link |
a student network and a teacher network.
link |
And basically researchers have figured out
link |
very cheap methods to do this.
link |
So you can actually have for free really
link |
two types of neural networks.
link |
They're kind of related,
link |
but they're different enough
link |
that you can actually basically have a learning problem set up.
link |
So you can ensure that they always remain different enough
link |
so the thing doesn't collapse into something boring.
link |
So the main sort of enemy of self supervised learning,
link |
any kind of similarity maximization technique is collapse.
link |
It's a collapse means that you learn
link |
the same feature representation
link |
for all the images in the world,
link |
which is completely useless.
link |
Everything is a banana.
link |
Everything is a banana.
link |
Everything is a cat.
link |
Everything is a car.
link |
And so all we need to do is basically come up
link |
with ways to prevent collapse,
link |
contrasted learning is one way of doing it.
link |
And then for example,
link |
like clustering or self distillation
link |
or other ways of doing it.
link |
We also had a recent paper
link |
where we used like decorrelation
link |
between like two sets of features to prevent collapse.
link |
So that's inspired a little bit
link |
by like Horace Barlow's neuroscience principles.
link |
By the way, I should comment that whoever counts
link |
the number of times then the word banana,
link |
apple, cat and dog were using this conversation,
link |
wins the internet.
link |
What is SWAV and the main improvement proposed
link |
in the paper on supervised learning of visual features
link |
by contrasting cluster assignments?
link |
SWAV basically is a clustering based technique,
link |
which is for again, the same thing
link |
for self supervised learning in vision,
link |
where we have two crops.
link |
And the idea basically is that you want the features
link |
from these two crops of an image to lie in the same cluster.
link |
And basically crops that are coming from different images
link |
to be in different clusters.
link |
Now, typically in a sort of,
link |
if you were to do this clustering,
link |
you would perform clustering offline.
link |
What that means is you would,
link |
if you have a data set of N examples,
link |
you would run over all of these N examples,
link |
get features for them, perform clustering.
link |
So basically get some clusters
link |
and then repeat the process again.
link |
So this is offline basically because I need to do one
link |
pass through the data to compute its clusters.
link |
SWAV is basically just a simple way of doing this online.
link |
So as you're going through the data,
link |
you're actually computing these clusters online.
link |
And so of course, there is like a lot of tricks involved
link |
in how to do this in a robust manner without collapsing.
link |
But this is the sort of key idea to it.
link |
Is there a nice way to say what is the key methodology
link |
of the clustering that enables that?
link |
Right, so the idea basically is that
link |
when you have N samples,
link |
we assume that we have access to like,
link |
there are always K clusters in a data set.
link |
K is a fixed number.
link |
So for example, K is 3000.
link |
And so if you have any,
link |
when you look at any sort of small number of examples,
link |
all of them must belong to one of these K clusters.
link |
And we impose this equi partition constraint.
link |
What this means is that basically,
link |
your entire set of N samples
link |
should be equally partitioned into K clusters.
link |
So all your K clusters are basically equal,
link |
they have equal contribution to these N samples.
link |
And this ensures that we never collapse.
link |
So collapse can be viewed as a way
link |
in which all samples belong to one cluster.
link |
So all this, if all features become the same,
link |
then you have basically just one mega cluster.
link |
You don't even have like 10 clusters or 3000 clusters.
link |
So SWAP basically ensures that at each point,
link |
all these 3000 clusters
link |
are being used in the clustering process.
link |
Basically just figure out how to do this online.
link |
And again, basically just make sure
link |
that two crops from the same image
link |
belong to the same cluster and others don't.
link |
And the fact they have a fixed K makes things simpler.
link |
Fixed K makes things simpler.
link |
Our clustering is not like really hard clustering,
link |
it's soft clustering.
link |
So basically you can be 0.2 to cluster number one
link |
and 0.8 to cluster number two.
link |
So it's not really hard.
link |
So essentially, even though we have like 3000 clusters,
link |
we can actually represent a lot of clusters.
link |
What is CER, S E E R?
link |
And what are the key results and insights in the paper,
link |
self supervised pre training of visual features in the wild?
link |
What is this big, beautiful CER system?
link |
CER, so I'll first go to SWAP
link |
because SWAP is actually like one
link |
of the key components for CER.
link |
So SWAP was, when we use SWAP,
link |
it was demonstrated on ImageNet.
link |
So typically like self supervised methods,
link |
the way we sort of operate is like
link |
in the research community, we kind of cheat.
link |
So we take ImageNet, which of course I talked
link |
about as having lots of labels.
link |
And then we throw away the labels,
link |
throw away all the hard work
link |
that went behind basically the labeling process.
link |
And we pretend that it is self unsupervised.
link |
But the problem here is that we have,
link |
like when we collected these images,
link |
the ImageNet dataset has a particular distribution
link |
of concepts, right?
link |
So these images are very curated.
link |
And what that means is these images,
link |
of course, belong to a certain set of noun concepts.
link |
And also ImageNet has this bias
link |
that all images contain an object
link |
which is like very big and it's typically in the center.
link |
So when you're talking about a dog,
link |
it's a well framed dog.
link |
It's towards the center of the image.
link |
So a lot of the data augmentation,
link |
a lot of the sort of hidden assumptions
link |
in self supervised learning,
link |
actually really exploit this bias of ImageNet.
link |
And so, I mean, a lot of my work,
link |
a lot of work from other people always uses ImageNet
link |
sort of as the benchmark to show
link |
the success of self supervised learning.
link |
So you're implying that there's particular limitations
link |
to this kind of dataset?
link |
Yes, I mean, it's basically because our data augmentation
link |
that we designed, like all the augmentation
link |
that we designed for self supervised learning
link |
and vision are kind of overfed to ImageNet.
link |
But you're saying a little bit hard coded
link |
in like the cropping.
link |
Exactly, the cropping parameters,
link |
the kind of lighting that we're using,
link |
the kind of blurring that we're using.
link |
Yeah, but you would, for a more in the wild dataset,
link |
you would need to be clever or more careful
link |
in setting the range of parameters
link |
and those kinds of things.
link |
So for SEER, our main goal was to fold one
link |
basically to move away from ImageNet for training.
link |
So the images that we used were like uncurated images.
link |
Now there's a lot of debate
link |
whether they're actually curated or not,
link |
but I'll talk about that later.
link |
But the idea was basically these are going to be
link |
random internet images that we're not going to filter out
link |
based on like a particular categories.
link |
So we did not say that, oh, images that belong to dogs
link |
and cats should be the only images
link |
that come in this dataset, banana.
link |
And basically other images should be thrown out.
link |
So we didn't do any of that.
link |
So these are random internet images.
link |
And of course, it also goes back to like the problem
link |
of scale that you talked about.
link |
So these were basically about a billion or so images.
link |
And for Context ImageNet, the ImageNet version
link |
that we use was one million images earlier.
link |
So this is basically going like three orders
link |
of magnitude more.
link |
The idea was basically to see if we can train
link |
a very large convolutional model in a self supervised way
link |
on this uncurated, but really large set of images.
link |
And how well would this model do?
link |
So is self supervised learning really overfit to ImageNet?
link |
Or can it actually work in the wild?
link |
And it was also out of curiosity,
link |
what kind of things will this model learn?
link |
Will it actually be able to still figure out,
link |
different types of objects and so on?
link |
Would there be particular kinds of tasks
link |
it would actually do better than an ImageNet trained model?
link |
And so for Sierra, one of our main findings was that
link |
we can actually train very large models
link |
in a completely self supervised way
link |
on lots of Internet images
link |
without really necessarily filtering them out,
link |
which was in itself a good thing
link |
because it's a fairly simple process, right?
link |
So you get images which are uploaded
link |
and you basically can immediately use them
link |
to train a model in an unsupervised way.
link |
You don't really need to sit and filter them out.
link |
These images can be cartoons, these can be memes,
link |
these can be actual pictures uploaded by people.
link |
And you don't really care about what these images are.
link |
You don't even care about what concepts they contain.
link |
So this was a very sort of simple setup.
link |
What ImageSelection mechanism would you say
link |
is there like inherent in some aspect of the process?
link |
So you're kind of implying it, there's almost none.
link |
But what is there would you say if you were to introspect?
link |
Right, so it's not like uncurated can basically like,
link |
one way of imagining uncurated is basically
link |
you have like cameras that can take pictures
link |
at random viewpoints.
link |
When people upload pictures to the Internet,
link |
they are typically going to care about the framing of it.
link |
They're not going to upload, say,
link |
the picture of a zoomed in wall, for example.
link |
Well, when we say Internet,
link |
do you mean social networks?
link |
So these are not going to be like pictures
link |
of like a zoomed in table or a zoomed in wall.
link |
So it's not really completely uncurated
link |
because people do have their like photographers bias,
link |
where they do want to keep things towards the center
link |
a little bit or like really have like,
link |
you know, nice looking things and so on in the picture.
link |
So that's the kind of bias that typically exists
link |
And also the user base, right?
link |
You're not going to get lots of pictures
link |
from different parts of the world
link |
because there are certain parts of the world
link |
where people may not actually be uploading
link |
a lot of pictures to the Internet
link |
or may not even have access to a lot of Internet.
link |
So this is a giant data set and a giant neural network.
link |
I don't think we've talked about what architectures
link |
work well for SSL, for self supervised learning.
link |
For Seer and for Swab,
link |
we were using convolutional networks,
link |
but recently in a work called Dino,
link |
we've basically started using transformers for vision.
link |
Both seem to work really well,
link |
con nets and transformers and depending on what you want to do,
link |
you might choose to use a particular formulation.
link |
So for Seer, it was a con net.
link |
It was particularly a reg net model,
link |
which was also work from Facebook.
link |
Reg nets are like really good when it comes to compute
link |
versus like accuracy.
link |
So because it was a very efficient model,
link |
compute and memory wise efficient
link |
and basically it worked really well in terms of scaling.
link |
So we used a very large reg net model
link |
and trained it on a billion images.
link |
Can you maybe quickly comment on what reg nets are?
link |
It comes from this paper,
link |
Designing Network Design Spaces.
link |
It's just a super interesting concept
link |
that emphasizes how to create efficient neural networks,
link |
large neural networks.
link |
So one of the sort of key takeaways from this paper,
link |
which the authors like whenever you hear them present this work,
link |
they keep saying is a lot of neural networks
link |
are characterized in terms of flops.
link |
Flops basically being the floating point operations
link |
and people really love to use flops to say,
link |
this model is like really computationally heavy
link |
or like our model is computationally cheap and so on.
link |
Now it turns out that flops are really not a good indicator
link |
of how well a particular network is,
link |
like how efficient it is really.
link |
And what a better indicator is is the activation
link |
or the memory that is being used by this particular model.
link |
And so designing like one of the key findings
link |
from this paper was basically that you need to design
link |
network families or neural network architectures
link |
that are actually very efficient in the memory space as well,
link |
not just in terms of pure flops.
link |
So RegNet is basically a network architecture family
link |
that came out of this paper
link |
that is particularly good at both flops
link |
and the sort of memory required for it.
link |
And of course it builds upon like earlier work,
link |
like ResNet being like the sort of more popular inspiration
link |
for it where you have residual connections.
link |
But one of the things in this work is basically
link |
they also use like squeeze excitation blocks.
link |
So it's a lot of nice sort of technical innovation
link |
in all of this from prior work
link |
and a lot of the ingenuity of these particular authors
link |
in how to combine these multiple building blocks.
link |
But the key constraint was optimize for both flops
link |
and memory when you're basically doing this.
link |
Don't just look at flops.
link |
And that allows you to what have sort of have very large
link |
networks through this process can optimize for low
link |
like for efficiency, for low memory.
link |
Also in just in terms of pure hardware,
link |
they fit very well on GPU memory.
link |
So they can be like really powerful neural network
link |
architectures with lots of parameters, lots of flops,
link |
but also because they're like efficient in terms of
link |
the amount of memory that they're using,
link |
you can actually fit a lot of these on,
link |
like you can fit a very large model
link |
on a single GPU for example.
link |
Would you say that the choice of architecture matters more
link |
than the choice of maybe data augmentation techniques?
link |
Is there a possibility to say what matters more?
link |
You kind of imply that you can probably go really far
link |
with just using basic convenants.
link |
All right, I think like data and data augmentation,
link |
the algorithm being used for the self supervised training
link |
matters a lot more than the particular kind of architecture.
link |
With different types of architecture,
link |
you will get different like properties
link |
in the resulting sort of representation.
link |
But really, I mean, the secret sauce is in the data
link |
augmentation and the algorithm being used to train them.
link |
The architectures, I mean, at this point,
link |
a lot of them perform very similarly,
link |
depending on like the particular task that you care about,
link |
they have certain advantages and disadvantages.
link |
Is there something interesting to be said
link |
about what it takes with Sears to train
link |
a giant neural network?
link |
You're talking about a huge amount of data,
link |
a huge neural network.
link |
Is there something interesting to be said
link |
of how to effectively train something like that fast?
link |
I mean, so the model was like a billion parameters.
link |
And it was trained on the billion images.
link |
So basically the same number of parameters
link |
as the number of images.
link |
And it took a while.
link |
I don't remember the exact number.
link |
It's in the paper.
link |
But it took a while.
link |
I guess I'm trying to get at is
link |
when you're thinking of scaling this kind of thing.
link |
I mean, one of the exciting possibilities
link |
of self supervised learning is the several orders
link |
of magnitude scaling of everything,
link |
both both the neural network and the size of the data.
link |
And so the question is,
link |
do you think there's some interesting tricks
link |
to do large scale distributed compute?
link |
or is that really outside of even deep learning?
link |
That's more about like hardware engineering.
link |
I think more and more there is like this,
link |
a lot of like systems are designed,
link |
basically taking it to account
link |
the machine learning needs, right?
link |
So because whenever you're doing this kind
link |
of distributed training,
link |
there is a lot of inter communication between nodes.
link |
So like gradients or the model parameters are being passed.
link |
So you really want to minimize communication costs
link |
when you really want to scale these models up.
link |
You want basically to be able to do as much,
link |
like as limited amount of communication as possible.
link |
So currently like a dominant paradigm
link |
is synchronized sort of training.
link |
So essentially after every sort of gradient step,
link |
all you basically have like a synchronization step
link |
between all the sort of compute chips
link |
that you're going on with.
link |
I think asynchronous training was popular,
link |
but it doesn't seem to perform as well.
link |
But in general, I think that's sort of the,
link |
I guess it's outside my scope as well.
link |
But the main thing is like minimize the amount
link |
of synchronization steps that you have.
link |
That has been the key take away at least in my experience.
link |
The others, I have no idea about how to design the chip.
link |
Yeah, there's very few things that I see Jim Keller's eyes
link |
light up as much as talking about giant computers doing
link |
like that fast communication that you're talking to,
link |
well, when they're training machine learning systems.
link |
What is Vistle, the ISSL, the PyTorch based SSL library?
link |
What are the use cases that you might have?
link |
Vistle basically was born out of a lot of us
link |
at Facebook doing the self supervised learning research.
link |
So it's a common framework in which we have
link |
like a lot of self supervised learning methods
link |
implemented for vision.
link |
It's also, it has in itself like a benchmark of tasks
link |
that you can evaluate the self supervised representations on.
link |
So the use case for it is basically for anyone
link |
who's either trying to evaluate their self supervised model
link |
or train their self supervised model
link |
or a researcher who's trying to build
link |
a new self supervised technique.
link |
So it's basically supposed to be all of these things.
link |
So as a researcher before Vistle, for example,
link |
or like when we started doing this work
link |
fairly seriously at Facebook,
link |
it was very hard for us to go and implement
link |
every self supervised learning model
link |
tested out in a like sort of consistent manner.
link |
The experimental setup was very different
link |
across different groups.
link |
Even when someone said that they were reporting image net
link |
accuracy, it could mean lots of different things.
link |
So with Vistle, we tried to really sort of standardize that
link |
as much as possible.
link |
And it was a paper like we did in 2019
link |
just about benchmarking.
link |
And so Vistle basically builds upon a lot of
link |
this kind of work that we did about like benchmarking.
link |
And then every time we try to like,
link |
we come up with a self supervised learning method,
link |
a lot of us try to push that into Vistle as well
link |
just so that it basically is like the central piece
link |
where a lot of these methods can reside.
link |
Just out of curiosity, people maybe,
link |
so certainly outside of Facebook, but just researchers,
link |
or just even people that know how to program in Python
link |
and know how to use PyTorch,
link |
what would be the use case?
link |
What would be a fun thing to play around with Vistle on?
link |
Like what's a fun thing to play around
link |
with self supervised learning on, would you say?
link |
Is there a good Hello World program?
link |
Like is it always about big size
link |
that's important to have?
link |
Or is there a fun little smaller case
link |
playgrounds to play around with?
link |
So we're trying to like push something towards that.
link |
I think there are a few setups out there,
link |
but nothing like super standard on the smaller scale.
link |
I mean, ImageNet in itself is actually pretty big also.
link |
So that is not something which is like feasible
link |
for a lot of people, but we are trying to like push up
link |
with like smaller sort of use cases.
link |
The thing is at a smaller scale,
link |
a lot of the observations are a lot of the algorithms
link |
that work don't necessarily translate
link |
into the medium or the larger scale.
link |
So it's really tricky to come up with a good small scale setup
link |
where a lot of your empirical observations
link |
will really translate to the other setup.
link |
So it's been really challenging.
link |
I've been trying to do that for a little bit as well
link |
because it does take time to train stuff on ImageNet,
link |
it does take time to train on like more images,
link |
but pretty much every time I've tried to do that,
link |
it's been unsuccessful because all the observations
link |
I draw from my set of experiments on a smaller dataset
link |
don't translate into ImageNet
link |
or like don't translate into another sort of dataset.
link |
So it's been hard for us to figure this one out,
link |
but it's an important problem.
link |
So there's this really interesting idea
link |
of learning across multiple modalities.
link |
You have a CVPR 2021 best paper candidate
link |
titled Audiovisual Instance Discrimination
link |
with Crossmodal Agreement.
link |
What are the key results, insights in this paper
link |
and what can you say in general about the promise
link |
and power of multimodal learning?
link |
For this paper, it actually came as a little bit
link |
of a shock to me at how well it worked.
link |
So I can describe what the problem setup was.
link |
So it's been used in the past by lots of folks,
link |
like for example, Andrew Owens from MIT,
link |
Alyosha Efros from Berkeley,
link |
Andrew Zisserman from Oxford.
link |
So a lot of these people have been sort of showing results
link |
Of course, I was aware of this result,
link |
but I wasn't really sure how well it would work in practice
link |
for like other sort of downstream tasks.
link |
So the results kept getting better
link |
and I wasn't sure if like a lot of our insights
link |
from self supervised learning would translate
link |
into this multimodal learning problem.
link |
So multimodal learning is when you have like,
link |
when you have multiple modalities.
link |
And that's not equal.
link |
Okay, so the particular modalities that we worked on
link |
in this work were audio and video.
link |
So the idea was basically if you have a video,
link |
you have it's corresponding audio track.
link |
And you want to use both of these signals,
link |
the audio signal and the video signal
link |
to learn a good representation for video
link |
and good representation for audio.
link |
Like this podcast.
link |
Like this podcast, exactly.
link |
So what we did in this work was basically trained
link |
two different neural networks,
link |
one on the video signal, one on the audio signal.
link |
And what we wanted is basically the features
link |
that we get from both of these neural networks
link |
should be similar.
link |
So it should basically be able to produce
link |
the same kinds of features from the video
link |
and the same kinds of features from the audio.
link |
Now, why is this useful?
link |
Well, for a lot of these objects that we have,
link |
there is a characteristic sound, right?
link |
So trains, when they go by,
link |
they make a particular kind of sound.
link |
Boats make a particular kind of sound.
link |
People, when they're jumping around,
link |
they will like shout or whatever.
link |
Bananas don't make a sound.
link |
So well, you can't learn anything about bananas there.
link |
Or when humans mention bananas.
link |
When they say the word banana, then probably.
link |
So you can't trust basically anything
link |
that comes out of a human's mouth as a source,
link |
that source of audio is useless.
link |
So the typical use case is basically like,
link |
for example, someone playing a musical instrument.
link |
So guitars have a particular kind of sound and so on.
link |
So because a lot of these things are correlated,
link |
the idea in multimodal learning
link |
is to take these two kinds of modalities,
link |
and learn a common embedding space,
link |
a common feature space,
link |
where both of these related modalities
link |
can basically be close together.
link |
And again, you use contrastive learning for this.
link |
So in contrastive learning,
link |
basically the video and the corresponding audio are positives,
link |
and you can take any other video or any other audio,
link |
and that becomes a negative.
link |
And so basically that's it.
link |
It's just a simple application of contrastive learning.
link |
The main sort of finding from this work for us
link |
was basically that you can actually learn
link |
very, very powerful feature representations,
link |
very, very powerful video representations.
link |
So you can learn the sort of video network
link |
that we ended up learning
link |
can actually be used for downstream,
link |
for example, recognizing human actions,
link |
or recognizing different types of sounds, for example.
link |
So this was sort of the key finding.
link |
Can you give kind of an example of a human action
link |
or just so we can build up intuition
link |
of what kind of thing?
link |
Right, so there is this data set called Kinetics,
link |
for example, which has like 400 different types
link |
So people jumping, people doing different kinds
link |
of sports or different types of swimming.
link |
So like different strokes and swimming, golf and so on.
link |
So there are like just different types of actions right there.
link |
And the point is this kind of video network
link |
that you learn in a self supervised way
link |
can be used very easily to kind of recognize
link |
these different types of actions.
link |
It can also be used for recognizing
link |
different types of objects.
link |
And what we did is we tried to visualize
link |
whether the network can figure out
link |
where the sound is coming from.
link |
So basically give it a video
link |
and basically play of a person just strumming a guitar,
link |
but of course there is no audio in this.
link |
And now you give it the sound of a guitar.
link |
And you ask like basically try to visualize
link |
where the network thinks the sound is coming from.
link |
And then it can kind of basically draw like,
link |
when you visualize it,
link |
you can see that it's basically focusing on the guitar.
link |
Yeah, that's so real.
link |
And the same thing, for example,
link |
for certain people's voices,
link |
like famous celebrities voices,
link |
it can actually figure out where their mouth is.
link |
So it can actually distinguish different people's voices,
link |
for example, a little bit as well.
link |
Without that ever being annotated in any way.
link |
Right, so this is all what it had discovered.
link |
We never pointed out that this is a guitar
link |
and this is the kind of sound it produces.
link |
It can actually naturally figure that out
link |
because it's seen so many correlations of this sound
link |
coming with this kind of like an object
link |
that it basically learns to associate this sound
link |
with this kind of an object.
link |
Yeah, that's really fascinating, right?
link |
That's really interesting.
link |
So the idea with this kind of network
link |
is then you then fine tune it for a particular task.
link |
So this is forming like a really good knowledge base
link |
within a neural network based on which you could then,
link |
the train a little bit more
link |
to accomplish a specific task well.
link |
Exactly, so you don't need a lot of videos of humans
link |
doing actions annotated.
link |
You can just use a few of them to basically get your.
link |
How much insight do you draw from the fact
link |
that it can figure out where the sound is coming from?
link |
I'm trying to see, so that's kind of very,
link |
it's very CVPR, beautiful, right?
link |
It's a cool little insight.
link |
I wonder how profound that is.
link |
Does it speak to the idea that multiple modalities
link |
are somehow much bigger than the sum of their parts
link |
or is it really, really useful to have multiple modalities
link |
or is it just that cool thing that there's parts
link |
of our world that can be revealed
link |
like effectively through multiple modalities,
link |
but most of it is really all about vision
link |
or about one of the modalities.
link |
I would say a little tending more towards the second part.
link |
So most of it can be sort of figured out with one modality,
link |
but having an extra modality always helps you.
link |
So in this case, for example, like one thing is when you're,
link |
if you observe someone cutting something
link |
and you don't have any sort of sound there,
link |
whether it's an apple or whether it's an onion,
link |
it's very hard to figure that out.
link |
But if you hear someone cutting it,
link |
it's very easy to figure it out
link |
because apples and onions make a very different kind
link |
of characteristics on when they're cutting.
link |
So you really figure this out based on audio.
link |
So your life will become much easier
link |
when you have access to different kinds of modalities.
link |
And the other thing is,
link |
so I like to relate it in this way,
link |
it may be like completely wrong,
link |
but the distributional hypothesis in NLP, right?
link |
Where context basically gives kind of meaning
link |
Sound kind of does that too, right?
link |
So if you have the same sound,
link |
so that's the same context across different videos,
link |
you're very likely to be observing
link |
the same kind of concept.
link |
So that's the kind of reason
link |
why it figures out the guitar thing, right?
link |
It observed the same sound across multiple different videos
link |
and it figures out maybe this is the common factor
link |
that's actually doing it.
link |
I wonder, I used to have this argument with my dad a bunch
link |
for creating general intelligence,
link |
whether smell is important,
link |
like if that's important sensory information.
link |
Mostly we're talking about like falling in love
link |
with an AI system.
link |
And for him, smell and touch are important.
link |
And I was arguing that it's not at all,
link |
it's nice and everything,
link |
but like you can fall in love with just language really,
link |
but voice is very powerful and vision is next
link |
and smell is not that important.
link |
Can I ask you about this process of active learning?
link |
You mentioned interactivity.
link |
Is there some value within the self supervised learning
link |
context to select parts of the data in intelligent ways
link |
such that they would most benefit the learning process?
link |
I mean, I know I'm talking to an active learning fan here,
link |
so of course I know the answer.
link |
First you were talking bananas
link |
and now you're talking about active learning, I love it.
link |
I think Yana Koon told me that active learning
link |
is not that interesting.
link |
And I think back then I didn't want to argue with him too much,
link |
but when we talk again,
link |
we're gonna spend three hours arguing about active learning.
link |
My sense was you can go extremely far with active learning,
link |
you know, perhaps farther than anything else.
link |
Like the, to me, there's this kind of intuition
link |
that similar to data augmentation,
link |
you can get a lot from the data,
link |
from intelligent optimized usage of the data.
link |
I'm trying to speak generally in such a way
link |
that includes data augmentation and active learning,
link |
that there's something about maybe interactive exploration
link |
of the data that at least as part of the solution
link |
to intelligence, like an important part.
link |
I don't know what your thoughts
link |
are on active learning in general.
link |
I actually really like active learning.
link |
So back in the day we did this largely ignored
link |
CVPR paper called Learning by Asking Questions.
link |
So the idea was basically you would train an agent
link |
that would ask a question about the image,
link |
it would get an answer.
link |
And basically then it would update itself,
link |
it would see the next image,
link |
it would decide what's the next hardest question
link |
that I can ask to learn the most.
link |
And the idea was basically because it was being smart
link |
about the kinds of questions it was asking,
link |
it would learn in fewer samples,
link |
it would be more efficient at using data.
link |
And we did find to some extent
link |
that it was actually better than randomly asking questions.
link |
Kind of weird thing about active learning is
link |
it's also a chicken and egg problem
link |
because when you look at an image
link |
to ask a good question about the image
link |
you need to understand something about the image.
link |
You can't ask a completely arbitrarily random question,
link |
it may not even apply to that particular image.
link |
So there is some amount of understanding or knowledge
link |
that basically keeps getting built
link |
when you're doing active learning.
link |
So I think active learning in by itself is really good.
link |
And the main thing we need to figure out is basically
link |
how do we come up with a technique
link |
to first model what the model knows
link |
and also model what the model does not know.
link |
I think that's the sort of beauty of it, right?
link |
Because when you know that there are certain things
link |
that you don't know anything about,
link |
asking a question about those concepts
link |
is actually going to bring you the most value.
link |
And I think that's the sort of key challenge.
link |
Now self supervised learning by itself,
link |
like selecting data for it and so on,
link |
that's actually really useful.
link |
But I think that's a very narrow view
link |
of looking at active learning, right?
link |
If you look at it more broadly,
link |
it is basically about if the model has a knowledge
link |
about end concepts,
link |
and it is weak basically about certain things.
link |
So it needs to ask questions either to discover new concepts
link |
or to basically like increase its knowledge
link |
about these end concepts.
link |
So at that level, it's a very powerful technique.
link |
I actually do think it's going to be really useful.
link |
Even in like simple things such as like data labeling,
link |
it's super useful.
link |
So here is like one simple way
link |
that you can use active learning.
link |
For example, you have your self supervised model,
link |
which is very good at predicting similarities
link |
and dissimilarities between things.
link |
And so if you label a picture as basically say a banana,
link |
now you know that all the images
link |
that are very similar to this image
link |
are also likely to contain bananas.
link |
So probably when you want to understand what else
link |
is a banana, you're not going to use these other images.
link |
You're actually going to use an image
link |
that is not completely dissimilar,
link |
but somewhere in between,
link |
which is not super similar to this image,
link |
but not super dissimilar either.
link |
And that's going to tell you a lot more
link |
about what this concept of a banana is.
link |
So that's kind of a heuristic.
link |
I wonder if it's possible to also learn,
link |
learn ways to discover the most likely
link |
the most beneficial image.
link |
So like, so not just looking a thing
link |
that's somewhat similar to a banana,
link |
but not exactly similar,
link |
but have some kind of more complicated learning system,
link |
like learned discovery mechanism
link |
that tells you what image to look for.
link |
Like how, yeah, like actually in a self supervised way,
link |
learning strictly a function that says,
link |
is this image going to be very useful to me,
link |
given what I currently know?
link |
I think there is a lot of synergy there.
link |
It's just, I think, yeah, it's going to be explored.
link |
I think very much related to that.
link |
I kind of think of what Tesla autopilot is doing
link |
at currently as kind of active learning.
link |
There's something that Andre Capati and their team
link |
are calling data engine.
link |
So you're basically deploying a bunch of instantiations
link |
of a neural network into the wild
link |
and they're collecting a bunch of edge cases
link |
that are then sent back for annotation,
link |
for particular, and edge cases as defined as near failure
link |
or some weirdness on a particular task
link |
that's then sent back, it's that not exactly a banana,
link |
but almost a banana cases, send back for annotation
link |
and then there's this loop that keeps going
link |
and you keep retraining and retraining
link |
and the active learning step there,
link |
or whatever you want to call it,
link |
is the cars themselves that are sending you back the data,
link |
like what the hell happened here?
link |
What are your thoughts about that sort of deployment
link |
of neural networks in the wild?
link |
Another way to ask a question for first is your thoughts
link |
and maybe if you want to comment,
link |
is there applications for autonomous driving,
link |
like computer vision based autonomous driving,
link |
applications of self supervised learning
link |
in the context of computer vision based autonomous driving?
link |
I think for self supervised learning to be used
link |
in autonomous driving, there's lots of opportunities.
link |
Just like pure consistency in predictions is one way, right?
link |
So because you have this nice sequence of data
link |
that is coming in a video stream of it,
link |
associated of course with the actions
link |
that say the car took,
link |
you can form a very nice predictive model
link |
of what's happening.
link |
So for example, like all the way,
link |
like one way possibly in which how they're figuring out
link |
what data to get labeled is basically
link |
through prediction uncertainty, right?
link |
So you predict that the car was going to turn right.
link |
So this was the action that was going to happen,
link |
say in the shadow mode and now the driver turned left.
link |
And this is a really big surprise.
link |
So basically by forming these good predictive models,
link |
you are, I mean, these are kind of self supervised models,
link |
Prediction models are basically being trained
link |
just by looking at what's going to happen next
link |
and asking them to predict what's going to happen next.
link |
So I would say this is really like one use
link |
of self supervised learning.
link |
It's a predictive model
link |
and you're learning a predictive model
link |
basically just by looking at what data you have.
link |
Is there something about that active learning context
link |
that you find insights from?
link |
Like that kind of deployment of the system,
link |
seeing cases where it doesn't perform as you expected
link |
and then retraining the system based on that?
link |
I think that, I mean, that really resonates with me.
link |
It's super smart to do it that way.
link |
Because I mean, the thing is with any kind of like
link |
practical system like autonomous driving,
link |
there are those edge cases
link |
that are the things that are actually the problem, right?
link |
I mean, highway driving or like freeway driving
link |
has basically been like, there has been a lot of success
link |
in that particular part of autonomous driving
link |
for a long time, I would say like since the 80s or something.
link |
Now, the point is all these failure cases
link |
are the sort of reason why autonomous driving
link |
hasn't become like super, super mainstream
link |
available like in every possible car right now.
link |
And so basically by really scaling this problem out
link |
by really trying to get all of these edge cases out
link |
as quickly as possible.
link |
And then just like using those to improve your model,
link |
that's super smart.
link |
And prediction uncertainty to do that is like
link |
one really nice way of doing it.
link |
Let me put you on the spot.
link |
So we mentioned offline Jitendra,
link |
he thinks that the Tesla computer vision approach
link |
or really any approach for autonomous driving
link |
How many years away, if you have to bet all your money on it,
link |
are we just solving autonomous driving
link |
with this kind of computer vision only
link |
machine learning based approach?
link |
Okay, so what does solving autonomous driving mean?
link |
Does it mean solving it in the US?
link |
Does it mean solving it in India?
link |
Because I can tell you that very different types
link |
of driving happening.
link |
Not India, not Russia.
link |
In the United States autonomous,
link |
so what solving means is when the car says it has control,
link |
it is fully liable.
link |
You can go to sleep, is driving by itself.
link |
So this is highway and city driving,
link |
but not everywhere, but mostly everywhere.
link |
And it's let's say significantly better,
link |
like say five times less accidents than humans.
link |
Sufficiently safer such that the public feels
link |
like that transition is enticing beneficial
link |
both for our safety and financing,
link |
all those kinds of things.
link |
Okay, so first disclaimer,
link |
I'm not an expert in autonomous driving.
link |
So let me put it out there.
link |
I would say like at least five to 10 years.
link |
This would be mine, I guess from now.
link |
I'm actually very impressed.
link |
Like when I sat in a friend's Tesla recently
link |
and of course like looking,
link |
so it can, on the screen,
link |
it basically shows all the detections and everything
link |
that the car is doing as you're driving by.
link |
And that's super distracting for me as a person
link |
because all I keep looking at is like the bounding boxes
link |
in the cars, it's tracking and it's really impressive.
link |
Like especially when it's raining
link |
and it's able to do that,
link |
that was the most impressive part for me.
link |
It's actually able to get through rain and do that.
link |
And one of the reasons why like a lot of us believed
link |
and I would put myself in that category
link |
is LiDAR based sort of technology
link |
for autonomous driving was the key driver, right?
link |
So Waymo was using it for the longest time.
link |
And Tesla then decided to go this completely other route
link |
that oh, we're not going to even use LiDAR.
link |
So their initial system I think was camera and radar based
link |
and now they're actually moving
link |
to a completely like vision based system.
link |
And so that was just like, it sounded completely crazy.
link |
Like LiDAR is very useful in cases
link |
where you have low visibility.
link |
Of course it comes with its own set of complications.
link |
But now to see that happen in like on a live Tesla
link |
that basically just proves everyone wrong,
link |
I would say in a way.
link |
And that's just working really well.
link |
I think there were also like a lot of advancements
link |
in camera technology.
link |
Now there were like, I know at CMU when I was there
link |
there was a particular kind of camera
link |
that had been developed that was really good
link |
at basically low visibility setting.
link |
So like lots of snow and lots of rain,
link |
it could actually still have a very reasonable visibility.
link |
And I think there are lots of these kinds of innovations
link |
that will happen on the sensor side itself
link |
which is actually going to make this very easy
link |
And so maybe that's actually why I'm more optimistic
link |
about vision based self like autonomous driving.
link |
It's gonna call it self supervised driving,
link |
but vision based autonomous driving,
link |
that's the reason I'm quite optimistic about it.
link |
Because I think there are going to be lots
link |
of these advances on the sensor side itself.
link |
So acquiring this data,
link |
we're actually going to get much better about it.
link |
And then of course when once we're able to scale out
link |
and get all of these edge cases in,
link |
as like Andre described,
link |
I think that's going to make us go very far away.
link |
Yeah, so it's funny,
link |
I'm very much with you on the five to 10 years,
link |
I'm not sure how you made it sound,
link |
but for some people that might seem like really far away,
link |
and then for other people,
link |
it might seem like very close.
link |
There's a lot of fundamental questions
link |
about how much game theory is in this whole thing.
link |
So how much is this simply collision avoidance problem?
link |
And how much of it is,
link |
you're still interacting with other humans in the scene,
link |
and you're trying to create an experience that's compelling
link |
so you want to get from point A to point B quickly,
link |
you want to navigate the scene in a safe way,
link |
but you also want to show some level of aggression,
link |
because, well, certainly this is why you're screwed in India
link |
because you have to show aggression.
link |
Or Jersey, or New Jersey.
link |
So like, or New York, or basically any major city,
link |
but I think it's probably Elon that I talked the most
link |
about this, which is a surprise to the level
link |
of which they're not considering human beings
link |
as a huge problem in this as a source of problem.
link |
Like the driving is fundamentally a robot on robot
link |
versus the environment problem,
link |
versus like you can just consider humans
link |
not part of the problem.
link |
I used to think humans are almost certainly
link |
have to be modeled really well.
link |
Pedestrians and cyclists and humans inside of the cars,
link |
you have to have like mental models for them.
link |
You cannot just see it as objects.
link |
But more and more, it's like the,
link |
it's the same kind of intuition breaking thing
link |
that self supervised learning does,
link |
which is, well, maybe through the learning,
link |
you'll get all the human,
link |
like human information you need, right?
link |
Like maybe you'll get it just with enough data.
link |
You don't need to have explicit good models
link |
of human behavior.
link |
Maybe you get it through the data.
link |
So I mean, my skepticism also just knowing
link |
a lot of automotive companies
link |
and how difficult it is to be innovative.
link |
I was skeptical that they would be able at scale
link |
to convert the driving scene across the world
link |
into digital form such that you can create
link |
this data engine at scale.
link |
And the fact that Tesla is at least getting there
link |
or are already there makes me think
link |
that it's now starting to be coupled
link |
to this self supervised learning vision,
link |
which is like, if that's gonna work,
link |
if through purely this process you can get really far,
link |
then maybe you can solve driving that way.
link |
I tend to believe we don't give enough credit
link |
to the how amazing humans are both at driving
link |
and at supervising autonomous systems.
link |
And also we don't, I wish we were,
link |
I wish there was much more driver sensing inside Teslas
link |
and much deeper consideration of human factors,
link |
like understanding psychology and drowsiness
link |
and all those kinds of things.
link |
When the car does more and more of the work,
link |
how to keep utilizing the little human supervision
link |
that I needed to keep this whole thing safe.
link |
I mean, it's a fascinating dance of human robot interaction.
link |
To me, autonomous driving for a long time
link |
is a human robot interaction problem.
link |
It is not a robotics problem or computer vision problem.
link |
Like you have to have a human in the loop.
link |
But so, which is why I think it's 10 years plus.
link |
But I do think there'll be a bunch of cities and contexts
link |
where geo restricted, it will work really, really damn well.
link |
So I think for me, it's five if I'm being optimistic
link |
and it's going to be five for a lot of cases.
link |
And 10 plus, yeah, I agree with you.
link |
10 plus, basically, if we want to recover most of,
link |
say, contiguous United States or something.
link |
So my optimistic is five and pessimistic is 30.
link |
I have a long tail on this one.
link |
I haven't watched enough driving videos.
link |
I've watched enough pedestrians to think like we may be,
link |
like there's a small part of me still, not a small,
link |
like a pretty big part of me that thinks
link |
we will have to build AGI to solve driving.
link |
Like there's something to me like,
link |
because humans are part of the picture,
link |
deeply part of the picture,
link |
and also human society is part of the picture
link |
in that human life is at stake.
link |
Anytime a robot kills a human,
link |
it's not clear to me that that's not a problem
link |
that machine learning will also have to solve.
link |
Like you have to integrate that into the whole thing.
link |
Just like Facebook or social networks,
link |
one thing is to say how to make
link |
a really good recommender system.
link |
And then the other thing is to integrate
link |
into that recommender system,
link |
all the journalists that will write articles
link |
about that recommender system.
link |
Like you have to consider the society
link |
within which the AI system operates.
link |
And in order to, and like politicians too,
link |
this is regulatory stuff for autonomous driving.
link |
It's kind of fascinating that the more successful
link |
your AI system becomes,
link |
the more it gets integrated in society
link |
and the more precious politicians and the public
link |
and the clickbait journalists
link |
and all the different fascinating forces
link |
of our society start acting on it.
link |
And then it's no longer how good you are
link |
at doing the initial task.
link |
It's also how good you are at navigating human nature,
link |
which is a fascinating space.
link |
What do you think are the limits of deep learning?
link |
If you allow me, we'll zoom out a little bit
link |
into the big question of artificial intelligence.
link |
You said dark matter of intelligence
link |
is self supervised learning, but there could be more.
link |
What do you think the limits of self supervised learning
link |
and just learning in general, deep learning are?
link |
I think like for deep learning in particular,
link |
because self supervised learning is I would say
link |
a little bit more vague right now.
link |
So I wouldn't like for something that's so vague,
link |
it's hard to predict what its limits are going to be.
link |
But like I said, I think anywhere you want to interact
link |
with human self supervised learning kind of hits a boundary
link |
very quickly because you need to have an interface
link |
to be able to communicate with the human.
link |
So really like if you have just like vacuous concepts
link |
or like just like nebulous concepts discovered by a network,
link |
it's very hard to communicate those for the human
link |
without like inserting some kind of human knowledge
link |
or some kind of like human bias there.
link |
In general, I think for deep learning,
link |
the biggest challenge is just like data efficiency.
link |
Even with self supervised learning,
link |
even with anything else,
link |
if you just see a single concept once,
link |
like one image of a, like I don't know
link |
whatever you want to call it, like any concept,
link |
it's really hard for these methods to generalize
link |
by looking at just one or two samples of things.
link |
And that has been a real challenge.
link |
And I think that's actually why like these edge cases,
link |
for example, for Tesla are actually that important.
link |
Because if you see just one instance of the car failing,
link |
and if you just annotate that
link |
and you get that into your data set,
link |
it's you have like very limited guarantee
link |
that it's not going to happen again.
link |
And you're actually going to be able to recognize
link |
this kind of instance in a very different scenario.
link |
So like when it was snowing,
link |
so you got that thing labeled when it was snowing,
link |
but now when it's raining,
link |
you're actually not able to get it.
link |
Or you basically have the same scenario
link |
in a different part of the world.
link |
So the lighting was different or so on.
link |
So it's just really hard for these models,
link |
like deep learning, especially to do that.
link |
What's your intuition?
link |
How do we solve Henry and Digi recognition problem
link |
when we only have one example for each number?
link |
It feels like humans are using something like learning.
link |
Right, I think it's,
link |
we are good at transferring knowledge a little bit.
link |
We are just better at like,
link |
for a lot of these problems
link |
where we are generalizing from a single sample,
link |
recognizing from a single sample,
link |
we are using a lot of our own domain knowledge
link |
and a lot of our like inductive bias
link |
into that one sample to generalize it.
link |
So I've never seen you write the number nine, for example.
link |
And if you were to write it, I would still get it.
link |
And if you were to write a different kind of alphabet
link |
and like write it in two different ways,
link |
I would still probably be able to figure out
link |
that these are the same two characters.
link |
It's just that I have been very used to seeing
link |
Henry and digits in my life.
link |
The other sort of problem with any deep learning system
link |
or any kind of machine learning system
link |
is like it's guarantees, right?
link |
There are no guarantees for it.
link |
Now you can argue that humans also don't have any guarantees.
link |
Like there is no guarantee that I can recognize a cat
link |
in every scenario.
link |
I'm sure there are going to be lots of cats
link |
that I don't recognize,
link |
lots of scenarios in which I don't recognize cats
link |
But I think from just a sort of application perspective,
link |
you do need guarantees, right?
link |
We call these things algorithms.
link |
Now algorithms, like traditional CS algorithms
link |
Sorting is a guarantee.
link |
If you were to call sort on a particular array of numbers,
link |
you are guaranteed that it's going to be sorted.
link |
Otherwise, it's a bug.
link |
Now for machine learning, it's very hard to characterize this.
link |
We know for a fact that a cat recognition model
link |
is not going to recognize cats, every cat in the world
link |
in every circumstance.
link |
I think most people would agree with that statement.
link |
But we are still OK with it.
link |
We still don't call this as a bug.
link |
Whereas in traditional computer science
link |
or traditional science, if you have this kind of failure case
link |
existing, then you think of it as something is wrong.
link |
I think there is this sort of notion of nebulous correctness
link |
for machine learning.
link |
And that's something we just need to be very comfortable with.
link |
And for deep learning or for a lot of these machine learning
link |
algorithms, it's not clear how do we characterize this notion
link |
I think limitation in our understanding
link |
or at least a limitation in our phrasing of this.
link |
And if we were to come up with better ways
link |
to understand this limitation, then it would actually
link |
Do you think there's a distinction
link |
between the concept of learning and the concept of reasoning?
link |
Do you think it's possible for neural networks to reason?
link |
So I think of it slightly differently.
link |
So for me, learning is whenever I can make a snap judgment.
link |
So if you show me a picture of a dog,
link |
I can immediately say it's a dog.
link |
But if you give me a puzzle, whatever,
link |
a Goldberg machine of things going to happen,
link |
then I have to reason.
link |
Because it's a very complicated setup.
link |
I've never seen that particular setup.
link |
And I really need to draw and imagine in my head
link |
what's going to happen to figure it out.
link |
So I think, yes, neural networks are really good at recognition,
link |
but they're not very good at reasoning.
link |
Because if they have seen something before or seen
link |
something similar before, they're
link |
very good at making those sort of snap judgments.
link |
But if you were to give them a very complicated thing
link |
that they've not seen before, they
link |
have very limited ability right now
link |
to compose different things.
link |
Like, oh, I've seen this particular part before.
link |
I've seen this particular part before.
link |
And now probably this is how they're going to work in tandem.
link |
It's very hard for them to come up with these kinds of things.
link |
Well, there's a certain aspect to reasoning
link |
that you can maybe convert into the process of programming.
link |
And so there's the whole field of the program synthesis.
link |
And people have been applying machine learning
link |
to the problem of program synthesis.
link |
And the question is, can the step of composition,
link |
why can't that be learned?
link |
This step of building things on top of it,
link |
like little intuitions, concepts on top of each other,
link |
can that be learnable?
link |
What's your intuition there?
link |
I guess a similar set of techniques,
link |
do you think that would be applicable?
link |
So I think it is, of course, learnable.
link |
It is learnable because we are prime examples of machines
link |
that have, or individuals that have learned this.
link |
Humans have learned this.
link |
So it is, of course, it is a technique that
link |
is very easy to learn.
link |
I think where we are kind of hitting a wall basically
link |
with current machine learning is the fact
link |
that when the network learns all of this information,
link |
we basically are not able to figure out how well it's
link |
going to generalize to an unseen thing.
link |
And we have no a priori, no way of characterizing that.
link |
And I think that's basically telling us a lot about the fact
link |
that we really don't know what this model has learned
link |
and how well it's basically, because we don't know how well
link |
it's going to transfer.
link |
There's also a sense in which it feels like we humans may not
link |
be aware of how much background, how good our background model
link |
is, how much knowledge we just have slowly building
link |
on top of each other.
link |
It feels like neural networks are constantly throwing stuff
link |
You'll do some incredible thing where
link |
you're learning a particular task in computer vision.
link |
You celebrate your state of the art successes,
link |
and you throw that out.
link |
It feels like you're never using stuff
link |
you've learned for your future successes in other domains.
link |
And humans are obviously doing that exceptionally well,
link |
still throwing stuff away in their mind,
link |
but keeping certain kernels of truth.
link |
Right, so I think we're like, continual learning
link |
is sort of the paradigm for listen machine learning.
link |
And I don't think it's a very well explored paradigm.
link |
We have things in deep learning, for example.
link |
Catastrophic forgetting is one of the standard things.
link |
The thing basically being that if you teach a network
link |
to recognize dogs, and now you teach
link |
that same network to recognize cats,
link |
it basically forgets how to recognize dogs.
link |
So it forgets very quickly.
link |
And whereas a human, if you were to teach someone
link |
to recognize dogs and then to recognize cats,
link |
they don't forget immediately how to recognize these dogs.
link |
I think that's basically what you're trying to get.
link |
Yeah, I wonder if the long term memory mechanisms,
link |
or the mechanisms that store not just memories,
link |
but concepts that allow you to reason and compose concepts,
link |
if those things will look very different than your networks,
link |
or if you can do that within a single neural network
link |
with some particular sort of architecture quirks.
link |
That seems to be a really open problem.
link |
And of course, I go up and down on that
link |
because there's something so compelling to the symbolic AI
link |
or to the ideas of logic based sort of expert systems.
link |
You have human interpretable facts
link |
that built on top of each other.
link |
It's really annoying with self supervised learning
link |
that the AI is not very explainable.
link |
You can't understand all the beautiful things it has learned.
link |
You can't ask it questions.
link |
But then again, maybe that's a stupid thing for us humans
link |
Right, I think whenever we try to understand it,
link |
we're putting our own subjective human bias into it.
link |
And I think that's the sort of problem.
link |
With self supervised learning, the goal
link |
is that it should learn naturally from the data.
link |
So now if you try to understand it,
link |
you are using your own preconceived notions
link |
of what this model has learned.
link |
That's the problem.
link |
High level question, what do you think
link |
it takes to build a system with super human,
link |
maybe let's say human level or super human level,
link |
general intelligence?
link |
We've already kind of started talking about this,
link |
but what's your intuition?
link |
Does this thing have to have a body?
link |
Does it have to interact richly with the world?
link |
Does it have to have some more human elements
link |
like self awareness?
link |
I think emotion is something which is like it's not really
link |
attributed typically in standard machine learning.
link |
It's not something we think about.
link |
There is NLP, there is vision, there
link |
Emotion is never a part of all of this.
link |
And that just seems a little bit weird to me.
link |
I think the reason basically being that there is surprise
link |
and basically emotion is one of the reasons emotions arises,
link |
like what happens and what you expect to happen.
link |
There is a mismatch between these things.
link |
And so that gives rise like I can either be surprised
link |
or I can be saddened or I can be happy and all of this.
link |
And so this basically indicates that I already
link |
have a predictive model in my head
link |
and something that I predicted or something
link |
that I thought was likely to happen.
link |
And then there was something that I observed that happened.
link |
There was a disconnect between these two things.
link |
And that basically is like maybe one of the reasons
link |
I like you have a lot of emotions.
link |
Yeah, I think so I talk to people a lot about them
link |
like Lisa Feldman Barrett.
link |
I think that's an interesting concept of emotion.
link |
But I have a sense that emotion primarily
link |
in the way we think about it, which
link |
is the display of emotion is a communication mechanism
link |
So it's a part of basically human to human interaction.
link |
An important part, but just the part.
link |
So it's like I would throw it into the full mix
link |
And to me, communication can be done with objects
link |
that don't look at all like humans.
link |
I've seen our ability to anthropomorphize,
link |
our ability to connect with things
link |
that look like a Roomba, our ability to connect.
link |
First of all, let's talk about other biological systems
link |
like dogs, our ability to love things that are very different
link |
But they do display emotion, right?
link |
I mean, dogs do display emotion.
link |
So they don't have to be anthropomorphic for them
link |
to display the kind of emotions that we don't.
link |
So I mean, but then the word emotion starts to lose.
link |
So then we have to be, I guess, specific.
link |
But yeah, so have rich, flavorful communication.
link |
Communication, yeah.
link |
Yeah, so like, yes, it's full of emotion.
link |
It's full of wit and humor and moods and all those kinds of things.
link |
Yeah, so you're talking about like flavor.
link |
OK, let's follow that.
link |
So there's content and then there is flavor
link |
and I'm talking about the flavor.
link |
Do you think it needs to have a body?
link |
Do you think like to interact with the physical world,
link |
do you think you can understand the physical world
link |
without being able to directly interact with it?
link |
I don't think so, yeah.
link |
I think at some point we will need to bite the bullet
link |
and actually interact with the physical world.
link |
As much as I like working on like passive computer vision,
link |
where I just like sit in my armchair and look at videos
link |
and learn, I do think that we will
link |
need to have some kind of embodiment
link |
or some kind of interaction to figure out
link |
things about the world.
link |
What about consciousness?
link |
Do you think, how often do you think about consciousness
link |
when you think about your work?
link |
You could think of it as the more simple thing
link |
of self awareness, of being aware that you
link |
are a perceiving, sensing, acting thing in this world,
link |
or you can think about the bigger version of that,
link |
which is consciousness, which is having,
link |
it feel like something to be that entity,
link |
the subjective experience of being in this world.
link |
So I think of self awareness a little bit more than the broader
link |
goal of it, because I think self awareness
link |
is pretty critical for any kind of AGI or whatever you
link |
want to call it that we build, because it
link |
needs to contextualize what it is and what role it's playing
link |
with respect to all the other things that exist around it.
link |
I think that requires self awareness.
link |
It needs to understand that it's an autonomous car.
link |
And what does that mean?
link |
What are its limitations?
link |
What are the things that it is supposed to do and so on?
link |
What is its role in some way?
link |
Or, I mean, these are the kind of things
link |
that we kind of expect from it, I would say.
link |
And so that's the level of self awareness
link |
that's, I would say, basically required at least,
link |
if not more than that.
link |
Yeah, I tend to, on the emotion side,
link |
believe that it has to be able to display consciousness.
link |
Display consciousness, what do you mean by that?
link |
Meaning for us humans to connect with each other
link |
or to connect with other living entities,
link |
I think in order for us to truly feel
link |
like that there's another being there,
link |
we have to believe that they're conscious.
link |
And so we won't ever connect with something
link |
that doesn't have elements of consciousness.
link |
Now, I tend to think that that's easier to achieve
link |
than it may sound, because we anthropomorphize stuff so hard.
link |
You have a mug that just has wheels and rotates
link |
every once in a while and makes a sound.
link |
I think a couple of days in, especially if you're,
link |
if you don't hang out with humans,
link |
you might start to believe that mug on wheels is conscious.
link |
So I think we anthropomorphize
link |
pretty effectively as human beings.
link |
But I do think that it's in the same bucket
link |
that we'll call emotion,
link |
that show that you're, I think of consciousness as the capacity to suffer.
link |
And if you're an entity that's able to feel things in the world
link |
and to communicate that to others,
link |
I think that's a really powerful way to interact with humans.
link |
And in order to create an AGI system,
link |
I believe you should be able to richly interact with humans.
link |
Like humans would need to want to interact with you.
link |
Like it can't be like, it's the self supervised learning versus like,
link |
the robot shouldn't have to pay you to interact with me.
link |
So it should be a natural, fun thing.
link |
And then you're going to scale up significantly
link |
how much interaction it gets.
link |
It's the elect surprise,
link |
which they're trying to give me to be a judge on their contest.
link |
I'll see if I want to do that.
link |
But their challenge is to talk to you,
link |
make the human sufficiently interested
link |
that the human keeps talking for 20 minutes.
link |
And right now they're not even close to that
link |
because it just gets so boring when you're like,
link |
when the intelligence is not there,
link |
it gets very not interesting to talk to it.
link |
And so the robot needs to be interesting.
link |
And one of the ways it can be interesting
link |
is display the capacity to love, to suffer.
link |
And I would say that essentially means
link |
the capacity to display consciousness.
link |
Like it is an entity, much like a human being.
link |
Of course, what that really means,
link |
I don't know if that's fundamentally a robotics problem
link |
or some kind of problem that we're not yet even aware.
link |
Like if it is truly a hard problem of consciousness,
link |
I tend to maybe optimistically think it's a,
link |
we can pretty effectively fake it till we make it.
link |
So we can display a lot of human like elements for a while.
link |
And that will be sufficient to form
link |
really close connections with humans.
link |
What to use the most beautiful idea
link |
in self supervised learning?
link |
Like when you sit back with, I don't know,
link |
with a glass of wine and armchair
link |
and just at a fireplace,
link |
just thinking how beautiful this world
link |
that you get to explore is,
link |
what do you think is the especially beautiful idea?
link |
The fact that like object level,
link |
what objects are in some notion of objectness emerges
link |
from these models by just like self supervised learning.
link |
So for example, like one of the things like the dyno paper
link |
that I was a part of at Facebook is,
link |
the object sort of boundaries emerge
link |
from these representations.
link |
So if you have like a dog running in the field,
link |
the boundaries around the dog,
link |
the network is basically able to figure out
link |
what the boundaries of this dog are automatically.
link |
And it was never trained to do that.
link |
It was never trained to,
link |
no one taught it that this is a dog
link |
and these pixels belong to a dog.
link |
It's able to group these things together automatically.
link |
I think in general that entire notion that
link |
this dumb idea that you take like these two crops
link |
of an image and then you say that the features
link |
should be similar,
link |
that has resulted in something like this.
link |
Like the model is able to figure out
link |
what the dog pixels are and so on.
link |
That just seems like so surprising.
link |
And I mean, I don't think a lot of us even understand
link |
how that is happening really.
link |
And it's something we are taking for granted,
link |
maybe like a lot in terms of how we're setting up
link |
but it's just, it's a very beautiful and powerful idea.
link |
So it's really fundamentally telling us something
link |
about that there is so much signal in the pixels
link |
that we can be super dumb about it
link |
about how we're setting up the self supervised learning
link |
problem and despite being like super dumb about it,
link |
we'll actually get very good,
link |
like we'll actually get something that is able to do
link |
very like surprising things.
link |
I wonder if there's other like objectness,
link |
other concepts that can emerge.
link |
I don't know if you follow Francois Chollet,
link |
he had the competition for intelligence
link |
that basically it's kind of like an IQ test
link |
But for an IQ test, you have to have a few concepts
link |
that you want to apply.
link |
One of them is objectness.
link |
I wonder if those concepts can emerge
link |
through self supervised learning on billions of images.
link |
I think something like object permanence
link |
can definitely emerge, right?
link |
So that's like a fundamental concept which we have,
link |
maybe not through images, through video,
link |
but that's another concept that should be emerging from it.
link |
Because it's not something that,
link |
like we don't teach humans that this isn't,
link |
this is like about this concept of object permanence,
link |
it actually emerges.
link |
And the same thing for like animals,
link |
like dogs I think actually permanence automatically
link |
is something that they are born with.
link |
So I think it should emerge from the data.
link |
It should emerge basically very quickly.
link |
I wonder if ideas like symmetry, rotation,
link |
these kinds of things might emerge.
link |
So I think rotation probably, yes, yeah, rotation, yes.
link |
I mean, there's some constraints
link |
in the architecture itself.
link |
But it's interesting if all of them could be,
link |
like counting was another one.
link |
You know, being able to kind of understand
link |
that there's multiple objects of the same kind in the image
link |
and be able to count them.
link |
I wonder if all of that could be,
link |
if constructed correctly, they can emerge.
link |
Cause then you can transfer those concepts
link |
to then interpret images at a deeper level.
link |
Counting I do believe, I mean, should be possible.
link |
You don't know like yet,
link |
but I do think it's not that far in the realm of possibility.
link |
Yeah, that'd be interesting
link |
if using self supervised learning on images
link |
can then be applied to then solving those kinds of IQ tests,
link |
which seem currently to be kind of impossible.
link |
What idea do you believe might be true
link |
that most people think is not true
link |
or don't agree with you on?
link |
Is there something like that?
link |
So this is going to be a little controversial,
link |
I don't believe in simulation,
link |
like actually using simulation to do things very much.
link |
I want to clarify, because this is a podcast
link |
where you talk about, are we living in a simulation often?
link |
You're referring to using simulation to construct worlds
link |
that you then leverage for machine learning.
link |
For example, like one example would be like to train
link |
an autonomous car driving system.
link |
You basically first build a simulator,
link |
which builds like the environment of the world.
link |
And then you basically have a lot of like,
link |
you train your machine learning system in that.
link |
So I believe it is possible,
link |
but I think it's a really expensive way of doing things.
link |
And at the end of it, you do need the real world.
link |
So maybe for certain settings,
link |
like maybe the payout is so large,
link |
like for autonomous driving,
link |
the payout is so large
link |
that you can actually invest that much money to build it.
link |
But I think as a general sort of principle,
link |
it does not apply to a lot of concepts.
link |
You can't really build simulations of everything,
link |
not only because like one, it's expensive,
link |
because second, it's also not possible for a lot of things.
link |
So in general, like there is a lot of like,
link |
there's a lot of work on like using synthetic data
link |
and like synthetic simulators.
link |
I generally am not very, like I don't believe in that.
link |
So you're saying it's very challenging visually,
link |
like to correctly like simulate the visual,
link |
like the lighting, all those kinds of things.
link |
I mean, I mean, all these companies that you have, right?
link |
So like Pixar and like whatever,
link |
all these companies are,
link |
if they're all this like computer graphic stuff
link |
is really about accurately a lot of them
link |
is about like accurately trying
link |
to figure out how the lighting is
link |
and like how things reflect off of one another and so on
link |
and like how sparkly things look and so on.
link |
So it's a very hard problem.
link |
So do we really need to solve that first
link |
to be able to like do computer vision?
link |
And for me, in the context of autonomous driving,
link |
it's very tempting to be able to use simulation, right?
link |
Because it's a safety critical application,
link |
but the other limitation of simulation
link |
that perhaps is a bigger one than the visual limitation
link |
is the behavior of objects.
link |
Because so you're ultimately interested in edge cases.
link |
And the question is,
link |
how well can you generate edge cases in simulation,
link |
especially with human behavior?
link |
I think another problem is like for autonomous driving, right?
link |
It's a constantly changing world.
link |
So say autonomous driving like in 10 years from now,
link |
like there are lots of autonomous cars,
link |
but there's still going to be humans.
link |
So now there are 50% of the agents,
link |
say which are humans,
link |
50% of the agents that are autonomous,
link |
like car driving agents.
link |
So now the mixture has changed.
link |
So now the kinds of behaviors
link |
that you actually expect from the other agents
link |
or other cars on the road
link |
are actually going to be very different.
link |
And as the proportion of the number of autonomous cars
link |
to humans keeps changing,
link |
this behavior will actually change a lot.
link |
So now if you were to build a simulator
link |
based on just like right now to build them today,
link |
you don't have that many autonomous cars on the road.
link |
So you'll try to like make all of the other agents
link |
in that simulator behave as humans,
link |
but that's not really going to hold true
link |
10, 15, 20, 30 years from now.
link |
Do you think we're living in a simulation?
link |
This is why I think it's an interesting question.
link |
How hard is it to build a video game,
link |
like virtual reality game,
link |
where it is so real,
link |
forget like ultra realistic
link |
to where you can't tell the difference,
link |
but like it's so nice that you just want to stay there.
link |
You just want to stay there
link |
and you don't want to come back.
link |
Do you think that's doable within our lifetime?
link |
Within our lifetime, probably.
link |
How you tell they are live long.
link |
Does that make you sad
link |
that there will be like population of kids
link |
that basically spend 95%, 99% of their time
link |
in a virtual world?
link |
Very, very hard question to answer.
link |
For certain people, it might be something
link |
that they really derive a lot of value out of,
link |
derive a lot of enjoyment and like happiness out of,
link |
and maybe the real world wasn't giving them that,
link |
that's why they did that.
link |
So maybe it is good for certain people.
link |
So ultimately, if it maximizes happiness,
link |
or we could judge.
link |
Yeah, I think if it's making people happy,
link |
Again, I think this is a very hard question.
link |
So like you've been a part of a lot of amazing papers.
link |
What advice would you give to somebody
link |
on what it takes to write a good paper?
link |
Grad students writing papers now,
link |
is there common things that you've learned along the way
link |
that you think it takes,
link |
both for a good idea and a good paper?
link |
Right, so I think both of these
link |
I've picked up from like lots of people
link |
I've worked with in the past.
link |
So one of them is picking the right problem
link |
to work on in research is as important
link |
as like finding the solution to it.
link |
So I mean, there are multiple reasons for this.
link |
So one is that there are certain problems
link |
that can actually be solved in a particular timeframe.
link |
So now say you want to work on finding the meaning of life.
link |
This is a great problem.
link |
I think most people will agree with that.
link |
But do you believe that your talents
link |
and like the energy that you'll spend on it
link |
will make some kind of meaningful progress
link |
If you are optimistic about it, then like go ahead.
link |
That's why I started this podcast.
link |
I keep asking people about the meaning of life.
link |
I'm hoping by episode like 220, I'll figure it out.
link |
Oh, not too many episodes to go then.
link |
All right, maybe today, I don't know.
link |
So that seems intractable at the moment.
link |
Right, so I think it's just the fact of
link |
like if you're starting a PhD for example,
link |
what is one problem that you want to focus on
link |
that you do think is interesting enough
link |
and you will be able to make a reasonable amount
link |
of headway into it that you think you'll be doing a PhD for.
link |
So in that kind of a timeframe.
link |
Of course, there's the second part
link |
which is what excites you genuinely.
link |
So you shouldn't just pick problems
link |
that you are not excited about
link |
because as a grad student or as a researcher,
link |
you really need to be passionate about it
link |
to continue doing that
link |
because there are so many other things
link |
that you could be doing in life.
link |
So you really need to believe in that
link |
to be able to do that for that long.
link |
In terms of papers,
link |
I think the one thing that I've learned is
link |
I've like in the past,
link |
whenever I used to write things
link |
and even now whenever I do that,
link |
I try to cram in a lot of things into the paper.
link |
Whereas what really matters is just pushing
link |
one simple idea, that's it.
link |
That's all because that's,
link |
the paper is going to be like whatever,
link |
eight or nine pages.
link |
If you keep cramming in lots of ideas,
link |
it's really hard for the single thing
link |
that you believe in to stand out.
link |
So if you really try to just focus
link |
on like especially in terms of writing,
link |
really try to focus on one particular idea
link |
and articulate it out in multiple different ways.
link |
It's far more valuable to the reader as well.
link |
And basically to the reader, of course,
link |
because they get to,
link |
they know that this particular idea
link |
is associated with this paper.
link |
And also for you because you have,
link |
like when you write about a particular idea
link |
in different ways, you think about it more deeply.
link |
So as a grad student,
link |
I used to always wait toward like maybe in the last week
link |
or whatever to write the paper
link |
because I used to always believe that doing the experiments
link |
was actually the bigger part of research than writing.
link |
And my advisor always told me
link |
that you should start writing very early on.
link |
And I thought, oh, it doesn't matter.
link |
I don't know what he's talking about.
link |
But I think more and more I realized that's the case.
link |
Like whenever I write something that I'm doing,
link |
I actually think much better about it.
link |
And so if you start writing early on,
link |
you actually, I think get better ideas
link |
or at least you figure out like holes in your theory
link |
or like particular experiments
link |
that you should run to block those holes and so on.
link |
Yeah, I'm continually surprised
link |
how many really good papers throughout history
link |
are quite short and quite simple.
link |
And there's a lesson to that.
link |
Like if you want to dream about writing a paper
link |
that changes the world and you want to go by example,
link |
they're usually simple and that it's not cramming
link |
or it's focusing on one idea and thinking deeply
link |
and you're right that the writing process itself
link |
It challenges you to really think about what is the idea
link |
that explains that the thread that ties it all together.
link |
And so like a lot of famous researchers I know
link |
actually would start off like,
link |
first they were even before the experiments were in,
link |
a lot of them would actually start
link |
with writing the introduction of the paper
link |
with zero experiments in.
link |
Because that at least helps them figure out
link |
what they're trying to solve
link |
and how it fits in like the context of things right now.
link |
And that would really guide their entire research.
link |
So a lot of them would actually first write in intros
link |
with like zero experiments in
link |
and that's how they would start projects.
link |
Some basic questions about people maybe
link |
there are more like beginners in this field.
link |
What's the best programming language to learn
link |
if you're interested in machine learning?
link |
I would say Python just because it's the easiest one to learn.
link |
And also a lot of like programming
link |
in machine learning happens in Python.
link |
So it'll, if you don't know any other programming language
link |
Python is actually going to get you a long way.
link |
Yeah, it seems like sort of a, it's a toss up question
link |
because it seems like Python is so much dominating
link |
the space now, but I wonder if there's interesting
link |
alternative, obviously there's like Swift
link |
and there's a lot of interesting alternatives popping up
link |
even JavaScript or R, more like for the data science
link |
applications, but it seems like Python more and more
link |
is actually being used to teach like introduction
link |
to programming at universities.
link |
So it just combines everything very nicely.
link |
Even harder question.
link |
What are the pros and cons of PyTorch versus TensorFlow?
link |
You can go with no comment.
link |
So a disclaimer to this is that the last time
link |
I used TensorFlow was probably like four years ago.
link |
And so it was right when it had come out
link |
because so I started on like deep learning in 2014 or so
link |
and the dominant sort of pattern framework for us then
link |
for vision was Cafe, which was out of Berkeley
link |
and we used Cafe a lot, it was really nice.
link |
And then TensorFlow came in, which was basically
link |
like Python first.
link |
So Cafe was mainly C++ and it had like very loose
link |
kind of Python binding.
link |
So Python wasn't really the first language you would use.
link |
You would really use either MATLAB or C++
link |
like get stuff done in like Cafe.
link |
And then Python of course became popular a little bit later.
link |
So TensorFlow was basically around that time.
link |
So 2015, 2016 is when I last used it.
link |
It's been a while.
link |
And then what, did you use Torch or did you?
link |
So then I moved to Lua Torch, which was the Torch in Lua.
link |
And then in 2017, I think basically pretty much
link |
to PyTorch completely.
link |
So you went to Lua, cool.
link |
Huh, so you were there before it was cool.
link |
Yeah, I mean, so Lua Torch was really good
link |
because it actually allowed you to do a lot
link |
of different kinds of things.
link |
So which Cafe was very rigid in terms of its structure.
link |
Like you would create a neural network once and that's it.
link |
Whereas if you wanted like very dynamic graphs and so on,
link |
it was very hard to do that.
link |
And Lua Torch was much more friendly
link |
for all of these things.
link |
Okay, so in terms of PyTorch and TensorFlow,
link |
my personal bias is PyTorch just because I've been using it
link |
longer and I'm more familiar with it.
link |
And also that PyTorch is much easier to debug
link |
is what I find because it's imperative in nature
link |
compared to like TensorFlow, which is not imperative.
link |
But that's telling you a lot that basically
link |
the imperative design is sort of a way in which a lot
link |
of people are taught programming
link |
and that's what actually makes debugging easier for them.
link |
So like I learned programming in C++.
link |
And so for me, imperative way of programming
link |
Do you think it's good to have kind of these two communities,
link |
this kind of competition?
link |
I think PyTorch is kind of more and more becoming dominant
link |
in the research community,
link |
but TensorFlow is still very popular
link |
in the more sort of application machine learning community.
link |
So do you think it's good to have that kind of split
link |
in code bases or, so like the benefit there
link |
is the competition challenges the library developers
link |
to step up their game.
link |
But the downside is there's these code bases
link |
that are in different libraries.
link |
Right, so I think the downside is there.
link |
I mean, for a lot of research code
link |
that's released in one framework
link |
and if you're using the other one, it's really hard
link |
to like really build on top of it.
link |
But thankfully the open source community
link |
in machine learning is amazing.
link |
So whenever like something pops up in TensorFlow,
link |
you wait a few days and someone who's like super sharp
link |
will actually come and translate that particular code
link |
based into PyTorch and basically have figured
link |
that all those nooks and crannies out.
link |
So the open source community is amazing
link |
and they really like figure out this gap.
link |
So I think in terms of like having these two frameworks
link |
or multiple, I think of course there are different use cases
link |
so there are going to be benefits to using one
link |
or the other framework.
link |
And like you said, I think competition is just healthy
link |
because both of these frameworks keep
link |
or like all of these frameworks really sort of keep learning
link |
from each other and keep incorporating different things
link |
to just make them better and better.
link |
What advice would you have for someone
link |
new to machine learning?
link |
Maybe just started or haven't even started
link |
but are curious about it and who want to get in the field.
link |
Don't be afraid to get your hands dirty.
link |
I think that's the main thing.
link |
So if something doesn't work, like really drill
link |
into why things are not working.
link |
Can you elaborate what your hands dirty means?
link |
Right, so for example, like if an algorithm,
link |
if you try to train a network and it's not converging,
link |
whatever, rather than trying to like Google the answer
link |
or trying to do something, like really spend those
link |
like five, eight, 10, 15, 20, whatever number of hours
link |
really trying to figure it out yourself.
link |
Because in that process, you'll actually learn a lot more.
link |
Googling is of course like a good way to solve it
link |
when you need a quick answer.
link |
But I think initially especially like when you're starting out
link |
it's much nicer to like figure things out by yourself.
link |
And I just say that from experience
link |
because like when I started out,
link |
there were not a lot of resources.
link |
So we would like in the lab a lot of us
link |
like we would look up to senior students
link |
and the senior students were of course busy
link |
and they would be like, hey, why don't you go figure it out
link |
because I just don't have the time
link |
I'm working on my dissertation or whatever.
link |
I'll find a PhD students.
link |
And so then we would sit down
link |
and like just try to figure it out.
link |
And that I think really helped me.
link |
That has really helped me figure a lot of things out.
link |
I think in general, if I were to generalize that,
link |
I feel like persevering through any kind of struggle
link |
on a thing you care about is good.
link |
So you're basically, you try to make it seem like
link |
it's good to spend time debugging
link |
but really any kind of struggle, whatever form that takes
link |
it could be just Googling a lot.
link |
Just basically anything just sticking with it
link |
and going through the hard thing
link |
that could take a form of implementing stuff from scratch.
link |
It could take the form of re implementing
link |
with different libraries or different programming languages.
link |
It could take a lot of different forms
link |
but struggle is good for the soul.
link |
So like in Pittsburgh, where I did my PhD,
link |
the thing was it used to snow a lot, right?
link |
And so when it was snowed, you really couldn't do much.
link |
So the thing that a lot of people said was snow
link |
builds character because when it's snowing,
link |
you can't do anything else.
link |
You focus on work.
link |
Do you have advice in general for people
link |
you've already exceptionally successful, you're young
link |
but do you have advice for young people starting out
link |
in college or maybe in high school?
link |
Advice for their career, advice for their life,
link |
how to pave a successful path in career and life.
link |
I would say just be hungry,
link |
like always be hungry for what you want.
link |
And I think like I've been inspired by a lot of people
link |
who are just like driven and who really like go
link |
for what they want, no matter what like,
link |
you shouldn't want it, you should need it.
link |
So if you need something, you basically go towards
link |
the ends to make it work.
link |
How do you know when you come across a thing
link |
that's like you need?
link |
I think there's not going to be any single thing
link |
that you're going to need, there are going to be
link |
different types of things that you need,
link |
but whenever you need something, you just go push for it.
link |
And of course, once you may not get it
link |
or you may find that this was not even the thing
link |
that you were looking for, it might be a different thing.
link |
But the point is like you're pushing through things
link |
and that actually brings a lot of skills
link |
and brings a lot of like build a certain kind of attitude
link |
which will probably help you get the other thing.
link |
Once you figure out what's really the thing that you want.
link |
Yeah, I think a lot of people are,
link |
I've noticed the kind of afraid of that
link |
is because one, it's a fear of commitment.
link |
And two, there's so many amazing things in this world.
link |
You almost don't want to miss out on all the other
link |
amazing things by committing to this one thing.
link |
So I think a lot of it has to do with just allowing yourself
link |
to like notice that thing.
link |
And just go all the way with it.
link |
I mean, also like failure, right?
link |
So I know this is like super cheesy that failure is something
link |
that you should be prepared for and so on.
link |
But I do think, I mean, especially in research,
link |
for example, failure is something that happens almost
link |
like almost every day is like experiments failing
link |
And so you really need to be so used to it.
link |
You need to have a thick skin.
link |
But and only basically through, like when you get through it
link |
is when you find the one thing that's actually working.
link |
Like Thomas Edison was like one person like that, right?
link |
So I really, like when I was a kid,
link |
I used to really read about how he found like the filament,
link |
the light bulb filament.
link |
And then I think his thing was like,
link |
he tried 990 things that didn't work or something of the sort.
link |
And then they asked him like, so what did you learn?
link |
Because all of these were failed experiments.
link |
And then he says, oh, these 990 things don't work.
link |
Did you know that?
link |
I mean, that's really inspiring.
link |
So you spent a few years on this earth
link |
performing a self supervised kind of learning process.
link |
Have you figured out the meaning of life yet?
link |
I told you I'm doing this podcast to try to get the answer.
link |
I'm hoping you could tell me.
link |
What do you think the meaning of it all is?
link |
I don't think I figured this out.
link |
No, I have no idea.
link |
Do you think AI will help us figure it out?
link |
Or do you think there's no answer?
link |
The whole point is to keep searching.
link |
I think it's an endless sort of quest for us.
link |
I don't think AI will help us there.
link |
This is like a very hard, hard, hard question
link |
which so many humans have tried to answer.
link |
Well, that's the interesting thing about the difference
link |
between AI and humans.
link |
Humans don't seem to know what the hell they're doing.
link |
And AI is almost always operating
link |
under well defined objective functions.
link |
And I wonder whether there are a lack of ability
link |
to define good long term objective functions
link |
or in retrospect, what is the objective function under which
link |
we operate if that's a feature or a bug?
link |
I would say it's a feature because then everyone actually
link |
has very different kinds of objective functions
link |
that they're optimizing.
link |
And those objective functions evolve and change dramatically
link |
through their course of their life.
link |
That's actually what makes us interesting, right?
link |
If otherwise, if everyone was doing the exact same thing,
link |
that would be pretty boring.
link |
We do want people with different kinds of perspectives.
link |
Also, people evolve continuously.
link |
That's like, I would say, the biggest
link |
feature of being human.
link |
And then we get to the ones that die
link |
because they do something stupid.
link |
We get to watch that, see it, and learn from it.
link |
And as a species, we take that lesson
link |
and become better and better because of all the dumb people
link |
in the world that died doing something wild and beautiful.
link |
Ishan, thank you so much for this incredible conversation.
link |
We did a depth first search through the space of machine
link |
And it was fun and fascinating.
link |
So it's really an honor to meet you.
link |
And it was a really awesome conversation.
link |
Thanks for coming down today and talking with me.
link |
I mean, I've listened to you.
link |
I told you it was unreal for me to actually meet you in person.
link |
And I'm so happy to be here.
link |
Thanks for listening to this conversation with Ishan Mizra.
link |
And thank you to Anit, The Information, Grammarly,
link |
and Athletic Greens.
link |
Check them out in the description to support this podcast.
link |
And now let me leave you with some words from Arthur C. Clark.
link |
Any sufficiently advanced technology
link |
is indistinguishable from magic.
link |
Thank you for listening and hope to see you next time.