back to indexJitendra Malik: Computer Vision | Lex Fridman Podcast #110
link |
The following is a conversation with Jitendra Malik, a professor at Berkeley and one of
link |
the seminal figures in the field of computer vision, the kind before the deep learning
link |
revolution and the kind after.
link |
He has been cited over 180,000 times and has mentored many world class researchers in computer
link |
Quick summary of the ads.
link |
Two sponsors, one new one which is BetterHelp and an old goodie ExpressVPN.
link |
Please consider supporting this podcast by going to betterhelp.com slash lex and signing
link |
up at expressvpn.com slash lexpod.
link |
Click the links, buy the stuff, it really is the best way to support this podcast and
link |
the journey I'm on.
link |
If you enjoy this thing, subscribe on YouTube, review it with 5 stars on Apple Podcast, support
link |
it on Patreon, or connect with me on Twitter at Lex Friedman, however the heck you spell
link |
As usual, I'll do a few minutes of ads now and never any ads in the middle that can break
link |
the flow of the conversation.
link |
This show is sponsored by BetterHelp, spelled H E L P help.
link |
Check it out at betterhelp.com slash lex.
link |
They figure out what you need and match you with a licensed professional therapist in
link |
It's not a crisis line, it's not self help, it's professional counseling done securely
link |
I'm a bit from the David Goggins line of creatures, as you may know, and so have some
link |
demons to contend with, usually on long runs or all nights working, forever and possibly
link |
full of self doubt.
link |
It may be because I'm Russian, but I think suffering is essential for creation.
link |
But I also think you can suffer beautifully, in a way that doesn't destroy you.
link |
For most people, I think a good therapist can help in this, so it's at least worth a
link |
Check out their reviews, they're good, it's easy, private, affordable, available worldwide.
link |
You can communicate by text, any time, and schedule weekly audio and video sessions.
link |
I highly recommend that you check them out at betterhelp.com slash lex.
link |
This show is also sponsored by ExpressVPN.
link |
Get it at expressvpn.com slash lexpod to support this podcast and to get an extra three months
link |
free on a one year package.
link |
I've been using ExpressVPN for many years, I love it.
link |
I think ExpressVPN is the best VPN out there.
link |
They told me to say it, but it happens to be true.
link |
It doesn't log your data, it's crazy fast, and is easy to use, literally just one big,
link |
sexy power on button.
link |
Again, for obvious reasons, it's really important that they don't log your data.
link |
It works on Linux and everywhere else too, but really, why use anything else?
link |
Shout out to my favorite flavor of Linux, Ubuntu Mate 2004.
link |
Once again, get it at expressvpn.com slash lexpod to support this podcast and to get
link |
an extra three months free on a one year package.
link |
And now, here's my conversation with Jitendra Malik.
link |
In 1966, Seymour Papert at MIT wrote up a proposal called the Summer Vision Project
link |
to be given, as far as we know, to 10 students to work on and solve that summer.
link |
So that proposal outlined many of the computer vision tasks we still work on today.
link |
Why do you think we underestimate, and perhaps we did underestimate and perhaps still underestimate
link |
how hard computer vision is?
link |
Because most of what we do in vision, we do unconsciously or subconsciously.
link |
So that gives us this, that effortlessness gives us the sense that, oh, this must be
link |
very easy to implement on a computer.
link |
Now, this is why the early researchers in AI got it so wrong.
link |
However, if you go into neuroscience or psychology of human vision, then the complexity becomes
link |
The fact is that a very large part of the cerebral cortex is devoted to visual processing.
link |
And this is true in other primates as well.
link |
So once we looked at it from a neuroscience or psychology perspective, it becomes quite
link |
clear that the problem is very challenging and it will take some time.
link |
You said the higher level parts are the harder parts?
link |
I think vision appears to be easy because most of what visual processing is subconscious
link |
So we underestimate the difficulty, whereas when you are like proving a mathematical theorem
link |
or playing chess, the difficulty is much more evident.
link |
So because it is your conscious brain, which is processing various aspects of the problem
link |
solving behavior, whereas in vision, all this is happening, but it's not in your awareness,
link |
it's in your, it's operating below that.
link |
But it's, it still seems strange.
link |
Yes, that's true, but it seems strange that as computer vision researchers, for example,
link |
the community broadly is time and time again makes the mistake of thinking the problem
link |
is easier than it is, or maybe it's not a mistake.
link |
We'll talk a little bit about autonomous driving, for example, how hard of a vision task that
link |
is, it, do you think, I mean, what, is it just human nature or is there something fundamental
link |
to the vision problem that we, we underestimate?
link |
We're still not able to be cognizant of how hard the problem is.
link |
Yeah, I think in the early days it could have been excused because in the early days, all
link |
aspects of AI were regarded as too easy.
link |
But I think today it is much less excusable.
link |
And I think why people fall for this is because of what I call the fallacy of the successful
link |
There are many problems in vision where getting 50% of the solution you can get in one minute,
link |
getting to 90% can take you a day, getting to 99% may take you five years, and 99.99%
link |
may be not in your lifetime.
link |
I wonder if that's a unique division.
link |
It seems that language, people are not so confident about, so natural language processing,
link |
people are a little bit more cautious about our ability to, to solve that problem.
link |
I think for language, people intuit that we have to be able to do natural language understanding.
link |
For vision, it seems that we're not cognizant or we don't think about how much understanding
link |
It's probably still an open problem.
link |
But in your sense, how much understanding is required to solve vision?
link |
Like this, put another way, how much something called common sense reasoning is required
link |
to really be able to interpret even static scenes?
link |
So vision operates at all levels and there are parts which can be solved with what we
link |
could call maybe peripheral processing.
link |
So in the human vision literature, there used to be these terms, sensation, perception and
link |
cognition, which roughly speaking referred to like the front end of processing, middle
link |
stages of processing and higher level of processing.
link |
And I think they made a big deal out of, out of this and they wanted to study only perception
link |
and then dismiss certain, certain problems as being quote cognitive.
link |
But really I think these are artificial divides.
link |
The problem is continuous at all levels and there are challenges at all levels.
link |
The techniques that we have today, they work better at the lower and mid levels of the
link |
I think the higher levels of the problem, quote the cognitive levels of the problem
link |
are there and we, in many real applications, we have to confront them.
link |
Now how much that is necessary will depend on the application.
link |
For some problems it doesn't matter, for some problems it matters a lot.
link |
So I am, for example, a pessimist on fully autonomous driving in the near future.
link |
And the reason is because I think there will be that 0.01% of the cases where quite sophisticated
link |
cognitive reasoning is called for.
link |
However, there are tasks where you can, first of all, they are much more, they are robust.
link |
So in the sense that error rates, error is not so much of a problem.
link |
For example, let's say we are, you're doing image search, you're trying to get images
link |
based on some, some, some description, some visual description.
link |
We are very tolerant of errors there, right?
link |
I mean, when Google image search gives you some images back and a few of them are wrong,
link |
It doesn't hurt anybody.
link |
There is no, there's not a matter of life and death.
link |
But making mistakes when you are driving at 60 miles per hour and you could potentially
link |
kill somebody is much more important.
link |
So just for the, for the fun of it, since you mentioned, let's go there briefly about
link |
autonomous vehicles.
link |
So one of the companies in the space, Tesla, is with Andre Karpathy and Elon Musk are working
link |
on a system called Autopilot, which is primarily a vision based system with eight cameras and
link |
basically a single neural network, a multitask neural network.
link |
They call it HydroNet, multiple heads, so it does multiple tasks, but is forming the
link |
same representation at the core.
link |
Do you think driving can be converted in this way to purely a vision problem and then solved
link |
with learning or even more specifically in the current approach, what do you think about
link |
what Tesla Autopilot team is doing?
link |
So the way I think about it is that there are certainly subsets of the visual based
link |
driving problem, which are quite solvable.
link |
So for example, driving in freeway conditions is quite a solvable problem.
link |
I think there were demonstrations of that going back to the 1980s by someone called
link |
Ernst Tickmans in Munich.
link |
In the 90s, there were approaches from Carnegie Mellon, there were approaches from our team
link |
In the 2000s, there were approaches from Stanford and so on.
link |
So autonomous driving in certain settings is very doable.
link |
The challenge is to have an autopilot work under all kinds of driving conditions.
link |
At that point, it's not just a question of vision or perception, but really also of control
link |
and dealing with all the edge cases.
link |
So where do you think most of the difficult cases, to me, even the highway driving is
link |
an open problem because it applies the same 50, 90, 95, 99 rule where the first step,
link |
the fallacy of the first step, I forget how you put it, we fall victim to.
link |
I think even highway driving has a lot of elements because to solve autonomous driving,
link |
you have to completely relinquish the help of a human being.
link |
You're always in control so that you're really going to feel the edge cases.
link |
So I think even highway driving is really difficult.
link |
But in terms of the general driving task, do you think vision is the fundamental problem
link |
or is it also your action, the interaction with the environment, the ability to...
link |
And then the middle ground, I don't know if you put that under vision, which is trying
link |
to predict the behavior of others, which is a little bit in the world of understanding
link |
the scene, but it's also trying to form a model of the actors in the scene and predict
link |
I include that in vision because to me, perception blends into cognition and building predictive
link |
models of other agents in the world, which could be other agents, could be people, other
link |
agents could be other cars.
link |
That is part of the task of perception because perception always has to not tell us what
link |
is now, but what will happen because what's now is boring.
link |
We care about the future because we act in the future.
link |
And we care about the past in as much as it informs what's going to happen in the future.
link |
So I think we have to build predictive models of behaviors of people and those can get quite
link |
So I mean, I've seen examples of this in actually, I mean, I own a Tesla and it has various safety
link |
features built in.
link |
And what I see are these examples where let's say there is some a skateboarder, I mean,
link |
and I don't want to be too critical because obviously these systems are always being improved
link |
and any specific criticism I have, maybe the system six months from now will not have that
link |
particular failure mode.
link |
So it had the wrong response and it's because it couldn't predict what this skateboarder
link |
And because it really required that higher level cognitive understanding of what skateboarders
link |
typically do as opposed to a normal pedestrian.
link |
So what might have been the correct behavior for a pedestrian, a typical behavior for pedestrian
link |
was not the typical behavior for a skateboarder, right?
link |
And so therefore to do a good job there, you need to have enough data where you have pedestrians,
link |
you also have skateboarders, you've seen enough skateboarders to see what kinds of patterns
link |
of behavior they have.
link |
So it is in principle with enough data, that problem could be solved.
link |
But I think our current systems, computer vision systems, they need far, far more data
link |
than humans do for learning those same capabilities.
link |
So say that there is going to be a system that solves autonomous driving.
link |
Do you think it will look similar to what we have today, but have a lot more data, perhaps
link |
more compute, but the fundamental architecture is involved, like neural, well, in the case
link |
of Tesla autopilot is neural networks.
link |
Do you think it will look similar in that regard and we'll just have more data?
link |
That's a scientific hypothesis as to which way is it going to go.
link |
I will tell you what I would bet on.
link |
So and this is my general philosophical position on how these learning systems have been.
link |
What we have found currently very effective in computer vision in the deep learning paradigm
link |
is sort of tabula rasa learning and tabula rasa learning in a supervised way with lots
link |
What's tabula rasa learning?
link |
Tabula rasa in the sense that blank slate, we just have the system, which is given a
link |
series of experiences in this setting and then it learns there.
link |
Now if let's think about human driving, it is not tabula rasa learning.
link |
So at the age of 16 in high school, a teenager goes into driver ed class, right?
link |
And now at that point they learn, but at the age of 16, they are already visual geniuses
link |
because from zero to 16, they have built a certain repertoire of vision.
link |
In fact, most of it has probably been achieved by age two, right?
link |
In this period of age up to age two, they know that the world is three dimensional.
link |
They know how objects look like from different perspectives.
link |
They know about occlusion.
link |
They know about common dynamics of humans and other bodies.
link |
They have some notion of intuitive physics.
link |
So they built that up from their observations and interactions in early childhood and of
link |
course reinforced through their growing up to age 16.
link |
So then at age 16, when they go into driver ed, what are they learning?
link |
They're not learning afresh the visual world.
link |
They have a mastery of the visual world.
link |
What they are learning is control, okay?
link |
They're learning how to be smooth about control, about steering and brakes and so forth.
link |
They're learning a sense of typical traffic situations.
link |
Now that education process can be quite short because they are coming in as visual geniuses.
link |
And of course in their future, they're going to encounter situations which are very novel,
link |
So during my driver ed class, I may not have had to deal with a skateboarder.
link |
I may not have had to deal with a truck driving in front of me where the back opens up and
link |
some junk gets dropped from the truck and I have to deal with it, right?
link |
But I can deal with this as a driver even though I did not encounter this in my driver
link |
And the reason I can deal with it is because I have all this general visual knowledge and
link |
And do you think the learning mechanisms we have today can do that kind of long term accumulation
link |
Or do we have to do some kind of, you know, the work that led up to expert systems with
link |
knowledge representation, you know, the broader field of artificial intelligence worked on
link |
this kind of accumulation of knowledge.
link |
Do you think neural networks can do the same?
link |
I think I don't see any in principle problem with neural networks doing it, but I think
link |
the learning techniques would need to evolve significantly.
link |
So the current learning techniques that we have are supervised learning.
link |
You're given lots of examples, x, y, y pairs and you learn the functional mapping between
link |
I think that human learning is far richer than that.
link |
It includes many different components.
link |
There is a child explores the world and sees, for example, a child takes an object and manipulates
link |
it in his hand and therefore gets to see the object from different points of view.
link |
And the child has commanded the movement.
link |
So that's a kind of learning data, but the learning data has been arranged by the child.
link |
And this is a very rich kind of data.
link |
The child can do various experiments with the world.
link |
So there are many aspects of sort of human learning, and these have been studied in child
link |
development by psychologists.
link |
And what they tell us is that supervised learning is a very small part of it.
link |
There are many different aspects of learning.
link |
And what we would need to do is to develop models of all of these and then train our
link |
systems with that kind of a protocol.
link |
So new methods of learning, some of which might imitate the human brain, but you also
link |
in your talks have mentioned sort of the compute side of things, in terms of the difference
link |
in the human brain or referencing Moravec, Hans Moravec.
link |
So do you think there's something interesting, valuable to consider about the difference
link |
in the computational power of the human brain versus the computers of today in terms of
link |
instructions per second?
link |
Yes, so if we go back, so this is a point I've been making for 20 years now.
link |
And I think once upon a time, the way I used to argue this was that we just didn't have
link |
the computing power of the human brain.
link |
Our computers were not quite there.
link |
And I mean, there is a well known trade off, which we know that neurons are slow compared
link |
to transistors, but we have a lot of them and they have a very high connectivity.
link |
Whereas in silicon, you have much faster devices, transistors switch at the order of nanoseconds,
link |
but the connectivity is usually smaller.
link |
At this point in time, I mean, we are now talking about 2020, we do have, if you consider
link |
the latest GPUs and so on, amazing computing power.
link |
And if we look back at Hans Moravec type of calculations, which he did in the 1990s, we
link |
may be there today in terms of computing power comparable to the brain, but it's not in the
link |
of the same style, it's of a very different style.
link |
So I mean, for example, the style of computing that we have in our GPUs is far, far more
link |
power hungry than the style of computing that is there in the human brain or other biological
link |
And that the efficiency part is, we're going to have to solve that in order to build actual
link |
real world systems of large scale.
link |
Let me ask sort of the high level question, taking a step back.
link |
How would you articulate the general problem of computer vision?
link |
Does such a thing exist?
link |
So if you look at the computer vision conferences and the work that's been going on, it's often
link |
separated into different little segments, breaking the problem of vision apart into
link |
whether segmentation, 3D reconstruction, object detection, I don't know, image capturing,
link |
There's benchmarks for each.
link |
But if you were to sort of philosophically say, what is the big problem of computer vision?
link |
Does such a thing exist?
link |
Yes, but it's not in isolation.
link |
So for all intelligence tasks, I always go back to sort of biology or humans.
link |
And if we think about vision or perception in that setting, we realize that perception
link |
is always to guide action.
link |
Action for a biological system does not give any benefits unless it is coupled with action.
link |
So we can go back and think about the first multicellular animals, which arose in the
link |
Cambrian era, you know, 500 million years ago.
link |
And these animals could move and they could see in some way.
link |
And the two activities helped each other.
link |
Because how does movement help?
link |
Movement helps that because you can get food in different places.
link |
But you need to know where to go.
link |
And that's really about perception or seeing, I mean, vision is perhaps the single most
link |
But all the others are equally are also important.
link |
So perception and action kind of go together.
link |
So earlier, it was in these very simple feedback loops, which were about finding food or avoid
link |
avoiding becoming food if there's a predator running, trying to, you know, eat you up,
link |
So we must, at the fundamental level, connect perception to action.
link |
Then as we evolved, perception became more and more sophisticated because it served many
link |
And so today we have what seems like a fairly general purpose capability, which can look
link |
at the external world and build a model of the external world inside the head.
link |
We do have that capability.
link |
That model is not perfect.
link |
And psychologists have great fun in pointing out the ways in which the model in your head
link |
is not a perfect model of the external world.
link |
They create various illusions to show the ways in which it is imperfect.
link |
But it's amazing how far it has come from a very simple perception action loop that
link |
you exist in, you know, an animal 500 million years ago.
link |
Once we have this, these very sophisticated visual systems, we can then impose a structure
link |
It's we as scientists who are imposing that structure, where we have chosen to characterize
link |
this part of the system as this quote, module of object detection or quote, this module
link |
of 3D reconstruction.
link |
What's going on is really all of these processes are running simultaneously and they are running
link |
simultaneously because originally their purpose was in fact to help guide action.
link |
So as a guiding general statement of a problem, do you think we can say that the general problem
link |
of computer vision, you said in humans, it was tied to action.
link |
Do you think we should also say that ultimately the goal, the problem of computer vision is
link |
to sense the world in a way that helps you act in the world?
link |
I think that's the most fundamental, that's the most fundamental purpose.
link |
We have by now hyper evolved.
link |
So we have this visual system which can be used for other things.
link |
For example, judging the aesthetic value of a painting.
link |
And this is not guiding action.
link |
Maybe it's guiding action in terms of how much money you will put in your auction bid,
link |
but that's a bit stretched.
link |
But the basics are in fact in terms of action, but we evolved really this hyper, we have
link |
hyper evolved our visual system.
link |
Actually just to, sorry to interrupt, but perhaps it is fundamentally about action.
link |
You kind of jokingly said about spending, but perhaps the capitalistic drive that drives
link |
a lot of the development in this world is about the exchange of money and the fundamental
link |
If you watch Netflix, if you enjoy watching movies, you're using your perception system
link |
to interpret the movie, ultimately your enjoyment of that movie means you'll subscribe to Netflix.
link |
So the action is this extra layer that we've developed in modern society perhaps is fundamentally
link |
tied to the action of spending money.
link |
Well certainly with respect to interactions with firms.
link |
So in this homo economicus role, when you're interacting with firms, it does become that.
link |
What else is there?
link |
And that was a rhetorical question.
link |
So to linger on the division between the static and the dynamic, so much of the work in computer
link |
vision, so many of the breakthroughs that you've been a part of have been in the static
link |
world and looking at static images.
link |
And then you've also worked on starting, but it's a much smaller degree, the community
link |
is looking at dynamic, at video, at dynamic scenes.
link |
And then there is robotic vision, which is dynamic, but also where you actually have
link |
a robot in the physical world interacting based on that vision.
link |
Which problem is harder?
link |
The trivial first answer is, well, of course one image is harder.
link |
But if you look at a deeper question there, are we, what's the term, cutting ourselves
link |
at the knees or like making the problem harder by focusing on images?
link |
That's a fair question.
link |
I think sometimes we can simplify a problem so much that we essentially lose part of the
link |
juice that could enable us to solve the problem.
link |
And one could reasonably argue that to some extent this happens when we go from video
link |
Now historically you have to consider the limits imposed by the computation capabilities
link |
So many of the choices made in the computer vision community through the 70s, 80s, 90s
link |
can be understood as choices which were forced upon us by the fact that we just didn't have
link |
enough access to enough compute.
link |
Not enough memory, not enough hardware.
link |
Not enough compute, not enough storage.
link |
So think of these choices.
link |
So one of the choices is focusing on single images rather than video.
link |
Storage and compute.
link |
We had to focus on, we used to detect edges and throw away the image.
link |
So we would have an image which I say 256 by 256 pixels and instead of keeping around
link |
the grayscale value, what we did was we detected edges, find the places where the brightness
link |
changes a lot and then throw away the rest.
link |
So this was a major compression device and the hope was that this makes it that you can
link |
still work with it and the logic was humans can interpret a line drawing.
link |
And yes, and this will save us computation.
link |
So many of the choices were dictated by that.
link |
I think today we are no longer detecting edges, right?
link |
We process images with ConvNets because we don't need to.
link |
We don't have those computer restrictions anymore.
link |
Now video is still understudied because video compute is still quite challenging if you
link |
are a university researcher.
link |
I think video computing is not so challenging if you are at Google or Facebook or Amazon.
link |
Still super challenging.
link |
I just spoke with the VP of engineering at Google, head of the YouTube search and discovery
link |
and they still struggle doing stuff on video.
link |
It's very difficult except using techniques that are essentially the techniques you used
link |
Some very basic computer vision techniques.
link |
No, that's when you want to do things at scale.
link |
So if you want to operate at the scale of all the content of YouTube, it's very challenging
link |
and there are similar issues with Facebook.
link |
But as a researcher, you have more opportunities.
link |
You can train large networks with relatively large video data sets.
link |
So I think that this is part of the reason why we have so emphasized static images.
link |
I think that this is changing and over the next few years, I see a lot more progress
link |
happening in video.
link |
So I have this generic statement that to me, video recognition feels like 10 years behind
link |
object recognition and you can quantify that because you can take some of the challenging
link |
video data sets and their performance on action classification is like say 30%, which is kind
link |
of what we used to have around 2009 in object detection.
link |
It's like about 10 years behind and whether it'll take 10 years to catch up is a different
link |
Hopefully, it will take less than that.
link |
Let me ask a similar question I've already asked, but once again, so for dynamic scenes,
link |
do you think some kind of injection of knowledge bases and reasoning is required to help improve
link |
like action recognition?
link |
Like if we saw the general action recognition problem, what do you think the solution would
link |
look like as another way to put it?
link |
So I completely agree that knowledge is called for and that knowledge can be quite sophisticated.
link |
So the way I would say it is that perception blends into cognition and cognition brings
link |
in issues of memory and this notion of a schema from psychology, which is, let me use the
link |
classic example, which is you go to a restaurant, right?
link |
Now there are things that happen in a certain order, you walk in, somebody takes you to
link |
a table, waiter comes, gives you a menu, takes the order, food arrives, eventually bill arrives,
link |
et cetera, et cetera.
link |
This is a classic example of AI from the 1970s.
link |
It was called, there was the term frames and scripts and schemas, these are all quite similar
link |
Okay, and in the 70s, the way the AI of the time dealt with it was by hand coding this.
link |
So they hand coded in this notion of a script and the various stages and the actors and
link |
so on and so forth, and use that to interpret, for example, language.
link |
I mean, if there's a description of a story involving some people eating at a restaurant,
link |
there are all these inferences you can make because you know what happens typically at
link |
So I think this kind of knowledge is absolutely essential.
link |
So I think that when we are going to do long form video understanding, we are going to
link |
I think the kinds of technology that we have right now with 3D convolutions over a couple
link |
of seconds of clip or video, it's very much tailored towards short term video understanding,
link |
not that long term understanding.
link |
Long term understanding requires this notion of schemas that I talked about, perhaps some
link |
notions of goals, intentionality, functionality, and so on and so forth.
link |
Now, how will we bring that in?
link |
So we could either revert back to the 70s and say, OK, I'm going to hand code in a script
link |
or we might try to learn it.
link |
So I tend to believe that we have to find learning ways of doing this because I think
link |
learning ways land up being more robust.
link |
And there must be a learning version of the story because children acquire a lot of this
link |
knowledge by sort of just observation.
link |
So at no moment in a child's life does it's possible, but I think it's not so typical
link |
that somebody that a mother coaches a child through all the stages of what happens in
link |
They just go as a family, they go to the restaurant, they eat, come back, and the child goes through
link |
ten such experiences and the child has got a schema of what happens when you go to a
link |
So we somehow need to provide that capability to our systems.
link |
You mentioned the following line from the end of the Alan Turing paper, Computing Machinery
link |
and Intelligence, that many people, like you said, many people know and very few have read
link |
where he proposes the Turing test.
link |
This is how you know because it's towards the end of the paper.
link |
Instead of trying to produce a program to simulate the adult mind, why not rather try
link |
to produce one which simulates the child's?
link |
So that's a really interesting point.
link |
If I think about the benchmarks we have before us, the tests of our computer vision systems,
link |
they're often kind of trying to get to the adult.
link |
So what kind of benchmarks should we have?
link |
What kind of tests for computer vision do you think we should have that mimic the child's
link |
in computer vision?
link |
I think we should have those and we don't have those today.
link |
And I think the part of the challenge is that we should really be collecting data of the
link |
type that the child experiences.
link |
So that gets into issues of privacy and so on and so forth.
link |
But there are attempts in this direction to sort of try to collect the kind of data that
link |
a child encounters growing up.
link |
So what's the child's linguistic environment?
link |
What's the child's visual environment?
link |
So if we could collect that kind of data and then develop learning schemes based on that
link |
data, that would be one way to do it.
link |
I think that's a very promising direction myself.
link |
There might be people who would argue that we could just short circuit this in some way
link |
and sometimes we have imitated, we have had success by not imitating nature in detail.
link |
So the usual example is airplanes, right?
link |
We don't build flapping wings.
link |
So yes, that's one of the points of debate.
link |
In my mind, I would bet on this learning like a child approach.
link |
So one of the fundamental aspects of learning like a child is the interactivity.
link |
So the child gets to play with the data set it's learning from.
link |
So it gets to select.
link |
I mean, you can call that active learning.
link |
In the machine learning world, you can call it a lot of terms.
link |
What are your thoughts about this whole space of being able to play with the data set or
link |
select what you're learning?
link |
So I think that I believe in that and I think that we could achieve it in two ways and I
link |
think we should use both.
link |
So one is actually real robotics, right?
link |
So real physical embodiments of agents who are interacting with the world and they have
link |
a physical body with dynamics and mass and moment of inertia and friction and all the
link |
rest and you learn your body, the robot learns its body by doing a series of actions.
link |
The second is that simulation environments.
link |
So I think simulation environments are getting much, much better.
link |
In my life in Facebook AI research, our group has worked on something called Habitat, which
link |
is a simulation environment, which is a visually photorealistic environment of places like
link |
houses or interiors of various urban spaces and so forth.
link |
And as you move, you get a picture, which is a pretty accurate picture.
link |
So now you can imagine that subsequent generations of these simulators will be accurate, not
link |
just visually, but with respect to forces and masses and haptic interactions and so
link |
And then we have that environment to play with.
link |
I think, let me state one reason why I think being able to act in the world is important.
link |
I think that this is one way to break the correlation versus causation barrier.
link |
So this is something which is of a great deal of interest these days.
link |
I mean, people like Judea Pearl have talked a lot about that we are neglecting causality
link |
and he describes the entire set of successes of deep learning as just curve fitting, right?
link |
But I don't quite agree about it.
link |
He's a troublemaker.
link |
But causality is important, but causality is not like a single silver bullet.
link |
It's not like one single principle.
link |
There are many different aspects here.
link |
And one of the ways in which, one of our most reliable ways of establishing causal links
link |
and this is the way, for example, the medical community does this is randomized control
link |
So you have, you pick some situation and now in some situation you perform an action and
link |
for certain others you don't, right?
link |
So you have a controlled experiment.
link |
Well, the child is in fact performing controlled experiments all the time, right?
link |
But that is a way that the child gets to build and refine its causal models of the world.
link |
And my colleague Alison Gopnik has, together with a couple of authors, coauthors, has this
link |
book called The Scientist in the Crib, referring to the children.
link |
So I like, the part that I like about that is the scientist wants to do, wants to build
link |
causal models and the scientist does control experiments.
link |
And I think the child is doing that.
link |
So to enable that, we will need to have these active experiments.
link |
And I think this could be done, some in the real world and some in simulation.
link |
So you have hope for simulation.
link |
I have hope for simulation.
link |
That's an exciting possibility if we can get to not just photorealistic, but what's that
link |
called life realistic simulation.
link |
So you don't see any fundamental blocks to why we can't eventually simulate the principles
link |
of what it means to exist in the world as a physical scientist.
link |
I don't see any fundamental problems that, I mean, and look, the computer graphics community
link |
has come a long way.
link |
So in the early days, back going back to the eighties and nineties, they were focusing
link |
on visual realism, right?
link |
And then they could do the easy stuff, but they couldn't do stuff like hair or fur and
link |
Okay, well, they managed to do that.
link |
Then they couldn't do physical actions, right?
link |
Like there's a bowl of glass and it falls down and it shatters, but then they could
link |
start to do pretty realistic models of that and so on and so forth.
link |
So the graphics people have shown that they can do this forward direction, not just for
link |
optical interactions, but also for physical interactions.
link |
So I think, of course, some of that is very compute intensive, but I think by and by we
link |
will find ways of making our models ever more realistic.
link |
You break vision apart into, in one of your presentations, early vision, static scene
link |
understanding, dynamic scene understanding, and raise a few interesting questions.
link |
I thought I could just throw some at you to see if you want to talk about them.
link |
So early vision, so it's, what is it that you said, sensation, perception and cognition.
link |
So is this a sensation?
link |
What can we learn from image statistics that we don't already know?
link |
So at the lowest level, what can we make from just the statistics, the basics, or the variations
link |
in the rock pixels, the textures and so on?
link |
So what we seem to have learned is that there's a lot of redundancy in these images and as
link |
a result, we are able to do a lot of compression and this compression is very important in
link |
biological settings, right?
link |
So you might have 10 to the 8 photoreceptors and only 10 to the 6 fibers in the optic nerve.
link |
So you have to do this compression by a factor of 100 is to 1.
link |
And so there are analogs of that which are happening in our neural net, artificial neural
link |
That's the early layers.
link |
So you think there's a lot of compression that can be done in the beginning.
link |
Just the statistics.
link |
So how successful is image compression?
link |
Well, I mean, the way to think about it is just how successful is image compression,
link |
And that's been done with older technologies, but it can be done with, there are several
link |
companies which are trying to use sort of these more advanced neural network type techniques
link |
for compression, both for static images as well as for video.
link |
One of my former students has a company which is trying to do stuff like this.
link |
And I think that they are showing quite interesting results.
link |
And I think that that's all the success of, that's really about image statistics and
link |
But that's still not doing compression of the kind, when I see a picture of a cat, all
link |
I have to say is it's a cat, that's another semantic kind of compression.
link |
So this is at the lower level, right?
link |
So we are, as I said, yeah, that's focusing on low level statistics.
link |
So to linger on that for a little bit, you mentioned how far can bottom up image segmentation
link |
You know, what you mentioned that the central question for scene understanding is the interplay
link |
of bottom up and top down information.
link |
Maybe this is a good time to elaborate on that.
link |
Maybe define what is bottom up, what is top down in the context of computer vision.
link |
So today what we have are very interesting systems because they work completely bottom
link |
What does bottom up mean, sorry?
link |
So bottom up means, in this case means a feed forward neural network.
link |
So starting from the raw pixels, yeah, they start from the raw pixels and they end up
link |
with some, something like cat or not a cat, right?
link |
So our systems are running totally feed forward.
link |
They're trained in a very top down way.
link |
So they're trained by saying, okay, this is a cat, there's a cat, there's a dog, there's
link |
a zebra, et cetera.
link |
And I'm not happy with either of these choices fully.
link |
We have gone into, because we have completely separated these processes, right?
link |
So there's a, so I would like the process, so what do we know compared to biology?
link |
So in biology, what we know is that the processes in at test time, at runtime, those processes
link |
are not purely feed forward, but they involve feedback.
link |
So and they involve much shallower neural networks.
link |
So the kinds of neural networks we are using in computer vision, say a ResNet 50 has 50
link |
Well in the brain, in the visual cortex going from the retina to IT, maybe we have like
link |
So they're far shallower, but we have the possibility of feedback.
link |
So there are backward connections.
link |
And this might enable us to deal with the more ambiguous stimuli, for example.
link |
So the biological solution seems to involve feedback, the solution in artificial vision
link |
seems to be just feed forward, but with a much deeper network.
link |
And the two are functionally equivalent because if you have a feedback network, which just
link |
has like three rounds of feedback, you can just unroll it and make it three times the
link |
depth and create it in a totally feed forward way.
link |
So this is something which, I mean, we have written some papers on this theme, but I really
link |
feel that this should, this theme should be pursued further.
link |
Some kind of occurrence mechanism.
link |
The other, so that's, so I want to have a little bit more top down in the, at test time.
link |
And then at training time, we make use of a lot of top down knowledge right now.
link |
So basically to learn to segment an object, we have to have all these examples of this
link |
is the boundary of a cat, and this is the boundary of a chair, and this is the boundary
link |
of a horse and so on.
link |
And this is too much top down knowledge.
link |
How do humans do this?
link |
We manage to, we manage with far less supervision and we do it in a sort of bottom up way because
link |
for example, we are looking at a video stream and the horse moves and that enables me to
link |
say that all these pixels are together.
link |
So the Gestalt psychologist used to call this the principle of common fate.
link |
So there was a bottom up process by which we were able to segment out these objects
link |
and we have totally focused on this top down training signal.
link |
So in my view, we have currently solved it in machine vision, this top down bottom up
link |
interaction, but I don't find the solution fully satisfactory and I would rather have
link |
a bit of both at both stages.
link |
For all computer vision problems, not just segmentation.
link |
And the question that you can ask is, so for me, I'm inspired a lot by human vision and
link |
I care about that.
link |
You could be just a hard boiled engineer and not give a damn.
link |
So to you, I would then argue that you would need far less training data if you could make
link |
my research agenda fruitful.
link |
Okay, so then maybe taking a step into segmentation, static scene understanding.
link |
What is the interaction between segmentation and recognition?
link |
You mentioned the movement of objects.
link |
So for people who don't know computer vision, segmentation is this weird activity that computer
link |
vision folks have all agreed is very important of drawing outlines around objects versus
link |
a bounding box and then classifying that object.
link |
What's the value of segmentation?
link |
What is it as a problem in computer vision?
link |
How is it fundamentally different from detection recognition and the other problems?
link |
Yeah, so I think, so segmentation enables us to say that some set of pixels are an object
link |
without necessarily even being able to name that object or knowing properties of that
link |
Oh, so you mean segmentation purely as the act of separating an object.
link |
From its background.
link |
It's a job that's united in some way from its background.
link |
Yeah, so entitification, if you will, making an entity out of it.
link |
Entitification, beautifully termed.
link |
So I think that we have that capability and that enables us to, as we are growing up,
link |
to acquire names of objects with very little supervision.
link |
So suppose the child, let's posit that the child has this ability to separate out objects
link |
Then when the mother says, pick up your bottle or the cat's behaving funny today, the word
link |
cat suggests some object and then the child sort of does the mapping, right?
link |
The mother doesn't have to teach specific object labels by pointing to them.
link |
Weak supervision works in the context that you have the ability to create objects.
link |
So I think that, so to me, that's a very fundamental capability.
link |
There are applications where this is very important, for example, medical diagnosis.
link |
So in medical diagnosis, you have some brain scan, I mean, this is some work that we did
link |
in my group where you have CT scans of people who have had traumatic brain injury and what
link |
the radiologist needs to do is to precisely delineate various places where there might
link |
be bleeds, for example, and there are clear needs like that.
link |
So there are certainly very practical applications of computer vision where segmentation is necessary,
link |
but philosophically segmentation enables the task of recognition to proceed with much weaker
link |
supervision than we require today.
link |
And you think of segmentation as this kind of task that takes on a visual scene and breaks
link |
it apart into interesting entities that might be useful for whatever the task is.
link |
And it is not semantics free.
link |
So I think, I mean, it blends into, it involves perception and cognition.
link |
It is not, I think the mistake that we used to make in the early days of computer vision
link |
was to treat it as a purely bottom up perceptual task.
link |
It is not just that because we do revise our notion of segmentation with more experience,
link |
Because for example, there are objects which are nonrigid like animals or humans.
link |
And I think understanding that all the pixels of a human are one entity is actually quite
link |
a challenge because the parts of the human, they can move independently and the human
link |
wears clothes, so they might be differently colored.
link |
So it's all sort of a challenge.
link |
You mentioned the three R's of computer vision are recognition, reconstruction and reorganization.
link |
Can you describe these three R's and how they interact?
link |
So recognition is the easiest one because that's what I think people generally think
link |
of as computer vision achieving these days, which is labels.
link |
Is this a chihuahua?
link |
I mean, you know, it could be very fine grained like, you know, specific breed of a dog or
link |
a specific species of bird, or it could be very abstract like animal.
link |
But given a part of an image or a whole image, say put a label on it.
link |
That's recognition.
link |
Reconstruction is essentially, you can think of it as inverse graphics.
link |
I mean, that's one way to think about it.
link |
So graphics is you have some internal computer representation and you have a computer representation
link |
of some objects arranged in a scene.
link |
And what you do is you produce a picture, you produce the pixels corresponding to a
link |
rendering of that scene.
link |
So let's do the inverse of this.
link |
We are given an image and we try to, we say, oh, this image arises from some objects in
link |
a scene looked at with a camera from this viewpoint.
link |
And we might have more information about the objects like their shape, maybe their textures,
link |
maybe, you know, color, et cetera, et cetera.
link |
So that's the reconstruction problem.
link |
In a way, you are in your head creating a model of the external world.
link |
Reorganization is to do with essentially finding these entities.
link |
So it's organization, the word organization implies structure.
link |
So that in perception, in psychology, we use the term perceptual organization.
link |
That the world is not just, an image is not just seen as, is not internally represented
link |
as just a collection of pixels, but we make these entities.
link |
We create these entities, objects, whatever you want to call it.
link |
And the relationship between the entities as well, or is it purely about the entities?
link |
It could be about the relationships, but mainly we focus on the fact that there are entities.
link |
So I'm trying to pinpoint what the organization means.
link |
So organization is that instead of like a uniform grid, we have this structure of objects.
link |
So the segmentation is the small part of that.
link |
So segmentation gets us going towards that.
link |
And you kind of have this triangle where they all interact together.
link |
So how do you see that interaction in sort of reorganization is yes, finding the entities
link |
The recognition is labeling those entities and then reconstruction is what filling in
link |
Well, for example, see, impute some 3D objects corresponding to each of these entities.
link |
That would be part of it.
link |
So adding more information that's not there in the raw data.
link |
I mean, I started pushing this kind of a view in the, around 2010 or something like that.
link |
Because at that time in computer vision, the distinction that people were just working
link |
on many different problems, but they treated each of them as a separate isolated problem
link |
with each with its own data set.
link |
And then you try to solve that and get good numbers on it.
link |
So I wasn't, I didn't like that approach because I wanted to see the connection between these.
link |
And if people divided up vision into, into various modules, the way they would do it
link |
is as low level, mid level and high level vision corresponding roughly to the psychologist's
link |
notion of sensation, perception and cognition.
link |
And I didn't, that didn't map to tasks that people cared about.
link |
So therefore I tried to promote this particular framework as a way of considering the problems
link |
that people in computer vision were actually working on and trying to be more explicit
link |
about the fact that they actually are connected to each other.
link |
And I was at that time just doing this on the basis of information flow.
link |
Now it turns out in the last five years or so in the post, the deep learning revolution
link |
that this, this architecture has turned out to be very conducive to that.
link |
Because basically in these neural networks, we are trying to build multiple representations.
link |
They can be multiple output heads sharing common representations.
link |
So in a certain sense today, given the reality of what solutions people have to this, I do
link |
not need to preach this anymore.
link |
It is, it is just there.
link |
It's part of the sedation space.
link |
So speaking of neural networks, how much of this problem of computer vision of reorganization
link |
recognition can be reconstruction?
link |
How much of it can be learned end to end, do you think?
link |
Sort of set it and forget it.
link |
Just plug and play, have a giant data set, multiple, perhaps multimodal, and then just
link |
learn the entirety of it.
link |
Well, so I think that currently what that end to end learning means nowadays is end
link |
to end supervised learning.
link |
And that I would argue is too narrow a view of the problem.
link |
I like this child development view, this lifelong learning view, one where there are certain
link |
capabilities that are built up and then there are certain capabilities which are built up
link |
So that's what I believe in.
link |
So I think end to end learning in the supervised setting for a very precise task to me is kind
link |
of is sort of a limited view of the learning process.
link |
So if we think about beyond purely supervised, looking back to children, you mentioned six
link |
lessons that we can learn from children of be multimodal, be incremental, be physical,
link |
explore, be social, use language.
link |
Can you speak to these, perhaps picking one that you find most fundamental to our time
link |
So I mean, I should say to give a due credit, this is from a paper by Smith and Gasser.
link |
And it reflects essentially, I would say common wisdom among child development people.
link |
It's just that this is not common wisdom among people in computer vision and AI and machine
link |
So I view my role as trying to bridge the two worlds.
link |
So let's take an example of a multimodal.
link |
So multimodal, a canonical example is a child interacting with an object.
link |
So then the child holds a ball and plays with it.
link |
So at that point, it's getting a touch signal.
link |
So the touch signal is getting the notion of 3D shape, but it is sparse.
link |
And then the child is also seeing a visual signal.
link |
And these two, so imagine these are two in totally different spaces.
link |
So one is the space of receptors on the skin of the fingers and the thumb and the palm.
link |
And then these map onto these neuronal fibers are getting activated somewhere.
link |
These lead to some activation in somatosensory cortex.
link |
I mean, a similar thing will happen if we have a robot hand.
link |
And then we have the pixels corresponding to the visual view, but we know that they
link |
correspond to the same object.
link |
So that's a very, very strong cross calibration signal.
link |
And it is self supervisory, which is beautiful.
link |
There's nobody assigning a label.
link |
The mother doesn't have to come and assign a label.
link |
The child doesn't even have to know that this object is called a ball.
link |
That the child is learning something about the three dimensional world from this signal.
link |
I think tactile and visual, there is some work on, there is a lot of work currently
link |
on audio and visual.
link |
And audio visual, so there is some event that happens in the world and that event has a
link |
visual signature and it has a auditory signature.
link |
So there is this glass bowl on the table and it falls and breaks and I hear the smashing
link |
sound and I see the pieces of glass.
link |
Okay, I've built that connection between the two, right?
link |
We have people, I mean, this has become a hot topic in computer vision in the last couple
link |
There are problems like separating out multiple speakers, right?
link |
Which was a classic problem in auditions.
link |
They call this the problem of source separation or the cocktail party effect and so on.
link |
But just try to do it visually when you also have, it becomes so much easier and so much
link |
So the multimodal, I mean, there's so much more signal with multimodal and you can use
link |
that for some kind of weak supervision as well.
link |
Yes, because they are occurring at the same time in time.
link |
So you have time which links the two, right?
link |
So at a certain moment, T1, you've got a certain signal in the auditory domain and a certain
link |
signal in the visual domain, but they must be causally related.
link |
Yeah, that's an exciting area.
link |
Not well studied yet.
link |
Yeah, I mean, we have a little bit of work at this, but so much more needs to be done.
link |
So this is a good example.
link |
Be physical, that's to do with like the one thing we talked about earlier that there's
link |
To mention language, use language.
link |
So Noam Chomsky believes that language may be at the core of cognition, at the core of
link |
everything in the human mind.
link |
What is the connection between language and vision to you?
link |
What's more fundamental?
link |
Are they neighbors?
link |
Is one the parent and the child, the chicken and the egg?
link |
Oh, it's very clear.
link |
It is vision, which is the parent.
link |
Which is the fundamental ability, okay.
link |
It comes before you think vision is more fundamental than language.
link |
And you can think of it either in phylogeny or in ontogeny.
link |
So phylogeny means if you look at evolutionary time, right?
link |
So we have vision that developed 500 million years ago, okay.
link |
Then something like when we get to maybe like five million years ago, you have the first
link |
So when we started to walk, then the hands became free.
link |
And so then manipulation, the ability to manipulate objects and build tools and so on and so forth.
link |
So you said 500,000 years ago?
link |
The first multicellular animals, which you can say had some intelligence arose 500 million
link |
And now let's fast forward to say the last seven million years, which is the development
link |
of the hominid line, right, where from the other primates, we have the branch which leads
link |
on to modern humans.
link |
Now there are many of these hominids, but the ones which, you know, people talk about
link |
Lucy because that's like a skeleton from three million years ago.
link |
And we know that Lucy walked, okay.
link |
So at this stage you have that the hand is free for manipulating objects and then the
link |
ability to manipulate objects, build tools and the brain size grew in this era.
link |
So okay, so now you have manipulation.
link |
Now we don't know exactly when language arose.
link |
Because no apes have, I mean, so I mean Chomsky is correct in that, that it is a uniquely
link |
human capability and we primates, other primates don't have that.
link |
But so it developed somewhere in this era, but it developed, I would, I mean, argue that
link |
it probably developed after we had this stage of humans, I mean, the human species already
link |
able to manipulate and hands free much bigger brain size.
link |
And for that, there's a lot of vision has already had, had to have developed.
link |
So the sensation and the perception may be some of the cognition.
link |
So we, we, we, so those, so, so that vision, so the world, so there, so, so these ancestors
link |
of ours, you know, three, four million years ago, they had, they had special intelligence.
link |
So they knew that the world consists of objects.
link |
They knew that the objects were in certain relationships to each other.
link |
They had observed causal interactions among objects.
link |
They could move in space.
link |
So they had space and time and all of that.
link |
So language builds on that substrate.
link |
So language has a lot of, I mean, I mean, the none, all human languages have constructs
link |
which depend on a notion of space and time.
link |
Where did that notion of space and time come from?
link |
It had to come from perception and action in the world we live in.
link |
Well, you've referred to the spatial intelligence.
link |
So to linger a little bit, we'll mention Turing and his mention of, we should learn from
link |
Nevertheless, language is the fundamental piece of the test of intelligence that Turing
link |
What do you think is a good test of intelligence?
link |
Are you, what would impress the heck out of you?
link |
Is it fundamentally natural language or is there something in vision?
link |
I think, I wouldn't, I don't think we should have created a single test of intelligence.
link |
So just like I don't believe in IQ as a single number, I think generally there can be many
link |
capabilities which are correlated perhaps.
link |
So I think that there will be, there will be accomplishments which are visual accomplishments,
link |
accomplishments which are accomplishments in manipulation or robotics, and then accomplishments
link |
But I do believe that language will be the hardest nut to crack.
link |
So what's harder, to pass the spirit of the Turing test or like whatever formulation will
link |
make it natural language, convincingly a natural language, like somebody you would want to
link |
have a beer with, hang out and have a chat with, or the general natural scene understanding?
link |
You think language is the tougher problem?
link |
I think, I'm not a fan of the, I think, I think Turing test, that Turing as he proposed
link |
the test in 1950 was trying to solve a certain problem.
link |
And, and I think it made a lot of sense then.
link |
Where we are today, 70 years later, I think, I think we should not worry about that.
link |
I think the Turing test is no longer the right way to channel research in AI, because that,
link |
it takes us down this path of this chat bot, which can fool us for five minutes or whatever.
link |
I think I would rather have a list of 10 different tasks.
link |
I mean, I think there are tasks which, there are tasks in the manipulation domain, tasks
link |
in navigation, tasks in visual scene understanding, tasks in reading a story and answering questions
link |
I mean, so my favorite language understanding task would be, you know, reading a novel and
link |
being able to answer arbitrary questions from it.
link |
I think that to me, and this is not an exhaustive list by any means.
link |
So I would, I think that that's what we, where we need to be going to.
link |
And each of these, on each of these axes, there's a fair amount of work to be done.
link |
So on the visual understanding side, in this intelligence Olympics that we've set up, what's
link |
a good test for one of many of visual scene understanding?
link |
Do you think such benchmarks exist?
link |
Sorry to interrupt.
link |
No, there aren't any.
link |
I think, I think essentially to me, a really good aid to the blind.
link |
So suppose there was a blind person and I needed to assist the blind person.
link |
So ultimately, like we said, vision that aids in the action in a survival in this world,
link |
maybe in the simulated world.
link |
Maybe easier to measure performance in a simulated world, what we are ultimately after is performance
link |
in the real world.
link |
So David Hilbert in 1900 proposed 23 open problems in mathematics, some of which are
link |
still unsolved, most important, famous of which is probably the Riemann hypothesis.
link |
You've thought about and presented about the Hilbert problems of computer vision.
link |
So let me ask, what do you today, I don't know when the last year you presented that
link |
in 2015, but versions of it, you're kind of the face and the spokesperson for computer
link |
It's your job to state what the open problems are for the field.
link |
So what today are the Hilbert problems of computer vision, do you think?
link |
Let me pick one which I regard as clearly unsolved, which is what I would call long
link |
form video understanding.
link |
So we have a video clip and we want to understand the behavior in there in terms of agents,
link |
their goals, intentionality and make predictions about what might happen.
link |
So that kind of understanding which goes away from atomic visual action.
link |
So in the short range, the question is, are you sitting, are you standing, are you catching
link |
That we can do now, or even if we can't do it fully accurately, if we can do it at 50%,
link |
maybe next year we'll do it at 65% and so forth.
link |
But I think the long range video understanding, I don't think we can do today.
link |
And it blends into cognition, that's the reason why it's challenging.
link |
So you have to track, you have to understand the entities, you have to understand the entities,
link |
you have to track them and you have to have some kind of model of their behavior.
link |
And their behavior might be, these are agents, so they are not just like passive objects,
link |
but they're agents, so therefore they would exhibit goal directed behavior.
link |
Okay, so this is one area.
link |
Then I will talk about understanding the world in 3D.
link |
This may seem paradoxical because in a way we have been able to do 3D understanding even
link |
like 30 years ago, right?
link |
But I don't think we currently have the richness of 3D understanding in our computer vision
link |
system that we would like.
link |
So let me elaborate on that a bit.
link |
So currently we have two kinds of techniques which are not fully unified.
link |
So they are the kinds of techniques from multi view geometry that you have multiple pictures
link |
of a scene and you do a reconstruction using stereoscopic vision or structure from motion.
link |
But these techniques do not, they totally fail if you just have a single view because
link |
they are relying on this multiple view geometry.
link |
Okay, then we have some techniques that we have developed in the computer vision community
link |
which try to guess 3D from single views.
link |
And these techniques are based on supervised learning and they are based on having a training
link |
time 3D models of objects available.
link |
And this is completely unnatural supervision, right?
link |
That's not, CAD models are not injected into your brain.
link |
Okay, so what would I like?
link |
What I would like would be a kind of learning as you move around the world notion of 3D.
link |
So we have our succession of visual experiences and from those we, so as part of that I might
link |
see a chair from different viewpoints or a table from different viewpoints and so on.
link |
Now as part that enables me to build some internal representation.
link |
And then next time I just see a single photograph and it may not even be of that chair, it's
link |
of some other chair.
link |
And I have a guess of what it's 3D shape is like.
link |
So you're almost learning the CAD model, kind of.
link |
I mean, the CAD model need not be in the same form as used by computer graphics programs.
link |
Hidden in the representation.
link |
It's hidden in the representation, the ability to predict new views.
link |
And what I would see if I went to such and such position.
link |
By the way, on a small tangent on that, are you okay or comfortable with neural networks
link |
that do achieve visual understanding that do, for example, achieve this kind of 3D understanding
link |
and you don't know how they, you're not able to interest, you're not able to visualize
link |
or understand or interact with the representation.
link |
So the fact that they're not or may not be explainable.
link |
Yeah, I think that's fine.
link |
To me that is, so let me put some caveats on that.
link |
So it depends on the setting.
link |
So first of all, I think the humans are not explainable.
link |
So that's a really good point.
link |
So we, one human to another human is not fully explainable.
link |
I think there are settings where explainability matters and these might be, for example, questions
link |
on medical diagnosis.
link |
So I'm in a setting where maybe the doctor, maybe a computer program has made a certain
link |
diagnosis and then depending on the diagnosis, perhaps I should have treatment A or treatment
link |
So now is the computer program's diagnosis based on data, which was data collected off
link |
for American males who are in their 30s and 40s and maybe not so relevant to me.
link |
Maybe it is relevant, you know, et cetera, et cetera.
link |
I mean, in medical diagnosis, we have major issues to do with the reference class.
link |
So we may have acquired statistics from one group of people and applying it to a different
link |
group of people who may not share all the same characteristics.
link |
The data might have, there might be error bars in the prediction.
link |
So that prediction should really be taken with a huge grain of salt.
link |
But this has an impact on what treatments should be picked, right?
link |
So there are settings where I want to know more than just, this is the answer.
link |
But what I acknowledge is that, so in that sense, explainability and interpretability
link |
It's about giving error bounds and a better sense of the quality of the decision.
link |
Where I'm willing to sacrifice interpretability is that I believe that there can be systems
link |
which can be highly performant, but which are internally black boxes.
link |
And that seems to be where it's headed.
link |
Some of the best performing systems are essentially black boxes, fundamentally by their construction.
link |
You and I are black boxes to each other.
link |
So the nice thing about the black boxes we are is, so we ourselves are black boxes, but
link |
we're also, those of us who are charming are able to convince others, like explain the
link |
black, what's going on inside the black box with narratives of stories.
link |
So in some sense, neural networks don't have to actually explain what's going on inside.
link |
They just have to come up with stories, real or fake that convince you that they know what's
link |
And I'm sure we can do that.
link |
We can create those stories, neural networks can create those stories.
link |
And the transformer will be involved.
link |
Do you think we will ever build a system of human level or superhuman level intelligence?
link |
We've kind of defined what it takes to try to approach that, but do you think that's
link |
The thing that we thought we could do, what Turing thought actually we could do by year
link |
What do you think we'll ever be able to do?
link |
So I think there are two answers here.
link |
One question, one answer is in principle, can we do this at some time?
link |
And my answer is yes.
link |
The second answer is a pragmatic one.
link |
Do you think we will be able to do it in the next 20 years or whatever?
link |
And to that my answer is no.
link |
So of course that's a wild guess.
link |
I think that, you know, Donald Rumsfeld is not a favorite person of mine, but one of
link |
his lines was very good, which is about known unknowns and unknown unknowns.
link |
So in the business we are in, there are known unknowns and we have unknown unknowns.
link |
So I think with respect to a lot of what's the case in vision and robotics, I feel like
link |
we have known unknowns.
link |
So I have a sense of where we need to go and what the problems that need to be solved are.
link |
I feel with respect to natural language, understanding and high level cognition, it's not just known
link |
unknowns, but also unknown unknowns.
link |
So it is very difficult to put any kind of a timeframe to that.
link |
Do you think some of the unknown unknowns might be positive in that they'll surprise
link |
us and make the job much easier?
link |
So fundamental breakthroughs?
link |
I think that is possible because certainly I have been very positively surprised by how
link |
effective these deep learning systems have been because I certainly would not have believed
link |
I think what we knew from the mathematical theory was that convex optimization works.
link |
When there's a single global optima, then these gradient descent techniques would work.
link |
Now these are nonlinear systems with non convex systems.
link |
Huge number of variables, so over parametrized.
link |
And the people who used to play with them a lot, the ones who are totally immersed in
link |
the lore and the black magic, they knew that they worked well, even though they were...
link |
I thought like everybody...
link |
No, the claim that I hear from my friends like Yann LeCun and so forth is that they
link |
feel that they were comfortable with them.
link |
But the community as a whole was certainly not.
link |
And I think to me that was the surprise that they actually worked robustly for a wide range
link |
of problems from a wide range of initializations and so on.
link |
And so that was certainly more rapid progress than we expected.
link |
But then there are certainly lots of times, in fact, most of the history of AI is when
link |
we have made less progress at a slower rate than we expected.
link |
So we just keep going.
link |
I think what I regard as really unwarranted are these fears of AGI in 10 years and 20
link |
years and that kind of stuff, because that's based on completely unrealistic models of
link |
how rapidly we will make progress in this field.
link |
So I agree with you, but I've also gotten the chance to interact with very smart people
link |
who really worry about existential threats of AI.
link |
And I, as an open minded person, am sort of taking it in.
link |
Do you think if AI systems in some way, the unknown unknowns, not super intelligent AI,
link |
but in ways we don't quite understand the nature of super intelligence, will have a
link |
detrimental effect on society?
link |
Do you think this is something we should be worried about or we need to first allow the
link |
unknown unknowns to become known unknowns?
link |
I think we need to be worried about AI today.
link |
I think that it is not just a worry we need to have when we get that AGI.
link |
I think that AI is being used in many systems today.
link |
And there might be settings, for example, when it causes biases or decisions which could
link |
I mean, decisions which could be unfair to some people or it could be a self driving
link |
cars which kills a pedestrian.
link |
So AI systems are being deployed today, right?
link |
And they're being deployed in many different settings, maybe in medical diagnosis, maybe
link |
in a self driving car, maybe in selecting applicants for an interview.
link |
So I would argue that when these systems make mistakes, there are consequences.
link |
And we are in a certain sense responsible for those consequences.
link |
So I would argue that this is a continuous effort.
link |
It is we and this is something that in a way is not so surprising.
link |
It's about all engineering and scientific progress which great power comes great responsibility.
link |
So as these systems are deployed, we have to worry about them and it's a continuous
link |
I don't think of it as something which will suddenly happen on some day in 2079 for which
link |
I need to design some clever trick.
link |
I'm saying that these problems exist today and we need to be continuously on the lookout
link |
for worrying about safety, biases, risks, right?
link |
I mean, the self driving car kills a pedestrian and they have, right?
link |
I mean, this Uber incident in Arizona, right?
link |
It has happened, right?
link |
This is not about AGI.
link |
In fact, it's about a very dumb intelligence which is still killing people.
link |
The worry people have with AGI is the scale.
link |
But I think you're 100% right is like the thing that worries me about AI today and it's
link |
happening in a huge scale is recommender systems, recommendation systems.
link |
So if you look at Twitter or Facebook or YouTube, they're controlling the ideas that we have
link |
access to, the news and so on.
link |
And that's a fundamental machine learning algorithm behind each of these recommendations.
link |
And they, I mean, my life would not be the same without these sources of information.
link |
I'm a totally new human being and the ideas that I know are very much because of the internet,
link |
because of the algorithm that recommend those ideas.
link |
And so as they get smarter and smarter, I mean, that is the AGI is that's the algorithm
link |
that's recommending the next YouTube video you should watch has control of millions of
link |
billions of people that that algorithm is already super intelligent and has complete
link |
control of the population, not a complete, but very strong control.
link |
For now we can turn off YouTube, we can just go have a normal life outside of that.
link |
But the more and more that gets into our life, it's that algorithm we start depending on
link |
it in the different companies that are working on the algorithm.
link |
So I think it's, you're right, it's already there.
link |
And YouTube in particular is using computer vision, doing their hardest to try to understand
link |
the content of videos so they could be able to connect videos with the people who would
link |
benefit from those videos the most.
link |
And so that development could go in a bunch of different directions, some of which might
link |
So yeah, you're right, the threats of AI are here already and we should be thinking about
link |
On a philosophical notion, if you could, personal perhaps, if you could relive a moment in
link |
your life outside of family because it made you truly happy or it was a profound moment
link |
that impacted the direction of your life, what moment would you go to?
link |
I don't think of single moments, but I look over the long haul.
link |
I feel that I've been very lucky because I feel that, I think that in scientific research,
link |
a lot of it is about being at the right place at the right time.
link |
And you can work on problems at a time when they're just too premature.
link |
You butt your head against them and nothing happens because the prerequisites for success
link |
And then there are times when you are in a field which is all pretty mature and you can
link |
only solve curlicues upon curlicues.
link |
I've been lucky to have been in this field which for 34 years, well actually 34 years
link |
as a professor at Berkeley, so longer than that, which when I started in it was just
link |
like some little crazy, absolutely useless field which couldn't really do anything to
link |
a time when it's really, really solving a lot of practical problems, has offered a lot
link |
of tools for scientific research because computer vision is impactful for images in biology
link |
or astronomy and so on and so forth.
link |
And we have, so we have made great scientific progress which has had real practical impact
link |
And I feel lucky that I got in at a time when the field was very young and at a time when
link |
it is, it's now mature but not fully mature.
link |
It's mature but not done.
link |
I mean, it's really still in a productive phase.
link |
Yeah, I think people 500 years from now would laugh at you calling this field mature.
link |
That is very possible.
link |
So, but you're also, lest I forget to mention, you've also mentored some of the biggest names
link |
of computer vision, computer science and AI today.
link |
So many questions I could ask, but really is what, what is it, how did you do it?
link |
What does it take to be a good mentor?
link |
What does it take to be a good guide?
link |
Yeah, I think what I feel, I've been lucky to have had very, very smart and hardworking
link |
and creative students.
link |
I think some part of the credit just belongs to being at Berkeley.
link |
Those of us who are at top universities are blessed because we have very, very smart and
link |
capable students coming on, knocking on our door.
link |
So I have to be humble enough to acknowledge that.
link |
But what have I added?
link |
I think I have added something.
link |
What I have added is, I think what I've always tried to teach them is a sense of picking
link |
the right problems.
link |
So I think that in science, in the short run, success is always based on technical competence.
link |
You're, you know, you're quick with math or you are whatever.
link |
I mean, there's certain technical capabilities which make for short range progress.
link |
Long range progress is really determined by asking the right questions and focusing on
link |
the right problems.
link |
And I feel that what I've been able to bring to the table in terms of advising these students
link |
is some sense of taste of what are good problems, what are problems that are worth attacking
link |
now as opposed to waiting 10 years.
link |
What's a good problem?
link |
If you could summarize, is that possible to even summarize, like what's your sense of
link |
I think, I think I have a sense of what is a good problem, which is there is a British
link |
scientist, in fact, he won a Nobel Prize, Peter Medover, who has a book on this.
link |
And basically he calls it, research is the art of the soluble.
link |
So we need to sort of find problems which are not yet solved, but which are approachable.
link |
And he sort of refers to this sense that there is this problem which isn't quite solved yet,
link |
but it has a soft underbelly.
link |
There is some place where you can, you know, spear the beast.
link |
And having that intuition that this problem is ripe is a good thing because otherwise
link |
you can just beat your head and not make progress.
link |
So I think that is important.
link |
So if I have that and if I can convey that to students, it's not just that they do great
link |
research while they're working with me, but that they continue to do great research.
link |
So in a sense, I'm proud of my students and their achievements and their great research
link |
even 20 years after they've ceased being my student.
link |
So it's in part developing, helping them develop that sense that a problem is not yet solved,
link |
but it's solvable.
link |
The other thing which I have, which I think I bring to the table, is a certain intellectual
link |
I've spent a fair amount of time studying psychology, neuroscience, relevant areas of
link |
applied math and so forth.
link |
So I can probably help them see some connections to disparate things, which they might not
link |
So the smart students coming into Berkeley can be very deep, they can think very deeply,
link |
meaning very hard down one particular path, but where I could help them is the shallow
link |
breadth, but they would have the narrow depth, but that's of some value.
link |
Well, it was beautifully refreshing just to hear you naturally jump to psychology back
link |
to computer science in this conversation back and forth.
link |
That's actually a rare quality and I think it's certainly for students empowering to
link |
think about problems in a new way.
link |
So for that and for many other reasons, I really enjoyed this conversation.
link |
Thank you so much.
link |
It was a huge honor.
link |
Thanks for talking to me.
link |
It's been my pleasure.
link |
Thanks for listening to this conversation with Jitendra Malik and thank you to our sponsors,
link |
BetterHelp and ExpressVPN.
link |
Please consider supporting this podcast by going to betterhelp.com slash Lex and signing
link |
up at expressvpn.com slash LexPod.
link |
Click the links, buy the stuff.
link |
That's how they know I sent you and it really is the best way to support this podcast and
link |
the journey I'm on.
link |
If you enjoy this thing, subscribe on YouTube, review it with five stars on Apple podcast,
link |
support it on Patreon or connect with me on Twitter at Lex Friedman.
link |
Don't ask me how to spell that.
link |
I don't remember it myself.
link |
And now let me leave you with some words from Prince Mishkin in The Idiot by Dostoevsky.
link |
Beauty will save the world.
link |
Thank you for listening and hope to see you next time.