back to indexJitendra Malik: Computer Vision | Lex Fridman Podcast #110
link |
The following is a conversation with Jitendra Malik, a professor at Berkeley and one of the
link |
seminal figures in the field of computer vision, the kind before the deep learning revolution
link |
and the kind after. He has been cited over 180,000 times and has mentored many world
link |
class researchers in computer science. Quick summary of the ads. Two sponsors,
link |
one new one, which is BetterHelp and an old, goodie, ExpressVPN. Please consider supporting
link |
this podcast by going to betterhelp.com slash lex and signing up at expressvpn.com slash lex pod.
link |
Click the links, buy the stuff. It really is the best way to support this podcast and the journey
link |
I'm on. If you enjoy this thing, subscribe on YouTube, review it with five stars on Apple
link |
Podcasts supported on Patreon are connected with me on Twitter at Lex Friedman. However,
link |
the heck you spell that. As usual, I'll do a few minutes of ads now and never any ads in the middle
link |
that can break the flow of the conversation. This show is sponsored by BetterHelp spelled H E L P
link |
help. Check it out at betterhelp.com slash lex. They figure out what you need and match you with
link |
a licensed professional therapist in under 48 hours. It's not a crisis line. It's not self help.
link |
It's professional counseling done securely online. I'm a bit from the David Goggins line
link |
of creatures as you may know. And so have some demons to contend with usually on long runs
link |
or all nights working forever and possibly full of self doubt. It may be because I'm Russian,
link |
but I think suffering is essential for creation. But I also think you can suffer beautifully in a
link |
way that doesn't destroy you. For most people, I think a good therapist can help in this. So it's
link |
at least worth a try. Check out their reviews. They're good. It's easy, private, affordable,
link |
available worldwide. You can communicate by text and your time and schedule weekly audio and video
link |
sessions. I highly recommend that you check them out at betterhelp.com slash lex. This show is
link |
also sponsored by ExpressVPN. Get it at expressvpn.com slash lex pod to support this podcast and
link |
to get an extra three months free on a one year package. I've been using ExpressVPN for many years.
link |
I love it. I think ExpressVPN is the best VPN out there. They told me to say it, but it happens to
link |
be true. It doesn't log your data. It's crazy fast and is easy to use literally just one big sexy
link |
power on button. Again, for obvious reasons, it's really important that they don't log your data.
link |
It works on Linux and everywhere else too. But really, why use anything else? Shout out to my
link |
favorite flavor of Linux Ubuntu Mate 2004. Once again, get it at expressvpn.com slash lex pod
link |
to support this podcast and to get an extra three months free on a one year package.
link |
And now here's my conversation with Jitendra Malik. In 1966, Seymour Papert at MIT wrote up a
link |
proposal called the summer vision project to be given as far as we know to 10 students to work on
link |
and solve that summer. So that proposal outlined many of the computer vision tasks we still work on
link |
today. Why do you think we underestimate and perhaps we did underestimate and perhaps still
link |
underestimate how hard computer vision is? Because most of what we do in vision, we do unconsciously
link |
or subconsciously in human vision. So that effortlessness gives us the sense that, oh,
link |
this must be very easy to implement on a computer. Now, this is why the early researchers in AI got
link |
it so wrong. However, if you go into neuroscience or psychology of human vision, then the complexity
link |
becomes very clear. The fact is that a very large part of the the cerebral cortex is devoted to
link |
visual processing. I mean, and this is true in other primates as well. So once we looked at it
link |
from a neuroscience or psychology perspective, it becomes quite clear that the problem is very
link |
challenging and it will take some time. You said the high level parts are the harder parts?
link |
I think vision appears to to be easy because most of what visual processing is subconscious or
link |
unconscious. So we underestimate the difficulty. Whereas when you are like proving a mathematical
link |
theorem or playing chess, the difficulty is much more evident. So because it is your conscious
link |
brain, which is processing various aspects of the problem solving behavior. Whereas in vision,
link |
all this is happening, but it's not in your awareness. It's in your it's operating below that.
link |
But it still seems strange. Yes, that's true. But it seems strange that as computer vision
link |
researchers, for example, the community broadly is time and time again makes the mistake of
link |
thinking the problem is easier than it is. Or maybe it's not a mistake. We'll talk a little bit
link |
about autonomous driving, for example, how hard of a vision task that is. Do you think, I mean,
link |
is it just human nature or is there something fundamental to the vision problem that we
link |
underestimate? We're still not able to be cognizant of how hard the problem is.
link |
Yeah, I think in the early days, it could have been excused because in the early days,
link |
all aspects of AI were regarded as too easy. But I think today it is much less excusable.
link |
And I think why people fall for this is because of what I call the fallacy of the successful
link |
first step. There are many problems in vision where getting 50% of the solution you can get in one
link |
minute, getting to 90% can take you a day, getting to 99% may take you five years and
link |
99.99% may be not in your lifetime. I wonder if that's a unique division.
link |
It seems that language people are not so confident about, so natural language processing,
link |
people are a little bit more cautious about our ability to solve that problem. I think for
link |
language people intuit that we have to be able to do natural language understanding. For vision,
link |
it seems that we're not cognizant or we don't think about how much understanding is required.
link |
It's probably still an open problem. But in your sense, how much understanding is required to solve
link |
vision? Put another way, how much something called common sense reasoning is required to
link |
really be able to interpret even static scenes? Yeah, so vision operates at all levels. And there
link |
are parts which can be solved with what we could call maybe peripheral processing. So in the human
link |
vision literature, there used to be these terms sensation, perception, and cognition, which
link |
roughly speaking referred to the front end of processing, middle stages of processing, and
link |
higher level of processing. And I think they made a big deal out of this and they wanted
link |
to study only perception and then dismiss certain problems as being, quote, cognitive.
link |
But really, I think these are artificial divides. The problem is continuous at all levels,
link |
and there are challenges at all levels. The techniques that we have today, they work better
link |
at the lower and mid levels of the problem. I think the higher levels of the problem, quote,
link |
the cognitive levels of the problem are there. And we, in many real applications, we have to
link |
confront them. Now, how much that is necessary will depend on the application. For some problems,
link |
it doesn't matter. For some problems, it matters a lot. So I am, for example, a pessimist on
link |
fully autonomous driving in the near future. And the reason is because I think there will be
link |
that 0.01% of the cases where quite sophisticated cognitive reasoning is called for. However,
link |
there are tasks where you can, first of all, they are much more, they are robust. So in the sense
link |
that error rates, error is not so much of a problem. For example, let's say you're doing
link |
image search. You're trying to get images based on some description, some visual description.
link |
We are very tolerant of errors there, right? I mean, when Google image search gives you some
link |
images back and a few of them are wrong, it's okay. It doesn't hurt anybody. There's no,
link |
there's not a matter of life and death. But making mistakes when you are driving at 60 miles per hour
link |
and you could potentially kill somebody is much more important. So just for the,
link |
for the fun of it, since you mentioned, let's go there briefly about autonomous vehicles.
link |
So one of the companies in the space, Tesla, is with Andre Capati and Elon Musk are working on
link |
a system called autopilot, which is primarily a vision based system with eight cameras and
link |
basically a single neural network, a multitask neural network. They call it hydranet multiple heads.
link |
So it does multiple tasks, but is forming the same representation at the core.
link |
Do you think driving can be converted in this way to a purely a vision problem and then solved
link |
with learning? Or even more specifically in the current approach, what do you think about
link |
what Tesla autopilot team is doing? So the way I think about it is that there are certainly
link |
subsets of the visual based driving problem, which are quite solvable. So for example,
link |
driving in freeway conditions is quite a solvable problem. I think there were demonstrations of that
link |
going back to the 1980s by someone called Ernst Tickman's in Munich. In the 90s, there were
link |
approaches from Carnegie Mellon. There were approaches from our team at Berkeley. In the 2000s,
link |
there were approaches from Stanford and so on. So autonomous driving in certain settings is
link |
very doable. The challenge is to have an autopilot work under all kinds of driving conditions. At
link |
that point, it's not just a question of vision or perception, but really also of control and
link |
dealing with all the edge cases. So where do you think most of the difficult cases, to me,
link |
even the highway driving is an open problem because it applies the same 50, 90, 95, 99 rule
link |
or the first step, the fallacy of the first step. I forget how you put it. We fall victim too.
link |
I think even highway driving has a lot of elements because to solve autonomous driving,
link |
you have to completely relinquish the fat help of a human being. You're always in control. So
link |
you're really going to feel the edge cases. So I think even highway driving is really difficult.
link |
But in terms of the general driving task, do you think vision is the fundamental problem?
link |
Or is it also your action, the interaction with the environment,
link |
the ability to... And then the middle ground, I don't know if you put that under vision,
link |
which is trying to predict the behavior of others, which is a little bit in the world of
link |
understanding the scene, but it's also trying to form a model of the actors in the scene
link |
and predict their behavior. Yeah, I include that in vision because to me, perception blends into
link |
cognition and building predictive models of other agents in the world, which could be other agents,
link |
could be people, other agents, could be other cars. That is part of the task of perception
link |
because perception always has to not tell us what is now, but what will happen because what's
link |
now is boring. It's done. It's over with. We care about the future because we act in the future.
link |
And we care about the past in as much as it informs what's going to happen in the future.
link |
So I think we have to build predictive models of behaviors of people and those can get quite
link |
complicated. So I mean, I've seen examples of this in actually, I mean, I own a Tesla and
link |
it has various safety features built in. And what I see are these examples where
link |
let's say there is some skateboarder. I mean, and I don't want to be too critical because
link |
obviously these systems are always being improved and any specific criticism I have,
link |
maybe the system six months from now will not have that particular failure mode.
link |
So it had the wrong response and it's because it couldn't predict what this skateboarder was going
link |
to do. And because it really required that higher level cognitive understanding of what
link |
skateboarders typically do as opposed to a normal pedestrian. So what might have been
link |
the correct behavior for a pedestrian, a typical behavior for pedestrian was not the
link |
typical behavior for a skateboarder. And so therefore to do a good job there,
link |
you need to have enough data where you have pedestrians, you also have skateboarders,
link |
you've seen enough skateboarders to see what kinds of patterns or behavior they have.
link |
So it is in principle with enough data that problem could be solved. But I think our current
link |
systems, computer vision systems, they need far, far more data than humans do for learning
link |
those same capabilities. So say that there is going to be a system that solves autonomous
link |
driving. Do you think it will look similar to what we have today, but have a lot more data,
link |
perhaps more compute, but the fundamental architecture is involved? Well, in the case
link |
of Tesla autopilot is neural networks. Do you think it will look similar? In that regard,
link |
and we'll just have more data. That's a scientific hypothesis as to which way is it going to go. I
link |
will tell you what I would bet on. And this is my general philosophical position on how these
link |
learning systems have been. What we have found currently very effective in computer vision
link |
in the deep learning paradigm is sort of tabular ASA learning and tabular ASA learning in a
link |
supervised way with lots and lots of... What's tabular ASA learning? Tabular ASA in the sense
link |
that blank slate. We just have the system which is given a series of experiences in this setting
link |
and then it learns there. Now, if let's think about human driving, it is not tabular ASA learning.
link |
So at the age of 16 in high school, a teenager goes into driver head class. And now at that point,
link |
they learn, but at the age of 16, they are already visual geniuses because from 0 to 16,
link |
they have built a certain repertoire of vision. In fact, most of it has probably been achieved by
link |
age two. In this period of age up to age two, they know that the world is three dimensional.
link |
They know how objects look like from different perspectives. They know about occlusion.
link |
They know about common dynamics of humans and other bodies. They have some notion of intuitive
link |
physics. So they built that up from their observations and interactions in early childhood
link |
and of course, reinforced through their growing up to age 16. So then at age 16, when they go into
link |
driver head, what are they learning? They're not learning afresh the visual world. They have a mastery
link |
of the visual world. What they are learning is control. They are learning how to be smooth
link |
about control, about steering and brakes and so forth. They're learning a sense of typical
link |
traffic situations. Now, that education process can be quite short because they are coming in as
link |
visual geniuses. And of course, in their future, they're going to encounter situations which are
link |
very novel. So during my driver head class, I may not have had to deal with a skateboarder.
link |
I may not have had to deal with a truck driving in front of me where the back opens up and some
link |
junk gets dropped from the truck and I have to deal with it. But I can deal with this as a driver,
link |
even though I did not encounter this in my driver head class. And the reason I can deal with it is
link |
because I have all this general visual knowledge and expertise. And do you think the learning
link |
mechanisms we have today can do that kind of long term accumulation of knowledge? Or do we have to
link |
do some kind of... The work that led up to expert systems with knowledge representation,
link |
the broader field of artificial intelligence worked on this kind of accumulation of knowledge.
link |
Do you think neural networks can do the same? I think I don't see any in principle problem with
link |
neural networks doing it. But I think the learning techniques would need to evolve significantly.
link |
So the current learning techniques that we have are supervised learning. You're giving lots of
link |
examples, X, Y, Y pairs, and you learn the functional mapping between them. I think that
link |
human learning is far richer than that. It includes many different components. There is
link |
a child who explores the world and sees... For example, a child takes an object and manipulates it
link |
in his or her hand and therefore gets to see the object from different points of view. And the child
link |
has commanded the movement. So that's a kind of learning data. But the learning data has been
link |
arranged by the child. And this is a very rich kind of data. The child can do various experiments
link |
with the world. So there are many aspects of human learning. And these have been studied in
link |
child development by psychologists. And what they tell us is that supervised learning is a very
link |
small part of it. There are many different aspects of learning. And what we would need to do is to
link |
develop models of all of these and then train our systems with that kind of protocol.
link |
So new methods of learning, some of which might imitate the human brain. But you also,
link |
in your talks, I've mentioned sort of the compute side of things. In terms of the
link |
difference in the human brain or referencing Hans Maravak. So do you think there's something
link |
interesting, valuable to consider about the difference in the computational power of the human
link |
brain versus the computers of today in terms of instructions per second? Yes. So if we go back...
link |
So this is a point I've been making for 20 years now. And I think once upon a time, the way I used
link |
to argue this was that we just didn't have the computing power of the human brain. Our computers
link |
were not quite there. And I mean, there is a well known tradeoff, which we know that
link |
the neurons are slow compared to transistors. But we have a lot of them and they have a very high
link |
connectivity. Whereas in silicon, you have much faster devices, transistors switch at...
link |
On the order of nanoseconds, but the connectivity is usually smaller. At this point in time,
link |
I mean, we are now talking about 2020, we do have, if you consider the latest GPUs and so on,
link |
amazing computing power. And if we look back at Hans Maravak's type of calculations, which he
link |
did in the 1990s, we may be there today in terms of computing power comparable to the brain. But
link |
it's not in the same style. It's of a very different style. So I mean, for example, the style of
link |
computing that we have in our GPUs is far, far more power hungry than the style of computing that
link |
is there in the human brain or other biological entities. Yeah. And that the efficiency part
link |
is we're going to have to solve that in order to build actual real world systems of large scale.
link |
Let me ask sort of the high level question. Taking a step back, how would you articulate
link |
the general problem of computer vision? Does such a thing exist? So if you look at the computer vision
link |
conferences and the work that's been going on, it's often separated into different little segments,
link |
breaking the problem of vision apart into whether segmentation, 3D reconstruction,
link |
object detection, I don't know, image capturing, whatever, there's benchmarks for each. But if
link |
you were to sort of philosophically say, what is the big problem of computer vision? Does such a
link |
thing exist? Yes, but it's not in isolation. So for all intelligence tasks, I always go back to
link |
sort of biology or humans. And if you think about vision or perception in that setting,
link |
we realize that perception is always to guide action. Perception for a biological system
link |
does not give any benefits unless it is coupled with action. So we can go back and think about
link |
the first multicellular animals which arose in the Cambrian era 500 million years ago.
link |
And these animals could move and they could see in some way. And the two activities helped each
link |
other because how does movement help? Movement helps that because you can get food in different
link |
places. But you need to know where to go. And that's really about perception or seeing. I mean,
link |
vision is perhaps the single most perception sense. But all the others are equally are also
link |
important. So perception and action kind of go together. So earlier it was in these very
link |
simple feedback loops which were about finding food or avoiding becoming food if there's a
link |
predator running, trying to eat you up and so forth. So we must at the fundamental level
link |
connect perception to action. Then as we evolved, perception became more and more sophisticated
link |
because it served many more purposes. And so today we have what seems like a fairly general
link |
purpose capability which can look at the external world and build a model of the external world
link |
inside the head. We do have that capability. That model is not perfect. And psychologists
link |
have great fun in pointing out the ways in which the model in your head is not a perfect model
link |
of the external world. They create various illusions to show the ways in which it is
link |
imperfect. But it's amazing how far it has come from a very simple perception action loop that
link |
you exist in, you know, an animal 500 million years ago. Once we have these very sophisticated
link |
visual systems, we can then impose a structure on them. It's we as scientists who are imposing
link |
that structure where we have chosen to characterize this part of the system as this
link |
quote module of object detection or quote this module of 3D reconstruction. What's going on
link |
is really all of these processes are running simultaneously and they are running simultaneously
link |
because originally their purpose was in fact to help guide action. So as a guiding general
link |
statement of a problem, do you think we can say that the general problem of computer vision,
link |
you said in humans, it was tied to action. Do you think we should also say that ultimately
link |
that the goal, the problem of computer vision is to sense the world in the way that helps you
link |
act in the world? Yes, I think that's the most fundamental, that's the most fundamental purpose.
link |
We have by now hyper evolved. So we have this visual system which can be used for other things,
link |
for example, judging the aesthetic value of a painting. And this is not guiding action,
link |
maybe it's guiding action in terms of how much money you will put in your auction bid, but
link |
that's a bit stretched. But the basics are in fact in terms of action, but we are not
link |
talking about action, but we have evolved really this hyper, we have hyper evolved our visual
link |
system. Actually, just to, sorry to interrupt, but perhaps it is fundamentally about action.
link |
You're kind of jokingly said about spending, but perhaps the capitalistic drive that drives
link |
a lot of the development in this world is about the exchange of money and the fundamental action
link |
is money. If you watch Netflix, if you enjoy watching movies, you're using your perception
link |
system to interpret the movie. Ultimately, your enjoyment of that movie means you'll
link |
subscribe to Netflix. So the action is this extra layer that we've developed in modern society,
link |
perhaps is fundamentally tied to the action of spending money. Well, certainly with respect to
link |
interactions with firms. So in this homo economic role, when you're interacting with firms,
link |
it does become that. That's it. What else is there?
link |
And that was a rhetorical question. Okay. So to link on the division between the static and the
link |
dynamic, so much of the work in computer vision, so many of the breakthroughs that you've been a
link |
part of have been in the static world in looking at static images. And then you've also worked on
link |
starting, but it's a much smaller degree. The community is looking at dynamic at video
link |
at dynamic scenes. And then there is robotic vision, which is dynamic, but also where you're
link |
actually have a robot in the physical world interacting based on that vision.
link |
Which problem is harder? The sort of the trivial first answers of, well, of course,
link |
one image is harder. But if you look at a deeper question there, are we, what's the term, cutting
link |
ourselves at the knees or like making the problem harder by focusing on the images?
link |
That's a fair question. I think sometimes we, we can simplify a problem so much
link |
that we essentially lose part of the juice that could enable us to solve the problem.
link |
And one could reasonably argue that to some extent this happens when we go from video to
link |
single images. Now, historically, you have to consider the limits of imposed by the
link |
computation capabilities we had. So if we, many of the choices made in the computer vision community
link |
through the 70s, 80s, 90s can be understood as choices which were forced upon us by the
link |
fact that we just didn't have access to compute enough compute.
link |
Not enough memory, not enough hard drive.
link |
Exactly. Not enough, not enough compute, not enough storage. So, so think of these choices.
link |
So one of the choices is focusing on single images rather than video. Okay,
link |
clear questions, storage and compute. We had to focus on, we did, we used to detect edges and
link |
throw away the image, right? So you have an image which I say 256 by 256 pixels.
link |
And instead of keeping around the grayscale value, what we did was we detected edges,
link |
find the places where the brightness changes a lot. So now that's, and now, and then throw away
link |
the rest. So this was a major compression device. And the hope was that this makes it,
link |
that you can still work with it. And the logic was humans can interpret a line drawing.
link |
And, and yes, and this will save us a computation. So many of the choices were dictated by that.
link |
I think today we are no longer detecting edges, right? We process images with conlets
link |
because we don't need to, we don't have that those compute restrictions anymore.
link |
Now video is still understudied because video compute is still quite challenging
link |
if you are a university researcher. I think video computing is not so challenging if you are at
link |
Google or Facebook or Amazon. Still super challenging. I just spoke with the
link |
VP of engineer and Google head of the YouTube search and discovery, and they still struggle
link |
doing stuff on video. It's very difficult except doing, except using techniques that are essentially
link |
the techniques used in the 90s, some very basic computer vision techniques.
link |
No, that's when you want to do things at scale. So if you want to operate at the scale of all the
link |
content of YouTube, it's very challenging. And there are similar issues in Facebook.
link |
But as a researcher, you, you have, you have more, you know, opportunities.
link |
You can train large networks with relatively large video data sets. Yeah.
link |
Yes. So I think that this is part of the reason why we have so emphasized static images.
link |
I think that this is changing. And over the next few years, I see a lot more progress happening
link |
in video. So I, I have this generic statement that to me, video recognition feels like 10 years
link |
behind object recognition. And you can quantify that because you can take some of the challenging
link |
video data sets and their performance on action classification is like say 30%, which is kind
link |
of what we used to have around 2009 in object detection, you know, it's like about 10 years
link |
behind. And whether it'll take 10 years to catch up is a different question. Hopefully,
link |
it will take less than that. Let me ask a similar question I've already asked. But once again,
link |
so for dynamic scenes, do you think, do you think some kind of injection of knowledge
link |
basis and reasoning is required to help improve like action recognition? Like if, if, if, if we
link |
solve the general action recognition problem, what do you think the solution would look like?
link |
There's another way. Yeah. So I, I completely agree that knowledge is called for. And that
link |
knowledge can be quite sophisticated. So the way I would say it is, is that you have to
link |
be quite sophisticated. So the way I would say it is that perception blends into cognition.
link |
And cognition brings in issues of memory and this notion of a schema from psychology, which is,
link |
let me use the classic example, which is you go to a restaurant, right? Now there are things
link |
happen in a certain order, you walk in, somebody takes you to a table, waiter comes,
link |
gives you a menu, takes the order, food arrives, eventually bill arrives, etc., etc.
link |
This is a classic example of AI from the 1970s. It was called, there was the term frames and
link |
scripts and schemas. These are all quite similar ideas. Okay. And then in the 70s, the way the AI
link |
of the time dealt with it was by hand coding this. So they hand coded in this notion of a script and
link |
the various stages and the actors and so on and so forth and use that to interpret, for example,
link |
language. I mean, if there's a description of a, of a story involving some people eating at a
link |
restaurant, there are all these inferences you can make because you know what happens typically
link |
at a restaurant. So I think this kind of, this kind of knowledge is absolutely essential.
link |
So I think that when we are going to do long form video understanding,
link |
we are going to need to do this. I think the kinds of technology that we have right now with
link |
3D convolutions over a couple of second of clip or video, it's very much tailored towards short
link |
term video understanding, not that long term understanding, long term understanding requires
link |
a notion of this notion of schemas that I talked about, perhaps some notions of goals,
link |
intentionality, functionality and so on and so forth. Now, how will we bring that in? So we
link |
could either revert back to the 70s and say, okay, I'm going to hand code in a script or
link |
we might try to learn it. So I tend to believe that we have to find learning ways of doing this
link |
because I think learning ways land up being more robust. And there must be a learning version of
link |
the story because children acquire a lot of this knowledge by sort of just observation. So at no
link |
moment in a child's life, it's possible, but I think it's not so typical that somebody that a
link |
mother coaches a child through all the stages of what happens in a restaurant. They just go as a
link |
family, they go to the restaurant, they eat, come back and the child goes through 10 such
link |
experiences and the child has got a schema of what happens when you go to a restaurant.
link |
So we somehow need to, we need to provide that capability to our systems.
link |
You mentioned the following line from the end of the Alan Turing paper,
link |
Computing Machinery and Intelligence that many people, like you said, many people know and
link |
very few have read where he proposes the Turing test. This is how you know because it's towards
link |
the end of the paper. Instead of trying to produce a program to simulate the adult mind,
link |
why not rather try to produce one which simulates the child's? So that's a really interesting point.
link |
If I think about the benchmarks we have before us, the tests of our computer vision systems,
link |
they're often kind of trying to get to the adult. So what kind of benchmarks should we have?
link |
What kind of tests for computer vision do you think we should have that mimic the child's
link |
in computer vision? Yeah, I think we should have those and we don't have those today.
link |
And I think the part of the challenge is that we should really be collecting data
link |
of the type that a child experiences. So that gets into issues of privacy and so on and so
link |
forth. But there are attempts in this direction to sort of try to collect the kind of data that
link |
a child encounters growing up. So what's the child's linguistic environment? What's the child's
link |
visual environment? So if we could collect that kind of data and then develop learning schemes
link |
based on that data, that would be one way to do it. I think that's a very promising direction
link |
myself. There might be people who would argue that we could just short circuit this in some way
link |
and sometimes we have imitated, we have had success by not imitating nature in detail.
link |
So the usual example is airplanes. We don't build flapping wings. So yes, that's one of the points
link |
of debate. In my mind, I would bet on this learning like a child approach.
link |
So one of the fundamental aspects of learning like a child is the interactivity. So the child
link |
gets to play with the data set it's learning from. Yes. It's against the select. I mean,
link |
you can call that active learning in the machine learning world. You can call it a lot of terms.
link |
What are your thoughts about this whole space of being able to play with the data set or select
link |
what you're learning? Yeah. So I think that I believe in that and I think that we could achieve
link |
it in two ways and I think we should use both. So one is actually real robotics. So real physical
link |
embodiments of agents who are interacting with the world and they have a physical body with
link |
a dynamics and mass and moment of inertia and friction and all the rest and you learn your
link |
body, the robot learns its body by doing a series of actions. The second is that simulation
link |
environments. So I think simulation environments are getting much, much better. In my life,
link |
in Facebook, AI research, our group has worked on something called Habitat, which is a simulation
link |
environment, which is a visually photo realistic environment of places like houses or interiors
link |
of various urban spaces and so forth. And as you move, you get a picture, which is a pretty
link |
accurate picture. So now you can imagine that subsequent generations of these simulators
link |
will be accurate, not just visually, but with respect to forces and masses and haptic interactions
link |
and so on. And then we have that environment to play with. I think that, let me state one reason
link |
why I think this being able to act in the world is important. I think that this is one way to break
link |
the correlation versus causation barrier. So this is something which is of a great
link |
deal of interest these days. People like Judea Pearl have talked a lot about that we are
link |
neglecting causality and he describes the entire set of successes of deep learning as just curve
link |
fitting. But I don't quite agree. He's a troublemaker, he is. But causality is important. But
link |
causality is not like a single silver bullet. It's not like one single principle. There are many
link |
different aspects here. And one of the ways in which one of our most reliable ways of establishing
link |
causal links and this is the way, for example, the medical community does this is randomized
link |
control trials. So you pick some situation and now in some situation you perform an action
link |
and for certain others you don't. So you have a controlled experiment. Well, the child is in fact
link |
performing controlled experiments all the time. In a small scale. But that is a way
link |
that the child gets to build and refine its causal models of the world. And my colleague,
link |
Alison Gopnik, has together with a couple of authors, co authors has this book called The
link |
Scientist in the Crib, referring to his children. So I like the part that I like about that is
link |
the scientist wants to do, wants to build causal models and the scientist does control
link |
experiments. And I think the child is doing that. So to enable that, we will need to
link |
have these, these active experiments. And I think this could be done some in the real world
link |
and some in simulation. So you have hope for simulation. I have hope for simulation. That's
link |
an exciting possibility if we can get to not just photo realistic, but what's that called
link |
life realistic simulation. So you don't see any fundamental blocks to why we can't eventually
link |
simulate the principles of what it means to exist in the world.
link |
I don't see any fundamental problems that I mean, and look, the computer graphics community has come
link |
a long way. So the in the early days, back going back to the 80s and 90s, they were,
link |
they were focusing on visual realism, right? And then they could do the easy stuff, but they
link |
couldn't do stuff like hair or fur and so on. Okay, well, they managed to do that. Then they
link |
couldn't do physical actions, right? Like there's a bowl of glass and it falls down and it shatters,
link |
but then they could start to do pretty realistic models of that and so on and so forth. So the
link |
graphics people have shown that they can do this forward direction, not just for optical
link |
interactions, but also for physical interactions. So I think, of course, some of that is very
link |
computer intensive, but I think by and by, we will find ways of making our models ever more
link |
realistic. You break vision apart into, in one of your presentations, early vision,
link |
static scene understanding, dynamic scene understanding, and raise a few interesting
link |
questions. I thought I could just throw some, some at you to see if you want to talk about them.
link |
So early vision, so it's, what is it that you said? Sensation, perception, and cognition. So
link |
is this a sensation? Yes. What can we learn from image statistics that we don't already know?
link |
So at the lowest level, what, what can we make from just the, the, the, the
link |
statistics, the basics or the, the variations in the rock pixels, the textures and so on?
link |
Yeah. So what we seem to have learned is, is that there's a lot of redundancy in these images.
link |
And as a result, we are able to do a lot of compression and, and this compression is very
link |
important in biological settings, right? So you might have 10 to the eight photo receptors and
link |
only 10 to the six fibers in the optic nerve. So you have to do this compression by a factor of
link |
100 is to one. And, and so there are analogs of that which are happening in, in our neural
link |
net, artificial neural network. That's the early layers. So you think there's, there's a lot of
link |
compression that can be done in the beginning. Yeah. Just, just the statistics. Yeah.
link |
How much, how much? Well, I saw, I mean, the, the way to think about it is just how successful is
link |
image compression, right? And we, we, and there are, and that's been done with older technologies,
link |
but it can be done with, there are several companies which are trying to use sort of these
link |
more advanced neural network type techniques for compression, both for static images as well as for,
link |
for video. One of my former students has a company which is trying to do stuff like this.
link |
And I think, I think that they are showing quite interesting results. And I think that
link |
that's all the success of that's really about image statistics and video statistics.
link |
But that's still not doing compression of the kind. When I see a picture of a cat,
link |
all I have to say is it's a cat. That's another semantic kind of compression.
link |
Yeah. So this is, this is at the lower level, right? So we are, we are, as I said, yeah,
link |
that's focusing on low level statistics. So to linger on that for a little bit,
link |
you mentioned how far can bottom up image segmentation go? And in general, what you mentioned
link |
that the central question for scene understanding is the interplay of bottom up and top down
link |
information. Maybe this is a good time to elaborate on that. Maybe define what is,
link |
what is bottom up, what is top down in the context of computer vision?
link |
Right. That's, so today what we have are very interesting systems because they work completely
link |
bottom up. How are they? What does bottom up mean? Sorry. So bottom up means, in this case,
link |
means a feed forward neural network. So starting from the raw pixels.
link |
Yeah. They start from the raw pixels and they, they end up with some, something like cat or
link |
not a cat, right? So our, our systems are running totally feed forward. They're trained in a very
link |
top down way. So they're trained by saying, okay, this is a cat. There's a cat. There's a dog.
link |
There's a zebra, et cetera. And I'm not happy with either of these choices fully. We have gone into,
link |
because we have completely separated these processes, right? So there's a, so I would
link |
like the, the process, the, the, so what do we know compared to biology? So in biology, what we
link |
know is that the processes in at test time at runtime, those processes are not purely feed
link |
forward, but they involve feedback. So, and they involve much shallower neural networks.
link |
So the kinds of neural networks we are using in computer vision, say a ResNet 50 has 50 layers.
link |
Well, in, in the brain, in the visual cortex, going from the retina to IT, maybe we have like seven,
link |
right? So they're far shallower, but we have the possibility of feedback. So there are backward
link |
connections. And this might enable us to, to deal with the more ambiguous stimuli, for example.
link |
So the, the biological solution seems to involve feedback. The solution in, in artificial vision
link |
seems to be just feed forward, but with a much deeper network. And the two are functionally
link |
equivalent, because if you have a feedback network, which just has like three rounds of feedback,
link |
you can just unroll it and make it three times the depth and create it in a totally feed forward
link |
way. So this is something which, I mean, we have written some papers on this theme, but I really
link |
feel that this should, this theme should be pursued further.
link |
Oh, some kind of recurrence mechanism.
link |
Yeah. Okay. The other, so that's, so I, so I want to have a little bit more top down in the
link |
at test time. Okay. Then at training time, we make use of a lot of top down knowledge right now.
link |
So basically to learn to segment an object, we have to have all these examples of this is the
link |
boundary of a cat, and this is the boundary of a chair, and this is the boundary of a horse,
link |
and so on. And this is too much top down knowledge. How do humans do this? We manage to,
link |
we manage with far less supervision, and we do it in a sort of bottom up way, because for example,
link |
we're looking at a video stream and the horse moves, and that enables me to say that all these
link |
pixels are together. So the Gestalt psychologists used to call this the principle of common fate.
link |
So there was a bottom up process by which we were able to segment out these objects,
link |
and we have totally focused on this top down training signal. So in my view, we have currently
link |
solved it in machine vision, this top down bottom up interaction, but I don't find the
link |
solution fully satisfactory. And I would rather have a bit of both in at both stages.
link |
For all computer vision problems, not just segmentation.
link |
And the question that you can ask is, so for me, I'm inspired a lot by human vision,
link |
and I care about that. You could be just a hard boiled engineer, not give a damn.
link |
So to you, I would then argue that you would need far less training data if you could make my
link |
research and, you know, fruitful.
link |
Okay, so then maybe taking a step into segmentation, static scene understanding,
link |
what is the interaction between segmentation and recognition? You mentioned the movement of objects.
link |
So for people who don't know computer vision, segmentation is this weird activity that computer
link |
vision folks have all agreed is very important of drawing outlines around objects versus
link |
a bounding box, and then classifying that object. What's the value of segmentation? What is it
link |
as a problem in computer vision? How is it fundamentally different from
link |
detection recognition and the other problems? Yeah, so I think, so segmentation enables us to say
link |
that some set of pixels are an object without necessarily even being able to name that object
link |
or knowing properties of that object. Oh, so you mean segmentation purely as the act of separating
link |
an object from its background, a blob of that's united in some way from its background. Yeah,
link |
so entityfication, if you will, making an entity out of it. Entityfication, beautifully. So I think
link |
that we have that capability, and that enables us to, as we are growing up, to acquire names of
link |
objects with very little supervision. So suppose the child, let's posit that the child has this
link |
ability to separate out objects in the world. Then when the mother says, pick up your bottle or
link |
the cat's behaving funny today, the word cat suggests some object, and then the child sort
link |
of does the mapping. The mother doesn't have to teach specific object labels by pointing to them.
link |
Weak supervision works in the context that you have the ability to create objects. So I think
link |
that, so to me, that's a very fundamental capability. There are applications where this is very
link |
important. For example, medical diagnosis. So in medical diagnosis, you have some brain scan,
link |
I mean, this is some work that we did in my group where you have CT scans of people who have had
link |
traumatic brain injury, and what the radiologist needs to do is to precisely delineate various
link |
places where there might be bleeds, for example. And there are clear needs like that. So there's
link |
certainly very practical applications of computer vision where segmentation is necessary. But
link |
philosophically, segmentation enables the task of recognition to proceed with much weaker
link |
supervision than we require today. And you think of segmentation as this kind of task
link |
that takes on a visual scene and breaks it apart into interesting entities that might be useful
link |
for whatever the task is. Yeah. And it is not semantics free. So I think, I mean, it blends
link |
into, it involves perception and cognition. It is not, I think the mistake that we used
link |
to make in the early days of computer vision was to treat it as a purely bottom up perceptual task.
link |
It is not just that because we do revise our notion of segmentation with more experience,
link |
right? Because, for example, there are objects which are non rigid, like animals or humans.
link |
And I think understanding that all the pixels of a human are one entity is actually quite a
link |
challenge because the parts of the human, they can move independently. The human wears clothes,
link |
so they might be differently colored. So it's all sort of a challenge.
link |
You mentioned the three hours of computer vision, our recognition reconstruction,
link |
reorganization. Can you describe these three hours and how they interact?
link |
Yeah. So recognition is the easiest one because that's what I think people generally think of
link |
as computer vision achieving these days, which is labels. So is this a cat? Is this a dog?
link |
Is this a chihuahua? I mean, it could be very fine grained, like specific breed of a dog
link |
or a specific species of bird, or it could be very abstract like animal.
link |
But given a part of an image or a whole image, say put a label on it.
link |
Yeah. So that's recognition. Reconstruction is essentially, you can think of it as inverse
link |
graphics. I mean, that's one way to think about it. So graphics is you have some internal computer
link |
representation and you have a computer representation of some objects arranged in a scene.
link |
And what you do is you produce a picture. You produce the pixels corresponding to a rendering
link |
of that scene. So let's do the inverse of this. We are given an image and we say, oh, this image
link |
arises from some objects in a scene looked at with a camera from this viewpoint. And we might
link |
have more information about the objects like their shape, maybe the textures, maybe color,
link |
et cetera, et cetera. So that's the reconstruction problem. In a way, you are in your head creating
link |
a model of the external world. Okay. Reorganization is to do with essentially finding these entities.
link |
So it's organization. The word organization implies structure. So in perception, in psychology,
link |
we use the term perceptual organization, that the world is not just an image is not just seen as
link |
is not internally represented as just a collection of pixels. But we make these entities, we create
link |
these entities, objects, whatever you want to call it. And the relationship between the entities
link |
as well? Or is it purely about the entities? It could be about the relationships, but mainly
link |
we focus on the fact that there are entities. Okay. So I'm trying to pinpoint what the organization
link |
means. So organization is that instead of like a uniform grid, we have this structure of objects.
link |
So segmentation is the small part of that. So segmentation gets us going towards that.
link |
Yeah. And you kind of have this triangle where they all interact together.
link |
Yes. So how do you see that interaction in sort of reorganization is yes,
link |
defining the entities in the world. The recognition is labeling those entities. And then reconstruction
link |
is what filling in the gaps? Well, to, for example, see impute some 3D objects corresponding to each
link |
of these entities, that would be part of adding more information that's not there in the raw data.
link |
Correct. I mean, I started pushing this kind of a view in the around 2010 or something like that,
link |
because at that time in computer vision, the distinction that people were just working on
link |
many different problems, but they treated each of them as a separate isolated problem
link |
with each with its own data set. And then you try to solve that and get good numbers on it.
link |
So I wasn't, I didn't like that approach because I wanted to see the connection between these.
link |
And if people divided up vision into various modules, the way they would do it is as low
link |
level, mid level and high level vision corresponding roughly to the psychologist's
link |
notion of sensation, perception and cognition. And I didn't, that didn't map to tasks that people
link |
cared about. Okay. So therefore, I tried to promote this particular framework as a way
link |
of considering the problems that people in computer vision were actually working on
link |
and trying to be more explicit about the fact that they actually are connected to each other.
link |
And I was at that time just doing this on the basis of information flow. Now it turns out
link |
in the last five years or so, in the post the deep learning revolution that this architecture
link |
has turned out to be very conducive to that because basically in these neural networks,
link |
we are trying to build multiple representations. There can be multiple output heads sharing
link |
common representations. So in a certain sense, today, given the reality of what solutions people
link |
have to this, I do not need to preach this anymore. It is just there. It's part of the solution space.
link |
So speaking of neural networks, how much of this problem of computer vision,
link |
of reorganization, recognition can be reconstruction, how much of it can be learned end to end,
link |
do you think? Sort of set it and forget it, just plug and play, have a giant dataset multiple,
link |
perhaps multimodal, and then just learn the entirety of it. Well, so I think that currently
link |
what that end to end learning means nowadays is end to end supervised learning. And that I would
link |
argue is too narrow a view of the problem. I like this child development view, this lifelong
link |
learning view, one where there are certain capabilities that are built up and then there
link |
are certain capabilities which are built up on top of that. So that's what I believe in. So I think
link |
end to end learning in this supervised setting for a very precise task to me is
link |
kind of a sort of a limited view of the learning process.
link |
Got it. So if we think about beyond purely supervised, look back to children. You mentioned
link |
six lessons that we can learn from children of be multimodal, be incremental, be physical,
link |
explore, be social, use language. Can you speak to these, perhaps picking one that you find most
link |
fundamental to our time today? Yeah. So I mean, I should say to give a due credit, this is from a
link |
paper by Smith and Gasser. And it reflects essentially, I would say, common wisdom among
link |
child development people. It's just that these are, this is not common wisdom among people in
link |
computer vision and AI and machine learning. So I view my role as trying to bridge the two worlds.
link |
So let's take an example of a multimodal. I like that. So multimodal, a canonical example is
link |
a child interacting with an object. So then the child holds a ball and plays with it.
link |
So at that point, it's getting a touch signal. So the touch signal is getting a notion of 3D
link |
shape, but it is sparse. And then the child is also seeing a visual signal. And these two,
link |
so imagine these are two in totally different spaces. So one is the space of receptors on the
link |
skin of the fingers and the thumb and the palm. And then these map onto these neuronal fibers
link |
are getting activated somewhere. These lead to some activation in somatosensory cortex.
link |
I mean, a similar thing will happen if we have a robot hand. And then we have the pixels corresponding
link |
to the visual view, but we know that they correspond to the same object. So that's a very,
link |
very strong cross calibration signal. And it is self supervisory, which is beautiful.
link |
There's nobody assigning a label. The mother doesn't have to come and assign a label.
link |
The child doesn't even have to know that this object is called a ball.
link |
Okay, but the child is learning something about the three dimensional world from this
link |
signal. I think tactile and visual, there is some work on, there is a lot of work currently
link |
on audio and visual. Okay, and audio visual. So there is some event that happens in the world.
link |
And that event has a visual signature, and it has an auditory signature. So there is this
link |
glass bowl on the table, and it falls and breaks. And I hear the smashing sound and I see the pieces
link |
of glass. Okay, I've built that connection between the two, right? We have people, I mean, this
link |
become a hot topic in computer vision in the last couple of years. There are problems like
link |
separating out multiple speakers, right? Which was a classic problem in, in audition,
link |
they call this the problem of source separation or the cocktail party effect and so on.
link |
But just try to do it visually when you also have, it becomes so much easier and so much
link |
more useful. So the multimodal, I mean, there's so much more signal with multimodal and you can use
link |
that for some kind of weak supervision as well. Yes, because they are occurring at the same time
link |
in time. So you have time, which links the two, right? So at a certain moment, T1, you got a certain
link |
signal in the auditory domain and a certain signal in the visual domain, but they must be causally
link |
related. Yeah, that's an exciting area. Not well studied yet. Yeah, I mean, we have a little bit
link |
of work at this, but so much more needs to be done. So this is a good example. Be physical,
link |
that's to do with like something we talked about earlier that there's an embodied world.
link |
To mention language, use language. So No Chomsky believes that language may be at the core of
link |
cognition, at the core of everything in the human mind. What is the connection between language
link |
and vision to you? Like what's more fundamental? Are they neighbors, is one the parent and the child,
link |
the chicken and the egg? Oh, it's very clear. It is vision that is the parent. The parent is
link |
the fundamental ability. Okay. Well, it comes before you think vision is more fundamental
link |
in language. Correct. And you can think of it either in phylogeny or in ontogeny. So phylogeny
link |
means if you look at evolutionary time, right? So we have vision that developed 500 million years
link |
ago. Okay. Then something like when we get to maybe like 5 million years ago, you have the first
link |
bipedal primate. So when we started to walk, then the hand became free. And so then manipulation,
link |
the ability to manipulate objects and build tools and so on and so forth. So you said 500,000 years
link |
ago? No, sorry. The first multicellular animals, which you can say had some intelligence, arose
link |
500 million years ago. Okay. And now let's fast forward to say the last 7 million years,
link |
which is the development of the hominid line, right? Where from the other primates, we have the
link |
branch which leads on to modern humans. Now, there are many of these hominids, but the ones which
link |
people talk about Lucy because that's like a skeleton from 3 million years ago. And we know
link |
that Lucy walked. Okay. So at this stage, you have that the hand is free for manipulating objects.
link |
And then the ability to manipulate objects, build tools and the brain size grew in this era.
link |
So, okay. So now you have manipulation. Now, we don't know exactly when language arose.
link |
But after that. But after that, because no apes have, I mean, so, I mean Chomsky is correct in
link |
that that it is a uniquely human capability. And we primates, other primates don't have that.
link |
But so it developed somewhere in this era. But it developed, I would, I mean,
link |
argue that it probably developed after we had this stage of humans. I mean, the human species
link |
already able to manipulate and hands free, much bigger brain size. And for that, there's a lot of
link |
vision has already had to have developed. So the sensation and the perception may be some of the
link |
cognition. Yeah. So those, so that vision, so the world, so these ancestors of us,
link |
you know, three, four million years ago, they had, they had spatial intelligence.
link |
So they knew that the world consists of objects. They knew that the objects were in
link |
certain relationships to each other. They had observed causal interactions among objects.
link |
They could move in space. So they had space and time and all of that. So language builds on that
link |
substrate. So language has a lot of, I mean, I mean, the, all human languages have constructs
link |
which depend on a notion of space and time. Where did that notion of space and time come from?
link |
It had to come from perception and action in the world we live in.
link |
Yeah. Well, you've referred to the spatial intelligence. Yeah. Yeah. So to linger a little
link |
bit, we mentioned Turing and his mention of we should learn from children. Nevertheless, language
link |
is the fundamental piece of the test of intelligence that Turing proposed. Yes. What do you think is
link |
a good test of intelligence? Are you, what would impress the heck out of you? Is it
link |
fundamentally in natural language or is there something in vision?
link |
I think I wouldn't, I don't think we should have created a single test of intelligence.
link |
So just like I don't believe in IQ as a single number, I think generally there can be many
link |
capabilities which are correlated perhaps. So I think that there will be, there will be
link |
accomplishments which are visual accomplishments, accomplishments which are accomplishments in
link |
manipulation or robotics, and then accomplishments in language. I do believe that language will
link |
be the hardest nut to crack. Really? Yeah. So what's harder to pass the spirit of the Turing
link |
test or like whatever formulation will make it natural language, convincingly a natural language,
link |
like somebody you would want to have a beer with, hang out and have a chat with,
link |
or the general natural scene understanding, you think language is the problem?
link |
I think I'm not a fan of the, I think Turing test, that Turing as he proposed the test in 1950
link |
was trying to solve a certain problem. Yeah, imitation. Yeah. And I think it made a lot of
link |
sense then, where we are today, 70 years later, I think we should not worry about that. I think
link |
the Turing test is no longer the right way to channel research in AI, because that it takes
link |
us down this path of this chatbot which can fool us for five minutes or whatever. I think
link |
I would rather have a list of 10 different tasks. I mean, I think the tasks which they're
link |
tasked in the manipulation domain, tasks in navigation, tasks in visual scene understanding,
link |
tasks in reading a story and answering questions based on that. I mean, so my favorite language
link |
understanding tasks would be reading a novel and being able to answer arbitrary questions from it.
link |
Okay. Right. I think that to me, and this is not an exhaustive list by any means,
link |
so I would, I think that that's what we, where we need to be going to and each of these,
link |
on each of these axes, there's a fair amount of work to be done.
link |
So on the visual understanding side, in this intelligence Olympics that we've set up,
link |
what's a good test for one of many of visual scene understanding? Do you think such benchmarks
link |
exist? Sorry to interrupt. No, there aren't any. I think, I think essentially to me,
link |
a really good aid to the blind. So suppose there was a blind person and I needed to assist the
link |
blind person. So ultimately, like we said, vision that aids in the action in the survival in this
link |
world. Yeah. Maybe in the simulated world. Maybe easier to, to measure performance in a simulated
link |
world. What we are ultimately after is performance in the real world. So David Hilbert in 1900 proposed
link |
23 open problems of mathematics, some of which are still unsolved, most important, famous of
link |
which is probably the Riemann hypothesis you've thought about and presented about the Hilbert
link |
problems of computer vision. So let me ask, what do you today? I don't know when the last year you
link |
presented that 2015, but versions of it, you're kind of the face and the spokesperson for computer
link |
vision. It's your job to state what the problem, the open problems are for the field. So what
link |
today are the Hilbert problems of computer vision? Do you think? Let me pick one to,
link |
which I regard as clearly, clearly unsolved, which is what I would call long form video
link |
understanding. So, so we have a video clip and we want to understand the behavior in there
link |
in terms of agents, their goals, intentionality and make predictions about what might happen.
link |
So that kind of understanding which goes away from atomic visual action. So in the short
link |
range, the question is, are you sitting? Are you standing? Are you catching a ball?
link |
That we can do now. Or even if we can't do it fully accurately, if we can do it at 50%,
link |
maybe next year we'll do it at 65 and so forth. But I think the long range video understanding,
link |
I don't think we can do today. And that means so long. And it blends into cognition. That's
link |
the reason why it's challenging. So you have to track, you have to understand the entities,
link |
you have to understand the entities, you have to track them, and you have to have some kind of model
link |
of their behavior. Correct. And their behavior might be, these are agents. So they're not just
link |
like passive objects, but they're agents. So therefore, they might, they would exhibit goal
link |
directed behavior. Okay. So this is, this is one area. Then I will talk about, say, understanding
link |
the world in 3D. Now, this may seem paradoxical because in a way, we have been able to do 3D
link |
understanding even like 30 years ago, right? But I don't think we currently have the richness of
link |
3D understanding in our computer vision system that we would like. Because, so let me elaborate on
link |
that a bit. So currently, we have two kinds of techniques which are not fully unified. So
link |
there are the kinds of techniques from multi view geometry that you have multiple pictures of a scene
link |
and you do a reconstruction using stereoscopic vision or structure for motion. But these techniques
link |
do not, they totally fail if you just have a single view because they are relying on this,
link |
this multiple view geometry. Okay. Then we have some techniques that we have developed in the
link |
computer vision community, which try to guess 3D from single views. And these techniques are based
link |
on, on a supervised learning and they are based on having a training time, 3D models of objects
link |
available. And this is completely unnatural supervision, right? That's not, CAD models are
link |
not injected into your brain. Okay. So what would I like? What I would like would be a kind of
link |
learning as you move around the world notion of 3D. So we, we have our succession of visual
link |
experiences. And from those, we, so in, as part of that, I might see a chair from different view
link |
points or a table from view point, different view points and so on. Now as part, that enables me
link |
to build some internal representation. And then next time I just see a single photograph.
link |
And it may not even be of that chair, it's of some other chair. And I have a guess of what its
link |
3D shape is like. So you're almost learning the CAD model kind of? Yeah, implicitly. I mean,
link |
implicitly. I mean, the CAD model need not be in the same form as used by computer graphics
link |
programs. It's hidden in the representation. It's hidden in the representation, the ability
link |
to predict new views and what I would see if I went to such and such position.
link |
By the way, on a small tangent on that, are you uncomfortable, are you okay or comfortable with
link |
neural networks that do achieve visual understanding that do, for example, achieve this kind of 3D
link |
understanding? And you don't know how they, you don't know the, you're not able to interest,
link |
but you're not able to visualize or understand or interact with the representation. So the fact
link |
that they're not or may not be explainable. Yeah, I think that's fine. To me, that is, so
link |
let me put some caveats on that. So it depends on the setting. So first of all, I think
link |
humans are not explainable. Yeah, that's a really good point. One human to another human is not
link |
fully explainable. I think there are settings where explainability matters. And these might,
link |
these are, these might be, for example, questions on medical diagnosis. So I'm in a setting where
link |
maybe the doctor, maybe a computer program has made a certain diagnosis.
link |
And then depending on the diagnosis, perhaps I should have treatment A or treatment B,
link |
right? So now is the computer programs diagnosis based on data, which was data collected off
link |
for American males who are in their 30s and 40s, and maybe not so relevant to me,
link |
maybe it is relevant, you know, et cetera, et cetera. And we, I mean, in medical diagnosis,
link |
we have major issues to do with the reference class. So we may have acquired statistics from
link |
one group of people and applying it to a different group of people who may not share all the same
link |
characteristics. The data might have, there might be error bars in the prediction. So that prediction
link |
should really be taken with a huge grain of salt. And, but this has an impact on what treatments
link |
should be picked, right? So, so there are settings where I want to know more than just
link |
this is the answer. But what I acknowledge is that the, so, so, so I, in that sense,
link |
explainability and interpretability may matter. It's about giving error bounds and a better sense
link |
of the quality of the decision. Where, what I, where I'm willing to sacrifice interpretability
link |
is that I believe that there can be systems which can be highly performant, but which are internally
link |
black boxes. And, and that seems to be words headed. Some of the best performing systems
link |
are essentially black boxes, fundamentally by their construction. You and I are black boxes
link |
to each other. Yeah. So the nice thing about the black boxes we are is, so we ourselves are black
link |
boxes, but we're also the, those of us who are charming are able to convince others, like explain
link |
the black, what's going on inside the black box with narratives of stories. So in some sense,
link |
neural networks don't have to actually explain what's going on inside. They just have to come
link |
up with stories real or fake that convince you that they know what's going on. And I'm sure we
link |
can do that. We can create those neural, those stories, neural networks can create those stories.
link |
Yeah. And the transformer will be involved. Do you think we will ever build a system of
link |
human level or super human level intelligence? We've kind of defined what it takes to try to
link |
approach that. But do you think we'll, do you think that's within our reach? The thing that we
link |
thought we could do, what Turing thought actually we could do by year 2000, right? Do you think
link |
we'll ever be able to do? Yeah. So I think there are two answers here. One question, one answer is
link |
in principle, can we do this at some time? And my answer is yes. The second answer is a pragmatic
link |
one. Do you think we will be able to do it in the next 20 years or whatever? And to that my
link |
answer is no. So, and of course that's a wild guess. I think that Donald Rumsfeld is not a
link |
favorite person of mine, but one of his lines was very good, which is about known unknowns,
link |
known unknowns and unknown unknowns. So in the business we are in, there are known unknowns
link |
and we have unknown unknowns. So I think with respect to a lot of what the case in
link |
vision and robotics, I feel like we have known unknowns. So I have a sense of where we need
link |
to go and what the problems that need to be solved are. I feel with respect to natural language,
link |
understanding and high level cognition, it's not just known unknowns, but also unknown unknowns.
link |
So it is very difficult to put any kind of a time frame to that.
link |
Do you think some of the unknown unknowns might be positive in that they'll surprise us and make
link |
the job much easier? So fundamental breakthroughs? I think that is possible because certainly I have
link |
been very positively surprised by how effective these deep learning systems have been because I
link |
certainly would not have believed that in 2010. I think what we knew from the mathematical theory
link |
was that convex optimization works when there's a single global optima than
link |
this gradient descent techniques would work. Now these are nonlinear systems with nonconvex
link |
systems. Huge number of variables. So overparameterized. Overparameterized. And the people who used to
link |
play with them a lot, the ones who were totally immersed in the lore and the black magic, they
link |
knew that they worked well even though they were. Really? I thought like everybody was.
link |
No, the claim that I hear from my friends like Jan Lekoon and so forth is that they feel that
link |
they were comfortable with them. Well, he says that now. But the community as a whole
link |
was certainly not. And I think we were, to me, that was the surprise that they actually
link |
worked robustly for a wide range of problems from a wide range of initializations and so on.
link |
And so that was certainly more rapid progress than we expected. But then there are certainly
link |
lots of times. In fact, most of the history in AI is when we have made less progress at a slower
link |
rate than we expected. So we just keep going. I think what I regard as really unwarranted
link |
are these fears of AGI in 10 years and 20 years and that kind of stuff. Because that's based on
link |
completely unrealistic models of how rapidly we will make progress in this field.
link |
So I agree with you. But I've also gotten a chance to interact with very smart people who
link |
really worry about the existential threats of AI. And as an open minded person, I'm sort of
link |
taking it in. Do you think if AI systems, in some way, the unknown unknowns, not super
link |
intelligent AI, but in ways we don't quite understand the nature of super intelligence,
link |
will have a detrimental effect on society? Do you think this is something we should be
link |
worried about? Or we need to first allow the unknown unknowns to become known unknowns?
link |
I think we need to be worried about AI today. I think that it is not just a worry we need to
link |
have when we get that AGI. I think that AI is being used in many systems today. And there might
link |
be settings, for example, when it causes biases or decisions which could be harmful, I mean,
link |
decisions which could be unfair to some people, or it could be a self driving cars which kills
link |
a pedestrian. So AI systems are being deployed today, right? And they are being deployed in
link |
many different settings, maybe in medical diagnosis, maybe in a self driving car, maybe
link |
in selecting applicants for an interview. So I would argue that when these systems
link |
make mistakes, there are consequences. And we are in a certain sense responsible for those
link |
consequences. So I would argue that this is a continuous effort. And this is something that
link |
in a way is not so surprising. It's about all engineering and scientific progress which
link |
great power comes, great responsibility. So as these systems are deployed, we have to worry
link |
about them. And it's a continuous problem. I don't think of it as something which will
link |
suddenly happen on some day in 2079, for which I need to design some clever trick.
link |
I'm saying that these problems exist today. And we need to be continuously on the lookout for
link |
worrying about safety, biases, risks, right? I mean, the self driving car kills a pedestrian
link |
and they have, right? I mean, there's Uber incident in Arizona, right? It has happened, right?
link |
This is not about AGI. In fact, it's about a very dumb intelligence which is killing people.
link |
The worry people have with AGI is the scale. And I, but I think you're 100% right is
link |
like the thing that worries me about AI today. And it's happening in a huge scale is recommender
link |
system, recommendation systems. So if you look at Twitter or Facebook or YouTube, they're controlling
link |
the ideas that we have access to, the news and so on. And that's a fundamentally machine learning
link |
algorithm behind each of these recommendations. And they, I mean, my life would not be the same
link |
without these sources of information. I'm a totally new human being. And the ideas that I know
link |
are very much because of the internet, because of the algorithm that recommend those ideas.
link |
And so as they get smarter and smarter, I mean, that is the AGI is that's the algorithm that's
link |
recommending the next YouTube video you should watch has control of millions of billions of people
link |
that that algorithm is already super intelligent and has complete control of the population.
link |
Not a complete, but very strong control. For now, we can turn off YouTube, we can just go have a
link |
normal life outside of that. But the more and more that gets into our life, it's that algorithm will
link |
start depending on it and the different companies that are working on the algorithm. So I think
link |
it's, you're right, it's already, it's already there. And YouTube in particular is using computer
link |
vision, doing their hardest to try to understand the content of videos so they could be able to
link |
connect videos with the people who would benefit from those videos the most. And so that development
link |
could go in a bunch of different directions, some of which might be harmful. So yeah, you're
link |
right. The, the, the threats of AI here already, we should be thinking about them. On a philosophical
link |
notion. If you could personal, perhaps, if you could relive a moment in your life outside of family,
link |
because it made you truly happy or it was a profound moment that impacted the direction of your life,
link |
what moment would you go to?
link |
I don't think of single moments, but I look over the long haul. I feel that I've been very lucky
link |
because I feel that I think that in scientific research, a lot of it is about being at the
link |
right place at the right time. And you can, you can work on problems at a time when they're just
link |
too premature, you know, you beat but your head against them and, and nothing happens because
link |
it's the prerequisites for success are not there. And then there are times when you are in a field
link |
which is all pretty mature and you can only solve curricules upon curricules. I've been lucky to
link |
have been in this field, which for 34 years, 34, well, actually 34 years is a professor at Berkeley,
link |
so longer than that, which when I started in it was just like some little crazy, absolutely
link |
useless field, which couldn't really do anything to a time when it's really, really
link |
solving a lot of practical problems has a lot has offered a lot of tools for scientific research,
link |
right, because computer vision is impactful for images in biology or astronomy and so on and
link |
so forth. And we have, so we have made great scientific progress, which has had real practical
link |
impact in the world. And I feel lucky that I, I got in at a time when the field was
link |
very young and at a time when it is, it's now mature, but not fully mature. It's mature, but not
link |
done. I mean, it's really in still in a, in a productive phase. Yeah, I think people 500 years
link |
from now would laugh at you calling this field mature. That is very possible. Yeah. So, but you're
link |
also, lest I forget to mention, you've also mentored some of the biggest names of computer
link |
vision, computer science and AI today. There's so many questions I could ask, but really is
link |
what, what is it? How did you do it? What does it take to be a good mentor? What does it take to be
link |
a good guide? Yeah, I think what I feel I've been lucky to have had very, very smart and hardworking
link |
and creative students. I think some part of the credit just belongs to being at Berkeley. I think
link |
those of us who are at top universities are blessed because we have very, very smart and capable
link |
students coming on, knocking on our door. So, so I have to be humble enough to acknowledge that.
link |
But what have I added? I think I have added something. What I have added is, I think what
link |
I've always tried to teach them is a sense of picking the right problems. So, I think that in
link |
science, in the short run, success is always based on technical competence. You're, you know,
link |
you're quick with math or you are whatever. I mean, there's certain technical capabilities
link |
which make for short range progress. Long range progress is really determined by asking the right
link |
questions and focusing on the right problems. And I feel that what I've been able to bring to the
link |
table in terms of advising these students is some sense of taste of what are good problems.
link |
What are problems that are worth attacking now as opposed to waiting 10 years?
link |
What's a good problem if you could summarize? If is that possible to even summarize? Like what's
link |
your sense of a good problem? I think I think I have a sense of what is a good problem, which is
link |
there is a British scientist. In fact, he won a Nobel Prize, Peter Medover, who has a book on this.
link |
And basically he calls it the research is the art of the soluble. So, we need to sort of find
link |
problems which are not yet solved, but which are approachable. And he sort of refers to this
link |
sense that there is this problem which isn't quite solved yet, but it has a soft underbelly.
link |
There is some place where you can spear the beast. And having that intuition that this
link |
problem is ripe is a good thing, because otherwise you can just beat your head and not make progress.
link |
So, I think that is important. So, if I have that and if I can convey that to students,
link |
it's not just that they do great research while they're working with me, but that they continue
link |
to do great research. So, in a sense, I'm proud of my students and their achievements and their
link |
great research, even 20 years after they've seized being my student. So, some part developing,
link |
helping them develop that sense that a problem is not yet solved, but is solvable. Correct.
link |
The other thing which I have, which I think I bring to the table, is a certain intellectual
link |
breadth. I've spent a fair amount of time studying psychology, neuroscience, relevant
link |
areas of applied math and so forth. So, I can probably help them see some connections
link |
to disparate things, which they might not have otherwise. So, the smart students coming into
link |
Berkeley can be very deep in the sense, they can think very deeply, meaning very hard down one
link |
particular path. But where I could help them is the shallow breadth, but whereas they would have
link |
the narrow depth, but that's of some value. Well, it was beautifully refreshing just to hear you
link |
naturally jump to psychology back to computer science and this conversation back and forth.
link |
I mean, that's actually a rare quality and I think it's certainly for students empowering
link |
to think about problems in a new way. So, for that and for many other reasons, I really enjoyed
link |
this conversation. Thank you so much. It was a huge honor. Thanks for talking to me.
link |
It's been my pleasure. Thanks for listening to this conversation with Jitendra Malik and thank
link |
you to our sponsors, BetterHelp and ExpressVPN. Please consider supporting this podcast by going
link |
to betterhelp.com slash Lex and signing up at expressvpn.com slash Lex pod. Click the links,
link |
buy the stuff. It's how they know I sent you and it really is the best way to support this podcast
link |
and the journey I'm on. If you enjoy this thing, subscribe on YouTube, review it with five stars
link |
on an app or podcast, support it on Patreon or connect with me on Twitter at Lex Friedman.
link |
Don't ask me how to spell that. I don't remember myself. And now let me leave you with some words
link |
from Prince Mishkin in The Idiot by Dostoevsky. Beauty will save the world. Thank you for listening
link |
and hope to see you next time.