back to index

Jitendra Malik: Computer Vision | Lex Fridman Podcast #110


small model | large model

link |
00:00:00.000
The following is a conversation with Jitendra Malik, a professor at Berkeley and one of the
link |
00:00:05.360
seminal figures in the field of computer vision, the kind before the deep learning revolution
link |
00:00:10.960
and the kind after. He has been cited over 180,000 times and has mentored many world
link |
00:00:19.280
class researchers in computer science. Quick summary of the ads. Two sponsors,
link |
00:00:25.520
one new one, which is BetterHelp and an old, goodie, ExpressVPN. Please consider supporting
link |
00:00:32.400
this podcast by going to betterhelp.com slash lex and signing up at expressvpn.com slash lex pod.
link |
00:00:40.640
Click the links, buy the stuff. It really is the best way to support this podcast and the journey
link |
00:00:46.000
I'm on. If you enjoy this thing, subscribe on YouTube, review it with five stars on Apple
link |
00:00:51.280
Podcasts supported on Patreon are connected with me on Twitter at Lex Friedman. However,
link |
00:00:56.880
the heck you spell that. As usual, I'll do a few minutes of ads now and never any ads in the middle
link |
00:01:02.320
that can break the flow of the conversation. This show is sponsored by BetterHelp spelled H E L P
link |
00:01:10.160
help. Check it out at betterhelp.com slash lex. They figure out what you need and match you with
link |
00:01:16.960
a licensed professional therapist in under 48 hours. It's not a crisis line. It's not self help.
link |
00:01:24.000
It's professional counseling done securely online. I'm a bit from the David Goggins line
link |
00:01:29.760
of creatures as you may know. And so have some demons to contend with usually on long runs
link |
00:01:36.800
or all nights working forever and possibly full of self doubt. It may be because I'm Russian,
link |
00:01:43.120
but I think suffering is essential for creation. But I also think you can suffer beautifully in a
link |
00:01:49.760
way that doesn't destroy you. For most people, I think a good therapist can help in this. So it's
link |
00:01:55.280
at least worth a try. Check out their reviews. They're good. It's easy, private, affordable,
link |
00:02:01.600
available worldwide. You can communicate by text and your time and schedule weekly audio and video
link |
00:02:07.680
sessions. I highly recommend that you check them out at betterhelp.com slash lex. This show is
link |
00:02:16.160
also sponsored by ExpressVPN. Get it at expressvpn.com slash lex pod to support this podcast and
link |
00:02:24.480
to get an extra three months free on a one year package. I've been using ExpressVPN for many years.
link |
00:02:31.200
I love it. I think ExpressVPN is the best VPN out there. They told me to say it, but it happens to
link |
00:02:37.920
be true. It doesn't log your data. It's crazy fast and is easy to use literally just one big sexy
link |
00:02:45.760
power on button. Again, for obvious reasons, it's really important that they don't log your data.
link |
00:02:51.280
It works on Linux and everywhere else too. But really, why use anything else? Shout out to my
link |
00:02:57.760
favorite flavor of Linux Ubuntu Mate 2004. Once again, get it at expressvpn.com slash lex pod
link |
00:03:05.920
to support this podcast and to get an extra three months free on a one year package.
link |
00:03:12.960
And now here's my conversation with Jitendra Malik. In 1966, Seymour Papert at MIT wrote up a
link |
00:03:22.640
proposal called the summer vision project to be given as far as we know to 10 students to work on
link |
00:03:29.760
and solve that summer. So that proposal outlined many of the computer vision tasks we still work on
link |
00:03:35.280
today. Why do you think we underestimate and perhaps we did underestimate and perhaps still
link |
00:03:41.840
underestimate how hard computer vision is? Because most of what we do in vision, we do unconsciously
link |
00:03:49.600
or subconsciously in human vision. So that effortlessness gives us the sense that, oh,
link |
00:03:57.760
this must be very easy to implement on a computer. Now, this is why the early researchers in AI got
link |
00:04:07.760
it so wrong. However, if you go into neuroscience or psychology of human vision, then the complexity
link |
00:04:17.120
becomes very clear. The fact is that a very large part of the the cerebral cortex is devoted to
link |
00:04:24.240
visual processing. I mean, and this is true in other primates as well. So once we looked at it
link |
00:04:31.280
from a neuroscience or psychology perspective, it becomes quite clear that the problem is very
link |
00:04:37.120
challenging and it will take some time. You said the high level parts are the harder parts?
link |
00:04:42.880
I think vision appears to to be easy because most of what visual processing is subconscious or
link |
00:04:53.200
unconscious. So we underestimate the difficulty. Whereas when you are like proving a mathematical
link |
00:05:03.440
theorem or playing chess, the difficulty is much more evident. So because it is your conscious
link |
00:05:10.160
brain, which is processing various aspects of the problem solving behavior. Whereas in vision,
link |
00:05:18.560
all this is happening, but it's not in your awareness. It's in your it's operating below that.
link |
00:05:25.520
But it still seems strange. Yes, that's true. But it seems strange that as computer vision
link |
00:05:31.360
researchers, for example, the community broadly is time and time again makes the mistake of
link |
00:05:40.160
thinking the problem is easier than it is. Or maybe it's not a mistake. We'll talk a little bit
link |
00:05:44.480
about autonomous driving, for example, how hard of a vision task that is. Do you think, I mean,
link |
00:05:53.760
is it just human nature or is there something fundamental to the vision problem that we
link |
00:05:58.160
underestimate? We're still not able to be cognizant of how hard the problem is.
link |
00:06:05.680
Yeah, I think in the early days, it could have been excused because in the early days,
link |
00:06:11.600
all aspects of AI were regarded as too easy. But I think today it is much less excusable.
link |
00:06:19.440
And I think why people fall for this is because of what I call the fallacy of the successful
link |
00:06:27.760
first step. There are many problems in vision where getting 50% of the solution you can get in one
link |
00:06:36.720
minute, getting to 90% can take you a day, getting to 99% may take you five years and
link |
00:06:45.440
99.99% may be not in your lifetime. I wonder if that's a unique division.
link |
00:06:51.280
It seems that language people are not so confident about, so natural language processing,
link |
00:06:57.680
people are a little bit more cautious about our ability to solve that problem. I think for
link |
00:07:04.640
language people intuit that we have to be able to do natural language understanding. For vision,
link |
00:07:13.760
it seems that we're not cognizant or we don't think about how much understanding is required.
link |
00:07:18.880
It's probably still an open problem. But in your sense, how much understanding is required to solve
link |
00:07:26.160
vision? Put another way, how much something called common sense reasoning is required to
link |
00:07:35.120
really be able to interpret even static scenes? Yeah, so vision operates at all levels. And there
link |
00:07:43.840
are parts which can be solved with what we could call maybe peripheral processing. So in the human
link |
00:07:52.640
vision literature, there used to be these terms sensation, perception, and cognition, which
link |
00:07:59.120
roughly speaking referred to the front end of processing, middle stages of processing, and
link |
00:08:05.680
higher level of processing. And I think they made a big deal out of this and they wanted
link |
00:08:11.920
to study only perception and then dismiss certain problems as being, quote, cognitive.
link |
00:08:18.960
But really, I think these are artificial divides. The problem is continuous at all levels,
link |
00:08:26.080
and there are challenges at all levels. The techniques that we have today, they work better
link |
00:08:31.920
at the lower and mid levels of the problem. I think the higher levels of the problem, quote,
link |
00:08:37.360
the cognitive levels of the problem are there. And we, in many real applications, we have to
link |
00:08:45.040
confront them. Now, how much that is necessary will depend on the application. For some problems,
link |
00:08:52.080
it doesn't matter. For some problems, it matters a lot. So I am, for example, a pessimist on
link |
00:09:00.240
fully autonomous driving in the near future. And the reason is because I think there will be
link |
00:09:07.760
that 0.01% of the cases where quite sophisticated cognitive reasoning is called for. However,
link |
00:09:16.800
there are tasks where you can, first of all, they are much more, they are robust. So in the sense
link |
00:09:24.240
that error rates, error is not so much of a problem. For example, let's say you're doing
link |
00:09:32.160
image search. You're trying to get images based on some description, some visual description.
link |
00:09:41.680
We are very tolerant of errors there, right? I mean, when Google image search gives you some
link |
00:09:46.240
images back and a few of them are wrong, it's okay. It doesn't hurt anybody. There's no,
link |
00:09:52.320
there's not a matter of life and death. But making mistakes when you are driving at 60 miles per hour
link |
00:10:01.440
and you could potentially kill somebody is much more important. So just for the,
link |
00:10:07.840
for the fun of it, since you mentioned, let's go there briefly about autonomous vehicles.
link |
00:10:12.640
So one of the companies in the space, Tesla, is with Andre Capati and Elon Musk are working on
link |
00:10:19.440
a system called autopilot, which is primarily a vision based system with eight cameras and
link |
00:10:26.320
basically a single neural network, a multitask neural network. They call it hydranet multiple heads.
link |
00:10:33.280
So it does multiple tasks, but is forming the same representation at the core.
link |
00:10:38.560
Do you think driving can be converted in this way to a purely a vision problem and then solved
link |
00:10:45.920
with learning? Or even more specifically in the current approach, what do you think about
link |
00:10:53.760
what Tesla autopilot team is doing? So the way I think about it is that there are certainly
link |
00:11:01.280
subsets of the visual based driving problem, which are quite solvable. So for example,
link |
00:11:06.160
driving in freeway conditions is quite a solvable problem. I think there were demonstrations of that
link |
00:11:14.640
going back to the 1980s by someone called Ernst Tickman's in Munich. In the 90s, there were
link |
00:11:23.120
approaches from Carnegie Mellon. There were approaches from our team at Berkeley. In the 2000s,
link |
00:11:29.520
there were approaches from Stanford and so on. So autonomous driving in certain settings is
link |
00:11:36.880
very doable. The challenge is to have an autopilot work under all kinds of driving conditions. At
link |
00:11:45.360
that point, it's not just a question of vision or perception, but really also of control and
link |
00:11:51.440
dealing with all the edge cases. So where do you think most of the difficult cases, to me,
link |
00:11:58.000
even the highway driving is an open problem because it applies the same 50, 90, 95, 99 rule
link |
00:12:06.240
or the first step, the fallacy of the first step. I forget how you put it. We fall victim too.
link |
00:12:12.000
I think even highway driving has a lot of elements because to solve autonomous driving,
link |
00:12:16.880
you have to completely relinquish the fat help of a human being. You're always in control. So
link |
00:12:24.000
you're really going to feel the edge cases. So I think even highway driving is really difficult.
link |
00:12:29.200
But in terms of the general driving task, do you think vision is the fundamental problem?
link |
00:12:34.800
Or is it also your action, the interaction with the environment,
link |
00:12:42.720
the ability to... And then the middle ground, I don't know if you put that under vision,
link |
00:12:47.520
which is trying to predict the behavior of others, which is a little bit in the world of
link |
00:12:53.520
understanding the scene, but it's also trying to form a model of the actors in the scene
link |
00:12:59.920
and predict their behavior. Yeah, I include that in vision because to me, perception blends into
link |
00:13:05.920
cognition and building predictive models of other agents in the world, which could be other agents,
link |
00:13:12.480
could be people, other agents, could be other cars. That is part of the task of perception
link |
00:13:18.240
because perception always has to not tell us what is now, but what will happen because what's
link |
00:13:25.440
now is boring. It's done. It's over with. We care about the future because we act in the future.
link |
00:13:33.200
And we care about the past in as much as it informs what's going to happen in the future.
link |
00:13:38.800
So I think we have to build predictive models of behaviors of people and those can get quite
link |
00:13:45.920
complicated. So I mean, I've seen examples of this in actually, I mean, I own a Tesla and
link |
00:13:58.080
it has various safety features built in. And what I see are these examples where
link |
00:14:05.520
let's say there is some skateboarder. I mean, and I don't want to be too critical because
link |
00:14:12.000
obviously these systems are always being improved and any specific criticism I have,
link |
00:14:19.280
maybe the system six months from now will not have that particular failure mode.
link |
00:14:25.600
So it had the wrong response and it's because it couldn't predict what this skateboarder was going
link |
00:14:37.040
to do. And because it really required that higher level cognitive understanding of what
link |
00:14:44.400
skateboarders typically do as opposed to a normal pedestrian. So what might have been
link |
00:14:49.360
the correct behavior for a pedestrian, a typical behavior for pedestrian was not the
link |
00:14:54.880
typical behavior for a skateboarder. And so therefore to do a good job there,
link |
00:15:04.240
you need to have enough data where you have pedestrians, you also have skateboarders,
link |
00:15:09.360
you've seen enough skateboarders to see what kinds of patterns or behavior they have.
link |
00:15:16.320
So it is in principle with enough data that problem could be solved. But I think our current
link |
00:15:25.200
systems, computer vision systems, they need far, far more data than humans do for learning
link |
00:15:31.920
those same capabilities. So say that there is going to be a system that solves autonomous
link |
00:15:36.720
driving. Do you think it will look similar to what we have today, but have a lot more data,
link |
00:15:42.960
perhaps more compute, but the fundamental architecture is involved? Well, in the case
link |
00:15:48.720
of Tesla autopilot is neural networks. Do you think it will look similar? In that regard,
link |
00:15:55.120
and we'll just have more data. That's a scientific hypothesis as to which way is it going to go. I
link |
00:16:02.320
will tell you what I would bet on. And this is my general philosophical position on how these
link |
00:16:10.880
learning systems have been. What we have found currently very effective in computer vision
link |
00:16:18.000
in the deep learning paradigm is sort of tabular ASA learning and tabular ASA learning in a
link |
00:16:25.840
supervised way with lots and lots of... What's tabular ASA learning? Tabular ASA in the sense
link |
00:16:30.400
that blank slate. We just have the system which is given a series of experiences in this setting
link |
00:16:37.600
and then it learns there. Now, if let's think about human driving, it is not tabular ASA learning.
link |
00:16:44.480
So at the age of 16 in high school, a teenager goes into driver head class. And now at that point,
link |
00:16:56.320
they learn, but at the age of 16, they are already visual geniuses because from 0 to 16,
link |
00:17:04.560
they have built a certain repertoire of vision. In fact, most of it has probably been achieved by
link |
00:17:10.480
age two. In this period of age up to age two, they know that the world is three dimensional.
link |
00:17:18.000
They know how objects look like from different perspectives. They know about occlusion.
link |
00:17:24.480
They know about common dynamics of humans and other bodies. They have some notion of intuitive
link |
00:17:31.120
physics. So they built that up from their observations and interactions in early childhood
link |
00:17:38.480
and of course, reinforced through their growing up to age 16. So then at age 16, when they go into
link |
00:17:46.800
driver head, what are they learning? They're not learning afresh the visual world. They have a mastery
link |
00:17:53.200
of the visual world. What they are learning is control. They are learning how to be smooth
link |
00:18:00.080
about control, about steering and brakes and so forth. They're learning a sense of typical
link |
00:18:05.920
traffic situations. Now, that education process can be quite short because they are coming in as
link |
00:18:15.600
visual geniuses. And of course, in their future, they're going to encounter situations which are
link |
00:18:22.240
very novel. So during my driver head class, I may not have had to deal with a skateboarder.
link |
00:18:29.680
I may not have had to deal with a truck driving in front of me where the back opens up and some
link |
00:18:37.760
junk gets dropped from the truck and I have to deal with it. But I can deal with this as a driver,
link |
00:18:45.040
even though I did not encounter this in my driver head class. And the reason I can deal with it is
link |
00:18:50.000
because I have all this general visual knowledge and expertise. And do you think the learning
link |
00:18:57.680
mechanisms we have today can do that kind of long term accumulation of knowledge? Or do we have to
link |
00:19:06.320
do some kind of... The work that led up to expert systems with knowledge representation,
link |
00:19:13.120
the broader field of artificial intelligence worked on this kind of accumulation of knowledge.
link |
00:19:19.920
Do you think neural networks can do the same? I think I don't see any in principle problem with
link |
00:19:27.440
neural networks doing it. But I think the learning techniques would need to evolve significantly.
link |
00:19:33.520
So the current learning techniques that we have are supervised learning. You're giving lots of
link |
00:19:42.240
examples, X, Y, Y pairs, and you learn the functional mapping between them. I think that
link |
00:19:49.520
human learning is far richer than that. It includes many different components. There is
link |
00:19:55.760
a child who explores the world and sees... For example, a child takes an object and manipulates it
link |
00:20:06.240
in his or her hand and therefore gets to see the object from different points of view. And the child
link |
00:20:13.120
has commanded the movement. So that's a kind of learning data. But the learning data has been
link |
00:20:17.840
arranged by the child. And this is a very rich kind of data. The child can do various experiments
link |
00:20:26.560
with the world. So there are many aspects of human learning. And these have been studied in
link |
00:20:36.320
child development by psychologists. And what they tell us is that supervised learning is a very
link |
00:20:43.920
small part of it. There are many different aspects of learning. And what we would need to do is to
link |
00:20:51.120
develop models of all of these and then train our systems with that kind of protocol.
link |
00:21:02.160
So new methods of learning, some of which might imitate the human brain. But you also,
link |
00:21:09.040
in your talks, I've mentioned sort of the compute side of things. In terms of the
link |
00:21:13.200
difference in the human brain or referencing Hans Maravak. So do you think there's something
link |
00:21:22.400
interesting, valuable to consider about the difference in the computational power of the human
link |
00:21:28.560
brain versus the computers of today in terms of instructions per second? Yes. So if we go back...
link |
00:21:36.160
So this is a point I've been making for 20 years now. And I think once upon a time, the way I used
link |
00:21:44.480
to argue this was that we just didn't have the computing power of the human brain. Our computers
link |
00:21:49.760
were not quite there. And I mean, there is a well known tradeoff, which we know that
link |
00:21:59.280
the neurons are slow compared to transistors. But we have a lot of them and they have a very high
link |
00:22:08.080
connectivity. Whereas in silicon, you have much faster devices, transistors switch at...
link |
00:22:14.880
On the order of nanoseconds, but the connectivity is usually smaller. At this point in time,
link |
00:22:22.400
I mean, we are now talking about 2020, we do have, if you consider the latest GPUs and so on,
link |
00:22:28.800
amazing computing power. And if we look back at Hans Maravak's type of calculations, which he
link |
00:22:36.960
did in the 1990s, we may be there today in terms of computing power comparable to the brain. But
link |
00:22:43.840
it's not in the same style. It's of a very different style. So I mean, for example, the style of
link |
00:22:52.080
computing that we have in our GPUs is far, far more power hungry than the style of computing that
link |
00:22:59.120
is there in the human brain or other biological entities. Yeah. And that the efficiency part
link |
00:23:08.320
is we're going to have to solve that in order to build actual real world systems of large scale.
link |
00:23:14.880
Let me ask sort of the high level question. Taking a step back, how would you articulate
link |
00:23:20.640
the general problem of computer vision? Does such a thing exist? So if you look at the computer vision
link |
00:23:26.560
conferences and the work that's been going on, it's often separated into different little segments,
link |
00:23:33.360
breaking the problem of vision apart into whether segmentation, 3D reconstruction,
link |
00:23:40.000
object detection, I don't know, image capturing, whatever, there's benchmarks for each. But if
link |
00:23:46.400
you were to sort of philosophically say, what is the big problem of computer vision? Does such a
link |
00:23:52.160
thing exist? Yes, but it's not in isolation. So for all intelligence tasks, I always go back to
link |
00:24:04.880
sort of biology or humans. And if you think about vision or perception in that setting,
link |
00:24:12.640
we realize that perception is always to guide action. Perception for a biological system
link |
00:24:19.440
does not give any benefits unless it is coupled with action. So we can go back and think about
link |
00:24:26.240
the first multicellular animals which arose in the Cambrian era 500 million years ago.
link |
00:24:33.360
And these animals could move and they could see in some way. And the two activities helped each
link |
00:24:41.440
other because how does movement help? Movement helps that because you can get food in different
link |
00:24:50.400
places. But you need to know where to go. And that's really about perception or seeing. I mean,
link |
00:24:58.320
vision is perhaps the single most perception sense. But all the others are equally are also
link |
00:25:04.800
important. So perception and action kind of go together. So earlier it was in these very
link |
00:25:11.600
simple feedback loops which were about finding food or avoiding becoming food if there's a
link |
00:25:19.040
predator running, trying to eat you up and so forth. So we must at the fundamental level
link |
00:25:27.600
connect perception to action. Then as we evolved, perception became more and more sophisticated
link |
00:25:36.400
because it served many more purposes. And so today we have what seems like a fairly general
link |
00:25:44.000
purpose capability which can look at the external world and build a model of the external world
link |
00:25:50.800
inside the head. We do have that capability. That model is not perfect. And psychologists
link |
00:25:57.680
have great fun in pointing out the ways in which the model in your head is not a perfect model
link |
00:26:03.440
of the external world. They create various illusions to show the ways in which it is
link |
00:26:10.080
imperfect. But it's amazing how far it has come from a very simple perception action loop that
link |
00:26:17.840
you exist in, you know, an animal 500 million years ago. Once we have these very sophisticated
link |
00:26:26.640
visual systems, we can then impose a structure on them. It's we as scientists who are imposing
link |
00:26:32.880
that structure where we have chosen to characterize this part of the system as this
link |
00:26:39.280
quote module of object detection or quote this module of 3D reconstruction. What's going on
link |
00:26:46.000
is really all of these processes are running simultaneously and they are running simultaneously
link |
00:26:56.240
because originally their purpose was in fact to help guide action. So as a guiding general
link |
00:27:02.640
statement of a problem, do you think we can say that the general problem of computer vision,
link |
00:27:09.120
you said in humans, it was tied to action. Do you think we should also say that ultimately
link |
00:27:16.080
that the goal, the problem of computer vision is to sense the world in the way that helps you
link |
00:27:24.400
act in the world? Yes, I think that's the most fundamental, that's the most fundamental purpose.
link |
00:27:30.880
We have by now hyper evolved. So we have this visual system which can be used for other things,
link |
00:27:39.040
for example, judging the aesthetic value of a painting. And this is not guiding action,
link |
00:27:46.480
maybe it's guiding action in terms of how much money you will put in your auction bid, but
link |
00:27:51.600
that's a bit stretched. But the basics are in fact in terms of action, but we are not
link |
00:27:58.880
talking about action, but we have evolved really this hyper, we have hyper evolved our visual
link |
00:28:07.360
system. Actually, just to, sorry to interrupt, but perhaps it is fundamentally about action.
link |
00:28:13.440
You're kind of jokingly said about spending, but perhaps the capitalistic drive that drives
link |
00:28:20.880
a lot of the development in this world is about the exchange of money and the fundamental action
link |
00:28:26.000
is money. If you watch Netflix, if you enjoy watching movies, you're using your perception
link |
00:28:30.400
system to interpret the movie. Ultimately, your enjoyment of that movie means you'll
link |
00:28:35.200
subscribe to Netflix. So the action is this extra layer that we've developed in modern society,
link |
00:28:42.800
perhaps is fundamentally tied to the action of spending money. Well, certainly with respect to
link |
00:28:49.040
interactions with firms. So in this homo economic role, when you're interacting with firms,
link |
00:29:00.240
it does become that. That's it. What else is there?
link |
00:29:05.440
And that was a rhetorical question. Okay. So to link on the division between the static and the
link |
00:29:12.960
dynamic, so much of the work in computer vision, so many of the breakthroughs that you've been a
link |
00:29:18.320
part of have been in the static world in looking at static images. And then you've also worked on
link |
00:29:26.080
starting, but it's a much smaller degree. The community is looking at dynamic at video
link |
00:29:31.360
at dynamic scenes. And then there is robotic vision, which is dynamic, but also where you're
link |
00:29:38.080
actually have a robot in the physical world interacting based on that vision.
link |
00:29:41.440
Which problem is harder? The sort of the trivial first answers of, well, of course,
link |
00:29:51.680
one image is harder. But if you look at a deeper question there, are we, what's the term, cutting
link |
00:30:01.680
ourselves at the knees or like making the problem harder by focusing on the images?
link |
00:30:07.120
That's a fair question. I think sometimes we, we can simplify a problem so much
link |
00:30:17.040
that we essentially lose part of the juice that could enable us to solve the problem.
link |
00:30:24.160
And one could reasonably argue that to some extent this happens when we go from video to
link |
00:30:29.680
single images. Now, historically, you have to consider the limits of imposed by the
link |
00:30:37.840
computation capabilities we had. So if we, many of the choices made in the computer vision community
link |
00:30:47.040
through the 70s, 80s, 90s can be understood as choices which were forced upon us by the
link |
00:30:56.960
fact that we just didn't have access to compute enough compute.
link |
00:31:01.600
Not enough memory, not enough hard drive.
link |
00:31:04.080
Exactly. Not enough, not enough compute, not enough storage. So, so think of these choices.
link |
00:31:09.280
So one of the choices is focusing on single images rather than video. Okay,
link |
00:31:15.360
clear questions, storage and compute. We had to focus on, we did, we used to detect edges and
link |
00:31:23.680
throw away the image, right? So you have an image which I say 256 by 256 pixels.
link |
00:31:29.600
And instead of keeping around the grayscale value, what we did was we detected edges,
link |
00:31:35.360
find the places where the brightness changes a lot. So now that's, and now, and then throw away
link |
00:31:41.120
the rest. So this was a major compression device. And the hope was that this makes it,
link |
00:31:47.040
that you can still work with it. And the logic was humans can interpret a line drawing.
link |
00:31:51.520
And, and yes, and this will save us a computation. So many of the choices were dictated by that.
link |
00:32:00.960
I think today we are no longer detecting edges, right? We process images with conlets
link |
00:32:09.440
because we don't need to, we don't have that those compute restrictions anymore.
link |
00:32:13.840
Now video is still understudied because video compute is still quite challenging
link |
00:32:18.880
if you are a university researcher. I think video computing is not so challenging if you are at
link |
00:32:25.200
Google or Facebook or Amazon. Still super challenging. I just spoke with the
link |
00:32:31.280
VP of engineer and Google head of the YouTube search and discovery, and they still struggle
link |
00:32:35.920
doing stuff on video. It's very difficult except doing, except using techniques that are essentially
link |
00:32:42.240
the techniques used in the 90s, some very basic computer vision techniques.
link |
00:32:47.120
No, that's when you want to do things at scale. So if you want to operate at the scale of all the
link |
00:32:53.600
content of YouTube, it's very challenging. And there are similar issues in Facebook.
link |
00:32:57.680
But as a researcher, you, you have, you have more, you know, opportunities.
link |
00:33:04.240
You can train large networks with relatively large video data sets. Yeah.
link |
00:33:09.040
Yes. So I think that this is part of the reason why we have so emphasized static images.
link |
00:33:15.360
I think that this is changing. And over the next few years, I see a lot more progress happening
link |
00:33:21.840
in video. So I, I have this generic statement that to me, video recognition feels like 10 years
link |
00:33:30.080
behind object recognition. And you can quantify that because you can take some of the challenging
link |
00:33:36.080
video data sets and their performance on action classification is like say 30%, which is kind
link |
00:33:43.520
of what we used to have around 2009 in object detection, you know, it's like about 10 years
link |
00:33:51.440
behind. And whether it'll take 10 years to catch up is a different question. Hopefully,
link |
00:33:57.440
it will take less than that. Let me ask a similar question I've already asked. But once again,
link |
00:34:03.600
so for dynamic scenes, do you think, do you think some kind of injection of knowledge
link |
00:34:10.880
basis and reasoning is required to help improve like action recognition? Like if, if, if, if we
link |
00:34:19.040
solve the general action recognition problem, what do you think the solution would look like?
link |
00:34:24.000
There's another way. Yeah. So I, I completely agree that knowledge is called for. And that
link |
00:34:31.920
knowledge can be quite sophisticated. So the way I would say it is, is that you have to
link |
00:34:38.000
be quite sophisticated. So the way I would say it is that perception blends into cognition.
link |
00:34:43.760
And cognition brings in issues of memory and this notion of a schema from psychology, which is,
link |
00:34:53.440
let me use the classic example, which is you go to a restaurant, right? Now there are things
link |
00:34:59.600
happen in a certain order, you walk in, somebody takes you to a table, waiter comes,
link |
00:35:06.240
gives you a menu, takes the order, food arrives, eventually bill arrives, etc., etc.
link |
00:35:14.800
This is a classic example of AI from the 1970s. It was called, there was the term frames and
link |
00:35:23.440
scripts and schemas. These are all quite similar ideas. Okay. And then in the 70s, the way the AI
link |
00:35:30.640
of the time dealt with it was by hand coding this. So they hand coded in this notion of a script and
link |
00:35:36.960
the various stages and the actors and so on and so forth and use that to interpret, for example,
link |
00:35:44.480
language. I mean, if there's a description of a, of a story involving some people eating at a
link |
00:35:51.600
restaurant, there are all these inferences you can make because you know what happens typically
link |
00:35:58.320
at a restaurant. So I think this kind of, this kind of knowledge is absolutely essential.
link |
00:36:05.920
So I think that when we are going to do long form video understanding,
link |
00:36:11.520
we are going to need to do this. I think the kinds of technology that we have right now with
link |
00:36:16.720
3D convolutions over a couple of second of clip or video, it's very much tailored towards short
link |
00:36:23.920
term video understanding, not that long term understanding, long term understanding requires
link |
00:36:30.800
a notion of this notion of schemas that I talked about, perhaps some notions of goals,
link |
00:36:37.520
intentionality, functionality and so on and so forth. Now, how will we bring that in? So we
link |
00:36:46.160
could either revert back to the 70s and say, okay, I'm going to hand code in a script or
link |
00:36:53.040
we might try to learn it. So I tend to believe that we have to find learning ways of doing this
link |
00:37:02.800
because I think learning ways land up being more robust. And there must be a learning version of
link |
00:37:08.560
the story because children acquire a lot of this knowledge by sort of just observation. So at no
link |
00:37:17.600
moment in a child's life, it's possible, but I think it's not so typical that somebody that a
link |
00:37:25.920
mother coaches a child through all the stages of what happens in a restaurant. They just go as a
link |
00:37:31.280
family, they go to the restaurant, they eat, come back and the child goes through 10 such
link |
00:37:37.040
experiences and the child has got a schema of what happens when you go to a restaurant.
link |
00:37:42.480
So we somehow need to, we need to provide that capability to our systems.
link |
00:37:47.840
You mentioned the following line from the end of the Alan Turing paper,
link |
00:37:53.040
Computing Machinery and Intelligence that many people, like you said, many people know and
link |
00:37:58.720
very few have read where he proposes the Turing test. This is how you know because it's towards
link |
00:38:05.120
the end of the paper. Instead of trying to produce a program to simulate the adult mind,
link |
00:38:09.920
why not rather try to produce one which simulates the child's? So that's a really interesting point.
link |
00:38:16.960
If I think about the benchmarks we have before us, the tests of our computer vision systems,
link |
00:38:24.320
they're often kind of trying to get to the adult. So what kind of benchmarks should we have?
link |
00:38:30.880
What kind of tests for computer vision do you think we should have that mimic the child's
link |
00:38:35.840
in computer vision? Yeah, I think we should have those and we don't have those today.
link |
00:38:42.560
And I think the part of the challenge is that we should really be collecting data
link |
00:38:49.840
of the type that a child experiences. So that gets into issues of privacy and so on and so
link |
00:38:58.400
forth. But there are attempts in this direction to sort of try to collect the kind of data that
link |
00:39:05.040
a child encounters growing up. So what's the child's linguistic environment? What's the child's
link |
00:39:11.680
visual environment? So if we could collect that kind of data and then develop learning schemes
link |
00:39:20.000
based on that data, that would be one way to do it. I think that's a very promising direction
link |
00:39:27.600
myself. There might be people who would argue that we could just short circuit this in some way
link |
00:39:33.280
and sometimes we have imitated, we have had success by not imitating nature in detail.
link |
00:39:44.160
So the usual example is airplanes. We don't build flapping wings. So yes, that's one of the points
link |
00:39:55.520
of debate. In my mind, I would bet on this learning like a child approach.
link |
00:40:04.960
So one of the fundamental aspects of learning like a child is the interactivity. So the child
link |
00:40:11.760
gets to play with the data set it's learning from. Yes. It's against the select. I mean,
link |
00:40:16.240
you can call that active learning in the machine learning world. You can call it a lot of terms.
link |
00:40:21.760
What are your thoughts about this whole space of being able to play with the data set or select
link |
00:40:28.080
what you're learning? Yeah. So I think that I believe in that and I think that we could achieve
link |
00:40:36.480
it in two ways and I think we should use both. So one is actually real robotics. So real physical
link |
00:40:48.320
embodiments of agents who are interacting with the world and they have a physical body with
link |
00:40:53.920
a dynamics and mass and moment of inertia and friction and all the rest and you learn your
link |
00:41:01.040
body, the robot learns its body by doing a series of actions. The second is that simulation
link |
00:41:10.160
environments. So I think simulation environments are getting much, much better. In my life,
link |
00:41:19.280
in Facebook, AI research, our group has worked on something called Habitat, which is a simulation
link |
00:41:26.160
environment, which is a visually photo realistic environment of places like houses or interiors
link |
00:41:36.160
of various urban spaces and so forth. And as you move, you get a picture, which is a pretty
link |
00:41:42.800
accurate picture. So now you can imagine that subsequent generations of these simulators
link |
00:41:52.720
will be accurate, not just visually, but with respect to forces and masses and haptic interactions
link |
00:42:01.200
and so on. And then we have that environment to play with. I think that, let me state one reason
link |
00:42:11.200
why I think this being able to act in the world is important. I think that this is one way to break
link |
00:42:18.800
the correlation versus causation barrier. So this is something which is of a great
link |
00:42:25.840
deal of interest these days. People like Judea Pearl have talked a lot about that we are
link |
00:42:33.520
neglecting causality and he describes the entire set of successes of deep learning as just curve
link |
00:42:39.840
fitting. But I don't quite agree. He's a troublemaker, he is. But causality is important. But
link |
00:42:49.440
causality is not like a single silver bullet. It's not like one single principle. There are many
link |
00:42:56.400
different aspects here. And one of the ways in which one of our most reliable ways of establishing
link |
00:43:04.240
causal links and this is the way, for example, the medical community does this is randomized
link |
00:43:11.200
control trials. So you pick some situation and now in some situation you perform an action
link |
00:43:17.680
and for certain others you don't. So you have a controlled experiment. Well, the child is in fact
link |
00:43:25.520
performing controlled experiments all the time. In a small scale. But that is a way
link |
00:43:35.520
that the child gets to build and refine its causal models of the world. And my colleague,
link |
00:43:42.640
Alison Gopnik, has together with a couple of authors, co authors has this book called The
link |
00:43:47.440
Scientist in the Crib, referring to his children. So I like the part that I like about that is
link |
00:43:54.080
the scientist wants to do, wants to build causal models and the scientist does control
link |
00:44:00.400
experiments. And I think the child is doing that. So to enable that, we will need to
link |
00:44:05.760
have these, these active experiments. And I think this could be done some in the real world
link |
00:44:13.680
and some in simulation. So you have hope for simulation. I have hope for simulation. That's
link |
00:44:18.080
an exciting possibility if we can get to not just photo realistic, but what's that called
link |
00:44:23.920
life realistic simulation. So you don't see any fundamental blocks to why we can't eventually
link |
00:44:33.120
simulate the principles of what it means to exist in the world.
link |
00:44:39.280
I don't see any fundamental problems that I mean, and look, the computer graphics community has come
link |
00:44:44.240
a long way. So the in the early days, back going back to the 80s and 90s, they were,
link |
00:44:49.760
they were focusing on visual realism, right? And then they could do the easy stuff, but they
link |
00:44:54.560
couldn't do stuff like hair or fur and so on. Okay, well, they managed to do that. Then they
link |
00:45:01.520
couldn't do physical actions, right? Like there's a bowl of glass and it falls down and it shatters,
link |
00:45:08.160
but then they could start to do pretty realistic models of that and so on and so forth. So the
link |
00:45:13.920
graphics people have shown that they can do this forward direction, not just for optical
link |
00:45:20.400
interactions, but also for physical interactions. So I think, of course, some of that is very
link |
00:45:26.880
computer intensive, but I think by and by, we will find ways of making our models ever more
link |
00:45:33.760
realistic. You break vision apart into, in one of your presentations, early vision,
link |
00:45:39.920
static scene understanding, dynamic scene understanding, and raise a few interesting
link |
00:45:43.680
questions. I thought I could just throw some, some at you to see if you want to talk about them.
link |
00:45:50.160
So early vision, so it's, what is it that you said? Sensation, perception, and cognition. So
link |
00:45:58.320
is this a sensation? Yes. What can we learn from image statistics that we don't already know?
link |
00:46:05.440
So at the lowest level, what, what can we make from just the, the, the, the
link |
00:46:12.960
statistics, the basics or the, the variations in the rock pixels, the textures and so on?
link |
00:46:17.760
Yeah. So what we seem to have learned is, is that there's a lot of redundancy in these images.
link |
00:46:27.920
And as a result, we are able to do a lot of compression and, and this compression is very
link |
00:46:34.000
important in biological settings, right? So you might have 10 to the eight photo receptors and
link |
00:46:40.080
only 10 to the six fibers in the optic nerve. So you have to do this compression by a factor of
link |
00:46:44.800
100 is to one. And, and so there are analogs of that which are happening in, in our neural
link |
00:46:53.600
net, artificial neural network. That's the early layers. So you think there's, there's a lot of
link |
00:46:57.680
compression that can be done in the beginning. Yeah. Just, just the statistics. Yeah.
link |
00:47:05.440
How much, how much? Well, I saw, I mean, the, the way to think about it is just how successful is
link |
00:47:12.880
image compression, right? And we, we, and there are, and that's been done with older technologies,
link |
00:47:19.440
but it can be done with, there are several companies which are trying to use sort of these
link |
00:47:27.040
more advanced neural network type techniques for compression, both for static images as well as for,
link |
00:47:33.280
for video. One of my former students has a company which is trying to do stuff like this.
link |
00:47:39.760
And I think, I think that they are showing quite interesting results. And I think that
link |
00:47:47.920
that's all the success of that's really about image statistics and video statistics.
link |
00:47:53.520
But that's still not doing compression of the kind. When I see a picture of a cat,
link |
00:47:58.720
all I have to say is it's a cat. That's another semantic kind of compression.
link |
00:48:02.560
Yeah. So this is, this is at the lower level, right? So we are, we are, as I said, yeah,
link |
00:48:07.280
that's focusing on low level statistics. So to linger on that for a little bit,
link |
00:48:12.960
you mentioned how far can bottom up image segmentation go? And in general, what you mentioned
link |
00:48:21.200
that the central question for scene understanding is the interplay of bottom up and top down
link |
00:48:26.000
information. Maybe this is a good time to elaborate on that. Maybe define what is,
link |
00:48:31.920
what is bottom up, what is top down in the context of computer vision?
link |
00:48:36.640
Right. That's, so today what we have are very interesting systems because they work completely
link |
00:48:44.800
bottom up. How are they? What does bottom up mean? Sorry. So bottom up means, in this case,
link |
00:48:49.360
means a feed forward neural network. So starting from the raw pixels.
link |
00:48:53.600
Yeah. They start from the raw pixels and they, they end up with some, something like cat or
link |
00:48:58.960
not a cat, right? So our, our systems are running totally feed forward. They're trained in a very
link |
00:49:05.520
top down way. So they're trained by saying, okay, this is a cat. There's a cat. There's a dog.
link |
00:49:11.280
There's a zebra, et cetera. And I'm not happy with either of these choices fully. We have gone into,
link |
00:49:20.480
because we have completely separated these processes, right? So there's a, so I would
link |
00:49:26.640
like the, the process, the, the, so what do we know compared to biology? So in biology, what we
link |
00:49:35.440
know is that the processes in at test time at runtime, those processes are not purely feed
link |
00:49:43.600
forward, but they involve feedback. So, and they involve much shallower neural networks.
link |
00:49:49.840
So the kinds of neural networks we are using in computer vision, say a ResNet 50 has 50 layers.
link |
00:49:55.680
Well, in, in the brain, in the visual cortex, going from the retina to IT, maybe we have like seven,
link |
00:50:03.760
right? So they're far shallower, but we have the possibility of feedback. So there are backward
link |
00:50:08.880
connections. And this might enable us to, to deal with the more ambiguous stimuli, for example.
link |
00:50:17.440
So the, the biological solution seems to involve feedback. The solution in, in artificial vision
link |
00:50:26.480
seems to be just feed forward, but with a much deeper network. And the two are functionally
link |
00:50:32.560
equivalent, because if you have a feedback network, which just has like three rounds of feedback,
link |
00:50:37.360
you can just unroll it and make it three times the depth and create it in a totally feed forward
link |
00:50:43.600
way. So this is something which, I mean, we have written some papers on this theme, but I really
link |
00:50:49.840
feel that this should, this theme should be pursued further.
link |
00:50:54.640
Oh, some kind of recurrence mechanism.
link |
00:50:57.120
Yeah. Okay. The other, so that's, so I, so I want to have a little bit more top down in the
link |
00:51:04.960
at test time. Okay. Then at training time, we make use of a lot of top down knowledge right now.
link |
00:51:13.600
So basically to learn to segment an object, we have to have all these examples of this is the
link |
00:51:19.520
boundary of a cat, and this is the boundary of a chair, and this is the boundary of a horse,
link |
00:51:23.360
and so on. And this is too much top down knowledge. How do humans do this? We manage to,
link |
00:51:30.960
we manage with far less supervision, and we do it in a sort of bottom up way, because for example,
link |
00:51:37.920
we're looking at a video stream and the horse moves, and that enables me to say that all these
link |
00:51:45.360
pixels are together. So the Gestalt psychologists used to call this the principle of common fate.
link |
00:51:52.880
So there was a bottom up process by which we were able to segment out these objects,
link |
00:51:58.160
and we have totally focused on this top down training signal. So in my view, we have currently
link |
00:52:06.960
solved it in machine vision, this top down bottom up interaction, but I don't find the
link |
00:52:13.600
solution fully satisfactory. And I would rather have a bit of both in at both stages.
link |
00:52:20.000
For all computer vision problems, not just segmentation.
link |
00:52:23.600
And the question that you can ask is, so for me, I'm inspired a lot by human vision,
link |
00:52:30.160
and I care about that. You could be just a hard boiled engineer, not give a damn.
link |
00:52:35.360
So to you, I would then argue that you would need far less training data if you could make my
link |
00:52:43.360
research and, you know, fruitful.
link |
00:52:46.480
Okay, so then maybe taking a step into segmentation, static scene understanding,
link |
00:52:53.840
what is the interaction between segmentation and recognition? You mentioned the movement of objects.
link |
00:53:00.560
So for people who don't know computer vision, segmentation is this weird activity that computer
link |
00:53:07.600
vision folks have all agreed is very important of drawing outlines around objects versus
link |
00:53:14.480
a bounding box, and then classifying that object. What's the value of segmentation? What is it
link |
00:53:24.560
as a problem in computer vision? How is it fundamentally different from
link |
00:53:29.280
detection recognition and the other problems? Yeah, so I think, so segmentation enables us to say
link |
00:53:37.280
that some set of pixels are an object without necessarily even being able to name that object
link |
00:53:45.680
or knowing properties of that object. Oh, so you mean segmentation purely as the act of separating
link |
00:53:53.600
an object from its background, a blob of that's united in some way from its background. Yeah,
link |
00:54:01.200
so entityfication, if you will, making an entity out of it. Entityfication, beautifully. So I think
link |
00:54:09.680
that we have that capability, and that enables us to, as we are growing up, to acquire names of
link |
00:54:20.640
objects with very little supervision. So suppose the child, let's posit that the child has this
link |
00:54:26.800
ability to separate out objects in the world. Then when the mother says, pick up your bottle or
link |
00:54:36.320
the cat's behaving funny today, the word cat suggests some object, and then the child sort
link |
00:54:44.960
of does the mapping. The mother doesn't have to teach specific object labels by pointing to them.
link |
00:54:53.120
Weak supervision works in the context that you have the ability to create objects. So I think
link |
00:55:02.400
that, so to me, that's a very fundamental capability. There are applications where this is very
link |
00:55:09.520
important. For example, medical diagnosis. So in medical diagnosis, you have some brain scan,
link |
00:55:17.600
I mean, this is some work that we did in my group where you have CT scans of people who have had
link |
00:55:24.080
traumatic brain injury, and what the radiologist needs to do is to precisely delineate various
link |
00:55:31.520
places where there might be bleeds, for example. And there are clear needs like that. So there's
link |
00:55:39.920
certainly very practical applications of computer vision where segmentation is necessary. But
link |
00:55:46.320
philosophically, segmentation enables the task of recognition to proceed with much weaker
link |
00:55:54.960
supervision than we require today. And you think of segmentation as this kind of task
link |
00:56:00.720
that takes on a visual scene and breaks it apart into interesting entities that might be useful
link |
00:56:09.600
for whatever the task is. Yeah. And it is not semantics free. So I think, I mean, it blends
link |
00:56:17.520
into, it involves perception and cognition. It is not, I think the mistake that we used
link |
00:56:26.080
to make in the early days of computer vision was to treat it as a purely bottom up perceptual task.
link |
00:56:32.320
It is not just that because we do revise our notion of segmentation with more experience,
link |
00:56:41.280
right? Because, for example, there are objects which are non rigid, like animals or humans.
link |
00:56:46.880
And I think understanding that all the pixels of a human are one entity is actually quite a
link |
00:56:53.360
challenge because the parts of the human, they can move independently. The human wears clothes,
link |
00:57:00.560
so they might be differently colored. So it's all sort of a challenge.
link |
00:57:05.360
You mentioned the three hours of computer vision, our recognition reconstruction,
link |
00:57:10.240
reorganization. Can you describe these three hours and how they interact?
link |
00:57:15.200
Yeah. So recognition is the easiest one because that's what I think people generally think of
link |
00:57:24.400
as computer vision achieving these days, which is labels. So is this a cat? Is this a dog?
link |
00:57:32.400
Is this a chihuahua? I mean, it could be very fine grained, like specific breed of a dog
link |
00:57:40.800
or a specific species of bird, or it could be very abstract like animal.
link |
00:57:46.800
But given a part of an image or a whole image, say put a label on it.
link |
00:57:51.360
Yeah. So that's recognition. Reconstruction is essentially, you can think of it as inverse
link |
00:58:02.640
graphics. I mean, that's one way to think about it. So graphics is you have some internal computer
link |
00:58:10.320
representation and you have a computer representation of some objects arranged in a scene.
link |
00:58:17.200
And what you do is you produce a picture. You produce the pixels corresponding to a rendering
link |
00:58:22.560
of that scene. So let's do the inverse of this. We are given an image and we say, oh, this image
link |
00:58:35.360
arises from some objects in a scene looked at with a camera from this viewpoint. And we might
link |
00:58:42.240
have more information about the objects like their shape, maybe the textures, maybe color,
link |
00:58:49.920
et cetera, et cetera. So that's the reconstruction problem. In a way, you are in your head creating
link |
00:58:57.680
a model of the external world. Okay. Reorganization is to do with essentially finding these entities.
link |
00:59:06.800
So it's organization. The word organization implies structure. So in perception, in psychology,
link |
00:59:19.760
we use the term perceptual organization, that the world is not just an image is not just seen as
link |
00:59:28.560
is not internally represented as just a collection of pixels. But we make these entities, we create
link |
00:59:35.280
these entities, objects, whatever you want to call it. And the relationship between the entities
link |
00:59:39.600
as well? Or is it purely about the entities? It could be about the relationships, but mainly
link |
00:59:44.720
we focus on the fact that there are entities. Okay. So I'm trying to pinpoint what the organization
link |
00:59:51.520
means. So organization is that instead of like a uniform grid, we have this structure of objects.
link |
01:00:00.400
So segmentation is the small part of that. So segmentation gets us going towards that.
link |
01:00:08.160
Yeah. And you kind of have this triangle where they all interact together.
link |
01:00:13.440
Yes. So how do you see that interaction in sort of reorganization is yes,
link |
01:00:22.560
defining the entities in the world. The recognition is labeling those entities. And then reconstruction
link |
01:00:31.760
is what filling in the gaps? Well, to, for example, see impute some 3D objects corresponding to each
link |
01:00:40.880
of these entities, that would be part of adding more information that's not there in the raw data.
link |
01:00:47.680
Correct. I mean, I started pushing this kind of a view in the around 2010 or something like that,
link |
01:00:57.920
because at that time in computer vision, the distinction that people were just working on
link |
01:01:06.480
many different problems, but they treated each of them as a separate isolated problem
link |
01:01:11.920
with each with its own data set. And then you try to solve that and get good numbers on it.
link |
01:01:16.800
So I wasn't, I didn't like that approach because I wanted to see the connection between these.
link |
01:01:23.440
And if people divided up vision into various modules, the way they would do it is as low
link |
01:01:31.280
level, mid level and high level vision corresponding roughly to the psychologist's
link |
01:01:36.720
notion of sensation, perception and cognition. And I didn't, that didn't map to tasks that people
link |
01:01:43.840
cared about. Okay. So therefore, I tried to promote this particular framework as a way
link |
01:01:50.800
of considering the problems that people in computer vision were actually working on
link |
01:01:55.360
and trying to be more explicit about the fact that they actually are connected to each other.
link |
01:02:02.080
And I was at that time just doing this on the basis of information flow. Now it turns out
link |
01:02:08.640
in the last five years or so, in the post the deep learning revolution that this architecture
link |
01:02:19.600
has turned out to be very conducive to that because basically in these neural networks,
link |
01:02:27.920
we are trying to build multiple representations. There can be multiple output heads sharing
link |
01:02:35.520
common representations. So in a certain sense, today, given the reality of what solutions people
link |
01:02:42.240
have to this, I do not need to preach this anymore. It is just there. It's part of the solution space.
link |
01:02:52.400
So speaking of neural networks, how much of this problem of computer vision,
link |
01:02:59.520
of reorganization, recognition can be reconstruction, how much of it can be learned end to end,
link |
01:03:11.200
do you think? Sort of set it and forget it, just plug and play, have a giant dataset multiple,
link |
01:03:20.000
perhaps multimodal, and then just learn the entirety of it. Well, so I think that currently
link |
01:03:28.480
what that end to end learning means nowadays is end to end supervised learning. And that I would
link |
01:03:35.440
argue is too narrow a view of the problem. I like this child development view, this lifelong
link |
01:03:43.280
learning view, one where there are certain capabilities that are built up and then there
link |
01:03:48.560
are certain capabilities which are built up on top of that. So that's what I believe in. So I think
link |
01:04:02.240
end to end learning in this supervised setting for a very precise task to me is
link |
01:04:09.600
kind of a sort of a limited view of the learning process.
link |
01:04:18.080
Got it. So if we think about beyond purely supervised, look back to children. You mentioned
link |
01:04:25.200
six lessons that we can learn from children of be multimodal, be incremental, be physical,
link |
01:04:33.360
explore, be social, use language. Can you speak to these, perhaps picking one that you find most
link |
01:04:40.320
fundamental to our time today? Yeah. So I mean, I should say to give a due credit, this is from a
link |
01:04:46.960
paper by Smith and Gasser. And it reflects essentially, I would say, common wisdom among
link |
01:04:55.920
child development people. It's just that these are, this is not common wisdom among people in
link |
01:05:05.120
computer vision and AI and machine learning. So I view my role as trying to bridge the two worlds.
link |
01:05:15.680
So let's take an example of a multimodal. I like that. So multimodal, a canonical example is
link |
01:05:22.880
a child interacting with an object. So then the child holds a ball and plays with it.
link |
01:05:32.320
So at that point, it's getting a touch signal. So the touch signal is getting a notion of 3D
link |
01:05:41.200
shape, but it is sparse. And then the child is also seeing a visual signal. And these two,
link |
01:05:48.800
so imagine these are two in totally different spaces. So one is the space of receptors on the
link |
01:05:55.600
skin of the fingers and the thumb and the palm. And then these map onto these neuronal fibers
link |
01:06:02.880
are getting activated somewhere. These lead to some activation in somatosensory cortex.
link |
01:06:10.320
I mean, a similar thing will happen if we have a robot hand. And then we have the pixels corresponding
link |
01:06:17.600
to the visual view, but we know that they correspond to the same object. So that's a very,
link |
01:06:25.440
very strong cross calibration signal. And it is self supervisory, which is beautiful.
link |
01:06:32.240
There's nobody assigning a label. The mother doesn't have to come and assign a label.
link |
01:06:37.680
The child doesn't even have to know that this object is called a ball.
link |
01:06:40.960
Okay, but the child is learning something about the three dimensional world from this
link |
01:06:47.840
signal. I think tactile and visual, there is some work on, there is a lot of work currently
link |
01:06:54.880
on audio and visual. Okay, and audio visual. So there is some event that happens in the world.
link |
01:07:01.680
And that event has a visual signature, and it has an auditory signature. So there is this
link |
01:07:07.680
glass bowl on the table, and it falls and breaks. And I hear the smashing sound and I see the pieces
link |
01:07:13.440
of glass. Okay, I've built that connection between the two, right? We have people, I mean, this
link |
01:07:21.680
become a hot topic in computer vision in the last couple of years. There are problems like
link |
01:07:28.960
separating out multiple speakers, right? Which was a classic problem in, in audition,
link |
01:07:35.280
they call this the problem of source separation or the cocktail party effect and so on.
link |
01:07:40.400
But just try to do it visually when you also have, it becomes so much easier and so much
link |
01:07:48.720
more useful. So the multimodal, I mean, there's so much more signal with multimodal and you can use
link |
01:07:56.640
that for some kind of weak supervision as well. Yes, because they are occurring at the same time
link |
01:08:02.240
in time. So you have time, which links the two, right? So at a certain moment, T1, you got a certain
link |
01:08:08.960
signal in the auditory domain and a certain signal in the visual domain, but they must be causally
link |
01:08:13.760
related. Yeah, that's an exciting area. Not well studied yet. Yeah, I mean, we have a little bit
link |
01:08:19.600
of work at this, but so much more needs to be done. So this is a good example. Be physical,
link |
01:08:29.040
that's to do with like something we talked about earlier that there's an embodied world.
link |
01:08:36.320
To mention language, use language. So No Chomsky believes that language may be at the core of
link |
01:08:43.040
cognition, at the core of everything in the human mind. What is the connection between language
link |
01:08:48.080
and vision to you? Like what's more fundamental? Are they neighbors, is one the parent and the child,
link |
01:08:55.920
the chicken and the egg? Oh, it's very clear. It is vision that is the parent. The parent is
link |
01:09:01.680
the fundamental ability. Okay. Well, it comes before you think vision is more fundamental
link |
01:09:10.080
in language. Correct. And you can think of it either in phylogeny or in ontogeny. So phylogeny
link |
01:09:18.800
means if you look at evolutionary time, right? So we have vision that developed 500 million years
link |
01:09:25.360
ago. Okay. Then something like when we get to maybe like 5 million years ago, you have the first
link |
01:09:33.040
bipedal primate. So when we started to walk, then the hand became free. And so then manipulation,
link |
01:09:40.400
the ability to manipulate objects and build tools and so on and so forth. So you said 500,000 years
link |
01:09:47.040
ago? No, sorry. The first multicellular animals, which you can say had some intelligence, arose
link |
01:09:55.760
500 million years ago. Okay. And now let's fast forward to say the last 7 million years,
link |
01:10:03.680
which is the development of the hominid line, right? Where from the other primates, we have the
link |
01:10:09.600
branch which leads on to modern humans. Now, there are many of these hominids, but the ones which
link |
01:10:20.800
people talk about Lucy because that's like a skeleton from 3 million years ago. And we know
link |
01:10:25.360
that Lucy walked. Okay. So at this stage, you have that the hand is free for manipulating objects.
link |
01:10:33.520
And then the ability to manipulate objects, build tools and the brain size grew in this era.
link |
01:10:43.280
So, okay. So now you have manipulation. Now, we don't know exactly when language arose.
link |
01:10:49.360
But after that. But after that, because no apes have, I mean, so, I mean Chomsky is correct in
link |
01:10:56.560
that that it is a uniquely human capability. And we primates, other primates don't have that.
link |
01:11:04.160
But so it developed somewhere in this era. But it developed, I would, I mean,
link |
01:11:11.360
argue that it probably developed after we had this stage of humans. I mean, the human species
link |
01:11:19.200
already able to manipulate and hands free, much bigger brain size. And for that, there's a lot of
link |
01:11:28.000
vision has already had to have developed. So the sensation and the perception may be some of the
link |
01:11:34.960
cognition. Yeah. So those, so that vision, so the world, so these ancestors of us,
link |
01:11:46.320
you know, three, four million years ago, they had, they had spatial intelligence.
link |
01:11:53.120
So they knew that the world consists of objects. They knew that the objects were in
link |
01:11:57.440
certain relationships to each other. They had observed causal interactions among objects.
link |
01:12:05.040
They could move in space. So they had space and time and all of that. So language builds on that
link |
01:12:12.160
substrate. So language has a lot of, I mean, I mean, the, all human languages have constructs
link |
01:12:19.680
which depend on a notion of space and time. Where did that notion of space and time come from?
link |
01:12:26.640
It had to come from perception and action in the world we live in.
link |
01:12:30.880
Yeah. Well, you've referred to the spatial intelligence. Yeah. Yeah. So to linger a little
link |
01:12:36.160
bit, we mentioned Turing and his mention of we should learn from children. Nevertheless, language
link |
01:12:45.440
is the fundamental piece of the test of intelligence that Turing proposed. Yes. What do you think is
link |
01:12:51.520
a good test of intelligence? Are you, what would impress the heck out of you? Is it
link |
01:12:57.360
fundamentally in natural language or is there something in vision?
link |
01:13:01.200
I think I wouldn't, I don't think we should have created a single test of intelligence.
link |
01:13:09.920
So just like I don't believe in IQ as a single number, I think generally there can be many
link |
01:13:17.200
capabilities which are correlated perhaps. So I think that there will be, there will be
link |
01:13:25.920
accomplishments which are visual accomplishments, accomplishments which are accomplishments in
link |
01:13:32.000
manipulation or robotics, and then accomplishments in language. I do believe that language will
link |
01:13:38.240
be the hardest nut to crack. Really? Yeah. So what's harder to pass the spirit of the Turing
link |
01:13:45.040
test or like whatever formulation will make it natural language, convincingly a natural language,
link |
01:13:50.960
like somebody you would want to have a beer with, hang out and have a chat with,
link |
01:13:54.480
or the general natural scene understanding, you think language is the problem?
link |
01:14:01.360
I think I'm not a fan of the, I think Turing test, that Turing as he proposed the test in 1950
link |
01:14:11.360
was trying to solve a certain problem. Yeah, imitation. Yeah. And I think it made a lot of
link |
01:14:17.280
sense then, where we are today, 70 years later, I think we should not worry about that. I think
link |
01:14:26.800
the Turing test is no longer the right way to channel research in AI, because that it takes
link |
01:14:34.800
us down this path of this chatbot which can fool us for five minutes or whatever. I think
link |
01:14:40.960
I would rather have a list of 10 different tasks. I mean, I think the tasks which they're
link |
01:14:48.480
tasked in the manipulation domain, tasks in navigation, tasks in visual scene understanding,
link |
01:14:53.680
tasks in reading a story and answering questions based on that. I mean, so my favorite language
link |
01:15:02.160
understanding tasks would be reading a novel and being able to answer arbitrary questions from it.
link |
01:15:07.760
Okay. Right. I think that to me, and this is not an exhaustive list by any means,
link |
01:15:15.600
so I would, I think that that's what we, where we need to be going to and each of these,
link |
01:15:22.480
on each of these axes, there's a fair amount of work to be done.
link |
01:15:25.920
So on the visual understanding side, in this intelligence Olympics that we've set up,
link |
01:15:30.960
what's a good test for one of many of visual scene understanding? Do you think such benchmarks
link |
01:15:40.880
exist? Sorry to interrupt. No, there aren't any. I think, I think essentially to me,
link |
01:15:46.640
a really good aid to the blind. So suppose there was a blind person and I needed to assist the
link |
01:15:55.120
blind person. So ultimately, like we said, vision that aids in the action in the survival in this
link |
01:16:03.040
world. Yeah. Maybe in the simulated world. Maybe easier to, to measure performance in a simulated
link |
01:16:12.640
world. What we are ultimately after is performance in the real world. So David Hilbert in 1900 proposed
link |
01:16:19.920
23 open problems of mathematics, some of which are still unsolved, most important, famous of
link |
01:16:26.320
which is probably the Riemann hypothesis you've thought about and presented about the Hilbert
link |
01:16:31.280
problems of computer vision. So let me ask, what do you today? I don't know when the last year you
link |
01:16:37.680
presented that 2015, but versions of it, you're kind of the face and the spokesperson for computer
link |
01:16:44.000
vision. It's your job to state what the problem, the open problems are for the field. So what
link |
01:16:52.000
today are the Hilbert problems of computer vision? Do you think? Let me pick one to,
link |
01:16:58.880
which I regard as clearly, clearly unsolved, which is what I would call long form video
link |
01:17:06.480
understanding. So, so we have a video clip and we want to understand the behavior in there
link |
01:17:17.120
in terms of agents, their goals, intentionality and make predictions about what might happen.
link |
01:17:27.920
So that kind of understanding which goes away from atomic visual action. So in the short
link |
01:17:37.760
range, the question is, are you sitting? Are you standing? Are you catching a ball?
link |
01:17:43.760
That we can do now. Or even if we can't do it fully accurately, if we can do it at 50%,
link |
01:17:50.160
maybe next year we'll do it at 65 and so forth. But I think the long range video understanding,
link |
01:17:57.440
I don't think we can do today. And that means so long. And it blends into cognition. That's
link |
01:18:04.640
the reason why it's challenging. So you have to track, you have to understand the entities,
link |
01:18:10.080
you have to understand the entities, you have to track them, and you have to have some kind of model
link |
01:18:15.040
of their behavior. Correct. And their behavior might be, these are agents. So they're not just
link |
01:18:22.240
like passive objects, but they're agents. So therefore, they might, they would exhibit goal
link |
01:18:28.160
directed behavior. Okay. So this is, this is one area. Then I will talk about, say, understanding
link |
01:18:35.120
the world in 3D. Now, this may seem paradoxical because in a way, we have been able to do 3D
link |
01:18:42.080
understanding even like 30 years ago, right? But I don't think we currently have the richness of
link |
01:18:48.640
3D understanding in our computer vision system that we would like. Because, so let me elaborate on
link |
01:18:56.560
that a bit. So currently, we have two kinds of techniques which are not fully unified. So
link |
01:19:03.280
there are the kinds of techniques from multi view geometry that you have multiple pictures of a scene
link |
01:19:08.560
and you do a reconstruction using stereoscopic vision or structure for motion. But these techniques
link |
01:19:15.520
do not, they totally fail if you just have a single view because they are relying on this,
link |
01:19:23.440
this multiple view geometry. Okay. Then we have some techniques that we have developed in the
link |
01:19:29.040
computer vision community, which try to guess 3D from single views. And these techniques are based
link |
01:19:35.920
on, on a supervised learning and they are based on having a training time, 3D models of objects
link |
01:19:44.160
available. And this is completely unnatural supervision, right? That's not, CAD models are
link |
01:19:51.520
not injected into your brain. Okay. So what would I like? What I would like would be a kind of
link |
01:19:59.360
learning as you move around the world notion of 3D. So we, we have our succession of visual
link |
01:20:11.680
experiences. And from those, we, so in, as part of that, I might see a chair from different view
link |
01:20:20.640
points or a table from view point, different view points and so on. Now as part, that enables me
link |
01:20:27.680
to build some internal representation. And then next time I just see a single photograph.
link |
01:20:35.120
And it may not even be of that chair, it's of some other chair. And I have a guess of what its
link |
01:20:40.240
3D shape is like. So you're almost learning the CAD model kind of? Yeah, implicitly. I mean,
link |
01:20:47.040
implicitly. I mean, the CAD model need not be in the same form as used by computer graphics
link |
01:20:51.680
programs. It's hidden in the representation. It's hidden in the representation, the ability
link |
01:20:55.920
to predict new views and what I would see if I went to such and such position.
link |
01:21:03.040
By the way, on a small tangent on that, are you uncomfortable, are you okay or comfortable with
link |
01:21:13.440
neural networks that do achieve visual understanding that do, for example, achieve this kind of 3D
link |
01:21:18.320
understanding? And you don't know how they, you don't know the, you're not able to interest,
link |
01:21:25.040
but you're not able to visualize or understand or interact with the representation. So the fact
link |
01:21:31.920
that they're not or may not be explainable. Yeah, I think that's fine. To me, that is, so
link |
01:21:41.760
let me put some caveats on that. So it depends on the setting. So first of all, I think
link |
01:21:53.760
humans are not explainable. Yeah, that's a really good point. One human to another human is not
link |
01:22:00.640
fully explainable. I think there are settings where explainability matters. And these might,
link |
01:22:07.920
these are, these might be, for example, questions on medical diagnosis. So I'm in a setting where
link |
01:22:16.400
maybe the doctor, maybe a computer program has made a certain diagnosis.
link |
01:22:21.120
And then depending on the diagnosis, perhaps I should have treatment A or treatment B,
link |
01:22:25.840
right? So now is the computer programs diagnosis based on data, which was data collected off
link |
01:22:38.400
for American males who are in their 30s and 40s, and maybe not so relevant to me,
link |
01:22:45.120
maybe it is relevant, you know, et cetera, et cetera. And we, I mean, in medical diagnosis,
link |
01:22:50.160
we have major issues to do with the reference class. So we may have acquired statistics from
link |
01:22:55.440
one group of people and applying it to a different group of people who may not share all the same
link |
01:23:01.120
characteristics. The data might have, there might be error bars in the prediction. So that prediction
link |
01:23:08.880
should really be taken with a huge grain of salt. And, but this has an impact on what treatments
link |
01:23:16.960
should be picked, right? So, so there are settings where I want to know more than just
link |
01:23:23.840
this is the answer. But what I acknowledge is that the, so, so, so I, in that sense,
link |
01:23:32.000
explainability and interpretability may matter. It's about giving error bounds and a better sense
link |
01:23:38.320
of the quality of the decision. Where, what I, where I'm willing to sacrifice interpretability
link |
01:23:46.480
is that I believe that there can be systems which can be highly performant, but which are internally
link |
01:23:53.840
black boxes. And, and that seems to be words headed. Some of the best performing systems
link |
01:23:59.440
are essentially black boxes, fundamentally by their construction. You and I are black boxes
link |
01:24:05.600
to each other. Yeah. So the nice thing about the black boxes we are is, so we ourselves are black
link |
01:24:12.800
boxes, but we're also the, those of us who are charming are able to convince others, like explain
link |
01:24:20.480
the black, what's going on inside the black box with narratives of stories. So in some sense,
link |
01:24:26.960
neural networks don't have to actually explain what's going on inside. They just have to come
link |
01:24:32.320
up with stories real or fake that convince you that they know what's going on. And I'm sure we
link |
01:24:39.120
can do that. We can create those neural, those stories, neural networks can create those stories.
link |
01:24:44.320
Yeah. And the transformer will be involved. Do you think we will ever build a system of
link |
01:24:53.840
human level or super human level intelligence? We've kind of defined what it takes to try to
link |
01:24:59.120
approach that. But do you think we'll, do you think that's within our reach? The thing that we
link |
01:25:03.200
thought we could do, what Turing thought actually we could do by year 2000, right? Do you think
link |
01:25:09.600
we'll ever be able to do? Yeah. So I think there are two answers here. One question, one answer is
link |
01:25:14.480
in principle, can we do this at some time? And my answer is yes. The second answer is a pragmatic
link |
01:25:23.040
one. Do you think we will be able to do it in the next 20 years or whatever? And to that my
link |
01:25:28.240
answer is no. So, and of course that's a wild guess. I think that Donald Rumsfeld is not a
link |
01:25:38.640
favorite person of mine, but one of his lines was very good, which is about known unknowns,
link |
01:25:45.200
known unknowns and unknown unknowns. So in the business we are in, there are known unknowns
link |
01:25:52.960
and we have unknown unknowns. So I think with respect to a lot of what the case in
link |
01:26:01.600
vision and robotics, I feel like we have known unknowns. So I have a sense of where we need
link |
01:26:09.040
to go and what the problems that need to be solved are. I feel with respect to natural language,
link |
01:26:16.080
understanding and high level cognition, it's not just known unknowns, but also unknown unknowns.
link |
01:26:24.000
So it is very difficult to put any kind of a time frame to that.
link |
01:26:30.720
Do you think some of the unknown unknowns might be positive in that they'll surprise us and make
link |
01:26:36.960
the job much easier? So fundamental breakthroughs? I think that is possible because certainly I have
link |
01:26:42.800
been very positively surprised by how effective these deep learning systems have been because I
link |
01:26:50.800
certainly would not have believed that in 2010. I think what we knew from the mathematical theory
link |
01:27:03.760
was that convex optimization works when there's a single global optima than
link |
01:27:07.920
this gradient descent techniques would work. Now these are nonlinear systems with nonconvex
link |
01:27:15.360
systems. Huge number of variables. So overparameterized. Overparameterized. And the people who used to
link |
01:27:22.640
play with them a lot, the ones who were totally immersed in the lore and the black magic, they
link |
01:27:29.680
knew that they worked well even though they were. Really? I thought like everybody was.
link |
01:27:36.160
No, the claim that I hear from my friends like Jan Lekoon and so forth is that they feel that
link |
01:27:43.520
they were comfortable with them. Well, he says that now. But the community as a whole
link |
01:27:48.800
was certainly not. And I think we were, to me, that was the surprise that they actually
link |
01:27:56.720
worked robustly for a wide range of problems from a wide range of initializations and so on.
link |
01:28:03.760
And so that was certainly more rapid progress than we expected. But then there are certainly
link |
01:28:13.920
lots of times. In fact, most of the history in AI is when we have made less progress at a slower
link |
01:28:21.120
rate than we expected. So we just keep going. I think what I regard as really unwarranted
link |
01:28:33.120
are these fears of AGI in 10 years and 20 years and that kind of stuff. Because that's based on
link |
01:28:43.040
completely unrealistic models of how rapidly we will make progress in this field.
link |
01:28:48.560
So I agree with you. But I've also gotten a chance to interact with very smart people who
link |
01:28:54.800
really worry about the existential threats of AI. And as an open minded person, I'm sort of
link |
01:29:00.480
taking it in. Do you think if AI systems, in some way, the unknown unknowns, not super
link |
01:29:12.080
intelligent AI, but in ways we don't quite understand the nature of super intelligence,
link |
01:29:17.280
will have a detrimental effect on society? Do you think this is something we should be
link |
01:29:22.880
worried about? Or we need to first allow the unknown unknowns to become known unknowns?
link |
01:29:28.800
I think we need to be worried about AI today. I think that it is not just a worry we need to
link |
01:29:35.600
have when we get that AGI. I think that AI is being used in many systems today. And there might
link |
01:29:43.760
be settings, for example, when it causes biases or decisions which could be harmful, I mean,
link |
01:29:50.960
decisions which could be unfair to some people, or it could be a self driving cars which kills
link |
01:29:56.240
a pedestrian. So AI systems are being deployed today, right? And they are being deployed in
link |
01:30:02.720
many different settings, maybe in medical diagnosis, maybe in a self driving car, maybe
link |
01:30:07.360
in selecting applicants for an interview. So I would argue that when these systems
link |
01:30:14.080
make mistakes, there are consequences. And we are in a certain sense responsible for those
link |
01:30:20.880
consequences. So I would argue that this is a continuous effort. And this is something that
link |
01:30:30.320
in a way is not so surprising. It's about all engineering and scientific progress which
link |
01:30:37.200
great power comes, great responsibility. So as these systems are deployed, we have to worry
link |
01:30:42.320
about them. And it's a continuous problem. I don't think of it as something which will
link |
01:30:47.360
suddenly happen on some day in 2079, for which I need to design some clever trick.
link |
01:30:54.800
I'm saying that these problems exist today. And we need to be continuously on the lookout for
link |
01:31:02.240
worrying about safety, biases, risks, right? I mean, the self driving car kills a pedestrian
link |
01:31:09.680
and they have, right? I mean, there's Uber incident in Arizona, right? It has happened, right?
link |
01:31:17.440
This is not about AGI. In fact, it's about a very dumb intelligence which is killing people.
link |
01:31:23.680
The worry people have with AGI is the scale. And I, but I think you're 100% right is
link |
01:31:31.280
like the thing that worries me about AI today. And it's happening in a huge scale is recommender
link |
01:31:37.280
system, recommendation systems. So if you look at Twitter or Facebook or YouTube, they're controlling
link |
01:31:45.600
the ideas that we have access to, the news and so on. And that's a fundamentally machine learning
link |
01:31:52.000
algorithm behind each of these recommendations. And they, I mean, my life would not be the same
link |
01:31:58.320
without these sources of information. I'm a totally new human being. And the ideas that I know
link |
01:32:04.000
are very much because of the internet, because of the algorithm that recommend those ideas.
link |
01:32:08.880
And so as they get smarter and smarter, I mean, that is the AGI is that's the algorithm that's
link |
01:32:16.400
recommending the next YouTube video you should watch has control of millions of billions of people
link |
01:32:25.040
that that algorithm is already super intelligent and has complete control of the population.
link |
01:32:31.040
Not a complete, but very strong control. For now, we can turn off YouTube, we can just go have a
link |
01:32:38.000
normal life outside of that. But the more and more that gets into our life, it's that algorithm will
link |
01:32:45.440
start depending on it and the different companies that are working on the algorithm. So I think
link |
01:32:49.200
it's, you're right, it's already, it's already there. And YouTube in particular is using computer
link |
01:32:55.600
vision, doing their hardest to try to understand the content of videos so they could be able to
link |
01:33:03.200
connect videos with the people who would benefit from those videos the most. And so that development
link |
01:33:09.840
could go in a bunch of different directions, some of which might be harmful. So yeah, you're
link |
01:33:15.680
right. The, the, the threats of AI here already, we should be thinking about them. On a philosophical
link |
01:33:22.080
notion. If you could personal, perhaps, if you could relive a moment in your life outside of family,
link |
01:33:31.760
because it made you truly happy or it was a profound moment that impacted the direction of your life,
link |
01:33:38.720
what moment would you go to?
link |
01:33:43.760
I don't think of single moments, but I look over the long haul. I feel that I've been very lucky
link |
01:33:51.040
because I feel that I think that in scientific research, a lot of it is about being at the
link |
01:34:01.040
right place at the right time. And you can, you can work on problems at a time when they're just
link |
01:34:08.960
too premature, you know, you beat but your head against them and, and nothing happens because
link |
01:34:14.800
it's the prerequisites for success are not there. And then there are times when you are in a field
link |
01:34:21.680
which is all pretty mature and you can only solve curricules upon curricules. I've been lucky to
link |
01:34:30.800
have been in this field, which for 34 years, 34, well, actually 34 years is a professor at Berkeley,
link |
01:34:37.840
so longer than that, which when I started in it was just like some little crazy, absolutely
link |
01:34:48.800
useless field, which couldn't really do anything to a time when it's really, really
link |
01:34:56.720
solving a lot of practical problems has a lot has offered a lot of tools for scientific research,
link |
01:35:02.960
right, because computer vision is impactful for images in biology or astronomy and so on and
link |
01:35:10.640
so forth. And we have, so we have made great scientific progress, which has had real practical
link |
01:35:17.760
impact in the world. And I feel lucky that I, I got in at a time when the field was
link |
01:35:24.800
very young and at a time when it is, it's now mature, but not fully mature. It's mature, but not
link |
01:35:33.680
done. I mean, it's really in still in a, in a productive phase. Yeah, I think people 500 years
link |
01:35:40.560
from now would laugh at you calling this field mature. That is very possible. Yeah. So, but you're
link |
01:35:47.040
also, lest I forget to mention, you've also mentored some of the biggest names of computer
link |
01:35:53.280
vision, computer science and AI today. There's so many questions I could ask, but really is
link |
01:36:00.640
what, what is it? How did you do it? What does it take to be a good mentor? What does it take to be
link |
01:36:06.320
a good guide? Yeah, I think what I feel I've been lucky to have had very, very smart and hardworking
link |
01:36:16.640
and creative students. I think some part of the credit just belongs to being at Berkeley. I think
link |
01:36:24.960
those of us who are at top universities are blessed because we have very, very smart and capable
link |
01:36:32.880
students coming on, knocking on our door. So, so I have to be humble enough to acknowledge that.
link |
01:36:39.120
But what have I added? I think I have added something. What I have added is, I think what
link |
01:36:47.760
I've always tried to teach them is a sense of picking the right problems. So, I think that in
link |
01:36:57.840
science, in the short run, success is always based on technical competence. You're, you know,
link |
01:37:05.680
you're quick with math or you are whatever. I mean, there's certain technical capabilities
link |
01:37:11.840
which make for short range progress. Long range progress is really determined by asking the right
link |
01:37:19.360
questions and focusing on the right problems. And I feel that what I've been able to bring to the
link |
01:37:28.160
table in terms of advising these students is some sense of taste of what are good problems.
link |
01:37:36.640
What are problems that are worth attacking now as opposed to waiting 10 years?
link |
01:37:41.360
What's a good problem if you could summarize? If is that possible to even summarize? Like what's
link |
01:37:46.720
your sense of a good problem? I think I think I have a sense of what is a good problem, which is
link |
01:37:52.560
there is a British scientist. In fact, he won a Nobel Prize, Peter Medover, who has a book on this.
link |
01:38:02.480
And basically he calls it the research is the art of the soluble. So, we need to sort of find
link |
01:38:10.960
problems which are not yet solved, but which are approachable. And he sort of refers to this
link |
01:38:19.760
sense that there is this problem which isn't quite solved yet, but it has a soft underbelly.
link |
01:38:27.120
There is some place where you can spear the beast. And having that intuition that this
link |
01:38:35.360
problem is ripe is a good thing, because otherwise you can just beat your head and not make progress.
link |
01:38:42.160
So, I think that is important. So, if I have that and if I can convey that to students,
link |
01:38:49.520
it's not just that they do great research while they're working with me, but that they continue
link |
01:38:55.120
to do great research. So, in a sense, I'm proud of my students and their achievements and their
link |
01:39:00.480
great research, even 20 years after they've seized being my student. So, some part developing,
link |
01:39:06.800
helping them develop that sense that a problem is not yet solved, but is solvable. Correct.
link |
01:39:12.640
The other thing which I have, which I think I bring to the table, is a certain intellectual
link |
01:39:21.600
breadth. I've spent a fair amount of time studying psychology, neuroscience, relevant
link |
01:39:28.800
areas of applied math and so forth. So, I can probably help them see some connections
link |
01:39:34.960
to disparate things, which they might not have otherwise. So, the smart students coming into
link |
01:39:44.480
Berkeley can be very deep in the sense, they can think very deeply, meaning very hard down one
link |
01:39:52.400
particular path. But where I could help them is the shallow breadth, but whereas they would have
link |
01:40:02.560
the narrow depth, but that's of some value. Well, it was beautifully refreshing just to hear you
link |
01:40:12.720
naturally jump to psychology back to computer science and this conversation back and forth.
link |
01:40:17.280
I mean, that's actually a rare quality and I think it's certainly for students empowering
link |
01:40:23.440
to think about problems in a new way. So, for that and for many other reasons, I really enjoyed
link |
01:40:28.400
this conversation. Thank you so much. It was a huge honor. Thanks for talking to me.
link |
01:40:31.840
It's been my pleasure. Thanks for listening to this conversation with Jitendra Malik and thank
link |
01:40:37.920
you to our sponsors, BetterHelp and ExpressVPN. Please consider supporting this podcast by going
link |
01:40:45.920
to betterhelp.com slash Lex and signing up at expressvpn.com slash Lex pod. Click the links,
link |
01:40:53.840
buy the stuff. It's how they know I sent you and it really is the best way to support this podcast
link |
01:41:00.000
and the journey I'm on. If you enjoy this thing, subscribe on YouTube, review it with five stars
link |
01:41:05.760
on an app or podcast, support it on Patreon or connect with me on Twitter at Lex Friedman.
link |
01:41:11.920
Don't ask me how to spell that. I don't remember myself. And now let me leave you with some words
link |
01:41:17.520
from Prince Mishkin in The Idiot by Dostoevsky. Beauty will save the world. Thank you for listening
link |
01:41:25.520
and hope to see you next time.