back to index

Jitendra Malik: Computer Vision | Lex Fridman Podcast #110


small model | large model

link |
00:00:00.000
The following is a conversation with Jitendra Malik, a professor at Berkeley and one of
link |
00:00:05.280
the seminal figures in the field of computer vision, the kind before the deep learning
link |
00:00:10.080
revolution and the kind after.
link |
00:00:13.940
He has been cited over 180,000 times and has mentored many world class researchers in computer
link |
00:00:21.040
science.
link |
00:00:22.880
Quick summary of the ads.
link |
00:00:24.540
Two sponsors, one new one which is BetterHelp and an old goodie ExpressVPN.
link |
00:00:31.520
Please consider supporting this podcast by going to betterhelp.com slash lex and signing
link |
00:00:37.240
up at expressvpn.com slash lexpod.
link |
00:00:40.960
Click the links, buy the stuff, it really is the best way to support this podcast and
link |
00:00:45.600
the journey I'm on.
link |
00:00:47.300
If you enjoy this thing, subscribe on YouTube, review it with 5 stars on Apple Podcast, support
link |
00:00:52.400
it on Patreon, or connect with me on Twitter at Lex Friedman, however the heck you spell
link |
00:00:57.880
that.
link |
00:00:58.880
As usual, I'll do a few minutes of ads now and never any ads in the middle that can break
link |
00:01:02.920
the flow of the conversation.
link |
00:01:05.240
This show is sponsored by BetterHelp, spelled H E L P help.
link |
00:01:11.640
Check it out at betterhelp.com slash lex.
link |
00:01:15.120
They figure out what you need and match you with a licensed professional therapist in
link |
00:01:19.440
under 48 hours.
link |
00:01:21.480
It's not a crisis line, it's not self help, it's professional counseling done securely
link |
00:01:26.480
online.
link |
00:01:27.480
I'm a bit from the David Goggins line of creatures, as you may know, and so have some
link |
00:01:33.360
demons to contend with, usually on long runs or all nights working, forever and possibly
link |
00:01:40.240
full of self doubt.
link |
00:01:42.060
It may be because I'm Russian, but I think suffering is essential for creation.
link |
00:01:47.180
But I also think you can suffer beautifully, in a way that doesn't destroy you.
link |
00:01:52.040
For most people, I think a good therapist can help in this, so it's at least worth a
link |
00:01:56.440
try.
link |
00:01:57.440
Check out their reviews, they're good, it's easy, private, affordable, available worldwide.
link |
00:02:03.340
You can communicate by text, any time, and schedule weekly audio and video sessions.
link |
00:02:09.780
I highly recommend that you check them out at betterhelp.com slash lex.
link |
00:02:15.440
This show is also sponsored by ExpressVPN.
link |
00:02:19.520
Get it at expressvpn.com slash lexpod to support this podcast and to get an extra three months
link |
00:02:26.080
free on a one year package.
link |
00:02:28.520
I've been using ExpressVPN for many years, I love it.
link |
00:02:32.640
I think ExpressVPN is the best VPN out there.
link |
00:02:36.080
They told me to say it, but it happens to be true.
link |
00:02:39.160
It doesn't log your data, it's crazy fast, and is easy to use, literally just one big,
link |
00:02:45.520
sexy power on button.
link |
00:02:47.360
Again, for obvious reasons, it's really important that they don't log your data.
link |
00:02:51.480
It works on Linux and everywhere else too, but really, why use anything else?
link |
00:02:57.120
Shout out to my favorite flavor of Linux, Ubuntu Mate 2004.
link |
00:03:02.280
Once again, get it at expressvpn.com slash lexpod to support this podcast and to get
link |
00:03:09.120
an extra three months free on a one year package.
link |
00:03:13.200
And now, here's my conversation with Jitendra Malik.
link |
00:03:18.140
In 1966, Seymour Papert at MIT wrote up a proposal called the Summer Vision Project
link |
00:03:25.640
to be given, as far as we know, to 10 students to work on and solve that summer.
link |
00:03:31.360
So that proposal outlined many of the computer vision tasks we still work on today.
link |
00:03:37.080
Why do you think we underestimate, and perhaps we did underestimate and perhaps still underestimate
link |
00:03:43.040
how hard computer vision is?
link |
00:03:46.420
Because most of what we do in vision, we do unconsciously or subconsciously.
link |
00:03:51.040
In human vision.
link |
00:03:52.040
In human vision.
link |
00:03:53.080
So that gives us this, that effortlessness gives us the sense that, oh, this must be
link |
00:03:58.400
very easy to implement on a computer.
link |
00:04:02.040
Now, this is why the early researchers in AI got it so wrong.
link |
00:04:09.480
However, if you go into neuroscience or psychology of human vision, then the complexity becomes
link |
00:04:17.560
very clear.
link |
00:04:19.100
The fact is that a very large part of the cerebral cortex is devoted to visual processing.
link |
00:04:26.640
And this is true in other primates as well.
link |
00:04:29.520
So once we looked at it from a neuroscience or psychology perspective, it becomes quite
link |
00:04:35.960
clear that the problem is very challenging and it will take some time.
link |
00:04:39.680
You said the higher level parts are the harder parts?
link |
00:04:43.800
I think vision appears to be easy because most of what visual processing is subconscious
link |
00:04:52.940
or unconscious.
link |
00:04:55.680
So we underestimate the difficulty, whereas when you are like proving a mathematical theorem
link |
00:05:03.940
or playing chess, the difficulty is much more evident.
link |
00:05:08.580
So because it is your conscious brain, which is processing various aspects of the problem
link |
00:05:15.320
solving behavior, whereas in vision, all this is happening, but it's not in your awareness,
link |
00:05:21.960
it's in your, it's operating below that.
link |
00:05:25.840
But it's, it still seems strange.
link |
00:05:27.880
Yes, that's true, but it seems strange that as computer vision researchers, for example,
link |
00:05:35.320
the community broadly is time and time again makes the mistake of thinking the problem
link |
00:05:41.020
is easier than it is, or maybe it's not a mistake.
link |
00:05:43.880
We'll talk a little bit about autonomous driving, for example, how hard of a vision task that
link |
00:05:48.160
is, it, do you think, I mean, what, is it just human nature or is there something fundamental
link |
00:05:56.120
to the vision problem that we, we underestimate?
link |
00:06:01.000
We're still not able to be cognizant of how hard the problem is.
link |
00:06:05.400
Yeah, I think in the early days it could have been excused because in the early days, all
link |
00:06:11.800
aspects of AI were regarded as too easy.
link |
00:06:15.520
But I think today it is much less excusable.
link |
00:06:19.920
And I think why people fall for this is because of what I call the fallacy of the successful
link |
00:06:27.800
first step.
link |
00:06:30.320
There are many problems in vision where getting 50% of the solution you can get in one minute,
link |
00:06:37.840
getting to 90% can take you a day, getting to 99% may take you five years, and 99.99%
link |
00:06:47.720
may be not in your lifetime.
link |
00:06:49.720
I wonder if that's a unique division.
link |
00:06:52.640
It seems that language, people are not so confident about, so natural language processing,
link |
00:06:58.040
people are a little bit more cautious about our ability to, to solve that problem.
link |
00:07:04.200
I think for language, people intuit that we have to be able to do natural language understanding.
link |
00:07:10.640
For vision, it seems that we're not cognizant or we don't think about how much understanding
link |
00:07:18.400
is required.
link |
00:07:19.400
It's probably still an open problem.
link |
00:07:21.520
But in your sense, how much understanding is required to solve vision?
link |
00:07:27.440
Like this, put another way, how much something called common sense reasoning is required
link |
00:07:34.720
to really be able to interpret even static scenes?
link |
00:07:39.080
Yeah.
link |
00:07:40.080
So vision operates at all levels and there are parts which can be solved with what we
link |
00:07:47.760
could call maybe peripheral processing.
link |
00:07:50.800
So in the human vision literature, there used to be these terms, sensation, perception and
link |
00:07:57.320
cognition, which roughly speaking referred to like the front end of processing, middle
link |
00:08:04.320
stages of processing and higher level of processing.
link |
00:08:08.220
And I think they made a big deal out of, out of this and they wanted to study only perception
link |
00:08:13.680
and then dismiss certain, certain problems as being quote cognitive.
link |
00:08:19.240
But really I think these are artificial divides.
link |
00:08:23.200
The problem is continuous at all levels and there are challenges at all levels.
link |
00:08:28.560
The techniques that we have today, they work better at the lower and mid levels of the
link |
00:08:34.120
problem.
link |
00:08:35.120
I think the higher levels of the problem, quote the cognitive levels of the problem
link |
00:08:39.960
are there and we, in many real applications, we have to confront them.
link |
00:08:46.480
Now how much that is necessary will depend on the application.
link |
00:08:51.520
For some problems it doesn't matter, for some problems it matters a lot.
link |
00:08:55.280
So I am, for example, a pessimist on fully autonomous driving in the near future.
link |
00:09:04.960
And the reason is because I think there will be that 0.01% of the cases where quite sophisticated
link |
00:09:13.880
cognitive reasoning is called for.
link |
00:09:16.120
However, there are tasks where you can, first of all, they are much more, they are robust.
link |
00:09:23.720
So in the sense that error rates, error is not so much of a problem.
link |
00:09:28.440
For example, let's say we are, you're doing image search, you're trying to get images
link |
00:09:34.840
based on some, some, some description, some visual description.
link |
00:09:41.900
We are very tolerant of errors there, right?
link |
00:09:43.840
I mean, when Google image search gives you some images back and a few of them are wrong,
link |
00:09:49.360
it's okay.
link |
00:09:50.360
It doesn't hurt anybody.
link |
00:09:51.360
There is no, there's not a matter of life and death.
link |
00:09:54.720
But making mistakes when you are driving at 60 miles per hour and you could potentially
link |
00:10:02.600
kill somebody is much more important.
link |
00:10:06.160
So just for the, for the fun of it, since you mentioned, let's go there briefly about
link |
00:10:11.220
autonomous vehicles.
link |
00:10:12.880
So one of the companies in the space, Tesla, is with Andre Karpathy and Elon Musk are working
link |
00:10:19.200
on a system called Autopilot, which is primarily a vision based system with eight cameras and
link |
00:10:26.400
basically a single neural network, a multitask neural network.
link |
00:10:30.560
They call it HydroNet, multiple heads, so it does multiple tasks, but is forming the
link |
00:10:35.680
same representation at the core.
link |
00:10:38.800
Do you think driving can be converted in this way to purely a vision problem and then solved
link |
00:10:47.120
with learning or even more specifically in the current approach, what do you think about
link |
00:10:53.720
what Tesla Autopilot team is doing?
link |
00:10:57.120
So the way I think about it is that there are certainly subsets of the visual based
link |
00:11:02.800
driving problem, which are quite solvable.
link |
00:11:05.480
So for example, driving in freeway conditions is quite a solvable problem.
link |
00:11:11.960
I think there were demonstrations of that going back to the 1980s by someone called
link |
00:11:18.600
Ernst Tickmans in Munich.
link |
00:11:22.080
In the 90s, there were approaches from Carnegie Mellon, there were approaches from our team
link |
00:11:27.200
at Berkeley.
link |
00:11:28.780
In the 2000s, there were approaches from Stanford and so on.
link |
00:11:33.200
So autonomous driving in certain settings is very doable.
link |
00:11:38.560
The challenge is to have an autopilot work under all kinds of driving conditions.
link |
00:11:45.440
At that point, it's not just a question of vision or perception, but really also of control
link |
00:11:51.280
and dealing with all the edge cases.
link |
00:11:54.200
So where do you think most of the difficult cases, to me, even the highway driving is
link |
00:11:59.160
an open problem because it applies the same 50, 90, 95, 99 rule where the first step,
link |
00:12:08.000
the fallacy of the first step, I forget how you put it, we fall victim to.
link |
00:12:12.080
I think even highway driving has a lot of elements because to solve autonomous driving,
link |
00:12:17.120
you have to completely relinquish the help of a human being.
link |
00:12:22.920
You're always in control so that you're really going to feel the edge cases.
link |
00:12:26.640
So I think even highway driving is really difficult.
link |
00:12:29.480
But in terms of the general driving task, do you think vision is the fundamental problem
link |
00:12:35.440
or is it also your action, the interaction with the environment, the ability to...
link |
00:12:44.800
And then the middle ground, I don't know if you put that under vision, which is trying
link |
00:12:48.720
to predict the behavior of others, which is a little bit in the world of understanding
link |
00:12:54.720
the scene, but it's also trying to form a model of the actors in the scene and predict
link |
00:13:00.640
their behavior.
link |
00:13:01.640
Yeah.
link |
00:13:02.640
I include that in vision because to me, perception blends into cognition and building predictive
link |
00:13:08.320
models of other agents in the world, which could be other agents, could be people, other
link |
00:13:13.520
agents could be other cars.
link |
00:13:15.520
That is part of the task of perception because perception always has to not tell us what
link |
00:13:22.720
is now, but what will happen because what's now is boring.
link |
00:13:26.480
It's done.
link |
00:13:27.480
It's over with.
link |
00:13:28.480
Okay?
link |
00:13:29.480
Yeah.
link |
00:13:30.480
We care about the future because we act in the future.
link |
00:13:33.520
And we care about the past in as much as it informs what's going to happen in the future.
link |
00:13:39.020
So I think we have to build predictive models of behaviors of people and those can get quite
link |
00:13:45.920
complicated.
link |
00:13:48.020
So I mean, I've seen examples of this in actually, I mean, I own a Tesla and it has various safety
link |
00:13:59.760
features built in.
link |
00:14:01.720
And what I see are these examples where let's say there is some a skateboarder, I mean,
link |
00:14:09.920
and I don't want to be too critical because obviously these systems are always being improved
link |
00:14:16.160
and any specific criticism I have, maybe the system six months from now will not have that
link |
00:14:23.680
particular failure mode.
link |
00:14:25.800
So it had the wrong response and it's because it couldn't predict what this skateboarder
link |
00:14:36.680
was going to do.
link |
00:14:38.360
Okay?
link |
00:14:39.360
And because it really required that higher level cognitive understanding of what skateboarders
link |
00:14:45.120
typically do as opposed to a normal pedestrian.
link |
00:14:48.760
So what might have been the correct behavior for a pedestrian, a typical behavior for pedestrian
link |
00:14:53.640
was not the typical behavior for a skateboarder, right?
link |
00:14:59.040
Yeah.
link |
00:15:00.040
And so therefore to do a good job there, you need to have enough data where you have pedestrians,
link |
00:15:07.600
you also have skateboarders, you've seen enough skateboarders to see what kinds of patterns
link |
00:15:14.720
of behavior they have.
link |
00:15:16.560
So it is in principle with enough data, that problem could be solved.
link |
00:15:21.660
But I think our current systems, computer vision systems, they need far, far more data
link |
00:15:29.960
than humans do for learning those same capabilities.
link |
00:15:33.760
So say that there is going to be a system that solves autonomous driving.
link |
00:15:38.100
Do you think it will look similar to what we have today, but have a lot more data, perhaps
link |
00:15:43.480
more compute, but the fundamental architecture is involved, like neural, well, in the case
link |
00:15:48.800
of Tesla autopilot is neural networks.
link |
00:15:52.280
Do you think it will look similar in that regard and we'll just have more data?
link |
00:15:57.160
That's a scientific hypothesis as to which way is it going to go.
link |
00:16:01.880
I will tell you what I would bet on.
link |
00:16:05.420
So and this is my general philosophical position on how these learning systems have been.
link |
00:16:14.200
What we have found currently very effective in computer vision in the deep learning paradigm
link |
00:16:20.860
is sort of tabula rasa learning and tabula rasa learning in a supervised way with lots
link |
00:16:27.800
and lots of...
link |
00:16:28.800
What's tabula rasa learning?
link |
00:16:29.800
Tabula rasa in the sense that blank slate, we just have the system, which is given a
link |
00:16:35.340
series of experiences in this setting and then it learns there.
link |
00:16:39.960
Now if let's think about human driving, it is not tabula rasa learning.
link |
00:16:44.700
So at the age of 16 in high school, a teenager goes into driver ed class, right?
link |
00:16:55.240
And now at that point they learn, but at the age of 16, they are already visual geniuses
link |
00:17:02.040
because from zero to 16, they have built a certain repertoire of vision.
link |
00:17:07.720
In fact, most of it has probably been achieved by age two, right?
link |
00:17:13.520
In this period of age up to age two, they know that the world is three dimensional.
link |
00:17:18.160
They know how objects look like from different perspectives.
link |
00:17:22.360
They know about occlusion.
link |
00:17:24.720
They know about common dynamics of humans and other bodies.
link |
00:17:29.760
They have some notion of intuitive physics.
link |
00:17:32.200
So they built that up from their observations and interactions in early childhood and of
link |
00:17:38.820
course reinforced through their growing up to age 16.
link |
00:17:44.020
So then at age 16, when they go into driver ed, what are they learning?
link |
00:17:49.400
They're not learning afresh the visual world.
link |
00:17:52.360
They have a mastery of the visual world.
link |
00:17:54.800
What they are learning is control, okay?
link |
00:17:58.520
They're learning how to be smooth about control, about steering and brakes and so forth.
link |
00:18:04.000
They're learning a sense of typical traffic situations.
link |
00:18:08.000
Now that education process can be quite short because they are coming in as visual geniuses.
link |
00:18:17.840
And of course in their future, they're going to encounter situations which are very novel,
link |
00:18:23.440
right?
link |
00:18:24.440
So during my driver ed class, I may not have had to deal with a skateboarder.
link |
00:18:29.720
I may not have had to deal with a truck driving in front of me where the back opens up and
link |
00:18:37.640
some junk gets dropped from the truck and I have to deal with it, right?
link |
00:18:42.260
But I can deal with this as a driver even though I did not encounter this in my driver
link |
00:18:47.480
ed class.
link |
00:18:48.840
And the reason I can deal with it is because I have all this general visual knowledge and
link |
00:18:52.880
expertise.
link |
00:18:55.120
And do you think the learning mechanisms we have today can do that kind of long term accumulation
link |
00:19:02.440
of knowledge?
link |
00:19:03.800
Or do we have to do some kind of, you know, the work that led up to expert systems with
link |
00:19:11.400
knowledge representation, you know, the broader field of artificial intelligence worked on
link |
00:19:17.720
this kind of accumulation of knowledge.
link |
00:19:20.240
Do you think neural networks can do the same?
link |
00:19:22.040
I think I don't see any in principle problem with neural networks doing it, but I think
link |
00:19:29.960
the learning techniques would need to evolve significantly.
link |
00:19:33.760
So the current learning techniques that we have are supervised learning.
link |
00:19:41.520
You're given lots of examples, x, y, y pairs and you learn the functional mapping between
link |
00:19:47.520
them.
link |
00:19:48.520
I think that human learning is far richer than that.
link |
00:19:52.360
It includes many different components.
link |
00:19:54.760
There is a child explores the world and sees, for example, a child takes an object and manipulates
link |
00:20:05.560
it in his hand and therefore gets to see the object from different points of view.
link |
00:20:12.760
And the child has commanded the movement.
link |
00:20:14.820
So that's a kind of learning data, but the learning data has been arranged by the child.
link |
00:20:21.000
And this is a very rich kind of data.
link |
00:20:23.600
The child can do various experiments with the world.
link |
00:20:30.540
So there are many aspects of sort of human learning, and these have been studied in child
link |
00:20:36.700
development by psychologists.
link |
00:20:39.600
And what they tell us is that supervised learning is a very small part of it.
link |
00:20:45.120
There are many different aspects of learning.
link |
00:20:48.920
And what we would need to do is to develop models of all of these and then train our
link |
00:20:57.220
systems with that kind of a protocol.
link |
00:21:02.480
So new methods of learning, some of which might imitate the human brain, but you also
link |
00:21:07.800
in your talks have mentioned sort of the compute side of things, in terms of the difference
link |
00:21:13.640
in the human brain or referencing Moravec, Hans Moravec.
link |
00:21:19.440
So do you think there's something interesting, valuable to consider about the difference
link |
00:21:25.360
in the computational power of the human brain versus the computers of today in terms of
link |
00:21:32.000
instructions per second?
link |
00:21:34.360
Yes, so if we go back, so this is a point I've been making for 20 years now.
link |
00:21:41.880
And I think once upon a time, the way I used to argue this was that we just didn't have
link |
00:21:47.240
the computing power of the human brain.
link |
00:21:49.160
Our computers were not quite there.
link |
00:21:53.480
And I mean, there is a well known trade off, which we know that neurons are slow compared
link |
00:22:03.200
to transistors, but we have a lot of them and they have a very high connectivity.
link |
00:22:09.720
Whereas in silicon, you have much faster devices, transistors switch at the order of nanoseconds,
link |
00:22:18.240
but the connectivity is usually smaller.
link |
00:22:21.780
At this point in time, I mean, we are now talking about 2020, we do have, if you consider
link |
00:22:27.640
the latest GPUs and so on, amazing computing power.
link |
00:22:31.840
And if we look back at Hans Moravec type of calculations, which he did in the 1990s, we
link |
00:22:39.200
may be there today in terms of computing power comparable to the brain, but it's not in the
link |
00:22:44.800
of the same style, it's of a very different style.
link |
00:22:49.960
So I mean, for example, the style of computing that we have in our GPUs is far, far more
link |
00:22:55.560
power hungry than the style of computing that is there in the human brain or other biological
link |
00:23:02.920
entities.
link |
00:23:03.920
Yeah.
link |
00:23:04.920
And that the efficiency part is, we're going to have to solve that in order to build actual
link |
00:23:11.040
real world systems of large scale.
link |
00:23:15.160
Let me ask sort of the high level question, taking a step back.
link |
00:23:19.520
How would you articulate the general problem of computer vision?
link |
00:23:24.400
Does such a thing exist?
link |
00:23:25.560
So if you look at the computer vision conferences and the work that's been going on, it's often
link |
00:23:30.220
separated into different little segments, breaking the problem of vision apart into
link |
00:23:36.280
whether segmentation, 3D reconstruction, object detection, I don't know, image capturing,
link |
00:23:44.640
whatever.
link |
00:23:45.640
There's benchmarks for each.
link |
00:23:46.840
But if you were to sort of philosophically say, what is the big problem of computer vision?
link |
00:23:52.340
Does such a thing exist?
link |
00:23:54.640
Yes, but it's not in isolation.
link |
00:23:57.400
So for all intelligence tasks, I always go back to sort of biology or humans.
link |
00:24:09.840
And if we think about vision or perception in that setting, we realize that perception
link |
00:24:15.800
is always to guide action.
link |
00:24:18.480
Action for a biological system does not give any benefits unless it is coupled with action.
link |
00:24:25.040
So we can go back and think about the first multicellular animals, which arose in the
link |
00:24:30.920
Cambrian era, you know, 500 million years ago.
link |
00:24:35.040
And these animals could move and they could see in some way.
link |
00:24:40.840
And the two activities helped each other.
link |
00:24:43.600
Because how does movement help?
link |
00:24:47.720
Movement helps that because you can get food in different places.
link |
00:24:52.240
But you need to know where to go.
link |
00:24:54.420
And that's really about perception or seeing, I mean, vision is perhaps the single most
link |
00:25:00.580
perception sense.
link |
00:25:02.760
But all the others are equally are also important.
link |
00:25:06.040
So perception and action kind of go together.
link |
00:25:10.160
So earlier, it was in these very simple feedback loops, which were about finding food or avoid
link |
00:25:17.700
avoiding becoming food if there's a predator running, trying to, you know, eat you up,
link |
00:25:24.360
and so forth.
link |
00:25:25.360
So we must, at the fundamental level, connect perception to action.
link |
00:25:30.160
Then as we evolved, perception became more and more sophisticated because it served many
link |
00:25:37.400
more purposes.
link |
00:25:39.800
And so today we have what seems like a fairly general purpose capability, which can look
link |
00:25:46.520
at the external world and build a model of the external world inside the head.
link |
00:25:53.520
We do have that capability.
link |
00:25:55.040
That model is not perfect.
link |
00:25:56.960
And psychologists have great fun in pointing out the ways in which the model in your head
link |
00:26:01.440
is not a perfect model of the external world.
link |
00:26:05.240
They create various illusions to show the ways in which it is imperfect.
link |
00:26:11.460
But it's amazing how far it has come from a very simple perception action loop that
link |
00:26:17.840
you exist in, you know, an animal 500 million years ago.
link |
00:26:23.840
Once we have this, these very sophisticated visual systems, we can then impose a structure
link |
00:26:29.760
on them.
link |
00:26:30.760
It's we as scientists who are imposing that structure, where we have chosen to characterize
link |
00:26:36.500
this part of the system as this quote, module of object detection or quote, this module
link |
00:26:43.040
of 3D reconstruction.
link |
00:26:45.120
What's going on is really all of these processes are running simultaneously and they are running
link |
00:26:55.400
simultaneously because originally their purpose was in fact to help guide action.
link |
00:27:01.000
So as a guiding general statement of a problem, do you think we can say that the general problem
link |
00:27:08.080
of computer vision, you said in humans, it was tied to action.
link |
00:27:14.680
Do you think we should also say that ultimately the goal, the problem of computer vision is
link |
00:27:20.880
to sense the world in a way that helps you act in the world?
link |
00:27:27.080
Yes.
link |
00:27:28.080
I think that's the most fundamental, that's the most fundamental purpose.
link |
00:27:32.960
We have by now hyper evolved.
link |
00:27:37.320
So we have this visual system which can be used for other things.
link |
00:27:42.000
For example, judging the aesthetic value of a painting.
link |
00:27:46.940
And this is not guiding action.
link |
00:27:49.300
Maybe it's guiding action in terms of how much money you will put in your auction bid,
link |
00:27:54.240
but that's a bit stretched.
link |
00:27:56.020
But the basics are in fact in terms of action, but we evolved really this hyper, we have
link |
00:28:06.120
hyper evolved our visual system.
link |
00:28:08.160
Actually just to, sorry to interrupt, but perhaps it is fundamentally about action.
link |
00:28:13.640
You kind of jokingly said about spending, but perhaps the capitalistic drive that drives
link |
00:28:20.940
a lot of the development in this world is about the exchange of money and the fundamental
link |
00:28:25.600
action is money.
link |
00:28:26.600
If you watch Netflix, if you enjoy watching movies, you're using your perception system
link |
00:28:30.840
to interpret the movie, ultimately your enjoyment of that movie means you'll subscribe to Netflix.
link |
00:28:36.780
So the action is this extra layer that we've developed in modern society perhaps is fundamentally
link |
00:28:44.680
tied to the action of spending money.
link |
00:28:47.760
Well certainly with respect to interactions with firms.
link |
00:28:54.200
So in this homo economicus role, when you're interacting with firms, it does become that.
link |
00:29:01.960
What else is there?
link |
00:29:02.960
And that was a rhetorical question.
link |
00:29:07.800
So to linger on the division between the static and the dynamic, so much of the work in computer
link |
00:29:16.200
vision, so many of the breakthroughs that you've been a part of have been in the static
link |
00:29:20.560
world and looking at static images.
link |
00:29:24.560
And then you've also worked on starting, but it's a much smaller degree, the community
link |
00:29:29.000
is looking at dynamic, at video, at dynamic scenes.
link |
00:29:32.880
And then there is robotic vision, which is dynamic, but also where you actually have
link |
00:29:38.840
a robot in the physical world interacting based on that vision.
link |
00:29:43.620
Which problem is harder?
link |
00:29:49.840
The trivial first answer is, well, of course one image is harder.
link |
00:29:53.960
But if you look at a deeper question there, are we, what's the term, cutting ourselves
link |
00:30:03.400
at the knees or like making the problem harder by focusing on images?
link |
00:30:08.200
That's a fair question.
link |
00:30:09.200
I think sometimes we can simplify a problem so much that we essentially lose part of the
link |
00:30:20.800
juice that could enable us to solve the problem.
link |
00:30:24.640
And one could reasonably argue that to some extent this happens when we go from video
link |
00:30:29.600
to single images.
link |
00:30:31.400
Now historically you have to consider the limits imposed by the computation capabilities
link |
00:30:39.920
we had.
link |
00:30:41.040
So many of the choices made in the computer vision community through the 70s, 80s, 90s
link |
00:30:50.780
can be understood as choices which were forced upon us by the fact that we just didn't have
link |
00:30:59.720
enough access to enough compute.
link |
00:31:01.760
Not enough memory, not enough hardware.
link |
00:31:04.360
Exactly.
link |
00:31:05.360
Not enough compute, not enough storage.
link |
00:31:08.240
So think of these choices.
link |
00:31:09.480
So one of the choices is focusing on single images rather than video.
link |
00:31:14.280
Okay.
link |
00:31:15.280
Clear question.
link |
00:31:16.760
Storage and compute.
link |
00:31:19.400
We had to focus on, we used to detect edges and throw away the image.
link |
00:31:24.960
Right?
link |
00:31:25.960
So we would have an image which I say 256 by 256 pixels and instead of keeping around
link |
00:31:31.120
the grayscale value, what we did was we detected edges, find the places where the brightness
link |
00:31:37.360
changes a lot and then throw away the rest.
link |
00:31:42.040
So this was a major compression device and the hope was that this makes it that you can
link |
00:31:47.640
still work with it and the logic was humans can interpret a line drawing.
link |
00:31:53.480
And yes, and this will save us computation.
link |
00:31:58.240
So many of the choices were dictated by that.
link |
00:32:00.920
I think today we are no longer detecting edges, right?
link |
00:32:07.240
We process images with ConvNets because we don't need to.
link |
00:32:10.840
We don't have those computer restrictions anymore.
link |
00:32:14.040
Now video is still understudied because video compute is still quite challenging if you
link |
00:32:19.880
are a university researcher.
link |
00:32:22.320
I think video computing is not so challenging if you are at Google or Facebook or Amazon.
link |
00:32:29.080
Still super challenging.
link |
00:32:30.080
I just spoke with the VP of engineering at Google, head of the YouTube search and discovery
link |
00:32:35.480
and they still struggle doing stuff on video.
link |
00:32:38.480
It's very difficult except using techniques that are essentially the techniques you used
link |
00:32:44.360
in the 90s.
link |
00:32:45.500
Some very basic computer vision techniques.
link |
00:32:48.680
No, that's when you want to do things at scale.
link |
00:32:51.540
So if you want to operate at the scale of all the content of YouTube, it's very challenging
link |
00:32:56.920
and there are similar issues with Facebook.
link |
00:32:59.440
But as a researcher, you have more opportunities.
link |
00:33:05.840
You can train large networks with relatively large video data sets.
link |
00:33:11.240
So I think that this is part of the reason why we have so emphasized static images.
link |
00:33:17.160
I think that this is changing and over the next few years, I see a lot more progress
link |
00:33:22.800
happening in video.
link |
00:33:25.240
So I have this generic statement that to me, video recognition feels like 10 years behind
link |
00:33:32.560
object recognition and you can quantify that because you can take some of the challenging
link |
00:33:37.840
video data sets and their performance on action classification is like say 30%, which is kind
link |
00:33:45.280
of what we used to have around 2009 in object detection.
link |
00:33:51.840
It's like about 10 years behind and whether it'll take 10 years to catch up is a different
link |
00:33:58.160
question.
link |
00:33:59.160
Hopefully, it will take less than that.
link |
00:34:01.360
Let me ask a similar question I've already asked, but once again, so for dynamic scenes,
link |
00:34:08.600
do you think some kind of injection of knowledge bases and reasoning is required to help improve
link |
00:34:17.280
like action recognition?
link |
00:34:20.400
Like if we saw the general action recognition problem, what do you think the solution would
link |
00:34:28.800
look like as another way to put it?
link |
00:34:31.120
So I completely agree that knowledge is called for and that knowledge can be quite sophisticated.
link |
00:34:39.720
So the way I would say it is that perception blends into cognition and cognition brings
link |
00:34:44.960
in issues of memory and this notion of a schema from psychology, which is, let me use the
link |
00:34:54.040
classic example, which is you go to a restaurant, right?
link |
00:34:58.780
Now there are things that happen in a certain order, you walk in, somebody takes you to
link |
00:35:03.580
a table, waiter comes, gives you a menu, takes the order, food arrives, eventually bill arrives,
link |
00:35:13.240
et cetera, et cetera.
link |
00:35:15.160
This is a classic example of AI from the 1970s.
link |
00:35:19.840
It was called, there was the term frames and scripts and schemas, these are all quite similar
link |
00:35:26.080
ideas.
link |
00:35:27.080
Okay, and in the 70s, the way the AI of the time dealt with it was by hand coding this.
link |
00:35:34.280
So they hand coded in this notion of a script and the various stages and the actors and
link |
00:35:40.440
so on and so forth, and use that to interpret, for example, language.
link |
00:35:45.440
I mean, if there's a description of a story involving some people eating at a restaurant,
link |
00:35:52.840
there are all these inferences you can make because you know what happens typically at
link |
00:35:58.440
a restaurant.
link |
00:36:00.240
So I think this kind of knowledge is absolutely essential.
link |
00:36:06.120
So I think that when we are going to do long form video understanding, we are going to
link |
00:36:12.320
need to do this.
link |
00:36:13.400
I think the kinds of technology that we have right now with 3D convolutions over a couple
link |
00:36:19.360
of seconds of clip or video, it's very much tailored towards short term video understanding,
link |
00:36:26.080
not that long term understanding.
link |
00:36:28.440
Long term understanding requires this notion of schemas that I talked about, perhaps some
link |
00:36:35.760
notions of goals, intentionality, functionality, and so on and so forth.
link |
00:36:43.120
Now, how will we bring that in?
link |
00:36:46.040
So we could either revert back to the 70s and say, OK, I'm going to hand code in a script
link |
00:36:51.760
or we might try to learn it.
link |
00:36:56.280
So I tend to believe that we have to find learning ways of doing this because I think
link |
00:37:03.560
learning ways land up being more robust.
link |
00:37:06.880
And there must be a learning version of the story because children acquire a lot of this
link |
00:37:12.440
knowledge by sort of just observation.
link |
00:37:16.640
So at no moment in a child's life does it's possible, but I think it's not so typical
link |
00:37:24.320
that somebody that a mother coaches a child through all the stages of what happens in
link |
00:37:29.560
a restaurant.
link |
00:37:30.560
They just go as a family, they go to the restaurant, they eat, come back, and the child goes through
link |
00:37:36.480
ten such experiences and the child has got a schema of what happens when you go to a
link |
00:37:41.560
restaurant.
link |
00:37:42.720
So we somehow need to provide that capability to our systems.
link |
00:37:48.040
You mentioned the following line from the end of the Alan Turing paper, Computing Machinery
link |
00:37:53.880
and Intelligence, that many people, like you said, many people know and very few have read
link |
00:37:59.680
where he proposes the Turing test.
link |
00:38:03.960
This is how you know because it's towards the end of the paper.
link |
00:38:06.960
Instead of trying to produce a program to simulate the adult mind, why not rather try
link |
00:38:10.940
to produce one which simulates the child's?
link |
00:38:14.440
So that's a really interesting point.
link |
00:38:17.280
If I think about the benchmarks we have before us, the tests of our computer vision systems,
link |
00:38:24.520
they're often kind of trying to get to the adult.
link |
00:38:28.340
So what kind of benchmarks should we have?
link |
00:38:31.160
What kind of tests for computer vision do you think we should have that mimic the child's
link |
00:38:37.400
in computer vision?
link |
00:38:38.400
I think we should have those and we don't have those today.
link |
00:38:42.880
And I think the part of the challenge is that we should really be collecting data of the
link |
00:38:50.240
type that the child experiences.
link |
00:38:55.180
So that gets into issues of privacy and so on and so forth.
link |
00:38:59.400
But there are attempts in this direction to sort of try to collect the kind of data that
link |
00:39:05.080
a child encounters growing up.
link |
00:39:08.600
So what's the child's linguistic environment?
link |
00:39:11.200
What's the child's visual environment?
link |
00:39:13.580
So if we could collect that kind of data and then develop learning schemes based on that
link |
00:39:20.800
data, that would be one way to do it.
link |
00:39:25.160
I think that's a very promising direction myself.
link |
00:39:28.880
There might be people who would argue that we could just short circuit this in some way
link |
00:39:33.920
and sometimes we have imitated, we have had success by not imitating nature in detail.
link |
00:39:44.440
So the usual example is airplanes, right?
link |
00:39:47.520
We don't build flapping wings.
link |
00:39:51.940
So yes, that's one of the points of debate.
link |
00:39:57.160
In my mind, I would bet on this learning like a child approach.
link |
00:40:05.120
So one of the fundamental aspects of learning like a child is the interactivity.
link |
00:40:11.400
So the child gets to play with the data set it's learning from.
link |
00:40:14.200
Yes.
link |
00:40:15.200
So it gets to select.
link |
00:40:16.200
I mean, you can call that active learning.
link |
00:40:19.600
In the machine learning world, you can call it a lot of terms.
link |
00:40:23.660
What are your thoughts about this whole space of being able to play with the data set or
link |
00:40:27.600
select what you're learning?
link |
00:40:29.320
Yeah.
link |
00:40:30.320
So I think that I believe in that and I think that we could achieve it in two ways and I
link |
00:40:38.720
think we should use both.
link |
00:40:40.800
So one is actually real robotics, right?
link |
00:40:45.560
So real physical embodiments of agents who are interacting with the world and they have
link |
00:40:52.880
a physical body with dynamics and mass and moment of inertia and friction and all the
link |
00:40:59.440
rest and you learn your body, the robot learns its body by doing a series of actions.
link |
00:41:08.400
The second is that simulation environments.
link |
00:41:11.640
So I think simulation environments are getting much, much better.
link |
00:41:17.000
In my life in Facebook AI research, our group has worked on something called Habitat, which
link |
00:41:24.880
is a simulation environment, which is a visually photorealistic environment of places like
link |
00:41:34.560
houses or interiors of various urban spaces and so forth.
link |
00:41:39.680
And as you move, you get a picture, which is a pretty accurate picture.
link |
00:41:45.000
So now you can imagine that subsequent generations of these simulators will be accurate, not
link |
00:41:53.880
just visually, but with respect to forces and masses and haptic interactions and so
link |
00:42:01.600
on.
link |
00:42:03.560
And then we have that environment to play with.
link |
00:42:07.520
I think, let me state one reason why I think being able to act in the world is important.
link |
00:42:16.280
I think that this is one way to break the correlation versus causation barrier.
link |
00:42:23.000
So this is something which is of a great deal of interest these days.
link |
00:42:27.160
I mean, people like Judea Pearl have talked a lot about that we are neglecting causality
link |
00:42:34.660
and he describes the entire set of successes of deep learning as just curve fitting, right?
link |
00:42:42.740
But I don't quite agree about it.
link |
00:42:45.240
He's a troublemaker.
link |
00:42:46.240
He is.
link |
00:42:47.240
But causality is important, but causality is not like a single silver bullet.
link |
00:42:54.520
It's not like one single principle.
link |
00:42:56.160
There are many different aspects here.
link |
00:42:58.660
And one of the ways in which, one of our most reliable ways of establishing causal links
link |
00:43:05.120
and this is the way, for example, the medical community does this is randomized control
link |
00:43:11.600
trials.
link |
00:43:12.840
So you have, you pick some situation and now in some situation you perform an action and
link |
00:43:18.440
for certain others you don't, right?
link |
00:43:22.600
So you have a controlled experiment.
link |
00:43:23.800
Well, the child is in fact performing controlled experiments all the time, right?
link |
00:43:28.880
Right.
link |
00:43:29.880
Okay.
link |
00:43:30.880
Small scale.
link |
00:43:31.880
In a small scale.
link |
00:43:32.880
But that is a way that the child gets to build and refine its causal models of the world.
link |
00:43:41.240
And my colleague Alison Gopnik has, together with a couple of authors, coauthors, has this
link |
00:43:47.000
book called The Scientist in the Crib, referring to the children.
link |
00:43:50.820
So I like, the part that I like about that is the scientist wants to do, wants to build
link |
00:43:57.720
causal models and the scientist does control experiments.
link |
00:44:01.820
And I think the child is doing that.
link |
00:44:03.800
So to enable that, we will need to have these active experiments.
link |
00:44:10.240
And I think this could be done, some in the real world and some in simulation.
link |
00:44:14.640
So you have hope for simulation.
link |
00:44:16.840
I have hope for simulation.
link |
00:44:18.120
That's an exciting possibility if we can get to not just photorealistic, but what's that
link |
00:44:22.960
called life realistic simulation.
link |
00:44:27.720
So you don't see any fundamental blocks to why we can't eventually simulate the principles
link |
00:44:35.800
of what it means to exist in the world as a physical scientist.
link |
00:44:39.440
I don't see any fundamental problems that, I mean, and look, the computer graphics community
link |
00:44:43.960
has come a long way.
link |
00:44:45.440
So in the early days, back going back to the eighties and nineties, they were focusing
link |
00:44:50.600
on visual realism, right?
link |
00:44:52.760
And then they could do the easy stuff, but they couldn't do stuff like hair or fur and
link |
00:44:58.080
so on.
link |
00:44:59.080
Okay, well, they managed to do that.
link |
00:45:01.280
Then they couldn't do physical actions, right?
link |
00:45:04.440
Like there's a bowl of glass and it falls down and it shatters, but then they could
link |
00:45:09.120
start to do pretty realistic models of that and so on and so forth.
link |
00:45:13.920
So the graphics people have shown that they can do this forward direction, not just for
link |
00:45:19.920
optical interactions, but also for physical interactions.
link |
00:45:23.880
So I think, of course, some of that is very compute intensive, but I think by and by we
link |
00:45:30.000
will find ways of making our models ever more realistic.
link |
00:45:35.860
You break vision apart into, in one of your presentations, early vision, static scene
link |
00:45:40.600
understanding, dynamic scene understanding, and raise a few interesting questions.
link |
00:45:44.320
I thought I could just throw some at you to see if you want to talk about them.
link |
00:45:50.360
So early vision, so it's, what is it that you said, sensation, perception and cognition.
link |
00:45:58.360
So is this a sensation?
link |
00:46:00.720
Yes.
link |
00:46:01.720
What can we learn from image statistics that we don't already know?
link |
00:46:05.720
So at the lowest level, what can we make from just the statistics, the basics, or the variations
link |
00:46:15.560
in the rock pixels, the textures and so on?
link |
00:46:18.480
Yeah.
link |
00:46:19.480
So what we seem to have learned is that there's a lot of redundancy in these images and as
link |
00:46:28.960
a result, we are able to do a lot of compression and this compression is very important in
link |
00:46:35.000
biological settings, right?
link |
00:46:36.960
So you might have 10 to the 8 photoreceptors and only 10 to the 6 fibers in the optic nerve.
link |
00:46:42.560
So you have to do this compression by a factor of 100 is to 1.
link |
00:46:46.880
And so there are analogs of that which are happening in our neural net, artificial neural
link |
00:46:54.760
network.
link |
00:46:55.760
That's the early layers.
link |
00:46:56.760
So you think there's a lot of compression that can be done in the beginning.
link |
00:47:01.520
Just the statistics.
link |
00:47:02.520
Yeah.
link |
00:47:03.520
So how successful is image compression?
link |
00:47:05.640
How much?
link |
00:47:06.640
Well, I mean, the way to think about it is just how successful is image compression,
link |
00:47:14.160
right?
link |
00:47:15.160
And that's been done with older technologies, but it can be done with, there are several
link |
00:47:23.160
companies which are trying to use sort of these more advanced neural network type techniques
link |
00:47:29.160
for compression, both for static images as well as for video.
link |
00:47:34.360
One of my former students has a company which is trying to do stuff like this.
link |
00:47:41.880
And I think that they are showing quite interesting results.
link |
00:47:47.480
And I think that that's all the success of, that's really about image statistics and
link |
00:47:52.560
video statistics.
link |
00:47:53.560
But that's still not doing compression of the kind, when I see a picture of a cat, all
link |
00:47:59.120
I have to say is it's a cat, that's another semantic kind of compression.
link |
00:48:02.480
Yeah.
link |
00:48:03.480
So this is at the lower level, right?
link |
00:48:04.800
So we are, as I said, yeah, that's focusing on low level statistics.
link |
00:48:10.280
So to linger on that for a little bit, you mentioned how far can bottom up image segmentation
link |
00:48:17.880
go.
link |
00:48:18.880
You know, what you mentioned that the central question for scene understanding is the interplay
link |
00:48:24.680
of bottom up and top down information.
link |
00:48:26.880
Maybe this is a good time to elaborate on that.
link |
00:48:29.980
Maybe define what is bottom up, what is top down in the context of computer vision.
link |
00:48:37.400
Right.
link |
00:48:38.400
So today what we have are very interesting systems because they work completely bottom
link |
00:48:45.160
up.
link |
00:48:46.160
What does bottom up mean, sorry?
link |
00:48:47.920
So bottom up means, in this case means a feed forward neural network.
link |
00:48:52.160
So starting from the raw pixels, yeah, they start from the raw pixels and they end up
link |
00:48:57.020
with some, something like cat or not a cat, right?
link |
00:49:00.600
So our systems are running totally feed forward.
link |
00:49:04.600
They're trained in a very top down way.
link |
00:49:07.560
So they're trained by saying, okay, this is a cat, there's a cat, there's a dog, there's
link |
00:49:11.560
a zebra, et cetera.
link |
00:49:14.440
And I'm not happy with either of these choices fully.
link |
00:49:18.560
We have gone into, because we have completely separated these processes, right?
link |
00:49:24.960
So there's a, so I would like the process, so what do we know compared to biology?
link |
00:49:34.160
So in biology, what we know is that the processes in at test time, at runtime, those processes
link |
00:49:42.500
are not purely feed forward, but they involve feedback.
link |
00:49:46.340
So and they involve much shallower neural networks.
link |
00:49:50.080
So the kinds of neural networks we are using in computer vision, say a ResNet 50 has 50
link |
00:49:55.880
layers.
link |
00:49:56.880
Well in the brain, in the visual cortex going from the retina to IT, maybe we have like
link |
00:50:02.800
seven, right?
link |
00:50:04.240
So they're far shallower, but we have the possibility of feedback.
link |
00:50:08.080
So there are backward connections.
link |
00:50:11.000
And this might enable us to deal with the more ambiguous stimuli, for example.
link |
00:50:18.240
So the biological solution seems to involve feedback, the solution in artificial vision
link |
00:50:26.480
seems to be just feed forward, but with a much deeper network.
link |
00:50:30.760
And the two are functionally equivalent because if you have a feedback network, which just
link |
00:50:35.500
has like three rounds of feedback, you can just unroll it and make it three times the
link |
00:50:40.440
depth and create it in a totally feed forward way.
link |
00:50:44.520
So this is something which, I mean, we have written some papers on this theme, but I really
link |
00:50:49.800
feel that this should, this theme should be pursued further.
link |
00:50:55.720
Some kind of occurrence mechanism.
link |
00:50:57.440
Yeah.
link |
00:50:58.440
Okay.
link |
00:50:59.440
The other, so that's, so I want to have a little bit more top down in the, at test time.
link |
00:51:07.440
Okay.
link |
00:51:08.440
And then at training time, we make use of a lot of top down knowledge right now.
link |
00:51:13.800
So basically to learn to segment an object, we have to have all these examples of this
link |
00:51:19.320
is the boundary of a cat, and this is the boundary of a chair, and this is the boundary
link |
00:51:22.840
of a horse and so on.
link |
00:51:24.640
And this is too much top down knowledge.
link |
00:51:27.960
How do humans do this?
link |
00:51:30.400
We manage to, we manage with far less supervision and we do it in a sort of bottom up way because
link |
00:51:36.680
for example, we are looking at a video stream and the horse moves and that enables me to
link |
00:51:44.540
say that all these pixels are together.
link |
00:51:47.360
So the Gestalt psychologist used to call this the principle of common fate.
link |
00:51:53.180
So there was a bottom up process by which we were able to segment out these objects
link |
00:51:58.160
and we have totally focused on this top down training signal.
link |
00:52:04.540
So in my view, we have currently solved it in machine vision, this top down bottom up
link |
00:52:10.280
interaction, but I don't find the solution fully satisfactory and I would rather have
link |
00:52:17.680
a bit of both at both stages.
link |
00:52:20.200
For all computer vision problems, not just segmentation.
link |
00:52:25.440
And the question that you can ask is, so for me, I'm inspired a lot by human vision and
link |
00:52:30.360
I care about that.
link |
00:52:31.880
You could be just a hard boiled engineer and not give a damn.
link |
00:52:35.560
So to you, I would then argue that you would need far less training data if you could make
link |
00:52:41.960
my research agenda fruitful.
link |
00:52:45.920
Okay, so then maybe taking a step into segmentation, static scene understanding.
link |
00:52:54.120
What is the interaction between segmentation and recognition?
link |
00:52:57.400
You mentioned the movement of objects.
link |
00:53:00.800
So for people who don't know computer vision, segmentation is this weird activity that computer
link |
00:53:07.680
vision folks have all agreed is very important of drawing outlines around objects versus
link |
00:53:15.220
a bounding box and then classifying that object.
link |
00:53:21.920
What's the value of segmentation?
link |
00:53:23.660
What is it as a problem in computer vision?
link |
00:53:27.320
How is it fundamentally different from detection recognition and the other problems?
link |
00:53:31.720
Yeah, so I think, so segmentation enables us to say that some set of pixels are an object
link |
00:53:41.760
without necessarily even being able to name that object or knowing properties of that
link |
00:53:47.120
object.
link |
00:53:48.120
Oh, so you mean segmentation purely as the act of separating an object.
link |
00:53:55.000
From its background.
link |
00:53:56.000
It's a job that's united in some way from its background.
link |
00:54:01.120
Yeah, so entitification, if you will, making an entity out of it.
link |
00:54:05.760
Entitification, beautifully termed.
link |
00:54:09.280
So I think that we have that capability and that enables us to, as we are growing up,
link |
00:54:17.820
to acquire names of objects with very little supervision.
link |
00:54:23.760
So suppose the child, let's posit that the child has this ability to separate out objects
link |
00:54:28.720
in the world.
link |
00:54:30.080
Then when the mother says, pick up your bottle or the cat's behaving funny today, the word
link |
00:54:42.160
cat suggests some object and then the child sort of does the mapping, right?
link |
00:54:47.740
The mother doesn't have to teach specific object labels by pointing to them.
link |
00:54:55.000
Weak supervision works in the context that you have the ability to create objects.
link |
00:55:01.600
So I think that, so to me, that's a very fundamental capability.
link |
00:55:07.800
There are applications where this is very important, for example, medical diagnosis.
link |
00:55:13.180
So in medical diagnosis, you have some brain scan, I mean, this is some work that we did
link |
00:55:20.180
in my group where you have CT scans of people who have had traumatic brain injury and what
link |
00:55:26.960
the radiologist needs to do is to precisely delineate various places where there might
link |
00:55:32.680
be bleeds, for example, and there are clear needs like that.
link |
00:55:39.840
So there are certainly very practical applications of computer vision where segmentation is necessary,
link |
00:55:46.360
but philosophically segmentation enables the task of recognition to proceed with much weaker
link |
00:55:54.980
supervision than we require today.
link |
00:55:58.000
And you think of segmentation as this kind of task that takes on a visual scene and breaks
link |
00:56:03.960
it apart into interesting entities that might be useful for whatever the task is.
link |
00:56:11.840
Yeah.
link |
00:56:12.840
And it is not semantics free.
link |
00:56:14.760
So I think, I mean, it blends into, it involves perception and cognition.
link |
00:56:22.080
It is not, I think the mistake that we used to make in the early days of computer vision
link |
00:56:28.440
was to treat it as a purely bottom up perceptual task.
link |
00:56:32.520
It is not just that because we do revise our notion of segmentation with more experience,
link |
00:56:41.000
right?
link |
00:56:42.000
Because for example, there are objects which are nonrigid like animals or humans.
link |
00:56:47.320
And I think understanding that all the pixels of a human are one entity is actually quite
link |
00:56:53.280
a challenge because the parts of the human, they can move independently and the human
link |
00:56:59.400
wears clothes, so they might be differently colored.
link |
00:57:02.800
So it's all sort of a challenge.
link |
00:57:05.600
You mentioned the three R's of computer vision are recognition, reconstruction and reorganization.
link |
00:57:12.280
Can you describe these three R's and how they interact?
link |
00:57:15.760
Yeah.
link |
00:57:16.840
So recognition is the easiest one because that's what I think people generally think
link |
00:57:24.240
of as computer vision achieving these days, which is labels.
link |
00:57:30.520
So is this a cat?
link |
00:57:31.600
Is this a dog?
link |
00:57:32.640
Is this a chihuahua?
link |
00:57:35.160
I mean, you know, it could be very fine grained like, you know, specific breed of a dog or
link |
00:57:41.080
a specific species of bird, or it could be very abstract like animal.
link |
00:57:47.080
But given a part of an image or a whole image, say put a label on it.
link |
00:57:51.880
Yeah.
link |
00:57:52.880
That's recognition.
link |
00:57:54.440
Reconstruction is essentially, you can think of it as inverse graphics.
link |
00:58:03.440
I mean, that's one way to think about it.
link |
00:58:07.160
So graphics is you have some internal computer representation and you have a computer representation
link |
00:58:14.760
of some objects arranged in a scene.
link |
00:58:17.440
And what you do is you produce a picture, you produce the pixels corresponding to a
link |
00:58:22.080
rendering of that scene.
link |
00:58:24.560
So let's do the inverse of this.
link |
00:58:28.840
We are given an image and we try to, we say, oh, this image arises from some objects in
link |
00:58:38.480
a scene looked at with a camera from this viewpoint.
link |
00:58:41.960
And we might have more information about the objects like their shape, maybe their textures,
link |
00:58:47.520
maybe, you know, color, et cetera, et cetera.
link |
00:58:51.720
So that's the reconstruction problem.
link |
00:58:53.320
In a way, you are in your head creating a model of the external world.
link |
00:59:00.200
Right.
link |
00:59:01.200
Okay.
link |
00:59:02.200
Reorganization is to do with essentially finding these entities.
link |
00:59:09.240
So it's organization, the word organization implies structure.
link |
00:59:15.600
So that in perception, in psychology, we use the term perceptual organization.
link |
00:59:22.760
That the world is not just, an image is not just seen as, is not internally represented
link |
00:59:30.980
as just a collection of pixels, but we make these entities.
link |
00:59:34.800
We create these entities, objects, whatever you want to call it.
link |
00:59:38.120
And the relationship between the entities as well, or is it purely about the entities?
link |
00:59:42.400
It could be about the relationships, but mainly we focus on the fact that there are entities.
link |
00:59:47.160
Okay.
link |
00:59:48.160
So I'm trying to pinpoint what the organization means.
link |
00:59:52.440
So organization is that instead of like a uniform grid, we have this structure of objects.
link |
01:00:02.120
So the segmentation is the small part of that.
link |
01:00:05.400
So segmentation gets us going towards that.
link |
01:00:09.000
Yeah.
link |
01:00:10.120
And you kind of have this triangle where they all interact together.
link |
01:00:13.560
Yes.
link |
01:00:14.560
So how do you see that interaction in sort of reorganization is yes, finding the entities
link |
01:00:23.560
in the world.
link |
01:00:25.200
The recognition is labeling those entities and then reconstruction is what filling in
link |
01:00:32.720
the gaps.
link |
01:00:33.720
Well, for example, see, impute some 3D objects corresponding to each of these entities.
link |
01:00:43.280
That would be part of it.
link |
01:00:44.280
So adding more information that's not there in the raw data.
link |
01:00:48.400
Correct.
link |
01:00:49.400
I mean, I started pushing this kind of a view in the, around 2010 or something like that.
link |
01:00:58.260
Because at that time in computer vision, the distinction that people were just working
link |
01:01:06.360
on many different problems, but they treated each of them as a separate isolated problem
link |
01:01:11.360
with each with its own data set.
link |
01:01:13.880
And then you try to solve that and get good numbers on it.
link |
01:01:17.040
So I wasn't, I didn't like that approach because I wanted to see the connection between these.
link |
01:01:23.840
And if people divided up vision into, into various modules, the way they would do it
link |
01:01:30.640
is as low level, mid level and high level vision corresponding roughly to the psychologist's
link |
01:01:36.720
notion of sensation, perception and cognition.
link |
01:01:40.180
And I didn't, that didn't map to tasks that people cared about.
link |
01:01:45.160
Okay.
link |
01:01:46.160
So therefore I tried to promote this particular framework as a way of considering the problems
link |
01:01:52.380
that people in computer vision were actually working on and trying to be more explicit
link |
01:01:58.180
about the fact that they actually are connected to each other.
link |
01:02:02.440
And I was at that time just doing this on the basis of information flow.
link |
01:02:07.400
Now it turns out in the last five years or so in the post, the deep learning revolution
link |
01:02:17.180
that this, this architecture has turned out to be very conducive to that.
link |
01:02:25.000
Because basically in these neural networks, we are trying to build multiple representations.
link |
01:02:33.040
They can be multiple output heads sharing common representations.
link |
01:02:37.280
So in a certain sense today, given the reality of what solutions people have to this, I do
link |
01:02:46.240
not need to preach this anymore.
link |
01:02:48.320
It is, it is just there.
link |
01:02:50.720
It's part of the sedation space.
link |
01:02:52.600
So speaking of neural networks, how much of this problem of computer vision of reorganization
link |
01:03:02.280
recognition can be reconstruction?
link |
01:03:09.280
How much of it can be learned end to end, do you think?
link |
01:03:12.800
Sort of set it and forget it.
link |
01:03:17.160
Just plug and play, have a giant data set, multiple, perhaps multimodal, and then just
link |
01:03:23.160
learn the entirety of it.
link |
01:03:25.680
Well, so I think that currently what that end to end learning means nowadays is end
link |
01:03:31.440
to end supervised learning.
link |
01:03:34.360
And that I would argue is too narrow a view of the problem.
link |
01:03:38.360
I like this child development view, this lifelong learning view, one where there are certain
link |
01:03:46.440
capabilities that are built up and then there are certain capabilities which are built up
link |
01:03:51.720
on top of that.
link |
01:03:53.320
So that's what I believe in.
link |
01:03:58.700
So I think end to end learning in the supervised setting for a very precise task to me is kind
link |
01:04:13.080
of is sort of a limited view of the learning process.
link |
01:04:17.560
Got it.
link |
01:04:18.660
So if we think about beyond purely supervised, looking back to children, you mentioned six
link |
01:04:25.500
lessons that we can learn from children of be multimodal, be incremental, be physical,
link |
01:04:33.400
explore, be social, use language.
link |
01:04:36.520
Can you speak to these, perhaps picking one that you find most fundamental to our time
link |
01:04:42.280
today?
link |
01:04:43.280
Yeah.
link |
01:04:44.280
So I mean, I should say to give a due credit, this is from a paper by Smith and Gasser.
link |
01:04:50.120
And it reflects essentially, I would say common wisdom among child development people.
link |
01:05:00.000
It's just that this is not common wisdom among people in computer vision and AI and machine
link |
01:05:07.040
learning.
link |
01:05:08.040
So I view my role as trying to bridge the two worlds.
link |
01:05:15.920
So let's take an example of a multimodal.
link |
01:05:18.960
I like that.
link |
01:05:20.160
So multimodal, a canonical example is a child interacting with an object.
link |
01:05:28.840
So then the child holds a ball and plays with it.
link |
01:05:32.600
So at that point, it's getting a touch signal.
link |
01:05:35.720
So the touch signal is getting the notion of 3D shape, but it is sparse.
link |
01:05:44.120
And then the child is also seeing a visual signal.
link |
01:05:48.320
And these two, so imagine these are two in totally different spaces.
link |
01:05:52.640
So one is the space of receptors on the skin of the fingers and the thumb and the palm.
link |
01:05:59.660
And then these map onto these neuronal fibers are getting activated somewhere.
link |
01:06:06.460
These lead to some activation in somatosensory cortex.
link |
01:06:10.360
I mean, a similar thing will happen if we have a robot hand.
link |
01:06:15.800
And then we have the pixels corresponding to the visual view, but we know that they
link |
01:06:20.440
correspond to the same object.
link |
01:06:24.440
So that's a very, very strong cross calibration signal.
link |
01:06:28.920
And it is self supervisory, which is beautiful.
link |
01:06:32.520
There's nobody assigning a label.
link |
01:06:34.000
The mother doesn't have to come and assign a label.
link |
01:06:37.880
The child doesn't even have to know that this object is called a ball.
link |
01:06:42.760
That the child is learning something about the three dimensional world from this signal.
link |
01:06:49.600
I think tactile and visual, there is some work on, there is a lot of work currently
link |
01:06:54.880
on audio and visual.
link |
01:06:57.960
And audio visual, so there is some event that happens in the world and that event has a
link |
01:07:02.600
visual signature and it has a auditory signature.
link |
01:07:07.200
So there is this glass bowl on the table and it falls and breaks and I hear the smashing
link |
01:07:12.020
sound and I see the pieces of glass.
link |
01:07:14.200
Okay, I've built that connection between the two, right?
link |
01:07:19.520
We have people, I mean, this has become a hot topic in computer vision in the last couple
link |
01:07:24.280
of years.
link |
01:07:26.120
There are problems like separating out multiple speakers, right?
link |
01:07:32.560
Which was a classic problem in auditions.
link |
01:07:35.460
They call this the problem of source separation or the cocktail party effect and so on.
link |
01:07:40.680
But just try to do it visually when you also have, it becomes so much easier and so much
link |
01:07:47.560
more useful.
link |
01:07:50.640
So the multimodal, I mean, there's so much more signal with multimodal and you can use
link |
01:07:56.680
that for some kind of weak supervision as well.
link |
01:08:00.240
Yes, because they are occurring at the same time in time.
link |
01:08:03.220
So you have time which links the two, right?
link |
01:08:06.220
So at a certain moment, T1, you've got a certain signal in the auditory domain and a certain
link |
01:08:10.840
signal in the visual domain, but they must be causally related.
link |
01:08:14.520
Yeah, that's an exciting area.
link |
01:08:16.640
Not well studied yet.
link |
01:08:17.640
Yeah, I mean, we have a little bit of work at this, but so much more needs to be done.
link |
01:08:25.540
So this is a good example.
link |
01:08:28.220
Be physical, that's to do with like the one thing we talked about earlier that there's
link |
01:08:34.040
a embodied world.
link |
01:08:36.560
To mention language, use language.
link |
01:08:39.440
So Noam Chomsky believes that language may be at the core of cognition, at the core of
link |
01:08:44.160
everything in the human mind.
link |
01:08:46.480
What is the connection between language and vision to you?
link |
01:08:50.760
What's more fundamental?
link |
01:08:51.920
Are they neighbors?
link |
01:08:53.440
Is one the parent and the child, the chicken and the egg?
link |
01:08:58.000
Oh, it's very clear.
link |
01:08:59.000
It is vision, which is the parent.
link |
01:09:00.560
Which is the fundamental ability, okay.
link |
01:09:07.680
It comes before you think vision is more fundamental than language.
link |
01:09:11.640
Correct.
link |
01:09:12.640
And you can think of it either in phylogeny or in ontogeny.
link |
01:09:18.240
So phylogeny means if you look at evolutionary time, right?
link |
01:09:22.320
So we have vision that developed 500 million years ago, okay.
link |
01:09:27.160
Then something like when we get to maybe like five million years ago, you have the first
link |
01:09:33.040
bipedal primate.
link |
01:09:34.400
So when we started to walk, then the hands became free.
link |
01:09:38.920
And so then manipulation, the ability to manipulate objects and build tools and so on and so forth.
link |
01:09:45.160
So you said 500,000 years ago?
link |
01:09:47.520
No, sorry.
link |
01:09:48.520
The first multicellular animals, which you can say had some intelligence arose 500 million
link |
01:09:56.720
years ago.
link |
01:09:57.720
Million.
link |
01:09:58.720
Okay.
link |
01:09:59.720
And now let's fast forward to say the last seven million years, which is the development
link |
01:10:05.680
of the hominid line, right, where from the other primates, we have the branch which leads
link |
01:10:10.560
on to modern humans.
link |
01:10:12.840
Now there are many of these hominids, but the ones which, you know, people talk about
link |
01:10:21.680
Lucy because that's like a skeleton from three million years ago.
link |
01:10:25.080
And we know that Lucy walked, okay.
link |
01:10:28.600
So at this stage you have that the hand is free for manipulating objects and then the
link |
01:10:34.360
ability to manipulate objects, build tools and the brain size grew in this era.
link |
01:10:43.520
So okay, so now you have manipulation.
link |
01:10:46.140
Now we don't know exactly when language arose.
link |
01:10:49.660
But after that.
link |
01:10:50.660
Because no apes have, I mean, so I mean Chomsky is correct in that, that it is a uniquely
link |
01:10:57.760
human capability and we primates, other primates don't have that.
link |
01:11:04.440
But so it developed somewhere in this era, but it developed, I would, I mean, argue that
link |
01:11:12.040
it probably developed after we had this stage of humans, I mean, the human species already
link |
01:11:19.520
able to manipulate and hands free much bigger brain size.
link |
01:11:25.440
And for that, there's a lot of vision has already had, had to have developed.
link |
01:11:31.720
So the sensation and the perception may be some of the cognition.
link |
01:11:35.800
Yeah.
link |
01:11:36.800
So we, we, we, so those, so, so that vision, so the world, so there, so, so these ancestors
link |
01:11:45.800
of ours, you know, three, four million years ago, they had, they had special intelligence.
link |
01:11:53.360
So they knew that the world consists of objects.
link |
01:11:56.240
They knew that the objects were in certain relationships to each other.
link |
01:11:59.720
They had observed causal interactions among objects.
link |
01:12:05.280
They could move in space.
link |
01:12:06.500
So they had space and time and all of that.
link |
01:12:09.000
So language builds on that substrate.
link |
01:12:13.120
So language has a lot of, I mean, I mean, the none, all human languages have constructs
link |
01:12:19.800
which depend on a notion of space and time.
link |
01:12:22.840
Where did that notion of space and time come from?
link |
01:12:26.920
It had to come from perception and action in the world we live in.
link |
01:12:30.960
Yeah.
link |
01:12:31.960
Well, you've referred to the spatial intelligence.
link |
01:12:33.560
Yeah.
link |
01:12:34.560
Yeah.
link |
01:12:35.560
So to linger a little bit, we'll mention Turing and his mention of, we should learn from
link |
01:12:42.960
children.
link |
01:12:43.960
Nevertheless, language is the fundamental piece of the test of intelligence that Turing
link |
01:12:49.360
proposed.
link |
01:12:50.360
Yes.
link |
01:12:51.360
What do you think is a good test of intelligence?
link |
01:12:53.840
Are you, what would impress the heck out of you?
link |
01:12:56.480
Is it fundamentally natural language or is there something in vision?
link |
01:13:02.800
I think, I wouldn't, I don't think we should have created a single test of intelligence.
link |
01:13:10.160
So just like I don't believe in IQ as a single number, I think generally there can be many
link |
01:13:17.200
capabilities which are correlated perhaps.
link |
01:13:21.920
So I think that there will be, there will be accomplishments which are visual accomplishments,
link |
01:13:28.920
accomplishments which are accomplishments in manipulation or robotics, and then accomplishments
link |
01:13:36.000
in language.
link |
01:13:37.000
But I do believe that language will be the hardest nut to crack.
link |
01:13:40.400
Really?
link |
01:13:41.400
Yeah.
link |
01:13:42.400
So what's harder, to pass the spirit of the Turing test or like whatever formulation will
link |
01:13:46.840
make it natural language, convincingly a natural language, like somebody you would want to
link |
01:13:52.000
have a beer with, hang out and have a chat with, or the general natural scene understanding?
link |
01:13:59.340
You think language is the tougher problem?
link |
01:14:01.440
I think, I'm not a fan of the, I think, I think Turing test, that Turing as he proposed
link |
01:14:09.080
the test in 1950 was trying to solve a certain problem.
link |
01:14:13.840
Yeah, imitation.
link |
01:14:14.840
Yeah.
link |
01:14:15.840
And, and I think it made a lot of sense then.
link |
01:14:18.240
Where we are today, 70 years later, I think, I think we should not worry about that.
link |
01:14:26.720
I think the Turing test is no longer the right way to channel research in AI, because that,
link |
01:14:34.620
it takes us down this path of this chat bot, which can fool us for five minutes or whatever.
link |
01:14:39.720
Okay.
link |
01:14:40.720
I think I would rather have a list of 10 different tasks.
link |
01:14:44.400
I mean, I think there are tasks which, there are tasks in the manipulation domain, tasks
link |
01:14:50.720
in navigation, tasks in visual scene understanding, tasks in reading a story and answering questions
link |
01:14:58.120
based on that.
link |
01:14:59.120
I mean, so my favorite language understanding task would be, you know, reading a novel and
link |
01:15:05.520
being able to answer arbitrary questions from it.
link |
01:15:08.560
Okay.
link |
01:15:09.560
Right.
link |
01:15:10.560
I think that to me, and this is not an exhaustive list by any means.
link |
01:15:15.800
So I would, I think that that's what we, where we need to be going to.
link |
01:15:21.120
And each of these, on each of these axes, there's a fair amount of work to be done.
link |
01:15:26.120
So on the visual understanding side, in this intelligence Olympics that we've set up, what's
link |
01:15:31.240
a good test for one of many of visual scene understanding?
link |
01:15:39.840
Do you think such benchmarks exist?
link |
01:15:41.320
Sorry to interrupt.
link |
01:15:42.320
No, there aren't any.
link |
01:15:43.680
I think, I think essentially to me, a really good aid to the blind.
link |
01:15:50.920
So suppose there was a blind person and I needed to assist the blind person.
link |
01:15:57.160
So ultimately, like we said, vision that aids in the action in a survival in this world,
link |
01:16:05.840
maybe in the simulated world.
link |
01:16:09.000
Maybe easier to measure performance in a simulated world, what we are ultimately after is performance
link |
01:16:15.280
in the real world.
link |
01:16:17.680
So David Hilbert in 1900 proposed 23 open problems in mathematics, some of which are
link |
01:16:23.920
still unsolved, most important, famous of which is probably the Riemann hypothesis.
link |
01:16:29.400
You've thought about and presented about the Hilbert problems of computer vision.
link |
01:16:33.240
So let me ask, what do you today, I don't know when the last year you presented that
link |
01:16:38.960
in 2015, but versions of it, you're kind of the face and the spokesperson for computer
link |
01:16:44.000
vision.
link |
01:16:45.000
It's your job to state what the open problems are for the field.
link |
01:16:51.840
So what today are the Hilbert problems of computer vision, do you think?
link |
01:16:56.560
Let me pick one which I regard as clearly unsolved, which is what I would call long
link |
01:17:05.760
form video understanding.
link |
01:17:08.280
So we have a video clip and we want to understand the behavior in there in terms of agents,
link |
01:17:20.840
their goals, intentionality and make predictions about what might happen.
link |
01:17:30.600
So that kind of understanding which goes away from atomic visual action.
link |
01:17:37.120
So in the short range, the question is, are you sitting, are you standing, are you catching
link |
01:17:41.800
a ball?
link |
01:17:44.080
That we can do now, or even if we can't do it fully accurately, if we can do it at 50%,
link |
01:17:50.400
maybe next year we'll do it at 65% and so forth.
link |
01:17:54.000
But I think the long range video understanding, I don't think we can do today.
link |
01:18:01.800
And it blends into cognition, that's the reason why it's challenging.
link |
01:18:06.920
So you have to track, you have to understand the entities, you have to understand the entities,
link |
01:18:11.280
you have to track them and you have to have some kind of model of their behavior.
link |
01:18:16.960
Correct.
link |
01:18:17.960
And their behavior might be, these are agents, so they are not just like passive objects,
link |
01:18:24.080
but they're agents, so therefore they would exhibit goal directed behavior.
link |
01:18:29.760
Okay, so this is one area.
link |
01:18:32.580
Then I will talk about understanding the world in 3D.
link |
01:18:37.120
This may seem paradoxical because in a way we have been able to do 3D understanding even
link |
01:18:43.020
like 30 years ago, right?
link |
01:18:45.840
But I don't think we currently have the richness of 3D understanding in our computer vision
link |
01:18:51.600
system that we would like.
link |
01:18:55.440
So let me elaborate on that a bit.
link |
01:18:57.560
So currently we have two kinds of techniques which are not fully unified.
link |
01:19:03.340
So they are the kinds of techniques from multi view geometry that you have multiple pictures
link |
01:19:08.080
of a scene and you do a reconstruction using stereoscopic vision or structure from motion.
link |
01:19:14.660
But these techniques do not, they totally fail if you just have a single view because
link |
01:19:21.520
they are relying on this multiple view geometry.
link |
01:19:25.680
Okay, then we have some techniques that we have developed in the computer vision community
link |
01:19:30.240
which try to guess 3D from single views.
link |
01:19:34.440
And these techniques are based on supervised learning and they are based on having a training
link |
01:19:41.780
time 3D models of objects available.
link |
01:19:46.020
And this is completely unnatural supervision, right?
link |
01:19:50.080
That's not, CAD models are not injected into your brain.
link |
01:19:54.000
Okay, so what would I like?
link |
01:19:56.120
What I would like would be a kind of learning as you move around the world notion of 3D.
link |
01:20:06.360
So we have our succession of visual experiences and from those we, so as part of that I might
link |
01:20:19.200
see a chair from different viewpoints or a table from different viewpoints and so on.
link |
01:20:24.880
Now as part that enables me to build some internal representation.
link |
01:20:31.320
And then next time I just see a single photograph and it may not even be of that chair, it's
link |
01:20:37.260
of some other chair.
link |
01:20:38.960
And I have a guess of what it's 3D shape is like.
link |
01:20:42.040
So you're almost learning the CAD model, kind of.
link |
01:20:45.680
Yeah, implicitly.
link |
01:20:46.680
Implicitly.
link |
01:20:47.680
I mean, the CAD model need not be in the same form as used by computer graphics programs.
link |
01:20:52.600
Hidden in the representation.
link |
01:20:53.880
It's hidden in the representation, the ability to predict new views.
link |
01:20:58.240
And what I would see if I went to such and such position.
link |
01:21:04.320
By the way, on a small tangent on that, are you okay or comfortable with neural networks
link |
01:21:14.360
that do achieve visual understanding that do, for example, achieve this kind of 3D understanding
link |
01:21:19.200
and you don't know how they, you're not able to interest, you're not able to visualize
link |
01:21:27.600
or understand or interact with the representation.
link |
01:21:31.120
So the fact that they're not or may not be explainable.
link |
01:21:34.960
Yeah, I think that's fine.
link |
01:21:38.400
To me that is, so let me put some caveats on that.
link |
01:21:44.540
So it depends on the setting.
link |
01:21:46.460
So first of all, I think the humans are not explainable.
link |
01:21:55.600
So that's a really good point.
link |
01:21:57.120
So we, one human to another human is not fully explainable.
link |
01:22:02.680
I think there are settings where explainability matters and these might be, for example, questions
link |
01:22:10.880
on medical diagnosis.
link |
01:22:13.520
So I'm in a setting where maybe the doctor, maybe a computer program has made a certain
link |
01:22:19.400
diagnosis and then depending on the diagnosis, perhaps I should have treatment A or treatment
link |
01:22:25.840
B, right?
link |
01:22:28.120
So now is the computer program's diagnosis based on data, which was data collected off
link |
01:22:38.720
for American males who are in their 30s and 40s and maybe not so relevant to me.
link |
01:22:45.500
Maybe it is relevant, you know, et cetera, et cetera.
link |
01:22:48.560
I mean, in medical diagnosis, we have major issues to do with the reference class.
link |
01:22:53.560
So we may have acquired statistics from one group of people and applying it to a different
link |
01:22:58.680
group of people who may not share all the same characteristics.
link |
01:23:02.880
The data might have, there might be error bars in the prediction.
link |
01:23:07.600
So that prediction should really be taken with a huge grain of salt.
link |
01:23:14.120
But this has an impact on what treatments should be picked, right?
link |
01:23:20.400
So there are settings where I want to know more than just, this is the answer.
link |
01:23:26.800
But what I acknowledge is that, so in that sense, explainability and interpretability
link |
01:23:33.840
may matter.
link |
01:23:34.840
It's about giving error bounds and a better sense of the quality of the decision.
link |
01:23:40.840
Where I'm willing to sacrifice interpretability is that I believe that there can be systems
link |
01:23:50.000
which can be highly performant, but which are internally black boxes.
link |
01:23:56.200
And that seems to be where it's headed.
link |
01:23:57.880
Some of the best performing systems are essentially black boxes, fundamentally by their construction.
link |
01:24:04.200
You and I are black boxes to each other.
link |
01:24:06.360
Yeah.
link |
01:24:07.360
So the nice thing about the black boxes we are is, so we ourselves are black boxes, but
link |
01:24:13.960
we're also, those of us who are charming are able to convince others, like explain the
link |
01:24:20.720
black, what's going on inside the black box with narratives of stories.
link |
01:24:25.440
So in some sense, neural networks don't have to actually explain what's going on inside.
link |
01:24:31.480
They just have to come up with stories, real or fake that convince you that they know what's
link |
01:24:37.080
going on.
link |
01:24:38.560
And I'm sure we can do that.
link |
01:24:39.880
We can create those stories, neural networks can create those stories.
link |
01:24:45.080
Yeah.
link |
01:24:46.080
And the transformer will be involved.
link |
01:24:50.040
Do you think we will ever build a system of human level or superhuman level intelligence?
link |
01:24:56.520
We've kind of defined what it takes to try to approach that, but do you think that's
link |
01:25:01.680
within our reach?
link |
01:25:02.680
The thing that we thought we could do, what Turing thought actually we could do by year
link |
01:25:07.480
2000, right?
link |
01:25:09.480
What do you think we'll ever be able to do?
link |
01:25:11.200
So I think there are two answers here.
link |
01:25:12.880
One question, one answer is in principle, can we do this at some time?
link |
01:25:18.240
And my answer is yes.
link |
01:25:20.560
The second answer is a pragmatic one.
link |
01:25:23.640
Do you think we will be able to do it in the next 20 years or whatever?
link |
01:25:27.840
And to that my answer is no.
link |
01:25:30.400
So of course that's a wild guess.
link |
01:25:34.680
I think that, you know, Donald Rumsfeld is not a favorite person of mine, but one of
link |
01:25:40.800
his lines was very good, which is about known unknowns and unknown unknowns.
link |
01:25:48.280
So in the business we are in, there are known unknowns and we have unknown unknowns.
link |
01:25:55.040
So I think with respect to a lot of what's the case in vision and robotics, I feel like
link |
01:26:04.800
we have known unknowns.
link |
01:26:06.960
So I have a sense of where we need to go and what the problems that need to be solved are.
link |
01:26:13.520
I feel with respect to natural language, understanding and high level cognition, it's not just known
link |
01:26:21.320
unknowns, but also unknown unknowns.
link |
01:26:24.200
So it is very difficult to put any kind of a timeframe to that.
link |
01:26:30.920
Do you think some of the unknown unknowns might be positive in that they'll surprise
link |
01:26:36.360
us and make the job much easier?
link |
01:26:38.720
So fundamental breakthroughs?
link |
01:26:40.120
I think that is possible because certainly I have been very positively surprised by how
link |
01:26:45.680
effective these deep learning systems have been because I certainly would not have believed
link |
01:26:53.880
that in 2010.
link |
01:26:57.640
I think what we knew from the mathematical theory was that convex optimization works.
link |
01:27:06.160
When there's a single global optima, then these gradient descent techniques would work.
link |
01:27:11.200
Now these are nonlinear systems with non convex systems.
link |
01:27:16.240
Huge number of variables, so over parametrized.
link |
01:27:18.680
And the people who used to play with them a lot, the ones who are totally immersed in
link |
01:27:26.680
the lore and the black magic, they knew that they worked well, even though they were...
link |
01:27:33.920
Really?
link |
01:27:34.920
I thought like everybody...
link |
01:27:35.920
No, the claim that I hear from my friends like Yann LeCun and so forth is that they
link |
01:27:43.200
feel that they were comfortable with them.
link |
01:27:45.960
But the community as a whole was certainly not.
link |
01:27:50.920
And I think to me that was the surprise that they actually worked robustly for a wide range
link |
01:27:59.820
of problems from a wide range of initializations and so on.
link |
01:28:04.960
And so that was certainly more rapid progress than we expected.
link |
01:28:13.720
But then there are certainly lots of times, in fact, most of the history of AI is when
link |
01:28:19.520
we have made less progress at a slower rate than we expected.
link |
01:28:24.060
So we just keep going.
link |
01:28:27.360
I think what I regard as really unwarranted are these fears of AGI in 10 years and 20
link |
01:28:39.600
years and that kind of stuff, because that's based on completely unrealistic models of
link |
01:28:44.880
how rapidly we will make progress in this field.
link |
01:28:48.800
So I agree with you, but I've also gotten the chance to interact with very smart people
link |
01:28:54.680
who really worry about existential threats of AI.
link |
01:28:57.840
And I, as an open minded person, am sort of taking it in.
link |
01:29:04.080
Do you think if AI systems in some way, the unknown unknowns, not super intelligent AI,
link |
01:29:12.920
but in ways we don't quite understand the nature of super intelligence, will have a
link |
01:29:18.080
detrimental effect on society?
link |
01:29:20.280
Do you think this is something we should be worried about or we need to first allow the
link |
01:29:25.920
unknown unknowns to become known unknowns?
link |
01:29:29.800
I think we need to be worried about AI today.
link |
01:29:32.960
I think that it is not just a worry we need to have when we get that AGI.
link |
01:29:38.240
I think that AI is being used in many systems today.
link |
01:29:43.360
And there might be settings, for example, when it causes biases or decisions which could
link |
01:29:49.800
be harmful.
link |
01:29:50.800
I mean, decisions which could be unfair to some people or it could be a self driving
link |
01:29:55.400
cars which kills a pedestrian.
link |
01:29:57.740
So AI systems are being deployed today, right?
link |
01:30:02.000
And they're being deployed in many different settings, maybe in medical diagnosis, maybe
link |
01:30:05.440
in a self driving car, maybe in selecting applicants for an interview.
link |
01:30:10.000
So I would argue that when these systems make mistakes, there are consequences.
link |
01:30:18.320
And we are in a certain sense responsible for those consequences.
link |
01:30:22.760
So I would argue that this is a continuous effort.
link |
01:30:27.040
It is we and this is something that in a way is not so surprising.
link |
01:30:32.440
It's about all engineering and scientific progress which great power comes great responsibility.
link |
01:30:40.000
So as these systems are deployed, we have to worry about them and it's a continuous
link |
01:30:44.300
problem.
link |
01:30:45.300
I don't think of it as something which will suddenly happen on some day in 2079 for which
link |
01:30:51.680
I need to design some clever trick.
link |
01:30:54.880
I'm saying that these problems exist today and we need to be continuously on the lookout
link |
01:31:00.800
for worrying about safety, biases, risks, right?
link |
01:31:06.840
I mean, the self driving car kills a pedestrian and they have, right?
link |
01:31:11.600
I mean, this Uber incident in Arizona, right?
link |
01:31:16.080
It has happened, right?
link |
01:31:17.760
This is not about AGI.
link |
01:31:18.760
In fact, it's about a very dumb intelligence which is still killing people.
link |
01:31:23.880
The worry people have with AGI is the scale.
link |
01:31:28.480
But I think you're 100% right is like the thing that worries me about AI today and it's
link |
01:31:34.840
happening in a huge scale is recommender systems, recommendation systems.
link |
01:31:39.320
So if you look at Twitter or Facebook or YouTube, they're controlling the ideas that we have
link |
01:31:47.600
access to, the news and so on.
link |
01:31:50.560
And that's a fundamental machine learning algorithm behind each of these recommendations.
link |
01:31:55.480
And they, I mean, my life would not be the same without these sources of information.
link |
01:32:00.840
I'm a totally new human being and the ideas that I know are very much because of the internet,
link |
01:32:07.180
because of the algorithm that recommend those ideas.
link |
01:32:09.680
And so as they get smarter and smarter, I mean, that is the AGI is that's the algorithm
link |
01:32:16.880
that's recommending the next YouTube video you should watch has control of millions of
link |
01:32:23.480
billions of people that that algorithm is already super intelligent and has complete
link |
01:32:30.160
control of the population, not a complete, but very strong control.
link |
01:32:35.160
For now we can turn off YouTube, we can just go have a normal life outside of that.
link |
01:32:39.920
But the more and more that gets into our life, it's that algorithm we start depending on
link |
01:32:46.760
it in the different companies that are working on the algorithm.
link |
01:32:49.040
So I think it's, you're right, it's already there.
link |
01:32:53.000
And YouTube in particular is using computer vision, doing their hardest to try to understand
link |
01:32:59.760
the content of videos so they could be able to connect videos with the people who would
link |
01:33:05.680
benefit from those videos the most.
link |
01:33:08.080
And so that development could go in a bunch of different directions, some of which might
link |
01:33:12.860
be harmful.
link |
01:33:14.820
So yeah, you're right, the threats of AI are here already and we should be thinking about
link |
01:33:19.720
them.
link |
01:33:20.720
On a philosophical notion, if you could, personal perhaps, if you could relive a moment in
link |
01:33:29.200
your life outside of family because it made you truly happy or it was a profound moment
link |
01:33:36.280
that impacted the direction of your life, what moment would you go to?
link |
01:33:44.160
I don't think of single moments, but I look over the long haul.
link |
01:33:49.240
I feel that I've been very lucky because I feel that, I think that in scientific research,
link |
01:33:58.840
a lot of it is about being at the right place at the right time.
link |
01:34:03.720
And you can work on problems at a time when they're just too premature.
link |
01:34:10.680
You butt your head against them and nothing happens because the prerequisites for success
link |
01:34:18.440
are not there.
link |
01:34:19.840
And then there are times when you are in a field which is all pretty mature and you can
link |
01:34:25.500
only solve curlicues upon curlicues.
link |
01:34:30.020
I've been lucky to have been in this field which for 34 years, well actually 34 years
link |
01:34:36.920
as a professor at Berkeley, so longer than that, which when I started in it was just
link |
01:34:44.600
like some little crazy, absolutely useless field which couldn't really do anything to
link |
01:34:53.600
a time when it's really, really solving a lot of practical problems, has offered a lot
link |
01:35:01.200
of tools for scientific research because computer vision is impactful for images in biology
link |
01:35:08.580
or astronomy and so on and so forth.
link |
01:35:12.160
And we have, so we have made great scientific progress which has had real practical impact
link |
01:35:18.180
in the world.
link |
01:35:19.400
And I feel lucky that I got in at a time when the field was very young and at a time when
link |
01:35:28.360
it is, it's now mature but not fully mature.
link |
01:35:34.120
It's mature but not done.
link |
01:35:35.600
I mean, it's really still in a productive phase.
link |
01:35:39.040
Yeah, I think people 500 years from now would laugh at you calling this field mature.
link |
01:35:45.680
That is very possible.
link |
01:35:46.680
Yeah.
link |
01:35:47.680
So, but you're also, lest I forget to mention, you've also mentored some of the biggest names
link |
01:35:53.860
of computer vision, computer science and AI today.
link |
01:35:59.200
So many questions I could ask, but really is what, what is it, how did you do it?
link |
01:36:04.560
What does it take to be a good mentor?
link |
01:36:06.760
What does it take to be a good guide?
link |
01:36:09.200
Yeah, I think what I feel, I've been lucky to have had very, very smart and hardworking
link |
01:36:17.640
and creative students.
link |
01:36:18.920
I think some part of the credit just belongs to being at Berkeley.
link |
01:36:25.600
Those of us who are at top universities are blessed because we have very, very smart and
link |
01:36:32.880
capable students coming on, knocking on our door.
link |
01:36:37.040
So I have to be humble enough to acknowledge that.
link |
01:36:40.440
But what have I added?
link |
01:36:41.960
I think I have added something.
link |
01:36:44.160
What I have added is, I think what I've always tried to teach them is a sense of picking
link |
01:36:52.360
the right problems.
link |
01:36:54.760
So I think that in science, in the short run, success is always based on technical competence.
link |
01:37:04.240
You're, you know, you're quick with math or you are whatever.
link |
01:37:09.080
I mean, there's certain technical capabilities which make for short range progress.
link |
01:37:15.640
Long range progress is really determined by asking the right questions and focusing on
link |
01:37:21.280
the right problems.
link |
01:37:23.280
And I feel that what I've been able to bring to the table in terms of advising these students
link |
01:37:31.320
is some sense of taste of what are good problems, what are problems that are worth attacking
link |
01:37:38.760
now as opposed to waiting 10 years.
link |
01:37:41.680
What's a good problem?
link |
01:37:42.720
If you could summarize, is that possible to even summarize, like what's your sense of
link |
01:37:47.320
a good problem?
link |
01:37:48.320
I think, I think I have a sense of what is a good problem, which is there is a British
link |
01:37:55.400
scientist, in fact, he won a Nobel Prize, Peter Medover, who has a book on this.
link |
01:38:02.920
And basically he calls it, research is the art of the soluble.
link |
01:38:08.440
So we need to sort of find problems which are not yet solved, but which are approachable.
link |
01:38:18.440
And he sort of refers to this sense that there is this problem which isn't quite solved yet,
link |
01:38:25.080
but it has a soft underbelly.
link |
01:38:26.760
There is some place where you can, you know, spear the beast.
link |
01:38:32.800
And having that intuition that this problem is ripe is a good thing because otherwise
link |
01:38:39.160
you can just beat your head and not make progress.
link |
01:38:42.400
So I think that is important.
link |
01:38:45.840
So if I have that and if I can convey that to students, it's not just that they do great
link |
01:38:52.080
research while they're working with me, but that they continue to do great research.
link |
01:38:56.320
So in a sense, I'm proud of my students and their achievements and their great research
link |
01:39:01.200
even 20 years after they've ceased being my student.
link |
01:39:05.760
So it's in part developing, helping them develop that sense that a problem is not yet solved,
link |
01:39:11.440
but it's solvable.
link |
01:39:12.440
Correct.
link |
01:39:13.440
The other thing which I have, which I think I bring to the table, is a certain intellectual
link |
01:39:21.600
breadth.
link |
01:39:22.600
I've spent a fair amount of time studying psychology, neuroscience, relevant areas of
link |
01:39:29.320
applied math and so forth.
link |
01:39:31.320
So I can probably help them see some connections to disparate things, which they might not
link |
01:39:40.480
have otherwise.
link |
01:39:42.960
So the smart students coming into Berkeley can be very deep, they can think very deeply,
link |
01:39:50.440
meaning very hard down one particular path, but where I could help them is the shallow
link |
01:39:58.520
breadth, but they would have the narrow depth, but that's of some value.
link |
01:40:08.560
Well, it was beautifully refreshing just to hear you naturally jump to psychology back
link |
01:40:14.760
to computer science in this conversation back and forth.
link |
01:40:18.520
That's actually a rare quality and I think it's certainly for students empowering to
link |
01:40:23.680
think about problems in a new way.
link |
01:40:25.600
So for that and for many other reasons, I really enjoyed this conversation.
link |
01:40:29.440
Thank you so much.
link |
01:40:30.440
It was a huge honor.
link |
01:40:31.440
Thanks for talking to me.
link |
01:40:32.440
It's been my pleasure.
link |
01:40:34.320
Thanks for listening to this conversation with Jitendra Malik and thank you to our sponsors,
link |
01:40:39.840
BetterHelp and ExpressVPN.
link |
01:40:43.120
Please consider supporting this podcast by going to betterhelp.com slash Lex and signing
link |
01:40:49.480
up at expressvpn.com slash LexPod.
link |
01:40:52.940
Click the links, buy the stuff.
link |
01:40:55.440
That's how they know I sent you and it really is the best way to support this podcast and
link |
01:41:00.720
the journey I'm on.
link |
01:41:02.360
If you enjoy this thing, subscribe on YouTube, review it with five stars on Apple podcast,
link |
01:41:07.520
support it on Patreon or connect with me on Twitter at Lex Friedman.
link |
01:41:12.280
Don't ask me how to spell that.
link |
01:41:13.360
I don't remember it myself.
link |
01:41:15.720
And now let me leave you with some words from Prince Mishkin in The Idiot by Dostoevsky.
link |
01:41:22.120
Beauty will save the world.
link |
01:41:24.760
Thank you for listening and hope to see you next time.