back to index

Vladimir Vapnik: Statistical Learning | Lex Fridman Podcast #5


small model | large model

link |
00:00:00.000
The following is a conversation with Vladimir Vapnik.
link |
00:00:03.000
He's the co inventor of support vector machines,
link |
00:00:05.200
support vector clustering, VC theory,
link |
00:00:07.840
and many foundational ideas in statistical learning.
link |
00:00:11.120
He was born in the Soviet Union
link |
00:00:13.080
and worked at the Institute of Control Sciences in Moscow.
link |
00:00:16.240
Then in the United States, he worked at AT&T,
link |
00:00:19.280
NEC Labs, Facebook Research,
link |
00:00:22.200
and now is a professor at Columbia University.
link |
00:00:25.880
His work has been cited over 170,000 times.
link |
00:00:30.120
He has some very interesting ideas
link |
00:00:31.800
about artificial intelligence and the nature of learning,
link |
00:00:34.760
especially on the limits of our current approaches
link |
00:00:37.560
and the open problems in the field.
link |
00:00:40.360
This conversation is part of MIT course
link |
00:00:42.440
on artificial general intelligence
link |
00:00:44.360
and the artificial intelligence podcast.
link |
00:00:46.800
If you enjoy it, please subscribe on YouTube
link |
00:00:49.520
or rate it on iTunes or your podcast provider of choice,
link |
00:00:52.960
or simply connect with me on Twitter
link |
00:00:55.240
or other social networks at Lex Friedman spelled F R I D.
link |
00:01:00.120
And now here's my conversation with Vladimir Vapnik.
link |
00:01:04.760
Einstein famously said that God doesn't play dice.
link |
00:01:08.800
Yeah.
link |
00:01:09.920
You have studied the world through the eyes of statistics.
link |
00:01:12.800
So let me ask you in terms of the nature of reality,
link |
00:01:17.280
fundamental nature of reality, does God play dice?
link |
00:01:21.320
We don't know some factors.
link |
00:01:25.400
And because we don't know some factors,
link |
00:01:28.160
which could be important,
link |
00:01:30.520
it looks like God plays dice.
link |
00:01:35.040
But we should describe it.
link |
00:01:38.000
In philosophy, they distinguish between two positions,
link |
00:01:42.080
positions of instrumentalism,
link |
00:01:44.920
where you're creating theory for prediction
link |
00:01:48.720
and position of realism,
link |
00:01:50.960
where you're trying to understand what God did.
link |
00:01:54.640
Can you describe instrumentalism and realism a little bit?
link |
00:01:58.400
For example, if you have some mechanical laws,
link |
00:02:04.200
what is that?
link |
00:02:06.280
Is it law which is true always and everywhere?
link |
00:02:11.480
Or it is law which allow you to predict
link |
00:02:14.880
position of moving element?
link |
00:02:19.880
What you believe.
link |
00:02:23.000
You believe that it is God's law,
link |
00:02:25.520
that God created the world,
link |
00:02:28.520
which obey to this physical law.
link |
00:02:33.200
Or it is just law for predictions.
link |
00:02:36.280
And which one is instrumentalism?
link |
00:02:38.440
For predictions.
link |
00:02:39.960
If you believe that this is law of God,
link |
00:02:43.680
and it's always true everywhere,
link |
00:02:47.560
that means that you're realist.
link |
00:02:50.080
So you're trying to really understand God's thought.
link |
00:02:55.520
So the way you see the world is as an instrumentalist?
link |
00:03:00.080
You know, I'm working for some models,
link |
00:03:03.280
model of machine learning.
link |
00:03:07.000
So in this model, we can see setting,
link |
00:03:12.840
and we try to solve,
link |
00:03:15.360
resolve the setting to solve the problem.
link |
00:03:18.320
And you can do in two different way.
link |
00:03:20.840
From the point of view of instrumentalist,
link |
00:03:23.880
and that's what everybody does now.
link |
00:03:27.160
Because they say that goal of machine learning
link |
00:03:31.640
is to find the rule for classification.
link |
00:03:36.880
That is true.
link |
00:03:38.360
But it is instrument for prediction.
link |
00:03:41.000
But I can say the goal of machine learning
link |
00:03:46.240
is to learn about conditional probability.
link |
00:03:50.080
So how God played use, and if he play,
link |
00:03:54.520
what is probability for one,
link |
00:03:56.000
what is probability for another, given situation.
link |
00:04:00.000
But for prediction, I don't need this.
link |
00:04:02.680
I need the rule.
link |
00:04:04.320
But for understanding, I need conditional probability.
link |
00:04:08.520
So let me just step back a little bit first to talk about,
link |
00:04:11.840
you mentioned, which I read last night,
link |
00:04:14.000
the parts of the 1960 paper by Eugene Wigner,
link |
00:04:21.360
Unreasonable Effectiveness of Mathematics
link |
00:04:23.560
and Natural Sciences.
link |
00:04:24.960
Such a beautiful paper, by the way.
link |
00:04:29.400
Made me feel, to be honest,
link |
00:04:32.640
to confess my own work in the past few years
link |
00:04:35.560
on deep learning, heavily applied.
link |
00:04:38.480
Made me feel that I was missing out
link |
00:04:40.440
on some of the beauty of nature
link |
00:04:43.480
in the way that math can uncover.
link |
00:04:45.640
So let me just step away from the poetry of that for a second.
link |
00:04:50.440
How do you see the role of math in your life?
link |
00:04:53.120
Is it a tool, is it poetry?
link |
00:04:55.640
Where does it sit?
link |
00:04:57.040
And does math for you have limits of what it can describe?
link |
00:05:01.480
Some people say that math is language which use God.
link |
00:05:06.480
Use God.
link |
00:05:08.280
So I believe that...
link |
00:05:10.320
Speak to God or use God or...
link |
00:05:12.280
Use God.
link |
00:05:13.120
Use God.
link |
00:05:14.080
Yeah.
link |
00:05:15.560
So I believe that this article
link |
00:05:23.920
about effectiveness, unreasonable effectiveness of math,
link |
00:05:27.840
is that if you're looking at mathematical structures,
link |
00:05:32.120
they know something about reality.
link |
00:05:36.120
And the most scientists from Natural Science,
link |
00:05:41.120
they're looking on equation and trying to understand reality.
link |
00:05:47.120
So the same in machine learning.
link |
00:05:50.120
If you try very carefully look on all equations
link |
00:05:56.120
which define conditional probability,
link |
00:05:59.120
you can understand something about reality
link |
00:06:04.120
more than from your fantasy.
link |
00:06:07.120
So math can reveal the simple underlying principles of reality perhaps.
link |
00:06:13.120
You know what means simple?
link |
00:06:16.120
It is very hard to discover them.
link |
00:06:19.120
But then when you discover them and look at them,
link |
00:06:23.120
you see how beautiful they are.
link |
00:06:26.120
And it is surprising why people did not see that before.
link |
00:06:33.120
You're looking on equation and derive it from equations.
link |
00:06:37.120
For example, I talked yesterday about least square method.
link |
00:06:43.120
And people had a lot of fantasy how to improve least square method.
link |
00:06:48.120
But if you're going step by step by solving some equations,
link |
00:06:52.120
you suddenly will get some term which after thinking,
link |
00:06:59.120
you understand that it describes position of observation point.
link |
00:07:04.120
In least square method, we throw out a lot of information.
link |
00:07:08.120
We don't look in composition of point of observations,
link |
00:07:11.120
we're looking only on residuals.
link |
00:07:14.120
But when you understood that, that's very simple idea,
link |
00:07:19.120
but it's not too simple to understand.
link |
00:07:22.120
And you can derive this just from equations.
link |
00:07:26.120
So some simple algebra, a few steps will take you to something surprising
link |
00:07:31.120
that when you think about, you understand.
link |
00:07:34.120
And that is proof that human intuition is not too rich and very primitive.
link |
00:07:42.120
And it does not see very simple situations.
link |
00:07:48.120
So let me take a step back.
link |
00:07:50.120
In general, yes.
link |
00:07:54.120
But what about human, as opposed to intuition, ingenuity?
link |
00:08:01.120
Moments of brilliance.
link |
00:08:06.120
Do you have to be so hard on human intuition?
link |
00:08:09.120
Are there moments of brilliance in human intuition?
link |
00:08:12.120
They can leap ahead of math and then the math will catch up?
link |
00:08:17.120
I don't think so.
link |
00:08:19.120
I think that the best human intuition, it is putting in axioms.
link |
00:08:26.120
And then it is technical.
link |
00:08:28.120
See where the axioms take you.
link |
00:08:31.120
But if they correctly take axioms.
link |
00:08:35.120
But it axiom polished during generations of scientists.
link |
00:08:41.120
And this is integral wisdom.
link |
00:08:45.120
That is beautifully put.
link |
00:08:47.120
But if you maybe look at, when you think of Einstein and special relativity,
link |
00:08:56.120
what is the role of imagination coming first there in the moment of discovery of an idea?
link |
00:09:04.120
So there is obviously a mix of math and out of the box imagination there.
link |
00:09:10.120
That I don't know.
link |
00:09:12.120
Whatever I did, I exclude any imagination.
link |
00:09:17.120
Because whatever I saw in machine learning that comes from imagination,
link |
00:09:22.120
like features, like deep learning, they are not relevant to the problem.
link |
00:09:29.120
When you are looking very carefully from mathematical equations,
link |
00:09:34.120
you are deriving very simple theory, which goes far beyond theoretically
link |
00:09:39.120
than whatever people can imagine.
link |
00:09:42.120
Because it is not good fantasy.
link |
00:09:44.120
It is just interpretation.
link |
00:09:46.120
It is just fantasy.
link |
00:09:48.120
But it is not what you need.
link |
00:09:51.120
You don't need any imagination to derive the main principle of machine learning.
link |
00:09:59.120
When you think about learning and intelligence,
link |
00:10:02.120
maybe thinking about the human brain and trying to describe mathematically
link |
00:10:06.120
the process of learning, that is something like what happens in the human brain.
link |
00:10:13.120
Do you think we have the tools currently?
link |
00:10:17.120
Do you think we will ever have the tools to try to describe that process of learning?
link |
00:10:21.120
It is not description what is going on.
link |
00:10:25.120
It is interpretation.
link |
00:10:27.120
It is your interpretation.
link |
00:10:29.120
Your vision can be wrong.
link |
00:10:32.120
You know, one guy invented microscope, Levenhuk, for the first time.
link |
00:10:39.120
Only he got this instrument and he kept secret about microscope.
link |
00:10:45.120
But he wrote a report in London Academy of Science.
link |
00:10:49.120
In his report, when he was looking at the blood,
link |
00:10:52.120
he looked everywhere, on the water, on the blood, on the sperm.
link |
00:10:56.120
But he described blood like fight between queen and king.
link |
00:11:04.120
So, he saw blood cells, red cells, and he imagined that it is army fighting each other.
link |
00:11:12.120
And it was his interpretation of situation.
link |
00:11:17.120
And he sent this report in Academy of Science.
link |
00:11:20.120
They very carefully looked because they believed that he is right.
link |
00:11:24.120
He saw something.
link |
00:11:25.120
Yes.
link |
00:11:26.120
But he gave wrong interpretation.
link |
00:11:28.120
And I believe the same can happen with brain.
link |
00:11:32.120
With brain, yeah.
link |
00:11:33.120
The most important part.
link |
00:11:35.120
You know, I believe in human language.
link |
00:11:39.120
In some proverbs, there is so much wisdom.
link |
00:11:43.120
For example, people say that it is better than thousand days of diligent studies one day with great teacher.
link |
00:11:54.120
But if I will ask you what teacher does, nobody knows.
link |
00:11:59.120
And that is intelligence.
link |
00:12:01.120
But we know from history and now from math and machine learning that teacher can do a lot.
link |
00:12:12.120
So, what from a mathematical point of view is the great teacher?
link |
00:12:16.120
I don't know.
link |
00:12:17.120
That's an open question.
link |
00:12:18.120
No, but we can say what teacher can do.
link |
00:12:25.120
He can introduce some invariants, some predicate for creating invariants.
link |
00:12:32.120
How he doing it?
link |
00:12:33.120
I don't know because teacher knows reality and can describe from this reality a predicate, invariants.
link |
00:12:41.120
But he knows that when you're using invariant, you can decrease number of observations hundred times.
link |
00:12:49.120
So, but maybe try to pull that apart a little bit.
link |
00:12:53.120
I think you mentioned like a piano teacher saying to the student, play like a butterfly.
link |
00:12:59.120
Yeah.
link |
00:13:00.120
I play piano.
link |
00:13:01.120
I play guitar for a long time.
link |
00:13:03.120
Yeah, maybe it's romantic, poetic, but it feels like there's a lot of truth in that statement.
link |
00:13:12.120
Like there is a lot of instruction in that statement.
link |
00:13:15.120
And so, can you pull that apart?
link |
00:13:17.120
What is that?
link |
00:13:19.120
The language itself may not contain this information.
link |
00:13:22.120
It is not blah, blah, blah.
link |
00:13:24.120
It is not blah, blah, blah.
link |
00:13:25.120
It affects you.
link |
00:13:26.120
It's what?
link |
00:13:27.120
It affects you.
link |
00:13:28.120
It affects your playing.
link |
00:13:29.120
Yes, it does, but it's not the laying.
link |
00:13:34.120
It feels like what is the information being exchanged there?
link |
00:13:38.120
What is the nature of information?
link |
00:13:39.120
What is the representation of that information?
link |
00:13:41.120
I believe that it is sort of predicate, but I don't know.
link |
00:13:45.120
That is exactly what intelligence and machine learning should be.
link |
00:13:49.120
Yes.
link |
00:13:50.120
Because the rest is just mathematical technique.
link |
00:13:53.120
I think that what was discovered recently is that there is two mechanism of learning.
link |
00:14:03.120
One called strong convergence mechanism and weak convergence mechanism.
link |
00:14:08.120
Before, people use only one convergence.
link |
00:14:11.120
In weak convergence mechanism, you can use predicate.
link |
00:14:16.120
That's what play like butterfly and it will immediately affect your playing.
link |
00:14:23.120
You know, there is English proverb, great.
link |
00:14:27.120
If it looks like a duck, swims like a duck, and quack like a duck, then it is probably duck.
link |
00:14:35.120
Yes.
link |
00:14:36.120
But this is exact about predicate.
link |
00:14:40.120
Looks like a duck, what it means.
link |
00:14:43.120
You saw many ducks that you're training data.
link |
00:14:47.120
So, you have description of how looks integral looks ducks.
link |
00:14:56.120
Yeah.
link |
00:14:57.120
The visual characteristics of a duck.
link |
00:14:59.120
Yeah.
link |
00:15:00.120
But you want and you have model for recognition.
link |
00:15:04.120
So, you would like so that theoretical description from model coincide with empirical description,
link |
00:15:12.120
which you saw on territory.
link |
00:15:14.120
So, about looks like a duck, it is general.
link |
00:15:18.120
But what about swims like a duck?
link |
00:15:21.120
You should know that duck swims.
link |
00:15:23.120
You can say it play chess like a duck.
link |
00:15:26.120
Okay.
link |
00:15:27.120
Duck doesn't play chess.
link |
00:15:29.120
And it is completely legal predicate, but it is useless.
link |
00:15:35.120
So, half teacher can recognize not useless predicate.
link |
00:15:41.120
So, up to now, we don't use this predicate in existing machine learning.
link |
00:15:47.120
So, why we need zillions of data.
link |
00:15:50.120
But in this English proverb, they use only three predicate.
link |
00:15:55.120
Looks like a duck, swims like a duck, and quack like a duck.
link |
00:15:59.120
So, you can't deny the fact that swims like a duck and quacks like a duck has humor in it, has ambiguity.
link |
00:16:08.120
Let's talk about swim like a duck.
link |
00:16:12.120
It doesn't say jump like a duck.
link |
00:16:16.120
Why?
link |
00:16:17.120
Because...
link |
00:16:18.120
It's not relevant.
link |
00:16:20.120
But that means that you know ducks, you know different birds, you know animals.
link |
00:16:27.120
And you derive from this that it is relevant to say swim like a duck.
link |
00:16:32.120
So, underneath, in order for us to understand swims like a duck, it feels like we need to know millions of other little pieces of information.
link |
00:16:42.120
Which we pick up along the way.
link |
00:16:44.120
You don't think so.
link |
00:16:45.120
There doesn't need to be this knowledge base in those statements carries some rich information that helps us understand the essence of duck.
link |
00:16:55.120
Yeah.
link |
00:16:57.120
How far are we from integrating predicates?
link |
00:17:01.120
You know that when you consider complete theory of machine learning.
link |
00:17:07.120
So, what it does, you have a lot of functions.
link |
00:17:12.120
And then you're talking it looks like a duck.
link |
00:17:17.120
You see your training data.
link |
00:17:20.120
From training data you recognize like expected duck should look.
link |
00:17:31.120
Then you remove all functions which does not look like you think it should look from training data.
link |
00:17:40.120
So, you decrease amount of function from which you pick up one.
link |
00:17:46.120
Then you give a second predicate and again decrease the set of function.
link |
00:17:52.120
And after that you pick up the best function you can find.
link |
00:17:56.120
It is standard machine learning.
link |
00:17:58.120
So, why you need not too many examples?
link |
00:18:03.120
Because your predicates aren't very good?
link |
00:18:06.120
That means that predicates are very good because every predicate is invented to decrease admissible set of function.
link |
00:18:17.120
So, you talk about admissible set of functions and you talk about good functions.
link |
00:18:22.120
So, what makes a good function?
link |
00:18:24.120
So, admissible set of function is set of function which has small capacity or small diversity, small VC dimension example.
link |
00:18:35.120
Which contain good function inside.
link |
00:18:37.120
So, by the way for people who don't know VC, you're the V in the VC.
link |
00:18:45.120
So, how would you describe to lay person what VC theory is?
link |
00:18:50.120
How would you describe VC?
link |
00:18:52.120
So, when you have a machine.
link |
00:18:54.120
So, machine capable to pick up one function from the admissible set of function.
link |
00:19:02.120
But set of admissible function can be big.
link |
00:19:07.120
So, it contain all continuous functions and it's useless.
link |
00:19:11.120
You don't have so many examples to pick up function.
link |
00:19:15.120
But it can be small.
link |
00:19:17.120
Small, we call it capacity but maybe better called diversity.
link |
00:19:24.120
So, not very different function in the set.
link |
00:19:27.120
It's infinite set of function but not very diverse.
link |
00:19:31.120
So, it is small VC dimension.
link |
00:19:34.120
When VC dimension is small, you need small amount of training data.
link |
00:19:41.120
So, the goal is to create admissible set of functions which is have small VC dimension and contain good function.
link |
00:19:53.120
Then you will be able to pick up the function using small amount of observations.
link |
00:20:02.120
So, that is the task of learning?
link |
00:20:06.120
Yeah.
link |
00:20:07.120
Is creating a set of admissible functions that has a small VC dimension and then you've figure out a clever way of picking up?
link |
00:20:17.120
No, that is goal of learning which I formulated yesterday.
link |
00:20:22.120
Statistical learning theory does not involve in creating admissible set of function.
link |
00:20:30.120
In classical learning theory, everywhere, 100% in textbook, the set of function, admissible set of function is given.
link |
00:20:39.120
But this is science about nothing because the most difficult problem to create admissible set of functions
link |
00:20:47.120
given, say, a lot of functions, continuum set of function, create admissible set of functions.
link |
00:20:55.120
That means that it has finite VC dimension, small VC dimension and contain good function.
link |
00:21:02.120
So, this was out of consideration.
link |
00:21:05.120
So, what's the process of doing that?
link |
00:21:07.120
I mean, it's fascinating.
link |
00:21:08.120
What is the process of creating this admissible set of functions?
link |
00:21:13.120
That is invariant.
link |
00:21:15.120
That's invariant.
link |
00:21:16.120
Yeah, you're looking of properties of training data and properties means that you have some function
link |
00:21:30.120
and you just count what is value, average value of function on training data.
link |
00:21:39.120
You have model and what is expectation of this function on the model and they should coincide.
link |
00:21:46.120
So, the problem is about how to pick up functions.
link |
00:21:51.120
It can be any function.
link |
00:21:54.120
In fact, it is true for all functions.
link |
00:22:00.120
But because when we're talking, say, duck does not jumping, so you don't ask question jump like a duck
link |
00:22:11.120
because it is trivial.
link |
00:22:13.120
It does not jumping and doesn't help you to recognize jump.
link |
00:22:16.120
But you know something, which question to ask and you're asking it seems like a duck,
link |
00:22:24.120
but looks like a duck at this general situation.
link |
00:22:28.120
Looks like, say, guy who have this illness, this disease.
link |
00:22:36.120
It is legal.
link |
00:22:39.120
So, there is a general type of predicate looks like and special type of predicate,
link |
00:22:47.120
which related to this specific problem.
link |
00:22:51.120
And that is intelligence part of all this business and that where teacher is involved.
link |
00:22:56.120
Incorporating the specialized predicates.
link |
00:23:01.120
What do you think about deep learning as neural networks, these arbitrary architectures
link |
00:23:08.120
as helping accomplish some of the tasks you're thinking about?
link |
00:23:13.120
Their effectiveness or lack thereof?
link |
00:23:15.120
What are the weaknesses and what are the possible strengths?
link |
00:23:20.120
You know, I think that this is fantasy, everything which like deep learning, like features.
link |
00:23:29.120
Let me give you this example.
link |
00:23:34.120
One of the greatest books is Churchill book about history of Second World War.
link |
00:23:39.120
And he started this book describing that in old time when war is over, so the great kings,
link |
00:23:53.120
they gathered together, almost all of them were relatives, and they discussed what should
link |
00:24:00.120
be done, how to create peace.
link |
00:24:03.120
And they came to agreement.
link |
00:24:05.120
And when happened First World War, the general public came in power.
link |
00:24:13.120
And they were so greedy that robbed Germany.
link |
00:24:18.120
And it was clear for everybody that it is not peace, that peace will last only 20 years
link |
00:24:24.120
because they were not professionals.
link |
00:24:28.120
And the same I see in machine learning.
link |
00:24:32.120
There are mathematicians who are looking for the problem from a very deep point of view,
link |
00:24:38.120
mathematical point of view.
link |
00:24:40.120
And there are computer scientists who mostly does not know mathematics.
link |
00:24:46.120
They just have interpretation of that.
link |
00:24:49.120
And they invented a lot of blah, blah, blah interpretations like deep learning.
link |
00:24:54.120
Why you need deep learning?
link |
00:24:55.120
Mathematic does not know deep learning.
link |
00:24:57.120
Mathematic does not know neurons.
link |
00:25:00.120
It is just function.
link |
00:25:02.120
If you like to say piecewise linear function, say that and do in class of piecewise linear
link |
00:25:09.120
function.
link |
00:25:10.120
But they invent something.
link |
00:25:12.120
And then they try to prove advantage of that through interpretations, which mostly wrong.
link |
00:25:22.120
And when it's not enough, they appeal to brain, which they know nothing about that.
link |
00:25:27.120
Nobody knows what's going on in the brain.
link |
00:25:30.120
So, I think that more reliable work on math.
link |
00:25:34.120
This is a mathematical problem.
link |
00:25:36.120
Do your best to solve this problem.
link |
00:25:39.120
Try to understand that there is not only one way of convergence, which is strong way of
link |
00:25:45.120
convergence.
link |
00:25:46.120
There is a weak way of convergence, which requires predicate.
link |
00:25:49.120
And if you will go through all this stuff, you will see that you don't need deep learning.
link |
00:25:56.120
Even more, I would say one of the theory, which called represented theory.
link |
00:26:03.120
It says that optimal solution of mathematical problem, which is described learning is on
link |
00:26:16.120
shadow network, not on deep learning.
link |
00:26:21.120
And a shallow network.
link |
00:26:22.120
Yeah.
link |
00:26:23.120
The ultimate problem is there.
link |
00:26:24.120
Absolutely.
link |
00:26:25.120
In the end, what you're saying is exactly right.
link |
00:26:29.120
The question is you have no value for throwing something on the table, playing with it, not
link |
00:26:37.120
math.
link |
00:26:38.120
It's like a neural network where you said throwing something in the bucket or the biological
link |
00:26:43.120
example and looking at kings and queens or the cells or the microscope.
link |
00:26:47.120
You don't see value in imagining the cells or kings and queens and using that as inspiration
link |
00:26:55.120
and imagination for where the math will eventually lead you.
link |
00:26:59.120
You think that interpretation basically deceives you in a way that's not productive.
link |
00:27:06.120
I think that if you're trying to analyze this business of learning and especially discussion
link |
00:27:17.120
about deep learning, it is discussion about interpretation, not about things, about what
link |
00:27:24.120
you can say about things.
link |
00:27:26.120
That's right.
link |
00:27:27.120
But aren't you surprised by the beauty of it?
link |
00:27:29.120
So not mathematical beauty, but the fact that it works at all or are you criticizing that
link |
00:27:38.200
very beauty, our human desire to interpret, to find our silly interpretations in these
link |
00:27:47.880
constructs?
link |
00:27:49.840
Let me ask you this.
link |
00:27:51.320
Are you surprised and does it inspire you?
link |
00:27:57.100
How do you feel about the success of a system like AlphaGo at beating the game of Go?
link |
00:28:03.520
Using neural networks to estimate the quality of a board and the quality of the position.
link |
00:28:11.600
That is your interpretation, quality of the board.
link |
00:28:14.600
Yeah, yes.
link |
00:28:15.600
Yeah.
link |
00:28:16.600
So it's not our interpretation.
link |
00:28:20.320
The fact is a neural network system, it doesn't matter, a learning system that we don't I
link |
00:28:25.920
think mathematically understand that well, beats the best human player, does something
link |
00:28:30.160
that was thought impossible.
link |
00:28:31.160
That means that it's not a very difficult problem.
link |
00:28:35.160
So you empirically, we've empirically have discovered that this is not a very difficult
link |
00:28:40.200
problem.
link |
00:28:41.200
Yeah.
link |
00:28:42.200
It's true.
link |
00:28:44.080
So maybe, can't argue.
link |
00:28:48.720
So even more I would say that if they use deep learning, it is not the most effective
link |
00:28:56.680
way of learning theory.
link |
00:29:00.320
And usually when people use deep learning, they're using zillions of training data.
link |
00:29:08.800
Yeah.
link |
00:29:10.480
But you don't need this.
link |
00:29:13.520
So I describe challenge, can we do some problems which do well deep learning method, this deep
link |
00:29:23.240
net, using hundred times less training data.
link |
00:29:28.400
Even more, some problems deep learning cannot solve because it's not necessary they create
link |
00:29:38.560
admissible set of function.
link |
00:29:40.840
To create deep architecture means to create admissible set of functions.
link |
00:29:45.840
You cannot say that you're creating good admissible set of functions.
link |
00:29:50.680
You just, it's your fantasy.
link |
00:29:52.760
It does not come from us.
link |
00:29:54.960
But it is possible to create admissible set of functions because you have your training
link |
00:30:00.280
data.
link |
00:30:01.280
That actually for mathematicians, when you consider a variant, you need to use law of
link |
00:30:10.600
large numbers.
link |
00:30:11.600
When you're making training in existing algorithm, you need uniform law of large numbers, which
link |
00:30:20.840
is much more difficult, it requires VC dimension and all this stuff.
link |
00:30:25.300
But nevertheless, if you use both weak and strong way of convergence, you can decrease
link |
00:30:33.480
a lot of training data.
link |
00:30:35.240
You could do the three, the swims like a duck and quacks like a duck.
link |
00:30:41.360
So let's step back and think about human intelligence in general.
link |
00:30:48.820
Clearly that has evolved in a non mathematical way.
link |
00:30:54.120
It wasn't, as far as we know, God or whoever didn't come up with a model and place in our
link |
00:31:04.280
brain of admissible functions.
link |
00:31:05.880
It kind of evolved.
link |
00:31:06.880
I don't know, maybe you have a view on this.
link |
00:31:09.720
So Alan Turing in the 50s, in his paper, asked and rejected the question, can machines think?
link |
00:31:16.920
It's not a very useful question, but can you briefly entertain this useful, useless question?
link |
00:31:23.960
Can machines think?
link |
00:31:25.720
So talk about intelligence and your view of it.
link |
00:31:28.560
I don't know that.
link |
00:31:29.880
I know that Turing described imitation.
link |
00:31:35.560
If computer can imitate human being, let's call it intelligent.
link |
00:31:43.060
And he understands that it is not thinking computer.
link |
00:31:46.720
He completely understands what he's doing.
link |
00:31:49.480
But he set up problem of imitation.
link |
00:31:53.840
So now we understand that the problem is not in imitation.
link |
00:31:58.000
I'm not sure that intelligence is just inside of us.
link |
00:32:04.360
It may be also outside of us.
link |
00:32:06.680
I have several observations.
link |
00:32:09.440
So when I prove some theorem, it's very difficult theorem, in couple of years, in several places,
link |
00:32:20.360
people prove the same theorem, say, Sawyer Lemma, after us was done, then another guys
link |
00:32:27.140
proved the same theorem.
link |
00:32:28.960
In the history of science, it's happened all the time.
link |
00:32:32.280
For example, geometry, it's happened simultaneously, first it did Lobachevsky and then Gauss and
link |
00:32:40.600
Boyai and another guys, and it's approximately in 10 times period, 10 years period of time.
link |
00:32:48.800
And I saw a lot of examples like that.
link |
00:32:51.760
And many mathematicians think that when they develop something, they develop something
link |
00:32:57.800
in general which affect everybody.
link |
00:33:01.600
So maybe our model that intelligence is only inside of us is incorrect.
link |
00:33:07.320
It's our interpretation.
link |
00:33:09.320
It might be there exists some connection with world intelligence.
link |
00:33:15.800
I don't know.
link |
00:33:16.800
You're almost like plugging in into...
link |
00:33:19.040
Yeah, exactly.
link |
00:33:21.240
And contributing to this...
link |
00:33:22.640
Into a big network.
link |
00:33:24.360
Into a big, maybe in your own network.
link |
00:33:28.360
On the flip side of that, maybe you can comment on big O complexity and how you see classifying
link |
00:33:37.400
algorithms by worst case running time in relation to their input.
link |
00:33:42.240
So that way of thinking about functions, do you think p equals np, do you think that's
link |
00:33:47.840
an interesting question?
link |
00:33:49.120
Yeah, it is an interesting question.
link |
00:33:52.000
But let me talk about complexity in about worst case scenario.
link |
00:34:00.000
There is a mathematical setting.
link |
00:34:04.320
When I came to United States in 1990, people did not know, they did not know statistical
link |
00:34:11.160
learning theory.
link |
00:34:13.040
So in Russia, it was published to monographs, our monographs, but in America they didn't
link |
00:34:19.400
know.
link |
00:34:20.400
Then they learned and somebody told me that it is worst case theory and they will create
link |
00:34:26.640
real case theory, but till now it did not.
link |
00:34:30.800
Because it is mathematical too.
link |
00:34:34.100
You can do only what you can do using mathematics.
link |
00:34:38.680
And which has a clear understanding and clear description.
link |
00:34:45.920
And for this reason, we introduce complexity.
link |
00:34:52.640
And we need this because using, actually it is diversity, I like this one more.
link |
00:35:01.720
You see the mention, you can prove some theorems.
link |
00:35:05.220
But we also create theory for case when you know probability measure.
link |
00:35:12.680
And that is the best case which can happen, it is entropy theory.
link |
00:35:18.080
So from mathematical point of view, you know the best possible case and the worst possible
link |
00:35:24.080
case.
link |
00:35:25.080
You can derive different model in medium, but it's not so interesting.
link |
00:35:30.480
You think the edges are interesting?
link |
00:35:33.440
The edges are interesting because it is not so easy to get good bound, exact bound.
link |
00:35:44.720
It's not many cases where you have the bound is not exact.
link |
00:35:49.280
But interesting principles which discover the mass.
link |
00:35:54.840
Do you think it's interesting because it's challenging and reveals interesting principles
link |
00:36:00.340
that allow you to get those bounds?
link |
00:36:02.700
Or do you think it's interesting because it's actually very useful for understanding the
link |
00:36:06.700
essence of a function of an algorithm?
link |
00:36:11.080
So it's like me judging your life as a human being by the worst thing you did and the best
link |
00:36:17.680
thing you did versus all the stuff in the middle.
link |
00:36:21.840
It seems not productive.
link |
00:36:24.520
I don't think so because you cannot describe situation in the middle.
link |
00:36:31.520
So it will be not general.
link |
00:36:34.600
So you can describe edges cases and it is clear it has some model, but you cannot describe
link |
00:36:44.120
model for every new case.
link |
00:36:47.720
So you will be never accurate when you're using model.
link |
00:36:53.400
But from a statistical point of view, the way you've studied functions and the nature
link |
00:36:59.360
of learning in the world, don't you think that the real world has a very long tail?
link |
00:37:07.760
That the edge cases are very far away from the mean, the stuff in the middle or no?
link |
00:37:19.520
I don't know that.
link |
00:37:21.520
I think that from my point of view, if you will use formal statistic, you need uniform
link |
00:37:36.920
law of large numbers.
link |
00:37:40.300
If you will use this invariance business, you will need just law of large numbers.
link |
00:37:52.240
And there's this huge difference between uniform law of large numbers and large numbers.
link |
00:37:56.760
Is it useful to describe that a little more or should we just take it to...
link |
00:38:01.880
For example, when I'm talking about duck, I give three predicates and that was enough.
link |
00:38:09.800
But if you will try to do formal distinguish, you will need a lot of observations.
link |
00:38:19.760
So that means that information about looks like a duck contain a lot of bit of information,
link |
00:38:27.400
formal bits of information.
link |
00:38:29.860
So we don't know that how much bit of information contain things from artificial and from intelligence.
link |
00:38:39.880
And that is the subject of analysis.
link |
00:38:42.440
Till now, all business, I don't like how people consider artificial intelligence.
link |
00:38:54.780
They consider us some codes which imitate activity of human being.
link |
00:39:01.240
It is not science, it is applications.
link |
00:39:03.960
You would like to imitate go ahead, it is very useful and a good problem.
link |
00:39:09.760
But you need to learn something more.
link |
00:39:15.960
How people try to do, how people can to develop, say, predicates seems like a duck or play
link |
00:39:25.960
like butterfly or something like that.
link |
00:39:29.960
Not the teacher says you, how it came in his mind, how he choose this image.
link |
00:39:37.000
So that process...
link |
00:39:38.000
That is problem of intelligence.
link |
00:39:39.960
That is the problem of intelligence and you see that connected to the problem of learning?
link |
00:39:44.720
Absolutely.
link |
00:39:45.720
Because you immediately give this predicate like specific predicate seems like a duck
link |
00:39:52.240
or quack like a duck.
link |
00:39:54.840
It was chosen somehow.
link |
00:39:57.560
So what is the line of work, would you say?
link |
00:40:01.400
If you were to formulate as a set of open problems, that will take us there, to play
link |
00:40:08.680
like a butterfly.
link |
00:40:09.680
We'll get a system to be able to...
link |
00:40:12.200
Let's separate two stories.
link |
00:40:14.520
One mathematical story that if you have predicate, you can do something.
link |
00:40:20.480
And another story how to get predicate.
link |
00:40:23.840
It is intelligence problem and people even did not start to understand intelligence.
link |
00:40:32.280
Because to understand intelligence, first of all, try to understand what do teachers.
link |
00:40:39.440
How teacher teach, why one teacher better than another one.
link |
00:40:43.960
Yeah.
link |
00:40:44.960
And so you think we really even haven't started on the journey of generating the predicates?
link |
00:40:50.400
No.
link |
00:40:51.400
We don't understand.
link |
00:40:52.400
We even don't understand that this problem exists.
link |
00:40:56.880
Because did you hear...
link |
00:40:57.880
You do.
link |
00:40:58.880
No, I just know name.
link |
00:41:02.720
I want to understand why one teacher better than another and how affect teacher, student.
link |
00:41:13.440
It is not because he repeating the problem which is in textbook.
link |
00:41:18.520
He makes some remarks.
link |
00:41:20.920
He makes some philosophy of reasoning.
link |
00:41:23.040
Yeah, that's a beautiful...
link |
00:41:24.600
So it is a formulation of a question that is the open problem.
link |
00:41:31.400
Why is one teacher better than another?
link |
00:41:34.200
Right.
link |
00:41:35.320
What he does better.
link |
00:41:37.360
Yeah.
link |
00:41:38.360
What...
link |
00:41:39.360
What...
link |
00:41:40.360
Why in...
link |
00:41:41.360
At every level?
link |
00:41:42.360
How do they get better?
link |
00:41:45.080
What does it mean to be better?
link |
00:41:48.560
The whole...
link |
00:41:49.560
Yeah.
link |
00:41:50.560
Yeah.
link |
00:41:51.560
From whatever model I have, one teacher can give a very good predicate.
link |
00:41:56.800
One teacher can say swims like a dog and another can say jump like a dog.
link |
00:42:03.880
And jump like a dog carries zero information.
link |
00:42:09.400
So what is the most exciting problem in statistical learning you've ever worked on or are working
link |
00:42:14.400
on now?
link |
00:42:17.600
I just finished this invariant story and I'm happy that...
link |
00:42:24.560
I believe that it is ultimate learning story.
link |
00:42:30.600
At least I can show that there are no another mechanism, only two mechanisms.
link |
00:42:38.120
But they separate statistical part from intelligent part and I know nothing about intelligent
link |
00:42:46.760
part.
link |
00:42:47.760
And if you will know this intelligent part, so it will help us a lot in teaching, in learning.
link |
00:42:59.160
In learning.
link |
00:43:00.160
Yeah.
link |
00:43:01.160
You will know it when we see it?
link |
00:43:02.920
So for example, in my talk, the last slide was a challenge.
link |
00:43:07.100
So you have say NIST digit recognition problem and deep learning claims that they did it
link |
00:43:14.680
very well, say 99.5% of correct answers.
link |
00:43:22.100
But they use 60,000 observations.
link |
00:43:25.280
Can you do the same using hundred times less?
link |
00:43:29.560
But incorporating invariants, what it means, you know, digit one, two, three.
link |
00:43:35.280
But looking on that, explain to me which invariant I should keep to use hundred examples or say
link |
00:43:44.040
hundred times less examples to do the same job.
link |
00:43:47.800
Yeah, that last slide, unfortunately your talk ended quickly, but that last slide was
link |
00:43:56.520
a powerful open challenge and a formulation of the essence here.
link |
00:44:01.960
What is the exact problem of intelligence?
link |
00:44:06.300
Because everybody, when machine learning started and it was developed by mathematicians, they
link |
00:44:15.040
immediately recognized that we use much more training data than humans needed.
link |
00:44:22.540
But now again, we came to the same story, have to decrease.
link |
00:44:27.640
That is the problem of learning.
link |
00:44:30.660
It is not like in deep learning, they use zillions of training data because maybe zillions
link |
00:44:37.320
are not enough if you have a good invariants.
link |
00:44:44.720
Maybe you will never collect some number of observations.
link |
00:44:49.520
But now it is a question to intelligence, how to do that?
link |
00:44:56.080
Because statistical part is ready, as soon as you supply us with predicate, we can do
link |
00:45:03.200
good job with small amount of observations.
link |
00:45:06.880
And the very first challenge is well known digit recognition.
link |
00:45:11.040
And you know digits, and please tell me invariants.
link |
00:45:15.560
I think about that, I can say for digit three, I would introduce concept of horizontal symmetry.
link |
00:45:25.760
So the digit three has horizontal symmetry, say more than, say, digit two or something
link |
00:45:32.440
like that.
link |
00:45:33.440
But as soon as I get the idea of horizontal symmetry, I can mathematically invent a lot
link |
00:45:40.480
of measure of horizontal symmetry, or then vertical symmetry, or diagonal symmetry, whatever,
link |
00:45:47.360
if I have idea of symmetry.
link |
00:45:49.980
But what else?
link |
00:45:52.800
I think on digit I see that it is meta predicate, which is not shape, it is something like symmetry,
link |
00:46:07.600
like how dark is whole picture, something like that, which can self rise a predicate.
link |
00:46:16.240
You think such a predicate could rise out of something that is not general, meaning
link |
00:46:29.800
it feels like for me to be able to understand the difference between two and three, I would
link |
00:46:35.640
need to have had a childhood of 10 to 15 years playing with kids, going to school, being
link |
00:46:48.080
yelled by parents, all of that, walking, jumping, looking at ducks, and then I would be able
link |
00:46:57.880
to generate the right predicate for telling the difference between two and a three.
link |
00:47:03.120
Or do you think there's a more efficient way?
link |
00:47:05.720
I don't know.
link |
00:47:06.720
I know for sure that you must know something more than digits.
link |
00:47:12.200
Yes.
link |
00:47:13.200
And that's a powerful statement.
link |
00:47:15.000
Yeah.
link |
00:47:16.000
But maybe there are several languages of description, these elements of digits.
link |
00:47:24.600
So I'm talking about symmetry, about some properties of geometry, I'm talking about
link |
00:47:32.000
something abstract.
link |
00:47:33.000
I don't know that.
link |
00:47:34.780
But this is a problem of intelligence.
link |
00:47:38.900
So in one of our articles, it is trivial to show that every example can carry not more
link |
00:47:47.160
than one bit of information in real.
link |
00:47:50.240
Because when you show example and you say this is one, you can remove, say, a function
link |
00:48:00.660
which does not tell you one, say, is the best strategy.
link |
00:48:05.080
If you can do it perfectly, it's remove half of the functions.
link |
00:48:10.160
But when you use one predicate, which looks like a duck, you can remove much more functions
link |
00:48:17.080
than half.
link |
00:48:18.920
And that means that it contains a lot of bit of information from formal point of view.
link |
00:48:26.160
But when you have a general picture of what you want to recognize and general picture
link |
00:48:34.640
of the world, can you invent this predicate?
link |
00:48:40.960
And that predicate carries a lot of information.
link |
00:48:47.560
Beautifully put.
link |
00:48:48.960
Maybe just me, but in all the math you show, in your work, which is some of the most profound
link |
00:48:56.000
mathematical work in the field of learning AI and just math in general, I hear a lot
link |
00:49:02.320
of poetry and philosophy.
link |
00:49:04.400
You really kind of talk about philosophy of science.
link |
00:49:09.920
There's a poetry and music to a lot of the work you're doing and the way you're thinking
link |
00:49:13.320
about it.
link |
00:49:14.320
So do you, where does that come from?
link |
00:49:16.680
Do you escape to poetry?
link |
00:49:18.880
Do you escape to music or not?
link |
00:49:21.360
I think that there exists ground truth.
link |
00:49:23.840
There exists ground truth?
link |
00:49:25.760
Yeah.
link |
00:49:26.760
And that can be seen everywhere.
link |
00:49:30.720
The smart guy, philosopher, sometimes I'm surprised how they deep see.
link |
00:49:39.000
Sometimes I see that some of them are completely out of subject.
link |
00:49:45.560
But the ground truth I see in music.
link |
00:49:50.960
Music is the ground truth?
link |
00:49:51.960
Yeah.
link |
00:49:52.960
And in poetry, many poets, they believe, they take dictation.
link |
00:50:01.880
So what piece of music as a piece of empirical evidence gave you a sense that they are touching
link |
00:50:12.360
something in the ground truth?
link |
00:50:14.560
It is structure.
link |
00:50:16.720
The structure of the math of music.
link |
00:50:17.720
Yeah, because when you're listening to Bach, you see the structure.
link |
00:50:22.360
Very clear, very classic, very simple, and the same in math when you have axioms in geometry,
link |
00:50:31.160
you have the same feeling.
link |
00:50:32.160
And in poetry, sometimes you see the same.
link |
00:50:38.360
And if you look back at your childhood, you grew up in Russia, you maybe were born as
link |
00:50:44.580
a researcher in Russia, you've developed as a researcher in Russia, you've came to United
link |
00:50:48.680
States and a few places.
link |
00:50:51.800
If you look back, what was some of your happiest moments as a researcher, some of the most
link |
00:51:00.000
profound moments, not in terms of their impact on society, but in terms of their impact on
link |
00:51:09.960
how damn good you feel that day and you remember that moment?
link |
00:51:15.400
You know, every time when you found something, it is great in the life, every simple things.
link |
00:51:26.600
But my general feeling is that most of my time was wrong.
link |
00:51:32.160
You should go again and again and again and try to be honest in front of yourself, not
link |
00:51:39.520
to make interpretation, but try to understand that it's related to ground truth, it is not
link |
00:51:47.840
my blah, blah, blah interpretation and something like that.
link |
00:51:52.640
But you're allowed to get excited at the possibility of discovery.
link |
00:51:56.720
Oh yeah.
link |
00:51:57.720
You have to double check it.
link |
00:51:59.840
No, but how it's related to another ground truth, is it just temporary or it is for forever?
link |
00:52:10.880
You know, you always have a feeling when you found something, how big is that?
link |
00:52:19.880
So 20 years ago when we discovered statistical learning theory, nobody believed, except for
link |
00:52:26.560
one guy, Dudley from MIT, and then in 20 years it became fashion, and the same with support
link |
00:52:37.640
vector machines, that is kernel machines.
link |
00:52:41.480
So with support vector machines and learning theory, when you were working on it, you had
link |
00:52:49.240
a sense, you had a sense of the profundity of it, how this seems to be right, this seems
link |
00:52:59.600
to be powerful.
link |
00:53:00.600
Right.
link |
00:53:01.600
Absolutely.
link |
00:53:02.600
Immediately.
link |
00:53:03.600
I recognized that it will last forever, and now when I found this invariant story, I have
link |
00:53:18.480
a feeling that it is complete learning, because I have proof that there are no different mechanisms.
link |
00:53:24.720
You can have some cosmetic improvement you can do, but in terms of invariants, you need
link |
00:53:35.480
both invariants and statistical learning, and they should work together.
link |
00:53:41.660
But also I'm happy that we can formulate what is intelligence from that, and to separate
link |
00:53:52.920
from technical part, and that is completely different.
link |
00:53:57.240
Absolutely.
link |
00:53:58.240
Well, Vladimir, thank you so much for talking today.
link |
00:54:00.280
Thank you.
link |
00:54:01.280
It's an honor.