back to index

Ian Goodfellow: Generative Adversarial Networks (GANs) | Lex Fridman Podcast #19


small model | large model

link |
00:00:00.000
The following is a conversation with Ian Goodfellow.
link |
00:00:03.760
He's the author of the popular textbook on deep learning
link |
00:00:06.360
simply titled Deep Learning.
link |
00:00:08.880
He coined the term of generative adversarial networks,
link |
00:00:12.320
otherwise known as GANs.
link |
00:00:14.560
And with his 2014 paper is responsible
link |
00:00:18.160
for launching the incredible growth
link |
00:00:20.440
of research and innovation
link |
00:00:22.120
in this subfield of deep learning.
link |
00:00:24.720
He got his BS and MS at Stanford,
link |
00:00:27.520
his PhD at University of Montreal
link |
00:00:30.120
with Yoshua Benjo and Aaron Kervel.
link |
00:00:33.200
He held several research positions,
link |
00:00:35.240
including at OpenAI, Google Brain,
link |
00:00:37.560
and now at Apple as the director of machine learning.
link |
00:00:41.560
This recording happened while Ian was still a Google Brain,
link |
00:00:45.400
but we don't talk about anything specific to Google
link |
00:00:48.520
or any other organization.
link |
00:00:50.760
This conversation is part
link |
00:00:52.480
of the artificial intelligence podcast.
link |
00:00:54.520
If you enjoy it, subscribe on YouTube,
link |
00:00:56.680
iTunes, or simply connect with me on Twitter
link |
00:00:59.600
at Lex Freedman, spelled F R I D.
link |
00:01:03.000
And now here's my conversation with Ian Goodfellow.
link |
00:01:08.240
You open your popular deep learning book
link |
00:01:11.000
with a Russian doll type diagram
link |
00:01:13.600
that shows deep learning is a subset
link |
00:01:15.880
of representation learning,
link |
00:01:17.160
which in turn is a subset of machine learning
link |
00:01:19.960
and finally a subset of AI.
link |
00:01:22.520
So this kind of implies that there may be limits
link |
00:01:25.280
to deep learning in the context of AI.
link |
00:01:27.720
So what do you think is the current limits of deep learning
link |
00:01:31.560
and are those limits something
link |
00:01:33.120
that we can overcome with time?
link |
00:01:35.760
Yeah, I think one of the biggest limitations
link |
00:01:37.720
of deep learning is that right now
link |
00:01:39.320
it requires really a lot of data, especially labeled data.
link |
00:01:43.960
There are some unsupervised
link |
00:01:45.480
and semi supervised learning algorithms
link |
00:01:47.160
that can reduce the amount of labeled data you need,
link |
00:01:49.480
but they still require a lot of unlabeled data.
link |
00:01:52.200
Reinforcement learning algorithms, they don't need labels,
link |
00:01:54.200
but they need really a lot of experiences.
link |
00:01:57.280
As human beings, we don't learn to play a pong
link |
00:01:58.920
by failing at pong two million times.
link |
00:02:02.720
So just getting the generalization ability better
link |
00:02:05.880
is one of the most important bottlenecks
link |
00:02:08.040
in the capability of the technology today.
link |
00:02:10.520
And then I guess I'd also say deep learning
link |
00:02:12.360
is like a component of a bigger system.
link |
00:02:16.600
So far, nobody is really proposing to have
link |
00:02:20.600
only what you'd call deep learning
link |
00:02:22.000
as the entire ingredient of intelligence.
link |
00:02:25.520
You use deep learning as sub modules of other systems,
link |
00:02:29.860
like AlphaGo has a deep learning model
link |
00:02:32.320
that estimates the value function.
link |
00:02:35.200
Most reinforcement learning algorithms
link |
00:02:36.600
have a deep learning module
link |
00:02:37.880
that estimates which action to take next,
link |
00:02:40.320
but you might have other components.
link |
00:02:42.480
So you're basically building a function estimator.
link |
00:02:46.120
Do you think it's possible?
link |
00:02:48.600
You said nobody's kind of been thinking about this so far,
link |
00:02:51.000
but do you think neural networks could be made to reason
link |
00:02:54.320
in the way symbolic systems did in the 80s and 90s
link |
00:02:57.720
to do more, create more like programs
link |
00:03:00.160
as opposed to functions?
link |
00:03:01.440
Yeah, I think we already see that a little bit.
link |
00:03:04.880
I already kind of think of neural nets as a kind of program.
link |
00:03:08.860
I think of deep learning as basically learning programs
link |
00:03:12.920
that have more than one step.
link |
00:03:15.280
So if you draw a flow chart
link |
00:03:16.960
or if you draw a TensorFlow graph
link |
00:03:19.540
describing your machine learning model,
link |
00:03:21.880
I think of the depth of that graph
link |
00:03:23.520
as describing the number of steps that run in sequence
link |
00:03:25.880
and then the width of that graph
link |
00:03:27.640
as the number of steps that run in parallel.
link |
00:03:30.120
Now it's been long enough
link |
00:03:31.680
that we've had deep learning working
link |
00:03:32.880
that it's a little bit silly
link |
00:03:33.880
to even discuss shallow learning anymore,
link |
00:03:35.740
but back when I first got involved in AI,
link |
00:03:38.880
when we used machine learning,
link |
00:03:40.080
we were usually learning things
link |
00:03:41.280
like support vector machines.
link |
00:03:43.680
You could have a lot of input features to the model
link |
00:03:45.640
and you could multiply each feature by a different weight.
link |
00:03:48.120
All those multiplications were done in parallel to each other
link |
00:03:51.240
and there wasn't a lot done in series.
link |
00:03:52.720
I think what we got with deep learning
link |
00:03:54.360
was really the ability to have steps of a program
link |
00:03:58.400
that run in sequence.
link |
00:04:00.320
And I think that we've actually started to see
link |
00:04:03.200
that what's important with deep learning
link |
00:04:05.040
is more the fact that we have a multi step program
link |
00:04:08.000
rather than the fact that we've learned a representation.
link |
00:04:10.800
If you look at things like Resnuts, for example,
link |
00:04:15.120
they take one particular kind of representation
link |
00:04:18.660
and they update it several times.
link |
00:04:21.040
Back when deep learning first really took off
link |
00:04:23.560
in the academic world in 2006,
link |
00:04:25.760
when Jeff Hinton showed that you could train
link |
00:04:28.400
deep belief networks,
link |
00:04:30.160
everybody who was interested in the idea
link |
00:04:31.960
thought of it as each layer
link |
00:04:33.560
learns a different level of abstraction,
link |
00:04:35.960
that the first layer trained on images
link |
00:04:37.840
learns something like edges
link |
00:04:38.960
and the second layer learns corners
link |
00:04:40.420
and eventually you get these kind of grandmother cell units
link |
00:04:43.320
that recognize specific objects.
link |
00:04:45.920
Today, I think most people think of it more
link |
00:04:48.560
as a computer program where as you add more layers,
link |
00:04:52.000
you can do more updates before you output your final number.
link |
00:04:55.120
But I don't think anybody believes that
link |
00:04:57.160
layer 150 of the Resnet is a grandmother cell
link |
00:05:02.040
and layer 100 is contours or something like that.
link |
00:05:06.040
Okay, so you're not thinking of it
link |
00:05:08.160
as a singular representation that keeps building.
link |
00:05:11.520
You think of it as a program sort of almost like a state.
link |
00:05:15.960
The representation is a state of understanding.
link |
00:05:18.600
Yeah, I think of it as a program that makes several updates
link |
00:05:21.520
and arrives at better and better understandings,
link |
00:05:23.840
but it's not replacing the representation at each step.
link |
00:05:27.500
It's refining it.
link |
00:05:29.160
And in some sense, that's a little bit like reasoning.
link |
00:05:31.660
It's not reasoning in the form of deduction,
link |
00:05:33.560
but it's reasoning in the form of taking a thought
link |
00:05:36.960
and refining it and refining it carefully
link |
00:05:39.440
until it's good enough to use.
link |
00:05:41.240
So do you think, and I hope you don't mind,
link |
00:05:43.560
we'll jump philosophical every once in a while.
link |
00:05:46.040
Do you think of, you know, cognition, human cognition,
link |
00:05:50.480
or even consciousness as simply a result
link |
00:05:53.520
of this kind of sequential representation learning?
link |
00:05:58.120
Do you think that can emerge?
link |
00:06:00.440
Cognition, yes, I think so.
link |
00:06:02.440
Consciousness, it's really hard to even define
link |
00:06:05.160
what we mean by that.
link |
00:06:07.400
I guess there's, consciousness is often defined
link |
00:06:09.840
as things like having self awareness,
link |
00:06:12.120
and that's relatively easy to turn it
link |
00:06:15.200
to something actionable for a computer scientist
link |
00:06:17.200
to reason about.
link |
00:06:18.400
People also define consciousness in terms
link |
00:06:20.080
of having qualitative states of experience, like qualia.
link |
00:06:24.000
There's all these philosophical problems,
link |
00:06:25.280
like could you imagine a zombie
link |
00:06:27.880
who does all the same information processing as a human,
link |
00:06:30.760
but doesn't really have the qualitative experiences
link |
00:06:33.500
that we have?
link |
00:06:34.720
That sort of thing, I have no idea how to formalize
link |
00:06:37.580
or turn it into a scientific question.
link |
00:06:39.960
I don't know how you could run an experiment
link |
00:06:41.600
to tell whether a person is a zombie or not.
link |
00:06:44.880
And similarly, I don't know how you could run
link |
00:06:46.680
an experiment to tell whether an advanced AI system
link |
00:06:49.680
had become conscious in the sense of qualia or not.
link |
00:06:53.080
But in the more practical sense,
link |
00:06:54.600
like almost like self attention,
link |
00:06:56.320
you think consciousness and cognition can,
link |
00:06:58.920
in an impressive way, emerge from current types
link |
00:07:03.240
of architectures that we think of as determining.
link |
00:07:05.600
Or if you think of consciousness
link |
00:07:07.920
in terms of self awareness and just making plans
link |
00:07:12.160
based on the fact that the agent itself
link |
00:07:15.120
exists in the world, reinforcement learning algorithms
link |
00:07:18.000
are already more or less forced to model
link |
00:07:20.840
the agent's effect on the environment.
link |
00:07:23.040
So that more limited version of consciousness
link |
00:07:26.340
is already something that we get limited versions
link |
00:07:30.560
of with reinforcement learning algorithms
link |
00:07:32.960
if they're trained well.
link |
00:07:34.640
But you say limited.
link |
00:07:37.440
So the big question really is how you jump
link |
00:07:39.920
from limited to human level, right?
link |
00:07:42.120
And whether it's possible,
link |
00:07:46.840
even just building common sense reasoning
link |
00:07:49.000
seems to be exceptionally difficult.
link |
00:07:50.520
So if we scale things up,
link |
00:07:52.480
if we get much better on supervised learning,
link |
00:07:55.000
if we get better at labeling,
link |
00:07:56.600
if we get bigger datasets, more compute,
link |
00:08:00.640
do you think we'll start to see really impressive things
link |
00:08:03.880
that go from limited to something echoes
link |
00:08:08.760
of human level cognition?
link |
00:08:10.320
I think so, yeah.
link |
00:08:11.200
I'm optimistic about what can happen
link |
00:08:13.360
just with more computation and more data.
link |
00:08:16.440
I do think it'll be important to get the right kind of data.
link |
00:08:20.120
Today, most of the machine learning systems we train
link |
00:08:23.160
are mostly trained on one type of data for each model.
link |
00:08:27.560
But the human brain, we get all of our different senses
link |
00:08:31.380
and we have many different experiences
link |
00:08:33.880
like riding a bike, driving a car,
link |
00:08:36.320
talking to people, reading.
link |
00:08:39.160
I think when we get that kind of integrated dataset
link |
00:08:42.440
working with a machine learning model
link |
00:08:44.400
that can actually close the loop and interact,
link |
00:08:47.640
we may find that algorithms not so different
link |
00:08:50.480
from what we have today,
link |
00:08:51.840
learn really interesting things
link |
00:08:53.240
when you scale them up a lot
link |
00:08:54.400
and train them on a large amount of multimodal data.
link |
00:08:58.240
So multimodal is really interesting,
link |
00:08:59.640
but within, like you're working adversarial examples.
link |
00:09:04.000
So selecting within model, within one mode of data,
link |
00:09:11.120
selecting better at what are the difficult cases
link |
00:09:13.800
from which you're most useful to learn from.
link |
00:09:16.120
Oh, yeah, like could we get a whole lot of mileage
link |
00:09:18.880
out of designing a model that's resistant
link |
00:09:22.280
to adversarial examples or something like that?
link |
00:09:24.080
Right, that's the question.
link |
00:09:26.280
My thinking on that has evolved a lot
link |
00:09:27.760
over the last few years.
link |
00:09:28.920
When I first started to really invest
link |
00:09:31.280
in studying adversarial examples,
link |
00:09:32.760
I was thinking of it mostly as adversarial examples
link |
00:09:36.320
reveal a big problem with machine learning.
link |
00:09:39.000
And we would like to close the gap
link |
00:09:41.160
between how machine learning models respond
link |
00:09:44.160
to adversarial examples and how humans respond.
link |
00:09:47.640
After studying the problem more,
link |
00:09:49.160
I still think that adversarial examples are important.
link |
00:09:51.960
I think of them now more of as a security liability
link |
00:09:55.440
than as an issue that necessarily shows
link |
00:09:57.800
there's something uniquely wrong
link |
00:09:59.880
with machine learning as opposed to humans.
link |
00:10:02.800
Also, do you see them as a tool
link |
00:10:04.600
to improve the performance of the system?
link |
00:10:06.480
Not on the security side, but literally just accuracy.
link |
00:10:10.760
I do see them as a kind of tool on that side,
link |
00:10:13.480
but maybe not quite as much as I used to think.
link |
00:10:16.640
We've started to find that there's a trade off
link |
00:10:18.520
between accuracy on adversarial examples
link |
00:10:21.680
and accuracy on clean examples.
link |
00:10:24.360
Back in 2014, when I did the first adversarily trained
link |
00:10:28.320
classifier that showed resistance
link |
00:10:30.840
to some kinds of adversarial examples,
link |
00:10:33.040
it also got better at the clean data on MNIST.
link |
00:10:36.040
And that's something we've replicated several times
link |
00:10:37.720
on MNIST, that when we train
link |
00:10:39.640
against weak adversarial examples,
link |
00:10:41.480
MNIST classifiers get more accurate.
link |
00:10:43.880
So far that hasn't really held up on other data sets
link |
00:10:47.080
and hasn't held up when we train
link |
00:10:48.880
against stronger adversaries.
link |
00:10:50.720
It seems like when you confront
link |
00:10:53.160
a really strong adversary,
link |
00:10:55.720
you tend to have to give something up.
link |
00:10:58.080
Interesting, but it's such a compelling idea
link |
00:11:00.520
because it feels like that's how us humans learn
link |
00:11:04.800
to do the difficult cases.
link |
00:11:06.320
We try to think of what would we screw up
link |
00:11:08.760
and then we make sure we fix that.
link |
00:11:11.000
It's also in a lot of branches of engineering,
link |
00:11:13.680
you do a worst case analysis
link |
00:11:15.800
and make sure that your system will work in the worst case.
link |
00:11:18.720
And then that guarantees that it'll work
link |
00:11:20.400
in all of the messy average cases that happen
link |
00:11:24.360
when you go out into a really randomized world.
link |
00:11:27.440
Yeah, with driving with autonomous vehicles,
link |
00:11:29.560
there seems to be a desire to just look
link |
00:11:31.840
for think adversarially,
link |
00:11:34.880
try to figure out how to mess up the system.
link |
00:11:36.920
And if you can be robust to all those difficult cases,
link |
00:11:40.640
then you can, it's a hand wavy empirical way
link |
00:11:43.600
to show your system is safe.
link |
00:11:45.800
Yeah, yeah.
link |
00:11:47.000
Today, most adversarial example research
link |
00:11:49.120
isn't really focused on a particular use case,
link |
00:11:51.640
but there are a lot of different use cases
link |
00:11:54.000
where you'd like to make sure
link |
00:11:55.080
that the adversary can't interfere
link |
00:11:57.720
with the operation of your system.
link |
00:12:00.200
Like in finance,
link |
00:12:01.040
if you have an algorithm making trades for you,
link |
00:12:03.320
people go to a lot of an effort
link |
00:12:04.640
to obfuscate their algorithm.
link |
00:12:06.680
That's both to protect their IP
link |
00:12:08.080
because you don't want to research
link |
00:12:10.880
and develop a profitable trading algorithm
link |
00:12:13.560
then have somebody else capture the gains.
link |
00:12:16.120
But it's at least partly
link |
00:12:17.160
because you don't want people to make adversarial
link |
00:12:19.000
examples that fool your algorithm
link |
00:12:21.240
into making bad trades.
link |
00:12:24.360
Or I guess one area that's been popular
link |
00:12:26.560
in the academic literature is speech recognition.
link |
00:12:30.160
If you use speech recognition to hear an audio waveform
link |
00:12:34.400
and then turn that into a command
link |
00:12:37.680
that a phone executes for you,
link |
00:12:39.640
you don't want a malicious adversary
link |
00:12:41.840
to be able to produce audio
link |
00:12:43.600
that gets interpreted as malicious commands,
link |
00:12:46.280
especially if a human in the room
link |
00:12:47.800
doesn't realize that something like that is happening.
link |
00:12:50.320
In speech recognition,
link |
00:12:52.000
has there been much success
link |
00:12:53.920
in being able to create adversarial examples
link |
00:12:58.440
that fool the system?
link |
00:12:59.760
Yeah, actually.
link |
00:13:00.880
I guess the first work that I'm aware of
link |
00:13:02.440
is a paper called Hidden Voice Commands
link |
00:13:05.120
that came out in 2016, I believe.
link |
00:13:08.480
And they were able to show
link |
00:13:09.560
that they could make sounds
link |
00:13:11.920
that are not understandable by a human
link |
00:13:14.960
but are recognized as the target phrase
link |
00:13:18.400
that the attacker wants the phone to recognize it as.
link |
00:13:21.360
Since then, things have gotten a little bit better
link |
00:13:24.040
on the attacker side when worse on the defender side.
link |
00:13:28.680
It's become possible to make sounds
link |
00:13:33.360
that sound like normal speech
link |
00:13:35.600
but are actually interpreted as a different sentence
link |
00:13:39.000
than the human hears.
link |
00:13:40.720
The level of perceptibility
link |
00:13:42.720
of the adversarial perturbation is still kind of high.
link |
00:13:46.600
When you listen to the recording,
link |
00:13:48.160
it sounds like there's some noise in the background,
link |
00:13:51.040
just like rustling sounds.
link |
00:13:52.960
But those rustling sounds are actually
link |
00:13:54.360
the adversarial perturbation
link |
00:13:55.560
that makes the phone hear a completely different sentence.
link |
00:13:58.040
Yeah, that's so fascinating.
link |
00:14:00.120
Peter Norvig mentioned that you're writing
link |
00:14:01.640
the deep learning chapter for the fourth edition
link |
00:14:04.280
of the Artificial Intelligence,
link |
00:14:05.840
the Modern Approach Book.
link |
00:14:07.320
So how do you even begin summarizing
link |
00:14:10.680
the field of deep learning in a chapter?
link |
00:14:12.960
Well, in my case, I waited like a year
link |
00:14:16.840
before I actually wrote anything.
link |
00:14:19.080
Is it?
link |
00:14:20.280
Even having written a full length textbook before,
link |
00:14:22.600
it's still pretty intimidating
link |
00:14:25.560
to try to start writing just one chapter
link |
00:14:27.800
that covers everything.
link |
00:14:31.080
One thing that helped me make that plan
link |
00:14:33.160
was actually the experience
link |
00:14:34.280
of having written the full book before
link |
00:14:36.680
and then watching how the field changed
link |
00:14:39.080
after the book came out.
link |
00:14:40.920
I realized there's a lot of topics
link |
00:14:42.280
that were maybe extraneous in the first book
link |
00:14:44.960
and just seeing what stood the test
link |
00:14:47.560
of a few years of being published
link |
00:14:49.360
and what seems a little bit less important
link |
00:14:52.160
to have included now helped me pare down
link |
00:14:53.760
the topics I wanted to cover for the book.
link |
00:14:56.840
It's also really nice now that the field
link |
00:14:59.560
is kind of stabilized to the point
link |
00:15:00.920
where some core ideas from the 1980s are still used today.
link |
00:15:04.720
When I first started studying machine learning,
link |
00:15:06.640
almost everything from the 1980s had been rejected
link |
00:15:09.520
and now some of it has come back.
link |
00:15:11.320
So that stuff that's really stood the test of time
link |
00:15:13.440
is what I focused on putting into the book.
link |
00:15:16.880
There's also, I guess, two different philosophies
link |
00:15:21.240
about how you might write a book.
link |
00:15:23.120
One philosophy is you try to write a reference
link |
00:15:24.760
that covers everything.
link |
00:15:26.160
The other philosophy is you try to provide
link |
00:15:27.960
a high level summary that gives people
link |
00:15:30.320
the language to understand a field
link |
00:15:32.360
and tells them what the most important concepts are.
link |
00:15:34.920
The first deep learning book that I wrote
link |
00:15:37.080
with Joshua and Aaron was somewhere
link |
00:15:39.240
between the two philosophies,
link |
00:15:41.240
that it's trying to be both a reference
link |
00:15:43.640
and an introductory guide.
link |
00:15:45.760
Writing this chapter for Russell and Norvig's book,
link |
00:15:48.920
I was able to focus more on just a concise introduction
link |
00:15:52.800
of the key concepts and the language
link |
00:15:54.240
you need to read about them more.
link |
00:15:56.000
In a lot of cases, I actually just wrote paragraphs
link |
00:15:57.560
that said, here's a rapidly evolving area
link |
00:16:00.080
that you should pay attention to.
link |
00:16:02.400
It's pointless to try to tell you what the latest
link |
00:16:04.760
and best version of a learn to learn model is.
link |
00:16:11.440
I can point you to a paper that's recent right now,
link |
00:16:13.640
but there isn't a whole lot of a reason to delve
link |
00:16:16.880
into exactly what's going on with the latest
link |
00:16:20.440
learning to learn approach or the latest module
link |
00:16:22.960
produced by a learning to learn algorithm.
link |
00:16:24.960
You should know that learning to learn is a thing
link |
00:16:26.760
and that it may very well be the source
link |
00:16:29.480
of the latest and greatest convolutional net
link |
00:16:32.200
or recurrent net module that you would want to use
link |
00:16:34.520
in your latest project.
link |
00:16:36.040
But there isn't a lot of point in trying to summarize
link |
00:16:38.200
exactly which architecture and which learning approach
link |
00:16:42.280
got to which level of performance.
link |
00:16:44.040
So you maybe focus more on the basics of the methodology.
link |
00:16:49.240
So from back propagation to feed forward
link |
00:16:52.480
to recurrent networks, convolutional, that kind of thing.
link |
00:16:55.160
Yeah, yeah.
link |
00:16:56.480
So if I were to ask you, I remember I took algorithms
link |
00:17:00.320
and data structures algorithms, of course.
link |
00:17:03.720
I remember the professor asked, what is an algorithm?
link |
00:17:09.200
And he yelled at everybody in a good way
link |
00:17:12.200
that nobody was answering it correctly.
link |
00:17:14.040
Everybody knew what the algorithm, it was graduate course.
link |
00:17:16.360
Everybody knew what an algorithm was,
link |
00:17:18.120
but they weren't able to answer it well.
link |
00:17:19.800
So let me ask you, in that same spirit,
link |
00:17:22.360
what is deep learning?
link |
00:17:24.520
I would say deep learning is any kind of machine learning
link |
00:17:29.520
that involves learning parameters of more than one
link |
00:17:34.720
consecutive step.
link |
00:17:37.280
So that, I mean, shallow learning is things where
link |
00:17:40.760
you learn a lot of operations that happen in parallel.
link |
00:17:43.760
You might have a system that makes multiple steps,
link |
00:17:46.720
like you might have hand designed feature extractors,
link |
00:17:51.000
but really only one step is learned.
link |
00:17:52.600
Deep learning is anything where you have multiple
link |
00:17:55.440
operations in sequence.
link |
00:17:56.880
And that includes the things that are really popular
link |
00:17:59.400
today, like convolutional networks
link |
00:18:01.280
and recurrent networks, but it also includes some
link |
00:18:04.640
of the things that have died out, like Bolton machines,
link |
00:18:08.280
where we weren't using back propagation.
link |
00:18:11.960
Today, I hear a lot of people define deep learning
link |
00:18:14.240
as gradient descent applied to these differentiable
link |
00:18:20.400
functions, and I think that's a legitimate usage
link |
00:18:24.240
of the term, it's just different from the way
link |
00:18:25.920
that I use the term myself.
link |
00:18:27.800
So what's an example of deep learning that is not
link |
00:18:32.360
gradient descent and differentiable functions?
link |
00:18:34.720
In your, I mean, not specifically perhaps,
link |
00:18:37.400
but more even looking into the future.
link |
00:18:39.760
What's your thought about that space of approaches?
link |
00:18:44.300
Yeah, so I tend to think of machine learning algorithms
link |
00:18:46.340
as decomposed into really three different pieces.
link |
00:18:50.200
There's the model, which can be something like a neural net
link |
00:18:53.000
or a Bolton machine or a recurrent model.
link |
00:18:56.600
And that basically just describes how do you take data
link |
00:18:59.520
and how do you take parameters and what function do you use
link |
00:19:03.480
to make a prediction given the data and the parameters?
link |
00:19:07.320
Another piece of the learning algorithm is
link |
00:19:09.920
the optimization algorithm, or not every algorithm
link |
00:19:13.880
can be really described in terms of optimization,
link |
00:19:15.920
but what's the algorithm for updating the parameters
link |
00:19:18.880
or updating whatever the state of the network is?
link |
00:19:22.600
And then the last part is the data set,
link |
00:19:26.280
like how do you actually represent the world
link |
00:19:29.200
as it comes into your machine learning system?
link |
00:19:33.160
So I think of deep learning as telling us something
link |
00:19:35.800
about what does the model look like?
link |
00:19:39.040
And basically to qualify as deep,
link |
00:19:41.240
I say that it just has to have multiple layers.
link |
00:19:44.560
That can be multiple steps in a feed forward
link |
00:19:47.360
differentiable computation.
link |
00:19:49.240
That can be multiple layers in a graphical model.
link |
00:19:52.040
There's a lot of ways that you could satisfy me
link |
00:19:53.560
that something has multiple steps
link |
00:19:56.160
that are each parameterized separately.
link |
00:19:58.920
I think of gradient descent as being all about
link |
00:20:00.640
that other piece,
link |
00:20:01.560
the how do you actually update the parameters piece?
link |
00:20:04.240
So you could imagine having a deep model
link |
00:20:05.960
like a convolutional net and training it with something
link |
00:20:08.680
like evolution or a genetic algorithm.
link |
00:20:11.280
And I would say that still qualifies as deep learning.
link |
00:20:14.640
And then in terms of models
link |
00:20:16.040
that aren't necessarily differentiable,
link |
00:20:18.760
I guess Bolton machines are probably the main example
link |
00:20:22.480
of something where you can't really take a derivative
link |
00:20:25.560
and use that for the learning process.
link |
00:20:28.000
But you can still argue that the model has many steps
link |
00:20:32.320
of processing that it applies
link |
00:20:33.760
when you run inference in the model.
link |
00:20:35.800
So it's the steps of processing that's key.
link |
00:20:38.960
So Jeff Hinton suggests that we need to throw away
link |
00:20:41.320
back propagation and start all over.
link |
00:20:44.960
What do you think about that?
link |
00:20:46.520
What could an alternative direction
link |
00:20:48.600
of training neural networks look like?
link |
00:20:51.000
I don't know that back propagation
link |
00:20:52.880
is going to go away entirely.
link |
00:20:54.680
Most of the time when we decide
link |
00:20:57.120
that a machine learning algorithm
link |
00:20:59.200
isn't on the critical path to research for improving AI,
link |
00:21:03.440
the algorithm doesn't die,
link |
00:21:04.640
it just becomes used for some specialized set of things.
link |
00:21:08.760
A lot of algorithms like logistic regression
link |
00:21:11.160
don't seem that exciting to AI researchers
link |
00:21:14.000
who are working on things like speech recognition
link |
00:21:16.760
or autonomous cars today,
link |
00:21:18.400
but there's still a lot of use for logistic regression
link |
00:21:21.080
and things like analyzing really noisy data
link |
00:21:23.960
in medicine and finance
link |
00:21:25.640
or making really rapid predictions
link |
00:21:28.720
in really time limited contexts.
link |
00:21:30.680
So I think back propagation and gradient descent
link |
00:21:33.440
are around to stay,
link |
00:21:34.520
but they may not end up being everything
link |
00:21:38.760
that we need to get to real human level
link |
00:21:40.840
or super human AI.
link |
00:21:42.360
Are you optimistic about us discovering?
link |
00:21:44.680
You know, back propagation has been around for a few decades.
link |
00:21:50.240
So are you optimistic about us as a community
link |
00:21:54.080
being able to discover something better?
link |
00:21:56.800
Yeah, I am.
link |
00:21:57.640
I think we likely will find something that works better.
link |
00:22:01.840
You could imagine things like having stacks of models
link |
00:22:05.520
where some of the lower level models predict parameters
link |
00:22:08.720
of the higher level models.
link |
00:22:10.200
And so at the top level,
link |
00:22:12.160
you're not learning in terms of literally
link |
00:22:13.480
calculating gradients, but just predicting
link |
00:22:15.800
how different values will perform.
link |
00:22:17.680
You can kind of see that already in some areas
link |
00:22:19.560
like Bayesian optimization,
link |
00:22:21.400
where you have a Gaussian process
link |
00:22:22.960
that predicts how well different parameter values
link |
00:22:24.800
will perform.
link |
00:22:25.880
We already use those kinds of algorithms
link |
00:22:27.680
for things like hyper parameter optimization.
link |
00:22:30.240
And in general, we know a lot of things
link |
00:22:31.640
other than back prop that work really well
link |
00:22:33.240
for specific problems.
link |
00:22:35.000
The main thing we haven't found is a way of taking one
link |
00:22:38.240
of these other non back prop based algorithms
link |
00:22:41.160
and having it really advance the state of the art
link |
00:22:43.520
on an AI level problem.
link |
00:22:46.160
Right.
link |
00:22:47.120
But I wouldn't be surprised if eventually we find
link |
00:22:49.600
that some of these algorithms that,
link |
00:22:51.560
even the ones that already exist,
link |
00:22:52.760
not even necessarily a new one,
link |
00:22:54.200
we might find some way of customizing one of these algorithms
link |
00:22:59.200
to do something really interesting
link |
00:23:00.560
at the level of cognition or the level of,
link |
00:23:06.400
I think one system that we really don't have working
link |
00:23:08.680
quite right yet is like short term memory.
link |
00:23:12.920
We have things like LSTMs,
link |
00:23:14.480
they're called long short term memory.
link |
00:23:17.000
They still don't do quite what a human does
link |
00:23:20.000
with short term memory.
link |
00:23:22.840
Like gradient descent to learn a specific fact
link |
00:23:26.920
has to do multiple steps on that fact.
link |
00:23:29.360
Like if I tell you, the meeting today is at 3pm,
link |
00:23:34.120
I don't need to say over and over again.
link |
00:23:35.440
It's at 3pm, it's at 3pm, it's at 3pm, it's at 3pm.
link |
00:23:38.640
For you to do a gradient step on each one,
link |
00:23:40.400
you just hear it once and you remember it.
link |
00:23:43.160
There's been some work on things like self attention
link |
00:23:46.920
and attention like mechanisms like the neural Turing machine
link |
00:23:50.400
that can write to memory cells and update themselves
link |
00:23:53.160
with facts like that right away.
link |
00:23:54.880
But I don't think we've really nailed it yet.
link |
00:23:56.880
And that's one area where I'd imagine that new optimization
link |
00:24:02.080
algorithms or different ways of applying existing
link |
00:24:04.240
optimization algorithms could give us a way
link |
00:24:07.280
of just lightning fast updating the state
link |
00:24:10.120
of a machine learning system to contain
link |
00:24:12.400
a specific fact like that without needing to have it
link |
00:24:14.920
presented over and over and over again.
link |
00:24:17.000
So some of the success of symbolic systems in the 80s
link |
00:24:21.440
is they were able to assemble these kinds of facts better.
link |
00:24:26.200
But there's a lot of expert input required
link |
00:24:29.080
and it's very limited in that sense.
link |
00:24:31.120
Do you ever look back to that as something
link |
00:24:34.720
that we'll have to return to eventually
link |
00:24:36.560
sort of dust off the book from the shelf
link |
00:24:38.440
and think about how we build knowledge, representation,
link |
00:24:42.400
knowledge.
link |
00:24:43.240
Like will we have to use graph searches?
link |
00:24:44.840
Graph searches, right.
link |
00:24:45.800
And like first order logic and entailment
link |
00:24:47.720
and things like that.
link |
00:24:48.560
That kind of thing, yeah, exactly.
link |
00:24:49.560
In my particular line of work,
link |
00:24:51.200
which has mostly been machine learning security
link |
00:24:54.560
and also generative modeling,
link |
00:24:56.720
I haven't usually found myself moving in that direction.
link |
00:25:00.560
For generative models, I could see a little bit of,
link |
00:25:03.520
it could be useful if you had something like a,
link |
00:25:06.520
a differentiable knowledge base
link |
00:25:09.680
or some other kind of knowledge base
link |
00:25:11.000
where it's possible for some of our fuzzier
link |
00:25:13.840
machine learning algorithms to interact with a knowledge base.
link |
00:25:16.880
I mean, your network is kind of like that.
link |
00:25:19.040
It's a differentiable knowledge base of sorts.
link |
00:25:21.440
Yeah.
link |
00:25:22.280
But if we had a really easy way of giving feedback
link |
00:25:27.600
to machine learning models,
link |
00:25:29.240
that would clearly help a lot with, with generative models.
link |
00:25:32.400
And so you could imagine one way of getting there would be,
link |
00:25:34.680
get a lot better at natural language processing.
link |
00:25:36.720
But another way of getting there would be,
link |
00:25:38.920
take some kind of knowledge base
link |
00:25:40.280
and figure out a way for it to actually interact
link |
00:25:42.800
with a neural network.
link |
00:25:44.080
Being able to have a chat with a neural network.
link |
00:25:46.080
Yeah.
link |
00:25:47.920
So like one thing in generative models we see a lot today is,
link |
00:25:50.920
you'll get things like faces that are not symmetrical.
link |
00:25:54.480
Like, like people that have two eyes
link |
00:25:56.800
that are different colors.
link |
00:25:58.200
And I mean, there are people with eyes
link |
00:25:59.560
that are different colors in real life,
link |
00:26:00.840
but not nearly as many of them as you tend to see
link |
00:26:03.480
in the machine learning generated data.
link |
00:26:06.120
So if you had either a knowledge base
link |
00:26:08.120
that could contain the fact,
link |
00:26:10.200
people's faces are generally approximately symmetric
link |
00:26:13.360
and eye color is especially likely
link |
00:26:15.920
to be the same on both sides.
link |
00:26:17.920
Being able to just inject that hint
link |
00:26:20.160
into the machine learning model
link |
00:26:22.000
without having to discover that itself
link |
00:26:23.800
after studying a lot of data
link |
00:26:25.760
would be a really useful feature.
link |
00:26:28.360
I could see a lot of ways of getting there
link |
00:26:30.120
without bringing back some of the 1980s technology,
link |
00:26:32.200
but I also see some ways that you could imagine
link |
00:26:35.160
extending the 1980s technology to play nice with neural nets
link |
00:26:38.240
and have it help get there.
link |
00:26:40.040
Awesome.
link |
00:26:40.880
So you talked about the story of you coming up
link |
00:26:44.360
with the idea of GANs at a bar with some friends.
link |
00:26:47.040
You were arguing that this, you know,
link |
00:26:50.400
GANs would work generative adversarial networks
link |
00:26:53.080
and the others didn't think so.
link |
00:26:54.680
Then you went home at midnight, coded up and it worked.
link |
00:26:58.400
So if I was a friend of yours at the bar,
link |
00:27:01.320
I would also have doubts.
link |
00:27:02.720
It's a really nice idea,
link |
00:27:03.880
but I'm very skeptical that it would work.
link |
00:27:06.800
What was the basis of their skepticism?
link |
00:27:09.280
What was the basis of your intuition why it should work?
link |
00:27:14.360
I don't wanna be someone who goes around promoting alcohol
link |
00:27:16.840
for the purposes of science,
link |
00:27:18.280
but in this case, I do actually think
link |
00:27:21.040
that drinking helped a little bit.
link |
00:27:23.080
When your inhibitions are lowered,
link |
00:27:25.360
you're more willing to try out things
link |
00:27:27.400
that you wouldn't try out otherwise.
link |
00:27:29.640
So I have noticed in general
link |
00:27:32.480
that I'm less prone to shooting down some of my own ideas
link |
00:27:34.560
when I have had a little bit to drink.
link |
00:27:37.960
I think if I had had that idea at lunchtime,
link |
00:27:40.800
I probably would have thought it.
link |
00:27:42.280
It's hard enough to train one neural net.
link |
00:27:43.720
You can't train a second neural net
link |
00:27:44.880
in the inner loop of the outer neural net.
link |
00:27:48.080
That was basically my friend's objection
link |
00:27:49.800
was that trying to train two neural nets at the same time
link |
00:27:52.720
would be too hard.
link |
00:27:54.280
So it was more about the training process
link |
00:27:56.120
unless, so my skepticism would be, I'm sure you could train it
link |
00:28:01.160
but the thing would converge to
link |
00:28:03.200
would not be able to generate anything reasonable
link |
00:28:05.840
and any kind of reasonable realism.
link |
00:28:08.240
Yeah, so part of what all of us were thinking about
link |
00:28:11.360
when we had this conversation was deep Bolton machines,
link |
00:28:15.280
which a lot of us in the lab, including me,
link |
00:28:17.000
were a big fan of deep Bolton machines at the time.
link |
00:28:20.640
They involved two separate processes running at the same time.
link |
00:28:24.240
One of them is called the positive phase
link |
00:28:27.400
where you load data into the model
link |
00:28:30.440
and tell the model to make the data more likely.
link |
00:28:32.920
The other one is called the negative phase
link |
00:28:34.480
where you draw samples from the model
link |
00:28:36.280
and tell the model to make those samples less likely.
link |
00:28:40.480
In a deep Bolton machine, it's not trivial
link |
00:28:42.400
to generate a sample.
link |
00:28:43.320
You have to actually run an iterative process
link |
00:28:46.280
that gets better and better samples
link |
00:28:48.520
coming closer and closer to the distribution
link |
00:28:50.720
the model represents.
link |
00:28:52.120
So during the training process,
link |
00:28:53.240
you're always running these two systems at the same time.
link |
00:28:56.560
One that's updating the parameters of the model
link |
00:28:58.360
and another one that's trying to generate samples
link |
00:28:59.880
from the model.
link |
00:29:01.120
And they worked really well on things like MNIST,
link |
00:29:03.720
but a lot of us in the lab, including me,
link |
00:29:05.200
had tried to get deep Bolton machines to scale past MNIST
link |
00:29:08.840
to things like generating color photos,
link |
00:29:11.320
and we just couldn't get the two processes
link |
00:29:13.480
to stay synchronized.
link |
00:29:16.720
So when I had the idea for GANs,
link |
00:29:18.120
a lot of people thought that the discriminator
link |
00:29:19.720
would have more or less the same problem
link |
00:29:21.960
as the negative phase in the Bolton machine,
link |
00:29:25.360
that trying to train the discriminator in the inner loop,
link |
00:29:27.840
you just couldn't get it to keep up
link |
00:29:29.960
with the generator in the outer loop.
link |
00:29:31.560
And that would prevent it from
link |
00:29:33.360
converging to anything useful.
link |
00:29:35.240
Yeah, I share that intuition.
link |
00:29:36.880
Yeah.
link |
00:29:39.560
But turns out to not be the case.
link |
00:29:42.000
A lot of the time with machine learning algorithms,
link |
00:29:43.800
it's really hard to predict ahead of time
link |
00:29:45.200
how well they'll actually perform.
link |
00:29:46.960
You have to just run the experiment
link |
00:29:48.160
and see what happens.
link |
00:29:49.200
And I would say I still today don't have like one factor
link |
00:29:53.480
I can put my finger on and say,
link |
00:29:54.840
this is why GANs worked for photo generation
link |
00:29:58.360
and deep Bolton machines don't.
link |
00:30:02.000
There are a lot of theory papers showing that
link |
00:30:04.560
under some theoretical settings,
link |
00:30:06.400
the GAN algorithm does actually converge.
link |
00:30:10.720
But those settings are restricted enough
link |
00:30:14.200
that they don't necessarily explain the whole picture
link |
00:30:17.560
in terms of all the results that we see in practice.
link |
00:30:20.760
So taking a step back,
link |
00:30:22.360
can you, in the same way as we talked about deep learning,
link |
00:30:24.880
can you tell me what generative adversarial networks are?
link |
00:30:29.480
Yeah, so generative adversarial networks
link |
00:30:31.400
are a particular kind of generative model.
link |
00:30:34.000
A generative model is a machine learning model
link |
00:30:36.320
that can train on some set of data.
link |
00:30:38.880
Like say you have a collection of photos of cats
link |
00:30:41.280
and you want to generate more photos of cats,
link |
00:30:44.040
or you want to estimate a probability distribution
link |
00:30:47.120
over cats so you can ask how likely it is
link |
00:30:49.840
that some new image is a photo of a cat.
link |
00:30:52.920
GANs are one way of doing this.
link |
00:30:55.840
Some generative models are good at creating new data.
link |
00:30:59.200
Other generative models are good
link |
00:31:00.840
at estimating that density function
link |
00:31:02.600
and telling you how likely particular pieces of data are
link |
00:31:06.600
to come from the same distribution as the training data.
link |
00:31:09.760
GANs are more focused on generating samples
link |
00:31:12.440
rather than estimating the density function.
link |
00:31:15.640
There are some kinds of GANs, like flow GAN,
link |
00:31:17.720
that can do both,
link |
00:31:18.560
but mostly GANs are about generating samples,
link |
00:31:21.680
generating new photos of cats that look realistic.
link |
00:31:25.240
And they do that completely from scratch.
link |
00:31:29.360
It's analogous to human imagination
link |
00:31:32.240
when a GAN creates a new image of a cat.
link |
00:31:34.760
It's using a neural network to produce a cat
link |
00:31:39.320
that has not existed before.
link |
00:31:41.040
It isn't doing something like compositing photos together.
link |
00:31:44.560
You're not literally taking the eye off of one cat
link |
00:31:47.080
and the ear off of another cat.
link |
00:31:49.000
It's more of this digestive process
link |
00:31:51.320
where the neural net trains in a lot of data
link |
00:31:53.920
and comes up with some representation
link |
00:31:55.560
of the probability distribution
link |
00:31:57.360
and generates entirely new cats.
link |
00:31:59.760
There are a lot of different ways
link |
00:32:00.880
of building a generative model.
link |
00:32:01.960
What's specific to GANs is that we have a two player game
link |
00:32:05.640
in the game theoretic sense.
link |
00:32:08.080
And as the players in this game compete,
link |
00:32:10.280
one of them becomes able to generate realistic data.
link |
00:32:13.920
The first player is called the generator.
link |
00:32:16.120
It produces output data, such as just images, for example.
link |
00:32:20.640
And at the start of the learning process,
link |
00:32:22.400
it'll just produce completely random images.
link |
00:32:25.120
The other player is called the discriminator.
link |
00:32:27.360
The discriminator takes images as input
link |
00:32:29.680
and guesses whether they're real or fake.
link |
00:32:32.480
You train it both on real data,
link |
00:32:34.200
so photos that come from your training set,
link |
00:32:36.120
actual photos of cats.
link |
00:32:37.840
And you try to say that those are real.
link |
00:32:39.880
You also train it on images
link |
00:32:41.920
that come from the generator network.
link |
00:32:43.840
And you train it to say that those are fake.
link |
00:32:46.720
As the two players compete in this game,
link |
00:32:49.200
the discriminator tries to become better
link |
00:32:50.920
at recognizing whether images are real or fake.
link |
00:32:53.280
And the generator becomes better
link |
00:32:54.760
at fooling the discriminator into thinking
link |
00:32:56.960
that its outputs are real.
link |
00:33:00.760
And you can analyze this through the language of game theory
link |
00:33:03.560
and find that there's a Nash equilibrium
link |
00:33:06.920
where the generator has captured
link |
00:33:08.600
the correct probability distribution.
link |
00:33:10.800
So in the cat example,
link |
00:33:12.160
it makes perfectly realistic cat photos.
link |
00:33:14.560
And the discriminator is unable to do better
link |
00:33:17.160
than random guessing,
link |
00:33:18.720
because all the samples coming from both the data
link |
00:33:21.800
and the generator look equally likely
link |
00:33:24.000
to have come from either source.
link |
00:33:25.840
So do you ever sit back
link |
00:33:28.320
and does it just blow your mind that this thing works?
link |
00:33:31.280
So from very, so it's able to estimate the density function
link |
00:33:35.840
enough to generate realistic images.
link |
00:33:38.640
I mean, yeah, do you ever sit back and think,
link |
00:33:43.640
how does this even, this is quite incredible,
link |
00:33:46.760
especially where against have gone in terms of realism.
link |
00:33:49.280
Yeah, and not just to flatter my own work,
link |
00:33:51.600
but generative models,
link |
00:33:53.840
all of them have this property
link |
00:33:55.400
that if they really did what we asked them to do,
link |
00:33:58.800
they would do nothing but memorize the training data.
link |
00:34:01.040
Right, exactly.
link |
00:34:02.920
Models that are based on maximizing the likelihood,
link |
00:34:05.720
the way that you obtain the maximum likelihood
link |
00:34:08.200
for a specific training set
link |
00:34:09.720
is you assign all of your probability mass
link |
00:34:12.440
to the training examples and nowhere else.
link |
00:34:15.120
For GANs, the game is played using a training set.
link |
00:34:18.440
So the way that you become unbeatable in the game
link |
00:34:21.160
is you literally memorize training examples.
link |
00:34:25.360
One of my former interns wrote a paper,
link |
00:34:28.880
his name is Vaishnav Nagarajan,
link |
00:34:31.040
and he showed that it's actually hard
link |
00:34:33.080
for the generator to memorize the training data,
link |
00:34:36.120
hard in a statistical learning theory sense,
link |
00:34:39.160
that you can actually create reasons
link |
00:34:42.200
for why it would require quite a lot of learning steps
link |
00:34:48.400
and a lot of observations of different latent variables
link |
00:34:52.200
before you could memorize the training data.
link |
00:34:54.360
That still doesn't really explain
link |
00:34:55.680
why when you produce samples that are new,
link |
00:34:58.280
why do you get compelling images
link |
00:34:59.880
rather than just garbage that's different
link |
00:35:02.400
from the training set.
link |
00:35:03.800
And I don't think we really have a good answer for that,
link |
00:35:06.960
especially if you think about
link |
00:35:07.920
how many possible images are out there
link |
00:35:10.240
and how few images the generative model sees during training.
link |
00:35:15.440
It seems just unreasonable
link |
00:35:16.920
that generative models create new images
link |
00:35:19.200
as well as they do, especially considering
link |
00:35:22.080
that we're basically training them to memorize
link |
00:35:23.760
rather than generalize.
link |
00:35:26.240
I think part of the answer is there's a paper
link |
00:35:28.920
called Deep Image Prior where they show
link |
00:35:31.480
that you can take a convolutional net
link |
00:35:33.080
and you don't even need to learn the parameters of it at all.
link |
00:35:35.000
You just use the model architecture.
link |
00:35:37.640
And it's already useful for things like in painting images.
link |
00:35:41.080
I think that shows us that the convolutional network
link |
00:35:43.760
architecture captures something really important
link |
00:35:45.880
about the structure of images.
link |
00:35:47.960
And we don't need to actually use learning
link |
00:35:50.960
to capture all the information
link |
00:35:52.200
coming out of the convolutional net.
link |
00:35:55.240
That would imply that it would be much harder
link |
00:35:58.400
to make generative models in other domains.
link |
00:36:01.240
So far, we're able to make reasonable speech models
link |
00:36:03.600
and things like that.
link |
00:36:04.880
But to be honest, we haven't actually explored
link |
00:36:07.440
a whole lot of different data sets all that much.
link |
00:36:09.800
We don't, for example, see a lot of deep learning models
link |
00:36:13.920
of like biology data sets
link |
00:36:18.440
where you have lots of microarrays
link |
00:36:19.880
measuring the amount of different enzymes
link |
00:36:22.240
and things like that.
link |
00:36:23.080
So we may find that some of the progress
link |
00:36:25.240
that we've seen for images and speech turns out
link |
00:36:27.360
to really rely heavily on the model architecture.
link |
00:36:30.120
And we were able to do what we did for vision
link |
00:36:32.960
by trying to reverse engineer the human visual system.
link |
00:36:37.040
And maybe it'll turn out that we can't just
link |
00:36:39.800
use that same trick for arbitrary kinds of data.
link |
00:36:43.480
Right, so there's aspect of the human vision system,
link |
00:36:45.920
the hardware of it that makes it,
link |
00:36:49.280
without learning, without cognition,
link |
00:36:51.120
just makes it really effective at detecting the patterns
link |
00:36:53.640
we see in the visual world.
link |
00:36:54.960
Yeah, that's really interesting.
link |
00:36:57.280
What, in a big quick overview in your view,
link |
00:37:04.640
what types of GANs are there
link |
00:37:06.280
and what other generative models besides GANs are there?
link |
00:37:10.080
Yeah, so it's maybe a little bit easier to start
link |
00:37:13.360
with what kinds of generative models
link |
00:37:14.640
are there other than GANs.
link |
00:37:16.840
So most generative models are likelihood based
link |
00:37:20.840
where to train them, you have a model
link |
00:37:23.920
that tells you how much probability it assigns
link |
00:37:27.320
to a particular example,
link |
00:37:29.080
and you just maximize the probability assigned
link |
00:37:31.480
to all the training examples.
link |
00:37:33.680
It turns out that it's hard to design a model
link |
00:37:36.200
that can create really complicated images
link |
00:37:39.200
or really complicated audio waveforms
link |
00:37:42.280
and still have it be possible to estimate
link |
00:37:46.200
the likelihood function from a computational point of view.
link |
00:37:51.200
Most interesting models that you would just write
link |
00:37:53.200
down intuitively, it turns out that it's almost impossible
link |
00:37:56.200
to calculate the amount of probability
link |
00:37:58.200
they assign to a particular point.
link |
00:38:00.200
So there's a few different schools of generative models
link |
00:38:04.200
in the likelihood family.
link |
00:38:06.200
One approach is to very carefully design the model
link |
00:38:09.200
so that it is computationally tractable
link |
00:38:12.200
to measure the density it assigns to a particular point.
link |
00:38:15.200
So there are things like auto regressive models,
link |
00:38:18.200
like pixel CNN, those basically break down
link |
00:38:23.200
the probability distribution into a product
link |
00:38:26.200
over every single feature.
link |
00:38:28.200
So for an image, you estimate the probability of each pixel
link |
00:38:32.200
given all of the pixels that came before it.
link |
00:38:35.200
There's tricks where if you want to measure
link |
00:38:37.200
the density function, you can actually calculate
link |
00:38:40.200
the density for all these pixels more or less in parallel.
link |
00:38:44.200
Generating the image still tends to require you
link |
00:38:46.200
to go one pixel at a time, and that can be very slow.
link |
00:38:50.200
But there are, again, tricks for doing this
link |
00:38:52.200
in a hierarchical pattern where you can keep
link |
00:38:54.200
the runtime under control.
link |
00:38:56.200
Are the quality of the images it generates
link |
00:38:59.200
putting runtime aside pretty good?
link |
00:39:02.200
They're reasonable, yeah.
link |
00:39:04.200
I would say a lot of the best results
link |
00:39:07.200
are from GANs these days, but it can be hard to tell
link |
00:39:10.200
how much of that is based on who's studying
link |
00:39:14.200
which type of algorithm, if that makes sense.
link |
00:39:17.200
The amount of effort invested in it.
link |
00:39:19.200
Yeah, or the kind of expertise.
link |
00:39:21.200
So a lot of people who've traditionally been excited
link |
00:39:23.200
about graphics or art and things like that
link |
00:39:25.200
have gotten interested in GANs.
link |
00:39:27.200
And to some extent, it's hard to tell,
link |
00:39:29.200
are GANs doing better because they have a lot of
link |
00:39:32.200
graphics and art experts behind them?
link |
00:39:34.200
Or are GANs doing better because
link |
00:39:36.200
they're more computationally efficient?
link |
00:39:38.200
Or are GANs doing better because
link |
00:39:40.200
they prioritize the realism of samples
link |
00:39:43.200
over the accuracy of the density function?
link |
00:39:45.200
I think all of those are potentially
link |
00:39:47.200
valid explanations, and it's hard to tell.
link |
00:39:51.200
So can you give a brief history of GANs
link |
00:39:53.200
from 2014 with Paper 13?
link |
00:39:59.200
Yeah, so a few highlights.
link |
00:40:01.200
In the first paper, we just showed that
link |
00:40:03.200
GANs basically work.
link |
00:40:05.200
If you look back at the samples we had now,
link |
00:40:07.200
they look terrible.
link |
00:40:09.200
On the CFAR 10 data set, you can't even
link |
00:40:11.200
see the effects in them.
link |
00:40:13.200
Your paper, sorry, you used CFAR 10?
link |
00:40:15.200
We used MNIST, which is Little Handwritten Digits.
link |
00:40:17.200
We used the Toronto Face Database,
link |
00:40:19.200
which is small grayscale photos of faces.
link |
00:40:22.200
We did have recognizable faces.
link |
00:40:24.200
My colleague Bing Xu put together
link |
00:40:26.200
the first GAN face model for that paper.
link |
00:40:29.200
We also had the CFAR 10 data set,
link |
00:40:32.200
which is things like very small 32x32 pixels
link |
00:40:35.200
of cars and cats and dogs.
link |
00:40:40.200
For that, we didn't get recognizable objects,
link |
00:40:43.200
but all the deep learning people back then
link |
00:40:46.200
were really used to looking at these failed samples
link |
00:40:48.200
and kind of reading them like tea leaves.
link |
00:40:50.200
And people who are used to reading the tea leaves
link |
00:40:53.200
recognize that our tea leaves at least look different.
link |
00:40:56.200
Maybe not necessarily better,
link |
00:40:58.200
but there was something unusual about them.
link |
00:41:01.200
And that got a lot of us excited.
link |
00:41:03.200
One of the next really big steps was LAPGAN
link |
00:41:06.200
by Emily Denton and Sumith Chintala at Facebook AI Research,
link |
00:41:10.200
where they actually got really good high resolution photos
link |
00:41:14.200
working with GANs for the first time.
link |
00:41:16.200
They had a complicated system
link |
00:41:18.200
where they generated the image starting at low res
link |
00:41:20.200
and then scaling up to high res,
link |
00:41:22.200
but they were able to get it to work.
link |
00:41:24.200
And then in 2015, I believe later that same year,
link |
00:41:30.200
Alec Radford and Sumith Chintala and Luke Metz
link |
00:41:35.200
published the DC GAN paper,
link |
00:41:38.200
which it stands for Deep Convolutional GAN.
link |
00:41:41.200
It's kind of a nonunique name
link |
00:41:43.200
because these days basically all GANs
link |
00:41:46.200
and even some before that were deep and convolutional,
link |
00:41:48.200
but they just kind of picked a name for a really great recipe
link |
00:41:52.200
where they were able to actually using only one model
link |
00:41:55.200
instead of a multi step process,
link |
00:41:57.200
actually generate realistic images of faces and things like that.
link |
00:42:01.200
That was sort of like the beginning
link |
00:42:05.200
of the Cambrian explosion of GANs.
link |
00:42:07.200
Once you had animals that had a backbone,
link |
00:42:09.200
you suddenly got lots of different versions of fish
link |
00:42:12.200
and four legged animals and things like that.
link |
00:42:15.200
So DC GAN became kind of the backbone
link |
00:42:17.200
for many different models that came out.
link |
00:42:19.200
Used as a baseline even still.
link |
00:42:21.200
Yeah, yeah.
link |
00:42:23.200
And so from there, I would say some interesting things we've seen
link |
00:42:26.200
are there's a lot you can say about how just
link |
00:42:30.200
the quality of standard image generation GANs has increased,
link |
00:42:33.200
but what's also maybe more interesting on an intellectual level
link |
00:42:36.200
is how the things you can use GANs for has also changed.
link |
00:42:40.200
One thing is that you can use them to learn classifiers
link |
00:42:44.200
without having to have class labels for every example
link |
00:42:47.200
in your training set.
link |
00:42:49.200
So that's called semi supervised learning.
link |
00:42:51.200
My colleague at OpenAI, Tim Solomon, who's at Brain now,
link |
00:42:55.200
wrote a paper called
link |
00:42:57.200
Improved Techniques for Training GANs.
link |
00:42:59.200
I'm a coauthor on this paper,
link |
00:43:01.200
but I can't claim any credit for this particular part.
link |
00:43:03.200
One thing he showed on the paper is that
link |
00:43:05.200
you can take the GAN discriminator and use it as a classifier
link |
00:43:09.200
that actually tells you this image is a cat,
link |
00:43:12.200
this image is a dog, this image is a car,
link |
00:43:14.200
this image is a truck.
link |
00:43:16.200
And so not just to say whether the image is real or fake,
link |
00:43:18.200
but if it is real to say specifically what kind of object it is.
link |
00:43:22.200
And he found that you can train these classifiers
link |
00:43:25.200
with far fewer labeled examples
link |
00:43:28.200
than traditional classifiers.
link |
00:43:30.200
So if you supervise based on also
link |
00:43:33.200
not just your discrimination ability,
link |
00:43:35.200
but your ability to classify,
link |
00:43:37.200
you're going to converge much faster
link |
00:43:40.200
to being effective at being a discriminator.
link |
00:43:43.200
Yeah.
link |
00:43:44.200
So for example, for the MNIST dataset,
link |
00:43:46.200
you want to look at an image of a handwritten digit
link |
00:43:49.200
and say whether it's a zero, a one, or two, and so on.
link |
00:43:53.200
To get down to less than 1% accuracy,
link |
00:43:57.200
we required around 60,000 examples
link |
00:44:00.200
until maybe about 2014 or so.
link |
00:44:03.200
In 2016, with this semi supervised GAN project,
link |
00:44:07.200
Tim was able to get below 1% error
link |
00:44:10.200
using only 100 labeled examples.
link |
00:44:13.200
So that was about a 600x decrease
link |
00:44:16.200
in the amount of labels that he needed.
link |
00:44:18.200
He's still using more images than that,
link |
00:44:21.200
but he doesn't need to have each of them labeled as,
link |
00:44:23.200
you know, this one's a one, this one's a two,
link |
00:44:25.200
this one's a zero, and so on.
link |
00:44:27.200
Then to be able to, for GANs,
link |
00:44:29.200
to be able to generate recognizable objects,
link |
00:44:31.200
so objects from a particular class,
link |
00:44:33.200
you still need labeled data,
link |
00:44:36.200
because you need to know
link |
00:44:38.200
what it means to be a particular class cat dog.
link |
00:44:41.200
How do you think we can move away from that?
link |
00:44:44.200
Yeah, some researchers at Brain Zurich
link |
00:44:46.200
actually just released a really great paper
link |
00:44:49.200
on semi supervised GANs,
link |
00:44:51.200
where their goal isn't to classify,
link |
00:44:54.200
to make recognizable objects
link |
00:44:56.200
despite not having a lot of labeled data.
link |
00:44:58.200
They were working off of DeepMind's BigGAN project,
link |
00:45:02.200
and they showed that they can match
link |
00:45:04.200
the performance of BigGAN
link |
00:45:06.200
using only 10%, I believe, of the labels.
link |
00:45:10.200
BigGAN was trained on the ImageNet data set,
link |
00:45:12.200
which is about 1.2 million images,
link |
00:45:14.200
and had all of them labeled.
link |
00:45:17.200
This latest project from Brain Zurich
link |
00:45:19.200
shows that they're able to get away with
link |
00:45:21.200
having about 10% of the images labeled.
link |
00:45:25.200
They do that essentially using a clustering algorithm,
link |
00:45:29.200
where the discriminator learns to assign
link |
00:45:32.200
the objects to groups,
link |
00:45:34.200
and then this understanding that objects can be grouped
link |
00:45:38.200
into similar types,
link |
00:45:40.200
helps it to form more realistic ideas
link |
00:45:43.200
of what should be appearing in the image,
link |
00:45:45.200
because it knows that every image it creates
link |
00:45:47.200
has to come from one of these archetypal groups,
link |
00:45:50.200
rather than just being some arbitrary image.
link |
00:45:53.200
If you train again with no class labels,
link |
00:45:55.200
you tend to get things that look sort of like
link |
00:45:57.200
grass or water or brick or dirt,
link |
00:46:00.200
but without necessarily a lot going on in them.
link |
00:46:04.200
I think that's partly because if you look
link |
00:46:06.200
at a large ImageNet image,
link |
00:46:08.200
the object doesn't necessarily occupy the whole image,
link |
00:46:11.200
and so you learn to create realistic sets of pixels,
link |
00:46:15.200
but you don't necessarily learn
link |
00:46:17.200
that the object is the star of the show,
link |
00:46:19.200
and you want it to be in every image you make.
link |
00:46:22.200
Yeah, I've heard you talk about the horse,
link |
00:46:25.200
the zebra cycle, gang mapping,
link |
00:46:27.200
and how it turns out, again,
link |
00:46:30.200
thought provoking that horses are usually on grass,
link |
00:46:33.200
and zebras are usually on drier terrain,
link |
00:46:35.200
so when you're doing that kind of generation,
link |
00:46:38.200
you're going to end up generating greener horses or whatever.
link |
00:46:43.200
So those are connected together.
link |
00:46:45.200
It's not just...
link |
00:46:46.200
Yeah, yeah.
link |
00:46:47.200
You're not able to segment,
link |
00:46:49.200
to be able to generate in a segmental way.
link |
00:46:52.200
So are there other types of games you come across
link |
00:46:55.200
in your mind that neural networks can play with each other
link |
00:47:00.200
to be able to solve problems?
link |
00:47:05.200
Yeah, the one that I spend most of my time on is in security.
link |
00:47:09.200
You can model most interactions as a game
link |
00:47:13.200
where there's attackers trying to break your system
link |
00:47:16.200
or the defender trying to build a resilient system.
link |
00:47:19.200
There's also domain adversarial learning,
link |
00:47:22.200
which is an approach to domain adaptation
link |
00:47:25.200
that looks really a lot like GANs.
link |
00:47:27.200
The authors had the idea before the GAN paper came out.
link |
00:47:31.200
Their paper came out a little bit later,
link |
00:47:33.200
and they were very nice and cited the GAN paper,
link |
00:47:38.200
but I know that they actually had the idea before it came out.
link |
00:47:41.200
Domain adaptation is when you want to train a machine learning model
link |
00:47:45.200
in one setting called a domain,
link |
00:47:47.200
and then deploy it in another domain later,
link |
00:47:50.200
and you would like it to perform well in the new domain,
link |
00:47:52.200
even though the new domain is different from how it was trained.
link |
00:47:55.200
So, for example, you might want to train
link |
00:47:58.200
on a really clean image dataset like ImageNet,
link |
00:48:01.200
but then deploy on users phones,
link |
00:48:03.200
where the user is taking pictures in the dark
link |
00:48:06.200
and pictures while moving quickly
link |
00:48:08.200
and just pictures that aren't really centered
link |
00:48:10.200
or composed all that well.
link |
00:48:13.200
When you take a normal machine learning model,
link |
00:48:16.200
it often degrades really badly when you move to the new domain
link |
00:48:19.200
because it looks so different from what the model was trained on.
link |
00:48:22.200
Domain adaptation algorithms try to smooth out that gap,
link |
00:48:25.200
and the domain adversarial approach is based on
link |
00:48:28.200
training a feature extractor,
link |
00:48:30.200
where the features have the same statistics
link |
00:48:32.200
regardless of which domain you extracted them on.
link |
00:48:35.200
So, in the domain adversarial game,
link |
00:48:37.200
you have one player that's a feature extractor
link |
00:48:39.200
and another player that's a domain recognizer.
link |
00:48:42.200
The domain recognizer wants to look at the output
link |
00:48:44.200
of the feature extractor and guess which of the two domains
link |
00:48:47.200
the features came from.
link |
00:48:49.200
So, it's a lot like the real versus fake discriminator in GANs.
link |
00:48:52.200
And then the feature extractor,
link |
00:48:54.200
you can think of as loosely analogous to the generator in GANs,
link |
00:48:57.200
except what it's trying to do here
link |
00:48:59.200
is both fool the domain recognizer
link |
00:49:02.200
into not knowing which domain the data came from
link |
00:49:05.200
and also extract features that are good for classification.
link |
00:49:08.200
So, at the end of the day, in the cases where it works out,
link |
00:49:13.200
you can actually get features that work about the same
link |
00:49:18.200
in both domains.
link |
00:49:20.200
Sometimes this has a drawback where,
link |
00:49:22.200
in order to make things work the same in both domains,
link |
00:49:24.200
it just gets worse at the first one.
link |
00:49:26.200
But there are a lot of cases where it actually
link |
00:49:28.200
works out well on both.
link |
00:49:30.200
So, do you think of GANs being useful in the context
link |
00:49:33.200
of data augmentation?
link |
00:49:35.200
Yeah, one thing you could hope for with GANs
link |
00:49:37.200
is you could imagine,
link |
00:49:39.200
I've got a limited training set
link |
00:49:41.200
and I'd like to make more training data
link |
00:49:43.200
to train something else like a classifier.
link |
00:49:46.200
You could train the GAN on the training set
link |
00:49:50.200
and then create more data
link |
00:49:52.200
and then maybe the classifier would perform better
link |
00:49:55.200
on the test set after training on this bigger GAN generated data set.
link |
00:49:58.200
So, that's the simplest version
link |
00:50:00.200
of something you might hope would work.
link |
00:50:02.200
I've never heard of that particular approach working,
link |
00:50:05.200
but I think there's some closely related things
link |
00:50:08.200
that I think could work in the future
link |
00:50:11.200
and some that actually already have worked.
link |
00:50:13.200
So, if we think a little bit about what we'd be hoping for
link |
00:50:15.200
if we use the GAN to make more training data,
link |
00:50:17.200
we're hoping that the GAN will generalize
link |
00:50:20.200
to new examples better than the classifier would have
link |
00:50:23.200
generalized if it was trained on the same data.
link |
00:50:25.200
And I don't know of any reason to believe
link |
00:50:27.200
that the GAN would generalize better than the classifier would.
link |
00:50:30.200
But what we might hope for is that the GAN
link |
00:50:33.200
could generalize differently from a specific classifier.
link |
00:50:37.200
So, one thing I think is worth trying
link |
00:50:39.200
that I haven't personally tried, but someone could try is
link |
00:50:41.200
what if you trained a whole lot of different generative models
link |
00:50:44.200
on the same training set,
link |
00:50:46.200
create samples from all of them
link |
00:50:48.200
and then train a classifier on that.
link |
00:50:50.200
Because each of the generative models
link |
00:50:52.200
might generalize in a slightly different way,
link |
00:50:54.200
they might capture many different axes of variation
link |
00:50:56.200
that one individual model wouldn't.
link |
00:50:58.200
And then the classifier can capture all of those ideas
link |
00:51:01.200
by training in all of their data.
link |
00:51:03.200
So, it'd be a little bit like making an ensemble of classifiers.
link |
00:51:06.200
An ensemble of GANs in a way.
link |
00:51:08.200
I think that could generalize better.
link |
00:51:10.200
The other thing that GANs are really good for
link |
00:51:12.200
is not necessarily generating new data
link |
00:51:16.200
that's exactly like what you already have,
link |
00:51:19.200
but by generating new data that has different properties
link |
00:51:23.200
from the data you already had.
link |
00:51:25.200
One thing that you can do is you can create
link |
00:51:27.200
differentially private data.
link |
00:51:29.200
So, suppose that you have something like medical records
link |
00:51:31.200
and you don't want to train a classifier on the medical records
link |
00:51:34.200
and then publish the classifier
link |
00:51:36.200
because someone might be able to reverse engineer
link |
00:51:38.200
some of the medical records you trained on.
link |
00:51:40.200
There's a paper from Casey Green's lab
link |
00:51:42.200
that shows how you can train again using differential privacy.
link |
00:51:46.200
And then the samples from the GAN
link |
00:51:48.200
still have the same differential privacy guarantees
link |
00:51:51.200
as the parameters of the GAN.
link |
00:51:53.200
So, you can make fake patient data
link |
00:51:55.200
for other researchers to use
link |
00:51:57.200
and they can do almost anything they want with that data
link |
00:51:59.200
because it doesn't come from real people.
link |
00:52:02.200
And the differential privacy mechanism
link |
00:52:04.200
gives you clear guarantees on how much
link |
00:52:07.200
the original people's data has been protected.
link |
00:52:09.200
That's really interesting, actually.
link |
00:52:11.200
I haven't heard you talk about that before.
link |
00:52:13.200
In terms of fairness,
link |
00:52:15.200
I've seen from AAAI your talk,
link |
00:52:19.200
how can adversarial machine learning
link |
00:52:21.200
help models be more fair
link |
00:52:23.200
with respect to sensitive variables?
link |
00:52:25.200
Yeah. So, there's a paper from Emma Storky's lab
link |
00:52:28.200
about how to learn machine learning models
link |
00:52:31.200
that are incapable of using specific variables.
link |
00:52:34.200
So, say, for example, you wanted to make predictions
link |
00:52:36.200
that are not affected by gender.
link |
00:52:39.200
It isn't enough to just leave gender
link |
00:52:41.200
out of the input to the model.
link |
00:52:43.200
You can often infer gender from a lot of other characteristics.
link |
00:52:45.200
Like, say that you have the person's name,
link |
00:52:47.200
but you're not told their gender.
link |
00:52:49.200
Well, if their name is Ian, they're kind of obviously a man.
link |
00:52:53.200
So, what you'd like to do is make a machine learning model
link |
00:52:55.200
that can still take in a lot of different attributes
link |
00:52:58.200
and make a really accurate informed prediction,
link |
00:53:02.200
but be confident that it isn't reverse engineering gender
link |
00:53:05.200
or another sensitive variable internally.
link |
00:53:08.200
You can do that using something very similar
link |
00:53:10.200
to the domain adversarial approach,
link |
00:53:12.200
where you have one player that's a feature extractor
link |
00:53:15.200
and another player that's a feature analyzer.
link |
00:53:18.200
And you want to make sure that the feature analyzer
link |
00:53:21.200
is not able to guess the value of the sensitive variable
link |
00:53:24.200
that you're trying to keep private.
link |
00:53:26.200
Right. Yeah, I love this approach.
link |
00:53:29.200
So, with the feature, you're not able to infer
link |
00:53:34.200
the sensitive variables.
link |
00:53:36.200
It's brilliant. It's quite brilliant and simple, actually.
link |
00:53:39.200
Another way I think that GANs in particular
link |
00:53:42.200
could be used for fairness would be
link |
00:53:44.200
to make something like a cycle GAN,
link |
00:53:46.200
where you can take data from one domain
link |
00:53:49.200
and convert it into another.
link |
00:53:51.200
We've seen cycle GAN turning horses into zebras.
link |
00:53:54.200
We've seen other unsupervised GANs made by Mingyu Liu
link |
00:53:59.200
doing things like turning day photos into night photos.
link |
00:54:02.200
I think for fairness, you could imagine
link |
00:54:05.200
taking records for people in one group
link |
00:54:08.200
and transforming them into analogous people in another group
link |
00:54:11.200
and testing to see if they're treated equitably
link |
00:54:14.200
across those two groups.
link |
00:54:16.200
There's a lot of things that would be hard to get right
link |
00:54:18.200
and make sure that the conversion process itself is fair.
link |
00:54:21.200
And I don't think it's anywhere near something
link |
00:54:24.200
that we could actually use yet.
link |
00:54:26.200
But if you could design that conversion process very carefully,
link |
00:54:28.200
it might give you a way of doing audits
link |
00:54:30.200
where you say, what if we took people from this group,
link |
00:54:33.200
converted them into equivalent people in another group?
link |
00:54:35.200
Does the system actually treat them how it ought to?
link |
00:54:39.200
That's also really interesting.
link |
00:54:41.200
You know, in popular press
link |
00:54:46.200
and in general, in our imagination,
link |
00:54:48.200
you think, well, GANs are able to generate data
link |
00:54:51.200
and you start to think about deep fakes
link |
00:54:54.200
or being able to sort of maliciously generate data
link |
00:54:57.200
that fakes the identity of other people.
link |
00:55:00.200
Is this something of a concern to you?
link |
00:55:03.200
Is this something, if you look 10, 20 years into the future,
link |
00:55:06.200
is that something that pops up in your work,
link |
00:55:10.200
in the work of the community that's working on generative models?
link |
00:55:13.200
I'm a lot less concerned about 20 years from now
link |
00:55:15.200
than the next few years.
link |
00:55:17.200
I think there will be a kind of bumpy cultural transition
link |
00:55:20.200
as people encounter this idea
link |
00:55:22.200
that there can be very realistic videos and audio that aren't real.
link |
00:55:25.200
I think 20 years from now,
link |
00:55:27.200
people will mostly understand that you shouldn't believe
link |
00:55:30.200
something is real just because you saw a video of it.
link |
00:55:33.200
People will expect to see that it's been cryptographically signed
link |
00:55:37.200
or have some other mechanism to make them believe
link |
00:55:41.200
that the content is real.
link |
00:55:43.200
There's already people working on this,
link |
00:55:45.200
like there's a startup called TruePick
link |
00:55:47.200
that provides a lot of mechanisms for authenticating
link |
00:55:50.200
that an image is real.
link |
00:55:52.200
They're maybe not quite up to having a state actor
link |
00:55:55.200
try to evade their verification techniques,
link |
00:55:59.200
but it's something that people are already working on
link |
00:56:02.200
and I think will get right eventually.
link |
00:56:04.200
So you think authentication will eventually win out?
link |
00:56:08.200
So being able to authenticate that this is real and this is not?
link |
00:56:11.200
Yeah.
link |
00:56:13.200
As opposed to GANs just getting better and better
link |
00:56:15.200
or generative models being able to get better and better
link |
00:56:18.200
to where the nature of what is real is normal.
link |
00:56:21.200
I don't think we'll ever be able to look at the pixels of a photo
link |
00:56:25.200
and tell you for sure that it's real or not real,
link |
00:56:28.200
and I think it would actually be somewhat dangerous
link |
00:56:32.200
to rely on that approach too much.
link |
00:56:34.200
If you make a really good fake detector
link |
00:56:36.200
and then someone's able to fool your fake detector
link |
00:56:38.200
and your fake detector says this image is not fake,
link |
00:56:41.200
then it's even more credible
link |
00:56:43.200
than if you've never made a fake detector in the first place.
link |
00:56:46.200
What I do think we'll get to is systems
link |
00:56:50.200
that we can kind of use behind the scenes
link |
00:56:52.200
to make estimates of what's going on
link |
00:56:55.200
and maybe not use them in court for a definitive analysis.
link |
00:56:59.200
I also think we will likely get better authentication systems
link |
00:57:04.200
where, imagine that every phone cryptographically
link |
00:57:08.200
signs everything that comes out of it.
link |
00:57:10.200
You wouldn't be able to conclusively tell
link |
00:57:12.200
that an image was real,
link |
00:57:14.200
but you would be able to tell somebody who knew
link |
00:57:18.200
the appropriate private key for this phone
link |
00:57:21.200
was actually able to sign this image
link |
00:57:24.200
and upload it to this server at this time stamp.
link |
00:57:28.200
You could imagine maybe you make phones
link |
00:57:31.200
that have the private keys hardware embedded in them.
link |
00:57:35.200
If a state security agency
link |
00:57:37.200
really wants to infiltrate the company,
link |
00:57:39.200
they could probably plant a private key of their choice
link |
00:57:42.200
or break open the chip
link |
00:57:44.200
and learn the private key or something like that.
link |
00:57:46.200
But it would make it a lot harder
link |
00:57:48.200
for an adversary with fewer resources to fake things.
link |
00:57:51.200
For most of us, it would be okay.
link |
00:57:53.200
You mentioned the beer and the bar and the new ideas.
link |
00:57:58.200
You were able to come up with this new idea
link |
00:58:01.200
pretty quickly and implement it pretty quickly.
link |
00:58:04.200
Do you think there are still many
link |
00:58:06.200
such groundbreaking ideas in deep learning
link |
00:58:08.200
that could be developed so quickly?
link |
00:58:10.200
Yeah, I do think that there are a lot of ideas
link |
00:58:13.200
that can be developed really quickly.
link |
00:58:15.200
GANs were probably a little bit of an outlier
link |
00:58:18.200
on the whole one hour time scale.
link |
00:58:20.200
But just in terms of low resource ideas
link |
00:58:24.200
where you do something really different
link |
00:58:26.200
on a high scale and get a big payback,
link |
00:58:29.200
I think it's not as likely that you'll see that
link |
00:58:32.200
in terms of things like core machine learning technologies
link |
00:58:35.200
like a better classifier
link |
00:58:37.200
or a better reinforcement learning algorithm
link |
00:58:39.200
or a better generative model.
link |
00:58:41.200
If I had the GAN idea today,
link |
00:58:43.200
it would be a lot harder to prove that it was useful
link |
00:58:45.200
than it was back in 2014
link |
00:58:47.200
because I would need to get it running on something
link |
00:58:50.200
like ImageNet or Celeb A at high resolution.
link |
00:58:54.200
Those take a while to train.
link |
00:58:56.200
You couldn't train it in an hour
link |
00:58:58.200
and know that it was something really new and exciting.
link |
00:59:01.200
Back in 2014, training on MNIST was enough.
link |
00:59:04.200
But there are other areas of machine learning
link |
00:59:07.200
where I think a new idea could actually be developed
link |
00:59:11.200
really quickly with low resources.
link |
00:59:13.200
What's your intuition about what areas
link |
00:59:15.200
of machine learning are ripe for this?
link |
00:59:18.200
Yeah, so I think fairness and interpretability
link |
00:59:23.200
are areas where we just really don't have any idea
link |
00:59:27.200
how anything should be done yet.
link |
00:59:29.200
Like for interpretability,
link |
00:59:31.200
I don't think we even have the right definitions.
link |
00:59:33.200
And even just defining a really useful concept,
link |
00:59:36.200
you don't even need to run any experiments.
link |
00:59:38.200
It could have a huge impact on the field.
link |
00:59:40.200
We've seen that, for example, in differential privacy
link |
00:59:43.200
that Cynthia Dwork and her collaborators
link |
00:59:45.200
made this technical definition of privacy
link |
00:59:48.200
where before a lot of things were really mushy
link |
00:59:50.200
and with that definition, you could actually design
link |
00:59:53.200
randomized algorithms for accessing databases
link |
00:59:55.200
and guarantee that they preserved individual people's privacy
link |
00:59:59.200
in a mathematical quantitative sense.
link |
01:00:03.200
Right now, we all talk a lot about
link |
01:00:05.200
how interpretable different machine learning algorithms are,
link |
01:00:07.200
but it's really just people's opinion.
link |
01:00:09.200
And everybody probably has a different idea
link |
01:00:11.200
of what interpretability means in their head.
link |
01:00:13.200
If we could define some concept related to interpretability
link |
01:00:16.200
that's actually measurable,
link |
01:00:18.200
that would be a huge leap forward
link |
01:00:20.200
even without a new algorithm that increases that quantity.
link |
01:00:24.200
And also, once we had the definition of differential privacy,
link |
01:00:28.200
it was fast to get the algorithms that guaranteed it.
link |
01:00:31.200
So you could imagine once we have definitions
link |
01:00:33.200
of good concepts and interpretability,
link |
01:00:35.200
we might be able to provide the algorithms
link |
01:00:37.200
that have the interpretability guarantees quickly, too.
link |
01:00:42.200
What do you think it takes to build a system
link |
01:00:46.200
with human level intelligence
link |
01:00:48.200
as we quickly venture into the philosophical?
link |
01:00:51.200
So artificial general intelligence, what do you think it takes?
link |
01:00:55.200
I think that it definitely takes better environments
link |
01:01:01.200
than we currently have for training agents,
link |
01:01:03.200
that we want them to have a really wide diversity of experiences.
link |
01:01:08.200
I also think it's going to take really a lot of computation.
link |
01:01:11.200
It's hard to imagine exactly how much.
link |
01:01:13.200
So you're optimistic about simulation,
link |
01:01:16.200
simulating a variety of environments as the path forward
link |
01:01:19.200
as opposed to operating in the real world?
link |
01:01:21.200
I think it's a necessary ingredient.
link |
01:01:23.200
I don't think that we're going to get to artificial general intelligence
link |
01:01:27.200
by training on fixed data sets
link |
01:01:29.200
or by thinking really hard about the problem.
link |
01:01:32.200
I think that the agent really needs to interact
link |
01:01:36.200
and have a variety of experiences within the same lifespan.
link |
01:01:41.200
And today we have many different models that can each do one thing,
link |
01:01:45.200
and we tend to train them on one dataset or one RL environment.
link |
01:01:49.200
Sometimes there are actually papers about getting one set of parameters
link |
01:01:53.200
to perform well in many different RL environments,
link |
01:01:56.200
but we don't really have anything like an agent
link |
01:01:59.200
that goes seamlessly from one type of experience to another
link |
01:02:02.200
and really integrates all the different things that it does
link |
01:02:05.200
over the course of its life.
link |
01:02:07.200
When we do see multiagent environments,
link |
01:02:10.200
they tend to be similar environments.
link |
01:02:16.200
All of them are playing an action based video game.
link |
01:02:19.200
We don't really have an agent that goes from playing a video game
link |
01:02:24.200
to reading the Wall Street Journal
link |
01:02:27.200
to predicting how effective a molecule will be as a drug or something like that.
link |
01:02:32.200
What do you think is a good test for intelligence in your view?
link |
01:02:36.200
There's been a lot of benchmarks started with Alan Turing,
link |
01:02:41.200
natural conversation being a good benchmark for intelligence.
link |
01:02:46.200
What would you and good fellows sit back and be really damn impressed
link |
01:02:53.200
if a system was able to accomplish?
link |
01:02:55.200
Something that doesn't take a lot of glue from human engineers.
link |
01:02:59.200
Imagine that instead of having to go to the CIFAR website and download CIFAR 10
link |
01:03:07.200
and then write a Python script to parse it and all that,
link |
01:03:11.200
you could just point an agent at the CIFAR 10 problem
link |
01:03:16.200
and it downloads and extracts the data and trains a model
link |
01:03:20.200
and starts giving you predictions.
link |
01:03:22.200
I feel like something that doesn't need to have every step of the pipeline assembled for it
link |
01:03:28.200
definitely understands what it's doing.
link |
01:03:30.200
Is AutoML moving into that direction or are you thinking way even bigger?
link |
01:03:34.200
AutoML has mostly been moving toward once we've built all the glue,
link |
01:03:39.200
can the machine learning system design the architecture really well?
link |
01:03:44.200
I'm more of saying if something knows how to pre process the data
link |
01:03:49.200
so that it successfully accomplishes the task,
link |
01:03:52.200
then it would be very hard to argue that it doesn't truly understand the task
link |
01:03:56.200
in some fundamental sense.
link |
01:03:58.200
I don't necessarily know that that's the philosophical definition of intelligence,
link |
01:04:02.200
but that's something that would be really cool to build that would be really useful
link |
01:04:05.200
and would impress me and would convince me that we've made a step forward in real AI.
link |
01:04:09.200
You give it the URL for Wikipedia
link |
01:04:13.200
and then next day expect it to be able to solve CIFAR 10.
link |
01:04:18.200
Or you type in a paragraph explaining what you want it to do
link |
01:04:22.200
and it figures out what web searches it should run and downloads all the necessary ingredients.
link |
01:04:28.200
So you have a very clear, calm way of speaking, no ums, easy to edit.
link |
01:04:37.200
I've seen comments for both you and I have been identified as both potentially being robots.
link |
01:04:44.200
If you have to prove to the world that you are indeed human, how would you do it?
link |
01:04:48.200
I can understand thinking that I'm a robot.
link |
01:04:53.200
It's the flip side of the Turing test, I think.
link |
01:04:57.200
Yeah, the prove your human test.
link |
01:05:00.200
Intellectually, so you have to, is there something that's truly unique in your mind
link |
01:05:08.200
as it doesn't go back to just natural language again, just being able to talk the way out of it?
link |
01:05:13.200
So proving that I'm not a robot with today's technology,
link |
01:05:16.200
that's pretty straightforward.
link |
01:05:18.200
My conversation today hasn't veered off into talking about the stock market or something because it's my training data.
link |
01:05:25.200
But I guess more generally trying to prove that something is real from the content alone is incredibly hard.
link |
01:05:31.200
That's one of the main things I've gotten out of my GAN research, that you can simulate almost anything
link |
01:05:37.200
and so you have to really step back to a separate channel to prove that something is real.
link |
01:05:42.200
So I guess I should have had myself stamped on a blockchain when I was born or something, but I didn't do that.
link |
01:05:48.200
So according to my own research methodology, there's just no way to know at this point.
link |
01:05:52.200
So what last question, problem stands out for you that you're really excited about challenging in the near future?
link |
01:05:59.200
I think resistance to adversarial examples, figuring out how to make machine learning secure against an adversary
link |
01:06:06.200
who wants to interfere it and control it, that is one of the most important things researchers today could solve.
link |
01:06:11.200
In all domains, image, language, driving and everything.
link |
01:06:17.200
I guess I'm most concerned about domains we haven't really encountered yet.
link |
01:06:22.200
Imagine 20 years from now when we're using advanced AIs to do things we haven't even thought of yet.
link |
01:06:28.200
If you ask people what are the important problems in security of phones in 2002,
link |
01:06:37.200
I don't think we would have anticipated that we're using them for nearly as many things as we're using them for today.
link |
01:06:43.200
I think it's going to be like that with AI that you can kind of try to speculate about where it's going,
link |
01:06:47.200
but really the business opportunities that end up taking off would be hard to predict ahead of time.
link |
01:06:53.200
What you can predict ahead of time is that almost anything you can do with machine learning,
link |
01:06:58.200
you would like to make sure that people can't get it to do what they want rather than what you want
link |
01:07:04.200
just by showing it a funny QR code or a funny input pattern.
link |
01:07:08.200
You think that the set of methodology to do that can be bigger than any one domain?
link |
01:07:12.200
I think so, yeah.
link |
01:07:15.200
One methodology that I think is not a specific methodology,
link |
01:07:20.200
but a category of solutions that I'm excited about today is making dynamic models
link |
01:07:25.200
that change every time they make a prediction.
link |
01:07:28.200
Right now, we tend to train models and then after they're trained, we freeze them.
link |
01:07:32.200
We just use the same rule to classify everything that comes in from then on.
link |
01:07:37.200
That's really a sitting duck from a security point of view.
link |
01:07:40.200
If you always output the same answer for the same input,
link |
01:07:44.200
then people can just run inputs through until they find a mistake that benefits them,
link |
01:07:49.200
and then they use the same mistake over and over and over again.
link |
01:07:53.200
I think having a model that updates its predictions so that it's harder to predict what you're going to get
link |
01:08:00.200
will make it harder for an adversary to really take control of the system
link |
01:08:04.200
and make it do what they want it to do.
link |
01:08:06.200
Yeah, models that maintain a bit of a sense of mystery about them
link |
01:08:10.200
because they always keep changing.
link |
01:08:12.200
Ian, thanks so much for talking today. It was awesome.
link |
01:08:14.200
Thank you for coming in. It's great to see you.