back to index

Ian Goodfellow: Generative Adversarial Networks (GANs) | Lex Fridman Podcast #19


small model | large model

link |
00:00:00.000
The following is a conversation with Ian Goodfellow.
link |
00:00:03.720
He's the author of the popular textbook on deep learning
link |
00:00:06.360
simply titled Deep Learning.
link |
00:00:08.800
He coined the term of Generative Adversarial Networks,
link |
00:00:12.320
otherwise known as GANs,
link |
00:00:14.560
and with his 2014 paper is responsible
link |
00:00:18.160
for launching the incredible growth
link |
00:00:20.440
of research and innovation in this subfield
link |
00:00:23.140
of deep learning.
link |
00:00:24.720
He got his BS and MS at Stanford,
link |
00:00:27.520
his PhD at University of Montreal
link |
00:00:30.120
with Yoshua Bengio and Aaron Kerrville.
link |
00:00:33.200
He held several research positions
link |
00:00:35.240
including at OpenAI, Google Brain,
link |
00:00:37.600
and now at Apple as the Director of Machine Learning.
link |
00:00:41.560
This recording happened while Ian was still at Google Brain,
link |
00:00:45.400
but we don't talk about anything specific to Google
link |
00:00:48.520
or any other organization.
link |
00:00:50.760
This conversation is part
link |
00:00:52.480
of the Artificial Intelligence Podcast.
link |
00:00:54.520
If you enjoy it, subscribe on YouTube, iTunes,
link |
00:00:57.560
or simply connect with me on Twitter at Lex Friedman,
link |
00:01:00.880
spelled F R I D.
link |
00:01:03.000
And now here's my conversation with Ian Goodfellow.
link |
00:01:08.240
You open your popular deep learning book
link |
00:01:11.000
with a Russian doll type diagram
link |
00:01:13.620
that shows deep learning is a subset
link |
00:01:15.880
of representation learning,
link |
00:01:17.140
which in turn is a subset of machine learning
link |
00:01:19.960
and finally a subset of AI.
link |
00:01:22.520
So this kind of implies that there may be limits
link |
00:01:25.280
to deep learning in the context of AI.
link |
00:01:27.720
So what do you think is the current limits of deep learning
link |
00:01:31.580
and are those limits something
link |
00:01:33.140
that we can overcome with time?
link |
00:01:35.760
Yeah, I think one of the biggest limitations
link |
00:01:37.740
of deep learning is that right now it requires
link |
00:01:40.120
really a lot of data, especially labeled data.
link |
00:01:43.960
There are some unsupervised
link |
00:01:45.480
and semi supervised learning algorithms
link |
00:01:47.160
that can reduce the amount of labeled data you need,
link |
00:01:49.480
but they still require a lot of unlabeled data,
link |
00:01:52.200
reinforcement learning algorithms.
link |
00:01:53.480
They don't need labels,
link |
00:01:54.320
but they need really a lot of experiences.
link |
00:01:57.280
As human beings, we don't learn to play Pong
link |
00:01:58.920
by failing at Pong 2 million times.
link |
00:02:02.720
So just getting the generalization ability better
link |
00:02:05.880
is one of the most important bottlenecks
link |
00:02:08.040
in the capability of the technology today.
link |
00:02:10.540
And then I guess I'd also say deep learning
link |
00:02:12.360
is like a component of a bigger system.
link |
00:02:16.620
So far, nobody is really proposing to have
link |
00:02:19.020
only what you'd call deep learning
link |
00:02:22.000
as the entire ingredient of intelligence.
link |
00:02:25.520
You use deep learning as sub modules of other systems,
link |
00:02:29.860
like AlphaGo has a deep learning model
link |
00:02:32.320
that estimates the value function.
link |
00:02:35.200
Most reinforcement learning algorithms
link |
00:02:36.600
have a deep learning module
link |
00:02:37.880
that estimates which action to take next,
link |
00:02:40.320
but you might have other components.
link |
00:02:42.480
So you're basically building a function estimator.
link |
00:02:46.100
Do you think it's possible,
link |
00:02:48.600
you said nobody's kind of been thinking about this so far,
link |
00:02:51.000
but do you think neural networks could be made to reason
link |
00:02:54.320
in the way symbolic systems did in the 80s and 90s
link |
00:02:57.720
to do more, create more like programs
link |
00:03:00.160
as opposed to functions?
link |
00:03:01.440
Yeah, I think we already see that a little bit.
link |
00:03:04.880
I already kind of think of neural nets
link |
00:03:06.360
as a kind of program.
link |
00:03:08.860
I think of deep learning as basically learning programs
link |
00:03:12.920
that have more than one step.
link |
00:03:15.280
So if you draw a flow chart
link |
00:03:16.960
or if you draw a TensorFlow graph
link |
00:03:19.540
describing your machine learning model,
link |
00:03:21.860
I think of the depth of that graph
link |
00:03:23.500
as describing the number of steps that run in sequence.
link |
00:03:25.860
And then the width of that graph
link |
00:03:27.640
is the number of steps that run in parallel.
link |
00:03:30.120
Now it's been long enough
link |
00:03:31.680
that we've had deep learning working
link |
00:03:32.880
that it's a little bit silly
link |
00:03:33.880
to even discuss shallow learning anymore.
link |
00:03:35.740
But back when I first got involved in AI,
link |
00:03:38.880
when we used machine learning,
link |
00:03:40.080
we were usually learning things like support vector machines.
link |
00:03:43.680
You could have a lot of input features to the model
link |
00:03:45.640
and you could multiply each feature by a different weight.
link |
00:03:48.080
All those multiplications were done
link |
00:03:49.560
in parallel to each other.
link |
00:03:51.200
There wasn't a lot done in series.
link |
00:03:52.680
I think what we got with deep learning
link |
00:03:54.320
was really the ability to have steps of a program
link |
00:03:58.360
that run in sequence.
link |
00:04:00.280
And I think that we've actually started to see
link |
00:04:03.160
that what's important with deep learning
link |
00:04:05.000
is more the fact that we have a multi step program
link |
00:04:07.960
rather than the fact that we've learned a representation.
link |
00:04:10.760
If you look at things like resonance, for example,
link |
00:04:15.100
they take one particular kind of representation
link |
00:04:18.640
and they update it several times.
link |
00:04:21.320
Back when deep learning first really took off
link |
00:04:23.560
in the academic world in 2006,
link |
00:04:25.760
when Jeff Hinton showed that you could train
link |
00:04:28.400
deep belief networks,
link |
00:04:30.160
everybody who was interested in the idea
link |
00:04:31.960
thought of it as each layer
link |
00:04:33.560
learns a different level of abstraction.
link |
00:04:35.940
That the first layer trained on images
link |
00:04:37.820
learns something like edges
link |
00:04:38.960
and the second layer learns corners.
link |
00:04:40.420
And eventually you get these kind of grandmother cell units
link |
00:04:43.320
that recognize specific objects.
link |
00:04:45.920
Today I think most people think of it more
link |
00:04:48.560
as a computer program where as you add more layers
link |
00:04:52.000
you can do more updates before you output your final number.
link |
00:04:55.120
But I don't think anybody believes that
link |
00:04:57.160
layer 150 of the ResNet is a grandmother cell
link |
00:05:02.060
and layer 100 is contours or something like that.
link |
00:05:06.040
Okay, so you're not thinking of it
link |
00:05:08.160
as a singular representation that keeps building.
link |
00:05:11.520
You think of it as a program,
link |
00:05:14.040
sort of almost like a state.
link |
00:05:15.920
Representation is a state of understanding.
link |
00:05:18.720
Yeah, I think of it as a program
link |
00:05:20.260
that makes several updates
link |
00:05:21.500
and arrives at better and better understandings,
link |
00:05:23.840
but it's not replacing the representation at each step.
link |
00:05:27.500
It's refining it.
link |
00:05:29.160
And in some sense, that's a little bit like reasoning.
link |
00:05:31.640
It's not reasoning in the form of deduction,
link |
00:05:33.560
but it's reasoning in the form of taking a thought
link |
00:05:36.960
and refining it and refining it carefully
link |
00:05:39.440
until it's good enough to use.
link |
00:05:41.240
So do you think, and I hope you don't mind,
link |
00:05:43.560
we'll jump philosophical every once in a while.
link |
00:05:46.040
Do you think of cognition, human cognition,
link |
00:05:50.460
or even consciousness as simply a result
link |
00:05:53.520
of this kind of sequential representation learning?
link |
00:05:58.120
Do you think that can emerge?
link |
00:06:00.440
Cognition, yes, I think so.
link |
00:06:02.460
Consciousness, it's really hard to even define
link |
00:06:05.160
what we mean by that.
link |
00:06:07.400
I guess there's, consciousness is often defined
link |
00:06:09.840
as things like having self awareness,
link |
00:06:12.080
and that's relatively easy to turn into something actionable
link |
00:06:16.080
for a computer scientist to reason about.
link |
00:06:18.400
People also define consciousness
link |
00:06:19.720
in terms of having qualitative states of experience,
link |
00:06:22.440
like qualia, and there's all these philosophical problems,
link |
00:06:25.300
like could you imagine a zombie
link |
00:06:27.880
who does all the same information processing as a human,
link |
00:06:30.740
but doesn't really have the qualitative experiences
link |
00:06:33.500
that we have?
link |
00:06:34.720
That sort of thing, I have no idea how to formalize
link |
00:06:37.600
or turn it into a scientific question.
link |
00:06:40.000
I don't know how you could run an experiment
link |
00:06:41.620
to tell whether a person is a zombie or not.
link |
00:06:44.880
And similarly, I don't know how you could run
link |
00:06:46.680
an experiment to tell whether an advanced AI system
link |
00:06:49.680
had become conscious in the sense of qualia or not.
link |
00:06:53.060
But in the more practical sense,
link |
00:06:54.600
like almost like self attention,
link |
00:06:56.320
you think consciousness and cognition can,
link |
00:06:58.920
in an impressive way, emerge from current types
link |
00:07:03.240
of architectures that we think of as learning.
link |
00:07:06.200
Or if you think of consciousness
link |
00:07:07.920
in terms of self awareness and just making plans
link |
00:07:12.160
based on the fact that the agent itself exists in the world,
link |
00:07:16.600
reinforcement learning algorithms
link |
00:07:18.000
are already more or less forced
link |
00:07:20.140
to model the agent's effect on the environment.
link |
00:07:23.040
So that more limited version of consciousness
link |
00:07:26.340
is already something that we get limited versions of
link |
00:07:31.400
with reinforcement learning algorithms
link |
00:07:32.960
if they're trained well.
link |
00:07:34.640
But you say limited, so the big question really
link |
00:07:39.240
is how you jump from limited to human level, right?
link |
00:07:42.120
And whether it's possible,
link |
00:07:46.840
even just building common sense reasoning
link |
00:07:49.000
seems to be exceptionally difficult.
link |
00:07:50.520
So if we scale things up,
link |
00:07:52.480
if we get much better on supervised learning,
link |
00:07:55.000
if we get better at labeling,
link |
00:07:56.620
if we get bigger data sets, more compute,
link |
00:08:00.640
do you think we'll start to see really impressive things
link |
00:08:03.880
that go from limited to something,
link |
00:08:08.320
echoes of human level cognition?
link |
00:08:10.320
I think so, yeah.
link |
00:08:11.200
I'm optimistic about what can happen
link |
00:08:13.340
just with more computation and more data.
link |
00:08:16.420
I do think it'll be important
link |
00:08:17.500
to get the right kind of data.
link |
00:08:20.100
Today, most of the machine learning systems we train
link |
00:08:23.160
are mostly trained on one type of data for each model.
link |
00:08:27.540
But the human brain, we get all of our different senses
link |
00:08:31.380
and we have many different experiences
link |
00:08:33.880
like riding a bike, driving a car,
link |
00:08:36.320
talking to people, reading.
link |
00:08:39.160
I think when we get that kind of integrated data set,
link |
00:08:42.420
working with a machine learning model
link |
00:08:44.420
that can actually close the loop and interact,
link |
00:08:47.660
we may find that algorithms not so different
link |
00:08:50.480
from what we have today learn really interesting things
link |
00:08:53.240
when you scale them up a lot
link |
00:08:54.400
and train them on a large amount of multimodal data.
link |
00:08:58.240
So multimodal is really interesting,
link |
00:08:59.640
but within, like you're working adversarial examples.
link |
00:09:04.000
So selecting within modal, within one mode of data,
link |
00:09:11.120
selecting better at what are the difficult cases
link |
00:09:13.780
from which you're most useful to learn from.
link |
00:09:16.120
Oh yeah, like could we get a whole lot of mileage
link |
00:09:18.880
out of designing a model that's resistant
link |
00:09:22.280
to adversarial examples or something like that?
link |
00:09:24.120
Right, that's the question.
link |
00:09:26.280
My thinking on that has evolved a lot
link |
00:09:27.760
over the last few years.
link |
00:09:29.960
When I first started to really invest
link |
00:09:31.280
in studying adversarial examples,
link |
00:09:32.760
I was thinking of it mostly as adversarial examples
link |
00:09:36.320
reveal a big problem with machine learning
link |
00:09:38.980
and we would like to close the gap
link |
00:09:41.160
between how machine learning models respond
link |
00:09:44.120
to adversarial examples and how humans respond.
link |
00:09:47.640
After studying the problem more,
link |
00:09:49.200
I still think that adversarial examples are important.
link |
00:09:51.940
I think of them now more of as a security liability
link |
00:09:55.440
than as an issue that necessarily shows
link |
00:09:57.800
there's something uniquely wrong
link |
00:09:59.880
with machine learning as opposed to humans.
link |
00:10:02.800
Also, do you see them as a tool
link |
00:10:04.600
to improve the performance of the system?
link |
00:10:06.480
Not on the security side, but literally just accuracy.
link |
00:10:10.760
I do see them as a kind of tool on that side,
link |
00:10:13.480
but maybe not quite as much as I used to think.
link |
00:10:16.640
We've started to find that there's a trade off
link |
00:10:18.500
between accuracy on adversarial examples
link |
00:10:21.680
and accuracy on clean examples.
link |
00:10:24.360
Back in 2014, when I did the first
link |
00:10:27.120
adversarily trained classifier that showed resistance
link |
00:10:30.840
to some kinds of adversarial examples,
link |
00:10:33.040
it also got better at the clean data on MNIST.
link |
00:10:36.040
And that's something we've replicated several times
link |
00:10:37.700
on MNIST, that when we train
link |
00:10:39.640
against weak adversarial examples,
link |
00:10:41.500
MNIST classifiers get more accurate.
link |
00:10:43.880
So far that hasn't really held up on other data sets
link |
00:10:47.080
and hasn't held up when we train
link |
00:10:48.880
against stronger adversaries.
link |
00:10:50.760
It seems like when you confront
link |
00:10:53.160
a really strong adversary,
link |
00:10:55.680
you tend to have to give something up.
link |
00:10:58.040
Interesting.
link |
00:10:59.040
But it's such a compelling idea
link |
00:11:00.480
because it feels like that's how us humans learn
link |
00:11:04.720
is through the difficult cases.
link |
00:11:06.280
We try to think of what would we screw up
link |
00:11:08.760
and then we make sure we fix that.
link |
00:11:11.000
It's also in a lot of branches of engineering,
link |
00:11:13.560
you do a worst case analysis
link |
00:11:15.800
and make sure that your system will work in the worst case.
link |
00:11:18.720
And then that guarantees that it'll work
link |
00:11:20.400
in all of the messy average cases that happen
link |
00:11:24.360
when you go out into a really randomized world.
link |
00:11:27.440
Yeah, with driving with autonomous vehicles,
link |
00:11:29.560
there seems to be a desire to just look for,
link |
00:11:33.080
think adversarially,
link |
00:11:34.880
try to figure out how to mess up the system.
link |
00:11:36.920
And if you can be robust to all those difficult cases,
link |
00:11:40.620
then you can, it's a hand wavy empirical way
link |
00:11:43.580
to show your system is safe.
link |
00:11:47.040
Today, most adversarial example research
link |
00:11:49.120
isn't really focused on a particular use case,
link |
00:11:51.640
but there are a lot of different use cases
link |
00:11:54.000
where you'd like to make sure that the adversary
link |
00:11:56.940
can't interfere with the operation of your system.
link |
00:12:00.200
Like in finance,
link |
00:12:01.060
if you have an algorithm making trades for you,
link |
00:12:03.320
people go to a lot of an effort
link |
00:12:04.660
to obfuscate their algorithm.
link |
00:12:06.680
That's both to protect their IP
link |
00:12:08.080
because you don't want to research
link |
00:12:10.880
and develop a profitable trading algorithm
link |
00:12:13.580
then have somebody else capture the gains.
link |
00:12:16.120
But it's at least partly
link |
00:12:17.160
because you don't want people to make adversarial examples
link |
00:12:19.520
that fool your algorithm into making bad trades.
link |
00:12:24.380
Or I guess one area that's been popular
link |
00:12:26.580
in the academic literature is speech recognition.
link |
00:12:30.180
If you use speech recognition to hear an audio wave form
link |
00:12:34.440
and then turn that into a command
link |
00:12:37.720
that a phone executes for you,
link |
00:12:39.680
you don't want a malicious adversary
link |
00:12:41.880
to be able to produce audio
link |
00:12:43.640
that gets interpreted as malicious commands,
link |
00:12:46.300
especially if a human in the room doesn't realize
link |
00:12:48.520
that something like that is happening.
link |
00:12:50.320
And speech recognition,
link |
00:12:52.000
has there been much success
link |
00:12:53.880
in being able to create adversarial examples
link |
00:12:58.440
that fool the system?
link |
00:12:59.760
Yeah, actually.
link |
00:13:00.880
I guess the first work that I'm aware of
link |
00:13:02.420
is a paper called Hidden Voice Commands
link |
00:13:05.120
that came out in 2016, I believe.
link |
00:13:08.480
And they were able to show that they could make sounds
link |
00:13:11.920
that are not understandable by a human
link |
00:13:14.960
but are recognized as the target phrase
link |
00:13:18.400
that the attacker wants the phone to recognize it as.
link |
00:13:21.320
Since then, things have gotten a little bit better
link |
00:13:24.020
on the attacker's side
link |
00:13:25.200
when worse on the defender's side.
link |
00:13:28.680
It's become possible to make sounds
link |
00:13:33.360
that sound like normal speech
link |
00:13:35.600
but are actually interpreted as a different sentence
link |
00:13:38.980
than the human hears.
link |
00:13:40.720
The level of perceptibility
link |
00:13:42.720
of the adversarial perturbation is still kind of high.
link |
00:13:46.640
When you listen to the recording,
link |
00:13:48.160
it sounds like there's some noise in the background,
link |
00:13:51.040
just like rustling sounds.
link |
00:13:52.960
But those rustling sounds
link |
00:13:53.940
are actually the adversarial perturbation
link |
00:13:55.560
that makes the phone hear a completely different sentence.
link |
00:13:58.040
Yeah, that's so fascinating.
link |
00:14:00.120
Peter Norvig mentioned
link |
00:14:01.080
that you're writing the deep learning chapter
link |
00:14:02.780
for the fourth edition
link |
00:14:04.280
of the Artificial Intelligence, A Modern Approach book.
link |
00:14:07.340
So how do you even begin summarizing
link |
00:14:10.700
the field of deep learning in a chapter?
link |
00:14:13.080
Well, in my case, I waited like a year
link |
00:14:16.880
before I actually wrote anything.
link |
00:14:19.200
Even having written a full length textbook before,
link |
00:14:22.660
it's still pretty intimidating
link |
00:14:25.600
to try to start writing just one chapter
link |
00:14:27.840
that covers everything.
link |
00:14:31.160
One thing that helped me make that plan
link |
00:14:33.200
was actually the experience
link |
00:14:34.320
of having written the full book before
link |
00:14:36.740
and then watching how the field changed
link |
00:14:39.160
after the book came out.
link |
00:14:41.000
I've realized there's a lot of topics
link |
00:14:42.340
that were maybe extraneous in the first book
link |
00:14:45.040
and just seeing what stood the test
link |
00:14:47.620
of a few years of being published
link |
00:14:49.440
and what seems a little bit less important
link |
00:14:52.240
to have included now helped me pare down the topics
link |
00:14:54.320
I wanted to cover for the book.
link |
00:14:56.920
It's also really nice now
link |
00:14:58.060
that the field is kind of stabilized
link |
00:15:00.600
to the point where some core ideas from the 1980s
link |
00:15:02.840
are still used today.
link |
00:15:04.800
When I first started studying machine learning,
link |
00:15:06.720
almost everything from the 1980s had been rejected
link |
00:15:09.600
and now some of it has come back.
link |
00:15:11.400
So that stuff that's really stood the test of time
link |
00:15:13.520
is what I focused on putting into the book.
link |
00:15:16.960
There's also, I guess, two different philosophies
link |
00:15:21.320
about how you might write a book.
link |
00:15:23.160
One philosophy is you try to write a reference
link |
00:15:24.840
that covers everything.
link |
00:15:26.240
The other philosophy is you try to provide
link |
00:15:28.040
a high level summary that gives people the language
link |
00:15:31.160
to understand a field
link |
00:15:32.440
and tells them what the most important concepts are.
link |
00:15:35.000
The first deep learning book that I wrote
link |
00:15:37.080
with Joshua and Aaron was somewhere
link |
00:15:39.260
between the two philosophies,
link |
00:15:41.240
that it's trying to be both a reference
link |
00:15:43.640
and an introductory guide.
link |
00:15:45.760
Writing this chapter for Russell Norvig's book,
link |
00:15:48.920
I was able to focus more on just a concise introduction
link |
00:15:52.780
of the key concepts and the language
link |
00:15:54.240
you need to read about them more.
link |
00:15:55.980
In a lot of cases, I actually just wrote paragraphs
link |
00:15:57.560
that said, here's a rapidly evolving area
link |
00:16:00.060
that you should pay attention to.
link |
00:16:02.360
It's pointless to try to tell you what the latest
link |
00:16:04.760
and best version of a learn to learn model is.
link |
00:16:11.680
I can point you to a paper that's recent right now,
link |
00:16:13.660
but there isn't a whole lot of a reason to delve
link |
00:16:16.880
into exactly what's going on
link |
00:16:18.640
with the latest learning to learn approach
link |
00:16:21.600
or the latest module produced
link |
00:16:23.400
by a learning to learn algorithm.
link |
00:16:25.000
You should know that learning to learn is a thing
link |
00:16:26.800
and that it may very well be the source of the latest
link |
00:16:30.680
and greatest convolutional net or recurrent net module
link |
00:16:33.800
that you would want to use in your latest project.
link |
00:16:36.060
But there isn't a lot of point in trying to summarize
link |
00:16:38.200
exactly which architecture and which learning approach
link |
00:16:42.300
got to which level of performance.
link |
00:16:44.060
So you maybe focus more on the basics of the methodology.
link |
00:16:49.260
So from back propagation to feed forward
link |
00:16:52.500
to recurrent neural networks, convolutional,
link |
00:16:54.480
that kind of thing?
link |
00:16:55.320
Yeah, yeah.
link |
00:16:56.480
So if I were to ask you, I remember I took algorithms
link |
00:17:00.360
and data structures algorithms course.
link |
00:17:03.720
I remember the professor asked, what is an algorithm?
link |
00:17:09.160
And yelled at everybody in a good way
link |
00:17:12.200
that nobody was answering it correctly.
link |
00:17:14.040
Everybody knew what the algorithm, it was graduate course.
link |
00:17:16.380
Everybody knew what an algorithm was,
link |
00:17:18.140
but they weren't able to answer it well.
link |
00:17:19.760
So let me ask you in that same spirit,
link |
00:17:22.360
what is deep learning?
link |
00:17:24.540
I would say deep learning is any kind of machine learning
link |
00:17:29.540
that involves learning parameters of more than one
link |
00:17:34.620
consecutive step.
link |
00:17:37.140
So that, I mean, shallow learning is things
link |
00:17:39.460
where you learn a lot of operations that happen in parallel.
link |
00:17:43.620
You might have a system that makes multiple steps.
link |
00:17:46.580
Like you might have hand designed feature extractors,
link |
00:17:50.700
but really only one step is learned.
link |
00:17:52.500
Deep learning is anything where you have multiple operations
link |
00:17:55.900
in sequence, and that includes the things
link |
00:17:58.420
that are really popular today,
link |
00:17:59.780
like convolutional networks and recurrent networks.
link |
00:18:03.580
But it also includes some of the things that have died out
link |
00:18:06.580
like Bolton machines,
link |
00:18:08.260
where we weren't using back propagation.
link |
00:18:11.980
Today, I hear a lot of people define deep learning
link |
00:18:14.220
as gradient descent applied
link |
00:18:18.020
to these differentiable functions.
link |
00:18:21.460
And I think that's a legitimate usage of the term.
link |
00:18:24.780
It's just different from the way that I use the term myself.
link |
00:18:27.820
So what's an example of deep learning
link |
00:18:31.740
that is not gradient descent and differentiable functions?
link |
00:18:34.740
In your, I mean, not specifically perhaps,
link |
00:18:37.420
but more even looking into the future,
link |
00:18:39.780
what's your thought about that space of approaches?
link |
00:18:44.300
Yeah, so I tend to think of machine learning algorithms
link |
00:18:46.340
as decomposed into really three different pieces.
link |
00:18:50.180
There's the model, which can be something like a neural net
link |
00:18:52.980
or a Bolton machine or a recurrent model.
link |
00:18:56.580
And that basically just describes how do you take data
link |
00:18:59.500
and how do you take parameters?
link |
00:19:01.140
And what function do you use to make a prediction
link |
00:19:04.300
given the data and the parameters?
link |
00:19:07.320
Another piece of the learning algorithm
link |
00:19:09.260
is the optimization algorithm.
link |
00:19:12.380
Or not every algorithm can be really described
link |
00:19:14.900
in terms of optimization,
link |
00:19:15.900
but what's the algorithm for updating the parameters
link |
00:19:18.860
or updating whatever the state of the network is?
link |
00:19:22.620
And then the last part is the data set,
link |
00:19:26.280
like how do you actually represent the world
link |
00:19:29.180
as it comes into your machine learning system?
link |
00:19:33.140
So I think of deep learning as telling us something about
link |
00:19:36.740
what does the model look like?
link |
00:19:39.060
And basically to qualify as deep,
link |
00:19:41.260
I say that it just has to have multiple layers.
link |
00:19:44.540
That can be multiple steps
link |
00:19:46.340
in a feed forward differentiable computation.
link |
00:19:49.220
That can be multiple layers in a graphical model.
link |
00:19:52.020
There's a lot of ways that you could satisfy me
link |
00:19:53.560
that something has multiple steps
link |
00:19:56.140
that are each parameterized separately.
link |
00:19:58.900
I think of gradient descent
link |
00:19:59.940
as being all about that other piece,
link |
00:20:01.540
the how do you actually update the parameters piece?
link |
00:20:04.260
So you could imagine having a deep model
link |
00:20:05.980
like a convolutional net
link |
00:20:07.540
and training it with something like evolution
link |
00:20:09.660
or a genetic algorithm.
link |
00:20:11.300
And I would say that still qualifies as deep learning.
link |
00:20:14.780
And then in terms of models
link |
00:20:16.060
that aren't necessarily differentiable,
link |
00:20:18.740
I guess Bolton machines are probably
link |
00:20:21.260
the main example of something
link |
00:20:23.580
where you can't really take a derivative
link |
00:20:25.540
and use that for the learning process.
link |
00:20:27.980
But you can still argue that the model
link |
00:20:30.780
has many steps of processing that it applies
link |
00:20:33.740
when you run inference in the model.
link |
00:20:35.760
So it's the steps of processing that's key.
link |
00:20:38.900
So Jeff Hinton suggests that we need to throw away
link |
00:20:41.300
back propagation and start all over.
link |
00:20:44.900
What do you think about that?
link |
00:20:46.500
What could an alternative direction
link |
00:20:48.540
of training neural networks look like?
link |
00:20:50.940
I don't know that back propagation
link |
00:20:52.860
is gonna go away entirely.
link |
00:20:54.660
Most of the time when we decide
link |
00:20:57.140
that a machine learning algorithm
link |
00:20:59.220
isn't on the critical path to research for improving AI,
link |
00:21:03.460
the algorithm doesn't die.
link |
00:21:04.660
It just becomes used for some specialized set of things.
link |
00:21:08.820
A lot of algorithms like logistic regression
link |
00:21:11.180
don't seem that exciting to AI researchers
link |
00:21:13.980
who are working on things like speech recognition
link |
00:21:16.760
or autonomous cars today.
link |
00:21:18.420
But there's still a lot of use for logistic regression
link |
00:21:21.100
and things like analyzing really noisy data
link |
00:21:24.000
in medicine and finance
link |
00:21:25.700
or making really rapid predictions
link |
00:21:28.780
in really time limited contexts.
link |
00:21:30.700
So I think back propagation and gradient descent
link |
00:21:33.480
are around to stay, but they may not end up being
link |
00:21:38.340
everything that we need to get to real human level
link |
00:21:40.860
or super human AI.
link |
00:21:42.380
Are you optimistic about us discovering
link |
00:21:46.700
back propagation has been around for a few decades?
link |
00:21:50.220
So are you optimistic about us as a community
link |
00:21:54.100
being able to discover something better?
link |
00:21:56.800
Yeah, I am.
link |
00:21:57.640
I think we likely will find something that works better.
link |
00:22:01.820
You could imagine things like having stacks of models
link |
00:22:05.500
where some of the lower level models
link |
00:22:07.580
predict parameters of the higher level models.
link |
00:22:10.200
And so at the top level,
link |
00:22:12.140
you're not learning in terms of literally
link |
00:22:13.500
calculating gradients,
link |
00:22:14.460
but just predicting how different values will perform.
link |
00:22:17.700
You can kind of see that already in some areas
link |
00:22:19.580
like Bayesian optimization,
link |
00:22:21.380
where you have a Gaussian process
link |
00:22:22.940
that predicts how well different parameter values
link |
00:22:24.800
will perform.
link |
00:22:25.880
We already use those kinds of algorithms
link |
00:22:27.700
for things like hyper parameter optimization.
link |
00:22:30.260
And in general, we know a lot of things other than back prop
link |
00:22:32.500
that work really well for specific problems.
link |
00:22:34.980
The main thing we haven't found is
link |
00:22:37.460
a way of taking one of these other
link |
00:22:38.880
non back prop based algorithms
link |
00:22:41.160
and having it really advanced the state of the art
link |
00:22:43.500
on an AI level problem.
link |
00:22:46.160
Right.
link |
00:22:47.100
But I wouldn't be surprised if eventually
link |
00:22:49.180
we find that some of these algorithms
link |
00:22:50.780
that even the ones that already exist,
link |
00:22:52.780
not even necessarily new one,
link |
00:22:54.220
we might find some way of customizing
link |
00:22:58.180
one of these algorithms to do something really interesting
link |
00:23:00.540
at the level of cognition or the level of,
link |
00:23:06.420
I think one system that we really don't have working
link |
00:23:08.660
quite right yet is like short term memory.
link |
00:23:12.940
We have things like LSTMs,
link |
00:23:14.500
they're called long short term memory.
link |
00:23:16.980
They still don't do quite what a human does
link |
00:23:20.020
with short term memory.
link |
00:23:22.860
Like gradient descent to learn a specific fact
link |
00:23:26.940
has to do multiple steps on that fact.
link |
00:23:29.380
Like if I tell you the meeting today is at 3 p.m.,
link |
00:23:34.140
I don't need to say over and over again,
link |
00:23:35.460
it's at 3 p.m., it's at 3 p.m., it's at 3 p.m.,
link |
00:23:37.780
it's at 3 p.m.
link |
00:23:38.940
for you to do a gradient step on each one.
link |
00:23:40.380
You just hear it once and you remember it.
link |
00:23:43.180
There's been some work on things like self attention
link |
00:23:46.940
and attention like mechanisms,
link |
00:23:48.340
like the neural Turing machine
link |
00:23:50.420
that can write to memory cells
link |
00:23:52.220
and update themselves with facts like that right away.
link |
00:23:54.900
But I don't think we've really nailed it yet.
link |
00:23:56.900
And that's one area where I'd imagine
link |
00:23:59.580
that new optimization algorithms
link |
00:24:02.660
or different ways of applying
link |
00:24:03.780
existing optimization algorithms
link |
00:24:05.980
could give us a way of just lightning fast
link |
00:24:08.800
updating the state of a machine learning system
link |
00:24:11.180
to contain a specific fact like that
link |
00:24:14.100
without needing to have it presented
link |
00:24:15.340
over and over and over again.
link |
00:24:16.980
So some of the success of symbolic systems in the 80s
link |
00:24:21.420
is they were able to assemble these kinds of facts better.
link |
00:24:26.220
But there's a lot of expert input required
link |
00:24:29.100
and it's very limited in that sense.
link |
00:24:31.140
Do you ever look back to that
link |
00:24:33.700
as something that we'll have to return to eventually?
link |
00:24:36.560
Sort of dust off the book from the shelf
link |
00:24:38.440
and think about how we build knowledge,
link |
00:24:41.340
representation, knowledge base.
link |
00:24:42.940
Like will we have to use graph searches?
link |
00:24:44.820
Graph searches, right.
link |
00:24:45.780
And like first order logic and entailment
link |
00:24:47.700
and things like that.
link |
00:24:48.540
That kind of thing, yeah, exactly.
link |
00:24:49.540
In my particular line of work,
link |
00:24:51.180
which has mostly been machine learning security
link |
00:24:54.540
and also generative modeling,
link |
00:24:56.740
I haven't usually found myself moving in that direction.
link |
00:25:00.560
For generative models, I could see a little bit of,
link |
00:25:03.500
it could be useful if you had something
link |
00:25:04.920
like a differentiable knowledge base
link |
00:25:09.660
or some other kind of knowledge base
link |
00:25:10.980
where it's possible for some of our
link |
00:25:13.140
fuzzier machine learning algorithms
link |
00:25:14.860
to interact with a knowledge base.
link |
00:25:16.900
I mean, your network is kind of like that.
link |
00:25:19.060
It's a differentiable knowledge base of sorts.
link |
00:25:21.480
Yeah.
link |
00:25:22.320
But.
link |
00:25:23.660
If we had a really easy way of giving feedback
link |
00:25:27.660
to machine learning models,
link |
00:25:29.260
that would clearly help a lot with generative models.
link |
00:25:32.420
And so you could imagine one way of getting there
link |
00:25:33.940
would be get a lot better at natural language processing.
link |
00:25:36.760
But another way of getting there would be
link |
00:25:38.960
take some kind of knowledge base
link |
00:25:40.300
and figure out a way for it to actually
link |
00:25:42.340
interact with a neural network.
link |
00:25:44.100
Being able to have a chat with a neural network.
link |
00:25:46.100
Yeah.
link |
00:25:47.900
So like one thing in generative models we see a lot today
link |
00:25:50.020
is you'll get things like faces that are not symmetrical,
link |
00:25:54.780
like people that have two eyes that are different colors.
link |
00:25:58.580
I mean, there are people with eyes
link |
00:25:59.580
that are different colors in real life,
link |
00:26:00.900
but not nearly as many of them as you tend to see
link |
00:26:03.500
in the machine learning generated data.
link |
00:26:06.140
So if you had either a knowledge base
link |
00:26:08.140
that could contain the fact,
link |
00:26:10.220
people's faces are generally approximately symmetric
link |
00:26:13.380
and eye color is especially likely
link |
00:26:15.940
to be the same on both sides.
link |
00:26:17.980
Being able to just inject that hint
link |
00:26:20.200
into the machine learning model
link |
00:26:22.060
without it having to discover that itself
link |
00:26:23.860
after studying a lot of data
link |
00:26:25.820
would be a really useful feature.
link |
00:26:28.380
I could see a lot of ways of getting there
link |
00:26:30.180
without bringing back some of the 1980s technology,
link |
00:26:32.220
but I also see some ways that you could imagine
link |
00:26:35.180
extending the 1980s technology to play nice with neural nets
link |
00:26:38.260
and have it help get there.
link |
00:26:40.080
Awesome.
link |
00:26:40.920
So you talked about the story of you coming up
link |
00:26:44.380
with the idea of GANs at a bar with some friends.
link |
00:26:47.020
You were arguing that this, you know, GANs would work,
link |
00:26:51.380
generative adversarial networks,
link |
00:26:53.060
and the others didn't think so.
link |
00:26:54.660
Then you went home at midnight, coded it up, and it worked.
link |
00:26:58.420
So if I was a friend of yours at the bar,
link |
00:27:01.340
I would also have doubts.
link |
00:27:02.700
It's a really nice idea,
link |
00:27:03.860
but I'm very skeptical that it would work.
link |
00:27:06.820
What was the basis of their skepticism?
link |
00:27:09.300
What was the basis of your intuition why it should work?
link |
00:27:14.340
I don't want to be someone who goes around
link |
00:27:15.980
promoting alcohol for the purposes of science,
link |
00:27:18.280
but in this case,
link |
00:27:20.020
I do actually think that drinking helped a little bit.
link |
00:27:23.060
When your inhibitions are lowered,
link |
00:27:25.360
you're more willing to try out things
link |
00:27:27.380
that you wouldn't try out otherwise.
link |
00:27:29.620
So I have noticed in general
link |
00:27:32.460
that I'm less prone to shooting down some of my own ideas
link |
00:27:34.540
when I have had a little bit to drink.
link |
00:27:37.960
I think if I had had that idea at lunchtime,
link |
00:27:41.020
I probably would have thought,
link |
00:27:42.260
it's hard enough to train one neural net,
link |
00:27:43.720
you can't train a second neural net
link |
00:27:44.880
in the inner loop of the outer neural net.
link |
00:27:48.080
That was basically my friend's objection,
link |
00:27:49.820
was that trying to train two neural nets at the same time
link |
00:27:52.740
would be too hard.
link |
00:27:54.260
So it was more about the training process,
link |
00:27:56.140
unless, so my skepticism would be,
link |
00:27:58.300
you know, I'm sure you could train it,
link |
00:28:01.140
but the thing it would converge to
link |
00:28:03.180
would not be able to generate anything reasonable,
link |
00:28:05.820
any kind of reasonable realism.
link |
00:28:08.260
Yeah, so part of what all of us were thinking about
link |
00:28:11.360
when we had this conversation was deep Bolton machines,
link |
00:28:15.280
which a lot of us in the lab, including me,
link |
00:28:16.980
were a big fan of deep Bolton machines at the time.
link |
00:28:20.660
They involved two separate processes
link |
00:28:22.920
running at the same time.
link |
00:28:25.060
One of them is called the positive phase,
link |
00:28:28.140
where you load data into the model
link |
00:28:31.160
and tell the model to make the data more likely.
link |
00:28:33.540
The other one is called the negative phase,
link |
00:28:35.140
where you draw samples from the model
link |
00:28:37.020
and tell the model to make those samples less likely.
link |
00:28:41.180
In a deep Bolton machine,
link |
00:28:42.220
it's not trivial to generate a sample.
link |
00:28:43.960
You have to actually run an iterative process
link |
00:28:46.980
that gets better and better samples
link |
00:28:49.140
coming closer and closer to the distribution
link |
00:28:51.380
the model represents.
link |
00:28:52.840
So during the training process,
link |
00:28:53.900
you're always running these two systems at the same time,
link |
00:28:56.940
one that's updating the parameters of the model
link |
00:28:58.940
and another one that's trying to generate samples
link |
00:29:00.500
from the model.
link |
00:29:01.660
And they worked really well in things like MNIST,
link |
00:29:04.340
but a lot of us in the lab, including me,
link |
00:29:05.820
had tried to get deep Bolton machines
link |
00:29:07.500
to scale past MNIST to things like generating color photos,
link |
00:29:11.900
and we just couldn't get the two processes
link |
00:29:14.120
to stay synchronized.
link |
00:29:17.380
So when I had the idea for GANs,
link |
00:29:18.740
a lot of people thought that the discriminator
link |
00:29:20.340
would have more or less the same problem
link |
00:29:22.580
as the negative phase in the Bolton machine,
link |
00:29:25.320
that trying to train the discriminator in the inner loop,
link |
00:29:27.800
you just couldn't get it to keep up
link |
00:29:29.920
with the generator in the outer loop,
link |
00:29:31.540
and that would prevent it from converging
link |
00:29:33.820
to anything useful.
link |
00:29:35.220
Yeah, I share that intuition.
link |
00:29:36.840
Yeah.
link |
00:29:39.540
But turns out to not be the case.
link |
00:29:41.940
A lot of the time with machine learning algorithms,
link |
00:29:43.760
it's really hard to predict ahead of time
link |
00:29:45.180
how well they'll actually perform.
link |
00:29:46.900
You have to just run the experiment and see what happens.
link |
00:29:49.140
And I would say I still today don't have
link |
00:29:52.500
like one factor I can put my finger on and say,
link |
00:29:54.780
this is why GANs worked for photo generation
link |
00:29:58.340
and deep Bolton machines don't.
link |
00:30:01.980
There are a lot of theory papers
link |
00:30:03.300
showing that under some theoretical settings,
link |
00:30:06.340
the GAN algorithm does actually converge,
link |
00:30:10.680
but those settings are restricted enough
link |
00:30:14.140
that they don't necessarily explain the whole picture
link |
00:30:17.520
in terms of all the results that we see in practice.
link |
00:30:20.740
So taking a step back,
link |
00:30:22.300
can you, in the same way as we talked about deep learning,
link |
00:30:24.860
can you tell me what generative adversarial networks are?
link |
00:30:29.420
Yeah, so generative adversarial networks
link |
00:30:31.380
are a particular kind of generative model.
link |
00:30:33.980
A generative model is a machine learning model
link |
00:30:36.280
that can train on some set of data.
link |
00:30:38.860
Like, so you have a collection of photos of cats
link |
00:30:41.220
and you want to generate more photos of cats,
link |
00:30:43.980
or you want to estimate a probability distribution over cats.
link |
00:30:47.700
So you can ask how likely it is
link |
00:30:49.800
that some new image is a photo of a cat.
link |
00:30:52.860
GANs are one way of doing this.
link |
00:30:55.800
Some generative models are good at creating new data.
link |
00:30:59.180
Other generative models are good at estimating
link |
00:31:01.620
that density function and telling you how likely
link |
00:31:04.140
particular pieces of data are to come
link |
00:31:07.180
from the same distribution as the training data.
link |
00:31:09.700
GANs are more focused on generating samples
link |
00:31:12.420
rather than estimating the density function.
link |
00:31:15.600
There are some kinds of GANs like FlowGAN that can do both,
link |
00:31:18.500
but mostly GANs are about generating samples,
link |
00:31:21.620
generating new photos of cats that look realistic.
link |
00:31:24.220
And they do that completely from scratch.
link |
00:31:29.340
It's analogous to human imagination.
link |
00:31:32.240
When a GAN creates a new image of a cat,
link |
00:31:34.780
it's using a neural network to produce a cat
link |
00:31:39.300
that has not existed before.
link |
00:31:41.040
It isn't doing something like compositing photos together.
link |
00:31:44.540
You're not literally taking the eye off of one cat
link |
00:31:47.100
and the ear off of another cat.
link |
00:31:48.300
It's more of this digestive process
link |
00:31:51.380
where the neural net trains in a lot of data
link |
00:31:53.940
and comes up with some representation
link |
00:31:55.580
of the probability distribution
link |
00:31:57.420
and generates entirely new cats.
link |
00:31:59.820
There are a lot of different ways
link |
00:32:00.900
of building a generative model.
link |
00:32:01.980
What's specific to GANs is that we have a two player game
link |
00:32:05.680
in the game theoretic sense.
link |
00:32:08.100
And as the players in this game compete,
link |
00:32:10.340
one of them becomes able to generate realistic data.
link |
00:32:13.940
The first player is called the generator.
link |
00:32:16.140
It produces output data such as just images, for example.
link |
00:32:20.660
And at the start of the learning process,
link |
00:32:22.460
it'll just produce completely random images.
link |
00:32:25.140
The other player is called the discriminator.
link |
00:32:27.400
The discriminator takes images as input
link |
00:32:29.700
and guesses whether they're real or fake.
link |
00:32:32.540
You train it both on real data,
link |
00:32:34.260
so photos that come from your training set,
link |
00:32:36.140
actual photos of cats,
link |
00:32:37.860
and you train it to say that those are real.
link |
00:32:39.900
You also train it on images
link |
00:32:41.980
that come from the generator network
link |
00:32:43.860
and you train it to say that those are fake.
link |
00:32:46.740
As the two players compete in this game,
link |
00:32:49.220
the discriminator tries to become better
link |
00:32:50.960
at recognizing whether images are real or fake.
link |
00:32:53.340
And the generator becomes better
link |
00:32:54.800
at fooling the discriminator into thinking
link |
00:32:57.020
that its outputs are real.
link |
00:33:00.820
And you can analyze this through the language of game theory
link |
00:33:03.580
and find that there's a Nash equilibrium
link |
00:33:06.940
where the generator has captured
link |
00:33:08.620
the correct probability distribution.
link |
00:33:10.820
So in the cat example,
link |
00:33:12.180
it makes perfectly realistic cat photos.
link |
00:33:14.580
And the discriminator is unable to do better
link |
00:33:17.180
than random guessing
link |
00:33:18.740
because all the samples coming from both the data
link |
00:33:21.860
and the generator look equally likely
link |
00:33:24.060
to have come from either source.
link |
00:33:25.860
So do you ever sit back
link |
00:33:28.380
and does it just blow your mind that this thing works?
link |
00:33:31.300
So from very,
link |
00:33:33.380
so it's able to estimate that density function
link |
00:33:35.860
enough to generate realistic images.
link |
00:33:38.700
I mean, does it, yeah.
link |
00:33:40.860
Do you ever sit back and think how does this even,
link |
00:33:44.700
why, this is quite incredible,
link |
00:33:46.780
especially where GANs have gone in terms of realism.
link |
00:33:49.260
Yeah, and not just to flatter my own work,
link |
00:33:51.620
but generative models,
link |
00:33:53.840
all of them have this property that
link |
00:33:56.500
if they really did what we ask them to do,
link |
00:33:58.800
they would do nothing but memorize the training data.
link |
00:34:01.060
Right, exactly.
link |
00:34:02.920
Models that are based on maximizing the likelihood,
link |
00:34:05.740
the way that you obtain the maximum likelihood
link |
00:34:08.140
for a specific training set
link |
00:34:09.700
is you assign all of your probability mass
link |
00:34:12.380
to the training examples and nowhere else.
link |
00:34:15.100
For GANs, the game is played using a training set.
link |
00:34:18.380
So the way that you become unbeatable in the game
link |
00:34:21.140
is you literally memorize training examples.
link |
00:34:25.340
One of my former interns wrote a paper,
link |
00:34:28.860
his name is Vaishnav Nagarajan,
link |
00:34:31.020
and he showed that it's actually hard for the generator
link |
00:34:33.860
to memorize the training data,
link |
00:34:36.060
hard in a statistical learning theory sense,
link |
00:34:39.100
that you can actually create reasons
link |
00:34:42.140
for why it would require quite a lot of learning steps
link |
00:34:48.340
and a lot of observations of different latent variables
link |
00:34:52.140
before you could memorize the training data.
link |
00:34:54.300
That still doesn't really explain why
link |
00:34:56.140
when you produce samples that are new,
link |
00:34:58.200
why do you get compelling images
link |
00:34:59.820
rather than just garbage
link |
00:35:01.820
that's different from the training set.
link |
00:35:03.720
And I don't think we really have a good answer for that,
link |
00:35:06.900
especially if you think about
link |
00:35:07.880
how many possible images are out there
link |
00:35:10.180
and how few images the generative model sees
link |
00:35:14.020
during training.
link |
00:35:15.420
It seems just unreasonable
link |
00:35:16.900
that generative models create new images as well as they do,
link |
00:35:20.740
especially considering that we're basically
link |
00:35:22.700
training them to memorize rather than generalize.
link |
00:35:26.180
I think part of the answer is
link |
00:35:28.180
there's a paper called Deep Image Prior
link |
00:35:30.820
where they show that you can take a convolutional net
link |
00:35:33.060
and you don't even need to learn
link |
00:35:34.020
the parameters of it at all,
link |
00:35:34.980
you just use the model architecture.
link |
00:35:36.780
And it's already useful for things like inpainting images.
link |
00:35:40.260
I think that shows us
link |
00:35:41.500
that the convolutional network architecture
link |
00:35:43.580
captures something really important
link |
00:35:45.100
about the structure of images.
link |
00:35:47.180
And we don't need to actually use the learning
link |
00:35:50.180
to capture all the information
link |
00:35:51.460
coming out of the convolutional net.
link |
00:35:54.500
That would imply that it would be much harder
link |
00:35:57.660
to make generative models in other domains.
link |
00:36:00.500
So far, we're able to make reasonable speech models
link |
00:36:02.900
and things like that.
link |
00:36:04.100
But to be honest, we haven't actually explored
link |
00:36:06.780
a whole lot of different data sets all that much.
link |
00:36:09.100
We don't, for example, see a lot of deep learning models
link |
00:36:13.260
of like biology data sets
link |
00:36:17.780
where you have lots of microarrays measuring
link |
00:36:20.260
the amount of different enzymes and things like that.
link |
00:36:22.180
So we may find that some of the progress
link |
00:36:24.620
that we've seen for images and speech
link |
00:36:26.220
turns out to really rely heavily on the model architecture.
link |
00:36:29.460
And we were able to do what we did for vision
link |
00:36:32.300
by trying to reverse engineer the human visual system.
link |
00:36:35.540
And maybe it'll turn out that we can't just use
link |
00:36:39.380
that same trick for arbitrary kinds of data.
link |
00:36:42.860
Right, so there's aspect to the human vision system,
link |
00:36:45.340
the hardware of it, that makes it without learning,
link |
00:36:49.540
without cognition, just makes it really effective
link |
00:36:51.980
at detecting the patterns we see in the visual world.
link |
00:36:54.340
Yeah.
link |
00:36:55.180
Yeah, that's really interesting.
link |
00:36:57.140
What, in a big, quick overview,
link |
00:37:01.660
in your view, what types of GANs are there
link |
00:37:05.740
and what other generative models besides GANs are there?
link |
00:37:09.540
Yeah, so it's maybe a little bit easier to start
link |
00:37:12.820
with what kinds of generative models are there
link |
00:37:14.420
other than GANs.
link |
00:37:16.340
So most generative models are likelihood based
link |
00:37:20.340
where to train them, you have a model that tells you
link |
00:37:24.340
how much probability it assigns to a particular example
link |
00:37:28.580
and you just maximize the probability assigned
link |
00:37:30.980
to all the training examples.
link |
00:37:33.220
It turns out that it's hard to design a model
link |
00:37:35.740
that can create really complicated images
link |
00:37:38.740
or really complicated audio waveforms
link |
00:37:41.820
and still have it be possible to estimate
link |
00:37:45.740
the likelihood function from a computational point of view.
link |
00:37:51.740
Most interesting models that you would just write down
link |
00:37:53.740
intuitively, it turns out that it's almost impossible
link |
00:37:56.580
to calculate the amount of probability they assign
link |
00:37:58.980
to a particular point.
link |
00:38:01.300
So there's a few different schools of generative models
link |
00:38:04.380
in the likelihood family.
link |
00:38:07.060
One approach is to very carefully design the model
link |
00:38:09.860
so that it is computationally tractable
link |
00:38:12.420
to measure the density it assigns to a particular point.
link |
00:38:15.180
So there are things like autoregressive models,
link |
00:38:18.780
like PixelCNN, those basically break down
link |
00:38:23.580
the probability distribution into a product
link |
00:38:26.460
over every single feature.
link |
00:38:28.300
So for an image, you estimate the probability
link |
00:38:31.180
of each pixel given all of the pixels that came before it.
link |
00:38:35.420
There's tricks where if you want to measure
link |
00:38:37.300
the density function, you can actually calculate
link |
00:38:40.260
the density for all these pixels more or less in parallel.
link |
00:38:44.100
Generating the image still tends to require you
link |
00:38:46.500
to go one pixel at a time, and that can be very slow.
link |
00:38:50.460
But there are, again, tricks for doing this
link |
00:38:52.620
in a hierarchical pattern where you can keep
link |
00:38:54.180
the runtime under control.
link |
00:38:55.780
Are the quality of the images it generates,
link |
00:38:59.340
putting runtime aside, pretty good?
link |
00:39:02.660
They're reasonable, yeah.
link |
00:39:04.420
I would say a lot of the best results
link |
00:39:07.460
are from GANs these days, but it can be hard to tell
link |
00:39:11.060
how much of that is based on who's studying
link |
00:39:14.700
which type of algorithm, if that makes sense.
link |
00:39:17.260
The amount of effort invested in a particular.
link |
00:39:18.900
Yeah, or like the kind of expertise.
link |
00:39:21.420
So a lot of people who've traditionally been excited
link |
00:39:23.140
about graphics or art and things like that
link |
00:39:25.060
have gotten interested in GANs.
link |
00:39:27.020
And to some extent, it's hard to tell
link |
00:39:28.740
are GANs doing better because they have a lot
link |
00:39:31.740
of graphics and art experts behind them,
link |
00:39:34.700
or are GANs doing better because they're more
link |
00:39:37.060
computationally efficient, or are GANs doing better
link |
00:39:40.300
because they prioritize the realism of samples
link |
00:39:43.460
over the accuracy of the density function.
link |
00:39:45.540
I think all of those are potentially valid explanations,
link |
00:39:48.660
and it's hard to tell.
link |
00:39:51.300
So can you give a brief history of GANs from 2014?
link |
00:39:57.620
Were you paper 13?
link |
00:39:59.260
Yeah, so a few highlights.
link |
00:40:00.980
In the first paper, we just showed
link |
00:40:03.140
that GANs basically work.
link |
00:40:04.740
If you look back at the samples we had now,
link |
00:40:06.620
they look terrible.
link |
00:40:08.820
On the CIFAR 10 data set,
link |
00:40:10.020
you can't even recognize objects in them.
link |
00:40:12.220
Your paper, sorry, you used CIFAR 10?
link |
00:40:15.020
We used MNIST, which is little handwritten digits.
link |
00:40:18.060
We used the Toronto Face database,
link |
00:40:19.860
which is small grayscale photos of faces.
link |
00:40:22.660
We did have recognizable faces.
link |
00:40:24.180
My colleague Bing Xu put together
link |
00:40:25.660
the first GAN face model for that paper.
link |
00:40:29.660
We also had the CIFAR 10 data set,
link |
00:40:32.940
which is things like very small 32 by 32 pixels
link |
00:40:36.060
of cars and cats and dogs.
link |
00:40:40.660
For that, we didn't get recognizable objects,
link |
00:40:42.980
but all the deep learning people back then
link |
00:40:46.140
were really used to looking at these failed samples
link |
00:40:48.380
and kind of reading them like tea leaves.
link |
00:40:50.420
And people who are used to reading the tea leaves
link |
00:40:53.020
recognize that our tea leaves at least look different.
link |
00:40:56.500
Maybe not necessarily better,
link |
00:40:57.820
but there was something unusual about them.
link |
00:41:01.220
And that got a lot of us excited.
link |
00:41:03.620
One of the next really big steps was LAPGAN
link |
00:41:06.180
by Emily Denton and Sumit Chintala at Facebook AI Research,
link |
00:41:10.900
where they actually got really good high resolution photos
link |
00:41:14.460
working with GANs for the first time.
link |
00:41:16.580
They had a complicated system
link |
00:41:18.140
where they generated the image starting at low res
link |
00:41:20.100
and then scaling up to high res,
link |
00:41:22.900
but they were able to get it to work.
link |
00:41:24.900
And then in 2015, I believe later that same year,
link |
00:41:31.700
Alec Radford and Sumit Chintala and Luke Metz
link |
00:41:35.940
published the DCGAN paper,
link |
00:41:38.420
which it stands for deep convolutional GAN.
link |
00:41:41.860
It's kind of a non unique name
link |
00:41:43.740
because these days basically all GANs
link |
00:41:46.420
and even some before that were deep and convolutional,
link |
00:41:48.380
but they just kind of picked a name
link |
00:41:50.220
for a really great recipe
link |
00:41:52.260
where they were able to actually using only one model
link |
00:41:55.380
instead of a multi step process,
link |
00:41:57.300
actually generate realistic images of faces
link |
00:41:59.700
and things like that.
link |
00:42:01.980
That was sort of like the beginning
link |
00:42:05.260
of the Cambrian explosion of GANs.
link |
00:42:07.380
Like once you had animals that had a backbone,
link |
00:42:09.740
you suddenly got lots of different versions of fish
link |
00:42:12.900
and four legged animals and things like that.
link |
00:42:15.340
So DCGAN became kind of the backbone
link |
00:42:17.940
for many different models that came out.
link |
00:42:19.420
It's used as a baseline even still.
link |
00:42:21.620
Yeah, yeah.
link |
00:42:23.140
And so from there,
link |
00:42:24.820
I would say some interesting things we've seen
link |
00:42:26.580
are there's a lot you can say
link |
00:42:29.420
about how just the quality
link |
00:42:30.940
of standard image generation GANs has increased,
link |
00:42:33.580
but what's also maybe more interesting
link |
00:42:35.100
on an intellectual level
link |
00:42:36.020
is how the things you can use GANs for has also changed.
link |
00:42:41.020
One thing is that you can use them to learn classifiers
link |
00:42:44.580
without having to have class labels
link |
00:42:46.660
for every example in your training set.
link |
00:42:48.940
So that's called semi supervised learning.
link |
00:42:51.780
My colleague at OpenAI, Tim Solomons,
link |
00:42:53.820
who's at Brain now,
link |
00:42:55.820
wrote a paper called Improve Techniques for Training GANs.
link |
00:42:59.780
I'm a coauthor on this paper,
link |
00:43:00.900
but I can't claim any credit for this particular part.
link |
00:43:03.700
One thing he showed in the paper
link |
00:43:04.900
is that you can take the GAN discriminator
link |
00:43:07.820
and use it as a classifier that actually tells you,
link |
00:43:11.540
this image is a cat, this image is a dog,
link |
00:43:13.620
this image is a car, this image is a truck, and so on.
link |
00:43:16.420
Not just to say whether the image is real or fake,
link |
00:43:18.820
but if it is real to say specifically
link |
00:43:20.700
what kind of object it is.
link |
00:43:22.620
And he found that you can train these classifiers
link |
00:43:25.340
with far fewer labeled examples
link |
00:43:28.580
than traditional classifiers.
link |
00:43:30.620
So if you supervise based on also
link |
00:43:33.660
not just your discrimination ability,
link |
00:43:35.300
but your ability to classify,
link |
00:43:36.820
you're going to do much,
link |
00:43:38.660
you're going to converge much faster
link |
00:43:40.100
to being effective at being a discriminator.
link |
00:43:43.300
Yeah.
link |
00:43:44.260
So for example, for the MNIST dataset,
link |
00:43:46.340
you want to look at an image of a handwritten digit
link |
00:43:48.860
and say whether it's a zero, a one, or a two, and so on.
link |
00:43:54.180
To get down to less than 1% accuracy
link |
00:43:56.980
required around 60,000 examples
link |
00:44:00.260
until maybe about 2014 or so.
link |
00:44:02.780
In 2016 with this semi supervised GAN project,
link |
00:44:07.460
Tim was able to get below 1% error
link |
00:44:11.060
using only 100 labeled examples.
link |
00:44:13.620
So that was about a 600X decrease
link |
00:44:15.980
in the amount of labels that he needed.
link |
00:44:17.980
He's still using more images than that,
link |
00:44:21.060
but he doesn't need to have each of them labeled
link |
00:44:22.740
as this one's a one, this one's a two,
link |
00:44:25.100
this one's a zero, and so on.
link |
00:44:27.020
Then to be able to,
link |
00:44:28.460
for GANs to be able to generate recognizable objects,
link |
00:44:31.220
so objects from a particular class,
link |
00:44:33.420
you still need labeled data
link |
00:44:37.020
because you need to know what it means
link |
00:44:38.900
to be a particular class cat, dog.
link |
00:44:41.740
How do you think we can move away from that?
link |
00:44:44.580
Yeah, some researchers at Brain Zurich
link |
00:44:46.620
actually just released a really great paper
link |
00:44:49.020
on semi supervised GANs
link |
00:44:51.780
where their goal isn't to classify,
link |
00:44:53.940
it's to make recognizable objects
link |
00:44:56.180
despite not having a lot of labeled data.
link |
00:44:58.660
They were working off of DeepMind's BigGAN project
link |
00:45:02.380
and they showed that they can match the performance
link |
00:45:05.180
of BigGAN using only 10%, I believe,
link |
00:45:08.660
of the labels.
link |
00:45:10.540
BigGAN was trained on the ImageNet data set,
link |
00:45:12.300
which is about 1.2 million images
link |
00:45:14.420
and had all of them labeled.
link |
00:45:17.460
This latest project from Brain Zurich
link |
00:45:19.060
shows that they're able to get away
link |
00:45:20.220
with only having about 10% of the images labeled.
link |
00:45:25.500
And they do that essentially using a clustering algorithm
link |
00:45:29.860
where the discriminator learns
link |
00:45:31.140
to assign the objects to groups
link |
00:45:34.580
and then this understanding that objects can be grouped
link |
00:45:38.220
into similar types helps it to form more realistic ideas
link |
00:45:43.340
of what should be appearing in the image
link |
00:45:45.300
because it knows that every image it creates
link |
00:45:47.860
has to come from one of these archetypal groups
link |
00:45:50.060
rather than just being some arbitrary image.
link |
00:45:53.100
If you train a GAN with no class labels,
link |
00:45:54.980
you tend to get things that look sort of like grass
link |
00:45:57.700
or water or brick or dirt,
link |
00:46:00.380
but without necessarily a lot going on in them.
link |
00:46:04.340
And I think that's partly because
link |
00:46:05.700
if you look at a large ImageNet image,
link |
00:46:07.820
the object doesn't necessarily occupy the whole image.
link |
00:46:11.180
And so you learn to create realistic sets of pixels,
link |
00:46:15.580
but you don't necessarily learn
link |
00:46:17.460
that the object is the star of the show
link |
00:46:20.060
and you want it to be in every image you make.
link |
00:46:22.100
Yeah, I've heard you talk about the horse,
link |
00:46:25.380
the zebra cycle GAN mapping
link |
00:46:26.980
and how it turns out, again, thought provoking
link |
00:46:31.900
that horses are usually on grass
link |
00:46:33.580
and zebras are usually on drier terrain.
link |
00:46:35.660
So when you're doing that kind of generation,
link |
00:46:38.140
you're going to end up generating greener horses
link |
00:46:41.740
or whatever, so those are connected together.
link |
00:46:45.340
It's not just, you're not able to segment,
link |
00:46:49.980
be able to generate in a segment away.
link |
00:46:52.300
So are there other types of games you come across
link |
00:46:54.980
in your mind that neural networks can play
link |
00:46:59.540
with each other to be able to solve problems?
link |
00:47:04.540
Yeah, the one that I spend most of my time on
link |
00:47:07.660
is in security.
link |
00:47:09.340
You can model most interactions as a game
link |
00:47:13.060
where there's attackers trying to break your system
link |
00:47:15.820
and you're the defender trying to build a resilient system.
link |
00:47:20.140
There's also domain adversarial learning,
link |
00:47:23.060
which is an approach to domain adaptation
link |
00:47:25.500
that looks really a lot like GANs.
link |
00:47:28.100
The authors had the idea before the GAN paper came out,
link |
00:47:31.780
their paper came out a little bit later
link |
00:47:33.740
and they're very nice and cited the GAN paper,
link |
00:47:38.220
but I know that they actually had the idea
link |
00:47:40.180
before it came out.
link |
00:47:42.420
Domain adaptation is when you want to train
link |
00:47:44.300
a machine learning model in one setting called a domain
link |
00:47:47.620
and then deploy it in another domain later.
link |
00:47:50.260
And you would like it to perform well in the new domain,
link |
00:47:52.660
even though the new domain is different
link |
00:47:53.980
from how it was trained.
link |
00:47:55.900
So for example, you might want to train
link |
00:47:58.460
on a really clean image data set like ImageNet,
link |
00:48:01.340
but then deploy on users phones
link |
00:48:03.340
where the user is taking pictures in the dark
link |
00:48:05.980
and pictures while moving quickly
link |
00:48:07.780
and just pictures that aren't really centered
link |
00:48:09.980
or composed all that well.
link |
00:48:13.380
When you take a normal machine learning model,
link |
00:48:15.820
it often degrades really badly
link |
00:48:17.820
when you move to the new domain
link |
00:48:18.980
because it looks so different
link |
00:48:20.020
from what the model was trained on.
link |
00:48:22.100
Domain adaptation algorithms try to smooth out that gap
link |
00:48:25.420
and the domain adversarial approach
link |
00:48:27.300
is based on training a feature extractor
link |
00:48:29.780
where the features have the same statistics
link |
00:48:32.140
regardless of which domain you extracted them on.
link |
00:48:35.140
So in the domain adversarial game,
link |
00:48:36.860
you have one player that's a feature extractor
link |
00:48:39.140
and another player that's a domain recognizer.
link |
00:48:42.060
The domain recognizer wants to look at the output
link |
00:48:44.260
of the feature extractor
link |
00:48:45.700
and guess which of the two domains the features came from.
link |
00:48:49.300
So it's a lot like the real versus fake discriminator
link |
00:48:51.420
in GANs and then the feature extractor,
link |
00:48:54.940
you can think of as loosely analogous
link |
00:48:56.820
to the generator in GANs,
link |
00:48:57.940
except what it's trying to do here
link |
00:48:59.100
is both fool the domain recognizer
link |
00:49:02.460
into not knowing which domain the data came from
link |
00:49:05.340
and also extract features that are good for classification.
link |
00:49:09.060
So at the end of the day,
link |
00:49:12.180
in the cases where it works out,
link |
00:49:13.780
you can actually get features
link |
00:49:16.860
that work about the same in both domains.
link |
00:49:20.620
Sometimes this has a drawback
link |
00:49:21.980
where in order to make things work the same in both domains,
link |
00:49:24.820
it just gets worse at the first one.
link |
00:49:26.780
But there are a lot of cases
link |
00:49:27.820
where it actually works out well on both.
link |
00:49:30.780
So do you think of GANs being useful
link |
00:49:32.980
in the context of data augmentation?
link |
00:49:35.420
Yeah, one thing you could hope for with GANs
link |
00:49:38.100
is you could imagine I've got a limited training set
link |
00:49:41.340
and I'd like to make more training data
link |
00:49:43.860
to train something else like a classifier.
link |
00:49:47.180
You could train the GAN on the training set
link |
00:49:50.500
and then create more data
link |
00:49:52.380
and then maybe the classifier
link |
00:49:54.300
would perform better on the test set
link |
00:49:55.940
after training on this bigger GAN generated data set.
link |
00:49:58.860
So that's the simplest version
link |
00:50:00.420
of something you might hope would work.
link |
00:50:03.060
I've never heard of that particular approach working,
link |
00:50:05.460
but I think there's some closely related things
link |
00:50:08.940
that I think could work in the future
link |
00:50:11.540
and some that actually already have worked.
link |
00:50:14.100
So if we think a little bit about what we'd be hoping for
link |
00:50:15.820
if we use the GAN to make more training data,
link |
00:50:18.220
we're hoping that the GAN will generalize to new examples
link |
00:50:22.060
better than the classifier would have generalized
link |
00:50:24.140
if it was trained on the same data.
link |
00:50:25.980
And I don't know of any reason to believe
link |
00:50:27.740
that the GAN would generalize better
link |
00:50:28.940
than the classifier would,
link |
00:50:31.460
but what we might hope for
link |
00:50:33.100
is that the GAN could generalize differently
link |
00:50:35.580
from a specific classifier.
link |
00:50:37.500
So one thing I think is worth trying
link |
00:50:39.180
that I haven't personally tried but someone could try is
link |
00:50:41.740
what if you trained a whole lot of different
link |
00:50:44.020
generative models on the same training set,
link |
00:50:46.500
create samples from all of them
link |
00:50:48.380
and then train a classifier on that?
link |
00:50:50.580
Because each of the generative models
link |
00:50:52.380
might generalize in a slightly different way.
link |
00:50:54.460
They might capture many different axes of variation
link |
00:50:56.980
that one individual model wouldn't
link |
00:50:58.860
and then the classifier can capture all of those ideas
link |
00:51:01.900
by training in all of their data.
link |
00:51:03.580
So it'd be a little bit like making
link |
00:51:04.740
an ensemble of classifiers.
link |
00:51:06.340
And I think that...
link |
00:51:07.180
Ensemble of GANs in a way.
link |
00:51:08.860
I think that could generalize better.
link |
00:51:10.100
The other thing that GANs are really good for
link |
00:51:12.700
is not necessarily generating new data
link |
00:51:17.020
that's exactly like what you already have,
link |
00:51:19.380
but by generating new data that has different properties
link |
00:51:23.580
from the data you already had.
link |
00:51:25.340
One thing that you can do is you can create
link |
00:51:27.260
differentially private data.
link |
00:51:29.140
So suppose that you have something like medical records
link |
00:51:31.900
and you don't want to train a classifier
link |
00:51:33.860
on the medical records and then publish the classifier
link |
00:51:36.500
because someone might be able to reverse engineer
link |
00:51:38.180
some of the medical records you trained on.
link |
00:51:40.580
There's a paper from Casey Green's lab
link |
00:51:42.820
that shows how you can train a GAN
link |
00:51:45.060
using differential privacy.
link |
00:51:47.020
And then the samples from the GAN
link |
00:51:49.020
still have the same differential privacy guarantees
link |
00:51:51.180
as the parameters of the GAN.
link |
00:51:52.740
So you can make fake patient data
link |
00:51:55.700
for other researchers to use.
link |
00:51:57.260
And they can do almost anything they want with that data
link |
00:51:59.220
because it doesn't come from real people.
link |
00:52:02.020
And the differential privacy mechanism
link |
00:52:04.300
gives you clear guarantees
link |
00:52:06.500
on how much the original people's data has been protected.
link |
00:52:09.940
That's really interesting, actually.
link |
00:52:11.380
I haven't heard you talk about that before.
link |
00:52:13.780
In terms of fairness, I've seen from AAAI,
link |
00:52:17.780
your talk, how can adversarial machine learning
link |
00:52:21.260
help models be more fair with respect to sensitive variables?
link |
00:52:25.740
Yeah, so there's a paper from Amos Starkey's lab
link |
00:52:28.460
about how to learn machine learning models
link |
00:52:31.420
that are incapable of using specific variables.
link |
00:52:34.820
So say, for example, you wanted to make predictions
link |
00:52:36.700
that are not affected by gender.
link |
00:52:39.580
It isn't enough to just leave gender
link |
00:52:41.220
out of the input to the model.
link |
00:52:42.820
You can often infer gender
link |
00:52:44.020
from a lot of other characteristics.
link |
00:52:45.500
Like say that you have the person's name,
link |
00:52:47.500
but you're not told their gender.
link |
00:52:48.620
Well, if their name is Ian, they're kind of obviously a man.
link |
00:52:53.740
So what you'd like to do is make a machine learning model
link |
00:52:55.660
that can still take in a lot of different attributes
link |
00:52:59.020
and make a really accurate informed prediction,
link |
00:53:02.620
but be confident that it isn't reverse engineering gender
link |
00:53:05.780
or another sensitive variable internally.
link |
00:53:08.420
You can do that using something very similar
link |
00:53:10.300
to the domain adversarial approach,
link |
00:53:12.860
where you have one player that's a feature extractor
link |
00:53:16.140
and another player that's a feature analyzer.
link |
00:53:19.100
And you want to make sure that the feature analyzer
link |
00:53:21.460
is not able to guess the value of the sensitive variable
link |
00:53:24.740
that you're trying to keep private.
link |
00:53:26.660
Right, that's, yeah, I love this approach.
link |
00:53:29.100
So yeah, with the feature,
link |
00:53:31.660
you're not able to infer the sensitive variables.
link |
00:53:36.340
Brilliant, that's quite brilliant and simple actually.
link |
00:53:39.500
Another way I think that GANs in particular
link |
00:53:42.780
could be used for fairness
link |
00:53:44.260
would be to make something like a CycleGAN,
link |
00:53:46.780
where you can take data from one domain
link |
00:53:49.740
and convert it into another.
link |
00:53:51.180
We've seen CycleGAN turning horses into zebras.
link |
00:53:53.900
We've seen other unsupervised GANs made by Mingyu Liu
link |
00:53:59.260
doing things like turning day photos into night photos.
link |
00:54:03.700
I think for fairness,
link |
00:54:04.820
you could imagine taking records for people in one group
link |
00:54:08.460
and transforming them into analogous people in another group
link |
00:54:11.580
and testing to see if they're treated equitably
link |
00:54:14.980
across those two groups.
link |
00:54:16.460
There's a lot of things that'd be hard to get right
link |
00:54:18.100
to make sure that the conversion process itself is fair.
link |
00:54:21.140
And I don't think it's anywhere near
link |
00:54:23.900
something that we could actually use yet,
link |
00:54:25.420
but if you could design that conversion process
link |
00:54:27.140
very carefully, it might give you a way of doing audits
link |
00:54:30.540
where you say, what if we took people from this group,
link |
00:54:33.140
converted them into equivalent people in another group,
link |
00:54:35.460
does the system actually treat them how it ought to?
link |
00:54:38.740
That's also really interesting.
link |
00:54:41.780
You know, in popular press and in general,
link |
00:54:47.500
in our imagination, you think,
link |
00:54:49.500
well, GANs are able to generate data
link |
00:54:51.700
and you start to think about deep fakes
link |
00:54:54.540
or being able to sort of maliciously generate data
link |
00:54:57.940
that fakes the identity of other people.
link |
00:55:01.220
Is this something of a concern to you?
link |
00:55:03.180
Is this something, if you look 10, 20 years into the future,
link |
00:55:06.900
is that something that pops up in your work,
link |
00:55:10.380
in the work of the community
link |
00:55:11.380
that's working on generating models?
link |
00:55:13.540
I'm a lot less concerned about 20 years from now
link |
00:55:15.860
than the next few years.
link |
00:55:17.380
I think there'll be a kind of bumpy cultural transition
link |
00:55:20.820
as people encounter this idea
link |
00:55:23.180
that there can be very realistic videos
link |
00:55:24.700
and audio that aren't real.
link |
00:55:26.260
I think 20 years from now,
link |
00:55:28.700
people will mostly understand
link |
00:55:30.100
that you shouldn't believe something is real
link |
00:55:31.940
just because you saw a video of it.
link |
00:55:34.060
People will expect to see
link |
00:55:35.220
that it's been cryptographically signed
link |
00:55:38.220
or have some other mechanism to make them believe
link |
00:55:41.900
that the content is real.
link |
00:55:44.300
There's already people working on this.
link |
00:55:45.700
Like there's a startup called Truepick
link |
00:55:47.660
that provides a lot of mechanisms
link |
00:55:50.180
for authenticating that an image is real.
link |
00:55:52.780
They're maybe not quite up to having a state actor
link |
00:55:56.100
try to evade their verification techniques,
link |
00:55:59.820
but it's something that people are already working on
link |
00:56:02.380
and I think we'll get right eventually.
link |
00:56:04.100
So you think authentication will eventually win out.
link |
00:56:08.260
So being able to authenticate that this is real
link |
00:56:10.700
and this is not.
link |
00:56:11.860
Yeah.
link |
00:56:13.260
As opposed to GANs just getting better and better
link |
00:56:15.740
or generative models being able to get better and better
link |
00:56:18.180
to where the nature of what is real is normal.
link |
00:56:21.460
I don't think we'll ever be able
link |
00:56:22.940
to look at the pixels of a photo
link |
00:56:25.460
and tell you for sure that it's real or not real.
link |
00:56:28.540
And I think it would actually be somewhat dangerous
link |
00:56:32.740
to rely on that approach too much.
link |
00:56:35.140
If you make a really good fake detector
link |
00:56:36.820
and then someone's able to fool your fake detector
link |
00:56:38.900
and your fake detector says this image is not fake,
link |
00:56:42.140
then it's even more credible
link |
00:56:43.500
than if you've never made a fake detector
link |
00:56:45.060
in the first place.
link |
00:56:46.260
What I do think we'll get to is systems
link |
00:56:50.380
that we can kind of use behind the scenes
link |
00:56:53.300
to make estimates of what's going on
link |
00:56:55.580
and maybe not like use them in court
link |
00:56:57.820
for a definitive analysis.
link |
00:56:59.580
I also think we will likely get better authentication systems
link |
00:57:04.180
where, imagine that every phone cryptographically signs
link |
00:57:08.500
everything that comes out of it.
link |
00:57:10.540
You wouldn't be able to conclusively tell
link |
00:57:12.820
that an image was real,
link |
00:57:14.540
but you would be able to tell somebody
link |
00:57:17.700
who knew the appropriate private key for this phone
link |
00:57:21.300
was actually able to sign this image
link |
00:57:24.340
and upload it to this server at this timestamp.
link |
00:57:27.460
Okay, so you could imagine maybe you make phones
link |
00:57:31.340
that have the private keys hardware embedded in them.
link |
00:57:35.540
If like a state security agency
link |
00:57:37.460
really wants to infiltrate the company,
link |
00:57:39.220
they could probably plant a private key of their choice
link |
00:57:42.540
or break open the chip and learn the private key
link |
00:57:45.060
or something like that.
link |
00:57:46.180
But it would make it a lot harder
link |
00:57:47.420
for an adversary with fewer resources to fake things.
link |
00:57:51.460
For most of us it would be okay.
link |
00:57:53.700
So you mentioned the beer and the bar and the new ideas.
link |
00:57:58.300
You were able to implement this
link |
00:57:59.740
or come up with this new idea pretty quickly
link |
00:58:02.860
and implement it pretty quickly.
link |
00:58:04.380
Do you think there's still many such groundbreaking ideas
link |
00:58:07.700
in deep learning that could be developed so quickly?
link |
00:58:10.980
Yeah, I do think that there are a lot of ideas
link |
00:58:12.980
that can be developed really quickly.
link |
00:58:15.940
GANs were probably a little bit of an outlier
link |
00:58:17.820
on the whole like one hour timescale.
link |
00:58:20.180
But just in terms of like low resource ideas
link |
00:58:24.220
where you do something really different
link |
00:58:25.540
on the algorithm scale and get a big payback.
link |
00:58:30.140
I think it's not as likely that you'll see that
link |
00:58:31.900
in terms of things like core machine learning technologies
link |
00:58:34.940
like a better classifier
link |
00:58:36.580
or a better reinforcement learning algorithm
link |
00:58:38.180
or a better generative model.
link |
00:58:41.020
If I had the GAN idea today,
link |
00:58:42.420
it would be a lot harder to prove that it was useful
link |
00:58:45.260
than it was back in 2014
link |
00:58:46.940
because I would need to get it running
link |
00:58:49.540
on something like ImageNet or Celeb A at high resolution.
link |
00:58:54.060
You know, those take a while to train.
link |
00:58:55.540
You couldn't train it in an hour
link |
00:58:57.580
and know that it was something really new and exciting.
link |
00:59:01.020
Back in 2014, training on MNIST was enough.
link |
00:59:04.260
But there are other areas of machine learning
link |
00:59:06.780
where I think a new idea
link |
00:59:09.380
could actually be developed really quickly
link |
00:59:11.940
with low resources.
link |
00:59:13.260
What's your intuition about what areas
link |
00:59:15.420
of machine learning are ripe for this?
link |
00:59:17.740
Yeah, so I think fairness and interpretability
link |
00:59:23.140
are areas where we just really don't have any idea
link |
00:59:27.020
how anything should be done yet.
link |
00:59:29.020
Like for interpretability,
link |
00:59:30.340
I don't think we even have the right definitions.
link |
00:59:32.700
And even just defining a really useful concept,
link |
00:59:36.060
you don't even need to run any experiments,
link |
00:59:38.100
could have a huge impact on the field.
link |
00:59:40.100
We've seen that, for example, in differential privacy
link |
00:59:42.540
that Cynthia Dwork and her collaborators
link |
00:59:45.300
made this technical definition of privacy
link |
00:59:48.020
where before a lot of things were really mushy.
link |
00:59:50.020
And then with that definition,
link |
00:59:51.580
you could actually design randomized algorithms
link |
00:59:54.220
for accessing databases and guarantee
link |
00:59:56.180
that they preserved individual people's privacy
link |
00:59:58.820
in like a mathematical quantitative sense.
link |
01:00:03.460
Right now, we all talk a lot about
link |
01:00:05.060
how interpretable different machine learning algorithms are,
link |
01:00:07.540
but it's really just people's opinion.
link |
01:00:09.820
And everybody probably has a different idea
link |
01:00:11.300
of what interpretability means in their head.
link |
01:00:13.820
If we could define some concept related to interpretability
link |
01:00:16.940
that's actually measurable,
link |
01:00:18.700
that would be a huge leap forward
link |
01:00:20.540
even without a new algorithm that increases that quantity.
link |
01:00:24.140
And also once we had the definition of differential privacy,
link |
01:00:28.740
it was fast to get the algorithms that guaranteed it.
link |
01:00:31.340
So you could imagine once we have definitions
link |
01:00:33.500
of good concepts and interpretability,
link |
01:00:35.700
we might be able to provide the algorithms
link |
01:00:37.540
that have the interpretability guarantees quickly too.
link |
01:00:40.500
So what do you think it takes to build a system
link |
01:00:46.900
with human level intelligence
link |
01:00:48.660
as we quickly venture into the philosophical?
link |
01:00:51.980
So artificial general intelligence, what do you think it takes?
link |
01:00:55.660
I think that it definitely takes better environments
link |
01:01:01.820
than we currently have for training agents
link |
01:01:03.780
that we want them to have
link |
01:01:05.260
a really wide diversity of experiences.
link |
01:01:08.740
I also think it's gonna take really a lot of computation.
link |
01:01:11.780
It's hard to imagine exactly how much.
link |
01:01:13.780
So you're optimistic about simulation,
link |
01:01:16.300
simulating a variety of environments as the path forward?
link |
01:01:19.540
I think it's a necessary ingredient.
link |
01:01:21.980
Yeah, I don't think that we're going to get
link |
01:01:24.700
to artificial general intelligence
link |
01:01:27.340
by training on fixed data sets
link |
01:01:29.700
or by thinking really hard about the problem.
link |
01:01:32.100
I think that the agent really needs to interact
link |
01:01:35.860
and have a variety of experiences within the same lifespan.
link |
01:01:41.580
And today we have many different models
link |
01:01:44.100
that can each do one thing.
link |
01:01:45.700
And we tend to train them on one data set
link |
01:01:47.500
or one RL environment.
link |
01:01:50.100
Sometimes there are actually papers
link |
01:01:51.380
about getting one set of parameters to perform well
link |
01:01:54.180
in many different RL environments.
link |
01:01:56.980
But we don't really have anything like an agent
link |
01:01:59.500
that goes seamlessly from one type of experience to another
link |
01:02:02.900
and really integrates all the different things
link |
01:02:05.260
that it does over the course of its life.
link |
01:02:08.020
When we do see multi agent environments,
link |
01:02:10.580
they tend to be,
link |
01:02:12.340
or so many multi environment agents,
link |
01:02:14.660
they tend to be similar environments.
link |
01:02:16.740
Like all of them are playing like an action based video game.
link |
01:02:20.420
We don't really have an agent that goes from
link |
01:02:23.220
playing a video game to like reading the Wall Street Journal
link |
01:02:27.500
to predicting how effective a molecule will be as a drug
link |
01:02:31.260
or something like that.
link |
01:02:33.260
What do you think is a good test for intelligence
link |
01:02:35.940
in your view?
link |
01:02:37.020
There's been a lot of benchmarks started with the,
link |
01:02:40.300
with Alan Turing,
link |
01:02:41.700
natural conversation being a good benchmark for intelligence.
link |
01:02:46.260
What would Ian Goodfellow sit back
link |
01:02:51.340
and be really damn impressed
link |
01:02:53.380
if a system was able to accomplish?
link |
01:02:56.060
Something that doesn't take a lot of glue
link |
01:02:58.500
from human engineers.
link |
01:02:59.780
So imagine that instead of having to
link |
01:03:03.540
go to the CIFAR website and download CIFAR 10
link |
01:03:07.940
and then write a Python script to parse it and all that,
link |
01:03:11.300
you could just point an agent at the CIFAR 10 problem
link |
01:03:16.460
and it downloads and extracts the data
link |
01:03:19.180
and trains a model and starts giving you predictions.
link |
01:03:22.420
I feel like something that doesn't need to have
link |
01:03:25.980
every step of the pipeline assembled for it,
link |
01:03:28.620
definitely understands what it's doing.
link |
01:03:30.460
Is AutoML moving into that direction
link |
01:03:32.380
or are you thinking way even bigger?
link |
01:03:34.380
AutoML has mostly been moving toward,
link |
01:03:38.180
once we've built all the glue,
link |
01:03:39.900
can the machine learning system
link |
01:03:42.180
design the architecture really well?
link |
01:03:44.340
And so I'm more of saying like,
link |
01:03:47.260
if something knows how to pre process the data
link |
01:03:49.580
so that it successfully accomplishes the task,
link |
01:03:52.340
then it would be very hard to argue
link |
01:03:53.460
that it doesn't truly understand the task
link |
01:03:56.180
in some fundamental sense.
link |
01:03:58.460
And I don't necessarily know that that's like
link |
01:04:00.020
the philosophical definition of intelligence,
link |
01:04:02.260
but that's something that would be really cool to build
link |
01:04:03.780
that would be really useful and would impress me
link |
01:04:05.580
and would convince me that we've made a step forward
link |
01:04:08.180
in real AI.
link |
01:04:09.420
So you give it like the URL for Wikipedia
link |
01:04:13.380
and then next day expect it to be able to solve CIFAR 10.
link |
01:04:18.700
Or like you type in a paragraph
link |
01:04:20.820
explaining what you want it to do
link |
01:04:22.180
and it figures out what web searches it should run
link |
01:04:24.780
and downloads all the necessary ingredients.
link |
01:04:28.300
So you have a very clear, calm way of speaking,
link |
01:04:34.780
no ums, easy to edit.
link |
01:04:37.580
I've seen comments for both you and I
link |
01:04:40.220
have been identified as both potentially being robots.
link |
01:04:44.180
If you have to prove to the world that you are indeed human,
link |
01:04:47.220
how would you do it?
link |
01:04:48.580
I can understand thinking that I'm a robot.
link |
01:04:55.300
It's the flip side of the Turing test, I think.
link |
01:04:57.780
Yeah, yeah, the prove your human test.
link |
01:05:01.900
Intellectually, so you have to...
link |
01:05:04.460
Is there something that's truly unique in your mind?
link |
01:05:08.620
Does it go back to just natural language again?
link |
01:05:11.620
Just being able to talk the way out of it.
link |
01:05:13.860
Proving that I'm not a robot with today's technology.
link |
01:05:17.060
Yeah, that's pretty straightforward.
link |
01:05:18.340
Like my conversation today hasn't veered off
link |
01:05:20.780
into talking about the stock market or something
link |
01:05:24.380
because of my training data.
link |
01:05:25.940
But I guess more generally trying to prove
link |
01:05:28.060
that something is real from the content alone
link |
01:05:30.500
is incredibly hard.
link |
01:05:31.380
That's one of the main things I've gotten
link |
01:05:32.460
out of my GAN research,
link |
01:05:33.460
that you can simulate almost anything.
link |
01:05:37.660
And so you have to really step back to a separate channel
link |
01:05:41.020
to prove that something is real.
link |
01:05:42.220
So like, I guess I should have had myself
link |
01:05:45.100
stamped on a blockchain when I was born or something,
link |
01:05:47.660
but I didn't do that.
link |
01:05:48.580
So according to my own research methodology,
link |
01:05:50.780
there's just no way to know at this point.
link |
01:05:52.940
So what, last question, problem stands out for you
link |
01:05:56.300
that you're really excited about challenging
link |
01:05:58.340
in the near future?
link |
01:05:59.900
So I think resistance to adversarial examples,
link |
01:06:02.900
figuring out how to make machine learning secure
link |
01:06:05.500
against an adversary who wants to interfere
link |
01:06:07.380
and control it, that is one of the most important things
link |
01:06:10.660
researchers today could solve.
link |
01:06:12.140
In all domains, image, language, driving, and everything.
link |
01:06:17.700
I guess I'm most concerned about domains
link |
01:06:19.780
we haven't really encountered yet.
link |
01:06:21.980
Like imagine 20 years from now,
link |
01:06:24.020
when we're using advanced AIs to do things
link |
01:06:26.820
we haven't even thought of yet.
link |
01:06:28.940
Like if you ask people,
link |
01:06:30.620
what are the important problems in security of phones
link |
01:06:35.100
in like 2002?
link |
01:06:37.620
I don't think we would have anticipated
link |
01:06:38.900
that we're using them for nearly as many things
link |
01:06:42.140
as we're using them for today.
link |
01:06:43.620
I think it's gonna be like that with AI
link |
01:06:44.860
that you can kind of try to speculate
link |
01:06:46.900
about where it's going,
link |
01:06:47.900
but really the business opportunities
link |
01:06:49.580
that end up taking off would be hard
link |
01:06:52.100
to predict ahead of time.
link |
01:06:54.140
What you can predict ahead of time
link |
01:06:55.300
is that almost anything you can do with machine learning,
link |
01:06:58.340
you would like to make sure
link |
01:06:59.420
that people can't get it to do what they want
link |
01:07:03.100
rather than what you want,
link |
01:07:04.580
just by showing it a funny QR code
link |
01:07:06.460
or a funny input pattern.
link |
01:07:08.460
And you think that the set of methodology to do that
link |
01:07:10.980
can be bigger than any one domain?
link |
01:07:13.140
I think so, yeah.
link |
01:07:14.140
Yeah, like one methodology that I think is,
link |
01:07:19.140
not a specific methodology,
link |
01:07:20.620
but like a category of solutions
link |
01:07:22.740
that I'm excited about today is making dynamic models
link |
01:07:25.660
that change every time they make a prediction.
link |
01:07:28.180
So right now we tend to train models
link |
01:07:31.100
and then after they're trained, we freeze them
link |
01:07:33.060
and we just use the same rule
link |
01:07:35.180
to classify everything that comes in from then on.
link |
01:07:38.180
That's really a sitting duck from a security point of view.
link |
01:07:41.500
If you always output the same answer for the same input,
link |
01:07:45.420
then people can just run inputs through
link |
01:07:48.220
until they find a mistake that benefits them.
link |
01:07:50.140
And then they use the same mistake
link |
01:07:51.700
over and over and over again.
link |
01:07:54.020
I think having a model that updates its predictions
link |
01:07:56.460
so that it's harder to predict what you're gonna get
link |
01:08:00.340
will make it harder for an adversary
link |
01:08:02.740
to really take control of the system
link |
01:08:04.820
and make it do what they want it to do.
link |
01:08:06.100
Yeah, models that maintain a bit of a sense of mystery
link |
01:08:09.740
about them, because they always keep changing.
link |
01:08:12.740
Ian, thanks so much for talking today, it was awesome.
link |
01:08:14.900
Thank you for coming in, it's great to see you.