back to index

Rohit Prasad: Amazon Alexa and Conversational AI | Lex Fridman Podcast #57


small model | large model

link |
00:00:00.000
The following is a conversation with Rohit Prasad.
link |
00:00:02.960
He's the vice president and head scientist of Amazon Alexa
link |
00:00:06.360
and one of its original creators.
link |
00:00:08.880
The Alexa team embodies some of the most challenging,
link |
00:00:12.120
incredible, impactful, and inspiring work
link |
00:00:14.960
that is done in AI today.
link |
00:00:17.040
The team has to both solve problems
link |
00:00:19.120
at the cutting edge of natural language processing
link |
00:00:21.720
and provide a trustworthy, secure, and enjoyable experience
link |
00:00:25.320
to millions of people.
link |
00:00:27.440
This is where state of the art methods
link |
00:00:29.400
in computer science meet the challenges
link |
00:00:31.840
of real world engineering.
link |
00:00:33.720
In many ways, Alexa and the other voice assistants
link |
00:00:37.280
are the voices of artificial intelligence
link |
00:00:39.520
to millions of people and an introduction to AI
link |
00:00:43.160
for people who have only encountered it in science fiction.
link |
00:00:46.940
This is an important and exciting opportunity.
link |
00:00:49.960
So the work that Rohit and the Alexa team are doing
link |
00:00:52.920
is an inspiration to me and to many researchers
link |
00:00:55.960
and engineers in the AI community.
link |
00:00:58.840
This is the Artificial Intelligence Podcast.
link |
00:01:01.940
If you enjoy it, subscribe on YouTube,
link |
00:01:04.400
give it five stars on Apple Podcast, support it on Patreon,
link |
00:01:07.720
or simply connect with me on Twitter,
link |
00:01:09.820
at Lex Friedman, spelled F R I D M A N.
link |
00:01:13.680
If you leave a review on Apple Podcasts especially,
link |
00:01:16.960
but also cast box or comment on YouTube,
link |
00:01:20.040
consider mentioning topics, people, ideas, questions,
link |
00:01:22.920
quotes in science, tech, or philosophy
link |
00:01:25.160
that you find interesting,
link |
00:01:26.320
and I'll read them on this podcast.
link |
00:01:28.800
I won't call out names, but I love comments
link |
00:01:31.640
with kindness and thoughtfulness in them,
link |
00:01:33.240
so I thought I'd share them.
link |
00:01:35.720
Someone on YouTube highlighted a quote
link |
00:01:37.480
from the conversation with Ray Dalio,
link |
00:01:40.280
where he said that you have to appreciate
link |
00:01:41.960
all the different ways that people can be A players.
link |
00:01:45.300
This connected me to, on teams of engineers,
link |
00:01:48.560
it's easy to think that raw productivity
link |
00:01:50.360
is the measure of excellence, but there are others.
link |
00:01:53.480
I've worked with people who brought a smile to my face
link |
00:01:55.760
every time I got to work in the morning.
link |
00:01:57.920
Their contribution to the team is immeasurable.
link |
00:02:01.240
I recently started doing podcast ads
link |
00:02:03.040
at the end of the introduction.
link |
00:02:04.660
I'll do one or two minutes after introducing the episode,
link |
00:02:07.640
and never any ads in the middle
link |
00:02:09.160
that break the flow of the conversation.
link |
00:02:11.540
I hope that works for you.
link |
00:02:13.000
It doesn't hurt the listening experience.
link |
00:02:15.680
This show is presented by Cash App,
link |
00:02:17.840
the number one finance app in the App Store.
link |
00:02:20.340
I personally use Cash App to send money to friends,
link |
00:02:23.000
but you can also use it to buy, sell,
link |
00:02:24.720
and deposit Bitcoin in just seconds.
link |
00:02:27.140
Cash App also has a new investing feature.
link |
00:02:30.360
You can buy fractions of a stock, say $1 worth,
link |
00:02:33.640
no matter what the stock price is.
link |
00:02:35.800
Brokerage services are provided by Cash App Investing,
link |
00:02:38.660
a subsidiary of Square and member SIPC.
link |
00:02:42.420
I'm excited to be working with Cash App
link |
00:02:44.440
to support one of my favorite organizations called First,
link |
00:02:47.560
best known for their FIRST Robotics and Lego competitions.
link |
00:02:50.920
They educate and inspire hundreds of thousands of students
link |
00:02:54.360
in over 110 countries, and have a perfect rating
link |
00:02:57.360
on Charity Navigator, which means that donated money
link |
00:03:00.100
is used to maximum effectiveness.
link |
00:03:03.480
When you get Cash App from the App Store, Google Play,
link |
00:03:06.380
and use code LexPodcast, you'll get $10,
link |
00:03:10.260
and Cash App will also donate $10 to FIRST,
link |
00:03:13.240
which again, is an organization that I've personally seen
link |
00:03:16.140
inspire girls and boys to dream
link |
00:03:19.100
of engineering a better world.
link |
00:03:20.740
This podcast is also supported by ZipRecruiter.
link |
00:03:24.240
Hiring great people is hard, and to me,
link |
00:03:26.880
is one of the most important elements
link |
00:03:28.960
of a successful mission driven team.
link |
00:03:31.400
I've been fortunate to be a part of,
link |
00:03:33.280
and lead several great engineering teams.
link |
00:03:35.920
The hiring I've done in the past was mostly through tools
link |
00:03:38.840
we built ourselves, but reinventing the wheel was painful.
link |
00:03:42.720
ZipRecruiter is a tool that's already available for you.
link |
00:03:45.880
It seeks to make hiring simple, fast, and smart.
link |
00:03:49.400
For example, Codable cofounder, Gretchen Huebner,
link |
00:03:52.800
used ZipRecruiter to find a new game artist
link |
00:03:55.160
to join our education tech company.
link |
00:03:57.320
By using ZipRecruiter's screening questions
link |
00:03:59.440
to filter candidates, Gretchen found it easier
link |
00:04:02.080
to focus on the best candidates,
link |
00:04:03.760
and finally, hiring the perfect person for the role,
link |
00:04:06.840
in less than two weeks, from start to finish.
link |
00:04:10.160
ZipRecruiter, the smartest way to hire.
link |
00:04:13.600
See why ZipRecruiter is effective for businesses
link |
00:04:15.920
of all sizes by signing up,
link |
00:04:17.920
as I did, for free, at ziprecruiter.com slash lexpod.
link |
00:04:23.160
That's ziprecruiter.com slash lexpod.
link |
00:04:27.160
And now, here's my conversation with Rohit Prasad.
link |
00:04:33.000
In the movie Her, I'm not sure if you've ever seen it.
link |
00:04:36.120
Human falls in love with the voice of an AI system.
link |
00:04:39.720
Let's start at the highest philosophical level
link |
00:04:42.000
before we get to deep learning and some of the fun things.
link |
00:04:45.080
Do you think this, what the movie Her shows,
link |
00:04:48.200
is within our reach?
link |
00:04:51.160
I think not specifically about Her,
link |
00:04:54.480
but I think what we are seeing is a massive increase
link |
00:04:59.000
in adoption of AI assistance, or AI,
link |
00:05:02.240
in all parts of our social fabric.
link |
00:05:05.320
And I think it's, what I do believe,
link |
00:05:08.880
is that the utility these AIs provide,
link |
00:05:11.680
some of the functionalities that are shown
link |
00:05:14.680
are absolutely within reach.
link |
00:05:18.240
So some of the functionality
link |
00:05:19.600
in terms of the interactive elements,
link |
00:05:21.640
but in terms of the deep connection,
link |
00:05:24.680
that's purely voice based.
link |
00:05:26.840
Do you think such a close connection is possible
link |
00:05:29.160
with voice alone?
link |
00:05:30.600
It's been a while since I saw Her,
link |
00:05:32.240
but I would say in terms of interactions
link |
00:05:36.760
which are both human like and in these AI systems,
link |
00:05:40.240
you have to value what is also superhuman.
link |
00:05:44.800
We as humans can be in only one place.
link |
00:05:47.760
AI assistance can be in multiple places at the same time.
link |
00:05:51.240
One with you on your mobile device,
link |
00:05:53.720
one at your home, one at work.
link |
00:05:56.360
So you have to respect these superhuman capabilities too.
link |
00:06:00.280
Plus as humans, we have certain attributes
link |
00:06:03.080
we are very good at, very good at reasoning.
link |
00:06:05.120
AI assistance not yet there,
link |
00:06:07.360
but in the realm of AI assistance,
link |
00:06:10.360
what they're great at is computation, memory,
link |
00:06:12.680
it's infinite and pure.
link |
00:06:14.600
These are the attributes you have to start respecting.
link |
00:06:16.440
So I think the comparison with human like
link |
00:06:18.360
versus the other aspect, which is also superhuman,
link |
00:06:21.480
has to be taken into consideration.
link |
00:06:22.920
So I think we need to elevate the discussion
link |
00:06:25.440
to not just human like.
link |
00:06:27.240
So there's certainly elements,
link |
00:06:28.800
we just mentioned, Alexa is everywhere,
link |
00:06:32.680
computation speaking.
link |
00:06:33.960
So this is a much bigger infrastructure
link |
00:06:35.600
than just the thing that sits there in the room with you.
link |
00:06:38.440
But it certainly feels to us mere humans
link |
00:06:43.120
that there's just another little creature there
link |
00:06:47.320
when you're interacting with it.
link |
00:06:48.400
You're not interacting with the entirety
link |
00:06:49.880
of the infrastructure, you're interacting with the device.
link |
00:06:52.360
The feeling is, okay, sure, we anthropomorphize things,
link |
00:06:56.560
but that feeling is still there.
link |
00:06:58.640
So what do you think we as humans,
link |
00:07:02.240
the purity of the interaction with a smart device,
link |
00:07:04.760
interaction with a smart assistant,
link |
00:07:06.920
what do you think we look for in that interaction?
link |
00:07:10.200
I think in the certain interactions
link |
00:07:12.240
I think will be very much where it does feel like a human
link |
00:07:15.920
because it has a persona of its own.
link |
00:07:19.080
And in certain ones it wouldn't be.
link |
00:07:20.680
So I think a simple example to think of it
link |
00:07:23.080
is if you're walking through the house
link |
00:07:25.200
and you just wanna turn on your lights on and off
link |
00:07:27.960
and you're issuing a command,
link |
00:07:29.840
that's not very much like a human like interaction
link |
00:07:32.040
and that's where the AI shouldn't come back
link |
00:07:33.840
and have a conversation with you,
link |
00:07:35.240
just it should simply complete that command.
link |
00:07:38.480
So those, I think the blend of,
link |
00:07:40.200
we have to think about this is not human, human alone.
link |
00:07:43.360
It is a human machine interaction
link |
00:07:45.080
and certain aspects of humans are needed
link |
00:07:48.160
and certain aspects are in situations
link |
00:07:49.920
demand it to be like a machine.
link |
00:07:51.640
So I told you, it's gonna be philosophical in parts.
link |
00:07:55.040
What's the difference between human and machine
link |
00:07:57.480
in that interaction?
link |
00:07:58.640
When we interact to humans,
link |
00:08:00.760
especially those are friends and loved ones
link |
00:08:04.000
versus you and a machine that you also are close with.
link |
00:08:10.400
I think the, you have to think about the roles
link |
00:08:12.640
the AI plays, right?
link |
00:08:13.800
So, and it differs from different customer to customer,
link |
00:08:16.320
different situation to situation,
link |
00:08:18.840
especially I can speak from Alexa's perspective.
link |
00:08:21.560
It is a companion, a friend at times,
link |
00:08:25.000
an assistant, an advisor down the line.
link |
00:08:27.520
So I think most AIs will have this kind of attributes
link |
00:08:31.240
and it will be very situational in nature.
link |
00:08:33.040
So where is the boundary?
link |
00:08:34.680
I think the boundary depends on exact context
link |
00:08:37.080
in which you're interacting with the AI.
link |
00:08:39.320
So the depth and the richness
link |
00:08:41.240
of natural language conversation
link |
00:08:42.920
is been by Alan Turing been used to try to define
link |
00:08:48.160
what it means to be intelligent.
link |
00:08:50.480
There's a lot of criticism of that kind of test,
link |
00:08:52.280
but what do you think is a good test of intelligence
link |
00:08:55.840
in your view, in the context of the Turing test
link |
00:08:58.360
and Alexa or the Alexa prize, this whole realm,
link |
00:09:03.240
do you think about this human intelligence,
link |
00:09:07.160
what it means to define it,
link |
00:09:08.000
what it means to reach that level?
link |
00:09:10.080
I do think the ability to converse
link |
00:09:12.480
is a sign of an ultimate intelligence.
link |
00:09:15.160
I think that there's no question about it.
link |
00:09:18.320
So if you think about all aspects of humans,
link |
00:09:20.560
there are sensors we have,
link |
00:09:22.840
and those are basically a data collection mechanism.
link |
00:09:26.400
And based on that,
link |
00:09:27.240
we make some decisions with our sensory brains, right?
link |
00:09:30.560
And from that perspective,
link |
00:09:32.720
I think there are elements we have to talk about
link |
00:09:35.240
how we sense the world
link |
00:09:37.080
and then how we act based on what we sense.
link |
00:09:40.360
Those elements clearly machines have,
link |
00:09:43.640
but then there's the other aspects of computation
link |
00:09:46.800
that is way better.
link |
00:09:48.360
I also mentioned about memory again,
link |
00:09:50.040
in terms of being near infinite,
link |
00:09:51.880
depending on the storage capacity you have,
link |
00:09:54.200
and the retrieval can be extremely fast and pure
link |
00:09:58.200
in terms of like, there's no ambiguity
link |
00:09:59.600
of who did I see when, right?
link |
00:10:02.080
I mean, machines can remember that quite well.
link |
00:10:04.440
So again, on a philosophical level,
link |
00:10:06.840
I do subscribe to the fact that to be able to converse
link |
00:10:10.840
and as part of that, to be able to reason
link |
00:10:13.400
based on the world knowledge you've acquired
link |
00:10:15.240
and the sensory knowledge that is there
link |
00:10:18.320
is definitely very much the essence of intelligence.
link |
00:10:23.160
But intelligence can go beyond human level intelligence
link |
00:10:26.960
based on what machines are getting capable of.
link |
00:10:29.800
So what do you think maybe stepping outside of Alexa
link |
00:10:33.440
broadly as an AI field,
link |
00:10:35.760
what do you think is a good test of intelligence?
link |
00:10:38.720
Put it another way outside of Alexa,
link |
00:10:41.200
because so much of Alexa is a product,
link |
00:10:43.040
is an experience for the customer.
link |
00:10:44.920
On the research side,
link |
00:10:46.400
what would impress the heck out of you if you saw,
link |
00:10:49.240
you know, what is the test where you said,
link |
00:10:50.800
wow, this thing is now starting to encroach
link |
00:10:57.000
into the realm of what we loosely think
link |
00:10:59.040
of as human intelligence?
link |
00:11:00.360
So, well, we think of it as AGI
link |
00:11:02.400
and human intelligence altogether, right?
link |
00:11:04.320
So in some sense, and I think we are quite far from that.
link |
00:11:08.000
I think an unbiased view I have
link |
00:11:11.480
is that the Alexa's intelligence capability is a great test.
link |
00:11:17.760
I think of it as there are many other true points
link |
00:11:20.600
like self driving cars, game playing like go or chess.
link |
00:11:26.320
Let's take those two for as an example,
link |
00:11:28.680
clearly requires a lot of data driven learning
link |
00:11:31.760
and intelligence, but it's not as hard a problem
link |
00:11:35.080
as conversing with, as an AI is with humans
link |
00:11:39.760
to accomplish certain tasks or open domain chat,
link |
00:11:42.320
as you mentioned, Alexa prize.
link |
00:11:44.880
In those settings, the key differences
link |
00:11:47.760
that the end goal is not defined unlike game playing.
link |
00:11:51.920
You also do not know exactly what state you are in
link |
00:11:55.720
in a particular goal completion scenario.
link |
00:11:58.960
In certain sense, sometimes you can,
link |
00:12:00.760
if it's a simple goal, but if you're even certain examples
link |
00:12:04.440
like planning a weekend or you can imagine
link |
00:12:07.120
how many things change along the way,
link |
00:12:09.920
you look for whether you may change your mind
link |
00:12:11.920
and you change the destination,
link |
00:12:14.840
or you want to catch a particular event
link |
00:12:17.040
and then you decide, no, I want this other event
link |
00:12:19.400
I want to go to.
link |
00:12:20.520
So these dimensions of how many different steps
link |
00:12:24.000
are possible when you're conversing as a human
link |
00:12:26.360
with a machine makes it an extremely daunting problem.
link |
00:12:29.120
And I think it is the ultimate test for intelligence.
link |
00:12:32.360
And don't you think that natural language is enough to prove
link |
00:12:37.440
that conversation, just pure conversation?
link |
00:12:40.360
From a scientific standpoint,
link |
00:12:42.280
natural language is a great test,
link |
00:12:45.000
but I would go beyond, I don't want to limit it
link |
00:12:47.800
to as natural language as simply understanding an intent
link |
00:12:51.040
or parsing for entities and so forth.
link |
00:12:52.760
We are really talking about dialogue.
link |
00:12:54.880
Dialogue, yeah.
link |
00:12:55.720
So I would say human machine dialogue
link |
00:12:58.480
is definitely one of the best tests of intelligence.
link |
00:13:02.960
So can you briefly speak to the Alexa Prize
link |
00:13:06.680
for people who are not familiar with it,
link |
00:13:08.640
and also just maybe where things stand
link |
00:13:12.640
and what have you learned and what's surprising?
link |
00:13:15.440
What have you seen that's surprising
link |
00:13:16.920
from this incredible competition?
link |
00:13:18.440
Absolutely, it's a very exciting competition.
link |
00:13:20.960
Alexa Prize is essentially a grand challenge
link |
00:13:24.040
in conversational artificial intelligence,
link |
00:13:26.880
where we threw the gauntlet to the universities
link |
00:13:29.440
who do active research in the field,
link |
00:13:31.960
to say, can you build what we call a social bot
link |
00:13:35.360
that can converse with you coherently
link |
00:13:37.320
and engagingly for 20 minutes?
link |
00:13:39.800
That is an extremely hard challenge,
link |
00:13:42.480
talking to someone who you're meeting for the first time,
link |
00:13:46.480
or even if you've met them quite often,
link |
00:13:49.640
to speak at 20 minutes on any topic,
link |
00:13:53.560
an evolving nature of topics is super hard.
link |
00:13:57.760
We have completed two successful years of the competition.
link |
00:14:01.600
The first was won with the University of Washington,
link |
00:14:03.400
second, the University of California.
link |
00:14:05.560
We are in our third instance.
link |
00:14:06.880
We have an extremely strong team of 10 cohorts,
link |
00:14:09.640
and the third instance of the Alexa Prize is underway now.
link |
00:14:14.840
And we are seeing a constant evolution.
link |
00:14:17.480
First year was definitely a learning.
link |
00:14:18.920
It was a lot of things to be put together.
link |
00:14:21.200
We had to build a lot of infrastructure
link |
00:14:23.640
to enable these universities
link |
00:14:25.960
to be able to build magical experiences
link |
00:14:28.280
and do high quality research.
link |
00:14:31.560
Just a few quick questions, sorry for the interruption.
link |
00:14:33.880
What does failure look like in the 20 minute session?
link |
00:14:37.240
So what does it mean to fail,
link |
00:14:38.720
not to reach the 20 minute mark?
link |
00:14:39.960
Oh, awesome question.
link |
00:14:41.200
So there are one, first of all,
link |
00:14:43.360
I forgot to mention one more detail.
link |
00:14:45.360
It's not just 20 minutes,
link |
00:14:46.560
but the quality of the conversation too that matters.
link |
00:14:49.320
And the beauty of this competition
link |
00:14:51.480
before I answer that question on what failure means
link |
00:14:53.800
is first that you actually converse
link |
00:14:56.600
with millions and millions of customers
link |
00:14:59.000
as the social bots.
link |
00:15:00.840
So during the judging phases, there are multiple phases,
link |
00:15:05.000
before we get to the finals,
link |
00:15:06.320
which is a very controlled judging in a situation
link |
00:15:08.640
where we bring in judges
link |
00:15:10.400
and we have interactors who interact with these social bots,
link |
00:15:14.400
that is a much more controlled setting.
link |
00:15:15.920
But till the point we get to the finals,
link |
00:15:18.960
all the judging is essentially by the customers of Alexa.
link |
00:15:22.680
And there you basically rate on a simple question,
link |
00:15:26.160
how good your experience was.
link |
00:15:28.400
So that's where we are not testing
link |
00:15:29.840
for a 20 minute boundary being crossed,
link |
00:15:32.760
because you do want it to be very much like a clear cut,
link |
00:15:36.600
winner, be chosen, and it's an absolute bar.
link |
00:15:40.040
So did you really break that 20 minute barrier
link |
00:15:42.760
is why we have to test it in a more controlled setting
link |
00:15:45.880
with actors, essentially interactors.
link |
00:15:48.640
And see how the conversation goes.
link |
00:15:50.800
So this is why it's a subtle difference
link |
00:15:54.160
between how it's being tested in the field
link |
00:15:57.000
with real customers versus in the lab to award the prize.
link |
00:16:00.480
So on the latter one, what it means is that
link |
00:16:03.520
essentially there are three judges
link |
00:16:08.000
and two of them have to say
link |
00:16:09.520
this conversation has stalled, essentially.
link |
00:16:13.080
Got it.
link |
00:16:13.920
And the judges are human experts.
link |
00:16:15.720
Judges are human experts.
link |
00:16:16.920
Okay, great.
link |
00:16:17.760
So this is in the third year.
link |
00:16:19.120
So what's been the evolution?
link |
00:16:20.920
How far, so the DARPA challenge in the first year,
link |
00:16:24.640
the autonomous vehicles, nobody finished.
link |
00:16:26.560
In the second year, a few more finished in the desert.
link |
00:16:30.640
So how far along in this,
link |
00:16:33.280
I would say much harder challenge are we?
link |
00:16:36.360
This challenge has come a long way
link |
00:16:37.720
to the extent that we're definitely not close
link |
00:16:40.480
to the 20 minute barrier being with coherence
link |
00:16:42.760
and engaging conversation.
link |
00:16:44.760
I think we are still five to 10 years away
link |
00:16:46.880
in that horizon to complete that.
link |
00:16:49.480
But the progress is immense.
link |
00:16:51.360
Like what you're finding is the accuracy
link |
00:16:54.080
and what kind of responses these social bots generate
link |
00:16:57.360
is getting better and better.
link |
00:16:59.520
What's even amazing to see that now there's humor coming in.
link |
00:17:03.360
The bots are quite...
link |
00:17:04.880
Awesome.
link |
00:17:05.720
You know, you're talking about
link |
00:17:07.360
ultimate science of intelligence.
link |
00:17:09.440
I think humor is a very high bar
link |
00:17:11.840
in terms of what it takes to create humor.
link |
00:17:14.880
And I don't mean just being goofy.
link |
00:17:16.520
I really mean good sense of humor
link |
00:17:19.480
is also a sign of intelligence in my mind
link |
00:17:21.600
and something very hard to do.
link |
00:17:23.120
So these social bots are now exploring
link |
00:17:25.040
not only what we think of natural language abilities,
link |
00:17:28.560
but also personality attributes
link |
00:17:30.400
and aspects of when to inject an appropriate joke,
link |
00:17:34.120
when you don't know the domain,
link |
00:17:38.440
how you come back with something more intelligible
link |
00:17:41.400
so that you can continue the conversation.
link |
00:17:43.200
If you and I are talking about AI
link |
00:17:45.200
and we are domain experts, we can speak to it.
link |
00:17:47.480
But if you suddenly switch a topic to that I don't know of,
link |
00:17:50.480
how do I change the conversation?
link |
00:17:52.160
So you're starting to notice these elements as well.
link |
00:17:55.240
And that's coming from partly by the nature
link |
00:17:58.560
of the 20 minute challenge
link |
00:18:00.120
that people are getting quite clever
link |
00:18:02.520
on how to really converse
link |
00:18:05.600
and essentially mask some of the understanding defects
link |
00:18:08.600
if they exist.
link |
00:18:09.840
So some of this, this is not Alexa, the product.
link |
00:18:12.680
This is somewhat for fun, for research,
link |
00:18:16.240
for innovation and so on.
link |
00:18:17.800
I have a question sort of in this modern era,
link |
00:18:20.280
there's a lot of, if you look at Twitter and Facebook
link |
00:18:24.280
and so on, there's discourse, public discourse going on
link |
00:18:27.160
and some things that are a little bit too edgy,
link |
00:18:28.800
people get blocked and so on.
link |
00:18:30.640
I'm just out of curiosity,
link |
00:18:32.280
are people in this context pushing the limits?
link |
00:18:35.960
Is anyone using the F word?
link |
00:18:37.760
Is anyone sort of pushing back
link |
00:18:41.440
sort of arguing, I guess I should say,
link |
00:18:45.960
as part of the dialogue to really draw people in?
link |
00:18:48.280
First of all, let me just back up a bit
link |
00:18:50.320
in terms of why we are doing this, right?
link |
00:18:52.120
So you said it's fun.
link |
00:18:54.280
I think fun is more part of the engaging part for customers.
link |
00:18:59.920
It is one of the most used skills as well
link |
00:19:02.480
in our skill store.
link |
00:19:04.360
But up that apart, the real goal was essentially
link |
00:19:07.200
what was happening is with a lot of AI research
link |
00:19:10.400
moving to industry, we felt that academia has the risk
link |
00:19:14.200
of not being able to have the same resources
link |
00:19:16.800
at disposal that we have, which is lots of data,
link |
00:19:20.480
massive computing power, and a clear ways
link |
00:19:24.640
to test these AI advances with real customer benefits.
link |
00:19:28.520
So we brought all these three together in the Alexa price.
link |
00:19:30.880
That's why it's one of my favorite projects in Amazon.
link |
00:19:33.880
And with that, the secondary effect is yes,
link |
00:19:37.800
it has become engaging for our customers as well.
link |
00:19:40.920
We're not there in terms of where we want it to be, right?
link |
00:19:43.880
But it's a huge progress.
link |
00:19:45.040
But coming back to your question on
link |
00:19:47.080
how do the conversations evolve?
link |
00:19:48.800
Yes, there is some natural attributes of what you said
link |
00:19:51.880
in terms of argument and some amount of swearing.
link |
00:19:54.160
The way we take care of that is that there is
link |
00:19:57.160
a sensitive filter we have built that sees keywords.
link |
00:20:00.400
It's more than keywords, a little more in terms of,
link |
00:20:03.480
of course, there's keyword based too,
link |
00:20:04.880
but there's more in terms of these words can be
link |
00:20:07.920
very contextual, as you can see,
link |
00:20:09.440
and also the topic can be something
link |
00:20:12.600
that you don't want a conversation to happen
link |
00:20:15.400
because this is a communal device as well.
link |
00:20:17.320
A lot of people use these devices.
link |
00:20:19.240
So we have put a lot of guardrails for the conversation
link |
00:20:22.600
to be more useful for advancing AI
link |
00:20:25.920
and not so much of these other issues you attributed
link |
00:20:31.080
what's happening in the AI field as well.
link |
00:20:32.880
Right, so this is actually a serious opportunity.
link |
00:20:35.280
I didn't use the right word, fun.
link |
00:20:36.880
I think it's an open opportunity to do
link |
00:20:39.960
some of the best innovation
link |
00:20:42.000
in conversational agents in the world.
link |
00:20:44.760
Absolutely.
link |
00:20:45.920
Why just universities?
link |
00:20:49.000
Why just universities?
link |
00:20:49.880
Because as I said, I really felt
link |
00:20:51.560
Young minds.
link |
00:20:52.400
Young minds, it's also to,
link |
00:20:55.080
if you think about the other aspect
link |
00:20:57.920
of where the whole industry is moving with AI,
link |
00:21:01.400
there's a dearth of talent given the demands.
link |
00:21:04.880
So you do want universities to have a clear place
link |
00:21:09.880
where they can invent and research
link |
00:21:11.440
and not fall behind that they can't motivate students.
link |
00:21:13.920
Imagine all grad students left to industry like us
link |
00:21:19.600
or faculty members, which has happened too.
link |
00:21:22.880
So this is a way that if you're so passionate
link |
00:21:25.200
about the field where you feel industry and academia
link |
00:21:28.640
need to work well, this is a great example
link |
00:21:31.360
and a great way for universities to participate.
link |
00:21:35.360
So what do you think it takes to build a system
link |
00:21:37.280
that wins the Alexa Prize?
link |
00:21:39.600
I think you have to start focusing on aspects of reasoning
link |
00:21:46.200
that it is, there are still more lookups
link |
00:21:50.760
of what intents customers asking for
link |
00:21:54.160
and responding to those rather than really reasoning
link |
00:21:58.920
about the elements of the conversation.
link |
00:22:02.480
For instance, if you're playing,
link |
00:22:06.240
if the conversation is about games
link |
00:22:08.120
and it's about a recent sports event,
link |
00:22:11.240
there's so much context involved
link |
00:22:13.320
and you have to understand the entities
link |
00:22:15.800
that are being mentioned
link |
00:22:17.320
so that the conversation is coherent
link |
00:22:19.640
rather than you suddenly just switch to knowing some fact
link |
00:22:23.200
about a sports entity and you're just relaying that
link |
00:22:26.280
rather than understanding the true context of the game.
link |
00:22:28.680
Like if you just said, I learned this fun fact
link |
00:22:32.280
about Tom Brady rather than really say
link |
00:22:36.000
how he played the game the previous night,
link |
00:22:39.280
then the conversation is not really that intelligent.
link |
00:22:42.800
So you have to go to more reasoning elements
link |
00:22:46.200
of understanding the context of the dialogue
link |
00:22:49.120
and giving more appropriate responses,
link |
00:22:51.240
which tells you that we are still quite far
link |
00:22:53.680
because a lot of times it's more facts being looked up
link |
00:22:57.400
and something that's close enough as an answer,
link |
00:22:59.920
but not really the answer.
link |
00:23:02.080
So that is where the research needs to go more
link |
00:23:05.040
and actual true understanding and reasoning.
link |
00:23:08.360
And that's why I feel it's a great way to do it
link |
00:23:10.440
because you have an engaged set of users
link |
00:23:13.520
working to help these AI advances happen in this case.
link |
00:23:18.200
You mentioned customers, they're quite a bit,
link |
00:23:20.640
and there's a skill.
link |
00:23:22.120
What is the experience for the user that's helping?
link |
00:23:26.520
So just to clarify, this isn't, as far as I understand,
link |
00:23:30.120
the Alexa, so this skill is a standalone
link |
00:23:32.560
for the Alexa Prize.
link |
00:23:33.560
I mean, it's focused on the Alexa Prize.
link |
00:23:35.360
It's not you ordering certain things on Amazon.
link |
00:23:37.720
Like, oh, we're checking the weather
link |
00:23:39.200
or playing Spotify, right?
link |
00:23:40.720
This is a separate skill.
link |
00:23:42.520
And so you're focused on helping that,
link |
00:23:45.600
I don't know, how do people, how do customers think of it?
link |
00:23:48.520
Are they having fun?
link |
00:23:49.800
Are they helping teach the system?
link |
00:23:52.040
What's the experience like?
link |
00:23:53.040
I think it's both actually.
link |
00:23:54.640
And let me tell you how you invoke this skill.
link |
00:23:57.800
So all you have to say, Alexa, let's chat.
link |
00:24:00.200
And then the first time you say, Alexa, let's chat,
link |
00:24:03.320
it comes back with a clear message
link |
00:24:04.720
that you're interacting with one of those
link |
00:24:06.240
university social bots.
link |
00:24:08.000
And there's a clear,
link |
00:24:09.320
so you know exactly how you interact, right?
link |
00:24:11.800
And that is why it's very transparent.
link |
00:24:14.080
You are being asked to help, right?
link |
00:24:16.240
And we have a lot of mechanisms
link |
00:24:18.800
where as we are in the first phase of feedback phase,
link |
00:24:23.680
then you send a lot of emails to our customers
link |
00:24:26.720
and then they know that the team needs a lot of interactions
link |
00:24:31.760
to improve the accuracy of the system.
link |
00:24:33.920
So we know we have a lot of customers
link |
00:24:35.880
who really want to help these university bots
link |
00:24:38.920
and they're conversing with that.
link |
00:24:40.400
And some are just having fun with just saying,
link |
00:24:42.680
Alexa, let's chat.
link |
00:24:44.000
And also some adversarial behavior to see whether,
link |
00:24:47.320
how much do you understand as a social bot?
link |
00:24:50.240
So I think we have a good,
link |
00:24:51.480
healthy mix of all three situations.
link |
00:24:53.920
So what is the,
link |
00:24:55.280
if we talk about solving the Alexa challenge,
link |
00:24:58.040
the Alexa prize,
link |
00:25:00.720
what's the data set of really engaging,
link |
00:25:05.480
pleasant conversations look like?
link |
00:25:07.520
Because if we think of this
link |
00:25:08.360
as a supervised learning problem,
link |
00:25:10.600
I don't know if it has to be,
link |
00:25:12.200
but if it does, maybe you can comment on that.
link |
00:25:15.400
Do you think there needs to be a data set
link |
00:25:17.480
of what it means to be an engaging, successful,
link |
00:25:21.880
fulfilling conversation?
link |
00:25:22.720
I think that's part of the research question here.
link |
00:25:24.760
This was, I think, we at least got the first part right,
link |
00:25:29.200
which is have a way for universities to build
link |
00:25:33.360
and test in a real world setting.
link |
00:25:35.680
Now you're asking in terms of the next phase of questions,
link |
00:25:38.640
which we are still, we're also asking, by the way,
link |
00:25:41.120
what does success look like from a optimization function?
link |
00:25:45.400
That's what you're asking in terms of,
link |
00:25:47.200
we as researchers are used to having a great corpus
link |
00:25:49.560
of annotated data and then making,
link |
00:25:53.480
then sort of tune our algorithms on those, right?
link |
00:25:57.600
And fortunately and unfortunately,
link |
00:26:00.640
in this world of Alexa prize,
link |
00:26:02.920
that is not the way we are going after it.
link |
00:26:05.400
So you have to focus more on learning
link |
00:26:07.720
based on life feedback.
link |
00:26:10.920
That is another element that's unique,
link |
00:26:12.960
where just not to,
link |
00:26:15.080
I started with giving you how you ingress
link |
00:26:17.280
and experience this capability as a customer.
link |
00:26:21.520
What happens when you're done?
link |
00:26:23.600
So they ask you a simple question on a scale of one to five,
link |
00:26:27.560
how likely are you to interact with this social bot again?
link |
00:26:31.880
That is a good feedback
link |
00:26:33.840
and customers can also leave more open ended feedback.
link |
00:26:37.440
And I think partly that to me
link |
00:26:40.840
is one part of the question you're asking,
link |
00:26:42.640
which I'm saying is a mental model shift
link |
00:26:44.600
that as researchers also,
link |
00:26:47.120
you have to change your mindset
link |
00:26:48.560
that this is not a DARPA evaluation or NSF funded study
link |
00:26:52.680
and you have a nice corpus.
link |
00:26:54.960
This is where it's real world.
link |
00:26:56.960
You have real data.
link |
00:26:58.720
The scale is amazing and that's a beautiful thing.
link |
00:27:01.560
And then the customer,
link |
00:27:02.960
the user can quit the conversation at any time.
link |
00:27:06.160
Exactly, the user can,
link |
00:27:07.200
that is also a signal for how good you were at that point.
link |
00:27:11.720
So, and then on a scale one to five, one to three,
link |
00:27:15.000
do they say how likely are you
link |
00:27:16.360
or is it just a binary?
link |
00:27:18.040
One to five.
link |
00:27:18.880
One to five.
link |
00:27:20.040
Wow, okay, that's such a beautifully constructed challenge.
link |
00:27:22.680
Okay.
link |
00:27:24.720
You said the only way to make a smart assistant really smart
link |
00:27:30.040
is to give it eyes and let it explore the world.
link |
00:27:34.560
I'm not sure it might've been taken out of context,
link |
00:27:36.840
but can you comment on that?
link |
00:27:38.240
Can you elaborate on that idea?
link |
00:27:40.080
Is that I personally also find that idea super exciting
link |
00:27:43.120
from a social robotics, personal robotics perspective.
link |
00:27:46.240
Yeah, a lot of things do get taken out of context.
link |
00:27:48.840
This particular one was just
link |
00:27:50.600
as philosophical discussion we were having
link |
00:27:53.000
on terms of what does intelligence look like?
link |
00:27:55.520
And the context was in terms of learning,
link |
00:27:59.200
I think just we said we as humans are empowered
link |
00:28:03.040
with many different sensory abilities.
link |
00:28:05.480
I do believe that eyes are an important aspect of it
link |
00:28:09.560
in terms of if you think about how we as humans learn,
link |
00:28:14.640
it is quite complex and it's also not unimodal
link |
00:28:18.320
that you are fed a ton of text or audio
link |
00:28:22.040
and you just learn that way.
link |
00:28:23.360
No, you learn by experience, you learn by seeing,
link |
00:28:27.240
you're taught by humans
link |
00:28:30.320
and we are very efficient in how we learn.
link |
00:28:33.240
Machines on the contrary are very inefficient
link |
00:28:35.320
on how they learn, especially these AIs.
link |
00:28:38.480
I think the next wave of research is going to be
link |
00:28:42.640
with less data, not just less human,
link |
00:28:46.000
not just with less labeled data,
link |
00:28:48.240
but also with a lot of weak supervision
link |
00:28:51.080
and where you can increase the learning rate.
link |
00:28:55.160
I don't mean less data
link |
00:28:56.120
in terms of not having a lot of data to learn from
link |
00:28:58.640
that we are generating so much data,
link |
00:29:00.360
but it is more about from a aspect
link |
00:29:02.640
of how fast can you learn?
link |
00:29:04.880
So improving the quality of the data,
link |
00:29:07.880
the quality of data and the learning process.
link |
00:29:09.920
I think more on the learning process.
link |
00:29:11.440
I think we have to, we as humans learn
link |
00:29:13.560
with a lot of noisy data, right?
link |
00:29:15.720
And I think that's the part
link |
00:29:18.480
that I don't think should change.
link |
00:29:21.440
What should change is how we learn, right?
link |
00:29:23.880
So if you look at, you mentioned supervised learning,
link |
00:29:26.080
we have making transformative shifts
link |
00:29:27.960
from moving to more unsupervised, more weak supervision.
link |
00:29:31.160
Those are the key aspects of how to learn.
link |
00:29:34.840
And I think in that setting, I hope you agree with me
link |
00:29:37.760
that having other senses is very crucial
link |
00:29:41.680
in terms of how you learn.
link |
00:29:43.480
So absolutely.
link |
00:29:44.640
And from a machine learning perspective,
link |
00:29:46.680
which I hope we get a chance to talk to a few aspects
link |
00:29:49.680
that are fascinating there,
link |
00:29:51.080
but to stick on the point of sort of a body,
link |
00:29:55.600
an embodiment.
link |
00:29:56.440
So Alexa has a body.
link |
00:29:57.520
It has a very minimalistic, beautiful interface
link |
00:30:01.600
where there's a ring and so on.
link |
00:30:02.840
I mean, I'm not sure of all the flavors
link |
00:30:04.480
of the devices that Alexa lives on,
link |
00:30:07.560
but there's a minimalistic basic interface.
link |
00:30:13.280
And nevertheless, we humans, so I have a Roomba,
link |
00:30:15.640
I have all kinds of robots all over everywhere.
link |
00:30:18.240
So what do you think the Alexa of the future looks like
link |
00:30:24.680
if it begins to shift what his body looks like?
link |
00:30:29.240
Maybe beyond the Alexa,
link |
00:30:30.640
what do you think are the different devices in the home
link |
00:30:33.720
as they start to embody their intelligence more and more?
link |
00:30:36.880
What do you think that looks like?
link |
00:30:38.080
Philosophically, a future, what do you think that looks like?
link |
00:30:41.200
I think let's look at what's happening today.
link |
00:30:43.600
You mentioned, I think our devices as an Amazon devices,
link |
00:30:46.840
but I also wanted to point out Alexa is already integrated
link |
00:30:49.840
a lot of third party devices,
link |
00:30:51.360
which also come in lots of forms and shapes,
link |
00:30:54.840
some in robots, some in microwaves,
link |
00:30:58.960
some in appliances that you use in everyday life.
link |
00:31:02.600
So I think it's not just the shape Alexa takes
link |
00:31:07.720
in terms of form factors,
link |
00:31:09.200
but it's also where all it's available.
link |
00:31:13.000
And it's getting in cars,
link |
00:31:14.240
it's getting in different appliances in homes,
link |
00:31:16.760
even toothbrushes, right?
link |
00:31:18.720
So I think you have to think about it
link |
00:31:20.760
as not a physical assistant.
link |
00:31:25.440
It will be in some embodiment, as you said,
link |
00:31:28.480
we already have these nice devices,
link |
00:31:31.120
but I think it's also important to think of it,
link |
00:31:33.800
it is a virtual assistant.
link |
00:31:35.640
It is superhuman in the sense that it is in multiple places
link |
00:31:38.520
at the same time.
link |
00:31:40.280
So I think the actual embodiment in some sense,
link |
00:31:45.200
to me doesn't matter.
link |
00:31:47.600
I think you have to think of it as not as human like
link |
00:31:52.800
and more of what its capabilities are
link |
00:31:56.080
that derive a lot of benefit for customers
link |
00:31:58.840
and how there are different ways to delight it
link |
00:32:00.680
and delight customers and different experiences.
link |
00:32:03.960
And I think I'm a big fan of it not being just human like,
link |
00:32:09.240
it should be human like in certain situations.
link |
00:32:11.120
Alexa price social bot in terms of conversation
link |
00:32:13.360
is a great way to look at it,
link |
00:32:14.920
but there are other scenarios where human like,
link |
00:32:18.800
I think is underselling the abilities of this AI.
link |
00:32:22.080
So if I could trivialize what we're talking about.
link |
00:32:26.120
So if you look at the way Steve Jobs thought
link |
00:32:29.400
about the interaction with the device that Apple produced,
link |
00:32:33.440
there was a extreme focus on controlling the experience
link |
00:32:36.760
by making sure there's only this Apple produced devices.
link |
00:32:40.200
You see the voice of Alexa being taking all kinds of forms
link |
00:32:45.600
depending on what the customers want.
link |
00:32:47.080
And that means it could be anywhere
link |
00:32:49.920
from the microwave to a vacuum cleaner to the home
link |
00:32:53.760
and so on the voice is the essential element
link |
00:32:56.960
of the interaction.
link |
00:32:57.800
I think voice is an essence, it's not all,
link |
00:33:01.160
but it's a key aspect.
link |
00:33:02.240
I think to your question in terms of,
link |
00:33:05.720
you should be able to recognize Alexa
link |
00:33:08.280
and that's a huge problem.
link |
00:33:10.000
I think in terms of a huge scientific problem,
link |
00:33:12.080
I should say like, what are the traits?
link |
00:33:13.800
What makes it look like Alexa,
link |
00:33:16.200
especially in different settings
link |
00:33:17.600
and especially if it's primarily voice, what it is,
link |
00:33:20.440
but Alexa is not just voice either, right?
link |
00:33:22.320
I mean, we have devices with a screen.
link |
00:33:25.080
Now you're seeing just other behaviors of Alexa.
link |
00:33:28.520
So I think we're in very early stages of what that means
link |
00:33:31.400
and this will be an important topic for the following years.
link |
00:33:34.960
But I do believe that being able to recognize
link |
00:33:38.240
and tell when it's Alexa versus it's not
link |
00:33:40.520
is going to be important from an Alexa perspective.
link |
00:33:43.400
I'm not speaking for the entire AI community,
link |
00:33:46.040
but I think attribution and as we go into more
link |
00:33:51.040
of understanding who did what,
link |
00:33:54.400
that identity of the AI is crucial in the coming world.
link |
00:33:58.000
I think from the broad AI community perspective,
link |
00:34:00.320
that's also a fascinating problem.
link |
00:34:02.120
So basically if I close my eyes and listen to the voice,
link |
00:34:05.480
what would it take for me to recognize that this is Alexa?
link |
00:34:08.040
Exactly.
link |
00:34:08.880
Or at least the Alexa that I've come to know
link |
00:34:10.600
from my personal experience in my home
link |
00:34:13.000
through my interactions that come through.
link |
00:34:14.400
Yeah, and the Alexa here in the US is very different
link |
00:34:16.920
than Alexa in UK and the Alexa in India,
link |
00:34:19.440
even though they are all speaking English
link |
00:34:21.640
or the Australian version.
link |
00:34:23.280
So again, so now think about when you go
link |
00:34:26.680
into a different culture, a different community,
link |
00:34:28.400
but you travel there, what do you recognize Alexa?
link |
00:34:31.800
I think these are super hard questions actually.
link |
00:34:34.160
So there's a team that works on personality.
link |
00:34:36.840
So if we talk about those different flavors
link |
00:34:39.360
of what it means culturally speaking,
link |
00:34:41.040
India, UK, US, what does it mean to add?
link |
00:34:44.680
So the problem that we just stated,
link |
00:34:46.440
it's just fascinating, how do we make it purely recognizable
link |
00:34:51.080
that it's Alexa, assuming that the qualities
link |
00:34:55.000
of the voice are not sufficient?
link |
00:34:58.040
It's also the content of what is being said.
link |
00:35:01.000
How do we do that?
link |
00:35:02.160
How does the personality come into play?
link |
00:35:04.320
What's that research gonna look like?
link |
00:35:06.800
I mean, it's such a fascinating subject.
link |
00:35:08.120
We have some very fascinating folks
link |
00:35:11.080
who from both the UX background and human factors
link |
00:35:13.560
are looking at these aspects and these exact questions.
link |
00:35:16.360
But I'll definitely say it's not just how it sounds,
link |
00:35:21.600
the choice of words, the tone, not just, I mean,
link |
00:35:25.320
the voice identity of it, but the tone matters,
link |
00:35:28.040
the speed matters, how you speak,
link |
00:35:30.720
how you enunciate words, what choice of words
link |
00:35:34.880
are you using, how terse are you,
link |
00:35:37.320
or how lengthy in your explanations you are.
link |
00:35:40.720
All of these are factors.
link |
00:35:42.920
And you also, you mentioned something crucial
link |
00:35:45.440
that you may have personalized it, Alexa,
link |
00:35:49.160
to some extent in your homes
link |
00:35:51.400
or in the devices you are interacting with.
link |
00:35:53.440
So you, as your individual, how you prefer Alexa sounds
link |
00:35:59.240
can be different than how I prefer.
link |
00:36:01.240
And the amount of customizability you want to give
link |
00:36:04.440
is also a key debate we always have.
link |
00:36:07.640
But I do want to point out it's more than the voice actor
link |
00:36:10.720
that recorded and it sounds like that actor.
link |
00:36:14.000
It is more about the choices of words,
link |
00:36:16.920
the attributes of tonality, the volume
link |
00:36:19.800
in terms of how you raise your pitch and so forth.
link |
00:36:22.600
All of that matters.
link |
00:36:23.880
This is such a fascinating problem
link |
00:36:25.440
from a product perspective.
link |
00:36:27.600
I could see those debates just happening
link |
00:36:29.480
inside of the Alexa team of how much personalization
link |
00:36:32.440
do you do for the specific customer?
link |
00:36:34.440
Because you're taking a risk if you over personalize.
link |
00:36:38.240
Because you don't, if you create a personality
link |
00:36:42.080
for a million people, you can test that better.
link |
00:36:46.040
You can create a rich, fulfilling experience
link |
00:36:48.640
that will do well.
link |
00:36:50.040
But the more you personalize it, the less you can test it,
link |
00:36:53.480
the less you can know that it's a great experience.
link |
00:36:56.320
So how much personalization, what's the right balance?
link |
00:36:59.720
I think the right balance depends on the customer.
link |
00:37:01.600
Give them the control.
link |
00:37:02.800
So I'll say, I think the more control you give customers,
link |
00:37:07.400
the better it is for everyone.
link |
00:37:09.600
And I'll give you some key personalization features.
link |
00:37:13.880
I think we have a feature called Remember This,
link |
00:37:15.840
which is where you can tell Alexa to remember something.
link |
00:37:19.440
There you have an explicit sort of control
link |
00:37:23.080
in customer's hand because they have to say,
link |
00:37:24.600
Alexa, remember X, Y, Z.
link |
00:37:26.520
What kind of things would that be used for?
link |
00:37:28.000
So you can like you, I have stored my tire specs
link |
00:37:32.200
for my car because it's so hard to go and find
link |
00:37:34.800
and see what it is, right?
link |
00:37:36.760
When you're having some issues.
link |
00:37:38.320
I store my mileage plan numbers
link |
00:37:41.440
for all the frequent flyer ones
link |
00:37:43.120
where I'm sometimes just looking at it and it's not handy.
link |
00:37:46.520
So those are my own personal choices I've made
link |
00:37:49.960
for Alexa to remember something on my behalf, right?
link |
00:37:52.320
So again, I think the choice was be explicit
link |
00:37:56.000
about how you provide that to a customer as a control.
link |
00:38:00.000
So I think these are the aspects of what you do.
link |
00:38:03.440
Like think about where we can use speaker recognition
link |
00:38:07.360
capabilities that it's, if you taught Alexa
link |
00:38:11.000
that you are Lex and this person in your household
link |
00:38:14.440
is person two, then you can personalize the experiences.
link |
00:38:17.920
Again, these are very in the CX customer experience patterns
link |
00:38:22.840
are very clear about and transparent
link |
00:38:26.520
when a personalization action is happening.
link |
00:38:30.040
And then you have other ways like you go
link |
00:38:32.240
through explicit control right now through your app
link |
00:38:34.640
that your multiple service providers,
link |
00:38:36.920
let's say for music, which one is your preferred one.
link |
00:38:39.480
So when you say play sting, depend on your
link |
00:38:42.000
whether you have preferred Spotify or Amazon music
link |
00:38:44.880
or Apple music, that the decision is made
link |
00:38:47.240
where to play it from.
link |
00:38:49.480
So what's Alexa's backstory from her perspective?
link |
00:38:52.720
Is there, I remember just asking as probably a lot
link |
00:38:58.120
of us are just the basic questions about love
link |
00:39:00.600
and so on of Alexa, just to see what the answer would be.
link |
00:39:03.800
It feels like there's a little bit of a personality
link |
00:39:10.280
but not too much.
link |
00:39:12.840
Is Alexa have a metaphysical presence
link |
00:39:18.360
in this human universe we live in
link |
00:39:21.880
or is it something more ambiguous?
link |
00:39:23.720
Is there a past?
link |
00:39:25.080
Is there a birth?
link |
00:39:26.240
Is there a family kind of idea
link |
00:39:28.920
even for joking purposes and so on?
link |
00:39:31.120
I think, well, it does tell you if I think you,
link |
00:39:34.800
I should double check this but if you said
link |
00:39:36.320
when were you born, I think we do respond.
link |
00:39:39.000
I need to double check that
link |
00:39:40.120
but I'm pretty positive about it.
link |
00:39:41.480
I think you do actually because I think I've tested that.
link |
00:39:44.000
But that's like how I was born in your brand of champagne
link |
00:39:49.120
and whatever the year kind of thing, yeah.
link |
00:39:51.240
So in terms of the metaphysical, I think it's early.
link |
00:39:55.720
Does it have the historic knowledge about herself
link |
00:40:00.360
to be able to do that?
link |
00:40:01.440
Maybe, have we crossed that boundary?
link |
00:40:03.720
Not yet, right?
link |
00:40:04.560
In terms of being, thank you.
link |
00:40:06.520
Have we thought about it quite a bit
link |
00:40:08.600
but I wouldn't say that we have come to a clear decision
link |
00:40:11.480
in terms of what it should look like.
link |
00:40:13.000
But you can imagine though, and I bring this back
link |
00:40:16.920
to the Alexa Prize social bot one,
link |
00:40:19.200
there you will start seeing some of that.
link |
00:40:21.200
Like these bots have their identity
link |
00:40:23.440
and in terms of that, you may find,
link |
00:40:26.800
this is such a great research topic
link |
00:40:28.400
that some academia team may think of these problems
link |
00:40:32.120
and start solving them too.
link |
00:40:35.080
So let me ask a question.
link |
00:40:38.840
It's kind of difficult, I think,
link |
00:40:41.160
but it feels, and fascinating to me
link |
00:40:43.280
because I'm fascinated with psychology.
link |
00:40:45.320
It feels that the more personality you have,
link |
00:40:48.200
the more dangerous it is
link |
00:40:50.400
in terms of a customer perspective of product.
link |
00:40:54.480
If you want to create a product that's useful.
link |
00:40:57.080
By dangerous, I mean creating an experience that upsets me.
link |
00:41:02.360
And so how do you get that right?
link |
00:41:06.680
Because if you look at the relationships,
link |
00:41:10.040
maybe I'm just a screwed up Russian,
link |
00:41:11.800
but if you look at the human to human relationship,
link |
00:41:15.040
some of our deepest relationships have fights,
link |
00:41:18.120
have tension, have the push and pull,
link |
00:41:21.200
have a little flavor in them.
link |
00:41:22.800
Do you want to have such flavor in an interaction with Alexa?
link |
00:41:26.800
How do you think about that?
link |
00:41:28.200
So there's one other common thing that you didn't say,
link |
00:41:31.280
but we think of it as paramount for any deep relationship.
link |
00:41:35.000
That's trust.
link |
00:41:36.680
Trust, yeah.
link |
00:41:37.520
So I think if you trust every attribute you said,
link |
00:41:40.960
a fight, some tension, is all healthy.
link |
00:41:44.880
But what is sort of unnegotiable in this instance is trust.
link |
00:41:49.880
And I think the bar to earn customer trust for AI
link |
00:41:52.920
is very high, in some sense, more than a human.
link |
00:41:56.880
It's not just about personal information or your data.
link |
00:42:02.040
It's also about your actions on a daily basis.
link |
00:42:05.120
How trustworthy are you in terms of consistency,
link |
00:42:07.920
in terms of how accurate are you in understanding me?
link |
00:42:11.200
Like if you're talking to a person on the phone,
link |
00:42:13.680
if you have a problem with your,
link |
00:42:14.880
let's say your internet or something,
link |
00:42:16.360
if the person's not understanding,
link |
00:42:17.720
you lose trust right away.
link |
00:42:19.040
You don't want to talk to that person.
link |
00:42:20.960
That whole example gets amplified by a factor of 10,
link |
00:42:24.360
because when you're a human interacting with an AI,
link |
00:42:28.200
you have a certain expectation.
link |
00:42:29.720
Either you expect it to be very intelligent
link |
00:42:31.960
and then you get upset, why is it behaving this way?
link |
00:42:34.400
Or you expect it to be not so intelligent
link |
00:42:37.640
and when it surprises you, you're like,
link |
00:42:38.800
really, you're trying to be too smart?
link |
00:42:40.960
So I think we grapple with these hard questions as well.
link |
00:42:43.680
But I think the key is actions need to be trustworthy.
link |
00:42:47.720
From these AIs, not just about data protection,
link |
00:42:50.840
your personal information protection,
link |
00:42:53.400
but also from how accurately it accomplishes
link |
00:42:57.200
all commands or all interactions.
link |
00:42:59.760
Well, it's tough to hear because trust,
link |
00:43:02.200
you're absolutely right,
link |
00:43:03.080
but trust is such a high bar with AI systems
link |
00:43:05.560
because people, and I see this
link |
00:43:07.400
because I work with autonomous vehicles.
link |
00:43:08.880
I mean, the bar that's placed on AI system
link |
00:43:11.720
is unreasonably high.
link |
00:43:13.440
Yeah, that is going to be, I agree with you.
link |
00:43:16.120
And I think of it as it's a challenge
link |
00:43:19.920
and it's also keeps my job, right?
link |
00:43:23.120
So from that perspective, I totally,
link |
00:43:26.360
I think of it at both sides as a customer
link |
00:43:28.720
and as a researcher.
link |
00:43:30.240
I think as a researcher, yes, occasionally it will frustrate
link |
00:43:33.400
me that why is the bar so high for these AIs?
link |
00:43:36.920
And as a customer, then I say,
link |
00:43:38.640
absolutely, it has to be that high, right?
link |
00:43:40.920
So I think that's the trade off we have to balance,
link |
00:43:44.120
but it doesn't change the fundamentals.
link |
00:43:46.760
That trust has to be earned and the question then becomes
link |
00:43:50.520
is are we holding the AIs to a different bar
link |
00:43:53.520
in accuracy and mistakes than we hold humans?
link |
00:43:56.320
That's going to be a great societal questions
link |
00:43:58.280
for years to come, I think for us.
link |
00:44:00.320
Well, one of the questions that we grapple as a society now
link |
00:44:04.000
that I think about a lot,
link |
00:44:05.480
I think a lot of people in the AI think about a lot
link |
00:44:07.840
and Alexis taking on head on is privacy.
link |
00:44:11.640
The reality is us giving over data to any AI system
link |
00:44:20.760
can be used to enrich our lives in profound ways.
link |
00:44:25.800
So if basically any product that does anything awesome
link |
00:44:28.520
for you, the more data it has,
link |
00:44:31.680
the more awesome things it can do.
link |
00:44:34.040
And yet on the other side,
link |
00:44:37.040
people imagine the worst case possible scenario
link |
00:44:39.400
of what can you possibly do with that data?
link |
00:44:42.240
People, it's goes down to trust, as you said before.
link |
00:44:45.680
There's a fundamental distrust of,
link |
00:44:48.200
in certain groups of governments and so on.
link |
00:44:50.440
And depending on the government,
link |
00:44:51.560
depending on who's in power,
link |
00:44:52.880
depending on all these kinds of factors.
link |
00:44:55.400
And so here's Alexa in the middle of all of it in the home,
link |
00:44:59.600
trying to do good things for the customers.
link |
00:45:02.320
So how do you think about privacy in this context,
link |
00:45:05.040
the smart assistance in the home?
link |
00:45:06.720
How do you maintain, how do you earn trust?
link |
00:45:08.680
Absolutely.
link |
00:45:09.520
So as you said, trust is the key here.
link |
00:45:12.400
So you start with trust
link |
00:45:13.560
and then privacy is a key aspect of it.
link |
00:45:16.760
It has to be designed from very beginning about that.
link |
00:45:20.240
And we believe in two fundamental principles.
link |
00:45:23.920
One is transparency and second is control.
link |
00:45:26.840
So by transparency, I mean,
link |
00:45:28.920
when we build what is now called smart speaker
link |
00:45:32.120
or the first echo,
link |
00:45:34.320
we were quite judicious about making these right trade offs
link |
00:45:38.400
on customer's behalf,
link |
00:45:40.160
that it is pretty clear
link |
00:45:41.920
when the audio is being sent to cloud,
link |
00:45:44.200
the light ring comes on
link |
00:45:45.280
when it has heard you say the word wake word,
link |
00:45:48.280
and then the streaming happens, right?
link |
00:45:49.760
So when the light ring comes up,
link |
00:45:51.360
we also had, we put a physical mute button on it,
link |
00:45:55.520
just so if you didn't want it to be listening,
link |
00:45:57.920
even for the wake word,
link |
00:45:58.760
then you turn the power button or the mute button on,
link |
00:46:01.800
and that disables the microphones.
link |
00:46:04.960
That's just the first decision on essentially transparency
link |
00:46:08.040
and control.
link |
00:46:09.720
Oh, then even when we launched,
link |
00:46:11.720
we gave the control in the hands of the customers
link |
00:46:13.840
that you can go and look at any of your individual utterances
link |
00:46:16.400
that is recorded and delete them anytime.
link |
00:46:19.560
And we've got to true to that promise, right?
link |
00:46:22.520
So, and that is super, again,
link |
00:46:25.000
a great instance of showing how you have the control.
link |
00:46:29.080
Then we made it even easier.
link |
00:46:30.440
You can say, like I said, delete what I said today.
link |
00:46:33.080
So that is now making it even just more control
link |
00:46:36.880
in your hands with what's most convenient
link |
00:46:39.360
about this technology is voice.
link |
00:46:42.000
You delete it with your voice now.
link |
00:46:44.400
So these are the types of decisions we continually make.
link |
00:46:48.080
We just recently launched this feature called,
link |
00:46:51.240
what we think of it as,
link |
00:46:52.360
if you wanted humans not to review your data,
link |
00:46:56.680
because you've mentioned supervised learning, right?
link |
00:46:59.160
So in supervised learning,
link |
00:47:01.160
humans have to give some annotation.
link |
00:47:03.760
And that also is now a feature
link |
00:47:06.200
where you can essentially, if you've selected that flag,
link |
00:47:09.320
your data will not be reviewed by a human.
link |
00:47:11.320
So these are the types of controls
link |
00:47:13.640
that we have to constantly offer with customers.
link |
00:47:18.440
So why do you think it bothers people so much that,
link |
00:47:23.840
so everything you just said is really powerful.
link |
00:47:26.920
So the control, the ability to delete,
link |
00:47:28.400
cause we collect, we have studies here running at MIT
link |
00:47:31.120
that collects huge amounts of data
link |
00:47:32.760
and people consent and so on.
link |
00:47:34.880
The ability to delete that data is really empowering
link |
00:47:38.040
and almost nobody ever asked to delete it,
link |
00:47:40.000
but the ability to have that control is really powerful.
link |
00:47:44.200
But still, there's these popular anecdote,
link |
00:47:47.040
anecdotal evidence that people say,
link |
00:47:49.280
they like to tell that,
link |
00:47:51.000
them and a friend were talking about something,
link |
00:47:53.160
I don't know, sweaters for cats.
link |
00:47:56.120
And all of a sudden they'll have advertisements
link |
00:47:58.200
for cat sweaters on Amazon.
link |
00:48:01.400
That's a popular anecdote
link |
00:48:02.680
as if something is always listening.
link |
00:48:05.040
What, can you explain that anecdote,
link |
00:48:07.800
that experience that people have?
link |
00:48:09.120
What's the psychology of that?
link |
00:48:11.000
What's that experience?
link |
00:48:13.080
And can you, you've answered it,
link |
00:48:15.080
but let me just ask, is Alexa listening?
link |
00:48:18.280
No, Alexa listens only for the wake word on the device.
link |
00:48:22.560
And the wake word is?
link |
00:48:23.920
The words like Alexa, Amazon, Echo,
link |
00:48:28.080
but you only choose one at a time.
link |
00:48:29.640
So you choose one and it listens only
link |
00:48:31.640
for that on our devices.
link |
00:48:34.040
So that's first.
link |
00:48:35.160
From a listening perspective,
link |
00:48:36.480
we have to be very clear that it's just the wake word.
link |
00:48:38.360
So you said, why is there this anxiety, if you may?
link |
00:48:41.280
Yeah, exactly.
link |
00:48:42.120
It's because there's a lot of confusion,
link |
00:48:43.560
what it really listens to, right?
link |
00:48:45.360
And I think it's partly on us to keep educating
link |
00:48:49.680
our customers and the general media more
link |
00:48:52.240
in terms of like how, what really happens.
link |
00:48:54.080
And we've done a lot of it.
link |
00:48:56.560
And our pages on information are clear,
link |
00:49:00.840
but still people have to have more,
link |
00:49:04.040
there's always a hunger for information and clarity.
link |
00:49:06.680
And we'll constantly look at how best to communicate.
link |
00:49:09.120
If you go back and read everything,
link |
00:49:10.560
yes, it states exactly that.
link |
00:49:13.120
And then people could still question it.
link |
00:49:15.360
And I think that's absolutely okay to question.
link |
00:49:17.760
What we have to make sure is that we are,
link |
00:49:21.760
because our fundamental philosophy is customer first,
link |
00:49:24.880
customer obsession is our leadership principle.
link |
00:49:27.280
If you put, as researchers, I put myself
link |
00:49:31.040
in the shoes of the customer,
link |
00:49:33.200
and all decisions in Amazon are made with that.
link |
00:49:35.880
And trust has to be earned,
link |
00:49:38.040
and we have to keep earning the trust
link |
00:49:39.440
of our customers in this setting.
link |
00:49:41.800
And to your other point on like,
link |
00:49:44.080
is there something showing up
link |
00:49:45.560
based on your conversations?
link |
00:49:46.680
No, I think the answer is like,
link |
00:49:49.640
a lot of times when those experiences happen,
link |
00:49:51.400
you have to also know that, okay,
link |
00:49:52.840
it may be a winter season,
link |
00:49:54.600
people are looking for sweaters, right?
link |
00:49:56.480
And it shows up on your amazon.com because it is popular.
link |
00:49:59.640
So there are many of these,
link |
00:50:02.720
you mentioned that personality or personalization,
link |
00:50:06.320
turns out we are not that unique either, right?
link |
00:50:09.120
So those things we as humans start thinking,
link |
00:50:12.080
oh, must be because something was heard,
link |
00:50:14.120
and that's why this other thing showed up.
link |
00:50:16.720
The answer is no,
link |
00:50:17.760
probably it is just the season for sweaters.
link |
00:50:21.520
I'm not gonna ask you this question
link |
00:50:23.800
because people have so much paranoia.
link |
00:50:27.160
But let me just say from my perspective,
link |
00:50:29.200
I hope there's a day when customer can ask Alexa
link |
00:50:33.160
to listen all the time,
link |
00:50:35.200
to improve the experience,
link |
00:50:36.640
to improve because I personally don't see the negative
link |
00:50:40.760
because if you have the control and if you have the trust,
link |
00:50:43.920
there's no reason why I shouldn't be listening
link |
00:50:45.640
all the time to the conversations to learn more about you.
link |
00:50:48.280
Because ultimately,
link |
00:50:49.640
as long as you have control and trust,
link |
00:50:52.560
every data you provide to the device,
link |
00:50:55.680
that the device wants is going to be useful.
link |
00:51:00.200
And so to me, as a machine learning person,
link |
00:51:03.880
I think it worries me how sensitive people are
link |
00:51:08.200
about their data relative to how empowering it could be
link |
00:51:13.200
relative to how empowering it could be
link |
00:51:19.320
for the devices around them,
link |
00:51:21.160
how enriching it could be for their own life
link |
00:51:23.720
to improve the product.
link |
00:51:25.440
So I just, it's something I think about sort of a lot,
link |
00:51:28.320
how do we make that devices,
link |
00:51:29.520
obviously Alexa thinks about a lot as well.
link |
00:51:32.200
I don't know if you wanna comment on that,
link |
00:51:34.200
sort of, okay, have you seen,
link |
00:51:35.360
let me ask it in the form of a question, okay.
link |
00:51:38.680
Have you seen an evolution in the way people think about
link |
00:51:42.240
their private data in the previous several years?
link |
00:51:46.400
So as we as a society get more and more comfortable
link |
00:51:48.680
to the benefits we get by sharing more data.
link |
00:51:53.520
First, let me answer that part
link |
00:51:55.040
and then I'll wanna go back
link |
00:51:55.960
to the other aspect you were mentioning.
link |
00:51:58.440
So as a society, on a general,
link |
00:52:01.160
we are getting more comfortable as a society.
link |
00:52:03.120
Doesn't mean that everyone is,
link |
00:52:05.840
and I think we have to respect that.
link |
00:52:07.600
I don't think one size fits all
link |
00:52:10.320
is always gonna be the answer for all, right?
link |
00:52:13.520
By definition.
link |
00:52:14.360
So I think that's something to keep in mind in these.
link |
00:52:17.160
Going back to your, on what more
link |
00:52:21.400
magical experiences can be launched
link |
00:52:23.640
in these kinds of AI settings.
link |
00:52:26.040
I think again, if you give the control,
link |
00:52:29.200
we, it's possible certain parts of it.
link |
00:52:32.080
So we have a feature called follow up mode
link |
00:52:33.960
where you, if you turn it on
link |
00:52:37.000
and Alexa, after you've spoken to it,
link |
00:52:40.400
will open the mics again,
link |
00:52:42.000
thinking you will answer something again.
link |
00:52:44.680
Like if you're adding lists to your shopping item,
link |
00:52:47.880
so right, or a shopping list or to do list,
link |
00:52:50.360
you're not done.
link |
00:52:51.440
You want to keep, so in that setting,
link |
00:52:53.000
it's awesome that it opens the mic
link |
00:52:54.520
for you to say eggs and milk and then bread, right?
link |
00:52:57.160
So these are the kinds of things which you can empower.
link |
00:52:59.920
So, and then another feature we have,
link |
00:53:02.320
which is called Alexa Guard.
link |
00:53:04.960
I said it only listens for the wake word, right?
link |
00:53:07.800
But if you have, let's say you're going to say,
link |
00:53:10.480
like you leave your home and you want Alexa to listen
link |
00:53:13.440
for a couple of sound events like smoke alarm going off
link |
00:53:17.200
or someone breaking your glass, right?
link |
00:53:19.280
So it's like just to keep your peace of mind.
link |
00:53:22.160
So you can say Alexa on guard or I'm away
link |
00:53:26.480
and then it can be listening for these sound events.
link |
00:53:29.200
And when you're home, you come out of that mode, right?
link |
00:53:33.040
So this is another one where you again gave controls
link |
00:53:35.560
in the hands of the user or the customer
link |
00:53:38.040
and to enable some experience that is high utility
link |
00:53:42.440
and maybe even more delightful in the certain settings
link |
00:53:44.600
like follow up mode and so forth.
link |
00:53:46.480
And again, this general principle is the same,
link |
00:53:48.880
control in the hands of the customer.
link |
00:53:52.640
So I know we kind of started with a lot of philosophy
link |
00:53:55.480
and a lot of interesting topics
link |
00:53:56.840
and we're just jumping all over the place,
link |
00:53:58.280
but really some of the fascinating things
link |
00:54:00.280
that the Alexa team and Amazon is doing
link |
00:54:03.040
is in the algorithm side, the data side,
link |
00:54:05.480
the technology, the deep learning, machine learning
link |
00:54:07.520
and so on.
link |
00:54:08.880
So can you give a brief history of Alexa
link |
00:54:13.040
from the perspective of just innovation,
link |
00:54:15.440
the algorithms, the data of how it was born,
link |
00:54:18.640
how it came to be, how it's grown, where it is today?
link |
00:54:22.280
Yeah, it start with in Amazon,
link |
00:54:24.360
everything starts with the customer
link |
00:54:27.000
and we have a process called working backwards.
link |
00:54:30.320
Alexa and more specifically than the product Echo,
link |
00:54:35.040
there was a working backwards document essentially
link |
00:54:37.320
that reflected what it would be,
link |
00:54:38.880
started with a very simple vision statement for instance
link |
00:54:44.320
that morphed into a full fledged document
link |
00:54:47.160
along the way changed into what all it can do, right?
link |
00:54:51.720
But the inspiration was the Star Trek computer.
link |
00:54:54.160
So when you think of it that way,
link |
00:54:56.240
everything is possible, but when you launch a product,
link |
00:54:58.360
you have to start with some place.
link |
00:55:01.040
And when I joined, the product was already in conception
link |
00:55:05.520
and we started working on the far field speech recognition
link |
00:55:08.960
because that was the first thing to solve.
link |
00:55:10.960
By that we mean that you should be able to speak
link |
00:55:12.880
to the device from a distance.
link |
00:55:15.280
And in those days, that wasn't a common practice.
link |
00:55:18.840
And even in the previous research world I was in
link |
00:55:22.360
was considered to an unsolvable problem then
link |
00:55:24.640
in terms of whether you can converse from a length.
link |
00:55:28.320
And here I'm still talking about the first part
link |
00:55:30.360
of the problem where you say,
link |
00:55:32.440
get the attention of the device
link |
00:55:34.080
as in by saying what we call the wake word,
link |
00:55:37.120
which means the word Alexa has to be detected
link |
00:55:40.400
with a very high accuracy because it is a very common word.
link |
00:55:44.880
It has sound units that map with words like I like you
link |
00:55:48.240
or Alec, Alex, right?
link |
00:55:51.160
So it's a undoubtedly hard problem to detect
link |
00:55:56.160
the right mentions of Alexa's address to the device
link |
00:56:00.520
versus I like Alexa.
link |
00:56:02.800
So you have to pick up that signal
link |
00:56:04.240
when there's a lot of noise.
link |
00:56:06.040
Not only noise but a lot of conversation in the house,
link |
00:56:09.120
right?
link |
00:56:09.960
You remember on the device,
link |
00:56:10.800
you're simply listening for the wake word, Alexa.
link |
00:56:13.160
And there's a lot of words being spoken in the house.
link |
00:56:15.760
How do you know it's Alexa and directed at Alexa?
link |
00:56:21.720
Because I could say, I love my Alexa, I hate my Alexa.
link |
00:56:25.320
I want Alexa to do this.
link |
00:56:27.000
And in all these three sentences, I said, Alexa,
link |
00:56:29.280
I didn't want it to wake up.
link |
00:56:32.120
Can I just pause on that second?
link |
00:56:33.720
What would be your device that I should probably
link |
00:56:36.680
in the introduction of this conversation give to people
link |
00:56:39.920
in terms of them turning off their Alexa device
link |
00:56:43.440
if they're listening to this podcast conversation out loud?
link |
00:56:49.240
Like what's the probability that an Alexa device
link |
00:56:51.640
will go off because we mentioned Alexa like a million times.
link |
00:56:55.160
So it will, we have done a lot of different things
link |
00:56:58.120
where we can figure out that there is the device,
link |
00:57:03.720
the speech is coming from a human versus over the air.
link |
00:57:08.200
Also, I mean, in terms of like, also it is think about ads
link |
00:57:11.720
or so we have also launched a technology
link |
00:57:14.240
for watermarking kind of approaches
link |
00:57:16.280
in terms of filtering it out.
link |
00:57:18.800
But yes, if this kind of a podcast is happening,
link |
00:57:21.600
it's possible your device will wake up a few times.
link |
00:57:24.360
It's an unsolved problem,
link |
00:57:25.440
but it is definitely something we care very much about.
link |
00:57:31.040
But the idea is you wanna detect Alexa.
link |
00:57:33.880
Meant for the device.
link |
00:57:36.080
First of all, just even hearing Alexa versus I like something.
link |
00:57:40.040
I mean, that's a fascinating part.
link |
00:57:41.040
So that was the first relief.
link |
00:57:43.040
That's the first.
link |
00:57:43.880
The world's best detector of Alexa.
link |
00:57:45.960
Yeah, the world's best wake word detector
link |
00:57:48.720
in a far field setting,
link |
00:57:49.920
not like something where the phone is sitting on the table.
link |
00:57:53.840
This is like people have devices 40 feet away
link |
00:57:56.680
like in my house or 20 feet away and you still get an answer.
link |
00:58:00.640
So that was the first part.
link |
00:58:02.480
The next is, okay, you're speaking to the device.
link |
00:58:05.880
Of course, you're gonna issue many different requests.
link |
00:58:09.000
Some may be simple, some may be extremely hard,
link |
00:58:11.560
but it's a large vocabulary speech recognition problem
link |
00:58:13.720
essentially, where the audio is now not coming
link |
00:58:17.600
onto your phone or a handheld mic like this
link |
00:58:20.360
or a close talking mic, but it's from 20 feet away
link |
00:58:23.880
where if you're in a busy household,
link |
00:58:26.240
your son may be listening to music,
link |
00:58:28.840
your daughter may be running around with something
link |
00:58:31.600
and asking your mom something and so forth, right?
link |
00:58:33.800
So this is like a common household setting
link |
00:58:36.360
where the words you're speaking to Alexa
link |
00:58:40.160
need to be recognized with very high accuracy, right?
link |
00:58:43.400
Now we are still just in the recognition problem.
link |
00:58:45.800
We haven't yet come to the understanding one, right?
link |
00:58:48.160
And if I pause them, sorry, once again,
link |
00:58:50.160
what year was this?
link |
00:58:51.160
Is this before neural networks began to start
link |
00:58:56.440
to seriously prove themselves in the audio space?
link |
00:59:00.480
Yeah, this is around, so I joined in 2013 in April, right?
link |
00:59:05.480
So the early research and neural networks coming back
link |
00:59:08.800
and showing some promising results
link |
00:59:11.240
in speech recognition space had started happening,
link |
00:59:13.560
but it was very early.
link |
00:59:15.360
But we just now build on that
link |
00:59:17.800
on the very first thing we did when I joined with the team.
link |
00:59:23.240
And remember, it was a very much of a startup environment,
link |
00:59:25.960
which is great about Amazon.
link |
00:59:28.080
And we doubled down on deep learning right away.
link |
00:59:31.240
And we knew we'll have to improve accuracy fast.
link |
00:59:36.600
And because of that, we worked on,
link |
00:59:38.960
and the scale of data, once you have a device like this,
link |
00:59:41.640
if it is successful, will improve big time.
link |
00:59:44.920
Like you'll suddenly have large volumes of data
link |
00:59:48.040
to learn from to make the customer experience better.
link |
00:59:51.080
So how do you scale deep learning?
link |
00:59:52.480
So we did one of the first works
link |
00:59:54.560
in training with distributed GPUs
link |
00:59:57.600
and where the training time was linear
link |
01:00:01.400
in terms of the amount of data.
link |
01:00:03.960
So that was quite important work
link |
01:00:06.200
where it was algorithmic improvements
link |
01:00:07.840
as well as a lot of engineering improvements
link |
01:00:09.920
to be able to train on thousands and thousands of speech.
link |
01:00:14.000
And that was an important factor.
link |
01:00:15.600
So if you ask me like back in 2013 and 2014,
link |
01:00:19.320
when we launched Echo,
link |
01:00:22.440
the combination of large scale data,
link |
01:00:25.680
deep learning progress, near infinite GPUs
link |
01:00:29.680
we had available on AWS even then,
link |
01:00:33.120
was all came together for us to be able
link |
01:00:35.320
to solve the far field speech recognition
link |
01:00:38.400
to the extent it could be useful to the customers.
link |
01:00:40.640
It's still not solved.
link |
01:00:41.480
Like, I mean, it's not that we are perfect
link |
01:00:43.000
at recognizing speech, but we are great at it
link |
01:00:45.520
in terms of the settings that are in homes, right?
link |
01:00:48.360
So, and that was important even in the early stages.
link |
01:00:50.920
So first of all, just even,
link |
01:00:51.960
I'm trying to look back at that time.
link |
01:00:54.240
If I remember correctly,
link |
01:00:57.120
it was, it seems like the task would be pretty daunting.
link |
01:01:01.160
So like, so we kind of take it for granted
link |
01:01:04.480
that it works now.
link |
01:01:06.400
Yes, you're right.
link |
01:01:07.720
So let me, like how, first of all, you mentioned startup.
link |
01:01:10.880
I wasn't familiar how big the team was.
link |
01:01:12.880
I kind of, cause I know there's a lot
link |
01:01:14.200
of really smart people working on it.
link |
01:01:16.040
So now it's a very, very large team.
link |
01:01:19.120
How big was the team?
link |
01:01:20.840
How likely were you to fail in the eyes of everyone else?
link |
01:01:24.120
And ourselves?
link |
01:01:26.120
And yourself?
link |
01:01:27.760
So like what?
link |
01:01:28.600
I'll give you a very interesting anecdote on that.
link |
01:01:31.600
When I joined the team,
link |
01:01:33.880
the speech recognition team was six people.
link |
01:01:37.680
My first meeting, and we had hired a few more people,
link |
01:01:40.520
it was 10 people.
link |
01:01:42.960
Nine out of 10 people thought it can't be done.
link |
01:01:48.040
Who was the one?
link |
01:01:50.080
The one was me, say, actually I should say,
link |
01:01:52.960
and one was semi optimistic.
link |
01:01:56.000
And eight were trying to convince,
link |
01:01:59.120
let's go to the management and say,
link |
01:02:01.720
let's not work on this problem.
link |
01:02:03.600
Let's work on some other problem,
link |
01:02:05.240
like either telephony speech for customer service calls
link |
01:02:09.000
and so forth.
link |
01:02:10.160
But this was the kind of belief you must have.
link |
01:02:12.040
And I had experience with far field speech recognition
link |
01:02:14.360
and my eyes lit up when I saw a problem like that saying,
link |
01:02:17.720
okay, we have been in speech recognition,
link |
01:02:20.840
always looking for that killer app.
link |
01:02:23.400
And this was a killer use case
link |
01:02:25.840
to bring something delightful in the hands of customers.
link |
01:02:28.840
So you mentioned the way you kind of think of it
link |
01:02:31.200
in the product way in the future,
link |
01:02:32.680
have a press release and an FAQ and you think backwards.
link |
01:02:35.760
Did you have, did the team have the echo in mind?
link |
01:02:41.000
So this far field speech recognition,
link |
01:02:43.040
actually putting a thing in the home that works,
link |
01:02:45.360
that it's able to interact with,
link |
01:02:46.640
was that the press release?
link |
01:02:48.160
What was the?
link |
01:02:49.000
The way close, I would say, in terms of the,
link |
01:02:51.440
as I said, the vision was start a computer, right?
link |
01:02:55.520
Or the inspiration.
link |
01:02:56.880
And from there, I can't divulge
link |
01:02:59.120
all the exact specifications,
link |
01:03:00.600
but one of the first things that was magical on Alexa
link |
01:03:07.200
was music.
link |
01:03:08.800
It brought me to back to music
link |
01:03:11.160
because my taste was still in when I was an undergrad.
link |
01:03:14.200
So I still listened to those songs and I,
link |
01:03:17.400
it was too hard for me to be a music fan with a phone, right?
link |
01:03:21.400
So I, and I don't, I hate things in my ears.
link |
01:03:24.200
So from that perspective, it was quite hard
link |
01:03:28.120
and music was part of the,
link |
01:03:32.040
at least the documents I have seen, right?
link |
01:03:33.640
So from that perspective, I think, yes,
link |
01:03:36.120
in terms of how far are we from the original vision?
link |
01:03:40.920
I can't reveal that, but it's,
link |
01:03:42.400
that's why I have done a fun at work
link |
01:03:44.520
because every day we go in and thinking like,
link |
01:03:47.200
these are the new set of challenges to solve.
link |
01:03:49.080
Yeah, that's a great way to do great engineering
link |
01:03:51.920
as you think of the press release.
link |
01:03:53.640
I like that idea actually.
link |
01:03:55.040
Maybe we'll talk about it a bit later,
link |
01:03:56.840
but it's just a super nice way to have a focus.
link |
01:03:59.280
I'll tell you this, you're a scientist
link |
01:04:01.400
and a lot of my scientists have adopted that.
link |
01:04:03.760
They have now, they love it as a process
link |
01:04:07.000
because it was very, as scientists,
link |
01:04:09.000
you're trained to write great papers,
link |
01:04:10.960
but they are all after you've done the research
link |
01:04:13.520
or you've proven that and your PhD dissertation proposal
link |
01:04:16.640
is something that comes closest
link |
01:04:18.480
or a DARPA proposal or a NSF proposal
link |
01:04:21.200
is the closest that comes to a press release.
link |
01:04:23.640
But that process is now ingrained in our scientists,
link |
01:04:27.040
which is like delightful for me to see.
link |
01:04:30.960
You write the paper first and then make it happen.
link |
01:04:33.080
That's right.
link |
01:04:33.920
In fact, it's not.
link |
01:04:34.760
State of the art results.
link |
01:04:36.320
Or you leave the results section open
link |
01:04:38.480
where you have a thesis about here's what I expect, right?
link |
01:04:41.680
And here's what it will change, right?
link |
01:04:44.960
So I think it is a great thing.
link |
01:04:46.560
It works for researchers as well.
link |
01:04:48.280
Yeah.
link |
01:04:49.120
So far field recognition.
link |
01:04:50.760
Yeah.
link |
01:04:52.400
What was the big leap?
link |
01:04:53.920
What were the breakthroughs
link |
01:04:55.520
and what was that journey like to today?
link |
01:04:58.440
Yeah, I think the, as you said first,
link |
01:05:00.240
there was a lot of skepticism
link |
01:05:01.640
on whether far field speech recognition
link |
01:05:03.400
will ever work to be good enough, right?
link |
01:05:06.560
And what we first did was got a lot of training data
link |
01:05:10.040
in a far field setting.
link |
01:05:11.520
And that was extremely hard to get
link |
01:05:14.080
because none of it existed.
link |
01:05:16.240
So how do you collect data in far field setup, right?
link |
01:05:20.120
With no customer base at this time.
link |
01:05:21.400
With no customer base, right?
link |
01:05:22.720
So that was first innovation.
link |
01:05:24.840
And once we had that, the next thing was,
link |
01:05:27.040
okay, if you have the data,
link |
01:05:29.760
first of all, we didn't talk about like,
link |
01:05:31.920
what would magical mean in this kind of a setting?
link |
01:05:35.320
What is good enough for customers, right?
link |
01:05:37.520
That's always, since you've never done this before,
link |
01:05:40.480
what would be magical?
link |
01:05:41.680
So it wasn't just a research problem.
link |
01:05:44.280
You had to put some in terms of accuracy
link |
01:05:47.720
and customer experience features,
link |
01:05:49.960
some stakes on the ground saying,
link |
01:05:51.560
here's where I think it should get to.
link |
01:05:55.000
So you established a bar
link |
01:05:56.120
and then how do you measure progress
link |
01:05:57.520
towards given you have no customer right now.
link |
01:06:01.800
So from that perspective, we went,
link |
01:06:04.240
so first was the data without customers.
link |
01:06:07.600
Second was doubling down on deep learning
link |
01:06:10.600
as a way to learn.
link |
01:06:11.960
And I can just tell you that the combination of the two
link |
01:06:16.200
got our error rates by a factor of five.
link |
01:06:19.240
From where we were when I started
link |
01:06:21.440
to within six months of having that data,
link |
01:06:24.360
we, at that point, I got the conviction
link |
01:06:28.440
that this will work, right?
link |
01:06:29.960
So, because that was magical
link |
01:06:31.680
in terms of when it started working and.
link |
01:06:34.760
That reached the magical bar.
link |
01:06:36.280
That came close to the magical bar.
link |
01:06:38.000
To the bar, right?
link |
01:06:39.560
That we felt would be where people will use it.
link |
01:06:44.280
That was critical.
link |
01:06:45.360
Because you really have one chance at this.
link |
01:06:48.880
If we had launched in November 2014 is when we launched,
link |
01:06:51.920
if it was below the bar,
link |
01:06:53.160
I don't think this category exists
link |
01:06:56.520
if you don't meet the bar.
link |
01:06:58.120
Yeah, and just having looked at voice based interactions
link |
01:07:02.080
like in the car or earlier systems,
link |
01:07:06.120
it's a source of huge frustration for people.
link |
01:07:08.320
In fact, we use voice based interaction
link |
01:07:10.280
for collecting data on subjects to measure frustration.
link |
01:07:14.600
So, as a training set for computer vision,
link |
01:07:16.560
for face data, so we can get a data set
link |
01:07:19.360
of frustrated people.
link |
01:07:20.600
That's the best way to get frustrated people
link |
01:07:22.240
is having them interact with a voice based system
link |
01:07:24.840
in the car.
link |
01:07:25.680
So, that bar I imagine is pretty high.
link |
01:07:28.520
It was very high.
link |
01:07:29.480
And we talked about how also errors are perceived
link |
01:07:32.720
from AIs versus errors by humans.
link |
01:07:35.400
But we are not done with the problems that ended up,
link |
01:07:38.320
we had to solve to get it to launch.
link |
01:07:39.800
So, do you want the next one?
link |
01:07:41.280
Yeah, the next one.
link |
01:07:42.680
So, the next one was what I think of as
link |
01:07:47.680
multi domain natural language understanding.
link |
01:07:50.960
It's very, I wouldn't say easy,
link |
01:07:53.200
but it is during those days,
link |
01:07:56.160
solving it, understanding in one domain,
link |
01:07:59.720
a narrow domain was doable,
link |
01:08:02.880
but for these multiple domains like music,
link |
01:08:06.880
like information, other kinds of household productivity,
link |
01:08:10.680
alarms, timers, even though it wasn't as big as it is
link |
01:08:14.160
in terms of the number of skills Alexa has
link |
01:08:15.640
and the confusion space has like grown
link |
01:08:17.480
by three orders of magnitude,
link |
01:08:20.680
it was still daunting even those days.
link |
01:08:22.680
And again, no customer base yet.
link |
01:08:24.640
Again, no customer base.
link |
01:08:26.200
So, now you're looking at meaning understanding
link |
01:08:28.200
and intent understanding and taking actions
link |
01:08:30.120
on behalf of customers.
link |
01:08:31.640
Based on their requests.
link |
01:08:33.440
And that is the next hard problem.
link |
01:08:36.440
Even if you have gotten the words recognized,
link |
01:08:39.960
how do you make sense of them?
link |
01:08:42.520
In those days, there was still a lot of emphasis
link |
01:08:47.520
on rule based systems for writing grammar patterns
link |
01:08:50.760
to understand the intent.
link |
01:08:52.360
But we had a statistical first approach even then,
link |
01:08:55.560
where for our language understanding we had,
link |
01:08:58.240
and even those starting days,
link |
01:09:00.200
an entity recognizer and an intent classifier,
link |
01:09:03.520
which was all trained statistically.
link |
01:09:06.080
In fact, we had to build the deterministic matching
link |
01:09:09.400
as a follow up to fix bugs that statistical models have.
link |
01:09:14.400
So, it was just a different mindset
link |
01:09:16.320
where we focused on data driven statistical understanding.
link |
01:09:20.080
It wins in the end if you have a huge data set.
link |
01:09:22.720
Yes, it is contingent on that.
link |
01:09:24.520
And that's why it came back to how do you get the data.
link |
01:09:27.120
Before customers, the fact that this is why data
link |
01:09:30.360
becomes crucial to get to the point
link |
01:09:33.280
that you have the understanding system built up.
link |
01:09:37.840
And notice that for you,
link |
01:09:40.680
we were talking about human machine dialogue,
link |
01:09:42.480
and even those early days,
link |
01:09:44.800
even it was very much transactional,
link |
01:09:47.120
do one thing, one shot utterances in great way.
link |
01:09:50.560
There was a lot of debate on how much should Alexa talk back
link |
01:09:52.840
in terms of if you misunderstood it.
link |
01:09:55.680
If you misunderstood you or you said play songs by the stones,
link |
01:10:01.440
and let's say it doesn't know early days,
link |
01:10:04.760
knowledge can be sparse, who are the stones?
link |
01:10:09.240
It's the Rolling Stones.
link |
01:10:12.760
And you don't want the match to be Stone Temple Pilots
link |
01:10:16.280
or Rolling Stones.
link |
01:10:17.200
So, you don't know which one it is.
link |
01:10:18.840
So, these kind of other signals,
link |
01:10:22.480
now there we had great assets from Amazon in terms of...
link |
01:10:27.040
UX, like what is it, what kind of...
link |
01:10:29.560
Yeah, how do you solve that problem?
link |
01:10:31.200
In terms of what we think of it
link |
01:10:32.280
as an entity resolution problem, right?
link |
01:10:34.000
So, because which one is it, right?
link |
01:10:36.200
I mean, even if you figured out the stones as an entity,
link |
01:10:40.160
you have to resolve it to whether it's the stones
link |
01:10:42.200
or the Stone Temple Pilots or some other stones.
link |
01:10:44.840
Maybe I misunderstood, is the resolution
link |
01:10:47.080
the job of the algorithm or is the job of UX
link |
01:10:50.520
communicating with the human to help the resolution?
link |
01:10:52.320
Well, there is both, right?
link |
01:10:54.240
It is, you want 90% or high 90s to be done
link |
01:10:58.760
without any further questioning or UX, right?
link |
01:11:01.200
So, but it's absolutely okay, just like as humans,
link |
01:11:05.560
we ask the question, I didn't understand you, Lex.
link |
01:11:09.000
It's fine for Alexa to occasionally say,
link |
01:11:10.640
I did not understand you, right?
link |
01:11:12.080
And that's an important way to learn.
link |
01:11:14.640
And I'll talk about where we have come
link |
01:11:16.240
with more self learning with these kind of feedback signals.
link |
01:11:20.080
But in those days, just solving the ability
link |
01:11:23.240
of understanding the intent and resolving to an action
link |
01:11:26.480
where action could be play a particular artist
link |
01:11:28.760
or a particular song was super hard.
link |
01:11:31.960
Again, the bar was high as we were talking about, right?
link |
01:11:35.400
So, while we launched it in sort of 13 big domains,
link |
01:11:40.240
I would say in terms of,
link |
01:11:42.360
we think of it as 13, the big skills we had,
link |
01:11:44.760
like music is a massive one when we launched it.
link |
01:11:47.720
And now we have 90,000 plus skills on Alexa.
link |
01:11:51.480
So, what are the big skills?
link |
01:11:52.640
Can you just go over them?
link |
01:11:53.480
Because the only thing I use it for
link |
01:11:55.480
is music, weather and shopping.
link |
01:11:58.840
So, we think of it as music information, right?
link |
01:12:02.520
So, weather is a part of information, right?
link |
01:12:05.360
So, when we launched, we didn't have smart home,
link |
01:12:08.000
but within, by smart home I mean,
link |
01:12:10.360
you connect your smart devices,
link |
01:12:12.040
you control them with voice.
link |
01:12:13.080
If you haven't done it, it's worth,
link |
01:12:15.000
it will change your life.
link |
01:12:15.840
Like turning on the lights and so on.
link |
01:12:16.680
Turning on your light to anything that's connected
link |
01:12:20.200
and has a, it's just that.
link |
01:12:21.480
What's your favorite smart device for you?
link |
01:12:23.160
My light.
link |
01:12:24.000
Light.
link |
01:12:24.840
And now you have the smart plug with,
link |
01:12:26.320
and you don't, we also have this echo plug, which is.
link |
01:12:29.880
Oh yeah, you can plug in anything.
link |
01:12:30.720
You can plug in anything
link |
01:12:31.560
and now you can turn that one on and off.
link |
01:12:33.560
I use this conversation motivation to get one.
link |
01:12:35.680
Garage door, you can check your status of the garage door
link |
01:12:39.560
and things like, and we have gone,
link |
01:12:41.200
make Alexa more and more proactive,
link |
01:12:43.200
where it even has hunches now,
link |
01:12:45.120
that, oh, looks, hunches, like you left your light on.
link |
01:12:50.520
Let's say you've gone to your bed
link |
01:12:51.640
and you left the garage light on.
link |
01:12:52.880
So it will help you out in these settings, right?
link |
01:12:56.600
That's smart devices, information, smart devices.
link |
01:13:00.160
You said music.
link |
01:13:01.120
Yeah, so I don't remember everything we had,
link |
01:13:02.960
but alarms, timers were the big ones.
link |
01:13:05.040
Like that was, you know,
link |
01:13:06.680
the timers were very popular right away.
link |
01:13:09.520
Music also, like you could play song, artist, album,
link |
01:13:13.440
everything, and so that was like a clear win
link |
01:13:17.000
in terms of the customer experience.
link |
01:13:19.440
So that's, again, this is language understanding.
link |
01:13:22.760
Now things have evolved, right?
link |
01:13:24.080
So where we want Alexa definitely to be more accurate,
link |
01:13:28.360
competent, trustworthy,
link |
01:13:29.800
based on how well it does these core things,
link |
01:13:33.080
but we have evolved in many different dimensions.
link |
01:13:35.240
First is what I think of are doing more conversational
link |
01:13:38.360
for high utility, not just for chat, right?
link |
01:13:40.920
And there at Remars this year, which is our AI conference,
link |
01:13:44.920
we launched what is called Alexa Conversations.
link |
01:13:48.560
That is providing the ability for developers
link |
01:13:51.800
to author multi turn experiences on Alexa
link |
01:13:55.040
with no code, essentially,
link |
01:13:57.080
in terms of the dialogue code.
link |
01:13:58.880
Initially it was like, you know, all these IVR systems,
link |
01:14:02.600
you have to fully author if the customer says this,
link |
01:14:06.560
do that, right?
link |
01:14:07.560
So the whole dialogue flow is hand authored.
link |
01:14:11.440
And with Alexa Conversations,
link |
01:14:13.640
the way it is that you just provide
link |
01:14:15.440
a sample interaction data with your service or your API,
link |
01:14:18.040
let's say your Atom tickets that provides a service
link |
01:14:21.400
for buying movie tickets.
link |
01:14:23.400
You provide a few examples of how your customers
link |
01:14:25.840
will interact with your APIs.
link |
01:14:27.840
And then the dialogue flow is automatically constructed
link |
01:14:29.960
using a record neural network trained on that data.
link |
01:14:33.360
So that simplifies the developer experience.
link |
01:14:35.920
We just launched our preview for the developers
link |
01:14:38.440
to try this capability out.
link |
01:14:40.600
And then the second part of it,
link |
01:14:42.120
which shows even increased utility for customers
link |
01:14:45.680
is you and I, when we interact with Alexa or any customer,
link |
01:14:50.920
as I'm coming back to our initial part of the conversation,
link |
01:14:53.160
the goal is often unclear or unknown to the AI.
link |
01:14:58.960
If I say, Alexa, what movies are playing nearby?
link |
01:15:02.680
Am I trying to just buy movie tickets?
link |
01:15:07.080
Am I actually even,
link |
01:15:09.120
do you think I'm looking for just movies for curiosity,
link |
01:15:12.040
whether the Avengers is still in theater or when is it?
link |
01:15:15.120
Maybe it's gone and maybe it will come on my missed it.
link |
01:15:17.640
So I may watch it on Prime, right?
link |
01:15:20.680
Which happened to me.
link |
01:15:21.920
So from that perspective now,
link |
01:15:24.680
you're looking into what is my goal?
link |
01:15:27.680
And let's say I now complete the movie ticket purchase.
link |
01:15:31.480
Maybe I would like to get dinner nearby.
link |
01:15:35.760
So what is really the goal here?
link |
01:15:38.680
Is it night out or is it movies?
link |
01:15:41.920
As in just go watch a movie?
link |
01:15:44.040
The answer is, we don't know.
link |
01:15:46.240
So can Alexa now figuratively have the intelligence
link |
01:15:50.720
that I think this meta goal is really night out
link |
01:15:53.760
or at least say to the customer
link |
01:15:55.800
when you've completed the purchase of movie tickets
link |
01:15:58.200
from Atom tickets or Fandango,
link |
01:16:00.320
or pick your anyone.
link |
01:16:01.840
Then the next thing is,
link |
01:16:02.880
do you want to get an Uber to the theater, right?
link |
01:16:09.360
Or do you want to book a restaurant next to it?
link |
01:16:12.880
And then not ask the same information over and over again,
link |
01:16:17.560
what time, how many people in your party, right?
link |
01:16:22.560
So this is where you shift the cognitive burden
link |
01:16:26.560
from the customer to the AI.
link |
01:16:29.000
Where it's thinking of what is your,
link |
01:16:32.120
it anticipates your goal
link |
01:16:34.200
and takes the next best action to complete it.
link |
01:16:37.480
Now that's the machine learning problem.
link |
01:16:40.760
But essentially the way we solve this first instance,
link |
01:16:43.760
and we have a long way to go to make it scale
link |
01:16:46.800
to everything possible in the world.
link |
01:16:48.720
But at least for this situation,
link |
01:16:50.160
it is from at every instance,
link |
01:16:53.000
Alexa is making the determination,
link |
01:16:54.600
whether it should stick with the experience
link |
01:16:56.240
with Atom tickets or not.
link |
01:16:58.600
Or offer you based on what you say,
link |
01:17:03.800
whether either you have completed the interaction,
link |
01:17:06.280
or you said, no, get me an Uber now.
link |
01:17:07.760
So it will shift context into another experience or skill
link |
01:17:12.080
or another service.
link |
01:17:12.920
So that's a dynamic decision making.
link |
01:17:15.360
That's making Alexa, you can say more conversational
link |
01:17:18.160
for the benefit of the customer,
link |
01:17:20.200
rather than simply complete transactions,
link |
01:17:22.520
which are well thought through.
link |
01:17:24.360
You as a customer has fully specified
link |
01:17:27.840
what you want to be accomplished.
link |
01:17:29.680
It's accomplishing that.
link |
01:17:30.840
So it's kind of as we do this with pedestrians,
link |
01:17:34.080
like intent modeling is predicting
link |
01:17:36.840
what your possible goals are and what's the most likely goal
link |
01:17:40.040
and switching that depending on the things you say.
link |
01:17:42.440
So my question is there,
link |
01:17:44.440
it seems maybe it's a dumb question,
link |
01:17:46.520
but it would help a lot if Alexa remembered me,
link |
01:17:51.400
what I said previously.
link |
01:17:53.040
Right.
link |
01:17:53.880
Is it trying to use some memories for the customer?
link |
01:17:58.360
Yeah, it is using a lot of memory within that.
link |
01:18:00.680
So right now, not so much in terms of,
link |
01:18:02.560
okay, which restaurant do you prefer, right?
link |
01:18:05.280
That is a more longterm memory,
link |
01:18:06.680
but within the short term memory, within the session,
link |
01:18:09.720
it is remembering how many people did you,
link |
01:18:11.720
so if you said buy four tickets,
link |
01:18:13.720
now it has made an implicit assumption
link |
01:18:15.560
that you were gonna have,
link |
01:18:18.200
you need at least four seats at a restaurant, right?
link |
01:18:21.640
So these are the kind of context it's preserving
link |
01:18:24.200
between these skills, but within that session.
link |
01:18:26.720
But you're asking the right question
link |
01:18:28.000
in terms of for it to be more and more useful,
link |
01:18:32.040
it has to have more longterm memory
link |
01:18:33.680
and that's also an open question
link |
01:18:35.120
and again, these are still early days.
link |
01:18:37.400
So for me, I mean, everybody's different,
link |
01:18:40.240
but yeah, I'm definitely not representative
link |
01:18:43.920
of the general population in the sense
link |
01:18:45.240
that I do the same thing every day.
link |
01:18:47.800
Like I eat the same,
link |
01:18:48.640
I do everything the same, the same thing,
link |
01:18:51.760
wear the same thing clearly, this or the black shirt.
link |
01:18:55.360
So it's frustrating when Alexa doesn't get what I'm saying
link |
01:18:59.000
because I have to correct her every time
link |
01:19:01.920
in the exact same way.
link |
01:19:02.800
This has to do with certain songs,
link |
01:19:05.480
like she doesn't know certain weird songs I like
link |
01:19:08.240
and doesn't know, I've complained to Spotify about this,
link |
01:19:11.240
talked to the RD, head of RD at Spotify,
link |
01:19:13.840
it's their way to heaven.
link |
01:19:15.040
I have to correct it every time.
link |
01:19:16.280
It doesn't play Led Zeppelin correctly.
link |
01:19:18.720
It plays cover of Led's of Stairway to Heaven.
link |
01:19:22.080
So I'm.
link |
01:19:22.920
You should figure, you should send me your,
link |
01:19:24.920
next time it fails, feel free to send it to me,
link |
01:19:27.480
we'll take care of it.
link |
01:19:28.400
Okay, well.
link |
01:19:29.240
Because Led Zeppelin is one of my favorite brands,
link |
01:19:31.720
it works for me, so I'm like shocked it doesn't work for you.
link |
01:19:34.120
This is an official bug report.
link |
01:19:35.440
I'll put it, I'll make it public,
link |
01:19:37.480
I'll make everybody retweet it.
link |
01:19:39.000
We're gonna fix the Stairway to Heaven problem.
link |
01:19:40.960
Anyway, but the point is,
link |
01:19:43.200
you know, I'm pretty boring and do the same things,
link |
01:19:45.120
but I'm sure most people do the same set of things.
link |
01:19:48.320
Do you see Alexa sort of utilizing that in the future
link |
01:19:51.360
for improving the experience?
link |
01:19:52.760
Yes, and not only utilizing,
link |
01:19:54.680
it's already doing some of it.
link |
01:19:56.200
We call it, where Alexa is becoming more self learning.
link |
01:19:59.520
So, Alexa is now auto correcting millions and millions
link |
01:20:04.360
of utterances in the US
link |
01:20:06.360
without any human supervision involved.
link |
01:20:08.720
The way it does it is,
link |
01:20:10.840
let's take an example of a particular song
link |
01:20:13.320
didn't work for you.
link |
01:20:14.720
What do you do next?
link |
01:20:15.680
You either it played the wrong song
link |
01:20:17.840
and you said, Alexa, no, that's not the song I want.
link |
01:20:20.720
Or you say, Alexa play that, you try it again.
link |
01:20:25.160
And that is a signal to Alexa
link |
01:20:27.440
that she may have done something wrong.
link |
01:20:30.080
And from that perspective,
link |
01:20:31.840
we can learn if there's that failure pattern
link |
01:20:35.200
or that action of song A was played
link |
01:20:38.480
when song B was requested.
link |
01:20:41.000
And it's very common with station names
link |
01:20:43.040
because play NPR, you can have N be confused as an M.
link |
01:20:47.160
And then you, for a certain accent like mine,
link |
01:20:51.840
people confuse my N and M all the time.
link |
01:20:54.720
And because I have a Indian accent,
link |
01:20:57.640
they're confusable to humans.
link |
01:20:59.600
It is for Alexa too.
link |
01:21:01.600
And in that part, but it starts auto correcting
link |
01:21:05.080
and we collect, we correct a lot of these automatically
link |
01:21:09.680
without a human looking at the failures.
link |
01:21:12.680
So one of the things that's for me missing in Alexa,
link |
01:21:17.360
I don't know if I'm a representative customer,
link |
01:21:19.720
but every time I correct it,
link |
01:21:22.920
it would be nice to know that that made a difference.
link |
01:21:26.120
Yes.
link |
01:21:26.960
You know what I mean?
link |
01:21:27.800
Like the sort of like, I heard you like a sort of.
link |
01:21:31.880
Some acknowledgement of that.
link |
01:21:33.840
We work a lot with Tesla, we study autopilot and so on.
link |
01:21:37.440
And a large amount of the customers
link |
01:21:39.240
that use Tesla autopilot,
link |
01:21:40.720
they feel like they're always teaching the system.
link |
01:21:43.000
They're almost excited
link |
01:21:43.840
by the possibility that they're teaching.
link |
01:21:45.080
I don't know if Alexa customers generally think of it
link |
01:21:48.440
as they're teaching to improve the system.
link |
01:21:51.160
And that's a really powerful thing.
link |
01:21:52.680
Again, I would say it's a spectrum.
link |
01:21:55.200
Some customers do think that way
link |
01:21:57.320
and some would be annoyed by Alexa acknowledging that.
link |
01:22:02.320
So there's, again, no one,
link |
01:22:04.360
while there are certain patterns,
link |
01:22:05.760
not everyone is the same in this way.
link |
01:22:08.280
But we believe that, again, customers helping Alexa
link |
01:22:13.680
is a tenet for us in terms of improving it.
link |
01:22:15.720
And some more self learning is by, again,
link |
01:22:18.280
this is like fully unsupervised, right?
link |
01:22:20.120
There is no human in the loop and no labeling happening.
link |
01:22:23.600
And based on your actions as a customer,
link |
01:22:27.120
Alexa becomes smarter.
link |
01:22:29.080
Again, it's early days,
link |
01:22:31.160
but I think this whole area of teachable AI
link |
01:22:35.840
is gonna get bigger and bigger in the whole space,
link |
01:22:38.680
especially in the AI assistant space.
link |
01:22:40.760
So that's the second part
link |
01:22:41.920
where I mentioned more conversational.
link |
01:22:44.800
This is more self learning.
link |
01:22:46.520
The third is more natural.
link |
01:22:48.320
And the way I think of more natural
link |
01:22:50.240
is we talked about how Alexa sounds.
link |
01:22:53.240
And we have done a lot of advances in our text to speech
link |
01:22:58.080
by using, again, neural network technology
link |
01:23:00.480
for it to sound very humanlike.
link |
01:23:03.520
From the individual texture of the sound to the timing,
link |
01:23:07.520
the tonality, the tone, everything, the whole thing.
link |
01:23:09.240
I would think in terms of,
link |
01:23:11.000
there's a lot of controls in each of the places
link |
01:23:13.360
for how, I mean, the speed of the voice,
link |
01:23:16.640
the prosthetic patterns,
link |
01:23:19.520
the actual smoothness of how it sounds,
link |
01:23:23.360
all of those are factored
link |
01:23:24.360
and we do a ton of listening tests to make sure.
link |
01:23:27.120
But naturalness, how it sounds should be very natural.
link |
01:23:30.720
How it understands requests is also very important.
link |
01:23:33.920
And in terms of, we have 95,000 skills.
link |
01:23:37.120
And if we have, imagine that in many of these skills,
link |
01:23:41.440
you have to remember the skill name
link |
01:23:43.440
and say, Alexa, ask the tide skill to tell me X.
link |
01:23:51.120
Now, if you have to remember the skill name,
link |
01:23:52.960
that means the discovery and the interaction is unnatural.
link |
01:23:56.640
And we are trying to solve that
link |
01:23:58.120
by what we think of as, again,
link |
01:24:03.960
you don't have to have the app metaphor here.
link |
01:24:05.680
These are not individual apps, right?
link |
01:24:07.400
Even though they're,
link |
01:24:08.360
so you're not sort of opening one at a time and interacting.
link |
01:24:11.400
So it should be seamless because it's voice.
link |
01:24:14.000
And when it's voice,
link |
01:24:15.160
you have to be able to understand these requests
link |
01:24:17.560
independent of the specificity, like a skill name.
link |
01:24:20.600
And to do that,
link |
01:24:21.640
what we have done is again,
link |
01:24:22.840
built a deep learning based capability
link |
01:24:24.440
where we shortlist a bunch of skills
link |
01:24:27.040
when you say, Alexa, get me a car.
link |
01:24:28.880
And then we figure it out, okay,
link |
01:24:30.080
it's meant for an Uber skill versus a Lyft
link |
01:24:33.320
or based on your preferences.
link |
01:24:34.880
And then you can rank the responses from the skill
link |
01:24:38.320
and then choose the best response for the customer.
link |
01:24:41.280
So that's on the more natural,
link |
01:24:43.240
other examples of more natural is like,
link |
01:24:46.360
we were talking about lists, for instance,
link |
01:24:49.120
and you don't wanna say, Alexa, add milk,
link |
01:24:51.720
Alexa, add eggs, Alexa, add cookies.
link |
01:24:55.160
No, Alexa, add cookies, milk, and eggs
link |
01:24:57.280
and that in one shot, right?
link |
01:24:59.240
So that works, that helps with the naturalness.
link |
01:25:01.760
We talked about memory, like if you said,
link |
01:25:05.400
you can say, Alexa, remember I have to go to mom's house,
link |
01:25:09.040
or you may have entered a calendar event
link |
01:25:11.160
through your calendar that's linked to Alexa.
link |
01:25:13.520
You don't wanna remember whether it's in my calendar
link |
01:25:15.800
or did I tell you to remember something
link |
01:25:18.360
or some other reminder, right?
link |
01:25:20.960
So you have to now, independent of how customers
link |
01:25:25.320
create these events, it should just say,
link |
01:25:28.120
Alexa, when do I have to go to mom's house?
link |
01:25:29.840
And it tells you when you have to go to mom's house.
link |
01:25:32.320
Now that's a fascinating problem.
link |
01:25:33.720
Who's that problem on?
link |
01:25:35.280
So there's people who create skills.
link |
01:25:38.520
Who's tasked with integrating all of that knowledge together
link |
01:25:42.840
so the skills become seamless?
link |
01:25:44.640
Is it the creators of the skills
link |
01:25:46.840
or is it an infrastructure that Alexa provides problem?
link |
01:25:51.280
It's both.
link |
01:25:52.120
I think the large problem in terms of making sure
link |
01:25:54.960
your skill quality is high,
link |
01:25:58.560
that has to be done by our tools,
link |
01:26:01.240
because it's just, so these skills,
link |
01:26:03.160
just to put the context,
link |
01:26:04.720
they are built through Alexa Skills Kit,
link |
01:26:06.360
which is a self serve way of building
link |
01:26:09.160
an experience on Alexa.
link |
01:26:11.320
This is like any developer in the world
link |
01:26:13.000
could go to Alexa Skills Kit
link |
01:26:14.880
and build an experience on Alexa.
link |
01:26:16.840
Like if you're a Domino's, you can build a Domino's Skills.
link |
01:26:20.160
For instance, that does pizza ordering.
link |
01:26:22.560
When you have authored that,
link |
01:26:25.320
you do want to now,
link |
01:26:28.280
if people say, Alexa, open Domino's
link |
01:26:30.120
or Alexa, ask Domino's to get a particular type of pizza,
link |
01:26:35.360
that will work, but the discovery is hard.
link |
01:26:37.800
You can't just say, Alexa, get me a pizza.
link |
01:26:39.360
And then Alexa figures out what to do.
link |
01:26:42.440
That latter part is definitely our responsibility
link |
01:26:45.000
in terms of when the request is not fully specific,
link |
01:26:48.960
how do you figure out what's the best skill
link |
01:26:51.560
or a service that can fulfill the customer's request?
link |
01:26:56.120
And it can keep evolving.
link |
01:26:57.280
Imagine going to the situation I said,
link |
01:26:59.280
which was the night out planning,
link |
01:27:00.360
that the goal could be more than that individual request
link |
01:27:03.520
that came up.
link |
01:27:05.600
A pizza ordering could mean a night in,
link |
01:27:08.600
where you're having an event with your kids
link |
01:27:10.520
in their house, and you're, so this is,
link |
01:27:12.920
welcome to the world of conversational AI.
link |
01:27:16.720
This is super exciting because it's not
link |
01:27:18.920
the academic problem of NLP,
link |
01:27:20.760
of natural language processing, understanding, dialogue.
link |
01:27:23.080
This is like real world.
link |
01:27:24.640
And the stakes are high in the sense
link |
01:27:27.120
that customers get frustrated quickly,
link |
01:27:30.000
people get frustrated quickly.
link |
01:27:31.800
So you have to get it right,
link |
01:27:33.120
you have to get that interaction right.
link |
01:27:35.280
So it's, I love it.
link |
01:27:36.880
But so from that perspective,
link |
01:27:39.200
what are the challenges today?
link |
01:27:41.920
What are the problems that really need to be solved
link |
01:27:45.040
in the next few years?
link |
01:27:45.880
What's the focus?
link |
01:27:46.840
First and foremost, as I mentioned,
link |
01:27:48.720
that get the basics right is still true.
link |
01:27:53.080
Basically, even the one shot requests,
link |
01:27:57.000
which we think of as transactional requests,
link |
01:27:58.840
needs to work magically, no question about that.
link |
01:28:01.680
If it doesn't turn your light on and off,
link |
01:28:03.600
you'll be super frustrated.
link |
01:28:05.200
Even if I can complete the night out for you
link |
01:28:07.080
and not do that, that is unacceptable as a customer, right?
link |
01:28:10.720
So that you have to get the foundational understanding
link |
01:28:14.120
going very well.
link |
01:28:15.440
The second aspect when I said more conversational
link |
01:28:17.760
is as you imagine is more about reasoning.
link |
01:28:20.120
It is really about figuring out what the latent goal is
link |
01:28:24.360
of the customer based on what I have the information now
link |
01:28:28.520
and the history, what's the next best thing to do.
link |
01:28:31.360
So that's a complete reasoning and decision making problem.
link |
01:28:35.400
Just like your self driving car,
link |
01:28:37.040
but the goal is still more finite.
link |
01:28:38.680
Here it evolves, your environment is super hard
link |
01:28:41.960
and self driving and the cost of a mistake is huge here,
link |
01:28:46.880
but there are certain similarities.
link |
01:28:48.520
But if you think about how many decisions Alexa is making
link |
01:28:52.640
or evaluating at any given time,
link |
01:28:54.280
it's a huge hypothesis space.
link |
01:28:56.480
And we're only talked about so far
link |
01:28:59.760
about what I think of reactive decision
link |
01:29:02.080
in terms of you asked for something
link |
01:29:03.640
and Alexa is reacting to it.
link |
01:29:05.920
If you bring the proactive part,
link |
01:29:07.760
which is Alexa having hunches.
link |
01:29:10.040
So any given instance then it's really a decision
link |
01:29:14.440
at any given point based on the information.
link |
01:29:17.240
Alexa has to determine what's the best thing it needs to do.
link |
01:29:20.120
So these are the ultimate AI problem
link |
01:29:22.520
about decisions based on the information you have.
link |
01:29:25.080
Do you think, just from my perspective,
link |
01:29:27.880
I work a lot with sensing of the human face.
link |
01:29:31.120
Do you think they'll, and we touched this topic
link |
01:29:33.680
a little bit earlier, but do you think it'll be a day soon
link |
01:29:36.560
when Alexa can also look at you to help improve the quality
link |
01:29:41.360
of the hunch it has, or at least detect frustration
link |
01:29:46.360
or detect, improve the quality of its perception
link |
01:29:51.600
of what you're trying to do?
link |
01:29:54.360
I mean, let me again bring back to what it already does.
link |
01:29:57.160
We talked about how based on you barge in over Alexa,
link |
01:30:01.800
clearly it's a very high probability
link |
01:30:04.960
it must have done something wrong.
link |
01:30:06.560
That's why you barged in.
link |
01:30:08.520
The next extension of whether frustration is a signal or not,
link |
01:30:13.240
of course, is a natural thought
link |
01:30:15.320
in terms of how that should be in a signal to it.
link |
01:30:18.200
You can get that from voice.
link |
01:30:19.520
You can get from voice, but it's very hard.
link |
01:30:21.280
Like, I mean, frustration as a signal historically,
link |
01:30:25.920
if you think about emotions of different kinds,
link |
01:30:29.640
there's a whole field of affective computing,
link |
01:30:31.440
something that MIT has also done a lot of research in,
link |
01:30:34.520
is super hard.
link |
01:30:35.600
And you are now talking about a far field device,
link |
01:30:39.040
as in you're talking to a distance noisy environment.
link |
01:30:41.920
And in that environment,
link |
01:30:44.080
it needs to have a good sense for your emotions.
link |
01:30:47.520
This is a very, very hard problem.
link |
01:30:49.440
Very hard problem, but you haven't shied away
link |
01:30:50.960
from hard problems.
link |
01:30:51.800
So, Deep Learning has been at the core
link |
01:30:55.240
of a lot of this technology.
link |
01:30:57.360
Are you optimistic
link |
01:30:58.200
about the current Deep Learning approaches
link |
01:30:59.680
to solving the hardest aspects of what we're talking about?
link |
01:31:03.200
Or do you think there will come a time
link |
01:31:05.320
where new ideas need to further,
link |
01:31:07.960
if we look at reasoning,
link |
01:31:09.320
so OpenAI, DeepMind,
link |
01:31:10.640
a lot of folks are now starting to work in reasoning,
link |
01:31:13.840
trying to see how we can make neural networks reason.
link |
01:31:16.560
Do you see that new approaches need to be invented
link |
01:31:20.480
to take the next big leap?
link |
01:31:23.280
Absolutely, I think there has to be a lot more investment.
link |
01:31:27.160
And I think in many different ways,
link |
01:31:29.360
and there are these, I would say,
link |
01:31:31.160
nuggets of research forming in a good way,
link |
01:31:33.520
like learning with less data
link |
01:31:36.040
or like zero short learning, one short learning.
link |
01:31:39.640
And the active learning stuff you've talked about
link |
01:31:41.360
is incredible stuff.
link |
01:31:43.200
So, transfer learning is also super critical,
link |
01:31:45.640
especially when you're thinking about applying knowledge
link |
01:31:48.560
from one task to another,
link |
01:31:49.840
or one language to another, right?
link |
01:31:52.000
It's really ripe.
link |
01:31:52.960
So, these are great pieces.
link |
01:31:55.280
Deep learning has been useful too.
link |
01:31:56.760
And now we are sort of marrying deep learning
link |
01:31:58.840
with transfer learning and active learning.
link |
01:32:02.440
Of course, that's more straightforward
link |
01:32:04.480
in terms of applying deep learning
link |
01:32:05.840
and an active learning setup.
link |
01:32:06.960
But I do think in terms of now looking
link |
01:32:12.120
into more reasoning based approaches
link |
01:32:14.200
is going to be key for our next wave of the technology.
link |
01:32:19.440
But there is a good news.
link |
01:32:20.840
The good news is that I think for keeping on
link |
01:32:23.280
to delight customers, that a lot of it
link |
01:32:25.200
can be done by prediction tasks.
link |
01:32:27.880
So, we haven't exhausted that.
link |
01:32:30.640
So, we don't need to give up
link |
01:32:34.440
on the deep learning approaches for that.
link |
01:32:37.280
So, that's just I wanted to sort of point that out.
link |
01:32:39.520
Creating a rich, fulfilling, amazing experience
link |
01:32:42.560
that makes Amazon a lot of money
link |
01:32:44.200
and a lot of everybody a lot of money
link |
01:32:46.360
because it does awesome things, deep learning is enough.
link |
01:32:49.840
The point.
link |
01:32:51.080
I don't think, I wouldn't say deep learning is enough.
link |
01:32:54.160
I think for the purposes of Alexa
link |
01:32:56.680
accomplished the task for customers.
link |
01:32:58.400
I'm saying there are still a lot of things we can do
link |
01:33:02.160
with prediction based approaches that do not reason.
link |
01:33:05.280
I'm not saying that and we haven't exhausted those.
link |
01:33:08.600
But for the kind of high utility experiences
link |
01:33:12.440
that I'm personally passionate about
link |
01:33:14.240
of what Alexa needs to do, reasoning has to be solved
link |
01:33:18.760
to the same extent as you can think
link |
01:33:21.000
of natural language understanding and speech recognition
link |
01:33:24.720
to the extent of understanding intents
link |
01:33:27.600
has been how accurate it has become.
link |
01:33:30.120
But reasoning, we have very, very early days.
link |
01:33:32.760
Let me ask it another way.
link |
01:33:34.000
How hard of a problem do you think that is?
link |
01:33:36.760
Hardest of them.
link |
01:33:39.160
I would say hardest of them because again,
link |
01:33:42.560
the hypothesis space is really, really large.
link |
01:33:47.560
And when you go back in time, like you were saying,
link |
01:33:50.000
I wanna, I want Alexa to remember more things
link |
01:33:53.000
that once you go beyond a session of interaction,
link |
01:33:56.280
which is by session, I mean a time span,
link |
01:33:59.200
which is today to versus remembering which restaurant I like.
link |
01:34:03.120
And then when I'm planning a night out to say,
link |
01:34:05.440
do you wanna go to the same restaurant?
link |
01:34:07.480
Now you're up the stakes big time.
link |
01:34:09.720
And this is where the reasoning dimension
link |
01:34:12.800
also goes way, way bigger.
link |
01:34:14.680
So you think the space, we'll be elaborating that
link |
01:34:17.760
a little bit, just philosophically speaking,
link |
01:34:20.480
do you think when you reason about trying to model
link |
01:34:24.480
what the goal of a person is in the context
link |
01:34:28.040
of interacting with Alexa, you think that space is huge?
link |
01:34:31.080
It's huge, absolutely huge.
link |
01:34:32.840
Do you think, so like another sort of devil's advocate
link |
01:34:35.840
would be that we human beings are really simple
link |
01:34:38.520
and we all want like just a small set of things.
link |
01:34:41.360
And so do you think it's possible?
link |
01:34:44.720
Cause we're not talking about
link |
01:34:47.000
a fulfilling general conversation.
link |
01:34:49.240
Perhaps actually the Alexa prize is a little bit after that.
link |
01:34:53.320
Creating a customer, like there's so many
link |
01:34:56.080
of the interactions, it feels like are clustered
link |
01:35:01.040
in groups that are, don't require general reasoning.
link |
01:35:06.520
I think you're right in terms of the head
link |
01:35:09.320
of the distribution of all the possible things
link |
01:35:11.800
customers may wanna accomplish.
link |
01:35:13.720
But the tail is long and it's diverse, right?
link |
01:35:18.200
So from that.
link |
01:35:19.040
There's many, many long tails.
link |
01:35:21.280
So from that perspective, I think you have
link |
01:35:24.880
to solve that problem otherwise,
link |
01:35:27.640
and everyone's very different.
link |
01:35:28.800
Like, I mean, we see this already
link |
01:35:30.440
in terms of the skills, right?
link |
01:35:32.320
I mean, if you're an average surfer, which I am not, right?
link |
01:35:36.960
But somebody is asking Alexa about surfing conditions, right?
link |
01:35:41.640
And there's a skill that is there for them to get to, right?
link |
01:35:45.480
That tells you that the tail is massive.
link |
01:35:47.840
Like in terms of like what kind of skills
link |
01:35:50.720
people have created, it's humongous in terms of it.
link |
01:35:54.200
And which means there are these diverse needs.
link |
01:35:56.960
And when you start looking at the combinations
link |
01:36:00.040
of these, right?
link |
01:36:00.960
Even if you had pairs of skills and 90,000 choose two,
link |
01:36:05.400
it's still a big set of combinations.
link |
01:36:07.920
So I'm saying there's a huge to do here now.
link |
01:36:11.720
And I think customers are, you know,
link |
01:36:14.760
wonderfully frustrated with things.
link |
01:36:18.080
And they have to keep getting to do better things for them.
link |
01:36:20.880
So.
link |
01:36:21.720
And they're not known to be super patient.
link |
01:36:23.920
So you have to.
link |
01:36:24.760
Do it fast.
link |
01:36:25.600
You have to do it fast.
link |
01:36:26.960
So you've mentioned the idea of a press release,
link |
01:36:29.840
the research and development, Amazon Alexa
link |
01:36:33.880
and Amazon general, you kind of think of what
link |
01:36:35.960
the future product will look like.
link |
01:36:37.240
And you kind of make it happen.
link |
01:36:38.360
You work backwards.
link |
01:36:40.040
So can you draft for me, you probably already have one,
link |
01:36:43.920
but can you make up one for 10, 20, 30, 40 years out
link |
01:36:48.880
that you see the Alexa team putting out
link |
01:36:52.800
just in broad strokes, something that you dream about?
link |
01:36:56.520
I think let's start with the five years first, right?
link |
01:37:00.920
So, and I'll get to the 40 years too.
link |
01:37:03.600
Cause I'm pretty sure you have a real five year one.
link |
01:37:06.000
That's why I didn't want to, but yeah,
link |
01:37:08.720
in broad strokes, let's start with five years.
link |
01:37:10.120
I think the five year is where, I mean,
link |
01:37:11.800
I think of in these spaces, it's hard,
link |
01:37:14.800
especially if you're in the thick of things
link |
01:37:16.160
to think beyond the five year space,
link |
01:37:17.960
because a lot of things change, right?
link |
01:37:20.280
I mean, if you ask me five years back,
link |
01:37:22.200
will Alexa will be here?
link |
01:37:24.200
I wouldn't have, I think it has surpassed
link |
01:37:26.360
my imagination of that time, right?
link |
01:37:29.040
So I think from the next five years perspective,
link |
01:37:33.160
from a AI perspective, what we're gonna see
link |
01:37:37.120
is that notion, which you said goal oriented dialogues
link |
01:37:40.400
and open domain like Alexa prize.
link |
01:37:42.400
I think that bridge is gonna get closed.
link |
01:37:45.200
They won't be different.
link |
01:37:46.400
And I'll give you why that's the case.
link |
01:37:48.520
You mentioned shopping.
link |
01:37:50.200
How do you shop?
link |
01:37:52.240
Do you shop in one shot?
link |
01:37:55.680
Sure, your double A batteries, paper towels.
link |
01:37:59.400
Yes, how long does it take for you to buy a camera?
link |
01:38:04.160
You do ton of research, then you make a decision.
link |
01:38:07.480
So is that a goal oriented dialogue
link |
01:38:11.440
when somebody says, Alexa, find me a camera?
link |
01:38:15.480
Is it simply inquisitiveness, right?
link |
01:38:18.640
So even in the something that you think of it as shopping,
link |
01:38:20.880
which you said you yourself use a lot of,
link |
01:38:23.960
if you go beyond where it's reorders
link |
01:38:27.360
or items where you sort of are not brand conscious
link |
01:38:32.440
and so forth.
link |
01:38:33.520
So that was just in shopping.
link |
01:38:35.040
Just to comment quickly,
link |
01:38:36.120
I've never bought anything through Alexa
link |
01:38:38.040
that I haven't bought before on Amazon on the desktop
link |
01:38:41.160
after I clicked in a bunch of read a bunch of reviews,
link |
01:38:44.000
that kind of stuff.
link |
01:38:44.840
So it's repurchase.
link |
01:38:45.800
So now you think in,
link |
01:38:47.480
even for something that you felt like is a finite goal,
link |
01:38:51.280
I think the space is huge because even products,
link |
01:38:54.680
the attributes are many,
link |
01:38:56.640
and you wanna look at reviews,
link |
01:38:58.240
some on Amazon, some outside,
link |
01:39:00.000
some you wanna look at what CNET is saying
link |
01:39:01.960
or another consumer forum is saying
link |
01:39:05.200
about even a product for instance, right?
link |
01:39:06.880
So that's just shopping where you could argue
link |
01:39:11.640
the ultimate goal is sort of known.
link |
01:39:13.960
And we haven't talked about Alexa,
link |
01:39:15.680
what's the weather in Cape Cod this weekend, right?
link |
01:39:18.880
So why am I asking that weather question, right?
link |
01:39:22.480
So I think of it as how do you complete goals
link |
01:39:27.480
with minimum steps for our customers, right?
link |
01:39:30.040
And when you think of it that way,
link |
01:39:32.400
the distinction between goal oriented and conversations
link |
01:39:35.960
for open domain say goes away.
link |
01:39:38.640
I may wanna know what happened
link |
01:39:41.680
in the presidential debate, right?
link |
01:39:43.520
And is it I'm seeking just information
link |
01:39:45.800
or I'm looking at who's winning the debates, right?
link |
01:39:49.560
So these are all quite hard problems.
link |
01:39:53.360
So even the five year horizon problem,
link |
01:39:55.560
I'm like, I sure hope we'll solve these.
link |
01:39:59.840
And you're optimistic because that's a hard problem.
link |
01:40:03.440
Which part?
link |
01:40:04.280
The reasoning enough to be able to help explore
link |
01:40:09.600
complex goals that are beyond something simplistic.
link |
01:40:12.400
That feels like it could be, well, five years is a nice.
link |
01:40:16.560
Is a nice bar for it, right?
link |
01:40:18.280
I think you will, it's a nice ambition
link |
01:40:21.240
and do we have press releases for that?
link |
01:40:23.760
Absolutely, can I tell you what specifically
link |
01:40:25.880
the roadmap will be?
link |
01:40:26.720
No, right?
link |
01:40:28.080
And what, and will we solve all of it
link |
01:40:30.760
in the five year space?
link |
01:40:31.760
No, this is, we'll work on this forever actually.
link |
01:40:35.560
This is the hardest of the AI problems
link |
01:40:37.960
and I don't see that being solved even in a 40 year horizon
link |
01:40:42.240
because even if you limit to the human intelligence,
link |
01:40:45.200
we know we are quite far from that.
link |
01:40:47.640
In fact, every aspects of our sensing to neural processing,
link |
01:40:52.640
to how brain stores information and how it processes it,
link |
01:40:56.320
we don't yet know how to represent knowledge, right?
link |
01:40:59.000
So we are still in those early stages.
link |
01:41:02.920
So I wanted to start, that's why at the five year,
link |
01:41:06.360
because the five year success would look like that
link |
01:41:09.120
in solving these complex goals.
link |
01:41:11.240
And the 40 year would be where it's just natural
link |
01:41:14.560
to talk to these in terms of more of these complex goals.
link |
01:41:18.720
Right now, we've already come to the point
link |
01:41:20.000
where these transactions you mentioned
link |
01:41:22.840
of asking for weather or reordering something
link |
01:41:25.720
or listening to your favorite tune,
link |
01:41:28.560
it's natural for you to ask Alexa.
link |
01:41:30.840
It's now unnatural to pick up your phone, right?
link |
01:41:33.880
And that I think is the first five year transformation.
link |
01:41:36.600
The next five year transformation would be,
link |
01:41:38.800
okay, I can plan my weekend with Alexa
link |
01:41:40.960
or I can plan my next meal with Alexa
link |
01:41:43.640
or my next night out with seamless effort.
link |
01:41:47.840
So just to pause and look back at the big picture of it all.
link |
01:41:51.200
It's a, you're a part of a large team
link |
01:41:55.560
that's creating a system that's in the home
link |
01:41:58.680
that's not human, that gets to interact with human beings.
link |
01:42:02.760
So we human beings, we these descendants of apes
link |
01:42:06.120
have created an artificial intelligence system
link |
01:42:09.000
that's able to have conversations.
link |
01:42:10.960
I mean, that to me, the two most transformative robots
link |
01:42:18.800
of this century, I think will be autonomous vehicles,
link |
01:42:23.200
but they're a little bit transformative
link |
01:42:24.760
in a more boring way.
link |
01:42:26.360
It's like a tool.
link |
01:42:28.120
I think conversational agents in the home
link |
01:42:32.840
is like an experience.
link |
01:42:34.640
How does that make you feel?
link |
01:42:36.120
That you're at the center of creating that?
link |
01:42:38.560
Do you sit back in awe sometimes?
link |
01:42:42.800
What is your feeling about the whole mess of it?
link |
01:42:47.320
Can you even believe that we're able
link |
01:42:49.000
to create something like this?
link |
01:42:50.840
I think it's a privilege.
link |
01:42:52.440
I'm so fortunate like where I ended up, right?
link |
01:42:57.640
And it's been a long journey.
link |
01:43:00.800
Like I've been in this space for a long time in Cambridge,
link |
01:43:03.480
right, and it's so heartwarming to see
link |
01:43:07.080
the kind of adoption conversational agents are having now.
link |
01:43:12.440
Five years back, it was almost like,
link |
01:43:14.480
should I move out of this because we are unable
link |
01:43:17.120
to find this killer application that customers would love
link |
01:43:21.360
that would not simply be a good to have thing
link |
01:43:24.440
in research labs.
link |
01:43:26.080
And it's so fulfilling to see it make a difference
link |
01:43:29.160
to millions and billions of people worldwide.
link |
01:43:32.240
The good thing is that it's still very early.
link |
01:43:34.400
So I have another 20 years of job security
link |
01:43:37.360
doing what I love.
link |
01:43:38.200
Like, so I think from that perspective,
link |
01:43:42.000
I tell every researcher that joins
link |
01:43:44.280
or every member of my team,
link |
01:43:46.240
that this is a unique privilege.
link |
01:43:47.640
Like I think, and we have,
link |
01:43:49.560
and I would say not just launching Alexa in 2014,
link |
01:43:52.760
which was first of its kind.
link |
01:43:54.360
Along the way we have, when we launched Alexa Skills Kit,
link |
01:43:57.360
it became democratizing AI.
link |
01:43:59.680
When before that there was no good evidence
link |
01:44:02.440
of an SDK for speech and language.
link |
01:44:04.960
Now we are coming to this where you and I
link |
01:44:06.640
are having this conversation where I'm not saying,
link |
01:44:10.320
oh, Lex, planning a night out with an AI agent, impossible.
link |
01:44:14.560
I'm saying it's in the realm of possibility
link |
01:44:17.120
and not only possibility, we'll be launching this, right?
link |
01:44:19.480
So some elements of that, it will keep getting better.
link |
01:44:23.800
We know that is a universal truth.
link |
01:44:25.640
Once you have these kinds of agents out there being used,
link |
01:44:30.160
they get better for your customers.
link |
01:44:32.080
And I think that's where,
link |
01:44:34.240
I think the amount of research topics
link |
01:44:36.560
we are throwing out at our budding researchers
link |
01:44:39.480
is just gonna be exponentially hard.
link |
01:44:41.840
And the great thing is you can now get immense satisfaction
link |
01:44:45.600
by having customers use it,
link |
01:44:47.280
not just a paper in NeurIPS or another conference.
link |
01:44:51.120
I think everyone, myself included,
link |
01:44:53.120
are deeply excited about that future.
link |
01:44:54.840
So I don't think there's a better place to end, Rohit.
link |
01:44:58.040
Thank you so much for talking to us.
link |
01:44:58.880
Thank you so much.
link |
01:44:59.720
This was fun.
link |
01:45:00.560
Thank you, same here.
link |
01:45:02.240
Thanks for listening to this conversation
link |
01:45:04.240
with Rohit Prasad.
link |
01:45:05.760
And thank you to our presenting sponsor, Cash App.
link |
01:45:08.880
Download it, use code LEGSPodcast,
link |
01:45:11.600
you'll get $10 and $10 will go to FIRST,
link |
01:45:14.720
a STEM education nonprofit
link |
01:45:16.520
that inspires hundreds of thousands of young minds
link |
01:45:19.760
to learn and to dream of engineering our future.
link |
01:45:23.320
If you enjoy this podcast, subscribe on YouTube,
link |
01:45:26.220
give it five stars on Apple Podcast,
link |
01:45:28.200
support it on Patreon, or connect with me on Twitter.
link |
01:45:31.720
And now let me leave you with some words of wisdom
link |
01:45:34.960
from the great Alan Turing.
link |
01:45:37.500
Sometimes it is the people no one can imagine anything of
link |
01:45:41.680
who do the things no one can imagine.
link |
01:45:44.180
Thank you for listening and hope to see you next time.