back to index

Rohit Prasad: Amazon Alexa and Conversational AI | Lex Fridman Podcast #57


small model | large model

link |
00:00:00.000
The following is a conversation with Rohit Prasad.
link |
00:00:02.960
He's the vice president and head scientist of Amazon Alexa
link |
00:00:06.360
and one of its original creators.
link |
00:00:08.880
The Alexa team embodies some of the most challenging,
link |
00:00:12.120
incredible, impactful, and inspiring work
link |
00:00:14.960
that is done in AI today.
link |
00:00:17.040
The team has to both solve problems
link |
00:00:19.120
at the cutting edge of natural language processing
link |
00:00:21.720
and provide a trustworthy, secure,
link |
00:00:24.040
and enjoyable experience to millions of people.
link |
00:00:27.440
This is where state of the art methods
link |
00:00:29.400
in computer science meet the challenges
link |
00:00:31.800
of real world engineering.
link |
00:00:33.680
In many ways, Alexa and the other voice assistants
link |
00:00:37.280
are the voices of artificial intelligence
link |
00:00:39.480
to millions of people and an introduction to AI
link |
00:00:43.120
for people who have only encountered it in science fiction.
link |
00:00:46.920
This is an important and exciting opportunity.
link |
00:00:49.920
And so the work that Rohit and the Alexa team are doing
link |
00:00:52.880
is an inspiration to me and to many researchers
link |
00:00:55.920
and engineers in the AI community.
link |
00:00:58.800
This is the Artificial Intelligence Podcast.
link |
00:01:01.880
If you enjoy it, subscribe on YouTube,
link |
00:01:04.360
give it five stars on Apple Podcasts,
link |
00:01:06.320
support it on Patreon,
link |
00:01:07.680
or simply connect with me on Twitter.
link |
00:01:09.760
Alex Friedman spelled F R I D M A N.
link |
00:01:13.640
If you leave a review on Apple Podcasts especially,
link |
00:01:16.880
but also Cast Box or comment on YouTube,
link |
00:01:19.960
consider mentioning topics, people, ideas, questions, quotes,
link |
00:01:23.560
and science, tech, or philosophy that you find interesting.
link |
00:01:26.320
And I'll read them on this podcast.
link |
00:01:28.840
I won't call out names, but I love comments
link |
00:01:31.680
with kindness and thoughtfulness in them,
link |
00:01:33.280
so I thought I'd share them.
link |
00:01:35.760
Someone on YouTube highlighted a quote
link |
00:01:37.520
from the conversation with Ray Dalio,
link |
00:01:40.320
where he said that you have to appreciate
link |
00:01:42.000
all the different ways that people can be AI players.
link |
00:01:45.320
This connected with me too.
link |
00:01:47.040
On teams of engineers, it's easy to think
link |
00:01:49.280
that raw productivity is the measure of excellence,
link |
00:01:52.000
but there are others.
link |
00:01:53.480
I worked with people who brought a smile to my face
link |
00:01:55.800
every time I got to work in the morning.
link |
00:01:57.960
Their contribution to the team is immeasurable.
link |
00:02:01.280
I recently started doing podcast ads
link |
00:02:03.080
at the end of the introduction.
link |
00:02:04.720
I'll do one or two minutes after introducing the episode,
link |
00:02:07.680
and never any ads in the middle
link |
00:02:09.200
that break the flow of the conversation.
link |
00:02:11.560
I hope that works for you.
link |
00:02:13.040
It doesn't hurt the listening experience.
link |
00:02:15.720
This show is presented by Cash App,
link |
00:02:17.880
the number one finance app in the App Store.
link |
00:02:20.400
I personally use Cash App to send money to friends,
link |
00:02:23.040
but you can also use it to buy, sell,
link |
00:02:24.760
and deposit Bitcoin in just seconds.
link |
00:02:27.200
Cash App also has a new investing feature.
link |
00:02:30.400
You can buy fractions of a stock, save $1 worth,
link |
00:02:33.680
no matter what the stock price is.
link |
00:02:35.840
Brokerage services are provided by Cash App Investing,
link |
00:02:38.720
a subsidiary of Square and Member SIPC.
link |
00:02:42.480
I'm excited to be working with Cash App
link |
00:02:44.480
to support one of my favorite organizations called First,
link |
00:02:47.600
best known for their first robotics and Lego competitions.
link |
00:02:50.960
They educate and inspire hundreds of thousands of students
link |
00:02:54.400
in over 110 countries,
link |
00:02:56.280
and have a perfect rating on Charity Navigator,
link |
00:02:58.880
which means the donated money is used
link |
00:03:00.960
to maximum effectiveness.
link |
00:03:03.480
When you get Cash App from the App Store, Google Play,
link |
00:03:06.440
and use code LEX Podcast, you'll get $10,
link |
00:03:10.280
and Cash App will also donate $10 to First,
link |
00:03:13.280
which again is an organization
link |
00:03:15.160
that I've personally seen inspire girls and boys
link |
00:03:18.200
to dream of engineering a better world.
link |
00:03:20.760
This podcast is also supported by ZipRecruiter.
link |
00:03:24.240
Hiring great people is hard,
link |
00:03:26.240
and to me is one of the most important elements
link |
00:03:28.960
of a successful mission driven team.
link |
00:03:31.440
I've been fortunate to be a part of
link |
00:03:33.480
and lead several great engineering teams.
link |
00:03:36.000
The hiring I've done in the past
link |
00:03:37.680
was mostly through tools we built ourselves,
link |
00:03:40.560
but reinventing the wheel was painful.
link |
00:03:42.800
ZipRecruiter is a tool that's already available for you.
link |
00:03:45.960
It seeks to make hiring simple, fast, and smart.
link |
00:03:49.400
For example, codable cofounder Gretchen Hebner
link |
00:03:52.760
used ZipRecruiter to find a new game artist
link |
00:03:55.200
to join our education tech company.
link |
00:03:57.440
By using ZipRecruiter's screening questions
link |
00:03:59.600
to filter candidates,
link |
00:04:00.880
Gretchen found it easier to focus on the best candidates
link |
00:04:03.840
and finally hiring the perfect person for the role
link |
00:04:06.800
in less than two weeks from start to finish.
link |
00:04:10.040
ZipRecruiter, the smartest way to hire.
link |
00:04:13.040
CY ZipRecruiter is effective for businesses of all sizes
link |
00:04:17.040
by signing up, as I did, for free at ziprecruiter.com
link |
00:04:21.080
slash lexpod, that's ziprecruiter.com slash lexpod.
link |
00:04:27.800
And now, here's my conversation with Rohit Prasad.
link |
00:04:33.040
In the movie, Her, I'm not sure if you've ever seen her.
link |
00:04:36.280
Human falls in love with the voice of an AI system.
link |
00:04:39.880
Let's start at the highest philosophical level
link |
00:04:42.120
before we get to deep learning and some of the fun things.
link |
00:04:45.200
Do you think this, what the movie, Her, shows,
link |
00:04:48.160
is within our reach?
link |
00:04:50.800
I think, not specifically about her,
link |
00:04:54.240
but I think what we are seeing is a massive increase
link |
00:04:58.760
in adoption of AI assistance or AI
link |
00:05:02.200
and all parts of our social fabric.
link |
00:05:05.200
And I think it's, what I do believe
link |
00:05:08.560
is that the utility these AIs provide
link |
00:05:12.560
some of the functionality
link |
00:05:14.000
some of the functionalities that are shown
link |
00:05:16.920
are absolutely within reach.
link |
00:05:20.480
So some of the functionality in terms
link |
00:05:22.200
of the interactive elements,
link |
00:05:24.000
but in terms of the deep connection,
link |
00:05:27.040
that's purely voice based.
link |
00:05:29.160
Do you think such a close connection is possible
link |
00:05:31.560
with voice alone?
link |
00:05:33.000
It's been a while since I saw Her,
link |
00:05:34.640
but I would say in terms of the,
link |
00:05:37.680
in terms of interactions which are both human like
link |
00:05:40.480
and in these AI assistance, you have to value
link |
00:05:44.280
what is also superhuman.
link |
00:05:46.720
We as humans can be in only one place.
link |
00:05:49.640
AI assistance can be in multiple places at the same time.
link |
00:05:53.160
One with you on your mobile device,
link |
00:05:55.600
one at your home, one at work.
link |
00:05:58.280
So you have to respect these superhuman capabilities too.
link |
00:06:02.200
Plus as humans, we have certain attributes
link |
00:06:05.000
we're very good at, very good at reasoning.
link |
00:06:07.000
AI assistance not yet there,
link |
00:06:09.200
but in the realm of AI assistance,
link |
00:06:11.920
what they're great at is computation, memory.
link |
00:06:14.360
It's infinite and pure.
link |
00:06:16.240
These are the attributes you have to start respecting.
link |
00:06:18.040
So I think the comparison with human like
link |
00:06:19.960
versus the other aspect,
link |
00:06:21.880
which is also superhuman,
link |
00:06:23.080
has to be taken into consideration.
link |
00:06:24.520
So I think we need to elevate the discussion
link |
00:06:27.040
to not just human like.
link |
00:06:28.840
So there's certainly elements we just mentioned.
link |
00:06:32.080
Alex is everywhere, computation is speaking.
link |
00:06:35.560
So this is a much bigger infrastructure
link |
00:06:37.320
than just the thing that sits there
link |
00:06:38.760
in the room with you.
link |
00:06:40.240
But it certainly feels to us mere humans
link |
00:06:44.840
that there's just another little creature there
link |
00:06:49.280
when you're interacting with it.
link |
00:06:50.240
You're not interacting with the entirety
link |
00:06:51.720
of the infrastructure, you're interacting with the device.
link |
00:06:54.160
The feeling is, okay, sure, we anthropomorphize things,
link |
00:06:58.400
but that feeling is still there.
link |
00:07:00.520
So what do you think we as humans,
link |
00:07:03.920
the purity of the interaction with a smart assistant,
link |
00:07:06.960
what do you think we look for?
link |
00:07:08.480
And in that interaction?
link |
00:07:10.240
I think in the certain interactions,
link |
00:07:12.280
I think we'll be very much where it does feel like a human
link |
00:07:15.960
because it has a person of its own.
link |
00:07:19.120
And in certain ones, it wouldn't be.
link |
00:07:20.680
So I think a simple example to think of it is
link |
00:07:23.280
if you're walking through the house
link |
00:07:25.240
and you just wanna turn on your lights on and off
link |
00:07:28.000
and you're issuing a command,
link |
00:07:29.880
that's not very much like a human like interaction.
link |
00:07:32.080
And that's where the AI shouldn't come back
link |
00:07:33.880
and have a conversation with you.
link |
00:07:35.280
Just it should simply complete that command.
link |
00:07:38.120
So I think the blend of,
link |
00:07:40.080
we have to think about this is not human, human alone.
link |
00:07:43.160
It is a human machine interaction
link |
00:07:44.960
and certain aspects of humans are needed
link |
00:07:48.040
and certain aspects and situations
link |
00:07:49.800
demanded to be like a machine.
link |
00:07:51.520
So I told you, it's gonna be full soft code in parts.
link |
00:07:54.920
What's the difference between human and machine
link |
00:07:57.320
in that interaction?
link |
00:07:58.520
When we interact to humans,
link |
00:08:00.640
especially those who are friends and loved ones
link |
00:08:03.880
versus you and a machine that you also are close with.
link |
00:08:10.240
I think you have to think about the roles the AI plays, right?
link |
00:08:13.640
So, and it differs from different customer to customer,
link |
00:08:16.120
different situation to situation,
link |
00:08:19.080
especially I can speak from Alexa's perspective.
link |
00:08:21.400
It is a companion, a friend at times, an assistant
link |
00:08:25.880
and an advisor down the line.
link |
00:08:27.360
So I think most AI's will have this kind of attributes
link |
00:08:31.080
and it will be very situational in nature.
link |
00:08:32.880
So where is the boundary?
link |
00:08:34.480
I think the boundary depends on exact context
link |
00:08:36.920
in which you're interacting with the AI.
link |
00:08:39.120
So the depth and the richness of natural language conversation
link |
00:08:42.760
has been, by Alan Turing,
link |
00:08:45.680
been used to try to define what it means to be intelligent.
link |
00:08:50.320
There's a lot of criticism of that kind of test,
link |
00:08:52.120
but what do you think is a good test of intelligence
link |
00:08:55.680
in your view in the context of the Turing test?
link |
00:08:58.200
And Alexa, with the Alexa prize, this whole realm,
link |
00:09:03.040
do you think about this human intelligence,
link |
00:09:07.000
what it means to define it,
link |
00:09:08.000
what it means to reach that level?
link |
00:09:09.920
I do think the ability to converse
link |
00:09:12.320
is a sign of an ultimate intelligence.
link |
00:09:15.000
I think that there's no question about it.
link |
00:09:18.200
So if you think about all aspects of humans,
link |
00:09:20.400
there are sensors we have
link |
00:09:22.680
and those are basically a data collection mechanism.
link |
00:09:26.240
And based on that, we make some decisions
link |
00:09:28.080
with our sensory brains, right?
link |
00:09:30.440
And from that perspective,
link |
00:09:32.600
I think there are elements we have to talk about
link |
00:09:35.080
how we sense the word
link |
00:09:36.920
and then how we act based on what we sense.
link |
00:09:40.200
Those elements clearly machines have.
link |
00:09:43.520
But then there's the other aspects of computation
link |
00:09:46.640
that is way better.
link |
00:09:48.240
I also mentioned about memory again
link |
00:09:49.920
in terms of being near infinite,
link |
00:09:51.760
depending on the storage capacity you have.
link |
00:09:54.080
And the retrieval can be extremely fast and pure
link |
00:09:58.080
in terms of like, there's no ambiguity of
link |
00:10:00.080
who did I see when, right?
link |
00:10:02.000
I mean, if machines can remember that quite well.
link |
00:10:04.320
So again, on a philosophical level,
link |
00:10:06.720
I do subscribe to the fact that to be able to converse
link |
00:10:10.720
and as part of that to be able to reason
link |
00:10:13.280
based on the world knowledge you've acquired
link |
00:10:15.120
and the sensory knowledge that is there
link |
00:10:18.200
is definitely very much the essence of intelligence.
link |
00:10:21.960
But intelligence can go beyond human level,
link |
00:10:25.160
intelligence based on what machines are getting capable of.
link |
00:10:28.560
So what do you think maybe stepping outside of Alexa
link |
00:10:32.120
broadly as an AI field?
link |
00:10:34.440
What do you think is a good test of intelligence?
link |
00:10:37.480
Put it another way outside of Alexa,
link |
00:10:39.880
because so much of Alexa is a product,
link |
00:10:41.680
is an experience for the customer.
link |
00:10:43.600
On the research side,
link |
00:10:45.120
what would impress the heck out of you if you saw?
link |
00:10:47.920
What is the test where you said, wow,
link |
00:10:50.720
wow, this thing is now starting to encroach
link |
00:10:56.960
into the realm of what we loosely think
link |
00:10:59.000
of as human intelligence?
link |
00:11:00.320
So, well, we think of it as AGI
link |
00:11:02.360
and human intelligence altogether, right?
link |
00:11:04.320
So in some sense, and I think we are quite far from that.
link |
00:11:07.960
I think an unbiased view I have
link |
00:11:11.440
is that the Alexa's intelligence capability is a great test.
link |
00:11:17.720
I think of it as there are many other true points
link |
00:11:20.560
like self driving cars,
link |
00:11:23.000
game playing like go or chess.
link |
00:11:26.280
Let's take those two for as an example.
link |
00:11:28.640
Clearly requires a lot of data driven learning
link |
00:11:31.760
and intelligence, but it's not as hard a problem
link |
00:11:35.080
as conversing with as an AI is with humans
link |
00:11:39.760
to accomplish certain tasks or open domain chat,
link |
00:11:42.320
as you mentioned, Alexa prize.
link |
00:11:44.840
In those settings, the key differences
link |
00:11:47.760
that the end goal is not defined unlike game playing.
link |
00:11:51.920
You also do not know exactly what state you are in
link |
00:11:55.720
in a particular goal completion scenario.
link |
00:11:58.960
In certain sense, sometimes you can if it's a simple goal,
link |
00:12:02.080
but if you're even certain examples
link |
00:12:04.480
like planning a weekend or you can imagine
link |
00:12:07.120
how many things change along the way.
link |
00:12:09.920
You look for whether you may change your mind
link |
00:12:11.960
and you change the destination
link |
00:12:14.880
or you want to catch a particular event
link |
00:12:17.040
and then you decide, no, I want this other event
link |
00:12:19.440
I want to go to.
link |
00:12:20.560
So these dimensions of how many different steps are possible
link |
00:12:24.800
when you're conversing as a human with a machine
link |
00:12:27.440
makes it an extremely daunting problem.
link |
00:12:29.120
And I think it is the ultimate test for intelligence.
link |
00:12:32.400
And don't you think that natural language
link |
00:12:35.720
is enough to prove that conversation?
link |
00:12:39.040
Just pure conversation.
link |
00:12:40.400
From a scientific standpoint,
link |
00:12:42.320
natural language is a great test,
link |
00:12:45.040
but I would go beyond, I don't want to limit it
link |
00:12:47.840
to as natural language as simply understanding an intent
link |
00:12:51.120
or parsing for entities and so forth.
link |
00:12:52.800
We are really talking about dialogue.
link |
00:12:55.680
So I would say human machine dialogue
link |
00:12:58.520
is definitely one of the best tests of intelligence.
link |
00:13:02.960
So can you briefly speak to the Alexa prize
link |
00:13:06.680
for people who are not familiar with it
link |
00:13:08.640
and also just maybe where things stand
link |
00:13:12.640
and what have you learned and what's surprising?
link |
00:13:15.440
What have you seen that surprising
link |
00:13:16.920
from this incredible competition?
link |
00:13:18.440
Absolutely, it's a very exciting competition.
link |
00:13:20.960
Alexa prize is essentially a grand challenge
link |
00:13:24.040
in conversational artificial intelligence
link |
00:13:26.880
where we threw the gauntlet to the universities
link |
00:13:29.440
who do active research in the field to say,
link |
00:13:32.360
can you build what we call a social bot
link |
00:13:35.360
that can converse with you coherently
link |
00:13:37.320
and engagingly for 20 minutes?
link |
00:13:39.800
That is an extremely hard challenge talking to someone
link |
00:13:43.600
who you're meeting for the first time
link |
00:13:46.480
or even if you've met them quite often
link |
00:13:49.640
to speak at 20 minutes on any topic
link |
00:13:53.560
and evolving nature of topics is super hard.
link |
00:13:57.720
We have completed two successful years of the competition.
link |
00:14:01.600
The first was one with the University of Washington,
link |
00:14:03.400
second University of California.
link |
00:14:05.560
We are in our third instance.
link |
00:14:06.880
We have an extremely strong team of 10 cohorts
link |
00:14:09.640
and the third instance of the Alexa prize is underway now.
link |
00:14:14.840
And we are seeing a constant evolution.
link |
00:14:17.480
First year was definitely a learning.
link |
00:14:18.920
It was a lot of things to be put together.
link |
00:14:21.200
We had to build a lot of infrastructure
link |
00:14:23.640
to enable these universities to be able
link |
00:14:26.400
to build magical experiences
link |
00:14:28.280
and do high quality research.
link |
00:14:31.560
Just a few quick questions, sorry for the interruption.
link |
00:14:33.920
What does failure look like in the 20 minute session?
link |
00:14:37.280
So what does it mean to fail not to reach the 20 minute mark?
link |
00:14:40.120
Awesome question.
link |
00:14:41.240
So there are one, first of all,
link |
00:14:43.360
I forgot to mention one more detail.
link |
00:14:45.360
It's not just 20 minutes,
link |
00:14:46.560
but the quality of the conversation too that matters.
link |
00:14:49.320
And the beauty of this competition
link |
00:14:51.480
before I answer that question on what failure means
link |
00:14:53.800
is first that you actually converse
link |
00:14:56.600
with millions and millions of customers
link |
00:14:59.000
as these social bots.
link |
00:15:00.840
So during the judging phases, there are multiple phases.
link |
00:15:05.000
Before we get to the finals,
link |
00:15:06.320
which is a very controlled judging in a situation
link |
00:15:08.640
where we bring in judges and we have contractors
link |
00:15:11.760
who interact with these social bots,
link |
00:15:14.400
that is a much more controlled setting.
link |
00:15:15.920
But till the point we get to the finals,
link |
00:15:18.960
all the judging is essentially by the customers of Alexa.
link |
00:15:22.720
And there you basically rate on a simple question
link |
00:15:26.200
how good your experience was.
link |
00:15:28.480
So that's where we are not testing
link |
00:15:29.920
for a 20 minute boundary being crossed
link |
00:15:32.800
because you do want it to be very much like a clear cut winner,
link |
00:15:37.080
be chosen and it's an absolute bar.
link |
00:15:40.080
So did you really break that 20 minute barrier?
link |
00:15:42.800
Is why we have to test it in a more controlled setting
link |
00:15:45.920
with actors, essentially interactors
link |
00:15:48.680
and see how the conversation goes.
link |
00:15:50.840
So this is why it's a subtle difference
link |
00:15:54.200
between how it's being tested in the field
link |
00:15:57.040
with real customers versus in the lab to award the prize.
link |
00:16:00.520
So on the latter one, what it means is that
link |
00:16:03.560
essentially there are three judges
link |
00:16:08.040
and two of them have to say this conversation
link |
00:16:10.320
is stalled, essentially.
link |
00:16:13.080
Got it.
link |
00:16:13.920
And the judges are human experts.
link |
00:16:15.760
Judges are human experts.
link |
00:16:16.960
Okay, great.
link |
00:16:17.800
So this is in the third year.
link |
00:16:19.080
So what's been the evolution?
link |
00:16:20.840
How far, so the DARPA challenge in the first year,
link |
00:16:24.560
the autonomous vehicles and nobody finished in the second year,
link |
00:16:27.720
a few more finished in the desert.
link |
00:16:30.600
So how far along in this, I would say,
link |
00:16:34.320
much harder challenge are we?
link |
00:16:36.320
This challenge has come a long way to the extent that
link |
00:16:39.200
we're definitely not close to the 20 minute barrier
link |
00:16:41.800
being with coherence and engaging conversation.
link |
00:16:44.720
I think we are still five to 10 years away
link |
00:16:46.840
in that horizon to complete that.
link |
00:16:49.480
But the progress is immense.
link |
00:16:51.360
Like what you're finding is the accuracy
link |
00:16:54.080
and what kind of responses these social bots generate
link |
00:16:57.360
is getting better and better.
link |
00:16:59.480
What's even amazing to see that now there's humor coming in.
link |
00:17:03.320
The bots are quite...
link |
00:17:04.880
Awesome.
link |
00:17:06.200
You're talking about ultimate science of intelligence.
link |
00:17:09.440
I think humor is a very high bar
link |
00:17:11.840
in terms of what it takes to create humor.
link |
00:17:14.880
And I don't mean just being goofy.
link |
00:17:16.520
I really mean good sense of humor
link |
00:17:19.440
is also a sign of intelligence in my mind
link |
00:17:21.600
and something very hard to do.
link |
00:17:23.120
So these social bots are now exploring
link |
00:17:25.040
not only what we think of natural language abilities
link |
00:17:28.560
but also personality attributes
link |
00:17:30.360
and aspects of when to inject an appropriate joke,
link |
00:17:34.080
when you don't know the domain,
link |
00:17:38.400
how you come back with something more intelligible
link |
00:17:41.360
so that you can continue the conversation.
link |
00:17:43.160
If you and I are talking about AI
link |
00:17:45.200
and we are domain experts, we can speak to it.
link |
00:17:47.480
But if you suddenly switch a topic to that,
link |
00:17:49.280
I don't know off.
link |
00:17:50.480
How do I change the conversation?
link |
00:17:52.160
So you're starting to notice these elements as well.
link |
00:17:55.240
And that's coming from partly by the nature
link |
00:17:58.560
of the 20 minute challenge
link |
00:18:00.120
that people are getting quite clever
link |
00:18:02.520
on how to really converse
link |
00:18:05.600
and essentially mask some of the understanding defects
link |
00:18:08.600
if they exist.
link |
00:18:09.880
So some of this, this is not Alexa the product.
link |
00:18:12.720
This is somewhat for fun, for research, for innovation
link |
00:18:17.000
and so on.
link |
00:18:17.840
I have a question sort of in this modern era,
link |
00:18:20.280
there's a lot of, if you look at Twitter
link |
00:18:23.440
and Facebook and so on, there's discourse,
link |
00:18:25.840
public discourse going on
link |
00:18:27.200
and some things that are a little bit too edgy,
link |
00:18:28.840
people get blocked and so on.
link |
00:18:30.680
I'm just out of curiosity.
link |
00:18:32.280
Are people in this context pushing the limits?
link |
00:18:36.000
Is anyone using the F word?
link |
00:18:37.760
Is anyone sort of pushing back sort of arguing,
link |
00:18:44.760
I guess I should say as part of the dialogue
link |
00:18:46.920
to really draw people in?
link |
00:18:48.320
First of all, let me just back up a bit
link |
00:18:50.360
in terms of why we are doing this, right?
link |
00:18:52.160
So you said it's fun.
link |
00:18:54.320
I think fun is more part of the engaging part for customers.
link |
00:18:59.960
It is one of the most used skills as well in our skill store.
link |
00:19:04.360
But up that apart, the real goal was essentially
link |
00:19:07.240
what was happening is
link |
00:19:08.760
with a lot of AI research moving to industry,
link |
00:19:11.920
we felt that academia has the risk
link |
00:19:14.200
of not being able to have the same resources
link |
00:19:16.800
at disposal that we have, which is lots of data,
link |
00:19:20.480
massive computing power,
link |
00:19:22.720
and a clear ways to test these AI advances
link |
00:19:26.320
with real customer benefits.
link |
00:19:28.520
So we brought all these three together in the Alexa prize.
link |
00:19:30.880
That's why it's one of my favorite projects in Amazon.
link |
00:19:33.880
And with that, the secondary effect is,
link |
00:19:37.520
yes, it has become engaging for our customers as well.
link |
00:19:40.960
We're not there in terms of where we want it to be, right?
link |
00:19:43.920
But it's a huge progress.
link |
00:19:45.080
But coming back to your question on
link |
00:19:47.120
how do the conversations evolve?
link |
00:19:48.840
Yes, there is some natural attributes
link |
00:19:51.040
of what you said in terms of argument
link |
00:19:52.800
and some amount of swearing.
link |
00:19:54.200
The way we take care of that
link |
00:19:56.040
is that there is a sensitive filter we have built.
link |
00:19:59.120
That's some keywords and so on.
link |
00:20:00.440
It's more than keywords, a little more in terms of,
link |
00:20:03.520
of course, there's keyword based too,
link |
00:20:04.920
but there's more in terms of,
link |
00:20:06.960
these words can be very contextual, as you can see.
link |
00:20:09.480
And also the topic can be something
link |
00:20:12.640
that you don't want a conversation to happen
link |
00:20:15.480
because this is a communal device as well.
link |
00:20:17.360
A lot of people use these devices.
link |
00:20:19.320
So we have put a lot of guardrails for the conversation
link |
00:20:22.680
to be more useful for advancing AI
link |
00:20:26.000
and not so much of these other issues you attributed
link |
00:20:31.160
what's happening in the AI field as well.
link |
00:20:32.960
Right, so this is actually a serious opportunity.
link |
00:20:35.360
I didn't use the right word, fun.
link |
00:20:36.920
I think it's an open opportunity to do some,
link |
00:20:40.520
some of the best innovation in conversational agents
link |
00:20:43.960
in the world.
link |
00:20:44.800
Absolutely.
link |
00:20:45.960
Why just universities?
link |
00:20:49.040
Why just universities?
link |
00:20:49.960
Because as I said, I really felt the young minds,
link |
00:20:53.240
it's also to, if you think about the other aspect
link |
00:20:57.960
of where the whole industry is moving with AI,
link |
00:21:01.440
there's a dearth of talent in, in given the demands.
link |
00:21:04.920
So you do want universities to have a clear place
link |
00:21:09.920
where they can invent and research and not fall behind
link |
00:21:12.520
with that they can't motivate students.
link |
00:21:13.960
Imagine all grad students left to, to industry, like us,
link |
00:21:19.640
or faculty members, which has happened too.
link |
00:21:22.920
So this is a way that if you're so passionate
link |
00:21:25.240
about the field where you feel industry and academia
link |
00:21:28.640
need to work well, this is a great example
link |
00:21:31.400
and a great way for universities to participate.
link |
00:21:35.440
So what do you think it takes to build a system
link |
00:21:37.320
that wins Deluxe Surprise?
link |
00:21:39.640
I think you have to start focusing on aspects of reasoning
link |
00:21:46.240
that it is, there are still more lookups
link |
00:21:50.800
of what intents the customer is asking for
link |
00:21:54.200
and responding to those rather than really reasoning
link |
00:21:58.960
about the elements of the, of the conversation.
link |
00:22:02.520
For instance, if you have, if you're playing,
link |
00:22:06.280
if the conversation is about games
link |
00:22:08.120
and it's about a recent sports event,
link |
00:22:11.280
there's so much context involved
link |
00:22:13.320
and you have to understand the entities
link |
00:22:15.840
that are being mentioned so that the conversation
link |
00:22:19.080
is coherent rather than you suddenly just switch
link |
00:22:21.560
to knowing some fact about a sports entity
link |
00:22:25.200
and you're just relaying that rather
link |
00:22:26.680
than understanding the true context of the game.
link |
00:22:28.720
Like if you just said, I learned this fun fact
link |
00:22:32.320
about Tom Brady rather than really say
link |
00:22:36.000
how he played the game the previous night,
link |
00:22:39.320
then the conversation is not really that intelligent.
link |
00:22:42.840
So you have to go to more reasoning elements
link |
00:22:46.200
of understanding the context of the dialogue
link |
00:22:49.160
and giving more appropriate responses,
link |
00:22:51.240
which tells you that we are still quite far
link |
00:22:53.720
because a lot of times it's more facts being looked up
link |
00:22:57.440
and something that's close enough as an answer
link |
00:22:59.960
but not really the answer.
link |
00:23:02.080
So that is where the research needs to go more
link |
00:23:05.080
and actual true understanding and reasoning.
link |
00:23:08.400
And that's why I feel it's a great way to do it
link |
00:23:10.480
because you have an engaged set of users working
link |
00:23:14.240
to make help these AI advances happen in this case.
link |
00:23:18.080
You mentioned customers, they're quite a bit,
link |
00:23:20.360
and there's a skill, what is the experience
link |
00:23:24.120
for the user that's helping?
link |
00:23:26.560
So just to clarify, this isn't, as far as I understand,
link |
00:23:30.120
the Alexa, so this skill is a standalone
link |
00:23:32.560
for the Alexa prize, I mean it's focused
link |
00:23:34.240
on the Alexa prize, it's not you ordering certain things
link |
00:23:37.320
on Amazon.com or checking the weather
link |
00:23:39.280
or playing Spotify, right, it's a separate skill.
link |
00:23:42.080
And so you're focused on helping that,
link |
00:23:45.680
I don't know how do people, how do customers think of it?
link |
00:23:48.560
Are they having fun?
link |
00:23:49.840
Are they helping teach the system?
link |
00:23:52.080
What's the experience like?
link |
00:23:53.080
I think it's both, actually,
link |
00:23:54.680
and let me tell you how you invoke this skill.
link |
00:23:57.840
So all you have to say, Alexa, let's chat.
link |
00:24:00.240
And then the first time you say, Alexa, let's chat,
link |
00:24:03.360
it comes back with a clear message
link |
00:24:04.720
that you're interacting with one of those
link |
00:24:06.280
university social bots, and there's a clear,
link |
00:24:09.320
so you know exactly how you interact, right?
link |
00:24:11.840
And that is why it's very transparent.
link |
00:24:14.080
You are being asked to help, right?
link |
00:24:16.280
And we have a lot of mechanisms where as the,
link |
00:24:20.960
we are in the first phase of feedback phase,
link |
00:24:23.680
then you send a lot of emails to our customers,
link |
00:24:26.720
and then they know that the team needs a lot of interactions
link |
00:24:31.720
to improve the accuracy of the system.
link |
00:24:33.920
So we know we have a lot of customers
link |
00:24:35.880
who really want to help these university bots,
link |
00:24:38.920
and they're conversing with that.
link |
00:24:40.400
And some are just having fun with just saying,
link |
00:24:42.680
Alexa, let's chat.
link |
00:24:44.000
And also some adversarial behavior to see whether,
link |
00:24:47.320
how much do you understand as a social bot?
link |
00:24:50.240
So I think we have a good, healthy mix
link |
00:24:52.280
of all three situations.
link |
00:24:53.920
So what is the, if we talk about solving the Alexa challenge,
link |
00:24:58.040
the Alexa prize, what's the data set
link |
00:25:04.040
of really engaging pleasant conversations look like?
link |
00:25:07.520
Because if we think of this
link |
00:25:08.360
as a supervised learning problem,
link |
00:25:10.600
I don't know if it has to be,
link |
00:25:12.200
but if it does, maybe you can comment on that.
link |
00:25:15.400
Do you think there needs to be a data set
link |
00:25:17.480
of what it means to be an engaging,
link |
00:25:21.160
successful, fulfilling conversation?
link |
00:25:22.640
I think that's part of the research question here.
link |
00:25:24.800
This was, I think, we at least got the first spot right,
link |
00:25:29.240
which is have a way for universities to build and test
link |
00:25:34.280
in a real world setting.
link |
00:25:35.840
Now you're asking in terms of the next phase of questions,
link |
00:25:38.640
which we are still, we're also asking, by the way,
link |
00:25:41.120
what does success look like from a optimization function?
link |
00:25:45.440
That's what you're asking.
link |
00:25:46.280
In terms of, we as researchers are used
link |
00:25:48.400
to having a great corpus of annotated data
link |
00:25:51.360
and then making or then sort of tune our algorithms
link |
00:25:56.280
on those, right?
link |
00:25:57.640
And fortunately and unfortunately,
link |
00:26:00.680
in this world of Alexa prize,
link |
00:26:02.960
that is not the way we are going after it.
link |
00:26:05.440
So you have to focus more on learning
link |
00:26:07.760
based on live feedback.
link |
00:26:10.960
That is another element that's unique,
link |
00:26:13.000
where just now I started with giving you how you ingress
link |
00:26:17.320
and experience this capability as a customer.
link |
00:26:21.560
What happens when you're done?
link |
00:26:23.640
So they ask you a simple question on a scale of one to five,
link |
00:26:27.560
how likely are you to interact with this social border game?
link |
00:26:31.920
That is a good feedback
link |
00:26:33.880
and customers can also leave more open ended feedback.
link |
00:26:37.480
And I think partly that to me is one part of the question
link |
00:26:42.160
you're asking, which I'm saying is a mental model shift
link |
00:26:44.640
that as researchers also, you have to change your mindset
link |
00:26:48.600
that this is not a DARPA evaluation or NSF funded study
link |
00:26:52.720
and you have a nice corpus.
link |
00:26:55.000
This is where it's real world.
link |
00:26:57.000
You have real data.
link |
00:26:58.760
The scale is amazing and that's a beautiful thing.
link |
00:27:01.600
And then the customer, the user can quit the conversation
link |
00:27:05.800
at any time.
link |
00:27:06.640
Exactly, the user can.
link |
00:27:07.480
That is also a signal for how good you were at that point.
link |
00:27:11.760
So, and then on a scale of one to five, one to three,
link |
00:27:15.000
do they say how likely are you or is it just a binary?
link |
00:27:17.720
How one to five?
link |
00:27:18.720
One to five.
link |
00:27:20.000
Wow, okay.
link |
00:27:20.840
That's such a beautifully constructed challenge.
link |
00:27:22.680
Okay.
link |
00:27:24.800
You said the only way to make a smart assistant really smart
link |
00:27:30.040
is to give it eyes and let it explore the world.
link |
00:27:34.560
I'm not sure you might have been taken out of context,
link |
00:27:36.840
but can you comment on that?
link |
00:27:38.240
Can you elaborate on that idea?
link |
00:27:40.080
Is that I personally also find that idea super exciting
link |
00:27:43.120
from a social robotics, personal robotics perspective?
link |
00:27:46.240
Yeah, a lot of things do get taken out of context.
link |
00:27:48.400
My, this particular one was just as philosophical discussion
link |
00:27:52.040
we were having on terms of what does intelligence look like?
link |
00:27:55.560
And the context was in terms of learning,
link |
00:27:59.200
I think just we said we as humans are empowered
link |
00:28:03.040
with many different sensory abilities.
link |
00:28:05.480
I do believe that eyes are an important aspect of it
link |
00:28:09.560
in terms of if you think about how we as humans learn,
link |
00:28:14.640
it is quite complex and it's also not unimodal
link |
00:28:18.360
that you are fed a ton of text or audio
link |
00:28:22.040
and you just learn that way.
link |
00:28:23.360
No, you learn by experience, you learn by seeing,
link |
00:28:27.240
you're taught by humans and we are very efficient
link |
00:28:31.960
in how we learn.
link |
00:28:33.240
Machines on the contrary are very inefficient
link |
00:28:35.320
on how they learn, especially these AIs.
link |
00:28:38.480
I think the next wave of research
link |
00:28:40.800
is going to be with less data,
link |
00:28:44.360
not just less human, not just with less labeled data
link |
00:28:48.240
but also with a lot of weak supervision
link |
00:28:51.080
and where you can increase the learning rate.
link |
00:28:55.160
I don't mean less data in terms of not having
link |
00:28:57.280
a lot of data to learn from that.
link |
00:28:59.000
We are generating so much data
link |
00:29:00.360
but it is more about from a aspect of how fast can you learn?
link |
00:29:04.920
So improving the quality of the data
link |
00:29:07.040
that's the quality of data and the learning process.
link |
00:29:09.960
I think more on the learning process.
link |
00:29:11.480
I think we have to, we as humans learn
link |
00:29:13.600
with a lot of noisy data, right?
link |
00:29:15.760
And I think that's the part that I don't think should change.
link |
00:29:21.480
What should change is how we learn, right?
link |
00:29:23.920
So if you look at, you mentioned supervised learning,
link |
00:29:26.120
we have making transformative shifts
link |
00:29:28.000
from moving to more unsupervised, more weak supervision.
link |
00:29:31.200
Those are the key aspects of how to learn.
link |
00:29:34.880
And I think in that setting, I hope you agree with me
link |
00:29:37.800
that having other senses is very crucial
link |
00:29:41.680
in terms of how you learn.
link |
00:29:43.480
So absolutely, and from a machine learning perspective
link |
00:29:46.680
which I hope we get a chance to talk to a few aspects
link |
00:29:49.680
that are fascinating there, but to stick on the point
link |
00:29:52.480
of sort of a body, you know, an embodiment.
link |
00:29:56.280
So Alexa has a body.
link |
00:29:57.520
It has a very minimalistic, beautiful interface
link |
00:30:01.600
where there's a ring and so on.
link |
00:30:02.840
I mean, I'm not sure of all the flavors of the devices
link |
00:30:06.520
that Alexa lives on, but there's a minimalistic,
link |
00:30:09.560
basic interface.
link |
00:30:13.320
And nevertheless, we humans, so I have a room,
link |
00:30:15.560
but I have all kinds of robots all over everywhere.
link |
00:30:18.280
So what do you think the Alexa of the future looks like
link |
00:30:24.720
if it begins to shift what his body looks like?
link |
00:30:28.520
What, maybe beyond the Alexa, what do you think
link |
00:30:31.200
are the different devices in the home
link |
00:30:33.760
as they start to embody their intelligence more and more?
link |
00:30:36.920
What do you think that looks like?
link |
00:30:38.120
Philosophically, a future, what do you think that looks like?
link |
00:30:41.200
I think let's look at what's happening today.
link |
00:30:43.600
You mentioned, I think, other devices as an Amazon devices,
link |
00:30:46.840
but I also wanted to point out Alexa
link |
00:30:48.640
is already integrated in a lot of third party devices
link |
00:30:51.360
which also come in lots of forms and shapes.
link |
00:30:54.840
Some in robots, right?
link |
00:30:56.360
Some in microwaves, some in appliances
link |
00:31:00.280
that you use in everyday life.
link |
00:31:02.600
So I think it's not just the shape Alexa takes
link |
00:31:07.720
in terms of form factors,
link |
00:31:09.200
but it's also where all it's available.
link |
00:31:13.000
And it's getting in cars,
link |
00:31:14.240
it's getting in different appliances in homes,
link |
00:31:16.720
even toothbrushes, right?
link |
00:31:18.720
So I think you have to think about it
link |
00:31:20.760
as not a physical assistant.
link |
00:31:25.440
It will be in some embodiment as you said,
link |
00:31:28.480
we already have these nice devices.
link |
00:31:31.120
But I think it's also important to think of it
link |
00:31:33.760
as a virtual assistant.
link |
00:31:35.640
It is superhuman in the sense
link |
00:31:37.200
that it is in multiple places at the same time.
link |
00:31:40.280
So I think the actual embodiment in some sense
link |
00:31:45.200
to me doesn't matter.
link |
00:31:47.600
I think you have to think of it as not as human like
link |
00:31:52.800
and more of what its capabilities are
link |
00:31:56.080
that derive a lot of benefit for customers
link |
00:31:58.800
and how there are different ways to delight it
link |
00:32:00.680
and delight customers and different experiences.
link |
00:32:03.960
And I think I'm a big fan of it
link |
00:32:06.680
not being just human like,
link |
00:32:09.240
it should be human like in certain situations.
link |
00:32:11.120
Alexa Price, social bot in terms of conversation
link |
00:32:13.360
is a great way to look at it,
link |
00:32:14.920
but there are other scenarios where human like
link |
00:32:18.800
I think is underselling the abilities of this AI.
link |
00:32:22.080
So if I could trivialize what we're talking about.
link |
00:32:26.120
So if you look at the way Steve Jobs thought
link |
00:32:29.400
about the interaction with the device
link |
00:32:31.400
that Apple produced,
link |
00:32:33.440
there was a extreme focus on controlling the experience
link |
00:32:36.760
by making sure there's only these Apple produced devices.
link |
00:32:40.200
You see the voice of Alexa being,
link |
00:32:44.040
taking all kinds of forms
link |
00:32:45.600
depending on what the customers want.
link |
00:32:47.080
And that means it could be anywhere
link |
00:32:49.840
from the microwave to vacuum cleaner to the home
link |
00:32:53.760
and so on the voice is the essential element
link |
00:32:56.920
of the interaction.
link |
00:32:57.760
I think voice is an essence.
link |
00:32:59.760
It's not all, but it's a key aspect.
link |
00:33:02.160
I think to your question in terms of
link |
00:33:05.640
you should be able to recognize Alexa.
link |
00:33:08.200
And that's a huge problem.
link |
00:33:09.920
I think in terms of a huge scientific problem,
link |
00:33:12.000
I should say like what are the traits?
link |
00:33:13.720
What makes it look like Alexa,
link |
00:33:16.120
especially in different settings
link |
00:33:17.520
and especially if it's primarily voice what it is.
link |
00:33:20.360
But Alexa is not just voice either, right?
link |
00:33:22.200
I mean, we have devices with a screen.
link |
00:33:25.000
Now you're seeing just other behaviors of Alexa.
link |
00:33:28.480
So I think we're in very early stages of what that means.
link |
00:33:31.360
And this will be an important topic for the following years.
link |
00:33:34.920
But I do believe that being able to recognize
link |
00:33:38.200
and tell when it's Alexa versus it's not
link |
00:33:40.480
is going to be important from an Alexa perspective.
link |
00:33:43.360
I'm not speaking for the entire AI community,
link |
00:33:46.000
but from, but I think attribution.
link |
00:33:49.440
And as we go into more of understanding
link |
00:33:53.600
who did what, that identity of the AI
link |
00:33:56.840
is crucial in the coming world.
link |
00:33:58.720
I think from the broad AI community perspective,
link |
00:34:01.040
that's also a fascinating problem.
link |
00:34:02.840
So basically if I close my eyes and listen to the voice,
link |
00:34:06.200
what would it take for me to recognize that this is Alexa?
link |
00:34:08.760
Exactly.
link |
00:34:09.600
Or at least the Alexa that I've come to know
link |
00:34:11.320
from my personal experience in my home
link |
00:34:13.720
through my interactions that come.
link |
00:34:15.040
Yeah.
link |
00:34:15.880
And the Alexa here in the US is very different.
link |
00:34:17.600
The Alexa in UK and the Alexa in India,
link |
00:34:20.160
even though they are all speaking English
link |
00:34:22.200
or the Australian version.
link |
00:34:23.960
So again, when, so now think about
link |
00:34:26.600
when you go into a different culture,
link |
00:34:28.320
a different community, but you travel there,
link |
00:34:30.640
what do you recognize Alexa?
link |
00:34:32.440
I think these are super hard questions actually.
link |
00:34:34.840
So there's a team that works on personality.
link |
00:34:37.480
So if we talk about those different flavors
link |
00:34:40.040
of what it means, culturally speaking,
link |
00:34:41.720
India, UK, US, what does it mean to add?
link |
00:34:45.280
So the problem that we just stated is just fascinating.
link |
00:34:48.400
How do we make it purely recognizable that it's Alexa?
link |
00:34:53.640
Assuming that the qualities of the voice are not sufficient.
link |
00:34:58.560
It's also the content of what is being said.
link |
00:35:01.560
How do we do that?
link |
00:35:02.680
How does the personality keep on coming to play?
link |
00:35:04.840
What's that research you would look like?
link |
00:35:07.320
I mean, it's such a fascinating.
link |
00:35:08.640
We have some very fascinating folks who,
link |
00:35:11.840
from both the UX background and human factors,
link |
00:35:14.160
are looking at these aspects and these exact questions.
link |
00:35:17.480
But I'll definitely say it's not just how it sounds,
link |
00:35:21.640
the choice of words, the tone,
link |
00:35:24.440
not just, I mean, the voice identity of it,
link |
00:35:26.840
but the tone matters, the speed matters,
link |
00:35:30.120
how you speak, how you enunciate words,
link |
00:35:32.600
how, what choice of words are you using?
link |
00:35:36.320
How terse are you or how lengthy in your explanations you are?
link |
00:35:40.720
All of these are factors.
link |
00:35:42.920
And you also mentioned something crucial
link |
00:35:45.440
that it may have personalized Alexa to some extent
link |
00:35:50.240
in your homes or in the devices you are interacting with.
link |
00:35:53.400
So you, as your individual,
link |
00:35:57.560
how you prefer Alexa sounds
link |
00:35:59.200
can be different than how I prefer.
link |
00:36:01.200
And the amount of customizability you want to give
link |
00:36:04.400
is also a key debate we always have.
link |
00:36:07.600
But I do want to point out it's more than the voice actor
link |
00:36:10.680
that recorded and it sounds like that actor.
link |
00:36:13.960
It is more about the choices of words,
link |
00:36:16.880
the attributes of tonality,
link |
00:36:18.960
the volume in terms of how you raise your pitch
link |
00:36:21.400
and so forth, all of that matters.
link |
00:36:23.840
This is such a fascinating problem
link |
00:36:25.400
from a product perspective.
link |
00:36:27.560
I could see those debates just happening
link |
00:36:29.440
inside of the Alexa team
link |
00:36:31.080
of how much personalization do you do
link |
00:36:32.800
for the specific customer?
link |
00:36:34.400
Because you're taking a risk if you over personalize.
link |
00:36:38.200
Because you don't,
link |
00:36:40.320
if you create a personality for a million people,
link |
00:36:44.400
you can test that better.
link |
00:36:46.040
You can create a rich, fulfilling experience
link |
00:36:48.600
that will do well.
link |
00:36:50.040
But the more you personalize it,
link |
00:36:52.280
the less you can test it,
link |
00:36:53.480
the less you can know that it's a great experience.
link |
00:36:56.320
So how much personalization, what's the right balance?
link |
00:36:59.720
I think the right balance depends on the customer.
link |
00:37:01.600
Give them the control.
link |
00:37:02.800
So I'll say, I think the more control you give customers,
link |
00:37:07.400
the better it is for everyone.
link |
00:37:09.600
And I'll give you some key personalization features.
link |
00:37:13.840
I think we have a feature called remember this,
link |
00:37:15.840
which is where you can tell Alexa to remember something.
link |
00:37:19.440
There you have an explicit sort of control
link |
00:37:23.080
in customer's hand
link |
00:37:23.920
because they have to say Alexa, remember X, Y, Z.
link |
00:37:26.520
What kind of things would that be used for?
link |
00:37:28.000
So you can like use it.
link |
00:37:30.360
I have stored my tire specs for my car
link |
00:37:33.240
because it's so hard to go and find and see what it is
link |
00:37:36.560
right when you're having some issues.
link |
00:37:39.040
I store my mileage plan numbers
link |
00:37:41.400
for all the frequent flyer ones
link |
00:37:43.080
where I'm sometimes just looking at it and it's not handy.
link |
00:37:46.480
So those are my own personal choices I've made
link |
00:37:49.920
for Alexa to remember something on my behalf.
link |
00:37:52.280
So again, I think the choice was be explicit
link |
00:37:56.000
about how you provide that to a customer as a control.
link |
00:38:00.000
So I think these are the aspects of what you do.
link |
00:38:03.440
Like think about where we can use
link |
00:38:06.320
speaker recognition capabilities that it's,
link |
00:38:08.600
if you taught Alexa that you are Lex
link |
00:38:12.960
and this person in your household is person two,
link |
00:38:16.320
then you can personalize the experiences.
link |
00:38:17.920
Again, these are very in the CX customer experience patterns
link |
00:38:22.840
are very clear about and transparent
link |
00:38:26.520
when a personalization action is happening.
link |
00:38:30.040
And then you have other ways
link |
00:38:31.400
like you go through explicit control right now
link |
00:38:34.080
through your app that your multiple service providers,
link |
00:38:36.920
let's say for music, which one is your preferred one?
link |
00:38:39.520
So when you say place sting,
link |
00:38:41.360
depend on your, whether you have preferred Spotify
link |
00:38:43.840
or Amazon music or Apple music
link |
00:38:45.760
that the decision is made where to play it from.
link |
00:38:49.520
So what's Alexa's backstory from her perspective?
link |
00:38:52.440
Is there, I remember just asking as probably
link |
00:38:57.880
a lot of us are just the basic questions
link |
00:39:00.000
about love and so on of Alexa,
link |
00:39:02.440
just to see what the answer would be.
link |
00:39:03.880
Just it feels like there's a little bit of a back,
link |
00:39:07.760
like there's a,
link |
00:39:08.600
this feels like there's a little bit of personality
link |
00:39:10.360
but not too much.
link |
00:39:12.880
Is Alexa have a metaphysical presence
link |
00:39:18.400
in this human universe who live in
link |
00:39:21.920
or is it something more ambiguous?
link |
00:39:23.760
Is there a past?
link |
00:39:25.120
Is there a birth?
link |
00:39:26.280
Is there a family kind of idea
link |
00:39:28.960
even for joking purposes and so on?
link |
00:39:31.160
I think, well, it does tell you,
link |
00:39:33.480
if I think you, I should double check this,
link |
00:39:35.760
but if you said, when were you born?
link |
00:39:37.160
I think we do respond.
link |
00:39:39.000
I need to double check that,
link |
00:39:40.120
but I'm pretty positive about it.
link |
00:39:41.480
I think that you do it
link |
00:39:42.320
because I think I've tested that.
link |
00:39:44.000
But that's like a, that's like how,
link |
00:39:46.760
like I was born in your brand of champagne
link |
00:39:49.120
and whatever the year kind of thing.
link |
00:39:50.960
Yeah.
link |
00:39:51.800
So on terms of the metaphysical,
link |
00:39:53.440
I think it's early,
link |
00:39:55.760
does it have the historic knowledge about herself
link |
00:40:00.400
to be able to do that?
link |
00:40:01.480
Maybe, have we crossed that boundary?
link |
00:40:03.760
Not yet, right?
link |
00:40:04.600
In terms of being, thank you.
link |
00:40:06.560
Have we thought about it quite a bit,
link |
00:40:08.640
but I wouldn't say that we have come to a clear decision
link |
00:40:11.520
in terms of what it should look like.
link |
00:40:13.040
But you can imagine though,
link |
00:40:15.880
and I bring this back to the Alexa Prize social bot one,
link |
00:40:19.240
there you will start seeing some of that.
link |
00:40:21.200
Like you, these bots have their identity.
link |
00:40:23.480
And in terms of that,
link |
00:40:24.680
you may find, you know,
link |
00:40:26.840
this is such a great research topic
link |
00:40:28.440
that some academia team may think of these problems
link |
00:40:32.160
and start solving them too.
link |
00:40:35.120
So let me ask a question.
link |
00:40:38.920
It's kind of difficult, I think,
link |
00:40:41.160
but it feels fascinating to me
link |
00:40:43.320
because I'm fascinated with psychology.
link |
00:40:45.360
It feels that the more personality you have,
link |
00:40:48.200
the more dangerous it is.
link |
00:40:50.400
In terms of a customer perspective, a product,
link |
00:40:54.440
if you want to create a product that's useful.
link |
00:40:57.080
By dangerous, I mean creating an experience that upsets me.
link |
00:41:02.320
And so, how do you get that right?
link |
00:41:06.680
Because if you look at the relationships,
link |
00:41:10.040
maybe I'm just a screwed up Russian,
link |
00:41:11.800
but if you look at the human relationship,
link |
00:41:15.040
some of our deepest relationships have fights,
link |
00:41:18.120
have tension, have the push and pull,
link |
00:41:21.200
have a little flavor in them.
link |
00:41:24.160
Do you want to have such flavor
link |
00:41:26.200
in an interaction with Alexa?
link |
00:41:28.080
How do you think about that?
link |
00:41:29.440
So there's one other common thing that you didn't say,
link |
00:41:32.440
but we think of it as paramount for any deep relationship.
link |
00:41:36.200
That's trust.
link |
00:41:37.760
Trust, yeah.
link |
00:41:38.600
So I think if you trust every attribute you said,
link |
00:41:42.120
a fight, some tension is all healthy.
link |
00:41:46.000
But what is sort of unnegotiable in this instance is trust.
link |
00:41:51.400
And I think the bar to earn customer trust for AI
link |
00:41:54.400
is very high, in some sense, more than a human.
link |
00:41:57.960
It's not just about personal information or your data,
link |
00:42:03.520
it's also about your actions on a daily basis.
link |
00:42:06.560
How trustworthy are you in terms of consistency,
link |
00:42:09.360
in terms of how accurate are you in understanding me?
link |
00:42:12.600
Like if you're talking to a person on the phone,
link |
00:42:15.120
if you have a problem with your,
link |
00:42:16.360
let's say your internet or something,
link |
00:42:17.760
if the person's not understanding,
link |
00:42:19.160
you lose trust right away.
link |
00:42:20.520
You don't want to talk to that person.
link |
00:42:22.560
That whole example gets amplified by a factor of 10
link |
00:42:25.920
because when you're a human interacting with an AI,
link |
00:42:29.760
you have a certain expectation.
link |
00:42:31.240
Either you expect it to be very intelligent
link |
00:42:33.560
and then you get upset, why is it behaving this way?
link |
00:42:35.760
Or you expect it to be not so intelligent
link |
00:42:39.080
and when it surprises you are like,
link |
00:42:40.320
really, you're trying to be too smart.
link |
00:42:42.480
So I think we grapple with these hard questions as well,
link |
00:42:45.240
but I think the key is actions need to be trustworthy
link |
00:42:49.120
from these AIs, not just about data protection,
link |
00:42:52.160
your personal information protection,
link |
00:42:54.720
but also from how accurately it accomplishes
link |
00:42:58.560
all commands or all interactions.
link |
00:43:01.080
Well, it's tough to hear because trust,
link |
00:43:03.560
you're absolutely right,
link |
00:43:04.480
but trust is such a high bar with AI systems
link |
00:43:06.920
because people, and I see this
link |
00:43:08.760
because I work with autonomous vehicles,
link |
00:43:10.200
I mean, the bar that's placed on AI system
link |
00:43:13.000
is unreasonably high.
link |
00:43:14.800
Yeah, that is going to be, I agree with you.
link |
00:43:17.400
And I think of it as, it's a challenge
link |
00:43:21.280
and it's also keeps my job, right?
link |
00:43:24.240
So from that perspective, I totally,
link |
00:43:27.520
I think of it at both sides as a customer and as a researcher.
link |
00:43:31.320
I think as a researcher,
link |
00:43:32.840
yes, occasionally it will frustrate me
link |
00:43:34.760
that why is the bar so high for these AIs?
link |
00:43:38.080
And as a customer, then I say absolutely
link |
00:43:40.400
it has to be that high, right?
link |
00:43:42.080
So I think that's the trade off we have to balance,
link |
00:43:45.200
but doesn't change the fundamentals
link |
00:43:47.760
that trust has to be earned.
link |
00:43:49.560
And the question then becomes is,
link |
00:43:52.120
are we holding the AIs to a different bar
link |
00:43:54.200
in accuracy and mistakes than we hold humans?
link |
00:43:57.000
That's going to be a great societal questions
link |
00:43:59.000
for years to come, I think for us.
link |
00:44:01.080
Well, one of the questions that we grapple
link |
00:44:02.960
as a society now that I think about a lot,
link |
00:44:06.200
I think a lot of people in the AI think about a lot
link |
00:44:08.560
and Alexis taking on head on is privacy.
link |
00:44:12.400
Is the reality is us giving over data
link |
00:44:18.040
to any AI system can be used to enrich our lives
link |
00:44:23.360
in profound ways.
link |
00:44:25.840
So if basically any product that does anything awesome
link |
00:44:28.560
for you, the more data has,
link |
00:44:31.720
the more awesome things it can do.
link |
00:44:34.080
And yet, at the other side,
link |
00:44:37.080
people imagine the worst case possible scenario
link |
00:44:39.480
of what can you possibly do with that data?
link |
00:44:42.240
People, it boils down to trust, as you said before.
link |
00:44:45.680
There's a fundamental distrust
link |
00:44:47.240
of in certain groups of governments and so on,
link |
00:44:50.440
depending on the government, depending on who's empowered,
link |
00:44:52.920
depending on all these kinds of factors.
link |
00:44:55.400
And so here's Alexa in the middle of all of it
link |
00:44:59.040
in the home trying to do good things for the customers.
link |
00:45:02.320
So how do you think about privacy in this context,
link |
00:45:05.000
the smart assistance in the home?
link |
00:45:06.720
How do you maintain, how do you earn trust?
link |
00:45:08.680
Absolutely, so as you said, trust is the key here.
link |
00:45:12.400
So you start with trust and then privacy
link |
00:45:15.360
is a key aspect of it.
link |
00:45:16.760
It has to be designed from very beginning about that.
link |
00:45:20.240
And we believe in two fundamental principles.
link |
00:45:23.920
One is transparency and second is control.
link |
00:45:26.840
So by transparency, I mean when we build
link |
00:45:30.720
what is now called smart speaker or the first echo.
link |
00:45:34.360
We were quite judicious about making these right tradeoffs
link |
00:45:38.400
on customers behalf that it is pretty clear
link |
00:45:41.960
when the audio is being sent to cloud.
link |
00:45:44.200
The light ring comes on when it has heard you say
link |
00:45:46.520
the word wake word and then the streaming happens, right?
link |
00:45:49.760
So when the light ring comes up, we also had,
link |
00:45:52.200
we put a physical mute button on it,
link |
00:45:55.520
just so if you didn't want it to be listening,
link |
00:45:57.880
even for the wake word,
link |
00:45:58.720
then you turn the mute button on
link |
00:46:01.760
and that disables the microphones.
link |
00:46:04.920
That's just the first decision
link |
00:46:06.600
on essentially transparency and control.
link |
00:46:09.720
Over then, even when we launched,
link |
00:46:11.720
we gave the control in the hands of the customers
link |
00:46:13.800
that you can go and look at any of your individual utterances
link |
00:46:16.360
that is recorded and delete them anytime.
link |
00:46:19.560
And we've got to do that promise, right?
link |
00:46:22.520
So and that is super, again, a great instance
link |
00:46:26.000
of showing how you have the control.
link |
00:46:29.080
Then we made it even easier.
link |
00:46:30.440
You can say, like I said, delete what I said today.
link |
00:46:33.080
So that is now making it even just more control
link |
00:46:36.880
in your hands with what's most convenient
link |
00:46:39.360
about this technology is voice.
link |
00:46:42.000
You delete it with your voice now.
link |
00:46:44.400
So these are the types of decisions we continually make.
link |
00:46:48.040
We just recently launched this feature called
link |
00:46:51.200
what we think of it as if you wanted humans
link |
00:46:53.680
not to review your data because you mentioned
link |
00:46:57.960
supervised learning, right?
link |
00:46:59.120
So in supervised learning,
link |
00:47:01.120
humans have to give some annotation.
link |
00:47:03.760
And that also is now a feature where you can,
link |
00:47:07.080
essentially, if you've selected that flag,
link |
00:47:09.280
your data will not be reviewed by a human.
link |
00:47:11.280
So these are the types of controls
link |
00:47:13.600
that we have to constantly offer with customers.
link |
00:47:18.400
So why do you think it bothers people so much
link |
00:47:22.760
that, so everything you just said is really powerful.
link |
00:47:26.840
So the control, the ability to delete,
link |
00:47:28.320
because we collect, we have studies here running at MIT
link |
00:47:31.080
that collects huge amounts of data
link |
00:47:32.720
and people consent and so on.
link |
00:47:34.800
The ability to delete that data is really empowering.
link |
00:47:38.000
And almost nobody ever asked to delete it,
link |
00:47:39.960
but the ability to have that control is really powerful.
link |
00:47:44.160
But still, there's these popular anecdotal evidence
link |
00:47:47.920
that people say they like to tell that
link |
00:47:50.920
them and a friend were talking about something,
link |
00:47:53.120
I don't know, sweaters for cats.
link |
00:47:56.080
And all of a sudden they'll have advertisements
link |
00:47:58.160
for cat sweaters on Amazon.
link |
00:48:00.960
There's that, that's a popular anecdote
link |
00:48:02.640
as if something is always listening.
link |
00:48:06.280
Can you explain that anecdote,
link |
00:48:07.760
that experience that people have?
link |
00:48:09.080
What's the psychology of that?
link |
00:48:10.960
What's that experience?
link |
00:48:13.040
And can you, you've answered it,
link |
00:48:15.040
but let me just ask, is Alexa listening?
link |
00:48:18.240
No, Alexa listens only for the wake word on the device, right?
link |
00:48:22.520
And the wake word is?
link |
00:48:23.880
The words like Alexa, Amazon, Echo,
link |
00:48:27.240
and you, but you only choose one at a time.
link |
00:48:29.600
So you choose one and it listens only
link |
00:48:31.600
for that on our devices.
link |
00:48:34.000
So that's first, from a listening perspective,
link |
00:48:36.440
you have to be very clear that it's just the wake word.
link |
00:48:38.360
So you said, why is there this anxiety, if you may?
link |
00:48:41.240
Yeah, exactly.
link |
00:48:42.080
It's because there's a lot of confusion
link |
00:48:43.560
what it really listens to, right?
link |
00:48:45.320
And I think it's partly on us to keep educating
link |
00:48:48.680
our customers and the general media more
link |
00:48:52.200
in terms of like what really happens
link |
00:48:54.040
and we've done a lot of it.
link |
00:48:56.560
And our pages on information are clear,
link |
00:49:00.800
but still people have to have more,
link |
00:49:04.000
there's always a hunger for information and clarity.
link |
00:49:06.640
And we'll constantly look at how best to communicate.
link |
00:49:09.080
If you go back and read everything,
link |
00:49:10.520
yes, it states exactly that.
link |
00:49:12.240
And then people could still question it.
link |
00:49:14.840
And I think that's absolutely okay to question.
link |
00:49:17.440
What we have to make sure is that we are,
link |
00:49:21.160
because our fundamental philosophy is customer first,
link |
00:49:24.320
customer obsession is our leadership principle.
link |
00:49:26.720
If you put, as researchers,
link |
00:49:29.520
I put myself in the shoes of the customer
link |
00:49:32.680
and all decisions in Amazon are made with that and that.
link |
00:49:36.520
And trust has to be earned
link |
00:49:37.520
and we have to keep earning the trust
link |
00:49:38.920
of our customers in this setting.
link |
00:49:41.320
And to your other point on like,
link |
00:49:44.040
is there something showing up
link |
00:49:45.520
based on your conversations?
link |
00:49:46.640
No, I think the answer is like you,
link |
00:49:49.600
a lot of times when those experiences happen,
link |
00:49:51.360
you have to also be know that, okay,
link |
00:49:52.800
it may be a winter season,
link |
00:49:54.560
people are looking for sweaters, right?
link |
00:49:56.440
And it shows up on your Amazon.com
link |
00:49:58.520
because it is popular.
link |
00:49:59.640
So there are many of these,
link |
00:50:02.720
you mentioned that personality or personalization.
link |
00:50:06.320
Turns out we are not that unique either, right?
link |
00:50:09.120
So those things we, as humans, start thinking,
link |
00:50:12.040
oh, must be because something was heard
link |
00:50:14.120
and that's why this other thing showed up.
link |
00:50:16.680
The answer is no.
link |
00:50:17.720
Probably it is just the season for sweaters.
link |
00:50:21.520
I'm not gonna ask you this question
link |
00:50:23.800
because it's just, because you're also,
link |
00:50:25.840
because people have so much paranoia.
link |
00:50:27.160
But for my, let me just say, from my perspective,
link |
00:50:29.200
I hope there's a day when the customer
link |
00:50:31.760
can ask Alexa to listen all the time
link |
00:50:35.200
to improve the experience, to improve,
link |
00:50:37.360
because I personally don't see the negative
link |
00:50:40.800
because if you have the control
link |
00:50:42.160
and if you have the trust,
link |
00:50:43.920
there's no reason why I shouldn't be listening
link |
00:50:45.640
all the time to the conversations
link |
00:50:47.040
to learn more about you.
link |
00:50:48.320
Because ultimately, as long as you have control and trust,
link |
00:50:53.840
every data you provide to the device
link |
00:50:56.920
that the device wants is going to be useful.
link |
00:51:01.280
And so to me, as a machine learning person,
link |
00:51:05.080
I think it worries me how sensitive people are
link |
00:51:09.480
about their data relative to how empowering
link |
00:51:18.560
it could be for the devices around them,
link |
00:51:21.440
enriching it could be for their own life
link |
00:51:23.720
to improve the product.
link |
00:51:25.440
So it's something I think about sort of a lot,
link |
00:51:28.320
how do we make that devices?
link |
00:51:29.520
Obviously Alexa thinks about it a lot as well.
link |
00:51:32.200
I don't know if you wanna comment on that.
link |
00:51:34.200
So have you seen, let me ask it in the form of a question.
link |
00:51:37.080
Okay, have you seen an evolution
link |
00:51:40.200
in the way people think about their private data
link |
00:51:44.200
in the previous several years?
link |
00:51:46.400
So as we as a society get more and more comfortable
link |
00:51:48.680
to the benefits we get by sharing more data.
link |
00:51:53.520
First, let me answer that part
link |
00:51:55.080
and then I'll wanna go back
link |
00:51:55.960
to the other aspect you were mentioning.
link |
00:51:58.480
So as a society, on a general,
link |
00:52:01.200
we are getting more comfortable as a society.
link |
00:52:03.120
Doesn't mean that everyone is
link |
00:52:05.840
and I think we have to respect that.
link |
00:52:07.400
I don't think one size fits all
link |
00:52:10.320
is always gonna be the answer for all, right?
link |
00:52:13.480
By definition.
link |
00:52:14.320
So I think that's something to keep in mind in these.
link |
00:52:17.160
Going back to your on what more magical experiences
link |
00:52:22.800
can be launched in these kind of AI settings.
link |
00:52:26.040
I think again, if you give the control,
link |
00:52:29.960
it's possible certain parts of it.
link |
00:52:32.080
So we have a feature called followup mode
link |
00:52:33.960
where if you turn it on and Alexa,
link |
00:52:38.320
after you've spoken to it will open the mics again,
link |
00:52:42.000
thinking you will answer something again.
link |
00:52:44.680
Like if you're adding lists to your shopping items,
link |
00:52:48.560
shopping list or to do list, you're not done.
link |
00:52:51.440
You want to keep.
link |
00:52:52.280
So in that setting, it's awesome
link |
00:52:53.600
that it opens the mic for you to say eggs and milk
link |
00:52:56.200
and then bread, right?
link |
00:52:57.160
So these are the kind of things which you can empower.
link |
00:52:59.920
So, and then another feature we have
link |
00:53:02.320
which is called Alexa guard.
link |
00:53:04.960
I said it only listens for the wake word, all right?
link |
00:53:07.800
But if you have a, let's say you're going to say,
link |
00:53:10.480
Alexa, you leave your home and you want Alexa
link |
00:53:13.080
to listen for a couple of sound events
link |
00:53:15.040
like smoke alarm going off
link |
00:53:17.200
or someone breaking your glass, right?
link |
00:53:19.280
So it's like just to keep your peace of mind.
link |
00:53:22.160
So you can say Alexa on guard or I'm away
link |
00:53:25.880
or and then it can be listening for these sound events.
link |
00:53:29.240
And when you're home, you come out of that mode, right?
link |
00:53:33.040
So this is another one where you again gave controls
link |
00:53:35.560
in the hands of the user or the customer
link |
00:53:38.040
and to enable some experience that is high utility
link |
00:53:42.440
and maybe even more delightful in the certain settings
link |
00:53:44.600
like follow up mode and so forth.
link |
00:53:46.600
And again, this general principle is the same,
link |
00:53:48.880
control in the hands of the customer.
link |
00:53:52.640
So I know we kind of started with a lot of philosophy
link |
00:53:55.480
and a lot of interesting topics
link |
00:53:56.840
and we're just jumping all over the place.
link |
00:53:58.280
But really some of the fascinating things
link |
00:54:00.280
that the Alexa team and Amazon is doing
link |
00:54:03.000
is in the algorithm side, the data side,
link |
00:54:05.440
the technology, the deep learning, machine learning
link |
00:54:07.480
and so on.
link |
00:54:08.840
So can you give a brief history of Alexa
link |
00:54:13.000
from the perspective of just innovation,
link |
00:54:15.400
the algorithms, the data of how it was born,
link |
00:54:18.600
how it came to be, how it has grown, where it is today?
link |
00:54:22.240
Yeah, it starts with the, in Amazon,
link |
00:54:24.280
everything starts with the customer.
link |
00:54:26.960
And we have a process called working backwards.
link |
00:54:30.280
Alexa, and more specifically than the product Echo,
link |
00:54:35.000
there was a working backwards document essentially
link |
00:54:37.280
that reflected what it would be,
link |
00:54:38.840
started with a very simple vision statement, for instance,
link |
00:54:44.240
that morphed into a full fledged document
link |
00:54:47.120
along the way it changed into what all it can do, right?
link |
00:54:51.040
You can, but the inspiration was the Star Trek computer.
link |
00:54:54.120
So when you think of it that way,
link |
00:54:56.080
everything is possible, but when you launch a product,
link |
00:54:58.240
you have to start with someplace.
link |
00:55:00.920
And when I joined, the product was already in conception
link |
00:55:05.400
and we started working on the far field speech recognition
link |
00:55:08.800
because that was the first thing to solve.
link |
00:55:10.800
By that we mean that you should be able to speak
link |
00:55:12.760
to the device from a distance.
link |
00:55:15.160
And in those days, that wasn't a common practice.
link |
00:55:18.720
And even in the previous research world I was in,
link |
00:55:22.240
was considered to an unsolvable problem then
link |
00:55:24.520
in terms of whether you can converse from a length.
link |
00:55:28.280
And here I'm still talking about the first part
link |
00:55:30.280
of the problem where you say,
link |
00:55:32.360
get the attention of the device,
link |
00:55:34.000
as in by saying what we call the wake word,
link |
00:55:37.040
which means the word Alexa has to be detected
link |
00:55:40.320
with a very high accuracy because it is a very common word.
link |
00:55:44.800
It has sound units that map with words like I like you
link |
00:55:48.200
or Alec, Alex, right?
link |
00:55:51.080
So it's an undoubtedly hard problem to detect
link |
00:55:56.240
the right mentions of Alexa's address to the device
link |
00:56:00.560
versus I like Alexa.
link |
00:56:02.880
So you have to pick up that signal
link |
00:56:04.280
when there's a lot of noise.
link |
00:56:06.120
Not only noise, but a lot of conversation in the house, right?
link |
00:56:09.360
You remember on the device,
link |
00:56:10.360
you're simply listening for the wake word Alexa.
link |
00:56:13.240
And there's a lot of words being spoken in the house.
link |
00:56:15.840
How do you know it's Alexa?
link |
00:56:18.120
And directed at Alexa.
link |
00:56:21.800
Because I could say, I love my Alexa.
link |
00:56:23.680
I hate my Alexa.
link |
00:56:25.400
I want Alexa to do this.
link |
00:56:27.080
And in all these three sentences I said Alexa,
link |
00:56:29.360
I didn't want it to wake up.
link |
00:56:31.400
So can I just pause on that second?
link |
00:56:33.800
What would be your device that I should probably
link |
00:56:36.760
in the introduction of this conversation give to people
link |
00:56:40.000
in terms of with them turning off their Alexa device,
link |
00:56:43.560
if they're listening to this podcast conversation out loud?
link |
00:56:49.360
Like what's the probability that an Alexa device
link |
00:56:51.720
will go off because we mentioned Alexa
link |
00:56:53.440
like a million times.
link |
00:56:55.240
So it will, we have done a lot of different things
link |
00:56:58.200
where we can figure out that there is the device,
link |
00:57:03.800
the speech is coming from a human versus over the air.
link |
00:57:08.280
Also, I mean, in terms of like also it is think about ads
link |
00:57:11.800
or so we also launched a technology
link |
00:57:14.280
for watermarking kind of approaches
link |
00:57:16.320
in terms of filtering it out.
link |
00:57:18.840
But yes, if this kind of a podcast is happening,
link |
00:57:21.640
it's possible your device will wake up a few times, right?
link |
00:57:24.400
It's an unsolved problem, but it is definitely
link |
00:57:28.840
something we care very much about.
link |
00:57:31.080
But the idea is you want to detect Alexa.
link |
00:57:33.920
Meant for the device.
link |
00:57:35.720
I mean, first of all, just even hearing Alexa
link |
00:57:37.600
versus I like something, I mean, that's a fascinating part.
link |
00:57:41.080
So that was the first relief.
link |
00:57:43.120
That's the first.
link |
00:57:43.960
The world's best detector of Alexa.
link |
00:57:46.040
Yeah, the world's best wake word detector
link |
00:57:48.760
in a far field setting, not like something
link |
00:57:51.120
where the phone is sitting on the table.
link |
00:57:53.880
This is like people have devices 40 feet away,
link |
00:57:56.720
like in my house or 20 feet away
link |
00:57:58.400
and you still get an answer.
link |
00:58:00.680
So that was the first part.
link |
00:58:02.480
The next is, okay, you're speaking to the device.
link |
00:58:05.880
Of course, you're gonna issue many different requests.
link |
00:58:09.040
Some may be simple, some may be extremely hard,
link |
00:58:11.560
but it's a large vocabulary speech recognition problem,
link |
00:58:13.760
essentially, where the audio is now not coming
link |
00:58:17.640
onto your phone or a handheld mic like this
link |
00:58:20.360
or a close talking mic, but it's from 20 feet away
link |
00:58:23.880
where if you're in a busy household,
link |
00:58:26.280
your son may be listening to music,
link |
00:58:28.880
your daughter may be running around with something
link |
00:58:31.640
and asking your mom something and so forth, right?
link |
00:58:33.840
So this is like a common household setting
link |
00:58:36.400
where the words you're speaking to Alexa
link |
00:58:40.200
need to be recognized with very high accuracy, right?
link |
00:58:43.400
Now, we're still just in the recognition problem.
link |
00:58:45.800
You haven't yet come to the understanding one, right?
link |
00:58:48.160
And if we pause them, sorry, once again,
link |
00:58:50.160
what year was this, is this before neural networks
link |
00:58:53.880
began to start to seriously prove themselves
link |
00:58:58.480
in the audio space?
link |
00:59:00.480
Yeah, this is around, so I joined in 2013 in April, right?
link |
00:59:05.480
So the early research in neural networks coming back
link |
00:59:08.800
and showing some promising results
link |
00:59:11.240
in speech recognition space had started happening,
link |
00:59:13.560
but it was very early.
link |
00:59:15.360
But we just now build on that on the very first thing we did
link |
00:59:20.000
when I joined the team and remember,
link |
00:59:23.800
it was a very much of a startup environment,
link |
00:59:25.960
which is great about Amazon.
link |
00:59:28.080
And we doubled on deep learning right away
link |
00:59:31.240
and we knew we'll have to improve accuracy fast.
link |
00:59:36.600
And because of that, we worked on and the scale of data
link |
00:59:39.920
once you have a device like this, if it is successful,
link |
00:59:43.240
will improve big time.
link |
00:59:44.960
Like you'll suddenly have large volumes of data
link |
00:59:48.040
to learn from to make the customer experience better.
link |
00:59:51.080
So how do you scale deep learning?
link |
00:59:52.480
So we did one of the first works
link |
00:59:54.560
in training with distributed GPUs
link |
00:59:57.600
and where the training time was linear
link |
01:00:01.480
in terms of like in the amount of data.
link |
01:00:04.000
So that was quite important work
link |
01:00:06.240
where it was algorithmic improvements
link |
01:00:07.840
as well as a lot of engineering improvements
link |
01:00:09.920
to be able to train on thousands and thousands of speech.
link |
01:00:14.040
And that was an important factor.
link |
01:00:15.600
So if you ask me like back in 2013 and 2014
link |
01:00:19.360
when we launched Echo,
link |
01:00:22.440
the combination of large scale data,
link |
01:00:25.680
deep learning progress, near infinite GPUs
link |
01:00:29.720
we had available on AWS even then
link |
01:00:33.120
was all came together for us to be able to
link |
01:00:36.400
solve the far field speech recognition
link |
01:00:38.400
to the extent it could be useful to the customers.
link |
01:00:40.600
It's still not solved.
link |
01:00:41.440
Like I mean, it's not that we are perfect at recognizing speech
link |
01:00:44.520
but we are great at it in terms of the settings
link |
01:00:46.800
that are in homes, right?
link |
01:00:48.360
So and that was important even in the early stages.
link |
01:00:50.920
So first of all, just even I'm trying to look back
link |
01:00:53.360
at that time, if I remember correctly it was,
link |
01:00:58.760
it seems like the task would be pretty daunting.
link |
01:01:02.080
So like, so we kind of take it for granted that it works now.
link |
01:01:06.320
Yes, so you're right.
link |
01:01:07.640
So let me like how, first of all you mentioned startup
link |
01:01:10.800
I wasn't familiar how big the team was.
link |
01:01:12.800
I kind of, because I know there's a lot of really smart
link |
01:01:15.200
people working on it.
link |
01:01:16.040
So now it's very, very large team.
link |
01:01:19.240
How big was the team?
link |
01:01:20.760
How likely were you to fail in the highs of everyone else?
link |
01:01:25.400
And ourselves?
link |
01:01:26.720
And yourself, so like what?
link |
01:01:28.560
I'll give you a very interesting anecdote on that.
link |
01:01:31.560
When I joined the team, the speech recognition team
link |
01:01:35.360
was six people, my first meeting
link |
01:01:38.840
and we had hired a few more people, it was 10 people.
link |
01:01:42.920
Nine out of 10 people thought it can't be done, right?
link |
01:01:47.960
Who was the one?
link |
01:01:48.800
The one was me, say, actually I should say,
link |
01:01:52.960
and one was semi optimistic and eight were trying to convince
link |
01:01:59.120
let's go to the management and say,
link |
01:02:01.720
let's not work on this problem,
link |
01:02:03.600
let's work on some other problem like either telephony speech
link |
01:02:07.720
for customer service calls and so forth.
link |
01:02:10.160
But this is the kind of belief you must have.
link |
01:02:12.040
And I had experience with far field speech recognition
link |
01:02:14.320
and my eyes lit up when I saw a problem like that saying,
link |
01:02:17.720
okay, we have been in speech recognition
link |
01:02:20.840
always looking for that killer app.
link |
01:02:23.400
And this was a killer use case
link |
01:02:25.840
to bring something delightful in the hands of customers.
link |
01:02:28.840
So you mentioned the way you kind of think of it
link |
01:02:31.160
in the product way in the future,
link |
01:02:32.640
have a press release and an FAQ and you think backwards.
link |
01:02:35.760
Did you have, did the team have the echo in mind?
link |
01:02:41.000
So this far field speech recognition,
link |
01:02:43.040
actually putting a thing in the home that works
link |
01:02:45.360
that's able to interact with,
link |
01:02:46.640
was that the press release?
link |
01:02:48.200
What was the...
link |
01:02:49.040
Very close, I would say in terms of the,
link |
01:02:51.480
as I said, the vision was start a computer, right?
link |
01:02:54.840
So, or the inspiration.
link |
01:02:56.920
And from there, I can't divulge all the exact specifications,
link |
01:03:00.640
but one of the first things that was magical on Alexa
link |
01:03:07.240
was music.
link |
01:03:08.840
It brought me to back to music
link |
01:03:11.200
because my taste is still in when I was an undergrad.
link |
01:03:14.240
So I still listen to those songs
link |
01:03:15.640
and I, it was too hard for me
link |
01:03:18.440
to be a music fan with a phone, right?
link |
01:03:21.440
So I hate things in my ears.
link |
01:03:24.240
So from that perspective, it was quite hard
link |
01:03:28.160
and music was part of the,
link |
01:03:32.040
at least the documents I've seen, right?
link |
01:03:33.680
So from that perspective, I think, yes,
link |
01:03:36.160
in terms of how far are we from the original vision?
link |
01:03:40.680
I can't reveal that,
link |
01:03:42.080
but that's why I have done a fun at work
link |
01:03:44.560
because every day we go in
link |
01:03:46.440
and thinking like these are the new set of challenges to solve.
link |
01:03:49.040
Yeah, it's a great way to do great engineering
link |
01:03:51.880
as you think of the product, the press release.
link |
01:03:53.600
I like that idea actually.
link |
01:03:55.000
Maybe we'll talk about it a bit later,
link |
01:03:56.800
which is a super nice way to have a focus.
link |
01:03:59.280
I'll tell you this, you're a scientist
link |
01:04:01.360
and a lot of my scientists have adopted that.
link |
01:04:03.720
They have now, they love it as a process
link |
01:04:06.960
because it was very,
link |
01:04:08.400
as scientists, you're trained to write great papers,
link |
01:04:10.920
but they are all after you've done the research
link |
01:04:13.480
or you've proven like, and your PhD dissertation proposal
link |
01:04:16.600
is something that comes closest
link |
01:04:18.440
or a DARPA proposal or a NSF proposal
link |
01:04:21.160
is the closest that comes to a press release.
link |
01:04:23.600
But that process is now ingrained in our scientists,
link |
01:04:27.000
which is like delightful for me to see.
link |
01:04:30.920
You write the paper first and then make it happen.
link |
01:04:33.040
That's right.
link |
01:04:33.880
I mean, in fact, it's not...
link |
01:04:34.720
State of the art results.
link |
01:04:36.280
Or you leave the results section open,
link |
01:04:38.400
but you have a thesis about here's what I expect, right?
link |
01:04:41.640
And here's what it will change, right?
link |
01:04:44.920
So I think it is a great thing.
link |
01:04:46.480
It works for researchers as well.
link |
01:04:48.160
Yeah.
link |
01:04:49.000
So far field recognition.
link |
01:04:50.680
Yeah.
link |
01:04:52.320
What was the big leap?
link |
01:04:53.840
What were the breakthroughs
link |
01:04:55.400
and what was that journey like to today?
link |
01:04:58.360
Yeah, I think the, as you said first,
link |
01:05:00.160
there was a lot of skepticism
link |
01:05:01.560
on whether far field speech recognition
link |
01:05:03.320
will ever work to be good enough, right?
link |
01:05:06.440
And what we first did was got a lot of training data
link |
01:05:09.960
in a far field setting.
link |
01:05:11.440
And that was extremely hard to get
link |
01:05:13.960
because none of it existed.
link |
01:05:16.120
So how do you collect data in far field setup, right?
link |
01:05:20.040
With no customer base at this time.
link |
01:05:21.360
With no customer base, right?
link |
01:05:22.600
So that was first innovation.
link |
01:05:24.720
And once we had that, the next thing was, okay,
link |
01:05:27.200
if you have the data, first of all,
link |
01:05:30.680
we didn't talk about like,
link |
01:05:31.800
what would magical mean in this kind of a setting?
link |
01:05:35.200
What is good enough for customers, right?
link |
01:05:37.440
That's always, since you've never done this before,
link |
01:05:40.360
what would be magical?
link |
01:05:41.560
So it wasn't just a research problem.
link |
01:05:44.160
You had to put some, in terms of accuracy
link |
01:05:47.600
and customer experience features,
link |
01:05:49.840
some stakes on the ground saying,
link |
01:05:51.400
here's where I think it should get to.
link |
01:05:54.880
So you established a bar
link |
01:05:55.960
and then how do you measure progress towards it?
link |
01:05:57.800
Given you have no customers right now.
link |
01:06:01.640
So from that perspective, we went,
link |
01:06:04.120
so first was the data without customers.
link |
01:06:07.480
Second was doubling down on deep learning as a way to learn.
link |
01:06:11.840
And I can just tell you that the combination of the two
link |
01:06:16.080
got our error rates by a factor of five.
link |
01:06:19.120
From where we were when I started to,
link |
01:06:22.200
within six months of having that data,
link |
01:06:24.240
we, at that point, I got the conviction
link |
01:06:28.320
that this will work, right?
link |
01:06:29.840
So because that was magical
link |
01:06:31.560
in terms of when it started working.
link |
01:06:33.760
And that reached the magic bar, became close to the magical bar.
link |
01:06:37.640
That to the bar, right?
link |
01:06:39.440
That we felt would be where people will use it,
link |
01:06:44.200
which was critical.
link |
01:06:45.280
Because you really have one chance at this.
link |
01:06:48.800
If we had launched in November,
link |
01:06:50.480
2014 is when we launched,
link |
01:06:51.840
if it was below the bar,
link |
01:06:53.040
I don't think this category exists if you don't meet the bar.
link |
01:06:58.040
Yeah, and just having looked at voice based interactions,
link |
01:07:02.000
like in the car or earlier systems,
link |
01:07:05.960
it's a source of huge frustration for people.
link |
01:07:08.280
In fact, we use voice based interaction
link |
01:07:10.280
for collecting data on subjects to measure frustration.
link |
01:07:14.600
So as a training set for computer vision, for face data,
link |
01:07:18.240
so we can get a data set of frustrated people.
link |
01:07:20.600
That's the best way to get frustrated people
link |
01:07:22.240
is having them interact with a voice based system in the car.
link |
01:07:25.520
So that bar, I imagine, was pretty high.
link |
01:07:28.520
It was very high.
link |
01:07:29.480
And we talked about how also errors are perceived
link |
01:07:32.720
from AIs versus errors by humans.
link |
01:07:36.920
But we are not done with the problems that ended up,
link |
01:07:39.880
we had to solve to get it to launch.
link |
01:07:41.200
So do you want the next one?
link |
01:07:42.640
Yeah, the next one.
link |
01:07:44.040
No.
link |
01:07:45.640
So the next one was what I think of as
link |
01:07:49.480
multi domain natural language understanding.
link |
01:07:52.480
It's very, I wouldn't say easy,
link |
01:07:54.720
but it is during those days,
link |
01:07:57.480
solving it, understanding in one domain, a narrow domain,
link |
01:08:02.800
was doable, but for these multiple domains,
link |
01:08:07.520
like music, like information,
link |
01:08:10.160
other kinds of household productivity, alarms, timers,
link |
01:08:14.080
even though it wasn't as big as it is,
link |
01:08:15.760
in terms of the number of skills Alexa has
link |
01:08:17.400
in the confusion space has grown by three hours of magnitude,
link |
01:08:22.400
it was still daunting even those days.
link |
01:08:24.680
Again, no customer base yet.
link |
01:08:26.320
Again, no customer base.
link |
01:08:27.920
So now you're looking at meaning understanding
link |
01:08:29.880
and intent understanding and taking actions
link |
01:08:31.880
on behalf of customers based on their requests.
link |
01:08:35.080
And that is the next hard problem.
link |
01:08:37.920
Even if you have gotten the words recognized,
link |
01:08:41.440
how do you make sense of them?
link |
01:08:44.080
In those days, there was still a lot of emphasis
link |
01:08:48.920
on rule based systems for writing grammar patterns
link |
01:08:52.360
to understand the intent,
link |
01:08:53.880
but we had a statistical first approach even then,
link |
01:08:57.160
where for our language understanding,
link |
01:08:58.760
we had in even those starting days an entity recognizer
link |
01:09:04.240
and an intent classifier, which was all trained statistically.
link |
01:09:08.120
In fact, we had to build the deterministic matching
link |
01:09:11.400
as a follow up to fix bugs that statistical models have, right?
link |
01:09:16.200
So it was just a different mindset
link |
01:09:18.240
where we focused on data driven statistical understanding.
link |
01:09:22.040
When's in the end if you have a huge data set?
link |
01:09:24.760
Yes, it is contingent on that.
link |
01:09:26.480
And that's why it came back to how do you get the data?
link |
01:09:29.160
Before customers, the fact that this is why data
link |
01:09:32.520
becomes crucial to get to the point
link |
01:09:35.360
that you have the understanding system built in, built up.
link |
01:09:40.160
And notice that we were talking about human machine dialogue
link |
01:09:44.560
and even those early days,
link |
01:09:46.840
even it was very much transactional,
link |
01:09:49.320
do one thing, one shot utterances in great way.
link |
01:09:52.600
There was a lot of debate on how much should Alexa talk back
link |
01:09:54.920
in terms of if you misunderstood you
link |
01:09:57.480
or you said play songs by the stones
link |
01:10:01.560
and let's say it doesn't know early days,
link |
01:10:04.880
knowledge can be sparse.
link |
01:10:07.120
Who are the stones, right?
link |
01:10:09.240
It's the rolling stones, right?
link |
01:10:10.840
So, and you don't want the match
link |
01:10:14.360
to be stone temple pilots or rolling stones, right?
link |
01:10:17.320
So you don't know which one it is.
link |
01:10:19.000
So these kind of other signals to,
link |
01:10:22.600
no, there we had great assets, right?
link |
01:10:24.680
From Amazon in terms of.
link |
01:10:27.160
UX, like what is it?
link |
01:10:28.480
What kind of, yeah, how do you solve that problem?
link |
01:10:31.320
In terms of what we think of it
link |
01:10:32.360
as an entity resolution problem, right?
link |
01:10:34.080
So, which one is it, right?
link |
01:10:36.280
I mean, even if you figured out the stones as an entity,
link |
01:10:40.200
you have to resolve it to whether it's the stones
link |
01:10:42.280
or the stone temple pilots or some other stones.
link |
01:10:44.920
Maybe I misunderstood, is the resolution
link |
01:10:47.120
the job of the algorithm or is the job of UX
link |
01:10:50.560
communicating with the human to help the resolution?
link |
01:10:52.480
Well, there is both, right?
link |
01:10:54.320
It is, you want 90% or high 90s to be done
link |
01:10:58.840
without any further questioning or UX, right?
link |
01:11:01.280
So, but it's absolutely okay.
link |
01:11:04.240
Just like as humans, we ask the question,
link |
01:11:06.960
I didn't understand your legs.
link |
01:11:09.040
It's fine for Alexa to occasionally say,
link |
01:11:10.680
I did not understand you, right?
link |
01:11:12.160
And that's an important way to learn.
link |
01:11:14.720
And I'll talk about where we have come
link |
01:11:16.280
with more self learning with these kind of feedback signals.
link |
01:11:20.160
But in those days, just solving the ability
link |
01:11:23.320
of understanding the intent and resolving to an action
link |
01:11:26.560
where action could be play a particular artist
link |
01:11:28.800
or a particular song was super hard.
link |
01:11:32.040
Again, the bar was high as we were talking about, right?
link |
01:11:35.480
So, while we launched it in sort of 13 big domains,
link |
01:11:40.320
I would say in terms of our thing,
link |
01:11:42.440
we think of it as 13 of the big skills we had,
link |
01:11:44.840
like music is a massive one when we launched it.
link |
01:11:47.800
And now we have 90,000 plus skills on Alexa.
link |
01:11:51.600
So, what are the big skills?
link |
01:11:52.760
Can you just go over them?
link |
01:11:53.720
Because the only thing I use it for
link |
01:11:55.600
is music, weather, and shopping.
link |
01:11:58.960
So, we think of it as music information, right?
link |
01:12:02.600
So, whether it is a part of an information, right?
link |
01:12:05.440
So, when we launched, we didn't have smart home,
link |
01:12:08.080
but within, by smart home, I mean,
link |
01:12:10.440
you connect your smart devices,
link |
01:12:12.120
you control them with voice.
link |
01:12:13.160
If you haven't done it, it's worth,
link |
01:12:15.040
it will change your life.
link |
01:12:15.880
By turning on the lights and so on.
link |
01:12:16.720
Yeah, turning on your light to anything that's connected
link |
01:12:20.200
and has a, it's just that.
link |
01:12:21.520
What's your favorite smart device for you?
link |
01:12:23.240
The light.
link |
01:12:24.080
The light.
link |
01:12:24.920
And now you have the smart plug with,
link |
01:12:26.320
and you don't, you also have this Echo plug, which is...
link |
01:12:29.920
Oh yeah, you can plug in anything.
link |
01:12:30.760
You can plug in anything,
link |
01:12:31.600
and now you can turn that one on and off.
link |
01:12:33.560
I use this conversation motivation
link |
01:12:35.080
and get one and something.
link |
01:12:35.920
The garage door, you can check your status
link |
01:12:38.720
of the garage door and things like,
link |
01:12:40.280
and we have gone make Alexa more and more proactive,
link |
01:12:43.240
where it even has hunches now,
link |
01:12:45.120
that looks hunches like you left your light on.
link |
01:12:50.520
Let's say you've gone to your bed
link |
01:12:51.640
and you left the garage light on.
link |
01:12:52.840
So, it will help you out in these settings, right?
link |
01:12:56.160
So...
link |
01:12:57.000
That's smart devices.
link |
01:12:58.320
Information smart devices, you said music.
link |
01:13:01.120
Yeah, so I don't remember everything we had,
link |
01:13:02.920
but our last timers were the big ones,
link |
01:13:05.000
like that was, the timers were very popular right away.
link |
01:13:09.480
Music also, like you could play song, artist, album,
link |
01:13:13.440
everything, and so that was like a clear win
link |
01:13:17.000
in terms of the customer experience.
link |
01:13:19.400
So that's, again, this is language understanding.
link |
01:13:22.760
Now things have evolved, right?
link |
01:13:24.080
So where we want Alexa definitely to be
link |
01:13:27.280
more accurate, competent, trustworthy,
link |
01:13:29.800
based on how well it does these core things,
link |
01:13:33.080
but we have evolved in many different dimensions.
link |
01:13:35.200
First is what I think of her doing,
link |
01:13:37.240
more conversational for high utility,
link |
01:13:39.160
not just for chat, right?
link |
01:13:40.880
And there, at RIMARS this year,
link |
01:13:43.480
which is our AI conference,
link |
01:13:44.880
we launched what is called Alexa Conversations.
link |
01:13:48.520
That is providing the ability for developers to author
link |
01:13:52.920
multi turn experiences on Alexa with no code, essentially,
link |
01:13:56.440
where in terms of the dialogue code,
link |
01:13:58.840
initially it was like, you know, all these IVR systems,
link |
01:14:02.560
you have to fully author,
link |
01:14:05.080
if the customer says this, do that, right?
link |
01:14:07.520
So the whole dialogue flow is hand author.
link |
01:14:11.440
And with Alexa Conversations,
link |
01:14:13.600
the way it is that you just provide a sample interaction data
link |
01:14:16.720
with your service or an API,
link |
01:14:18.000
let's say your Atom tickets
link |
01:14:19.080
that provides a service for buying movie tickets.
link |
01:14:23.360
You provide a few examples
link |
01:14:24.760
of how your customers will interact with your APIs.
link |
01:14:27.800
And then the dialogue flow is automatically constructed
link |
01:14:29.920
using a regular neural network train on that data.
link |
01:14:33.320
So that simplifies the developer experience.
link |
01:14:35.880
We just launched our preview for the developers
link |
01:14:38.400
to try this capability out.
link |
01:14:40.560
And then the second part of it,
link |
01:14:42.080
which shows even increased utility for customers,
link |
01:14:45.680
is you and I, when we interact with Alexa or any customer,
link |
01:14:50.880
as I'm coming back to our initial part of the conversation,
link |
01:14:53.120
the goal is often unclear or unknown to the AI.
link |
01:14:58.920
If I say, Alexa, what movies are playing nearby?
link |
01:15:02.640
Am I trying to just buy movie tickets?
link |
01:15:08.000
Am I actually even, do you think I'm looking
link |
01:15:11.360
for just movies for curiosity,
link |
01:15:12.840
whether the Avengers is still in theater or when is it?
link |
01:15:15.920
Maybe it's gone and maybe it will come on my missed it.
link |
01:15:18.440
So I may watch it on prime, which happened to me.
link |
01:15:22.640
So from that perspective now,
link |
01:15:25.440
you're looking into what is my goal?
link |
01:15:28.480
And let's say I now complete the movie ticket purchase.
link |
01:15:33.160
Maybe I would like to get dinner nearby.
link |
01:15:37.400
So what is really the goal here?
link |
01:15:40.400
Is it night out or is it movies?
link |
01:15:43.800
As in just go watch a movie?
link |
01:15:45.800
The answer is, we don't know.
link |
01:15:48.000
So can Alexa now figure we have the intelligence
link |
01:15:52.560
that I think this meta goal is really night out
link |
01:15:55.440
or at least say to the customer
link |
01:15:57.560
when you've completed the purchase of movie tickets
link |
01:16:00.000
from Adam tickets or Fandango or Pick Your Anyone.
link |
01:16:03.240
Then the next thing is,
link |
01:16:04.320
do you want to get an Uber to the theater?
link |
01:16:10.800
Or do you want to book a restaurant next to it?
link |
01:16:14.400
And then not ask the same information over and over again.
link |
01:16:18.960
What time, how many people in your party?
link |
01:16:23.960
So this is where you shift the cognitive burden
link |
01:16:28.080
from the customer to the AI,
link |
01:16:30.440
where it's thinking of what is your,
link |
01:16:33.600
it anticipates your goal
link |
01:16:35.560
and takes the next best action to complete it.
link |
01:16:38.840
Now that's the machine learning problem.
link |
01:16:42.160
But essentially the way we saw this first instance
link |
01:16:45.200
and we have a long way to go to make it scale
link |
01:16:48.240
to everything possible in the world.
link |
01:16:50.120
But at least for this situation,
link |
01:16:51.520
it is from at every instance,
link |
01:16:54.400
Alexa is making the determination,
link |
01:16:56.000
whether it should stick with the experience
link |
01:16:57.640
with Adam tickets or offer or you,
link |
01:17:02.560
based on what you say,
link |
01:17:03.760
whether either you have completed the interaction
link |
01:17:06.240
or you said, no, get me an Uber now.
link |
01:17:07.760
So it will shift context
link |
01:17:09.120
into another experience or scale or another service.
link |
01:17:12.840
So that's a dynamic decision making.
link |
01:17:15.360
That's making Alexa,
link |
01:17:16.480
you can say more conversational
link |
01:17:18.120
for the benefit of the customer
link |
01:17:20.160
rather than simply complete transactions
link |
01:17:22.480
which are well thought through.
link |
01:17:25.200
You as a customer has fully specified
link |
01:17:27.800
what you want to be accomplished.
link |
01:17:29.640
It's accomplishing that.
link |
01:17:30.800
So it's kind of as,
link |
01:17:32.400
we do this with pedestrians,
link |
01:17:34.040
like intent modeling is predicting
link |
01:17:36.800
what your possible goals are
link |
01:17:38.680
and what's the most likely goal
link |
01:17:39.960
and then switching that depending on the things you say.
link |
01:17:42.400
So my question is there,
link |
01:17:44.400
it seems maybe it's a dumb question,
link |
01:17:46.520
but it would help a lot if Alexa remembered me
link |
01:17:51.400
what I said previously.
link |
01:17:53.040
Is it trying to use some memories
link |
01:17:57.880
for the customers?
link |
01:17:58.720
It is using a lot of memory within that.
link |
01:18:00.840
So right now, not so much in terms of,
link |
01:18:02.720
okay, which restaurant do you prefer?
link |
01:18:05.400
That is a more long term memory,
link |
01:18:06.840
but within the short term memory,
link |
01:18:08.360
within the session,
link |
01:18:09.880
it is remembering how many people did you,
link |
01:18:11.880
so if you said buy four tickets,
link |
01:18:13.880
now it has made an implicit assumption
link |
01:18:15.720
that you are gonna have,
link |
01:18:18.360
you need at least four seats at a restaurant, right?
link |
01:18:21.800
So these are the kind of context it's preserving
link |
01:18:24.360
between these skills, but within that session,
link |
01:18:26.880
but you're asking the right question
link |
01:18:28.160
in terms of for it to be more and more useful,
link |
01:18:32.200
it has to have more long term memory
link |
01:18:33.840
and that's also an open question.
link |
01:18:35.280
And again, these are still early days.
link |
01:18:37.520
So for me, I mean, everybody's different,
link |
01:18:40.400
but yeah, I'm definitely not representative
link |
01:18:44.080
of the general population
link |
01:18:45.000
in the sense that I do the same thing every day.
link |
01:18:47.960
Like I eat the same,
link |
01:18:48.840
like I do everything the same, the same thing.
link |
01:18:51.960
Wear the same thing clearly, this or the black shirt.
link |
01:18:55.600
So it's frustrating when Alexa doesn't get what I'm saying
link |
01:18:59.200
because I have to correct her every time in the exact same way.
link |
01:19:03.040
This has to do with certain songs.
link |
01:19:05.680
Like she doesn't know certain weird songs.
link |
01:19:08.480
And doesn't know, I've complained to Spotify about this,
link |
01:19:11.440
I talked to the head of RDA Spotify, Stairway to Heaven.
link |
01:19:15.240
I have to correct it every time.
link |
01:19:16.560
It doesn't play Led Zeppelin correctly.
link |
01:19:18.760
It plays cover of Led Zeppelin.
link |
01:19:21.040
So you should figure out,
link |
01:19:23.240
you should send me your next time it fails.
link |
01:19:26.360
Feel free to send it to me.
link |
01:19:27.480
You will take care of it.
link |
01:19:28.440
Okay.
link |
01:19:29.280
Because Led Zeppelin is one of my favorite brands
link |
01:19:31.680
that it works for me.
link |
01:19:32.520
So I'm like shocked, it doesn't work for you.
link |
01:19:34.160
This is an official bug report.
link |
01:19:35.480
I'll put it, I'll make it public or make everybody retweet it.
link |
01:19:39.080
We're gonna fix the Stairway to Heaven problem.
link |
01:19:41.000
Anyway, but the point is,
link |
01:19:43.240
you know, I'm pretty boring and do the same thing.
link |
01:19:45.160
But I'm sure most people do the same set of things.
link |
01:19:48.360
Do you see Alexa sort of utilizing that in the future
link |
01:19:51.400
for improving the experience?
link |
01:19:52.800
Yes.
link |
01:19:53.640
And not only utilizing, it's already doing some of it.
link |
01:19:56.240
We call it where Alexa is becoming more self learning.
link |
01:19:59.560
So Alexa is now auto correcting millions and millions
link |
01:20:04.400
of utterances in the US without any human supervision involved.
link |
01:20:08.760
The way it does it is, let's take an example
link |
01:20:11.920
of a particular song didn't work for you.
link |
01:20:14.760
What do you do next?
link |
01:20:15.720
You either, it played the wrong song and you said,
link |
01:20:18.440
Alexa, no, that's not the song I want.
link |
01:20:20.760
Or you say Alexa, play that, you try it again.
link |
01:20:25.200
And that is a signal to Alexa
link |
01:20:27.480
that she may have done something wrong.
link |
01:20:30.120
And from that perspective, we can learn
link |
01:20:33.520
if there's that failure pattern or that action
link |
01:20:36.720
of song A was played when song B was requested.
link |
01:20:41.080
And it's very common with station names because play NPR,
link |
01:20:44.360
you can have N be confused as an M.
link |
01:20:47.200
And then you, for a certain accent like mine,
link |
01:20:51.880
people confuse my N and M all the time.
link |
01:20:54.760
And because I have an Indian accent,
link |
01:20:57.680
they're confusable to humans.
link |
01:20:59.640
It is for Alexa too.
link |
01:21:01.640
And in that part, but it starts auto correcting.
link |
01:21:05.120
And we collect, we correct a lot of these automatically
link |
01:21:09.720
without a human looking at the failures.
link |
01:21:12.720
So the, one of the things that's for me missing in Alexa,
link |
01:21:17.400
I don't know if I'm a representative customer,
link |
01:21:19.760
but every time I correct it,
link |
01:21:22.960
it would be nice to know that that made a difference.
link |
01:21:26.160
Yes.
link |
01:21:27.000
You know what I mean?
link |
01:21:27.840
Like the sort of like, I heard you like a sort of.
link |
01:21:31.920
Some acknowledgement of that.
link |
01:21:33.840
We work a lot with Tesla, we study autopilot and so on.
link |
01:21:37.480
And a large amount of the customers that use Tesla autopilot,
link |
01:21:40.720
they feel like they're always teaching the system.
link |
01:21:43.000
They're almost excited by the possibility
link |
01:21:44.440
that they're teaching.
link |
01:21:45.280
I don't know if Alexa customers generally think of it
link |
01:21:48.440
as they're teaching to improve the system.
link |
01:21:51.120
And that's a really powerful thing.
link |
01:21:52.720
Again, I would say it's a spectrum.
link |
01:21:55.240
Some customers do think that way.
link |
01:21:57.360
And some would be annoyed by Alexa acknowledging that.
link |
01:22:01.200
Or so there's a, again, no one,
link |
01:22:04.160
you know, while there are certain patterns,
link |
01:22:05.760
not everyone is the same in this way.
link |
01:22:08.280
But we believe that again, customers helping Alexa
link |
01:22:13.640
is a tenet for us in terms of improving it.
link |
01:22:15.680
And more self learning is by, again,
link |
01:22:18.240
this is like fully unsupervised, right?
link |
01:22:20.080
There is no human in the loop and no labeling happening.
link |
01:22:23.560
And based on your actions as a customer,
link |
01:22:27.120
Alexa becomes smarter.
link |
01:22:29.000
Again, it's early days,
link |
01:22:31.120
but I think this whole area of teachable AI
link |
01:22:35.840
is gonna get bigger and bigger in the whole space,
link |
01:22:38.680
especially in the AI assistant space.
link |
01:22:40.760
So that's the second part where I mentioned
link |
01:22:43.440
more conversational, this is more self learning.
link |
01:22:46.520
The third is more natural.
link |
01:22:48.320
And the way I think of more natural
link |
01:22:50.240
is we talked about how Alexa sounds.
link |
01:22:53.280
And we've done a lot of advances in our text to speech
link |
01:22:58.080
by using again, neural network technology
link |
01:23:00.480
for it to sound very human like.
link |
01:23:03.520
From the individual texture of the sound
link |
01:23:05.640
to the timing, the tonality, the tone, everything.
link |
01:23:09.280
I would think in terms of,
link |
01:23:11.040
there's a lot of controls in each of the places
link |
01:23:13.400
for how, I mean, the speed of the voice,
link |
01:23:16.680
the prosthetic patterns,
link |
01:23:18.080
the actual smoothness of how it sounds.
link |
01:23:23.400
All of those are factored and we do ton of listening tests
link |
01:23:25.880
to make sure it was that,
link |
01:23:27.120
but naturalness, how it sounds should be very natural.
link |
01:23:30.760
How it understands requests is also very important.
link |
01:23:33.680
Like, and in terms of, like we have 95,000 skills
link |
01:23:37.160
and if we have, imagine that in many of these skills,
link |
01:23:41.480
you have to remember the skill name
link |
01:23:43.400
and say Alexa ask the tied skill to tell me X, right?
link |
01:23:49.840
Or now, if you have to remember the skill name,
link |
01:23:53.000
that means the discovery and the interaction is unnatural.
link |
01:23:56.680
And we are trying to solve that by what we think of as,
link |
01:24:00.960
again, this was, you don't have to have the app metaphor here.
link |
01:24:05.760
These are not individual apps, right?
link |
01:24:07.440
Even though they're,
link |
01:24:08.400
so you're not sort of opening one at a time and interacting.
link |
01:24:11.440
So it should be seamless because it's voice.
link |
01:24:14.040
And when it's voice,
link |
01:24:15.200
you have to be able to understand these requests
link |
01:24:17.600
independent of the specificity, like a skill name.
link |
01:24:20.640
And to do that, what we have done is again,
link |
01:24:22.880
built a deep learning base capability
link |
01:24:24.480
where we shortlist a bunch of skills
link |
01:24:27.080
when you say, Alexa, get me a car.
link |
01:24:28.920
And then we figure it out, okay,
link |
01:24:30.120
it's meant for an Uber skill versus a Lyft
link |
01:24:33.360
or based on your preferences.
link |
01:24:34.920
And then you can rank the responses from the skill
link |
01:24:38.360
and then choose the best response for the customer.
link |
01:24:41.320
So that's on the more natural,
link |
01:24:43.280
other examples of more natural is like,
link |
01:24:46.400
we were talking about lists, for instance.
link |
01:24:49.160
And you want to, you don't want to say Alexa add milk,
link |
01:24:51.760
Alexa add eggs, Alexa add cookies.
link |
01:24:55.200
No, Alexa add cookies, milk and eggs,
link |
01:24:57.320
and that in one shot, right?
link |
01:24:59.280
So that works, that helps with the naturalness.
link |
01:25:01.800
We talked about memory, like if you said,
link |
01:25:05.440
you can say Alexa, remember, I have to go to mom's house
link |
01:25:09.080
or you may have entered a calendar event
link |
01:25:11.200
through your calendar that's linked to Alexa.
link |
01:25:13.560
You don't want to remember whether it's in my calendar
link |
01:25:15.840
or did I tell you to remember something
link |
01:25:18.400
or some other reminder, right?
link |
01:25:21.000
So you have to now, independent of how customers
link |
01:25:25.320
create these events, it should just say Alexa,
link |
01:25:28.440
when do I have to go to mom's house?
link |
01:25:29.880
And it tells you when you have to go to mom's house.
link |
01:25:32.360
Now that's a fascinating problem.
link |
01:25:33.720
Who's that problem on?
link |
01:25:35.280
So there's people who create skills.
link |
01:25:38.520
Who's tasked with integrating all of that knowledge together?
link |
01:25:42.840
So the skills become seamless.
link |
01:25:44.640
Is it the creators of the skills or is it an infrastructure
link |
01:25:49.080
that Alexa provides problem?
link |
01:25:51.200
It's both.
link |
01:25:52.040
I think the large problem in terms
link |
01:25:54.280
of making sure your skill quality is high,
link |
01:25:58.440
that has to be done by our tools because it's a,
link |
01:26:02.280
so these skills, just to put the context,
link |
01:26:04.600
they're built through Alexa skills kit,
link |
01:26:06.200
which is a self serve way of building an experience on Alexa.
link |
01:26:11.200
This is like any developer in the world
link |
01:26:12.840
could go to Alexa skills kit
link |
01:26:14.720
and build an experience on Alexa.
link |
01:26:16.760
Like if you're a dominoes,
link |
01:26:18.160
you can build a domino skills.
link |
01:26:20.040
For instance, that does pizza ordering.
link |
01:26:22.440
When you've authored that,
link |
01:26:25.240
you do want to now, if people say Alexa open dominoes
link |
01:26:30.040
or Alexa ask dominoes to get a particular type of pizza,
link |
01:26:35.280
that will work, but the discovery is hard.
link |
01:26:37.720
You can't just say Alexa, get me a pizza
link |
01:26:39.240
and then Alexa figures out what to do.
link |
01:26:42.360
That latter part is definitely our responsibility
link |
01:26:44.920
in terms of when the request is not fully specific.
link |
01:26:48.840
How do you figure out what's the best skill
link |
01:26:51.440
or a service that can fulfill the customer's request?
link |
01:26:56.000
And it can keep evolving.
link |
01:26:57.160
Imagine going to the situation I said,
link |
01:26:59.200
which was the night out planning that it,
link |
01:27:00.920
the goal could be more than that individual request
link |
01:27:03.400
that came up a pizza ordering could mean a nighting.
link |
01:27:08.520
When you're having an event with your kids in their house
link |
01:27:11.320
and you're, so this is welcome to the word of conversational AI.
link |
01:27:15.200
This is super exciting because it's not the academic problem
link |
01:27:20.040
of NLP of natural English processing,
link |
01:27:21.760
understanding dialogue.
link |
01:27:23.080
This is like real world.
link |
01:27:24.600
And there's the stakes are high in the sense
link |
01:27:27.120
that customers get frustrated quickly,
link |
01:27:30.000
people get frustrated quickly.
link |
01:27:31.800
So you have to get it right.
link |
01:27:33.120
You have to get that interaction right.
link |
01:27:35.280
So it's, I love it.
link |
01:27:36.840
But so from that perspective, what are the challenges today?
link |
01:27:41.880
What are the problems that really need to be solved
link |
01:27:45.000
in the next few years?
link |
01:27:45.840
I think first and foremost, as I mentioned that
link |
01:27:50.800
get the basics right is still true.
link |
01:27:53.240
Basically, even the one shot request,
link |
01:27:56.960
which we think of as transactional request
link |
01:27:58.800
needs to work magically, no question about that.
link |
01:28:01.640
If it doesn't turn your light on and off,
link |
01:28:03.520
you'll be super frustrated.
link |
01:28:05.160
Even if I can complete the night out for you
link |
01:28:07.040
and not do that, that is unacceptable for as a customer.
link |
01:28:10.680
So that you have to get the foundational understanding
link |
01:28:14.080
going very well.
link |
01:28:15.400
The second aspect when I said more conversational
link |
01:28:17.720
is, as you imagine, is more about reasoning.
link |
01:28:20.080
It is really about figuring out what the latent goal is
link |
01:28:24.320
of the customer based on what I have the information now
link |
01:28:28.480
and the history and what's the next best thing to do.
link |
01:28:31.320
So that's a complete reasoning
link |
01:28:33.360
and decision making problem.
link |
01:28:35.360
Just like your self driving car,
link |
01:28:37.000
but the goal is still more finite.
link |
01:28:38.640
Here it evolves, your environment is super hard
link |
01:28:41.920
and self driving and the cost of a mistake is huge.
link |
01:28:46.200
Here, but there are certain similarities,
link |
01:28:48.480
but if you think about how many decisions Alexa is making
link |
01:28:52.600
or evaluating at any given time,
link |
01:28:54.240
it's a huge hypothesis space.
link |
01:28:56.440
And we're only talked about so far
link |
01:28:59.720
about what I think of reactive decision
link |
01:29:02.040
in terms of you asked for something
link |
01:29:03.640
and Alexa is reacting to it.
link |
01:29:05.920
If you bring the proactive part,
link |
01:29:07.760
which is Alexa having hunches.
link |
01:29:10.040
So any given instance then,
link |
01:29:11.720
it's really a decision at any given point
link |
01:29:15.360
based on the information.
link |
01:29:17.200
Alexa has to determine what's the best thing it needs to do.
link |
01:29:20.120
So these are the ultimate AI problem
link |
01:29:22.520
about decisions based on the information you have.
link |
01:29:25.080
Do you think, just from my perspective,
link |
01:29:27.480
I work a lot with sensing of the human face.
link |
01:29:31.120
Do you think they'll,
link |
01:29:32.320
and we touched this topic a little bit earlier,
link |
01:29:34.400
but do you think it'll be a day soon
link |
01:29:36.560
when Alexa can also look at you
link |
01:29:38.880
to help improve the quality of the hunch it has
link |
01:29:43.200
or at least detect frustration or detect,
link |
01:29:48.000
improve the quality of its perception
link |
01:29:51.600
of what you're trying to do?
link |
01:29:54.360
I mean, let me again bring back to what it already does.
link |
01:29:57.160
We talked about how based on you barge in over Alexa,
link |
01:30:01.800
clearly it's a very high probability
link |
01:30:04.960
it must have done something wrong.
link |
01:30:06.600
That's why you barged in.
link |
01:30:08.560
The next extension of whether frustration
link |
01:30:11.520
is a signal or not,
link |
01:30:13.280
of course, is a natural thought
link |
01:30:15.360
in terms of how that should be in a signal too.
link |
01:30:18.200
You can get that from voice.
link |
01:30:19.560
You can get from voice, but it's very hard.
link |
01:30:21.320
Like, I mean, frustration as a signal historically,
link |
01:30:25.960
if you think about emotions of different kinds,
link |
01:30:29.720
there's a whole field of affective computing,
link |
01:30:31.480
something that MIT has also done a lot of research in,
link |
01:30:34.560
is super hard.
link |
01:30:35.640
And you're now talking about a far field device
link |
01:30:39.080
as in you're talking to a distance, noisy environment.
link |
01:30:41.960
And in that environment,
link |
01:30:44.120
it needs to have a good sense for your emotions.
link |
01:30:47.600
This is a very, very hard problem.
link |
01:30:49.560
Very hard problem, but you haven't shied away
link |
01:30:51.000
from hard problems.
link |
01:30:51.840
So deep learning has been at the core
link |
01:30:55.280
of a lot of this technology.
link |
01:30:57.440
Are you optimistic about the current deep learning approaches
link |
01:30:59.720
to solving the hardest aspects of what we're talking about?
link |
01:31:03.240
Or do you think there will come a time
link |
01:31:05.360
where new ideas need to,
link |
01:31:07.280
if you look at reasoning,
link |
01:31:09.360
so open AI, deep mind,
link |
01:31:10.720
a lot of folks are now starting to work in reasoning,
link |
01:31:13.880
trying to see how it can make neural networks reason.
link |
01:31:16.600
Do you see that new approaches need to be invented
link |
01:31:20.520
to take the next big leap?
link |
01:31:23.320
Absolutely, I think there has to be a lot more investment
link |
01:31:27.200
and I think in many different ways.
link |
01:31:29.400
And there are these, I would say nuggets of research
link |
01:31:32.400
forming in a good way,
link |
01:31:33.560
like learning with less data
link |
01:31:36.080
or like zero short learning, one short learning.
link |
01:31:39.680
And the active learning stuff you've talked about
link |
01:31:41.440
is an incredible stuff.
link |
01:31:43.360
So transfer learning is also super critical,
link |
01:31:45.680
especially when you're thinking about applying knowledge
link |
01:31:48.600
from one task to another or one language to another, right?
link |
01:31:52.040
That's really ripe.
link |
01:31:53.000
So these are great pieces.
link |
01:31:55.320
Deep learning has been useful too.
link |
01:31:56.800
And now we are sort of matting deep learning
link |
01:31:58.880
with transfer learning and active learning,
link |
01:32:02.480
of course, that's more straightforward
link |
01:32:04.640
in terms of applying deep learning
link |
01:32:05.880
and an active learning setup.
link |
01:32:07.000
But I do think in terms of now looking
link |
01:32:12.160
into more reasoning based approaches
link |
01:32:14.240
is going to be key for our next wave of the technology.
link |
01:32:19.440
But there is a good news.
link |
01:32:20.880
The good news is that I think for keeping on
link |
01:32:23.320
to delight customers,
link |
01:32:24.440
that a lot of it can be done by prediction tasks.
link |
01:32:27.880
So, and so we haven't exhausted that.
link |
01:32:30.680
So we don't need to give up
link |
01:32:34.480
on the deep learning approaches for that.
link |
01:32:37.320
So that's just, I wanted to sort of
link |
01:32:39.560
creating a rich, fulfilling, amazing experience
link |
01:32:42.600
that makes Amazon a lot of money
link |
01:32:44.240
and a lot of everybody a lot of money
link |
01:32:46.400
because it does awesome things, deep learning is enough.
link |
01:32:49.880
The point, the point.
link |
01:32:51.120
I don't think, no, I mean,
link |
01:32:52.880
I wouldn't say deep learning is enough.
link |
01:32:54.200
I think for the purposes of Alexa
link |
01:32:56.680
accomplish the task for customers,
link |
01:32:58.440
I'm saying there's still a lot of things we can do
link |
01:33:02.240
with prediction based approaches that do not reason.
link |
01:33:05.200
I'm not saying that, and we haven't exhausted those,
link |
01:33:08.640
but for the kind of high utility experiences
link |
01:33:12.480
that I'm personally passionate about
link |
01:33:14.280
of what Alexa needs to do, reasoning has to be solved.
link |
01:33:18.800
To the same extent as you can think
link |
01:33:21.000
of natural language understanding
link |
01:33:23.560
and speech recognition to the extent
link |
01:33:25.480
of understanding intents has been,
link |
01:33:29.000
how accurate it has become.
link |
01:33:30.120
But reasoning, we have very, very early days.
link |
01:33:32.760
Let me ask you another way.
link |
01:33:34.040
How hard of a problem do you think that is?
link |
01:33:36.760
Hardest of them.
link |
01:33:39.160
I would say hardest of them because again,
link |
01:33:42.520
the hypothesis space is really, really large.
link |
01:33:47.520
And when you go back in time, like you were saying,
link |
01:33:50.000
I want Alexa to remember more things.
link |
01:33:53.040
That once you go beyond a session of interaction,
link |
01:33:56.320
which is by session, I mean a time span,
link |
01:33:59.200
which is today, two verses remembering
link |
01:34:01.880
which restaurant I like.
link |
01:34:03.160
And then when I'm planning a night out to say,
link |
01:34:05.480
do you want to go to the same restaurant?
link |
01:34:07.520
Now you're up the stakes big time.
link |
01:34:09.720
And this is where the reasoning dimension
link |
01:34:12.840
also goes way, way bigger.
link |
01:34:14.720
So you think the space, we'll be elaborating
link |
01:34:17.720
that a little bit, just philosophically speaking.
link |
01:34:20.720
Do you think when you reason about trying to model
link |
01:34:24.680
what the goal of a person is in the context
link |
01:34:28.240
of interacting with Alexa, you think that space is huge?
link |
01:34:31.320
It's huge.
link |
01:34:32.280
Absolutely huge.
link |
01:34:33.120
Do you think so like another sort of devil's advocate
link |
01:34:36.040
would be that we human beings are really simple
link |
01:34:38.720
and we all want like just a small set of things.
link |
01:34:41.520
And so you think it's possible because we're not talking
link |
01:34:45.560
about a fulfilling general conversation.
link |
01:34:49.280
Perhaps actually the Alexa prize
link |
01:34:50.960
is a little bit more about after that.
link |
01:34:53.360
Creating a customer, like there's so many
link |
01:34:56.080
of the interactions, it feels like are clustered
link |
01:35:01.080
in groups that don't require general reasoning.
link |
01:35:06.520
I think yeah, you're right in terms of the head
link |
01:35:09.360
of the distribution of all the possible things
link |
01:35:11.800
customers may want to accomplish.
link |
01:35:13.760
But the tail is long and it's diverse, right?
link |
01:35:18.200
So from that perspective, I think you have
link |
01:35:24.880
to solve that problem otherwise.
link |
01:35:27.680
And everyone's very different.
link |
01:35:28.800
Like I mean, we see this already in terms of the skills, right?
link |
01:35:32.360
I mean, if you're an average surfer, which I am not, right?
link |
01:35:37.000
But somebody is asking Alexa about surfing conditions, right?
link |
01:35:41.680
And there's a skill that is there for them to get to, right?
link |
01:35:45.520
That tells you that the tail is massive.
link |
01:35:47.880
Like in terms of like what kind of skills
link |
01:35:50.760
people have created, it's humongous in terms of it.
link |
01:35:54.240
And which means there are these diverse needs.
link |
01:35:57.000
And when you start looking at the combinations of these, right?
link |
01:36:01.000
Even if you have pairs of skills and 90,000 choose two,
link |
01:36:05.440
it's still a big combination.
link |
01:36:07.920
So I'm saying there's a huge to do here now.
link |
01:36:11.720
And I think customers are wonderfully frustrated with things
link |
01:36:18.080
and they have to keep getting to do better things for them.
link |
01:36:20.880
And they're not known to be super patient.
link |
01:36:23.920
So you have to do it fast.
link |
01:36:25.600
You have to do it fast.
link |
01:36:26.960
So you've mentioned the idea of a press release,
link |
01:36:29.800
the research and development, Amazon, Alexa and Amazon in general,
link |
01:36:34.800
you kind of think of what the future product will look like
link |
01:36:36.920
and you kind of make it happen, you work backwards.
link |
01:36:39.680
So can you draft for me?
link |
01:36:42.440
You probably already have one, but can you make up one?
link |
01:36:45.120
For 10, 20, 30, 40 years out
link |
01:36:48.520
that you see the Alexa team putting out
link |
01:36:52.480
just in broad strokes, something that you dream about?
link |
01:36:56.160
I think let's start with the five years first.
link |
01:36:59.840
Right? So and I'll get to the 40 is too hard to take.
link |
01:37:03.280
Because I'm pretty sure you have a real five year one.
link |
01:37:06.000
Because I didn't want to.
link |
01:37:08.320
But yeah, in broad strokes, let's start with five years.
link |
01:37:10.160
I think the five years is where, I mean, I think of in these spaces,
link |
01:37:13.640
it's hard, especially if you're in the pick of things
link |
01:37:16.160
to think beyond the five years space
link |
01:37:17.960
because a lot of things change, right?
link |
01:37:20.280
I mean, if you ask me five years back,
link |
01:37:22.200
will Alexa will be here?
link |
01:37:24.200
I wouldn't have, I think it has surpassed
link |
01:37:26.360
my imagination of that time, right?
link |
01:37:29.040
So I think from the next five years perspective,
link |
01:37:33.160
from an AI perspective, what we're going to see
link |
01:37:37.080
is that notion which you said,
link |
01:37:39.080
goal oriented dialogues and open domain like Alexa Prize.
link |
01:37:42.400
I think that bridge is going to get closed.
link |
01:37:45.200
They won't be different.
link |
01:37:46.400
And I'll give you why that's the case.
link |
01:37:48.520
You mentioned shopping. How do you shop?
link |
01:37:52.360
Do you shop in one shot?
link |
01:37:55.680
Sure, your AA batteries, paper towels, yes.
link |
01:38:00.320
How long does it take for you to buy a camera?
link |
01:38:04.120
You do a ton of research.
link |
01:38:05.960
Then you make a decision.
link |
01:38:07.440
So is that a goal oriented dialogue
link |
01:38:11.400
when somebody says, Alexa, find me a camera?
link |
01:38:15.440
Is it simply inquisitiveness, right?
link |
01:38:18.600
So even in the something that you think of it as shopping,
link |
01:38:20.840
which you said, you yourself use a lot of.
link |
01:38:23.960
If you go beyond where it's reorders or items
link |
01:38:29.720
where you sort of are not brand conscious and so forth.
link |
01:38:33.160
So that was just in shot.
link |
01:38:34.400
Yeah, just to comment quickly,
link |
01:38:36.080
I've never bought anything through Alexa
link |
01:38:38.000
that I haven't bought before on Amazon on the desktop
link |
01:38:41.120
after I clicked in a bunch of, read a bunch of reviews,
link |
01:38:43.960
that kind of stuff. So it's repurchase.
link |
01:38:45.720
So now you think in even for something that you felt like
link |
01:38:49.360
is a finite goal, I think the space is huge
link |
01:38:52.560
because even products, the attributes are many,
link |
01:38:56.240
like and you want to look at reviews some on Amazon,
link |
01:38:59.040
some outside, some you want to look at what CNET is saying
link |
01:39:01.920
or another consumer forum is saying
link |
01:39:05.160
about even a product, for instance, right?
link |
01:39:06.840
So that's just shopping where you could argue
link |
01:39:11.600
the ultimate goal is sort of known.
link |
01:39:13.920
And we haven't talked about Alexa,
link |
01:39:15.640
what's the weather in Cape Cod this weekend, right?
link |
01:39:18.840
So why am I asking that weather question, right?
link |
01:39:22.440
So I think of it as how do you complete goals
link |
01:39:27.440
with minimum steps for our customers, right?
link |
01:39:30.000
And when you think of it that way,
link |
01:39:32.360
the distinction between goal oriented and conversations
link |
01:39:35.920
for open domain say goes away.
link |
01:39:38.560
I may want to know what happened in the presidential debate,
link |
01:39:43.160
right? And is it I'm seeking just information
link |
01:39:45.760
or I'm looking at who's winning, winning the debates, right?
link |
01:39:49.480
So these are all quite hard problems.
link |
01:39:53.320
So even the five year horizon problem,
link |
01:39:55.480
I'm like, I sure hope we'll solve these.
link |
01:39:59.760
And you're optimistic because that's a hard problem.
link |
01:40:03.360
Which part?
link |
01:40:05.000
The reasoning enough to be able to help explore
link |
01:40:09.520
complex goals that are beyond something simplistic.
link |
01:40:12.280
That feels like it could be, well, five years is a nice.
link |
01:40:16.520
Is a nice bar for that, right?
link |
01:40:18.200
I think you will, it's a nice ambition.
link |
01:40:21.200
And do we have press releases for that?
link |
01:40:23.680
Absolutely, can I tell you what specifically
link |
01:40:25.800
the roadmap will be now, right?
link |
01:40:28.000
And will we solve all of it in the five year space now?
link |
01:40:32.040
This will work on this forever, actually.
link |
01:40:35.480
This is the hardest of the AI problems.
link |
01:40:37.880
And I don't see that being solved even in a 40 year horizon
link |
01:40:42.160
because even if you limit to the human intelligence,
link |
01:40:45.160
we know we are quite far from that.
link |
01:40:47.600
In fact, every aspects of our sensing to neural processing
link |
01:40:52.600
to how brain stores information and how it processes it,
link |
01:40:56.280
we don't yet know how to represent knowledge, right?
link |
01:40:58.960
So we are still in those early stages.
link |
01:41:02.880
So I wanted to start, that's why at the five year.
link |
01:41:06.320
Because the five year success would look like that
link |
01:41:09.080
and solving these complex goals.
link |
01:41:11.200
And the 40 year would be where it's just natural
link |
01:41:14.520
to talk to these in terms of more of these complex goals.
link |
01:41:18.680
Right now, we've already come to the point where
link |
01:41:21.400
these transactions you mentioned of asking for weather
link |
01:41:24.040
or reordering something or listening to your favorite tune,
link |
01:41:28.520
it's natural for you to ask Alexa.
link |
01:41:30.680
It's now unnatural to pick up your phone, right?
link |
01:41:33.840
And that I think is the first five year transformation.
link |
01:41:36.560
The next five year transformation would be,
link |
01:41:38.760
okay, I can plan my weekend with Alexa
link |
01:41:40.920
or I can plan my next meal with Alexa
link |
01:41:43.640
or my next night out with seamless effort.
link |
01:41:47.800
So just to pause and look back at the big picture of it all,
link |
01:41:51.160
it's a year apart of a large team
link |
01:41:55.560
that's creating a system that's in the home,
link |
01:41:58.680
that's not human, that gets to interact with human beings.
link |
01:42:02.800
So we human beings, we these descendants of apes
link |
01:42:06.120
have created an artificial intelligence system
link |
01:42:09.000
that's able to have conversations.
link |
01:42:10.960
I mean, that to me, the two most transformative robots
link |
01:42:15.960
of this century, I think will be autonomous vehicles,
link |
01:42:20.520
but they're a little bit transformative in a more boring way.
link |
01:42:23.600
It's like a tool.
link |
01:42:25.400
I think conversational agents in the home is like an experience.
link |
01:42:31.880
How does that make you feel
link |
01:42:33.360
that you're at the center of creating that?
link |
01:42:35.640
Did you sit back in awe sometimes?
link |
01:42:40.320
What is your feeling?
link |
01:42:43.640
What is your feeling about the whole mess of it?
link |
01:42:47.320
Can you even believe that we're able to create something like this?
link |
01:42:50.800
I think it's a privilege.
link |
01:42:52.400
I'm so fortunate, like where I ended up, right?
link |
01:42:57.600
And it's been a long journey,
link |
01:43:00.760
like I've been in this space for a long time in Cambridge, right?
link |
01:43:03.760
And it's so heartwarming to see
link |
01:43:07.040
the kind of adoption conversational agents are having now.
link |
01:43:11.400
Five years back, it was almost like,
link |
01:43:14.480
should I move out of this because we are unable to find
link |
01:43:18.440
this killer application that customers would love
link |
01:43:21.320
that would not simply be a good to have thing in research labs.
link |
01:43:26.080
And it's so fulfilling to see it make a difference
link |
01:43:29.160
to millions and billions of people worldwide.
link |
01:43:32.240
The good thing is that it's still very early.
link |
01:43:34.400
So I have another 20 years of job security doing what I love.
link |
01:43:38.240
Like, so I think from that perspective, I feel,
link |
01:43:42.000
I tell every researcher that joins or every member of my team,
link |
01:43:46.240
that this is a unique privilege.
link |
01:43:47.640
Like, I think, and we have,
link |
01:43:49.600
and I would say not just launching Alexa in 2014,
link |
01:43:52.800
which was first of its kind,
link |
01:43:54.400
along the way we have, when we launched Alexa SkillsKit,
link |
01:43:57.360
it became democratizing AI,
link |
01:43:59.720
when before that there was no good evidence
link |
01:44:02.440
of an SDK for speech and language.
link |
01:44:04.960
Now we are coming to this where you and I,
link |
01:44:06.640
having this conversation where I'm not saying,
link |
01:44:10.280
oh, Lex, planning a night out with an AI agent, impossible.
link |
01:44:14.560
I'm saying it's in the realm of possibility
link |
01:44:17.080
and not only possibility will be launching this, right?
link |
01:44:19.480
So some elements of that,
link |
01:44:21.480
every, it will keep getting better.
link |
01:44:23.760
We know that is a universal truth.
link |
01:44:25.600
Once you have these kind of agents out there being used,
link |
01:44:30.120
they get better for your customers.
link |
01:44:32.040
And I think that's where,
link |
01:44:33.880
I think the amount of research topics we are throwing out
link |
01:44:37.640
at our budding researchers
link |
01:44:39.440
is just gonna be exponentially hard.
link |
01:44:41.800
And the great thing is you can now get immense satisfaction
link |
01:44:45.600
by having customers use it,
link |
01:44:47.240
not just a paper and new reps or another conference.
link |
01:44:51.120
I think everyone, myself included,
link |
01:44:53.120
are deeply excited about that feature.
link |
01:44:54.800
So I don't think there's a better place to end, Rohit.
link |
01:44:58.040
Thank you so much for talking to us.
link |
01:44:58.880
Thank you so much.
link |
01:44:59.720
This was fun.
link |
01:45:00.560
Thank you, same here.
link |
01:45:02.200
Thanks for listening to this conversation
link |
01:45:04.200
with Rohit Prasad.
link |
01:45:05.720
And thank you to our presenting sponsor, Cash App.
link |
01:45:08.840
Download it, use code LEGS Podcast.
link |
01:45:11.560
You'll get $10 and $10 will go to first,
link |
01:45:14.680
a STEM education nonprofit
link |
01:45:16.480
that inspires hundreds of thousands of young minds
link |
01:45:19.720
to learn and to dream of engineering our future.
link |
01:45:23.280
If you enjoy this podcast, subscribe on YouTube,
link |
01:45:26.160
give it five stars on Apple Podcast,
link |
01:45:28.160
support it on Patreon,
link |
01:45:29.600
or connect with me on Twitter.
link |
01:45:31.680
And now let me leave you with some words of wisdom
link |
01:45:34.920
from the great Alan Turing.
link |
01:45:37.480
Sometimes it is the people no one can imagine anything of
link |
01:45:41.640
who do the things no one can imagine.
link |
01:45:44.160
Thank you for listening and hope to see you next time.