-
Notifications
You must be signed in to change notification settings - Fork 4
/
Copy path2019-04-18-Accessibility-in-Voice-Experiences
113 lines (111 loc) · 56.3 KB
/
2019-04-18-Accessibility-in-Voice-Experiences
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
# Accessibility in Voice Experiences
## Presenters: Jeff Mau, Keith Soljacich, and Diana Deibel - Thursday, April 18, 2019
[Source Recording](https://www.youtube.com/watch?v=29SneVTDHBY&t=2s)
**[Jeff]:** So I'm all mic'd up so hi, I'm Jeff Mau. I work with Keith at Digitas. And Diana has been an expert helper with me with some of the coursework I teach at the institute of design related to conversational interfaces. So we thought this would be a fun way to introduce accessibility and voice experiences with sort of a miniature panel. So we hope that we will first share our thoughts on this topic. And then open it up to, you know, questions and answers. And it's a very, you know, tricky area to explore, because it's very new. And, hopefully, that's what makes this fun.
So here we are. I just always get ahead of myself with my slides. So there's our pictures. So the main premise of accessibility with voice experiences or conversational experiences is this is how I like to think about it. I have to make sure I don't block the slide when I move around.
What do we do when accessibility alternative such as voice becomes the primary interface? Historically with the web, you would have a screen reader or some other method of reading the words, if you have a need to hear them rather than see them. But if the primary way is through speech, it kind of inverts itself. I thought well, that's odd. That's hard. Like what's going on here? So, I'm going set this up with some questions. Hopefully we will attempt to answer.
So the main design question here is, you know, what are the challenges in creating a voice-based experience that's intentionally accessible. That's our goal. Frankly right now, I think most of the companies making these systems are just trying to build the system.
So, from a human point of view, I think the first thing is it has to be a natural conversation. And it must be in a natural language. And as we all know, even within English, we have dialects about how we speak. So what natural means can be a lot of things. Much less when we get into other languages, many of these systems will just do something translated based on English, and that may or may not fit what's natural to people from different places in the world.
And I wrote these in the my format. So I was thinking, this must match my mental model of how things work and what do things mean? Any interaction you might have with a voice or conversational-based system will just need to talk or respond in the way that you would talk about it. And the way that your mental model of how something works will be reflected in how you start talking about it, and the steps you take to get something done.
Also, it must adapt to my changing thoughts and goals. So I won't speak for everybody, but I change my mind a lot. Some days I'm very, you know, attention deficit disorder, and other days, I'm very focused. So how does the system adapt to shifting intents, stops, and parameters. That's the way we interact with each other. And if you're ever in a meeting at work or having a conversation with friends and somebody is bouncing all over the place, it can actually be hard to follow. And, so, these systems don't do well with that yet. And we'll talk about that.
And then from a technology point of view, this has been one of the places where the tech has been getting better. But we've been trying to make these systems truly conversational, which means that they understand context, as well as more than just one inquiry at a time. And there's also some expectations. Once this is popular, suddenly, everybody has an Alexa at home because they got it for a holiday present. When our clients come to us and say we should build a voice experience, can you help us? They find out it's too expensive or things have to be built manual and it's still very difficult and that's changing rapidly. I'm looking at Keith, because we had a conversation about this last week where some of the machine learning that's coming out is making this easier.
My voice is shot, because I had a sinus cold so I apologize for the hoarseness and coughing.
So people ask me is this a user-experience opportunity? Yes. Is it a technology platform more than just a random, you know, oddity or cool new thing? Yes, is a platform. And the marketing people we work with often ask us how do we use this for a marketing channel along with social media and all the other channels that we deal with? And the answer to this, to all three of these is yes, it's all three of these things. So voice is the whole new paradigm, which is what makes it exciting.
But that also makes it, the expectations for the technology and the human experience are really high. Just like the web or social media. So, it's going to take some time to get there as well. So that's the set-up and with that, I'll turn it over to Keith and we'll do the microphone switch.
>> KEITH SOLJACICH: All right. Thanks. I need that too. Hello, to everybody in the room here at the Notary and everybody online. My name is Keith Soljacich. I'm a VP of experiential technology at Digitas. I'm the only person in the world with that title. [Chuckles] We like to make up titles, but the reason why we make up these titles, or the reason I have this title is my background at Digitas, and I've been there for almost 12 years. Is really in the technology, the powers that experiences, the brands, and the companies want to get to us, right?
So my background being in that area, say, I know it's my job to understand these new medium like voice, like AI, like AR, VR, or wherever else the technology is going, to understand what it is, how we can -- what is capable of doing and then essentially how can we activate it?
And, so, voice is an area at Digitas that we've been exploring for a few years. I think it's incredibly interesting that the Alexa platform which really launched voice into the stratosphere was a hobby of a few people at Amazon which completely transformed the industry. And it's really supercharged voice coming into the mainstream and being a user interface that's accessible. I think that's incredible. At the same time, because it started as a hobby, we're very much in the early stages and Amazon went ahead and put it out as a consumer product and said here's a few things you're able to do.
And I'm proud of them for doing that. It's turned into some great work, some great experiences. But we're still very much in the early, I would say, the crawl to walk phase of voice user interfaces. So because of that, accessibility is incredibly important. And like Jeff mentioned, how we create these voice-based experiences, consideration is incredibly important, because when you have a black cylinder in front of you, and you want to interact with it, for the most part, it's not going to be able to tell you what it's able to do.
So you're left to almost guess what you can say to this black cylinder in front of you and hope that you get a response back that you're looking for. Again, we're still in the infancy of the technology. So companies like Digitas, companies like probably some of the ones you work with are looking to say, how does our brand, how does or company, you know, what the message we want to get out there and how do we convey that through voice? It's proliferating and Google and Google Home are able to do more and more everyday because of the developer community, and the content creation community that's out there. So we have a platform that isn't going to change too much from accessibility standpoint. I speak to it. And it speaks back to me and gives me some information, or runs a command that I'm looking to do.
But the capabilities of the platform, they keep changing every day. And you don't have to change that black cylinder in your kitchen. So that is graded for voice. But the technology is evolving and that's also a good thing. Voice is being embedded and that's what I want to talk about today is proliferation of voice and accessibility and what that might look like over the next couple of years, what's happening right now.
So, first of all, you've got the, now, these multi-modal experiences that are starting to proliferate right now. And I put couple of examples up there. I'm an Alexa user so I always put Amazon devices up there. But on the top here, you've got the echo show. And that's the echo show 2. And that's the second-generation and Amazon fire TV and what a multi-modal and for anyone in the room not familiar with that term, multi modal means multiple ways of interfacing with the technology.
So voice and video in this case. And what multi-modal will allow you to do is get that visual feedback to your command. Right? So maybe you requested something. Now you can get that visual confirmation back. Or maybe you're requesting a piece of content. Or something that you may want to pull up a camera in your house. That multi-modal experience can let you use your voice to command another display. So that's incredibly powerful. And I do believe that the proliferation of multi-modal experiences will really help grow voice in the accessibility.
Now you go beyond just speaking to something, but actually seeing something. So you have the capability of using multiple senses to interface or send commands.
Also in companion to that we're adding voice. We're adding voice to essentially every piece of technology that's out there, and it started probably back when you were able to speak to your GPS in your car. And you say, take me home. And it will pull up that command to take you home on the add on GPS. Siri, digital assistance, that transforms the way we use the phone. Again, same piece of technology, but we added voice to it and made it more useful.
And certainly, from the disability accessibility, this is a huge advancement. You're able to carry around a small device with you all the time that is again, getting smarter and smarter, capable of doing more things. So adding voice to the current technology, that's something that I think is going to continue.
So then you've got something like the ecosystem. And because we're adding voice to all of this technology that we're surrounded by, the good news is, now we're surrounded by device that are capable of receiving voice commands. And that network and essentially that mesh of where you can speak to something is getting deeper and deeper.
So I used to have an Alexa just in my kitchen. And now just in my home it proliferated to the bedroom, the basement, and eventually to the car and the office. And in retail. So if we proliferate these devices and really blanket our world in the accessible mesh, you'll be able to use voice, if that's your way of interfacing with something. You be able to use voice anywhere you go.
So my point here is that, we will get more and more naturally -- we'll get more and more naturally used to using voice to interface with things around us.
And then now that we've added, you know, we're adding multi-modal experiences, we're meshing our world with voice accessible devices, now you get into things like voice routines that can essentially be very efficient with your time and energy. Right? So now I've got that device, and the easiest example to give here is the command that I give in the morning when I wake up to the Alexa next to my bed and I say, "Alexa, good morning." And Alexa knows, because I've programmed it to turn on the lights low in my bedroom. Do not blow out my eyeballs. It's going to potentially start my coffeemaker downstairs. It's going to put those kitchen lights on. Maybe draw the blinds up.
Now, what used to be a multi, multi step process, I've essentially cut bunch of time of my morning routine by using voice and these commands that are empowering it. That's incredibly useful from an accessibility standpoint. I just got back bunch of times where I don't have to walk around and raise all the blinds and turn on my coffeemaker. So these voice routines are something that also will proliferate smart home technology in the car, at work, where you can just say a simple phrase. And then visit be so powerful on the back-end that it runs that routine exactly how you've created it.
Now, those can advance into being location-aware. So if I said, Alexa, good morning in the bedroom. It does one thing of or if I say Alexa good morning in the basement, maybe it does something different? I'm trying to think of these scenarios. But when you have location-specific commands, that are intelligent to where you are or the time of day, or, you know, what you're looking to do, that's incredibly powerful too.
So right now, like I've said, we've got sort of this between crawl and walk device. It's still very early, it's like the early days of the Internet. But you can see the technology is advancing. The capabilities are advancing. And it's going to, you know, everything we pour into this ecosystem will grow the accessibility of the platform for voice. So, great.
>> DIANA DIEBEL: Sorry, guys. There was a tech break there. I'm Diana Diebel, and I'm a VUI designer going on 9 years now. And, so, I'm going start by disagreeing a little bit with Keith and Jeff. Yeah, [Laughter]. To me, this is not a new technology.
The interface of voice, obviously, is very old. This is the first interface of any us ever had. World speech. But also the technology itself is not super new. We've had things like IVR which are those really fun phone banks that all of us have probably experienced at one point or another when you call your bank, or when you call your airline. For customer service, press 1. So we've had computers that have been able to evolve and talk to them since the '70s. But what is new is the smart speaker and the proliferation of voice coming into the commercial market right now.
And people starting to accept this as a common interface they might be able to interact with. So, part of that, like when we start thinking about, well, what does that mean for accessibility? Going back to Jeff's original question of: If voice is used as an accessibility tool, what accessibility issues do we need to think about when we're designing for that to accommodate anything we might be running into?
We can actually pull some of the lessons that we've learned from the IVR stuff so we don't have to start from scratch or ground zero. So there's couple of things to think about. And the way I think about access is a little bit more broad. And the first one is the traditional access question of: Can the intended parties use this regardless of physical differences?
But because we're also talking about conversation, because there is no screen, there's no visual, and because conversation itself is so human, and it's just imbued with our social mannerism and culture, we can't just leave it at that. We also have to think about access in terms of does the conversation provide conversational access? And allow for all the intended parties regardless of their background or of their cognitive state? So I'm going to go through cognitive, physical, and inclusion. And we can talk about things to think about and ways to solve those.
So we'll start with the physical access. Jeff went into this a little bit with there are some things to think about just in terms of speech when you're thinking about access. The easiest place to start is probably thinking about regional differences. So when you are designing something, you can start just English. You don't have to take into consideration other languages. If you're thinking about English and you're thinking about the U.S., you still have to think about what are the accents I'm going run into? From the technological standpoint, the ASR, which is automated speech recognition system, that's like the back-end of it, the brain, it's been trained from Midwestern. So it's a very specific accent. If you veer outside of that is correct it becomes difficult for traditional ASR with usually settings to parse specific accents. So if we're thinking about, we're designing something for everybody in the U.S., how is the southeastern U.S. accent going to sound? Well, the "no" we say in the Midwest is "nollll" which is a longer sound. So from a technical standpoint, that might get parsed out. And I've had several instances where I designed things and what should be a clear no response, the ASR interpret as a yes response because it doesn't sound like the clip sound it's trained to listen for. And this kind of goes, all these buzzwords of like voice and AI and machine learning are interweaved together and part of voice is AI, which is trained by machine learning, which is thinking about like all of that data that you feed into whatever AI you're using.
You still have to figure into it. So you, a person, a human has to feed that data and have to make a choice about the data you're feeding into it. And whatever choice you make is going to influence the way that that brain then works moving forward. So in this instance, the brain has been taught that a southeastern accent is not acceptable accent, because that's not the data it was trained on. So we have to reteach it. Or we have to go in, and there's settings you can do. And full disclosure, I'm not a engineering but we've got engineers that can shorten or lengthen that and know what to do to fix it. So that's one method of being able to accommodate that.
Sorry. I made myself a bunch of anecdotal notes so I can remember. There's also things like English as a Second Language, which will have cultural things for later. But even in a speech moment like those accents are going to be very different from other accents. So that might not get parsed as well. You can think about, well, does this need to be oral only? Is there another channel I can use to support this? Iconography works really well when you're dealing with multiple languages. So if you can't supplement it with something that has a screen, that has just purely graphics on it or icons, that helps a lot in terms of getting your message across and allowing somebody to give an input if they can tap a hamburger as opposed to saying hamburger. And tones and different kinds of voices doesn't get parsed by ASR. So raspy voice, if people have a thick raspy voice, that does not get parsed in the same way maybe our common voices sound like? So it's really hard to ask somebody with a thick rasp to speak to something, because the machinery is just not set up to accommodate them. And there's not a lot technically we can do right now with that. I'm sure there's people working on, but in terms of the design process, it's usually, okay, how can I find a channel to accommodate this if I know my user population is going to contain people with raspy voices?
There's also things like pace. So, when we think about using voice as an accessibility tool for people who might be visually-impaired, people who are visually-impaired also listen and parse oral information a lot more quickly than people who are used to having visual in front of them. It's because if we have sight, we rely on it so, we don't have to parse the information as quickly. But the flip side of that is if you're going to have something that you're giving out to everybody, you need to have some sort of pace control on that, because people are going to understand things at different paces. So, even telling a simple command like speed it up or slow it down please, allows somebody to control that in a way that they can then understand. And they don't have to wait around and wait forever if you're talking like this so everybody understands. That's just way too frustrating and not what a lot of us are signing on for.
Especially, if we're used to getting information more quickly. And then there's also things like volume. So if you have any kind of hearing impairment, having something that's really quiet can be a problem. However, if you have a hearing impairment, and you have a hearing aid in, having something really loud can be a problem. So you want to have something that allows people to control the volume. And also, by the way, any time you design for accessibility for an "Edge" case as people love to call it, it makes it better for everybody. Because something like a volume control or pace control makes it better for anybody else and any other context that just knows some information who wants to get past it. Or sitting in an office and maybe they don't want information blasting across the office at them.
So all of these things are going to make the product better. There's also impediments. I think kind of obviously, there's speech impediments like stutter, lisps, those often get parsed differently. Something like a stutter. The ASR might cut it off because it thinks, I heard that word. I'm done. And the person doesn't get to continue. And you get stuck in this loop of, no, that's not what I meant. Or I don't understand, can you please repeat?
And also, more temporary things but that are still impactful, like having a stroke. That's really hard for people to speak when they've had that happen. So, again, thinking about what are things like in the IVR context? We're ourselves on a phone. So we can easily have somebody, that's where those key tone presses came from is having that quick easy access and discovery, which Keith mentioned having a problem of how do I know what to do or get to places? That's where that was born out of. Even though it's a pain to everybody right now and we want to speak for naturally, there are ways you can provide using the tool you have in the hand a quick access for people who have problem speaking.
And thinking about the multi-modal interaction, and you have a screen in front of you or a device, that you can use to help somebody through that experience, I would caution you not to do a back and forth where you have to answer something in one channel, and receive information from another channel. Because even things like knowing when the device is listening. So echo is a great example. It has the ring around it where there's a little light that comes on.
Great, if you can see it. If you can't see it, then how do you know when the device is listening to you? And this is true regardless of any sort of visual impairment you might have. And that gets into the issues of transparency and privacy and all kinds of other stuff. So really making sure we're protecting people, and giving them the opportunity to interact in the way they want to interact when they actually want to interact is really important. Especially, as we, like, move into this world where we're seeing people starting to get dinged for unethical behavior. Which I personally think is a good thing. But the we don't want to be caught in those same situations. So it's better to think about that now, and what can you do to sort of think ahead of where this could go, and how can I provide access? It's not just about, yes, giving somebody physical access to something. But it's just also going to make the product more trustworthy and give them this sort of gated access they might want.
Also cognitive. So cognitive access doesn't necessarily have to mean, I know people kind of immediately think of oh, like, that means you can't understand it. And maybe you have problems with memory, or have you things like dementia. And that's where we need to think about cognitive access. Because our voice users are all over the map in terms of age, in terms of literacy, in terms of language, and even your temporary cognitive states, then it's important to take into consideration sort of all of the use case that you're thinking about and context of those use cases.
So if you're making that, like Keith showed a remote control and a T.V. If you're in a household with kids, kids can't read, but they will grab the remote to talk into it and thinking about, well, first of all, the timbre of a kid's voice is different than an adults. So people have been working on that and solving for that pitch and being able to parse that.
But the way that kids phrase things is also very different. They go right into the conversation. With zero context. They assume you are like onboard with them. My son loves history and he's really into cars and he like to pick up the remote and say mud race. Mud race and we're like, babe, he doesn't know you're talking about cars. And you have to give it a little bit more. But these are things like, as we're developing this, that's really cool stuff to play with. But it also helps you realize where people are dropping into a conversation, the kind of cognition you're going to need both on the user side and also your system side. You have to think of your system as a person too. And what sort of cognition level does that meet?
And thinking about it from what sort of language are you giving back? What sort of word choices are you using? From a literacy perspective, yeah, you're not making somebody read it, but if you're using complicated words, whether that's long words with a lot of, I don't know, Pomp & Circumstance to it, maybe not every class has access to. That's going to leave people out. If you're using jargon, that's going to leave people out. If you're using slang, that might leave people out and we can talk about slang in a minute too.
But just thinking about like, who is the audience and what's the clearest way I can say this? Because saying it the clearest helps people understand regardless of any age, regardless of native or language, and regardless of any memory issue. If they're super-drunk or stoned and ordering a pizza, you definitely want to be clear as possible. Obviously, that's temporary. But if you have a pizza app, that's something you better consider. So good way of thinking about this, if you're going to give directions to something, always give it in sequential order. So it's less of a load on the brain to figure it out. If you have a app and it's telling somebody how to make coffee. And it says, turn on the coffee pot in the kitchen. The person has to work backwards from that. They have to, they're going turn on the coffee pot, but now you said in the kitchen. So they have to retrace their steps mentally and figure out the kitchen is the first piece of information they need to think about and then go turn on the coffee pot. Whereas, if you say go to the kitchen and turn on the coffee pot. That's the order in which they would do it and that's much easier to understand and much easier to remember and that works really well for young kids. It works well for people with dementia. Keeping things short, sweet, and clear. It works for everybody. And it really makes it just a lot more accessible.
There's another, you know, pronoun is a tricky one. Obviously, like context. We want everything to remember the last thing we're talking about because that's the most natural way we talk about things and certainly there's things like, if you ask who is the 16th President and what was his favorite band? It knows you mean the 16th President when he say "His." Great. But if you say, like, maybe like the next question is. Well, did he ever go see them? Now you have two pronouns. Is it clear who "Them" is? Who "He" is? And when we start making things more widely accessible culturally and starting to include more people, and there are other pronouns we need to think about, how does that reflect in the speech that we're accepting, the speech we are returning? So thinking about is there a necessity for that pronoun? I know a lot of times, we want to make everything as casual as possible and make everything feel friendly. But at the cost of what? It's always important to like take a step back from the persona you're creating for that system. The fun moment that you're having, and think about is this actually clear to people? Because if it's not clear to people, it doesn't matter if it's fun or cool, because people are not going to use it because they don't know what's going on. So clarity is the first and foremost important.
We talked about acronyms. So I'm going to skip that. Obviously, don't use them. Nobody knows what you're talking about. [Laughter]
So for inclusive access, we can go back and talk about the words and slang. So, when you are using, again, thinking about that persona and who that persona is, it is really easy to fall into a trap of, well, this persona for my BOT is like a young skater bro. And he's going to say things like, "Hang 10, dude." I realize it's dated. But. [Laughter] But it's great for that, you know, maybe small or large group of people who know what that is. But you're making a lot of assumption about your users that they're going to understand that phrase. But when you do start to use phrases like that, you're creating an inclusive space for people in the know and exclusion for people who are not. So whenever you're using slang to use in your system, be aware of who you're including and Who you're excluding. You don't have to make everything for everyone. Nobody will ever think of things saying you have to do that. But you need to be thoughtful about it and have rational why you're excluding people. A lot of voice stuff is good for therapy and for addiction, because that's private. You don't necessarily want to get judged by a human and robots are great for that.
You're not going to have an addiction app. That's for people who are not addicted to things. It doesn't matter what they say, because they're not your user. So, yes, you can make it for everybody. But that's who you're not gearing this towards. It doesn't matter about that. You can focus on who you're actually going to be talking to.
But it's important that all the people you're talking to, you use the language they use. So you can be sure, again, that you're being clear. And even more importantly I would say is on the reverse. What are you programming the system to accept? Because when you start saying, oh, I don't understand you, you're like, [Sigh] Come on, bro. And I don't understand what bro is. Then you immediately say, okay, this is not for you. This conversation is not for you. This is particularly important with these thoughts when you're asking for names. I've had this happen a lot of times where I've been training a BOT on something, and it will recognize pretty much any Anglo name I put the in there. The second that I deviate from that, and I start putting in any other name, it doesn't understand it. And I have to do a lot of work to train it to understand names of different origins.
That is a fault of the people who built the stuff. But it is our fault if we let it continue. We have to be responsible for finding those errors, and fixing it. And there's nothing worse in a conversation. If somebody walk up to you, and you were like, hi, my name is Diana. And they're like, okay. And just like that's what it feels like when the BOT doesn't understand who it is. They're essentially slamming the door in your face saying I don't want to talk to you. I don't recognize, you, as a person because names are so personal to people. And it's usually that's how it starts. It's usually, hi, I'm pizza BOT. What's your name? So not a great introduction if you don't take people's names.
Also, thinking about humor. I almost want to tell everybody don't ever use humor. But I hate that, because I love humor. And I love being surprised by it. And that's one of the pleasantries of conversation, it's such a blind experience that you're constant the surprised what's going to happen. That's why we get into conversations to learn something, whether it's about somebody else, or to make things easier. And, so, humor is a great surprise in that.
And I don't want to take it away from anybody, but I do want to caution you, your humor maybe not everybody's human. So do a lot of testing on who your users actually are and the humor you're using if you're going to use it. Probably the only thing I've seen to be safe across-the-board is dad jokes. And not everybody finds those funny. Those could be just lame. [Laughter] So just be really clear about why you're going to do that. And think about the context of your voice system.
You have something, like, a healthcare BOT or finance BOT, those are pretty serious high stakes topics for people. It's their livelihood or their life. And having a joke thrown in there might not be appropriate. So, again, even if you want to use some sort of friendliness in it, think about not only who your user base is but the context of the conversation.
And then cultural references are kind of along those same lines of their really fun to drop, like, and make it feel like you know you guys are part of the same club and you all know what's going on. But, again, these are things that are from our own perspective that we think, well, everybody knows this. Everybody watched Seinfeld in the '90s. Well, not people who were 8-year-olds in the '90s and not people who didn't have TV or people who just chose not to watch that show because it wasn't funny for them and didn't want to watch it. So you have to think about what sort of references you're going to make, because it may not be obvious to everybody.
And you might be creating that exclusionary circle sort of unintentionally. This actually happened at work. We had somebody who, couple of guys in their 30s talking about commando and how awesome it was. If you're not familiar with it. I was not. It's like an Arnold Schwarzenegger movie from the 80s. Commando and we were like what are you talking about? How do you not know this movie? No one else in that office knew what they were talking about it but they were so convinced this was the peak of cultural America that happened and no one knew what this was.
Somebody who does this really well is pretzel labs. Alexa skills for kids and they're out of Israel. And a woman who runs it is, she sounds like a crazy path, but she used to be both a writer and, I think she's in psychology or something. Anyway, she has built and working with mindset right now on this English as a Second Language skill for kids who are in Israel. And it's solving the problem that I think we probably all experienced if we've ever been in a foreign language class where the foreign language class teacher doesn't speak the language they're teaching you really well and they have a terrible accent. So when you learn it, you are learning it through them and their terrible accent.
So they thought about how can we help with this? We have the speech thing here and that can help deliver the correct accent. So they had this English as a Second Language teacher Alexa in the classroom with them. But they've taken into consideration who they're talking to which is 12-year-old Israeli. So very this Alexa answer back to them, and say who's your favorite pop star? She might say Ariana Grande, but she may reference an Israeli pop star. And that might be cool. She gets me. And there are moments like that where you can really bring joy to people just by knowing who your audience is and allowing them to be part of that conversation and creating that inclusive access for them.
So, we've gone over a lot. I know I probably talked way too long. So what now? We've given you a bunch of information. I always like to leave people with a little bit of empowerment. You guys now have this information. You are now the ambassador of accessibility in all your jobs. If I know that can be tough to bring up. But I think you're excited about it or curious about it and, so, you can use your curiosity to bring up those conversations when you have this at your job. Because I know it's not easy to be the squeaky wheel in the room. But just say, hey, has anybody considered how this might be affected by accessibility? Has anybody thought about well, what if we did this? What would the outcome be if we tried that? Just posing the question allows people to start thinking about it and then you're not, like the negative Nancy. You're just more of the really creative person who thinks about accessibility. So that's it. I think we have time for Q & A now.
[Applause]
>> AUDIENCE MEMBER: In traditional research, a lot of times we use social listening to understand cultural norms and nuances and how people communicate and whatnot. Do you use something similar where you use social listening? Where you can zero in from the occasion perspective with the subtle nuances. So the additional tools we're using [Indiscernible] to zero in on the vernacular of a language and how people are communicating through voice.
>> DIANA DIEBEL: So I'm repeating. Question was what social listening or what social tools do you use for gathering vernacular and getting it right through language? So stuff I do is, yeah, I go and eavesdrop on people. I actually learned this at an improv class and I have taken it with me everywhere. Take a notebook. Take a pen and go sit at a bus stop or coffee shop, wherever your users are hanging out and just write the entire conversation. Write down the pauses, write down when like a bird flies by. Write down the umms, ands, stutters, and anything else. Because that's going to teach you how conversation works.
And how people communicate with each other. You'll see things like non sequitur, somebody starts talking about something, like the toddler that starts in the middle of the story and you're like what? And you'll see how people correct that if they don't know, which is something your system can do. And you can steal these things from real conversation to learn how language works and also pick up specific vernacular and other thing I tend to, for clarity and making sure I'm kind of been catching my own bias, I will write things out. And I will write out a phrase in the way I want to initially. And then I will write it out five more times in five different ways to see is this the clearest easiest way that I can communicate this? Or am I doing this because I like it? And not because this is the best way to do it?
What about online tools?
>> AUDIENCE MEMBER: [Away from mic] Maybe YouTube, watching video?
>> DIANA DIEBEL: What about online and going into communities? Yeah, I totally use chat rooms and message boards, and YouTube. All kinds. There's all kinds of videos and stuff. It's not quite as good as seeing two people interact. But certainly, if you are trying to just like get some baseline stuff, and you don't have other people to pull from, I mean, the most obvious thing is go talk to people, right? That's the easiest one. But this is like assuming you already tried that or that's not a possibility for whatever reason. You guys have anything to add?
>> I think that was great. And one other thing you can do is, if you really want to understand and I live in the brand world so my lens is across brands, but if you want to understand how consumers could potentially be interacting with your brand, we're lucky enough to have a search strategy team. And, so, you can go dig in. And that's publicly available. We're talking with humans who can help us with that. But you can look at what search terms are popular around that. Or how are people asking these questions. Like you said, then go validate some of those things you heard against search terms and see if you're actually getting back the results you're looking for. Because they do treat these conversational interfaces like a Google. But like a Google, but using your voice, they're going to structure the quest in the same way they structure a search term. So go use the publically available search terms and search trends.
>> AUDIENCE MEMBER: Thank you so much. This is amazing. I'm wondering if you have [Indiscernible] prototyping.
>> Sure. I can speak to this one. Strategies for prototyping. What are the strategies for prototyping a voice experience? Couple of months ago, it was very limited. Luckily, there are brand new tools out that are publically available and there's free tiers that led to prototype a voice experience. We had postering which is really great and then Apple bought postering. And, so, now we don't have postering. That was easy enough to go type in some intents and response and test out. There are more tools out there that fill that void by postering that lets you go build a prototyping conversation interface. I think it's like voiceapps.com or something like that. But there's more tools coming out soon that I know about from the technical platforms I guess. That are going to make it, one of the things you mentioned was how it's long and difficult, and cumbersome to build these voice interfaces. And I do see that trend changing over the next year or so, building more easy to use interfaces to create these.
>> And in addition, prior to getting into any technology prototyping tools, the equivalent to paper prototyping would be, we're going to squeeze into the video stream here.
>> JEFF MAU: What I love to do in my class I.D. is we actually do the Wizard of Oz experience of conversational script what you think it should do. And have one person be the user and the other person be the system. And attempt to role play. And it's comical and hilarious and fun, and awkward. But it really catches a lot of the natural tendencies that we've all been trying to highlight here. Just in the attempt to impersonate the system. And then from there, that can help develop what those intents are and what those queries might be. And getting more technical.
But, yeah, that would be the addition to, if you were to interview people in real life to hear how they speak, you can also parley that into your script or how you think the system should behave, rather than function, which both is true, but I like to think about it as how the system behaving like another person. And I think that's the fastest attempt. And then it's also fun to record yourself with video to go back and watch, like, improving through practice of that Wizard of Oz experience. To get a sense of maybe unique or detailed situations that you wouldn't see in the moment. Because we don't remember everything while we're focusing on practicing that experience. So that would be where I would start.
>> DIANA DIEBEL: I have one more. I'm just going to give you one more, which is dialogue flow. It's Google's platform. It's a little complicated to use. It's gotten better. But it has the ability where you can, it's completely free. You input whatever the conversation is that you want it to have. And then you can write next to it. Just click on the microphone and speak into it. And have that experience without having to click through each thing where it takes you out of the conversation. And it doesn't really give you the user ability testing that I think we're often looking for, even like just at that first level, once you've gotten past Wizard of Ozing. Also you if you put the app on your phone, the Google app, you can do it on your phone. So pull the test out and you can go through it there. So you can not only do it yourself, and like just be in that same user, but you can take it on the road and test it with some people.
>> AUDIENCE MEMBER: So, kind of like in a digital interaction, where it's in front of your face and you're reminded about it continuously, but I found that, like in verbal interfaces, like approach has been to send an email to say, oh, here's all the new things that your website can do or whatever. And I never had any interest in looking through that and really like thinking about it. And it seems like it's a difficult problem. I'm kind of curious where your take is on that. The features you chose to use once and it didn't work.
>> DIANA DIEBEL: Discoverability, so the question is how do you, individual interface, when you have a new feature, it can show it to you right away, so how do you replicate that in an oral interface? So, discoverability is the biggest issue invoice. It just is. [Laughter]
And the nut has not been cracked. So clearly, a lot of times there is a multi-modal approach, because we know people are not going to sit and listen to, like it is so burdensome to have anybody give you a list about anything. Especially when you're coming into Alexa or Google or any sort of playful smart speaker situation, you usually have something on your mind. You're not just wandering over there like you have a task in mind. So it's going to impede the experience by saying, hang out. Wait a second. Let me tell you about all the features I've got. So the email is the least burdensome way to do that. With that said, that is something that I think every designer or developer is currently working on is, in addition to going to a different channel, how else can we do this?
>> JEFF MAU: I would add to that, if you remember the early days of Google, the expectation was you could search for anything and would you get a result. And then over time, there became certain patterns of more than the result, but an answer for certain kinds of queries. So, they were first trying to shorten up and pull in data to give you an answer instead of sending you along to a website.
So now, if you ask for the CUB score, it gives you a participant or givers you a card or whatever they give you, they changed the patterns quarterly. And, so, they're starting to do that with voice as well. But because it's so early, crawl/walk days. We're still learning what the patterns are. Google has been around for 20 years? I'm looking to Keith. 30? 40? 50? No, it's not that old. But it took them a long time to mature into the patterns. And there's no type ahead auto complete in voice. But when that first came out, it was like the biggest thing since slice bread.
So if you can come up with the type ahead for voice, put a patent on it, because you'll make a lot of money. Keith, did you have anything to add?
>> KEITH SOLJACICH: You know, there's a difference between a voice experience that when you walk into it, that it can textually know what you're doing. So speaking to a GPS. I'm only going to ask for directions. So you mentioned the smart speaker is this open-ended, essentially Google clone. And the way that these platforms kind of started off, they said, here's a skill. Here's something we're capable of doing, because another developer created that. And when you invoke the skill, you know, the best practice for what we said was to say, here's what I'm able to do. Here's how I'm able to help you. The nature of the smart speaker is changing a little bit where what the goal for the Alexa and the Google is not to have you invoke something, and then understand what the features are. It just has any topic off the top of your head that you need to ask or request or know something about that they have a skill, or an action, or something that really quickly, they know that skill can answer that question. And they're making that connection on the back-end. So really, what they want that to be is an open dialogue smarter speaker that you should be able to ask it anything.
And, again, we're not there, but this idea that maybe there's features that are coming, hopefully, those features kind of fade off and you're just able to ask it anything. And it will find the right thing to match with what you need. So, it's flipping on its head a little bit in that way. And maybe hopefully we don't have to announce features. That will all be there.
>> JEFF MAU: Google is 20 years old.
>> AUDIENCE MEMBER: [Away from mic]
>> DIANA DIEBEL: There just was put out some guidelines by researchers from Microsoft. They did, they basically took a bunch of, sorry, the question is about A.D.A. compliance for voice.
So, there's a medium article on it and I can send it to Karen here and she can put it up on the site if you're interested in it. But basically, that's the only research I know of where they spent some time, not only combing through existing evidence, but also then taking the findings of that existing evidence and testing it out again. So they did a pretty comprehensive research. And then made a list of kind of the best practices and guidelines for voice in general, which then includes some accessibility things. It's not purely for accessibility.
>> JEFF MAU: I'll add to that. So my short addition to that would be, with my set up slides, some of the questions I've framed are applicable to any accessibility thought and I tried to tailor it specifically to voice. But that was from working on attempting to hit triple A compliance from my current web project. Which is hard, I'm sure you guys are aware.
So I think that when you really get into the W3C, WCAG accessibility guidelines, some of those things can be, like, revised specifically for this. And maybe we should do that. I don't think anyone's really done that yet. Unless you guys know more than I do.
>> AUDIENCE MEMBER: [Away from mic] So I know the conversation around [Indiscernible] silver AG accessibility guidelines. So I know people are talking about how to address emerging technologies. And there's a consensus amongst the working group that they want the guidelines to be brought enough that they can [Indiscernible]. [Away from mic]. So I don't think it's specifically about voice user interface, but there could go something unearthed through voice user interface [Indiscernible]. [Away from mic].
>> JEFF MAU: That's great.
>> AUDIENCE MEMBER: If we can take about few more minutes. Unless you guys are wrapping up?
>> JEFF MAU: There's one hand over here.
>> AUDIENCE MEMBER: [Away from mic] [Indiscernible]
>> JEFF MAU: So I'll start with a response and you guys can add. The question is how do you test voice experiences to summarize your question. You can apply user ability, user testing methodology to this. And adapt it for, especially, for anything task driven and outcomes you're trying to make sure that people can achieve with your solution. But instead of in front of a computer or a phone, you're sitting down with Alexa in a room or the phone in the wild, depending on the conditions you want. But it always goes back to what does success mean for your skill or your experience, and can people accomplish the task that you're trying to support? Honestly, I don't think it's that different. But considerations for success would be to tie it back to a lot of what Diana covered of physical access, cognitive access, inclusive considerations. Because those could be criteria for success.
>> DIANA DIEBEL: In terms of the actual, like, how do you test the thing? The prototyping stuff that we said earlier is usually what we use. So we'll do like a Wizard of Oz. And then we'll do something using the actual medium that we can. I think with skills, if I remember correctly, you can launch it there in a test version. You don't have to put the full thing out.
Actions on Google are different. You have to like have, they won't launch the thing on to Google Home until it's, like, done. But you have the Google Assistant app that you can open up the test app and do it in the context of that. And you can do that remote if you needed to. Like have the person, as long as they volunteer a Google Home, or they have the app on their phone, you can do it through that. And then just record it like you would anything else.
And then most of the learning, honestly, unfortunately, happens when you launch it and you start to look at the reports you get back. And those are going to tell you exactly will people dropped off. And it won't tell you why, but everybody drops off at the 3rd question. Clearly, I've offended somebody with the 3rd question or something has happen after the 3 question that it's not working.
So you need to look at those areas. You might also get a list of from place like Amazon, they will just give you a dump of utterances people are saying. They don't organize it for you, so you don't know where they said that in the conversation. And that's only mildly useful. But at least you have stuff people are trying to say. So sometimes you can map that back and see where you might to include other things and reiterate on that. And depending on the nuances and other systems, that has a voice platform, you can get full transcripts from them, or you can listen in if you record the conversations. That goes into a whole other question of should you record the conversations, should you listen into it? But in terms of a testing perspective and improving the product, that's the best way to learn, how to make it better. Because you can actually hear when people are doing. When people get mad, or get annoyed, or get distracted in the background and that's what AR is parsing it all that stuff when you're listening in.
>> KEITH SOLJACICH: And I can speak a little bit to the horror stories around voice testing. And something you learn while you're developing on the platform, it's quite different from the traditional web. So, for the traditional web, you can write a QA script that says, for my tester. You're going to use this device on this screen and this browser and you're going to use this, and this, and this to find gray men's T-shirt, for example. And every QA tester should be able to run that test script exactly the same. And, hopefully, give the exact same results. Okay. Great. That's the web.
Now, voice. [Laughter] So for voice, every time you speak to that smart speaker, it's going to hear you differently. It's going to hear you differently for couple of reasons, and we highlighted a lot of that earlier. Slang, dialect, you know, all the things from how your voice comes out. But not only that, but you've got proximity to the smart speaker. So we went through an anecdote, on a voice skill we were developing and our QA tester tested everything with the Alexa device on the device next to it. And it was great. And we got great results. And it was always hearing him clearly and he was in a really controlled environment. And then we had user acceptance testing where we had more people testing in a room. And it was failing, failing, failing and what we realized was that they're standing further from the device than our QA tester was standing. And it was picking up wildly different responses or the natural language that was picking up was wildly different. So we learned on that that you've got to test smart speakers from different distances.
And then you just, you have to test and test and test. Like we said, collect that data. It's so important, all the testing data you collect when you're testing a voice application, because in the end, you're going to have to train the system on all of the fails. And you can't the eliminate all the fails, but you want to eliminate the most common fails and redirect those back into successful outcomes, which you can do in the voice user interface. So it's quite a task, but, yeah.
>> JEFF MAU: Apparently I'm itching to say one more thing. And if you have any more questions, we're happy to answer. Otherwise, maybe this is good. One more example, which is slightly different, so I mentioned I teach a class related to this at the institute of design. And some of my students started a start-up called "Pepper" and it's a skill that works in the kitchen that helps you weigh the ingredients to create a meal. And then it uses Alexa to help you walk through the recipe.
But it also, so it's a cool story, because of what they made. But it also isn't just about weighing things, it's about helping you eat healthy, and deal with medical conditions, or, you know, certain recipe that you might need to make. And, so, they actually built an Alexa on a raspberry pie so they can build it into their prototype scale made out of Foamcore, so they can stick it in their kitchen and make it realistic as possible and then over time, they iterated on the physical device, but they actual hello to think about embedding Alexa in their product design. So that's a different form of user testing.
And it also helped them figure out how to build the database to make the recipes work. Because guess what? None the data from the FDA or anywhere is formatted for these kinds of environments. The APIs are not ready. So they had to build they are own custom API and it's was a iterative cycle which was my point. And it's also a cool product. Check it out. I'm going to do a product plug. But that's an example where it really doesn't work in an office environment at your desk. It really needed to be in someone's kitchen where the final destination of the intended use.
>> All right, that brings us to the end of the presentation. Thank you so much Jeff, Keith, and Diane.
[Applause]