[00:00:10] >> Hello everyone and welcome to the the s.e.p. virtual lecture series My name is Rich De Mello I'm the chair of the school cybersecurity and privacy. And I'm really excited today because because the. Director of research or or the Georgia Tech's been a cult pen drop is with us with us today. [00:00:36] Ellie Corey He's director of research at Penn drop has a master's and Ph d. from University of to lose. He's been senior research positions in a number of companies including Google. And. And he's. Currently the director of research for a company that started in l.a. you help me out when when when was the company founded on 11. [00:01:08] 200-2011. It still has significant Georgia Tech involvement and has done really really well in a very interesting space something that's become much more important than we imagine that the at the time the company was with was launched l. is going to talk to us today about voice metrics and emerging security threats in the voice channel please if you if you have a question typed the question in the in the q. and a Eventually but if you if you want a and. [00:01:46] We'll take a look at those questions as a as they come in for the most part will try to try to arrange things so the questions are asked really towards the end. But it's possible someone will have a clearer point question in the middle so I'm going to turn things over to Ellie right away and. [00:02:06] It's going to be an exciting time thank you. Professor de Mello I'm not really glad to get today and. Yeah just just at Tekken really. Touched my heart felt. Yes I'll talk about the voice of my my sex and like related emerging emerging dissent you might see my skin. [00:02:29] Ok so. We are. We talking about the 1st mammoth extract to. Define a bit of the the problem and what are the security threats that are that are that we see nowadays in the in the points in the bush channel Ok so let's let me start with like a quick example that I prepared this morning so you've maybe seen the. [00:02:54] You know the man's memes from. Senator Bernie Sanders. And what I thought is that I mean most of them are actually like visual images and videos I thought let me try to do the same with your I thought. This is actually how it sounds. This is. Who you are. [00:03:23] So I get a Hope you get you can hear you can hear the voice So this is like the voice is really very very close to 222 with. Bernie Sanders was is very it's like it sound the same and we actually use like. A typical night. I think engine it was it will it will actually say this is actually. [00:03:47] Bernie Sanders. Though So yeah this is like a fine example but but this could be really very very. Good very very harmful and as a matter of fact this morning I heard I heard on the news that in in China there was like a big big. It big actually. [00:04:11] A lot of blood lot of. A lot of money were stolen using a defeat app where people were actually the frost's who were actually swapping the face and the voice of of the of of of the of some customers and it's Was that costing millions off of yeah of Chinese Yeah. [00:04:35] So this is really quite quite becoming quite problematic and it's really hot topic in the research community how to how to how to solve that problem how to detect. Not us one problem but also videos another problem and you see your effort sample this this video or what in the on is one called. [00:04:58] It people try to research I'd like to do. To have an a must face on that in court and he was also a voice but luckily the voice is not at the gentleman but it's more like a imitating imitating. So that's really a big big problem nowadays I will I will come back to that probably specifically in the order space little later on in the in the presentation so he's the outline of my talks I 1st talked about her voice biometrics. [00:05:36] And. I'll explain a billet like that definitely. The challenges that we face in voice in voice by magic specifically and our channels are able to and about speak of and the final the final aspect will be about who was at the x. Ok so who else what is was vomiting so this actually might if it's not voice biometrics it's actually the automated recognition of individuals based on their voice connected to sticks that are both biological and behavioral it differently in the past think people think it's like those connect it's our millibars you can biological but it but I do I'm convinced that behavioral is also a big part of that of that equation and for example. [00:06:24] Like the people some sometimes have that are actually influencing the voice on the on the doctor. So and if were to say to to to to consider it as likely. To consider. Speaking and it's just as biometrics it has to has to the 2 main 2 main their fundamental tenets the 1st as one the 1st one is distinct distinctiveness. [00:06:55] Which means that. Like like. The voice that the voice is someone's voice is. Is a really super eval from others this can be distinguished from from other other voices in on the large scale not only like on one among 10 people but on on millions of people and the 2nd one is persistence meaning that by met this will not change over time when someone is 8 we want to make sure that the biometric that we we. [00:07:29] We extract is actually exact from someone who is persistent. Ok so these are the from the fundamental tenets of my magics and and until recently like the accuracy is that uses of the most primitive system were actually quite quite high before before people start using. To that's where we made a really big through is and I remember when I started on this topic in 2015 I saw the benefit of the planning and how it it improved. [00:08:03] Accuracy is by a large margin. Ok so and I'm sure you guys know about about deploying so basically deploying is a Sufi of machine learning and much unit is actually Suffield of intelligence and. So what was that is of the Pentagon he actually is is the beauty of the money is that it's even very complex the presentation from from huge amount of data. [00:08:32] Compared for example from previous approaches are more which has more limitation in their mathematical position. Ok and that heat leads to a very a leap in accuracy when it comes to for example a commission or. Jet and I don't know how familiar you guys are in. Audio and speech processing but let's take this a simple example of gender classification this is one of the like I'll say simple examples. [00:09:06] And let's use it that's his as for example to give you an understanding ahead of how audio processing is worse and how he could use it plenty in that in that I'm so. So to sort of the general classification from speech we could actually solve it in 3 different ways the 1st one is. [00:09:26] Say you compute like the fundamental frequency that you and and for example if that frequency is high enough let's say higher than 170 Hertz. This means that is actually the most likely to be a female voice while if his left hand that is less than that is most likely to be a me horse and so this is like the typical I would say. [00:09:49] Typical approach that's like that to a 20 is back that's that's how people look at it and like you could actually do more complex stuff and more complex using a. A learned models right using for example a gem and in this case is going to be able to go. [00:10:08] To government. And and for example to do that you need to detain data from female speaker has and for me speakers and then you extract teachers like men if. You like men if you got some cautious and and you build a you build a. Man for female speakers and then jam and form a speakers and then at that if he has time at that time you compute how likely is the your you are already sample how likely it is how closely it has his truth may speak it may lead model or. [00:10:47] So and the 3rd approach is like using a you know network and again he need to take data for a few minutes because and for me speakers and then you and then at the goal of the no nothing is to to try to identify classified female and male speakers though so actually so are all these 3 approaches are actually I call them artificial intelligence and like only the last 2 are actually machine learning and and only the last one is coming soon as the planning though and so why you want to why we typically now we're favoring the plenty because of of its highest at higher accuracy we could this actually accuracy of close to 9200 percent on the status of Jenna classification by using. [00:11:40] Ok so this is like this is a toy example of how we how we could use like. D. printing in processing. And let's go back to Boyce to voice. Ok so there are actually 2 main use cases of was when I think the 1st one is that if cation though it's a no it's a one to one operation that the one on the on the left they want one operation the answer it's actually tries to answer the question is this might try to speak it is this. [00:12:12] For example just because some lesser you call your bank right now it is actually we have had have like many of the largest banks in the us and and these these banks are actually using the technology our technology to do for example. To authenticate the general user super example unequally Matthew we want to make sure that this is you who's calling and not a Fosters calling in not a another another another speaker by mistake accessing your account. [00:12:46] So this is late this is called The rich get in and around to get it's not you and the 2nd one is identification right so let's take the same example for the say let's take the same example of. Bank. Thank you Scase And in this case actually we want also to present the identify. [00:13:05] Among I don't fight if if a particular chord is is is called by frosts or biased by frost so that we have an on the on the back of of. Course Prince of process but this or some other is case. It ends it's actually a one to n.. [00:13:28] Problem that it becomes more difficult but but with the latest technology using it money we can actually we are able to solve that problem specifically like with a reasonable value of and that could be. In order of. 1000 or 2000. And 8 just to give us an example of identification possible if you're using x. at home this is also a an example of identification where. [00:13:57] For example when you speak or your partner speak on on trial x. or it is able to recognize who speak and as a matter of fact actually is able like to to recognize you know it is up to 5 or 6 different speakers in the household Ok. Yes So this is these are these are these cases of force by medics and. [00:14:25] Let's not talk about how voiced by medics was that so the typical pipeline nowadays consists of 2 of 2 phases right the 1st one is extractions phase that we see on the slide and the 2nd one we see our next slide is more about the position is on on the production face. [00:14:46] So basically when once for example you call your bank what we're going to do is we're going to extract the speech portion of your voice right there are a lot of nice piece Porton like like silence like. Music in the background like. A. Door opening or only is that none voice one speech portions so we the goal is. [00:15:10] The goal is to 1st discard all one useful information and the 2nd phase is on the many species information on that image many species think we'll we're going to extract 11 features like mad and filter a custom coalitions or. Perceptually prediction and predictive features. And once we have those features exacted the Typically this is these are excited at the at the 10 milliseconds right so every 10 minutes 2nd having an acoustic event that that we extract and and these are the units that are used in the middle and I would say that comes next and and for example for machine learning I'm going to I said we are using again and but in the past like before 2015 people are people well people are people who are heavily using like jam and bass approach right and actually Georgia Tech is one of the pain you know of. [00:16:12] Using jam and for push to get admission. And that so much an hour nowadays must systems must like Muslims are actually using deplaning systems. Ok so and out of this machine any other Tim machine any system we actually extract like the fingerprint of this of the horse or what he called actually a Speakon and that he. [00:16:40] And the past were using I've acted is now is now and now we call them more x. factor or there are different values of them like sand and back to us or. Around and axis. Ok. So this is this is an example of like one of the. Recent again architectures for the West by magic So you see the yellow and green images actually is less a spectrum or a values of a spectrum that could be could be a message or a piece and then there is like a bunch of. [00:17:18] Layers followed by a pulling layers and then. One or 2 fully connected fully connected less and. And you see where we're actually extract in that in and if you stamp but at any time we actually train with different speaker labors let's say for thousands thousands of of thousands and thousands of speakers that that we can use in our chain. [00:17:48] Let me so this is actually on the extraction side let's move let's move to the position sides that the ones we have those in banks what you could do is that we need to compute the similarity between inventing that is exactly that Iranian time so when for example this because it is just that and the embedding at if you stand for example at that testing time for example when that the caller is calling again and so. [00:18:16] We compute a sum that metric that seems that it could be as simple as a cosign exercise him that is simple at the end and I back to space or expect a space or it could be also in more complicated than that it could be like something like a probably Stickley indiscriminate and. [00:18:35] Ok and the end of this process is to to computerise corner and if that's what is high enough if the likelihood or if that life was quite as high enough is going to be more we can say that it is actually the same speaker if not we can say this is actually a different different speaker an imposter or a process Ok. [00:18:59] So this is the prediction. Phase of a. Bush by magic pipeline and activity how we how we evaluate a let's say buy I think engine in German and more particularly was when at the end we actually evaluate this in terms of. Main measures one of them is called the force acceptance rate though this is actually the editor probability of positively authenticating in any process and the 1st rejection rate is that are probability of rejecting that the general speak of national speakers and so. [00:19:34] As you can imagine there is like a threshold that decides how the how the system we were in that a question that says how the system works and if you if you. Will you'll get a what you call that that curve or it detection at or trade off a curve and and you could also this is actually shown on the on the right on the right of this slide in this in summation you can see that f. is up here we compare 2 different systems that and the clothes that the system the current the clothes that a curve is to the origin. [00:20:12] The most like the back of the system a super sample yet the system is that and that is actually better than a system in and it's a bit like when we compare systems pacifically for. They say when when we talk to customers they ask us typically what are you an equal or a trade to the eagle or it is is the point where the force acceptance rate is equal to the force object and it the for example in this figures. [00:20:39] That add that curve to that system has a force that has an equal in it of 7 percent and the eastern has an equal rate of 11 percent right and so our goal as a researcher is to try to reduce that that equal or it as much as possible and now it is like. [00:20:57] The best systems out there. Has equal or it that are close or below one percent. Ok so. Let's now talk about the challenges of was women's excited that actually mentions as. There are like China but everything China there is because variability that is was at that was also the reason but I will and machine bias at. [00:21:24] Division by Realty. Is for example like if you speak for one second or if people one minute is actually going to it has a big impact on that yourself and have the morning speech then with the money speaking the most likely at the system will actually be more accurate right but this is I'm not I'm not going to govern covered in the stock although we have to like and we have we focusing a lot on on on and on and on that using. [00:21:54] The errors of our systems on the issue of other sources like for example when you say accelerating we say hi Bixby or or he will these are eventual atmospheres and we have we've been working recently and that is that community is working hard on. Solving this short otherness' problem. [00:22:17] There is also another topic that I'm not going to come in his presentation hopefully in eny any future one which is a machinist right so for example we want to make sure for example the system is working good and the Goddess of the race of the speaker regardless of the gender of the speaker regardless of the age of the speaker So this is also a very important very and important aspect of the system that and you want to make sure that the system is fair and is not a favoring a particular set of. [00:22:50] Population. At the opposite I'll cover in this presentation are able to speak of availability and was a tax let's start with with what what what does it mean variability so who. Is actually for example when you have let's say Balanoff. And. Let's say if it's an ambient noise or is a barrel noise all these are important and not only that type of noise but also the level of noise but that is the type of sample ambient noise or better noise but what I mean by Babel is example like if you if there are if there is like a car speech in the background. [00:23:34] Like that this is also this could influence a lot for example that was permitted and I mean speech is more like for example now and in my household there is like. Typical I mean you know it's like a ski or or he turned. These are the I mean. [00:23:54] There are also other kind of noise like musical or or. Yeah music is one of the good examples of noise that could deteriorate the degrade accuracy of was permitted and. Maybe I can I can actually. You can listen to that to that typical ambient noise not happy to hear it while I rag my man so this is an ambient noise and if you know I mentioned. [00:24:22] What the what this means is the level of in this particular the level of noise is actually at the as a same level of. Signal speech since. It could be sometimes it could actually be low. To be could be minus 10 to be rational that noise level is higher than this the speech that Melissa. [00:24:45] And let me egg Let's actually listen to this example of about unless. You're you know you have. So you can he can hear him the background is like the Count speech Ok so this is a noise is 111 example of one of every little Another example of John variability is there ever a relation or more like echo that and let's listen to this. [00:25:11] Happy to hear it were really rather. And desired example kept me here in whirling round and so you can see we can actually figure out from the from the audio that. The speaker is is actually far from the microphone like or the speaker is actually speaking in a in a in a usual many in a. [00:25:35] In a lobby I thought in a small room the listeners are also has also a good lot of impact on voice which vanity accuracy. And other an aspect of fatality ability is device right so for example if you're if you're if I'm speaking through through speakers my imac magic speakers or if I'm speaking through a microphone or these have have about every lead on on the 1st plane a thick action as well. [00:26:05] And last last but not least is also the kind of compassion or the kind of encoding I got for example if you speak on the phone channel there are a lot of different could act that are used to compress the speech and so be able to transmit it in a very cheap way and 1111 good example that 11 good example of that that is used in the elephant channel is 711. [00:26:33] And another example is m.r. white band but. If you listen to the or you might not. Notice the difference but in fact if you look at the spectrum which at which at which I said this is like the used as a input to the future to excited futures because you for example in the high frequencies of the of the m.r. went right and you can see like the information is actually lost though so this is also an air an important aspect that could influence the accuracy of voice by magic and. [00:27:08] Ok there are also other kind of for the ability for example if you use if if example using a Lexile if you're using a phone. That call your bank like. This the sampling rate of the audio is actually different for access it's actually typically 16 while if if you're calling a bank it's actually. [00:27:32] Believes is also important like this aspect is also important and we want to make sure that. The system that we put in place works at Cross different a chance. 111 now one decent chance and availability that we looked at and. When the pandemic. But hitting is is there if you weigh in and for example we saw we saw in the study you could you could also look at that hour. [00:28:02] Like on the pin ups outside if we actually saw that if you're wearing a mask it will influence. The amplitude of for some for something that's specifically on a high high frequency band and on and on this area on the. On the on the right you can see for example that if you're wearing a mask which is like that occurs you see at a lower amplitude then and the group then you know nothing to mask this is but I'm eating but actually I'll come back to example to this example in the last few slices. [00:28:44] But what we found that fortunately existing systems actually can handle the difference in that if you're wearing a mask or not and there's no there is little to no to get additional terms of accuracy. Though that's about China but every city and about the challenges right and so how how we did it is that the problem of China but every we can solve this problem at different levels one of them is how to to create a better at singing data for you for your emotional. [00:29:22] Right and so here we have pipeline of. What you call channel simulator where we try to present from a single from a single otherness which is like high quality which is the core in a very high conditions that we could actually create thousands of of variants of this this or you sample that has lower accuracy that are that is because it has degraded speech. [00:29:50] And for this we for example do we have a our own like no simulated escalator and we have our own reverberations later we have our own acquisition devices later and finally we have we have also. Like a transcoding a simulator though. By mixing going by mixing all the simulators together we could actually build a very powerful and strong channel simulator. [00:30:17] So basically so I can from from one single high quality or you could generate hundreds of low quality audio that could be used to train the system to handle those all these but everything in the challenge and the challenge. Ok And I've mentioned about the mass So he's that the case study that you could you can see on on in the in the link. [00:30:42] Below. We found that. We actually did it in a comparative study between voice recognition and deficit reduction I was by medics and was and face that by magic and we found. That was my method is actually more robust. Than face mission at handling. That people wearing masks and particularly we found for example that. [00:31:11] For the same force acceptance rate right I talked about that was an acceptance at all and besides for the same force acceptance rate of one percent the 1st objection rate of a 1st physical machine system can can increase it to 10 percent was really like $104.00 inches in in their possession it while it was mammoth The engine is still is the has to the same position it at that one percent Ok. [00:31:43] So yeah this is. This is this is a like the. The. Our on on Channel about immunity and and we're also looking at its peak availability right though what I mean by speak about able to so typically that I am this is a study that that that I did back in 20171718 so he's got to get all the all the other says of President Obama from 20092016 what I did is that I enrolled his voice in. [00:32:22] His words with the 1st week address that and then I I compared I compared his his this in all voice with all subsequent voices coming from from subsequent. With that and I found that over time the system that the score was assume that like it or look like you could measure is actually decreasing over time though. [00:32:49] So this this means that actually this was actually the effect of it was aging I did the same for other president at the same for another bigot and turn it that has had that we have and so this is it seems to be a seems to be a something problematic and we want all the way. [00:33:08] In there in that when we work on on challenges or like Can competitions lead by biased or other parties. This aspect of if it was an age. Of. Yeah is not is not typically taken that can't white while in real life where in real life the systems this is really problematic and you want you want you system to handle. [00:33:36] The launch today will change in force. Listen this is a this administration in the midst of an unpleasant crisis that calls for un president. So this is actually his 1st his 1st week at this and you can see his voices is really is quite deep well if you listen for example at. [00:34:00] One of his most recent we had his weekly address from 16 I run that this we go through the conference starts we should say. That so there's not room for sound 13 but still. You can see like. He's voice is actually is is is less deep and as a matter of fact if we look at the fundamental frequencies we found that is it turned off. [00:34:29] In his in the in his fundament if you see this could be less perceptual to humans but but in fact it's very is it's problematic to try it was permitted and. So this is something we work on and actually we have we have a patent on and on this and how to solve that problem. [00:34:48] Of of what was 80 Ok. So that's I'm out there and this is actually one kind of speaking about ability that that we look at but there are also other kind of as because variability like for example if someone is. Has has a cord but what if someone has called it that we saw recently like. [00:35:09] If someone has cut it that is like a change in his. In his voice but actually we can even recognize if someone has added 5 someone has come in from the way they cut from that or from the off though. Ultimately these has an impact on Bush damage the engine but but that in fact is really is at which we find that him in fact is quite primitive. [00:35:37] Ok so. So this is about this because I will t. M 111 ask one other kind of speak about it if for example you speaking and when you wake up with the speaker on these people when you like an image of day that is actually a change in your office as well and that's likely that that change is also quite limited. [00:36:02] Last but not least is that was a tax I think this is very. Something that is could be could be quite important for us and so on in terms of was at tax there are different. Ways at tax that that that we are we are way off. The most the most important ones are the what we called software a softer base that tax and and this could be like peace and this to a similar to the example I gave earlier. [00:36:34] Where it's actually you type you type it's actually that text to speech also when you type something and you want when you send this to to to to speak it right it could also was conversion for so for example if I speak. For example if I'm. Calling the bank and I have I'm using a software to change my push to like the target post to someone as force and this is called worse. [00:37:04] And this could be really very harmful if it's time. The fur the 3rd kind of voice assertion which is which is very. Like which which you see a lot in in our. In the in our business is this ocean but again you're using an app but you're using an app simply to change your voice to basically even made it positive a positive identification. [00:37:34] You don't want for example versus manically engendered thank you but this is people people like the process actually use. The social change that pitch to some for example like more if you mean voice or. Was. Ok so. Yes So this needs are all known has a software at that and. [00:38:00] Actually. Talk a bit about this piece an attack how how how is generated out of without. The consent has. Pushed. Though so typically so typically there as there is like I think I've put up an example actually a You Tube video from Mark Zuckerberg and I ran and through a complete automated system that does. [00:38:29] Speech after detection. Speaking that ization right and and then does and then extract the speaker utterances that block too much and then I do speech and has been on top of it to reduce the noise level and then there is like a automatics peace organization to extract that task groups the fact that taxed out of this. [00:38:54] And then and there and then I have a. Machine any engine that that. That actually generate a that 1st journey like a general universal voice and then I used I use the mobs a compressed voice to adapt to it I doubt that the channel the universal voice to market resource and then I can create that target voice for Mark Zuckerberg. [00:39:29] Yeah let me actually show an example Hotlips like. This it's it's it's quite a say it's quite quite good and I know how it sounds at you and but the quality is really the voice is quite. Similar to Mark's it was as have other other examples. Like for President Trump. [00:40:03] Or even President Obama in Vegas become real. I haven't an example from an inn in the us t.v. is the real root though so in all these examples these are they call there is no human in the loop it's all automated you could simply provide. We actually have a paper on that you can simply provide like a. [00:40:28] Like a You Tube link to the software and it will do all the work it will it will do all this processing. Was at its action there is nation as our automatic specific mission and try to find generate the voice for that in this case Ellen Degeneres. So this is about the. [00:40:55] Synthesis. Let me show an example of what's conversion though as I said there was conversion is that Hoover someone's voice into someone's As was and in this example. I have like this the source because a misspeak or less a call Bob and I have also a target speaker which is that the speaker in this case is going to be if you made voice. [00:41:23] Let's listen for these like any voices for the poor flashed across reminder she had never before seen a rabbit with either a Wisco pocket this was the 1st beacon this is that times bigger plan flashed across your mind Muslim never before seen a rabbit will either we've heard. [00:41:45] It so so is. This actually this this was up as I was at training time this are used to train the most conversion between. Between the source speaker and atomosphere And if in a stance of for example one this is once the painting is done. So the input of that was conversion would be the sample this sample from Bob And here lived the 4 years of this one and. [00:42:14] 2 and the one on the left the opposite is going to be generated by by the system 2 that's just it in here really poorly black. It's really very close to the tightest be a limb but the voice is very close with our speaker and it's all it's all automated the generated by by this was compiled So imagine this is done in the real time as in this is done like why are you calling for example while the fossil is calling. [00:42:42] Back but he's able not only to change his ways and if they'd ape the positive identification but by the by the by the. Property taxes system but is also able to target the voice of the customer speak and in this case I guess it's really one of the most sophisticated I would say and. [00:43:05] Sophisticated attacks. And we have evidence that this actually. Is used in some of the. By some of the process. Ok. Yeah and the 2nd so that I don't know about some of the attacks of arts business as was given and was distortion Let's talk a bit about the tack that is. [00:43:38] Is actually widely accessible so anyone can can we ate at that doesn't need a Ph d. for that. And it's you simply For example you simply take a video of someone and. And actually you try to present this to enforce by magick and that of course is going to be quite the application is quite limited You cannot like have a real conversation with a with a an agent using using such at that but it can scoop a let's say a automated. [00:44:11] Let's say in the eye they are from a site for example when you call your bank you go and I 1st where you have to ask you number of questions about your name you know the date of birth but all these could could be could be prepared in advance and could be used to spoof they could be used to create it. [00:44:35] So yeah I mean. It's and like that the quality of that is could be really very very high that depend on depending on where you are you getting the where you're getting the. Order samples but in some cases it could be really high quality and could be could. [00:44:53] Make the benefit and and maybe 111 example I'd like to use it here. If you watch this all movie. From Alex knickers so in this example actually the take the actors is trying to do exactly do that kind of attack. But is he saying My voice is my past but if I may see this it's. [00:45:33] A typical example of regret that. Yeah and again this is like quite cheap maybe I actually encourage if you have an ex at home you could maybe try to like it could you solve saying thing and x. and then a type of attack to present this to accents you know if it if it if it's if it's actually prove it or not. [00:45:59] And I know Amazon is work on this topic if but if you if you like if you back actually when I typed this. It was it was very successful without without any issue. Also for example sometimes when I when I when I and I have the t.v. on and like someone is on the t.v. saying I'm next or what Alex or something it was also a beginning a beginning that it was turning on the Alex Alex device who said this is this is a kind of get back. [00:46:33] Yeah I think I have. 30 minutes I'll try to last. Year. The last kind of attack is human immigration. This is also. Quite quite funny see a lot of these in our business and like to go on to kind of mutation. Mediations that you could think of as the 1st one is impersonation but here again I'm trying for example to sound like like my my wife's what's right is going to be difficult but that lesson is kind of impersonation There is also another kind of impersonation we're where I'm trying to hide too tight from the system and to evade it positive and. [00:47:20] I don't want. The engine. To use it to detect that this is actually my voice right so this. Evasion are like. Human disgust and so fortunately Actually I'm most most of my magic engine engines are able to handle this kind of for human habitation and among the sea attacks of that back to get back and human imitation humanity imitation is the lest the less harmful out. [00:47:57] Ok so I'll give you actually a a a a. It is small small small. Idea of how we solve this problem so can we solve this this. Was spoofing we call this was putting that action that we want we saw this part of we're spoofing but by building a system that is thin to for example recognize identify a. [00:48:24] Genuine genuine voice from for example or from Aids pieces or was question and you can see if you plot that yes any. We can see for example that we are able to. Eval to. Let you there are actually different classes here for example in different classes that plus the Internet is like this if. [00:48:51] The class are in blue is going to be like this piece of business and was conversion at that and that plus in green is actually the general course so a simple linear classifier on top of these. And other of these like destiny could could have. Could have cost separate between the sit ups again this is this was the can't state in 2100 had actually one of the best systems in the China's. [00:49:19] Space who challenged and. What What's what's what's problematic is that. Those attacks are becoming more and more complicated when a more sophisticated and specifically I mean if it attacks these are coming out high quality it becomes quite difficult for and not only machines but also humans. To detect if there are fake or not so we are trying where that's why we are saying actually banks so that problem in more. [00:49:52] Time in the we actually look at that how how good are our system at journalism to detect new kinds of new kind. And actually we have a people from from us here about about that. That looking at. Generalization of a defeated action. Ok that's all thanks thanks a lot for. [00:50:17] Your time and if you have any question you know hello this is a session will there ever and I'll helm handle the questions thank you very much for the very interesting talk of I have many questions posted in the chats and I'll read them to you some of you may have already answered because they may have been posted before but you can just say I thought about this. [00:50:47] From the earlier questions you mentioned earlier there is the east and can you explain what it does. Yeah. So yeah that is Asian is actually. The task of. Actually is trying to answer the question of. Who sat. Who's speaking and when if you're right to speak for example if in this conversation right we have I'm speaking professor. [00:51:17] So. So it's actually and you're speaking so it's imminent the goal of there is a sin is to to identify when is because speaking and and also try to possible speak so it's actually both taking additional problem and passing problem that and the goal is to again to answer the question of who is speaking and when and we're using this in in many tasks for example Esau an example of the fact that action there also as part of the people assessing to train effective action system we're actually using. [00:51:53] Speak as if. They feel next question coder discriminate between music or t.v. in the background with being an actor speaking and they have chills Eaker. Yeah this is a. Very interesting question so. If it's so if you're saying discrimination I think you're looking at it as a classification problem so the we have actually a so basically you train your system between like a. [00:52:32] Speech. Like speech detection systems be active it action or was active it actually system where for example in the speech it was actually trying to separate between speech and that speech and for the last piece vision you actually train it with with you actually in your training data you add samples from from for example music or t.v. knows police how. [00:52:59] If you look at it as a classification problem this is how we look at it so we have a class that is speech only and it last that is anything on speech from silence from music from t.v. noise from bubble noise from it all different kind of noise that you could imagine we have another system that does that that's what the prosecution but then. [00:53:19] We we also have an expression has my system and that what speech has been is trying to do is excellent for example if I'm speaking on the right so I want also. I have I want also to. To figure out the non speech portion while I'm speaking while I'm While I'm while I'm speaking so there's like there is overlap between speech and not speech there's overlap between speech and music and in this case we also have a system that does noise reduction or music induction and. [00:53:53] So this is called Speech and has meant and combining combine expression has meant and. This was activity detection leads to very very powerful system where where it can it knows what are the. What are the it focus on only the speech portion and discard anything that is not speech. [00:54:20] You question thank you next question how do you come that against the replay attacks have always played through external speakers versus live in person so you mentioned the player tags but. Maybe you can talk about preservation. Now this is really an interesting question so. So yeah I mean. [00:54:49] So how would how exactly that problem so we like we found that. Replayed it played it played audio has if you look at the spectrum we have is that actually in the high frequencies are typically. Has low energies So for example this is a. Subtle difference that you could use in that you could use in training that was the engine and as a matter of. [00:55:22] Fact and as a matter of fact actually we. How how how will that effect or the fact that active we actually look at this again at the Spectrum but not not in this case it's not going to be an **** us embassies are more and more. Are more directed for directed for heat for voices but for like a human voice for was by midday or poor. [00:55:53] Automatic speech we're going to shoot back but we actually take their oath that. And that our own is is has enough information so we use that as input to a another yet and convolutional could be convolutional could be like as Nat. No network and and. So and we actually used to train a to separate between Ripley at that and General course so it's actually also saw this from us a bit using. [00:56:24] Using spectrum as input to those economic system. Thank you. Next question can the model deal with the ai generated fake ways. For example that end deep fakes. Yes So. Literally I mean it's it's almost the same way as as we saw during the. Attack the same example we also have systems and like you got I'm happy to share or some some reference some some of our papers on that topic some of that topic where we also use. [00:57:03] Like the planning. Together would like a. Strong features low level features to this commission and with we would like a big data we need also a lot of lot of data not only from general speech but also from if attack or census speech go and and we want it will air in variety of. [00:57:33] Different engines. Which I think the system so for example you can imagine that one example is that using Google Google lessons as Angela could be using Amazon next is peace peace and has an engine or it could be we also have our own spaces and so using all these. [00:57:53] Put up all these. Synthesis engines to train edge attacked it will lead to a very robust. Definitive action system yeah it's on the decline and so when you make it worse if you engineering how effective action are generated which are they if they are generated using gas using it money but we're. [00:58:23] We saw only that problem by the provider using the cloning as a. Ok and probably the last question I'm sorry I am that I'm aware if you have time you can stay but some people will have to leave it on what factors does their v.b. eat take into account when considering speaker variability like age them of the season. [00:58:49] Yeah this is it and he actually question so. Too. So that I don't so. Suppose that I talked about. About for example a Chinaman able he writes I talked about all this simulator to how we think. Good and that diverse opinion data for trained to have a system that is good for China that it will be also we also have the same for for. [00:59:25] For like speaking for everything right we also. Like. I mean I mentioned So yeah I mean from from we actually have enough data from a from a speaker. That guarantees a variability in maybe the age factor is not is not really present while you say. That you don't have data you don't have a course in for me speak with you know speak it will likely be a large. [01:00:01] Variation but But what we're doing is we're actually. The we have a patent on that what we do is that at prediction time we try to correct to correct that that. Crack that. That that fact or that problem or to proceed to that problem and actually so we will basically changing our money Feinberg elevating. [01:00:28] This court that the court that we got in from a from the prediction are could we from the. Genesis Tim will actually find that or changing that basically calibrating it to take into account the age the age the age of the speaker. So that that's basically it as as per the other factors that you that you must make time of the day or the season and these ones are typically handled while in the in the thing that as I. [01:01:04] Think. Though we're running out of time are you Ok Don's questions. Yeah I'm happy and I can stay. Think ill how much speech is needed to gain or a good quality deep faith or if you're like that once you've shown. Yes. Yeah this is important question I mean there are systems other than that that that are saying that you can handle. [01:01:36] This he can't they can handle the engine it is speech to some it's piece from as little as one minute of speech that as little as one minutes of this piece but but to get a really good quality or if you like quality you similar to like 2 to Alex was worse and in this case you need you need. [01:01:59] And then ten's of. Let's say 10 or 20 or 30 hours of of speech with high quality or you're the one that I should like the one the examples that I shared in this presentation. When he sent his voice or. Present times worse these ones are. Generated generate with typically like. [01:02:23] 30 to one hour of speech so typical like typical You Tube. Things. Next question it is normal less human that the voice sources close to the target but this is a problem when that attacks are remote if we store fluid and sent over the network and then played by Has Eaker close to a target How does that the pathways magic. [01:02:55] Good question so. I mean. This is actually this is an example of a fair chance of everything right and. Like in my presentation I mentioned this example of when one is classmate of that are they don Michael from going to my track the fact that I was speaker artifacts there is also like the transmission of the fact that for example. [01:03:19] A mention about the quantum there's also packet loss a little. After tax cut to all these are yet could can influence it was Manotick engine and typically the way it would handling this way is. Quite by by by simulating. Data that exhibit this kind of for artifacts and. [01:03:45] Also I mean we need a very strong. General architecture to handle time to last all these for everybody and all these. Like all these Actually case edge cases in outing data. Requestion. You. Put translating one person's voice into another up to ls for example presumably they invent noise an echo assess tripped out in the Bob tell us conversation and version in a system like yours use the fact that there's little to no end of economics to deduce that Ellis' voice is being fake. [01:04:31] Could it be this question. From translate one person's word is that for example up to Ellis presumably the ambient noise and of course stripped out in the book tailless come version in the system like yours use the fact that there is little to no end game to economics to deduce that Ellis' voices be faked. [01:04:55] Yeah this is this is. Yeah this is a great question though and. We would actually using that that exactly what you're saying but for different it's possible for like it to detect if air if it's live it will collect Limas that action or is actually kind of tack to using that that information of like it's changes in the background in the background noise to detect detect if some if an order is fake or not in this particular example it's actually more complicated than that because actually have it having a background noise. [01:05:39] In the source. My degrade the quality of the converter on the computer of the conversion right to who so what what what what what they compose converses some systems are doing is typically do like some kind of speech and has been before before sending it before or before doing the conversion. [01:06:03] Yet so. It could be it could be like if it's a future future direction to use to use a kind of. Like exactly what you're saying about about the change in background noise or. Existing of background noise could be could be the existence or the nonexistence of paranoids could be used to to. [01:06:26] To detect such artifacts have a good question. And the last question from the audience can you identify as someone that is drunk or lying. We are. Like I'm. Like my team is not what on this topic but but there are some somewhere in the literature that are. That are looking at this aspect of if someone is is wrong or not and. [01:06:57] Like and typically the present you want this kind of technology let's say for. If you're driving that you want to detect. Not only from the face we also want to check from the voice if if if someone is running drunk or not I think the. Little chatter desire suggests that this is possible and it works. [01:07:20] Yet so it's worse. For them if someone is not going not from from speech. But for. From a 2nd question which is. If someone is lying or not. It's yeah I'm not aware of of any like any any decent. Results on that aspect and you know it's very very difficult task. [01:07:49] Even even for a human to do that so am. I don't know how how how good machines would be would be at that. Thing and then I have a quick question kind of general what can you say about the security guard n.c.s. of what is there is that then to cation because you mentioned so many of. [01:08:14] This types of attacks right from human to synthesis type and it does sound. Like what can you say that that the system cannot be broken given that ACARS can do all of that well there's that I dislike how car heart people try to break the system when like how convinced you are. [01:08:42] The system cannot withstand all this types of attacks because with the possible base that and all bets are off and possible of those leaks then learned right or just brute force but with voice it sounds like I waste is everywhere and it shouldn't be that hard so come up with something like impersonate me. [01:09:04] Yeah so this is a question. Like from from from from from our perspective like from. The speech each. Researcher though we are actually. Doing. It little Actually the research community is actually. Working hard on solving this because of problems like these. Problems and. They think that the results and our disaster such as like if you if you really have the state of the art in voice biometrics and. [01:09:49] Speaking are going and state of the art in detecting them you are already you know you're already in a good spot right again we need to keep to keep your your voice security. At that security aspect or the detection aspect. You need to keep it up to date. [01:10:13] To make sure they see like it's like it's like a virus and device problem right whenever the record a new virus. Is appealing that is like you need to add you need to add that the code of in your antivirus and you need to add that in your and to advice you need to change activate as antivirus to do that to to act it so this is actually I see this I see that we should be doing this in in 6 and make sure that the. [01:10:45] Detection aspect or the. Interactive aspect of that the security aspect of your system is is always up to date but I would also say. It's what's paramedics at the end of day is only one. We should consider that we should consider actually the multi-factor problem. Use was my method in a multi-factor way not only was my method but. [01:11:10] Other factors that could be possible that could be. Other other other information could be for example a piece liquidation that could be. Like for example. If used for example tax property tax light for example someone is in have hasn't been on. Passphrase right they could use this we could use it we could use this together with which my magic to create a duty to a higher and higher level of accuracy but at the end of the day I think the best recipe would be to use was my methods with other with other. [01:11:50] Factors in a multi-factor question makes sense thank you and thank you very much for a very very interesting talk and like the most questions I've seen in them this. Was very interesting thanks again thank you.