[00:00:10]
>> My background is in circuit design so I am a digital signal analog kind of circuit person and before I came here I used to work at Intel So so far some. With some of the projects that we had with Intel the folks at Intel kind of pointed me to some of the security implications of things that we were doing so that got me interested in secret and cybersecurity and then you know the last few years there is a decent amount of activity in my group in this area so I am really thankful that you guys have asked me to come here and talk to you and also learn from you.
[00:00:48]
Title of my talk is machine learning for profile that x. and I'm going to not go into a lot of you know like mathematical details and stuff. And try and provide you with more of an intuitive understanding of the kinds of stuff that we are thinking of and then in the last what I'm going to talk about some of the lower countermeasures particularly for our physical side channel the back and talk mostly about some of the circuit techniques that we're working on and so on the interesting results that we recently have had.
[00:01:19]
So before I start some acknowledgements It's also work is ongoing students of course is working with me for the last couple of years and sadly he worked with me for he graduated couple of years back and he is now in Qualcomm so he did a lot of the initial work and a lot of this work is in collaboration with Professor Sens group and producer and his students as well as as I mentioned with Intel.
[00:01:42]
So what sites for those of you who are probably not familiar with the with the concept of sites on the back that's the whole idea of such an attack is to compromise security off of secure encryption from from a distance so imagine that you have a computer like this one which is running some sort of a crypto cord it's essentially doing encryption and data you are going to assume that there is a key which is a secret key which is kind of stored on the on the processor and since it's stored on the processor and as an as and user I don't have access to it it's a true secret but unfortunately that's not the case when it does computation because of a fundamental reasons which are intrinsically tied to time or dynamics when you already have compute there is going to be some sort of leakage and leakage will happen through physical means like power and electromagnetic radiation as well as sometimes light and so on so the whole idea of sites in their back is to be able to kind of try and figure out from a distance whether I can figure out whether what the secret key was over there I can whether I can correlate the external physical signatures to the exact computations that that's going on on the chip so this has been a problem for you know for many many many of us who started our our understanding of secure circuit security from a purely digital point of view.
[00:03:03]
Were very skeptical of it at the beginning because it seems like you know there is already so much noise How can you ever figure out what's going on in a chip just by looking at a faint signature of your radiation or about race but unfortunately you know there is this old saying that you know.
[00:03:19]
Statistics wins over subtlety soon if the signal is very subtle eventually statistics will catch up and give you whatever information you're looking for so so that's you know that's in short a physical understanding of sight channel security. Mechanisms so computationally cryptographic algorithms are secure let's stick to 56 it's secure you should know.
[00:03:41]
Be able to break it by brute force or any other he will stick methods from a computational point of view but unfortunately as I said even if you have a computational engine which is secure it is implemented at the end of the day on a piece of hardware which means there are transistors there is switching going on and each of these would essentially lead to some sort of physical side channel leakage and as I said it can be through electromagnetic radiation everything radiates in it and there is switching there is radiation and you know there are f.c.c. rules on how how much that radiation can be but if you're close enough with an antenna you can actually pick up the radiation and figure out exactly what's going on in the in the chip.
[00:04:19]
There is of course power consumption which is a very strong signature so the power coming to the chip has a direct correlation to the want of switching that's going on on the chip and the exact competition that's going on and then there is timing so if you look at you know there is some switching going on on the chip and then you're looking at let's say the power that's coming into the car and Chris it will find a very interesting timing information because every time there is a switching there is a burst of current and so on so if you'd statistically analyze the current going into a chip it carries a lot of information not only on the kind of computation but also the timing associated with the computation when what is happening and that's going to stuff and people have been using it for other problems as well so today in a very recently there was this whole big discussion with involving inequality in Intel where they are kind of saying that if you if you just listen to the sound of hard disks you can figure out what applications are running so it's very interesting so it's a lot like you know you can you can hear it with your ears but with with a good microphone which knows what it does looking for just looking at sound you can actually figure out what kind of computation is going on or what kind of application is running so it's not only just.
[00:05:25]
Part on the machine but even in acoustic signatures have very strong correlation with with what's going on that kind of application that you're donning. So the this being the background so I'm going to talk mostly about power sites in the back today because that's what I've been mostly working on and a little bit on e m but I don't want to talk of.
[00:05:41]
T m today it's mostly the power of the power sites and the leakage that you know that's very strong the signature is very strong so we're interested in that so just as a quick intro to the classical crypto security using mathematical abstraction that it's about to mathematically secure process.
[00:05:58]
I did he was successful right so you know all of you. As 128 and then you know eventually is to 56 that's kind of the defector today not only in the commercial market but also in the military markets that we mostly use. And. Many of these in the last few years this although they're secure have been actually attacked successfully attacked using physical means which has become a bigger problem now that we often do not have access to the machines Let's say you are doing some secure computation you are trusting the cloud computer but the cloud is in some other country and you don't have access to that particular machine and there are people who are actually need to do that they have access to the machine and they are looking at all the sites on the traces and trying to figure out what you are doing let's say in this country because the because today we're not computing on our desks but computing on the cloud many of these problems have been aggravated because people have been the user does not necessarily have access to the machine but other bodies might have physical access to the machine so people have tried to use that in the last few years and this has started becoming a real security threat to you know far from a commercial point of view.
[00:07:06]
So if you look at it a very simple. Back analysis from a very simple back of the envelope calculation you have very easily figured out that the main problem is that we hardware is implemented suppose I channel is you know when you have as I said even you have some sort of think about crypto engine as a black box which is consuming power because of because it is switching so some of you with some in a circuits program will understand that when we implement something in a complimentary months when you are there is the switching I drop off from the supply when they are not doing anything I'm not drawing power from the supply almost very little power from the supply so the switching which is essentially related to the intrinsic computation is somehow related to the current that we are drawing in from the supply which is intrinsically connected to the to the site and information that's leaking out from the chip so if you look at the complexity of x. So if you look at in a brute force attack let's say on a yes to 56 just because the space is so large you cannot do a brute force attack it will take you no more than the age of the universe for you to converge whereas if you do if you look at the site in a profile it's because the what the way the hardware is is build the hardware is not built on $256.00 bits but it is a byte size that actually are essentially not now limited space which goes from to respect our to have to 6 to rest of our of 8 and such an information is captured in that to truly rest with our of 8 space and that makes the whole thing very very very easy to break in some sense.
[00:08:34]
And then you kind of break it's a divide and conquer kind of approach you break each of these bites one by one by one so you break the 1st bite and the 2nd bite and then you break down all the 16 bytes and you're done so you have covered the key and you can do that because you have access to the plaintext.
[00:08:50]
In most cases are sometimes you have access to the key so you can't you're trying to kind of send So one of the operands is known later the plaintext is known and the key is not known so you're trying to send a known plaintext and use the Internet.
[00:09:05]
You know what I what the Stuart key is and looking at the power signature to make a correlation so at the end of the day you're trying to be look at Matrix that correlates your power signature with the secret that being stored on the processor and as I said the main problem here is you have taken out to 56 complexity and on the physical design it's kind of in a goes down to is to go 8 because of that we keep engine implement that have to be implemented.
[00:09:32]
So if you look at the power signatures this is actually a measured power trips from up from a processor not a microprocessor but a small processor like a microcontroller which is running a yes to $56.00 and you can see that you know initially if this is just a power trip you know from the from the from the p.c.b. from the printed circuit board look if you look at it initially there is this idle period then there is an Io when it is ringing and data then there is a jeweler when your scheduling all the instructions and then there it is 14 round sort of a is there going on and you can very clearly see this is pretty early pattern in the current signature so that it may look random but what it is not and if you look closely you'll see that there is a pretty r.t.c. There is also a cycle to cycle differences so there is a lot of information which is embedded in this in this very simple power Chris that you capture from the board and then you can see that we go back to the io when the computation is done you push the data into your true that you are the chip which is you know when you when you have data coming out from the chip and it goes back to the idle period so you can very easily see that a very.
[00:10:33]
Very simple set of tools very simple an oscilloscope and even a high performance more or less a reasonably high performance multimeter and access to the board can give you all this interesting information of what's going on supposedly securely on the chip. So given that if you look at it another question of offsite channel attack as a as a hacker or an attacker is how do you correlate this current signature to the secret key that's stored right so how do you make that correlation so there you know there are a couple of ways of doing that and people have tried both power and am has lost signal to noise ratio so e.m.i. attack by by by coming close to the chip with an antenna is a little harder but power is reasonably easier and it does it can crack it to very.
[00:11:22]
Few thousands of iterations and that many of these are you know. The most of these are based on the fact that the function is easy to compute but but harder to reconstruct right so that's the whole idea of most of this crypto so invert these are not invertible functions so if you do a physical attack it is kind of you know you are you know you don't have to do that inversion anymore so you know in terms of in terms of in a mathematically thinking about it you're just looking at this is like the feedforward part of the design so you're kind of looking at the signature and you do not have to invert it so you so the problem space is kind of easy so if you do that there are 2 basic ways of attacking so one is called the profile that act on the other and it's called a non profiler so nonprofessional attack has been pretty successful so far and it works beautifully so a nonprofit like back in in summary is something like you buy a chip I want to find out what the secret key is you run a yes on this repeatedly you know many many many many rounds and depending on the signal to noise ratio how much noise there is in that be and in a how good you or your measurement equipment are versus how careful the designer was the person who designed the hardware you will you will go through some sometimes 8000 sometimes 20000 sometimes 40000 number of traces and as you kind of course through multiple creases you are also creating what is called a power model so you are creating a correlation between the power that you're seeing.
[00:12:49]
And the switching that's going on on the chip and once this correlation matrix is constructed and you have been successfully in a have successfully constructed that then you have a clear a clear method of looking at a power Chris and figuring out what is the secret key inside the inside the chip so the way to do it is essentially you know as I said.
[00:13:09]
You're doing the same thing over and over again and in the process you are rejecting noise and noise being in a 0 mean and all that you have the statistical properties of my steward mended and given that the process is more or less stationary you can eventually in a figure out what the key is so so this is being widely successful profile attacks there who related power back c.p.u. which is you know which is kind of the you know one of the most powerful or documented study and there is also this differential powered x. d.p.a. which is also in a pretty powerful and people have done our work on that.
[00:13:41]
The. Advantage of c.p. n.d.p. and all of these are you don't need any prior knowledge so that's why it isn't you know it's a non profiler back you don't need any prior knowledge you take the chip and you run it through the statistical analysis and eventually figure out what's going on but it takes time right so you to the hacker that hacker needs access to that device and sometimes you know if it's a few 1000 traces it'll take like let's say.
[00:14:04]
Tens of minutes sometimes it's 800000 traces of a 1000000 traces you push it push it out you may need access to the device for listen are a couple of hours that's 34 hours depending on home interests you need right so the in terms of pros and cons you don't need prior knowledge but you need time right so one of the things that is becoming more interesting particularly from from you know the difference companies and security companies or not is.
[00:14:30]
An attacker has access to a device for only a few minutes what can we learn from that so that gave rise to this whole idea of non profiling. Of profile attacks where. So the Darley I want to talk about was where those non profile attacks so in profile attacks what you do is you create a.
[00:14:49]
Profile of your chip so you have you gather prior knowledge of your chip or your hardware and use their prior knowledge to hack into something that you have never seen before like a new chip right so that is called a profiled attack method. So if you look at the you know the immortality is there kind of 2 ways to do it one is you have a plain text message x. other one is the ciphertext x. in the plaintext x..
[00:15:13]
You control the plaintext and you look at the plaintext and the correlation between the plaintext and and and the power Chris you can also do the same thing with the ciphertext So you look at the ciphertext and the power signature and you create a correlation between these 2 most of them are studies to get the kind of equivalent sometimes one is harder than the other depending on how much access you have on the device but both are both are possible for for nonprofit.
[00:15:37]
So the interesting thing about it as I said I would not provide that x. is. You know these are typically the steps that you will follow you will you know identify a point of attack which is if you if you look at. The crypto script hardware implementation there is something called an s. box which has essentially an Excel kind of operation and is books is typically the point of attack because that leaks out the most amount of information you know when it switches and if you look at you know there are hardware models for for the power power having model is a pretty popular model for power and you can use that model to construct your correlation matrix between the plaintext and the power or between ciphertext and power and then you know for if you do this for each key byte one after another and as I said you know it's divide and conquer kind of approach so you attack the 1st key byte then you attack the 2nd key by it and so on and off by doing that and 16 times you eventually recovered the entire $206.00 bits of a yes so this is you know the.
[00:16:36]
Version of attack where you don't need any prior information. So now what I want to talk about today it is kind of the the profile version of that very thing there is as I said there's a lot of interesting Internet interest an interesting approach is particularly based on on learning right so how do you how do you learn certain patterns in and essentially try to build this correlation.
[00:17:02]
Before you have access to the device so there are you know in the past there have been other kinds of profile based attack models you know particularly they are extremely powerful if you can create a good profile and so if you have enough information if you have a if you have a good prior model off that of the device that you are going to attack then then you have a very strong profile and if you have a strong profile then your attack mechanism or the acts that you're going to do are also going to be more successful So these are essentially called template based attacks you know these are statistical template Mr x. So what you're doing is you're creating a template by by looking at similar devices and then using that information to attack a new device you know that you have access to for a very limited amount of time.
[00:17:44]
So one of the things that you know we have been looking at is essentially to use some of the learning technology as you know some of you know you know some of the nonsense in machine learning and particularly deep neural networks and so on and see if you can if you can train a neural network our train of variant of a neural network to create that functional correlation right so to create this this this functional mapping from power to the key right so fix a plaintext on the ciphertext and then you look at the correlation between the poetry is that you're going to collect and the secret key that you're going to get so whether a neural network or something which is a model can cannot can be optimized or can be learned and whether it can learn to profile and so that's that's kind of the 1st thing that we were looking at and there is some prior work so I think one of the things that that that emerged when we really started working on it was most of the prior work was looking at the same device so you have a device and you created this power you have the power of this device and you know you know you're sending known plaintext and from the known plaintext you created you trained and you will network to kind of he had this functional dependencies between the between the plaintext on the power and then when you change the you know any change a secret key on that same device you can very easily crack it and the idea is now you just need inference on a neural network like a feedforward single pass and in a single iteration you can potentially you know know what the secret keys so as opposed to doing 10100000 iterations can I recovered the security in a field if you traces you know ideally one Chris but maybe even 10 traces is good enough so that that's the whole motivation so now the question is what kind of fit of model of a neural network works.
[00:19:29]
So the kinds of neural networks that people have used one was just a you know fully connected deep neural network and the structure of it I'm not familiar with it is you know you think it's somewhat from statistics somewhat from biology I guess in order. All these neurons that we call Richard are.
[00:19:47]
You know with some sort of a non-linear activation function so it's like an internet and fire all the all the data comes in you integrate or sum them up with some known weights and then the once it reaches a threshold you kind of fired that with some sort of a probably stick function some sort of a non-linear function and then that neuron the firing of the neuron produces a result that goes to the next layer of neurons and so on and so forth and finally go to the final you know final output layer and in this case what you'll do is this would be the power Chris is that you will fit in you know the current numbers and this is the output layer which gives you the key whether it's there it's what what key it is right so so these are the labels that you have so there are $256.00 possible bits so these are the labels and this is the 1st one is on the what you're seeing on the left is essentially a fully connected deep neural network and we you know and people have tried to train that and essentially you know in some cases it does work there's also been work on using convolutional filters particularly in the 1st few layers and using c.n.n. switches in a very popular deep neural network architecture for image processing it has been shown that you know cnn's are somewhat powerful particularly when there is misalignment So when you have when you are training this neural network it with known data when you when you have small degrees of misalignment between the traces when you're collecting the data then c.n.n. is going to use all of that they can still learn even if there is a little bit of misalignment and even again without going into the c.n.n. architecture and there is prior work which shows that c.n.n. is are also reasonably are moderately successful for some of these attacks the key here is you have it's kind of that is going on of the same device so you you you have one device and you're training a model based on their device and now you.
[00:21:38]
Know once once a model has been trained can I attack that device in a few you know one or 2 traces in a very few traces. So as I said you know this has this had been made a moderately successful with the DNS and cnn's But you know when we look at when we started looking at you know what is the what is a practical b. and an attacker would attack this so the idea is that I could will not actually have access to that particular device right he's he has some devices on the market and he's trying to build his new road network based on those devices not the one that he has is eventually going back right so the question was can I train my neural network on a different device and identical device identical parts number identical you know lock number or whatever but can I build my model on that particular device and transfer the model to something which is unknown a new device so when we did that initially you know it it turned out that that was hugely unsuccessful because it is very easy to understand that you know between all the because of manufacturing variability because of small changes and so on if you train a neural network on one device and then you try and attack another device without any prior information a neural network based profile is not strong enough it does not capture the properties or other it kept it does not after the invariant properties it captures.
[00:22:53]
The invariant properties as well as something which is device specific and the ones which are device specific will sometimes. Flood or the amount of information that you actually want to transfer from this from this. Device to the unknown device so. You know that our accuracy is what initially like one person and so on like really poor and we started digging into it a little farther and see whether there is a way to create a new network model which is robust enough so that you can actually attack an unknown device so when you actually look at the current profiles these are the same device the same kind of microcontroller parts from the market 30 of them and you look at the amplitude for doing the same operation you can see that the current company doing on this air but still you only know how much the current computer changes for this entire In a 256 rounds of compute and you can see that you would have some variation as well as you know the.
[00:23:44]
Mean amplitude also keeps on changing so there is enough variation device to device although it's a seem almost to me it's the same device at the end of the day from the same manufacturer but just because there is some device to device variation you see this systematic variation in amplitude the mean of the amplitude the variance of them to do it and so on which is which is bad enough that it will corrupt your your your neural network model.
[00:24:10]
So what we did was we tried to figure out you know how does it work let's say I create on a device and test it on the device and with me it's like this multilayer perceptron which is the fully connected neural network and this is the one b. convolutional neural network and this display lots of essentially showing and the accuracy that you get so so this is this the way to do it to interpret this plot is let's get a number 3 here so this is a training device I've trained my neural network on device number 3 and I want to test on a on a on a on a device which is in a letter 30 or 29 or something and the light at the colored the higher the accuracy so you can see that if 8 train on device number 3 and based on device number 3 then I this is very likely to resign 100 percent accuracy if I have the same device I can train it of course in a pretty well I can build a neural network model which is which is robust enough and it works pretty well and so you see that all these diagonal elements are almost to a 99 percent 100 percent accurate but once you go do this all Bagnall elements you see that if I trained on device number 15 and test on let's say device number 26 I have very poor accuracy right so I am not able to recover the key with it with enough accuracy which means you know if I train on a device and test on other on another device we don't capture some of the invariant proper properties of the model and we see the same thing when we have you know c.n.n. center convolutional that.
[00:25:29]
Is actually even worse so it doesn't really work but you can see that the diagonals are of course good and that was kind of the prior work showing that you can train a new or that work model as long as the training and the testing devices are the same.
[00:25:41]
So you know what we wanted to see you know what are some of the you know rationale behind it again without going into some of the details you know what we wanted to see particularly why this particular device device number 21 is giving us very poor results in this case device number 18 is giving us poor results and we can see that there is a definite.
[00:25:58]
Correlation between devices that give you very poor results and their own current signature and you see that the variance of the current signature of the devices that produce bad results in profile attacks are are actually different from the rest which means they have something going on which is because of process and manufacturing variations that their country says look a lot different from from other devices and there is a direct correlation and we can i don't know how to use that information but we can at least analyze and see that yeah we know why this is not working why the neural network model is failing for this particular device and receive this for example here this particular device device number 18 which give me very poor performance is actually also the device where the variation on the current is a lot higher than the rest of the devices so you can't necessarily train for 4 device number 18 using some other device.
[00:26:47]
Just to show you pictorially you know what we are seeing here is if you take this model and project it to a very low dimension this today mentions us for visualization purposes you see that you know all these devices kind of predicted downwards that they have you know in the to the space they have they occupy this sides and these are you know the device 12.
[00:27:11]
Point 2 years or so this is device number one for a particular key Byte This is device number 2 for a particular to keep it and so on so the low dimensional projections tell us one thing that you know for for each devices when you predict them downwards on a low dimension there is separation between this ellipsoids which means there is information you know even at really low dimensions there is information but the thing is between device to device there is there are not perfectly aligned so there is some overlap so when you look at the same key byte for 2 devices there is.
[00:27:40]
Some overlap but they're not exactly on top of each other which means that although there is information in this lot I mentioned a model between 2 key bites on the same device but once you go from one device to another the separation again which means there is variation and that variation is the one that does not allow you to distinguish between us to do a variation and variations I want to separate these 2 So what I want is something which is invariant to these devices and what you started doing was we looked at you know what happens if a train the model with multiple devices so as opposed to training with one device they start with you know training the neural network with 2 devices 3 devices and follow very standard neural network prescriptions for doing that batch training and and all that and you can very easily see that once you do that once you train on just enough for devices this isn't a training on one device which you already saw this is training on 2 devices 3 devices and 4 devices and once you train on just 4 devices that itself is kind of good enough it already you know gives it gives you good accuracy when you train a model and for devices and take you to and take a new 50 vise and just run inference you can actually decipher the key into Crysis 3 traces pretty accurately with very high accuracy so it means that you don't need to train over lots of devices but even if if you device is good enough and gives you a pretty robust model at that point it gives you were 70 percent 80 percent accuracy but how do you take it to a 90 percent loss so there is something you know added that you have to do and you know we tested the same thing on c.n.n. cnn's actually don't perform that well the d.n.c. now with this multi device training performed better and there are some intuitive explanations behind it.
[00:29:17]
C.n.n. said probably not the best neural network topology to use in this particular problem but but the fact is in a fit if you train the multiple devices the accuracy is to get better they do get better and you don't need to train a lot of devices which is a good thing so then we start looking at Ok I have this new old world model I really want to you know I really want to capture the in how the end features and we know a.
[00:29:40]
We know that you know the car increase is a very long vector and capturing tried increase of 10000 points or 30000 points so so the variation is definitely in very high dimensions rate so can I can i did that how you demonstrate the vector and predicted to a lower dimension where they can preserve the invariant features so that's the idea and the 1st thing that you do in of course you know in the start is to go.
[00:30:03]
The 1st article too that you used to do that is essentially principal component analysis so you take that Victor and do P.C.'s and p.c. it gives you the most dominant and I get values and corresponding eigenvectors So if you take this let's say 30000 long car interests and projected down using P.C.'s to just 101500 data points which are the principal components and then train then you will network we suddenly see that there is a huge improvement in accuracy and now we can see that the same matrix is now almost white there are very few places where you don't have in a very you have loss of accuracy but you can train on this for devices but before you feed the data the current data into the neural network you do p.c.a. to reduce the dimension and capture the principal components and create on the principle components you're not very close to 99 percent so what it means is I don't I can look at a device that I have never seen before I train a neural network on 4 or 5 devices at home and use P.C.'s and all that and come and have access to a new device for only a few seconds capture card and Chris and I can tell you what that secret and the devices which is kind of kind of a scary situation in some sense and that's why you know there was there is enough interest these days on understanding how this profile attacks can actually be deployed and what kind of in a compromise or 3 we actually need.
[00:31:16]
One of the things you know I did not explicitly go into was how do you train a neural network by capturing data from these known devices the ones that you're part just at home to create the profile one of the biggest challenges is when you are running a is only a start the process and you start collecting current traces there is often some misalignment you know you miss one cycle are your start is not exactly in a synchronized across all the devices and whenever you have a little bit of misalignment in the car increases that you that you capture and use for training that misalignment creates some centrally corrupts a model that you have your lead role model is not going to be perfect now because there is misalignment in this traces so just like p.c.a. is one of those very useful tools for dimensional deduction what we did was we use dynamic time warping computational method to take care of insulins and if you don't know what dynamic time warping is it's it's a very it's a it's a very simple tool which allows you to remove the time axis so you can think about let's say it's very popular in systems that do things like gesture of cognition right let's say let's have a t.v. that you're the Samsung T.V.'s have just a recognition these days so let's say you know you wake it up by waving at someone is waving very fast someone is waiting slowly the to be able to respond to both because what it does it it really if you move the time information it just it is just looking at the sequence of events that's happening and it can capture that using time warping the all the time information and just discrete the sequence of things that are happening so we're trying to do the same things so if you're if there is time misalignment the only way to do that to the individual devices may also selected different frequencies so the signature may not be exactly aligned so the only way to do this is to remove the time information and use dynamic time warping to essentially treat this as a sequence of events or sequence of switching events that are happening and once you do that you can actually take care of a lot of these misalignment problems for example on the left you see current signatures that.
[00:33:15]
You know the graduate students tried their best to capture and align but it is still misaligned you can see that one trace is not lying on top of another but when you go to the right after dynamic time warping So you know the x. axis is not by many more it's like the tree sample and you can see after working all this in all this way forms are kind of one on top of each other so that because that creates a very strong very powerful initial whatever 3 analysis of the 2 that you can do before you feed the system into the p.c. a 4 dimensional deduction and then to the neural network for classification.
[00:33:48]
So p.c. n.d.t. the blue that every time we're being turned out to be very powerful for us in taking care of misalignment problems as well as preserving invariant features while rejecting device to device variations to some extent So if you're very Someone is all of this in a this is some of these numbers you know it's kind of complex I'm going to go through it a little carefully so what we're seeing here is let's see in on that on the chart on the top you have the number of training devices if I have only one device I'm training on one device and I get a model and test on a new device so you know I'm almost a sitting on a new device something that I've not seen in the past you can see that the average accuracy of recovering the right key is about 60 percent and and you know once that multilayer perceptron are a deep neural network and then the maximum is about 98 percent The minimum is over 2 percent so in in cases where the device to device variation is large there is hardly any information that you can build from a known device and transfer to a new device but interestingly when we do p.c.a. I think that's the biggest An advantage of doing p.c.a. once you're starting p.c.a. that 2 percent becomes 50 percent because you are now rejecting those device to device variations as much as you can and trying to preserve the invariant features and that gives you an order minimum increases the maximum and the average also increase as a result but now if you will from a training with one device to 2 devices to 3 devices to 4 devices all of these results start become.
[00:35:15]
Better and now you can see with P.C.'s and modular perceptron by preserving them on invariant features now we are on an average we are about 99 percent 19.4 percent right and in a minimum you know you're looking at about 89 percent correctness so only a very few devices and with P.C.'s you can you can preserve most of the invariant features and do this provided that back on the devices which is kind of an interesting result which we thought was very practical and we have this you know this is all in in measurements right so there's not the most There's nothing is in soulish and everything is on measurement on competence on boards that we have built and so on and this is way better than the state of the art with cnn's that people have tried in the past which gives the over 60 percent accuracy and then once we started doing n.l.p. as well I think we figured out that most of these 89 percent of this is not 99 percent but 89 percent because of misalignment in most cases it's misalignment of traces so then we have you know any time working class piece here plus a multi we see that we are now very close to 99 percent So once you once you start you know using start using time working to take care of misalignment and.
[00:36:27]
2 dimensional reduction and preserve the invariant features you can train on 4 or 5 devices and bring that model to a new device new should be able to break that new device in a few traces and when I say a few It's less than 10 typically less than 10 with 9999 percent accuracy so that's kind of an interesting and significant result and we have you know we have published it in multiple places but there's been a lot of interest from the industry interestingly and there is some war going on on on on standardizing some of these as machine learning based techniques for attack and trying to figure out you know how how to build countermeasures for that so that's the next thing.
[00:37:15]
This one. This is not this is the minimum Yeah that's a minimum. So the average so the average and the maximum Are all is pretty good because you have you know you have the device or device variations a small it's only those outliers which which give you trouble because you have to finish find out the invariant features and outliers are also the devices where the clock frequency is slightly different between devices something is a little slower a little faster that keeps it could that creates all these misalignments.
[00:37:46]
So this is kind of an overview of this profile site channel m.l. best techniques that we have been exploring again this is in a pretty recent work I think some of these have been published some of these have not been actually been published yet and we have been doing this for almost.
[00:38:01]
2 years now and I think there are some more interesting things that we are observing and seeing not understanding so so this is an investigation in progress. So in the last few minutes I want to talk about some of the counter measures that we are working on particularly for not necessarily for profile attacks but you know should be able to book profiling on profiler x. and before I do that you know if there is any question on this I can try and answer.
[00:38:31]
Or if not you know then let me spend the next 10 minutes or so on countermeasures and again countermeasures are where you have a heart I'm going to talk about hardware countermeasures So what can we do as hardware designers so that I can protect my information right I don't want to need any leakage so.
[00:38:54]
In terms of hardware based on to measures. There are various categories that you can think of you know how countermeasures are designed and the most the more the most common one is logical counter measure and a logical counter measure is a counter measure Very want to decouple the switching activity that's going on on the chip from the power trace of the interest rate so you do not want the correlated signals to leak out and the way to do that would be you know one of the most most popular techniques is called dual role logic and you know even if you're not a circuits person I think it's kind of easy to understand what what happens in when you are switching from 0 to one you will pull some current from d.c. which shows up as a signature but now what I do is you know in a dual role Raje I will essentially whenever something switches from 0 to one I will by design something else from one to 0 so everything is kind of balanced so I do not want to draw current which is specific to a particular switching activity within the chip but I want to make sure that I always draw the same amount of current and some of the current signature is decoding lifted from from the actual switching that is going on so do it do a little logic and desire that these are you know this have been people have tried this and there is.
[00:40:05]
There is a good amount of work and there are demonstrations of this as well they work substantially Well I mean there what it does it it reduces the signal to noise ratio which means the time for detection and in seeing correlated x. is going to increase the the biggest problem of this countermeasures are.
[00:40:26]
The expense the overhead rate that you that you pay and typically the overhead comes in terms of area because in particularly where object you it's very easy to understand that for every switching activity if you want if you do have to do a bar then let me essentially means you have doubled your doing the same thing you know complimentary of the function all the time so it is almost 2 times a power so it's a fairly big overhead in general so for really security.
[00:40:56]
There. Our products we use that but it comes with a huge overhead and it's not particularly applicable for low cost carriers particularly with the ones and sensor noise and cameras and stuff for your kind of applications so that's a big overhead that we keep on paying for these logical contributions then there are particular countermeasures architectural countermeasures are where you do not have good level countermeasure the 1st one the logical ones are gate level so every inverter nand gate Norgate whatever the small building blocks are you have a differential version of that the architectural measures are more about higher level of granularity very functional units let's an x. box and you build something like an x. box inverse when you already are trying something on the x. box you do something opposite so that the current when the sum up and go out of the chip they look identical so you know you do plus one here and you do minus one here so there you know that's a very common architectural technique there are masking techniques as well that people have used architecturally but these also have large awards in a typically.
[00:41:59]
Rule of thumb is you know have to use the word because whenever you do f. you also do it if inverse. Then there are noise injection based overheads. Techniques and that's kind of I guess the most important the most useful one in terms of your head I'm not going to actually something like this so anybody as has some you know there is some switching going on there is a poetry's that goes out and the power to noise you know you how do you inject noise by adding another circuit which is randomly switching so if that circuit is randomly switching it is injecting noise on the power real and that noise is going to be high enough that the signal to noise ratio on the power will decrease and that happens and you can add in a nice separation noise injection.
[00:42:45]
Noise injection circuits and. The There are 2 things to remember one is they also have overhead lower than some of the other architectural or logical techniques but the biggest challenge there is you know when you do know is injection and your s.n.r. goes down the challenge would be.
[00:43:08]
I'm sorry. So the so the challenge for that would be the if you wait long enough you know it's study stickle in a technique rates if you wait long enough you collect enough number of cases even though if you have injected noise you will eventually be able to recover the key.
[00:43:27]
And finally there are techniques which are becoming very popular in the last 23 years now but equally again from the from the industrial point of view are techniques which involve voltage regulators so where digital letters are are essentially when you have when you have a microprocessor or some sort of an embedded device that's running off let's a $100.00 your battery is $500.00 so between your battery and your chip there is something like which is which is a sponsible far taking 5 volts from the battery and converting it to one folds and regulating that maintaining maintaining at maintaining that output voltage at one board so that 4 days regulator is kind of a circuit that bridges the supply and your actual consumer bar consumer the actual chip so what people have done is essentially Bill interesting for digital letters that that would not leak out any information so the voltage regulator can be used as a means of some of suppressing any signal that leaks out from the chip and switch capacitor b. to be a string literals where at 1st you know Sean I think it's about 67 years back which is again a particular typology a voltage regulators.
[00:44:32]
Mistake nice but they can live with low dropout rate letters and all again you know these are some circuit details that I'm not going to go into at all and you know you don't necessarily have to understand or appreciate the fact that we're digital leaders are kind of the bridge between the battery or the main supply and the engine and voltage regulators can be designed in ways to isolate the current signature that is leaking out from what can be collected right and if the digital literate is very close to their day to the actual.
[00:44:58]
Consuming day then the voltage regulator is a very good isolation technique between what's going on in the chip versus what is leaking out and and one of the techniques that we have been working on with the with Intel has been on this some of this linear drop out based.
[00:45:16]
And I'm going to kind of briefly talk about that in a couple of slides. So before we go into this. One of the things you know I want to point out point. It was you know this dual rail technique you know these are some of the traditional logical basic needs that have been used and you can if you have something like an f.p.g.a. which is field programmable so you can program it on an f.p.g.a. and kind of see if you look at the power trace and see that the Potteries has information which has been suppressed so there is an r. is low so a typical.
[00:45:47]
Logic would look like this instead of having a and b. producing a Q You also have. A prison q. and q. bar and this is and then you and then you have some flip flops on and so on which are circuit techniques which are also differential and that's all data kind of moves into the circuit so the idea is instead of taking certain taking circuit elements with a single ended you make them differential and if they are different then the leakage of information to the part risk can be minimized so that's all in a basic philosophy of.
[00:46:19]
One of the biggest challenges in actually implementing it is when you have let's say a bar and b.b. bar precinct you and Cuba are in an ideal world everything is balanced so Cuba and Cuba has the same capacitance currents are same so what you observe from outside is exactly the constant current that you want but in reality when you build something of course there is a mismatch right so the q. and Cuba are only a build a differential circuit is not going to be exactly identical So what will happen is they'll be small mismatches in capacitance between the 2 output nodes which will deflect into a small delta small difference in the current signature which will eventually leak out so at the end of the day no matter how hard you try to balance everything there will always be small mismatches on the die which will eventually lead to information leakage so even this logical logical styles which are supposed to be extremely robust eventually leak out leakage and this this is either through timing information or amplitude information so either you will be able to be able to observe timing information on the power trace or amplitude information and be able to correlate.
[00:47:21]
Other popular technique you know is what it's called since somebody fired its logic this is also a differential logic extremely high overhead compared to what we used to be able to see most complimentary most logic and this sense of our best logic is also something which has very high over it so again not something which is easy to implement in theory they are perfect they do not leak out any information but in practice because of mismatches there is some information leakage plus that is very high over it so not something that you would want to put in in low power systems.
[00:47:54]
And as I said. Most of these techniques in that. Time information which is true time distortion are amplitude information to discussion which means. If this current is supposed to be constant it is not exactly constant because of mismatches between the if pair you will have some amplitude distortion amplitude modulation which can be picked up by a sensitive meter and it would be able to figure out what the core you know what the what the securities it takes longer but it eventually be able to figure it out.
[00:48:24]
So as opposed to doing that one of the techniques that we have been looking at is something which in theory is something like this so I have let's see some sort of a yes engine which is running it will draw current which is I.A.E.A.'s and this current is what in a traditional sense is the one that will essentially leak out the power signature and if you use noise injection you can use some sort of a noise current here to inject on the part b. And so that what you actually see on the board being outside the Di is going to be something of a very low is an r. and and you can mathematically show what that's what the signal to noise ratio is and how it is correlated to the mean time to detection.
[00:49:02]
As opposed to doing that what we have been working on is something where the. Signature itself is at United by something we are calling the signature at innovation hardware where this is kind of a rapport circuit wrapped around us and that that will suppress the amount of information leakage through the power trace that's going out.
[00:49:20]
So that is you know if you look at it let's say this is where the all the signature this is this is where all the information was captured but now this is this is the signature has been activated and what you see on the part bin is kind of lifted from the actual switching here and now you can use a lot less noise injected on the power pane and and you should be able to suppress that So so the noise overhead that we see for i n one is pretty high so you can have a very light you need a very high amount of noise to to to have no correlation whereas in the 2nd technique if I can see in the News this is the signature activation technique where I can reduce the signal strength I would be able to do that with much less noise so it is supposed to be very powerful so.
[00:50:05]
So the way we build it is essentially something like this. For 30 people it's kind of easy to understand what we do is we we want to build a carbon source on the top so the current source brings in all of its constant current and then there is this i.e.s. which is they is a need which is switching and producing a pattern of current and what I want to do is to make sure that this pattern of current is independent of the current that is coming in so this is my current source now and what I'm going to do is I'm going to put a regulator here which is a shunt regulator where the total current that comes in I.C.'s is always constant.
[00:50:39]
Changes depending on what is exactly going on and based on that the shunt regulator which is that which is in parallel with this i.e. a is how the excess car and so the current is i.c. is minus i.a.s. So what you see from the part mean is always constant and this Chantry we are topology is something that you know we kind of proposed 34 years back as a means of always bringing in constant current by design and only only using and using a shunt structure which which bypasses that isn't all current that they don't need so what you see at the power pin on the board is something which is always constant current and of course there isn't a whole bunch of feedback loop and so on to make things work but once you do that in philosophy they send out of the system has has in a changed dramatically.
[00:51:32]
So what we sow you know in terms of the mean time to diction and this is that innovation factor so based in the that you know based on that dimension factor the mean time production increases a square of that unit in factors if you can actually do it you're signalled by a factor of 10 x. your Mean Time to detection will go up by a factor of 100 x. right so there is a square dependence there which is what we're kind of trying to capture so any kind of education that I can provide is actually going to be very useful for him to d.m.t. is the mean time to detection which I need to be able to you know give me good resiliency and you can see that this is the 1st blood shows they as current and the next one shows the supply current after it has gone through the signature at innovation circuitry.
[00:52:15]
And this is kind of some of the results that we have captured. We can see that as opposed to a standard method which uses about 70 lots of additional power we can we can have the same m.t.d. the same resilience resiliency with about $66700000.00 watts of power it's over 10 x. reduction in power so that we the whole idea is how do they how do they enable circuits which are low cost and useful for low power applications and still are in a pretty resilient so this is where the notion seems like an interesting way of doing that's a very recently something that will be presented in February of this year we have essentially shown that using the signature at the nation's circuitry in I think it's 40 nanometer CMOs that our empty can be is now about 1000000000 so 1000000000 is a significantly large number so you start from a raw number of about 821-0000 traces that you need to break in now you can with all the circuitry and all the law over it countermeasures you can actually extend it to about 1000000000 which makes it almost impossible to break it using.
[00:53:20]
Techniques so. In summary in a profile attacks are kind of you know interesting and I think there is a lot of scope for profiler docs for for devices of our access to the device might be limited. There are there has been a good amount of work on countermeasures but I think what we have to be really cognizant of the words but equally power over in an area where head for for most of the solutions and particularly for embedded applications like in a sensor known small form factor mobile devices and so on and there are opportunities there and for people like me who are interested in circuit design there are lots of interesting opportunities in our love design and mixing design to be able to isolate switching events from from the current crisis and the interest is that out of the chip and I just showed you a couple of examples but there you know there is a very vast area of research that needs to be done so that we are actually successful in making them secure I mean the still not secure enough said.
[00:54:18]
So that's all I had and I'll go back to Mansur questions and talk to you about anything thanks. Yes. I know the piece is performed on the input so you take the input and do dimensional reduction on the input not on the weights and then you train the network based on the output of the P.C.'s.
[00:55:00]
Yeah yeah. Yeah yeah that's a good question I think yes I think. In principle that's that's true but what happens in reality is there is also a capacitor which is in parallel with a yes which is a good amount of charge storage so even if you need an instantaneous rush of current for particular switching event you can use that capacitor to provide you with that kind of car and so you do not necessarily have to design for the worst case you can design for the average case plus something and that isn't all capacitor can provide you with instantaneous charge that you need so you can you can optimize the design based.
[00:55:58]
2 questions Where did you put our case that just power and. Yes yes and yes and. No so far not so this is all a trick we have not have some work we just started on. And all that we want to eventually look at but we haven't done we have done nonprofit attacks which we easy to attack but for profile attacks we have not done any work on that yet.
[00:56:38]
Yeah Ok so that any kind of misalignment that you have between computing off f. and f. bar here in terms of current timing in a timing misalignment will show up on the signature so. Yeah they're worth. Yeah. That's that's one. Yeah. Yeah so yeah this is this is this is actually a very good question I mean.
[00:57:19]
So there are 2 things one that you can do so one of the things that we have actually shown in hardware is that when you do when you build and I see you do you do metal ization across multiple layers right so if you if you believe me as engine and you can only route your signals to low level metal layers little metal one metal 2 and have metal 3 all the way up to metal mine not touch the signals then those layers themselves act as a good shield and and you know from a physical design point of view I mean the you know if you if you give cadence a synopsis tools to do the job for you they will write it out all over the place right they'll take it all they have to mine and they're trying to minimize something else but if you are minimizing the radiation then it makes more sense to route it locally using low level metal layers and use the higher layer metal layers to act as shields and that actually reduces the leakage significantly on top of that from a simple packaging perspective people have looked at you know how do you build something like Faraday cage is so that you can emit radiation there are problems associated with that if you limit radiation it acts like a cavity and then it heats up and stuff like that so there are some other things that keep happening but yeah but there are various war going on in understanding what's the minimum amount of metal ization you can put across not only on that on the package but also on the dye so that you can shield this as much as you can and you know there are interesting work going on in understanding how it can act as a diffraction grating and you can defrag the signals out and kind of randomized it the things that you can do.
[00:58:51]
All right this whole question I'll start here thank you.