[00:00:05] >> It is my pleasure to introduce Dr Peter Kogi who is chair professor in Notre Dame's Department of Computer Science and Engineering Peter's an I.B.M. fellow he was the winner of the $2012.00 Seymour Cray award he was actually the winner I believe it was the 2015 computer behind Year award that I had the pleasure of presenting him to I know he doesn't list that in his because several people have given it to have said now I'm a pioneer that means I'm old. [00:00:38] He spent 26 years with I.B.M. federal his undergraduate degree is from Notre Dame and his Ph D. from Stanford in electrical engineering Peter. Like to start off by. Just making some some comments particularly based on the last really great talk that we eard I like to build computers I mean this is more of a in the wackier the better but over 50 years of doing this I found I can't get people to give me money to build computers unless I tell them why and they're only going to give me money if I can tell them why because they can do something they can't do now so. [00:01:23] Well what I'm going to talk about if this actually works with the speed of light OK good and half an hour OK so. What I want to talk about is this architecture you heard about this morning. Wacky you know the EMU machine and poke at an application area that is really hard for today the reason why the wackiness is valuable so before I do that however I do a lot of you may know Burton Smith he passed away earlier this year Burton was one of my real real heroes I've known I knew him for 30 years. [00:02:18] We used to joke that Burton built the world's 3rd. Pipeline machine earth multi-threaded machine turns out Cray built the 1st back in the sixty's and I built the 2nd in the early seventy's that flew in the space shuttle but we didn't know that was multithreading times or Burton was the one who who normalized the whole thing and made it a a valid computer architecture primitive so I've learned a lot from him. [00:02:47] And it's a shame to see him go so looking forward. We was at the DIO a thing that Jim and mentioned last week about post more and what we need to do and just some takeaways from my view of what what post more computing is going to look like at least excuse me at least until we get to either the quantum or the smarter goals that we just heard about. [00:03:16] Going to be heavily heterogeneous lots of new technologies will show up 1st as accelerators. Clock rates will continue to go flat remember going to have unless you get to some really superconducting materials you're going to be one to 3 Get yours and that's it which means that from the standpoint of programmers if you've got really big problems to solve you're going to billions of threads so if you don't like threads that go to something else. [00:03:44] Power efficiency is paramount I ran the X. scale technology. Study tent literally 10 years ago and there's a boss by the way though advertise for supercomputing you know looking back at that result and the key takeaways from that report is that it's power power power and power that we need to work and clearly that's that's true today and what that means in terms of architecture is we're going to have to reduce data transfers. [00:04:15] And we've got a void wasting anything that we throw away like you reference memory and you reference parts of memory that you don't need you you've wasted energy there. And the B.S.P. protocol the under model that underlies M.P.I. and what have you that's no longer a universal solution for things we've we've seen computational changes due to multi scale physics and then more recently data analytics where we've gone from wanting to do large dense weak scaling problems to sparse irregular strong scaling problems and I'm going to give you a little bit as evidence as we go on the other thing is you know with the advent of AI we've actually found uses for things like 8 bit floating point. [00:05:05] And that's an entirely different animal to design machines around then. You know quad precision floating point so we need some new thinking to get around this and so my topics are going to be a couple of things I'm going to talk a bit about the problems with today's architectures particularly from the viewpoint of. [00:05:27] Data analytics talk about the EMU architecture a bit and a little bit about benchmarking and some follow on efforts followed by a joint project and Vivek and I just got 2 months ago. That's to look beyond the current class of migrating thread architectures so today's analytics. To get to it I'm going to give go through a little bit of architectural archaeology and when you when you've been around to see the you remember the dinosaurs because you were there. [00:06:02] That becomes relevant so we had prehistory in the if you remember those of you who remember in the late eighty's there was the attack of the killer micros when we went from craze big vector machines to lots of little microprocessors and when we 1st started out with these things the cations were all to do 3 D. mesh simulations you talked. [00:06:26] Your nearest neighbors and some sort of a 3 D. mesh and the issue architecturally was how can I do move above one Cor You have lots of little cores could collaborate and when we figured out how to do that we found that if you look in these charts the red is kind of the order of computation the blue is order of memory references or memory bandwidth and and the oranges order of Iowa and interconnect between between nodes and we found that the computation exceeded the kind of matched the data references that you needed and. [00:07:06] The IO was less than that so the solution to the scale be on one core was again these parallel microprocessors in use 3 D. typologies tied directly to the problem and continue to use Denard scaling to scale up the clock so that then came this is probably about the time that the professors here are starting in grad school we had the memory wall issue where the we were the problem switched to a linear algebra so they exit polls B. and as dense and the issue there was memory latency we were the clock rates were outracing the memory times and we couldn't do anything the solution here was caches and continue to skilled the clock and increase the aisle P. but key thing was caches. [00:07:59] The next one which probably your grad students were probably in and the grade school in that was the fire wall power wall and we're solving the same problem but we found that scaling up the clock meant that our chips are now dissipating more N 130 watts of so a piece and you can't cool them and life life as we knew it had to end so we this solution to this was to flatten the clock rates build simpler cores back to back to the future kind of thing add simply instructions begin to use about and accelerators this is about the 2008 timeframe and bigger caches. [00:08:41] OK So then came last week the efficiency wall you know you were you were probably in college at this time and again were solving a exceed this is what the the architects of the time were they were still selling X. equals B. for dense computations and the issue here was power efficiency so you couldn't get enough flops per watt out and our solutions there were multi-level memories Excel or raters hybrid cores white or cash lines. [00:09:11] And multilevel memories or things like H B M and MC DRAM and what have you yesterday we had the sparsity wall about 5 years ago the deal we recognized early some people deal we recognized that the real codes weren't running that they were near the efficiency of the dense linear algebra they were you know they were run down about one to 4 percent efficiency in terms of using So they switched benchmarks but it's still a linear algebra thing but a sparse linear algebra but it was a sparse variant of the problem that was easily weakly scaled so the solution for this is hasn't really worked and it was basically do more of the same and maybe better networks. [00:10:00] Then where we are today I think is is more of a case you know the with the rise of data analytics and machine learning what have you we've got large persistent data we've got persistent non-locality in Access massive numbers of remote operations and irregularity in sparsity all over the place and we don't know you know there's no consensus on what the architecture ought to be although I hope maybe we can change that that can result on all the current architectures as they're grossly inefficient so let me just look quickly. [00:10:39] Lips at today's server this is a typical server blade today you've got lots of G.P. course we somebody talked about the power 9 and summit I think that was David and. If you look at this and you look at the trends this is libeled what we're going to have the next 5 years more course prosodic at some increase in threads Precor but we don't know how to do that well with conventional architectures small increases in come out of the memory. [00:11:08] More and more on chip Hi Ben with memory as an intermediate level of memory no real growth in the number of sockets per node we've got G.P.U. we'll have maybe some increase in G.P. use per node but some it is a 6 I think and that's kind of the limit of what you can think you can deal with. [00:11:29] Limited injection band within 2 or nix multiple different kinds of interconnects all over the place so whenever you talk to somebody you've got to talk different ways to different people and you've got all kinds of different disparate address spaces and the net result of that and I'm not going to go through these but from the programming viewpoint you've got tons of problems there's nothing simple nothing uniform it's extraordinarily hard to reason about anything so you end up having large software stacks just to make different transitions between different levels of things and so there's there's again got to be a different way so that's all that's what we do today let me just talk a little bit about data analytics and I'll give a couple of examples this is from a variety of places. [00:12:19] Defined data analytics is you know the discovery interpretation you know finding meaningful patterns in data and traditional data analytics the data is kind of passive it's siloed and you do periodic. Batch analytics on it and what's more modern today is the data is a bit more dynamic multi-site data center data clouds. [00:12:46] I.O.T. compute at the edges and what have you and there's a bunch of multiple open source tools none of which are compatible with each other or even have it much in the way of common execution models so I have a couple I'm not going to go through this in detail but I've got a couple of charts that show why there's a problem this is a from I.B.M. web page that talks about the properties of data analytics and they go through not only the normal 3 vs that we talk about velocity volume in veracity I guess but a whole bunch of other kinds of of properties that they have that where we are not used to in the past and there's another paper. [00:13:29] That is really interesting that looked at OK given those workloads and they chose. 11 of them these are down here all of which have very large you know 100 gigabyte data sets and what have you that comprise them what they estimated to be about 70 percent of the data analytics analytics work load they looked at those things in particular and did a lot of low level analysis of how we are using our current course the way we design machines today and they came up with some data and the reds of the key takeaways here these are 4 different aspects of a core if you're a computer architect these are 4 things that you look at and the 1st one up on the upper left is you know a lot of course speculate as to which way you go and branches these things are terrible for anything that spends a lot of time speculating just doesn't do well if you look in the upper right C.P.I. cycles per instruction. [00:14:28] 4 Way issue so as best C.P.I. of about a quarter is what's possible these machines don't even come close on these problems so all that superscalar hardware is wasted down on the left all the different execution units you might have in the course. To do different things. These Corps had 8 execution units so these are pretty modern devices and they are the utilization of those quote execution units which is terrible OK And then finally last but certainly not least from my viewpoint is that when you look at where the stalls came from they came from memory you know this is all the way back to the beginning of our problems we we no longer can. [00:15:15] Now totally bound on memory and not at all on the architecture of the process or that kind of doesn't matter so what I have here is a a there's the roofline model that if you're architect you probably know. And what this is is it defines a term called intensity which is the ratio between 2 terms and the intensity I'll use here is performance divided by the bandwidth of memory if memory is an issue then I'm going to make that the denominator in performances a numerator Normally when we draw these roof line models we put performance over on the Y. axis I'm going to change it to efficiency because that's more realistic term and what I've got are 3 curves for 3 of the top 10 machines for summit Tahu light and the Princeton over in Europe and these curves basically say the bottom axis is the number of operate number of units of performance you get out of the machine as a function of the number of bytes as a ratio to the number of bytes that you had to read from memory and in these machines the vertical on the. [00:16:28] Angled lines are basically say until you have enough to saturate your cores the performance you get is linearly proportional to the memory bandwidth you you how much how well you can use the bytes and come out of the memory and. Important about all 3 of these machines all 3 of these lines is where they flatten out where you've actually saturated your cores is somewhere between 4 to 8 and this is where you've managed to use these intermediate memories 100 percent OK So if you're using DRAM these numbers are actually out in the sixty's OK they're way out there now when you look at relapse and not just high performance Linpack but relapse they're down there and this is in fact exactly why real code see one to 4 percent efficiency on these big machines because they're actually down there and just to kind of show you this is a chart that goes through a bunch of different benchmarks I've looked at and computed over in the right hand column the intensity. [00:17:37] For those guys in the only one that's above one is Linpack and with a big enough cash I can make I can take any intensity you like and that's that's what's happening when we put intermediate memory and you can you can boost. The Linpack number almost arbitrarily but the rest of these guys can't they're they're bound by some of these other things and you get numbers that are point 1.01.001 and so on really terrible. [00:18:06] So the other issue that's important besides using the cores officially is how we put these together when we want to scale and today we use distributed systems and M.P.I. to connect them and what have you so this is a chart that I drew a couple years ago that I then started and I may have even shown this last year I don't remember that started a couple my grad students down and doing things and this is a chart where the X. axis is the number of nodes and the Y. axis is performance where I've normalized each of the benchmarks I looked at to one to be the best performance you could get with a single node in other words if you can if you have a say in open M.P. code for a particular problem and runs on one node whatever that performance is the best of those numbers as a one and then I scaled everything beyond that and all these curves you know when when you when you look at real data people talk about when you're out in 101-000-1000 extension 0 nodes what have you when you're far out there on the right and for most of these with with one major exception they scale well weakly In other words you get far enough out and you've got a nice curve you double the number of nodes you get double the performance the interesting thing however is when you look at the characteristics down here you go from one node to 2 nodes to a few nodes and you throw away a factor of 102100 in performance to do to do to communication and all kinds of things and the benchmarks I have here are breath for a search from graph 500 which is a graph problem I've got H.P. C.G. which is this synthetic benchmark to do the Linpack but it's using sparse major Cs and I've got S. P.M.T. sparse matrix but dense vector product and that I put on there both because it's a core for H.P. C.G. but also it's a key operation that shows up in something like graph blog. [00:20:06] If those of you who like to do graph computing graph plazas new paradigm that uses linear algebra to do this so as PM the is is one of the key functions that you'd see there so looking at sparse and graphs are really sparse so looking It's sparse made very very sparse major season is something that's going to be of interest so we did this and we found this data here the purple lines and for really sparse cases down some 7 non 00 What have you not only did. [00:20:39] You lose well over an order of magnitude in performance but you never got it back this thing doesn't turn around and have weak good weeks scaling it just dies so I had a grad student spend about a year on this and he optimized the blazes out of it and got curves that look like for this this is a whole lot more major Cs than in the the data that came from the reference and what he was able to do was a couple of things one for the very sparse is Casey minimized the the loss of performance or now it's only a factor of 3 for the sparseness things and for the dense cases he did really well for a while and up until you get to maybe 16 nodes or something like that and then the bottom falls out and you die in the funny the funny curve here is because this is where in the system we had that's when you switch you hit a switch in the in the interconnect and to go to a bigger system you got to go through another level switch and it just falls out after that and the problems there are all due to the communication times which very often are as much as 50 times the compute times particularly if you think of sparse cases by the time you get to the really sparse cases a lot of the nodes have you know almost nothing to do computationally but they still have to to communicate in some sort of a pattern and the reduction time the time to aggregate the results together. [00:22:09] So that's not so good. Lips run keep OK So issues here. Performance is driven by memory bandwidth many of the current apps need strong scaling not weeks scaling the you can't arbitrarily scale the size of a dataset that you want to work with you've got to if you want to add more processing you're you're you're fixed with a dataset you want and if you look at strong scaling a particular sparse problems you get all kinds of issues that kill performance and it's even worse when you want to do something different remotely where you have to communicate with somebody else and tell them to go do a different function and it's even worse when you want to change from a batch to a streaming mode so those are all issues and if you look at it in terms of a design space if you will. [00:23:00] Flops this is 3 axis of flops perception memory bandwidth for 2nd and injection Ben with and this is where we are for dense H.B.C. where we design most of the systems sparse H.B.C. problems that are more realistic are here and the big data big graphs are way over here so they're in totally different parts of the zine spectrum and that's where the movie Sheen I think really has has its sweet spot is over there so this is this is the EMU we've got a wall that we need to overcome There's 2 steps to doing this whips. [00:23:37] If memory access is the issue you perform the compute near the near the memory and today in the memory controller port tomorrow perhaps in the bottom of the stack but something that distinguishes is a bit from the old ideas of processing and memory which if any of you know my history I had something to do with back in the day the computation site here is not fixed so that's radically different. [00:24:02] The other thing is of scaling is the issue because the communication then move the computation and not the data in a limb innate the need to talk to other people. And just as some cartoons to paraphrase something that goes around today what we want to do with the 1st thing is make memory great again and whips. [00:24:26] And then the constraint is we want the computation to run where they want to the next picture is just this is an interesting picture and kind of why we liked the term E.-MU. Thread migration basically says Move the side of execution we do that today in operating systems when you suspend a task for an IO You may start it up on a different core but it's not really part of the hardware in languages like chapel you can say go do this function over there the idea behind the MU is that this migration becomes automatic it doesn't is necessary and this particular picture is kind of a migration on steroids for those of you who like Australian history Google the great EMA war and the. [00:25:17] British army was sent in to clear out in moves that were bothering people's farms and the British army lost. So. So that's a picture from the gradient wars. OK so the model we have is a nodal it is a new measure parallelism it's got some memory memory front end and some number of course but these cores are anonymous and the smart memory controllers also do atomics and we've got a lot of these things and all the memory is as was said earlier Jason said earlier as on a single global address space and whips my machine just died Illo. [00:26:03] You know put it up. Nope it's my my screen here out there now it went into sleep mode. Yes the moves one yes OK. Let me plug this in OK All right. My machine is beginning to flake out on me so. And I'm running pretty close on time so what you can have is you have a thread it's executing in some core if it makes a reference to some memory that's not here the hardware hits it over the head puts it to sleep ships it over to the right node wakes it up and the thread never knows that it moved OK so all memory references are local and threads by the way can spawn additional threads the go off and keep doing additional work and and OK now this is not good. [00:27:09] Yes Intel and went and Microsoft heard me all right well this is not good now I wave my hands. So we'll see if it comes better but I'll continue to talk so. Shoot I can't even see what's there the so this is what we the model that we're building the hardware that's in in there in the crunch Center now has 8 nodes each node has 8 of these nodes on it and where we're going after that is the next generation is this is all done with F.P.G.A. so we can make changes to the architecture and what have you as we we find issues we're going to bump the size of the. [00:28:00] F.P.G.A. to a more modern one that we can make run faster and better interfaces and what have you and in terms of this isn't going to let me do anything apparently maybe yeah there we go OK all right the language we've got is a variant of silk and I was there was a talk on triangle finding out there and the example was being done in silk and silk has 3 primitives spawn which is a prefix to a function that says you can go the do the say synchronously silks ink causes the thread the way for all those children and a silk for basically does a parallel for there's a rich set of intrinsics to do atomics and all kinds of other remote things and we have C. plus plus and Python interfaces that are just about done and I'm going to. [00:28:56] Skip forward this is this is the what were we hope to do next. In terms of scaling the benchmark results as to benchmarks and I'll talk about that kind of reinforce where we are in terms of performance one is is pointer chasing with. Random reads and this actually came out of Eric Yes this is this is largely your work good. [00:29:24] And so we built this code that basically puts billions and billions of nodes simple 2 word linked list pointers in a in a graph that's you know is spread over multiple nodes both the multiple nodes in both email and multiple nodes on a conventional cluster to simulate a big data problem where the data is in fact randomly scattered all over the place and on a 32 node dual dual socket Zeon you've got about 100 gig of pointers percent intercepting and are 8 node box was about 187 and you scale that forward and you get some pretty impressive differences between 256 EON nodes and moving to the 8 node strategy which is going to give us more performance and then you add more of these nodes you go up to $64.00 nodes Iraq and you bowl the doors off the numbers and the the the numbers here. [00:30:29] If you look at the band with consume the actual bytes of real information that across the difference between moving the threads and moving the data or accessing the data remotely to do do this chasing is like 30 to one and the power is equally a huge number difference simply because we need less hardware to do it and we're using the hardware more efficiency to go back to the efficiency thing and there's a really good price comparison for that the other one is this business of doing remote atomics that turns out to be important on a lot of these applications this is just a benchmark called cups which basically has a giant table that spread all over memory and you want to do updates to it same same kind of phenomenon. [00:31:17] Even with the the simple areas F.P.G.A. as we have now we blow the doors off of the. Cluster where you have to communicate to do those remote things we've got a bunch of other benchmarking efforts in fact Janice who's also here told me just before lunch that apparently the B.F.S. we have is just about. [00:31:42] Functional now is that correct good. Excellent. And so there's a whole bunch of other things that we should have some data on real soon now so super node super rogue This is what I mentioned key and handsome and here are going to be improve the compilation. Reduce and reduce the number of migrations needed by adding some additional funding. [00:32:08] Migrating threads migrating threads don't you just do one remote atomic they do to remote atomics reduce the cost of the migrations by introducing even lighter weight variants threads and then reducers is something that comes from so plus that talks about basically this issue that I showed you that we have in conventional machines for aggregating data at the end of things I guess B.M.V. reducers are things that. [00:32:40] Automate that in a way the programmer doesn't have to worry about logical hotspots and where we're going to add features to help with that software directions were on the verge of finishing up C. plus plus and integrate the paper compiler technology from MIT and do our thing. We have ports open M.P. and graph laws underway and we've got a bunch of front ends to Python that allow us to put code in that can be talked to by Python codes so and again this is where I think we're going to want to did I skip something of importance OK I'm going to skip that this is analytics direction that we're going to go we're also pushing into more. [00:33:30] And this is where I want to be as I said I'd like to build machines where I really want to be is in the bottom of the stack where what you have is a sea of memory with the computing underneath that and the threads of the glue that that boil it all together so we also have and I'll just do this very quickly Vivek and I have a new project that just started about 2 months ago. [00:33:55] And it's this is the same chart I said earlier about porous post market computing we want to migrate leverage the migrating threads and today the issue with scaling and big heterogeneous systems are gluing it all together and it gets even worse when we want to talk about data analytics so the focus this project is on the extreme scaling for high performance data analytics applications using the threads to glue things together particularly glue together heterogeneous activities going on different kinds of nodes different kinds of cores buried throughout the system. [00:34:37] Where you can handle the sparser regular problems directly or you can when you go to spawn a thread you spawn a thread on the local accelerator and you don't longer have to go through big software stacks to talk to it and our plan is basically develop the semantics for this kind of thing and then go through developing prototype execution models tools demonstrate with Graph engines and look at scaling and we've already got some results we got a paper in supercomputing again from Georgia Tech one of the students on possible compiler optimizations and we're starting a couple other things related to the semantics and that's it. [00:35:23] OK. What do you look. Like my grief. Lady again an accidental because I'm struck only would not be correct in thinking this. Should be well where we wanted to be was a a system where the data layed out is and were partially there is defined separately from the from the code so what that would mean is you can write correct code without given that this is a shared address space you can write correct code and not without worrying about where the data layout is and then you can go in and change the data layout to boost performance the mechanism we have now for views from memory has only limited ability to do that separation I actually have a couple of patents in the work on richer things that are allowed much better kinds of distributions but that will take a little bit more logic and hopefully when we get to the strategies will be able to put those in to the system yeah. [00:36:53] A lot that we Yep we. Want to. Correct and in fact the gumps code that we have we we. The the prototype we had before this prototype that's all you had was you just migrated the system we have now and that's down here you can do remote atomics In other words you can say I want to do I want to add something to this location. [00:37:25] And their instructions would spawn a single purpose thread that go out there and do that OK and in that case you don't get any response back other then and acknowledgement so you can guarantee you know when everything's quit but you don't necessarily get a value back. But the Atomic the remote atomic happens remotely the next generation is to introduce these things we're calling dyadic where you can do 2 operations Another is you go out do a remote atomic and you can come back in you can take that value and do something else with it and something else could be returned it to a memory location in your in the car in the parents'. [00:38:07] Stack if you will and then they acknowledge meant for that when the parent sees the acknowledgement he knows that it's done and so I know on the on the gumps we sit there and just spin out gobs and gobs of these remote accesses and just you know wait for them to complete on these others with dyadic so we can still spit out gobs and gobs of them and they can return values and will again know when they complete and then where then we're free to migrate. [00:38:50] Correct. And we given up on coherency. Yes we're not even going to try. Well that that gets to this data placement problem you want. Yeah you get that temporal calorie because of the placement not because of anything else and that going back to my this what you mention is we don't you know I don't we don't try for coherency we don't try for a locality there was an I.B.M. machine I think it was the. [00:39:33] 1401. That was called the cadet machine and anybody remember the cadet machines that tried right everything was table driven so that this code here to prison even truckers. Right right very good their stuff but it was old as me baby you know. So yeah. Yeah yeah. Yeah yeah yeah. [00:40:09] Yeah. Yeah. Yeah. You know. Yeah. But. You know me that every Western. Yeah yeah yeah yeah yeah and everybody part of. Me. Is. Mark you know and me. Thank you.