[00:00:06] >> So let's let's get started. So we're very happy to have done so here today Morehead is an assistant professor at U.N.C. where he leads. Berkeley spent a couple years at D.T.I. Chicago would be I didn't have a chance to overlap with it but it started just after Dave and I left and we collaborate a bunch. [00:00:31] More it is the recipe and off the 2018 or young investigated award the 2017 dot being investigated award and I stopped counting all the industrial boards because they were just too many. I'm just going to skip that but you know more towards research more and more design a lot of interesting work at the intersection of among other things language and vision but also knowledge grounding common sense extracting knowledge from large scale internet databases and today he's going to talk about multi-modal personable and knowledge knowledgeable language edition to get it. [00:01:09] In the back. No. I'm not sure because on what should he do now OK OK great thanks for coming yeah so as mentioned. One. Word there are groups being worth doing. On different aspects of language integration some of this will go all the way through dialogue generation conversation. [00:01:38] And of the day it's about and it is not a language generation so we're going to create aspects of it which is how to bring in more dollars to bring in personality based. And lastly a knowledge base can they bring in external common things and external knowledge. So this is a more high level slide that I usually show on what are some requirements in my opinion on a different set of as a diverse set of aspects that are dialogue model needs if it's operating in your home in the near future and being useful or just fun. [00:02:19] But more. OK. Yeah so some of these should be pretty obvious so obviously the 1st thing you need is if there's a very long conversation history then you want to have. In Friends that how you want to have the model be able to remember right pieces from it in the past as so if you use Alex who will whom you are probably pretty aware that I think 23 years ago I couldn't even like there's a nice examples in the New Year ASCII book where you can say hey show me the Thai restaurant is a different type of Thai restaurant around me it will show you a list then you'll say Where how is the 2nd one in terms of reviews and 11 no clue what you're saying because they don't know what you're talking word what's the 2nd one then I think recently they're able to handle at least 2 terms of conversation where it's going to call reference and it understands what the 2nd one means. [00:03:12] Handling hundreds of turns of conversation not over just the day but the week the month we remember conversations from years ago so that's one big challenge in inference models whether it's deep learning or more structure traditional models then you want these models to have common sense and external knowledge right when we talk as humans we as human lot of shared knowledge we assume a lot of things that we want explicitly in the rate out rate if I'm using a very simple example if I'm using a very simple word very non simple words very complex word in my vocabulary that you may not share. [00:03:46] I won't start defining that word at the end of the conversation right I'll assume a lot of external knowledge and then more complex things like cultural knowledge and so on then there is this whole lifelong learning you which is the 3rd point where you don't want these dial models suggest be static read you want them to be able to operate in the environment get feedback from the environment and improve over time. [00:04:10] The 4th point is you want them to have personality you want them to be able to convince be able to convince the human audience especially if they're in scenarios like old age homes or health care situations or intelligent tutoring right you want these models to be as convincing as possible and finally there's this whole. [00:04:29] Gamut of requirements where these telemarketers need to be grounded in different kinds of other mortalities I talk a little bit more about the video side of things but there's things like gaze gesture. All kinds of expressions great facial expressions and so on. So there were 3 aspects to their talk the 1st one based on the title write multi-modal then personal personable or personality based and then knowledge based so on the multi-modal side we don't have that much time today so I'll cover few aspects and then we can go into details and questions in our meetings later so the main idea here that we are pursuing is that currently obviously Alex the Google home don't have this capability but ideally if you're if you want a personal virtual person assistant or a regular personal assistant in your home should be able to see their daily activities around it right it should be able to share the same visual context as you are so that it has better knowledge of what you're talking about when you refer to things right that's what we do as humans when we talk to each other. [00:05:34] So basically it should be able to hold dialogue conditioned on visual context both for understanding and then generating a response. So as a 1st step on this we did a lot of work 2 years one or 2 years ago on video captioning so we want to bring in videos into dialogues so the 1st step we did was build a lot of strong models for converting on the fly You Tube and other. [00:06:01] Videos to text so can we just have models that can describe the events going on in a video so I won't go into these details today but these are papers that use multitask learning from. Auxiliary information like and tail wind generation or even unsupervised video production some of this looks a bit pixilated bird you basically want the video to predict its future while also the same encoder generates the caption as another decoder 8 so you can share the encoder or you can share their decoder with an entailment task which means generating a logical subset from the input we also tried a policy gradient method here to have a logical correction in the caption so that it doesn't generate any contradictory are unrelated information which is a big problem in these metric based rewards. [00:06:50] Then we started moving to video plus dialogue and for a video to caption So the 1st thing we did here was the look at possible data set so there's obviously where do you find data sets that have been Dio's grounded chatted around in the videos so there's things like Facebook and You Tube obviously which are harder to scrape for various reasons so we started with something called Twitch which many of you might be familiar with this is a gaming platform which comes with a lot of very diverse kinds of games that as you can see here it's more a strategy game like I'm not good at these but just as human it's like Starcraft or World of Warcraft but then there's also things like soccer and basketball style. [00:07:35] Games in there so what we did was they collected a lot of data on twitch and the 1st paper we did was more on a vision task of video summarization So given a long we knew. Which can be generated highlight a short clip that highlights all the important events that happened in a one hour to each game so for this to be basically Maybe I'll use the pointer which say it's a. [00:08:02] No I just use the pointer on the states. But basically what you want to do is you want to be able to use the visual features on the game site but more interestingly also the vision textual features on the chat so basically this brought up some very interesting chat language the idea of which is somehow very different from Twitter so there's been a lot of work and N.L.P. on Twitter in the last maybe decade right and Jacob's group has done a lot of this to the which side of things is a little bit different where it's not just space constrained but it's time constrained so instead of Twitter where you have X. number of characters here you have to chat live right the game's going on it won't stop for anyone so you have to type and move on fast with respect to the game. [00:08:49] So the rings in its own vocabulary special symbols A more to cons there's also multiple users that are talking to each other so there's a whole discourse element of when you have 10 speakers how do they interact with each other and what patterns there's also multilingual ality records a very rich. [00:09:05] Data set for all kinds of research here the 1st thing we did on this was we predicted Like I said the highlight so we did this multi-channel model where we basically have the video frames coming in on one end but also the chat coming in on the other end and you want to jointly use these features to predict whether each frame should be a part of the output highlight or not so given each frame and then put video you try to classify this as being important enough to be part of the summary or not. [00:09:37] So the import the visual part is obvious the chart part was interesting in the sense that if there's some very shouting or excitement style text happening on the chat this can learn good features they're looking at this is something where the corresponding frame has something exciting going on select put this in the summary and we did this at a character level this was 2 years ago where character level models were still new and were coming up where the idea was that when you have datasets like Twitter and twitch where the vocabulary is very different from traditional English it has its character level models are much better because they're actually building the word character by character and then building the sentence word by word right so it's almost hierarchical where they can actually learn different kinds of prefixes suffixes their meanings are to magically through patterns across data examples right so basically when you have very compressed words of very fast type words or spelling errors or a more to cons it can start learning their meanings directly at the character level. [00:10:37] Then this year at the M. and LP The extended this to the task of dialogue so we said how can we take this data set as. A stream of video context and a stream of chat history and then try to predict the next response so think of it like a multi encoder said there you have the video context coming in and the chat context and then you're trying to generate the next response as a chat board trying to be a player or a chat board player on which. [00:11:07] So this is a paper I point Wreford to here for details we cover a lot of topics today so this is what we call the video context dialog where the task was given a set of frames from a video clip and the corresponding chat you have to predict this 3rd thing which is the S. 9 which is the next response and then we have some models here for discriminating models and gender to models so discriminating models mean that you are given a possible next response and then your model has to rank the best one at the top or recall that one recall or 5 and gender to mortals are more useful in the real world where you're not just choosing between response options you're actually generating the response word by word so we have several models here both from. [00:11:55] Elected that style try directional attention flow models as well as by Def models for generation. Data and code is available online. Another thing that I like to point to is this thing called T.V. Q So this is another video based question answering task that we did this year which is based on T.V. shows so this is probably the biggest corpus on video based question answering at this point so it has been 50000 question answer pairs so instead of dialogue this is 1st looking at one short question answering but the innovation here is that it's compositional So the question example would be something like what was Penny drinking when Leonard entered the door so you 1st need to figure out what part of the clip is when Leonard entered the door and then you go to more or less style models where what was Penny drinking at that point in the clip certain needs both video localization and then more N.L.P. vision style answer detection so yes so the novelties here are these compositional questions which forces you to your model to require board video localization which Cape has the answer and then finding the answer inside it it's one of the only large scale terror cells that requires both video and text for example movie was a popular data said board the example when the data was created you were not really even shown the videos you were shown the plot explored and the Turks were asking questions based on that exploit not looking at the videos so our circus yet. [00:13:33] So mostly a minute or 2 minutes and then they are allowed to so the Turks also give us pans so we created an interface where they could drag the start and end and tell us which part of that clip contains the answer and then words the answer and then they also created tricky negative answers for us. [00:13:51] Search somewhere to pull choices where obviously the negative answers have to be very tricky to make sense. Yeah so we've released that we are that part was a little bit noisy in the sense not noisy but just more relaxed so they didn't do very tight boundaries which is obvious so then we have a 2nd stage going on right now where we are collecting both tighter video boundaries but also object annotations like boundaries represents pression boundaries. [00:14:18] So yes so definitely this is online now there's a leaderboard now start playing with the series that if you're interested the best part about this I think is also that it has very diverse domains so we went into this dataset making sure that this will be something where we can try to transfer learning on and also real video understanding like checking whether our models are really learning something more story based and more plot based so it has relationship style questions that need thinking over a longer term clips but also diverse domains so now we're trying to see whether we can train on something like comedy genre of Big Bang Theory but then do better on maybe medical journal like how them really kind of shows right so we have 23 kinds of very different domains in the status. [00:15:02] Yes I guess I. Probably spoke about half of the Slate So this is the pie chart of different kinds of questions obviously Davey and this group started creating these fighters. For a week or so this is for T.V. Basically as you can see if you try to have a pretty large diversity of question types. [00:15:25] Object category so that's what kind of questions action person location but also reasoning why and how questions and then there's also abstract questions that start with what. The 6 kinds of shows Big Bang friends How I Met Your Mother sitcoms and comedies for then there's also medical and crime shows we have some initial strong baselines on this. [00:15:49] Human performance is 90 on this and the best models so far these are already pretty strong baseline models that we released with all your usual multichannel across attention blah blah it's already it's only around maybe 6065 right now so there's a big 30 percent gap to fill in. [00:16:07] So that's the that's the link P.D.Q. or Darcy is dirty 2030. So to conclude this multi-modal thread of the talk we also have time to go into this but we've been doing work since the last maybe 4 years now on deep learning foreign language and robotics so we are probably the better 1st learning or that it matters but the 1st paper on like translating language instructions to action sequences in a map this was back when the maps looked very simulated and. [00:16:40] Funny in some sense but we've done it both ways so we moved towards interactions what we wanted to do was we wanted to take an instruction and convert it to a pattern a map. I don't. Think a lot of pixilation this is a P.D.F. so I'm not sure but basically there's a path here which you probably can't see so this is a big room which has different colored floors there are easily hacked racks chairs all kinds of objects in here different colored was so the instructions basically talk about navigation in this boy environment so you can word the instruction to the action sequence in the map and we've also done an actual right paper in 2017 where we went from a destination in a map to a path to generating the instruction back so basically we did some sort of inverse reinforcement learning here based on human demonstrations to figure out the best path and the best cast So these are structured command pieces and then you translate the command pieces the cast's pieces to natural language and in human experiments I would encourage you to look at the H.R.A. paper this was one of the 1st scenarios where our human instructions that were generated by the navigational instructions that were generated by the model were ranked higher than human generated instructions in a blind study obviously the domain is simpler board they were both humans and machines were operating for the same simplified. [00:18:04] We're also trying to extend this to. Assembly instructions so when you're when you're several blocks on a table and you're too. Going to set up said them up into a specific configuration right so basically to move the block closest to the right table edge so that it is to the left of the stack near the front they will left they will corner right so very not very pretty easy for you and they will be mortals have the hardest time doing these kinds of tasks because for 30 or 40 years of N.L.P. all the data sets were news based Ventry bank C.N.N. daily mail and news never talks about left most right of something left of something so basically you're starting from scratch which is great because that forces you to think about weekly supervised transfer learning distance supervision all of that So this is a very interesting data set that came over 2 years ago we think still are the best results on this where you're basically given a source image with a very you have to find the source blog or move the block closest to the right table edge so that's the red circle here and then you have so that's the source now you have to find the reference and the offset so the reference is front left table corner So that's this and then the offset is left of the references stack near the front left table corner that's the stack right on the left side and then the 3rd is left off that So now you have to move this blog to the left of the stack so this model we we created a model that finds the source the reference and the offset and then you add the reference in the offset to create the destination We've also done the reverse now where we are trying to take a source under target image and find the instruction that leads you from that source configuration to target eventually to build dialog models where if are human and robot and interacting the robot is not just following executive instructions but also giving the next instruction so that they can complete a task. [00:19:59] And then this is speeders. This is also the simulator so you can we've tried this on a Baxter real board. Right now in this case there is no sequence of actions there's just a source and an instruction and you can hear that are good so there's no repeated like there's no sequence of tasks to finish a configuration because this already is too challenging right now so all you need to do is the action space here is find the source to get an accuracy for that where they found the right source blog then the another accuracy is for whether you find the right reference which is this the stack near the front left table corner and then the final accuracy is on the destination distance so instead of moving it here which is the right option you might have moved or slightly here so then you get a little bit penalty for the same as the navigation so you get source you get reference and you predict the offset and then reference plus offset becomes their destination so there's some new work on like trying to do this in a sequence of actions. [00:21:08] But we're more interested in the dialogue aspect of it that can the model also generate go from here to here but also from here to here and then have a board conversation Yeah yeah so that's what I was saying we've tried an initial version of this on Baxter not for navigation but for Baxter current move. [00:21:34] We've tried the assembling version of things and I want to talk about this today but we also have another project on adding common sense to instructions to Baxter which is one of those red rethink robots. For navigation. We not started yet on a real reward we're trying to do this with fetch so one of those small walking robots. [00:21:59] We're trying to find the right set up we don't want to simplify too much but also you don't want to leave it kind of in the department so trying to 1st create the right room like a groom. We have a collaboration with the robotics folks at U.N.C. and we we've already started some of this yeah and then the last piece of this is the room to room that I said Peter has where again we are going from instruction to Pat but we're also generating instructions so that currently. [00:22:36] We have I think still the rank one on this data set it's anonymous but it's us. But that's the company that's the comprehension part of the instruction to action sequence we are looking also at a generation of instruction not just a speaker list and models but real instructions that can be validated by humans as being better than hopefully other human instructions and then finally building a more interactive television appeared. [00:23:05] Anyway so we don't have I'm happy to take some of these discussions of lane I want to jump into the 2nd part so this was a multi more the language generation or dialogue generation thread the other thread we're looking at is personality based language and ration So what are some requirements to write when you're thinking of Google whom are Alex side home. [00:23:26] And thinking how to make it more human personality or convincing these are some of the axes that come to mind 1st of all they should have emotions and style as we use them right so there are things like politeness and rudeness sad and sympathetic but there's also the small excess of it and humor and sarcasm which the N.L.P. community has looked that at least on CASM has been looked at for several years there's also this whole kind of direction of not just imitating some you don't want your model to start imitating the human response you wanted to understand the human style and then respond with an appropriate style and you don't respond to sadness with sadness you don't respond to so many such examples so you have to figure out the right sort of appropriate responding emotion and then generate a response with that emotion and to the end goal would be when we write say proposals right we realize how much of this is important when you're collaborating with people in health care or you're collaborating with people in television tutoring style areas where you realize that these chat boards are much more convincing and effective and trustworthy if they're able to have some sort of these battling with state axes so I'll only go into one of these in more detail which is politeness axis so we've had some work in the last 2 years on how to 1st have very strong detectors for politeness forces rudeness in language and then recently how to add those elements and to develop models when you're generating language so. [00:25:05] So the 1st lady are. Based on Brown and Levinson $187.00 so this is a 30 year old paper that's a seminal paper on psycholinguistics work on politeness understanding. The stable so the stable is from the nest school in the lead to less school so I think the longest name in N.L.P.. [00:25:27] At Cornell So this is Lilian Lee and his then Student Christian. Faculty in school and Cornell. So they 5 years ago converted the brown and Levinson $87.00 paper to a set of 20 features more or less where the different theories of psycholinguistics politeness thirty's were converted to features so some of so-so 1st of all politeness is not as simple as just adding please do everything right there because I'll show you like even even adding the position of the word please can make things rude right so this is from the context of online messages and when you email people so this is a data set that was collected based on that so the obvious ones are things like gratitude their friends greeting positive lexicon negative lexicon right but then things start becoming tricky So for example the word if you look at $7.00 and $8.00. [00:26:21] Basically if you start your request with leads that has a very negative politeness quote mine is point 30 so you can start imagining when you write emails and what emotions you're feeling I think you'll go to later this right when you really want something to be done fast. [00:26:35] Please can we do this and that directly correlates with Count 12 and 13 counter-factual more than indicate a mortal if you're saying can you do something that's rude or on the router side of things where as could you would you write so this should all start becoming obvious in hindsight. [00:26:55] And then the usual stuff about the right questions an indirect is obvious 14 to 18 is all about 1st person start 2nd person start right. If you're saying you something then that's very direct and rude whereas if you say we I mean I do that many times I have to have to say something with we even though I'm like that not only involved in it it's clearly that person for. [00:27:20] So if you learn to be think they should be just to be more polite when writing emails and then there's things like hedges and factuality at the end so anyway so these were features that they started using a specific we keep Pedia based online messaging request and Stack Exchange based request. [00:27:39] So what's the 1st thing we did was they had an S.T.M. based on this read these are before deep learning is a feature based classifiers where you actually 1st come up manually with good features and then you add them to an S.T.M. style model and it learns the importance or weights of those features so the 1st thing we did in just 3 years ago was we took pretty standard now more like a combination of C N N R N And so the C N N would be something that would find local features and then the R. and N. word stage them together with L.S.T. M's to find longer term memory features so we just took some kind of combination of some strands of deep learning models and we showed that 1st of all you already beat all the S.T.M. style results with that right which is obvious then we move on and most of the paper was about why is there beating feature based models. [00:28:27] So this was one of the early interpret ability models papers where we showed that actually if you look at things like activation clustering right where you look for clusters that fire the same neuron in your newer model we find clusters of requests which correspond to these 20 features so we were these are people familiar with activation clustering is basically I guess it was more in the vision community 1st where you're trying to interpret your model by clustering things that fire the same activation in your model right so so what we found was that most of these clusters were not only the rediscovering these features right which is good but then they were also discovering some new politeness properties that these 20 features didn't cover so then we send it to cycling with researchers like crisp Arthur transport and others and it was. [00:29:16] Obvious to them in hindsight so one of them so these 4. Things examples of things that were rediscovered and then at the end the new discoveries were things like a cluster of indefinite pronouns which means things like someone anyone something anything so then you start imagining why this is rude so this was a very good feature it came with a negative score because you're saying Can someone do this can anyone do this can so this becomes so indefinite pronouns is a good feature for rudeness that wasn't in the list of brown and Levinson and 30 years ago and then the other one was even more obvious in today's world where you when you start adding a lot of punctuation to your emails like 3 question marks at the end we've probably all been there 8 and the interesting one was ellipses which means dark dark dark the 3 dark Dischord ellipsis So these are all being automatically discovered as the new features and clusters of them so this was a little bit of interpretability on why the deep learning better than feature based models in a very interesting application area. [00:30:16] So then this year I guess last year well we started this 12 months over journalistic time so this is our tackle paper from a few months ago finally where we tried to incorporate style into dialogue using classifiers rights of any of you to 1st of all you don't have to discard the paper title is polite dialogue generation without paddler data right so the thing to understand here is that you don't have parallel data to train the Stella model on I don't have it by which I mean that I don't have data where the same response is written in a regular way versus politely so I can train a machine translation style model of just OK because there are datasets like English to Shakespeare or Shakespeare English and there's been papers that just apply machine translation models to take normal English and convert it to Shakespeare in English or vice versa here there's no such data we don't have data that's written an answer that's written politely and rudely and plainly so we have to do it without valid data which means that we can actually bring in classifiers as very strong sort of controllers to add style to language generation. [00:31:24] Because the previous paper we built a very strong 88 to 90 percent politeness classifier So this paper showed 3 days of different levels of control that you can have in your dial a model using classifiers So the simplest one is the fusion model where what you can do with your left side is the encoder right of the conversational history. [00:31:50] And then you're trying to generate the response like the usual sequence to sequence model but what you can do is you can take the decoder is the response generator right there at the bottom one and you confuse it which means you can mix it spatter meters with Decoder another decoder that strain on only polite sentences from your point I did a 3rd right so this is a more shallow style way of baseline where you can just have your response generator but it makes their spare a meter that different levels early fusion late fusion diffusion this is existing work on how to mix their 2 decoders plan a meter so that it's not only relevant to the conversation but also mixing some style right so the 2nd big order can just be trained on the polite sentences the next one we used was label fine tuning model right so this is a pretty powerful model actually and very surprisingly works very well is the idea that. [00:32:47] It's a little bit maybe more tricky to understand but the idea is that you still have the sequence to sequence moderate important target conversation history R.N. I'M TRYING TO gendered the target response word by word the word you can do is during training you can take the target response rate you already know the real response to ground truth you can run the politeness classified honored get the score and add that as a label to somewhere in the model right somewhere in the in the string so now you can do this for all your myths a 100000 training examples so the model now starts learning that whenever I see a higher high score label here then my response usually looks very polite and then whenever I'm seeing a negative or lower score level here my responses are usually looking pretty rude they use certain different kinds of words and grammar and syntax so you can train them more like this and then I test aim you can nicely just have a normal at test time when you're given a new conversation history and you're generating its response you can add whatever label you want to hear from plus one minus one and kind of have a control knob of like how much rudeness or politeness to generate in the response. [00:34:00] So this is level fine tuning pretty powerful model and then the last one is a reinforcement learning based. Basically this policy gradient not really reinforcement learning so the idea here is that you have a context history and you're generating its response so you sample the full response lot of samples when you generated the whole response you throw it to the classifier get the score back and use that as the reward of the reward is very high that goes back as a politeness score classifier gives a high score then that goes back as a positive reward saying OK do more of this and if the classifier doesn't like it then it goes back as a negative reward and discourages that behavior. [00:34:45] And then we're looking at more models now where you can directly incorporate the classifier into the recorder and kind of decompose the loss and see which got our legs on Advantage regular sampling you can do exploration more so there's tradeoffs so we're also looking at a direct call lightness classifier loss incorporation into the recorder decomposition so I'll skip probably the result because we have the 3rd part. [00:35:14] Yes So the idea the left side table is basically showing us examples of how well the classified world so we can say things like Well thanks I appreciate that I know amazing thank So these are examples of positive and then this would be examples of rude so this is kind of cherry picked in the sense of showing you a tricky example right so this is you really should pay more attention to what you read so that no like there is no negative words and ideas that all regular words but the way you're writing this and the way you are using these words and the position of the words and the words really it's learning that this is extremely rude this is almost at the end of the opposite spectrum $0.00 to $1.00. [00:35:55] I mean even the something like Excuse me does that flask belong to this man again as. Nothing really negative in terms of it's not as simple as looking for a negative word right and sentiment analysis but it learns that you're accusing someone of something so this is the classifier then these are results on the dialog task so we had to do human empirics 30 obviously and because there's no way we tried automatic metrics but in dialogue they don't really mean anything most of the time so what we really what we did was an M. 3rd we had to baselines retrieval and generate 10 which basically are retrieval based baselines where they'll either find the most relevant response but an artful light or they'll find the most polite response but maybe not relevant then we had our sequence to sequence baseline which has no style in it and then are 3 models fusion label scientists owning and polite reinforcement like the word mortal and then we asked her several dirtbag with a description in the paper what you asked them for scoring the responses both for corn quality which is the usual style of quality how relevant is their top flew in desert and then also politeness and their difference is important because you can get a lot of politeness but really bad quality or you can get very. [00:37:04] High quality but no politeness right but you want the balance so there defer the polite are a model is the best balance where it's able to get statistically equal. Sort of politeness lever to some of the strong baselines under 3 will while maintaining quality or even better quality than the original seek to seek more. [00:37:26] And then these are some interpretive techniques that we can skip We also in the paper have a lot of different examples of what these models are generating So this conversation is talking world X. is saying you're sweet to say so why it is pretty song and then the other more you can say something like. [00:37:45] Like something irrelevant because it knows you're talking so the basic Morley will basically like the last 2 you can see both maintaining conversation relevance that there's something about a song and you sound like a goddess but they're adding politeness to it but there's still a long way to go right you can see the problem here maybe you don't but the subtle problem here is that the code reference is wrong right if it's X. Y. and then X. again X. X. is saying you're sweet to say so so they are being complimented right and then why compliment them again pretty song but the next year and say you sound like a god is right they're being complimentary So obviously there's so many other axes in dialogue that have to be fixed this terms of work that we don't hear but we are focusing on style so you can look at some examples. [00:38:30] And then I shout out for regions were that we did. As a summer intern right. So urgent one of the students your colleague sitting here so we did the humor version of this. Which I think Facebook is also looking at now had some recent work on this on how to make image captioning are just relevant to the image but also humorous So this is the exact same idea where you want some generation of language to be both relevant to the input and have some style in it it's easy to do one or the other right if you're forcing your a generation to add style it's very easy for it to lose track of relevance to the inboard. [00:39:09] So we had this macro paper this year on how to insert witty puns So we called it funny captions how to insert how degenerate an image caption that maintains relevance to the image while having a pun inside the caption. OK so the last part of their talk is some work on how to acknowledge and robustness to dialogue models our language generation in general. [00:39:36] So again this is my take on what are the different axes are some of the different axes that needs to be taken into account when you're making language generation or dialogue generation models robust So 1st of all they need external common sense which can have many definitions of what that means but basically something you should think of as when we converse as humans there's a lot of human knowledge that we don't literally explicitly spell out because we as humans both of us know that there's things like logical entailment saliency underscores which are more low level skills needed in language and ration so something so by logical entailment I mean that when you're generating say a summary of a document you need to make sure that it's not generating something that's contradicted to the input document it's not generating something that's totally unrelated to the input document so this is what logical entailment is that it can enforce logically that the output is a subset of the input but this is not like a certain math right it's language so it's very non-trivial to ensure that what language you generated is logically contained strictly within the semantics of the input document so this is the whole area correlate and it's entertainment or natural language inference is a more new name for it then there's very even more obvious things like robustness to missing words spelling grammar errors paraphrases I'll show you some examples on that and the opposite can it be sensitive to important but very small things like adding a negation or an antonym. [00:41:08] And then finally can it actually more futuristically verify facts when it's having a conversation with the right with all the fake news and the sort of distraction and misleading sort of issues that are popping up today. So I talk a little bit about each of these probably want to do slides each. [00:41:27] Just to kind of give you a bandwidth of ideas that to think about and chat about later so this is a recent Connel paper which I'll probably go most detail into is adversarial dialogue so how to add robustness to dialogue models so there is 2 things in terms of robustness Like I said right there is over sensitivity and overstimulated these are. [00:41:52] Terms that are well known now in the adversary community so oversensitivity means that the you did something today in forward which should not have changed the answer but it did rights in the vision community you can add little pixels somewhere to your face and then the face recognizer totally misclassify that someone else overstayed. [00:42:14] So in my case in language in a dialogue more than These are the 5 examples that I can give you for that so you can actually take your dialogue history the conversation I can randomly swap some words I could drop a strop word I could do some paraphrasing of something like say the same thing but in a different way either data level paraphrasing or generate it like actually have a model that generates paraphrases or certain grammar errors that shouldn't really matter in the sense that they as a human you would still understand the conversation and continue but we showed that how all the current state of the our development of the story break on all of this right and then how do we fix it and then overstay ability is the opposite problem that when you make some change that is very important but the model doesn't realize it's important it's too stable it doesn't change the response so again the example should make it obvious will show that you can take state of the art models and dialogue you can add even very obvious things like an art can a geisha or antonyms take a word and replace it with its opposite meaning words and responses don't change so it's clearly showing you right that there's so much shallow. [00:43:19] Kind of background level phrase level matching specially where deep deep learning models. Anyways that's what the paper is about I have a nice real example from one of these currently used systems right Alex of home without I don't really remember which one it is but I think one of them we tried saying I think I'm having a heart attack so ideally the answer should be some definition and some help right here Alex home or something should say someone having a heart attack may feel chest pain which may be giving you the scription and giving you some information then we just perturb it and paraphrase there then are some grammar errors saying I'm afraid I'm having a heart attack so not even grammar is just paraphrasing and then it basically can't say anything it said My apologies I don't understand but once we fix it then it can go back to hopefully saying the same thing we didn't write on Alexa obviously but we've tried it on current community research data sets so this is where the paper is about I would encourage you to look at it so I don't know about the 2nd part so the adversarial testing is where we take all these 5 oversensitivity and over stable at the adversities rate which means that we go to the dialogue context break these things in weird little ways and show how all the results go wrong for the state of the Art Bell models like the variational hierarchical model the reading enforcement model and the diner models and then the way we fix them is something that's pretty well known by now called adversarial training where for oversensitivity it's very easy what you do is you can actually take the training data and then change it with these changes right you can actually take the conversation history add random swaps to it and then feed that data back to the training more the to say hey even when you see this random swap the response is still the same so you should be robust to it so you feed it back as positive examples for over 30 years a little bit more tricky because. [00:45:18] You have to tell the model that when I retrain you I'm going to show you a negation so make sure you don't generate the current response you have to generate something else but you don't know what that something else is so we do Max margin style negative example in training where we tell we train them order to say whenever you see the same dialogue history with a negation in it you can generate anything but the old response so you do Max margin where every other response can be treated as better than the current response in the ground to the data but still very simple yet so basically the results are not important what I meant to show here is that when we do adversarial testing everything breaks then the adversary really train it everything comes back again better and then when we do combined models then we combine all these adversities together you get the state of the art results on all the community data sets and actually even the normal test results improve nor does the adversarial test serves and then the same thing for human evaluation. [00:46:18] And then these are some examples on how things get fixed after that training. I guess I have 10 minutes of 15. So this was one way of adding robustness and knowledge to develop models right the more obvious with this next paper so this also was present there that I am an LP 2 weeks ago on the stats called Mighty hard question answering and reasoning so there's this whole trend in an LP now on multi hope reasoning which means that. [00:46:55] You don't just take the document and have a bun shard find for you don't find the answer in one short. I guess I should say one short and sort of word one short learning anything. You need multiple hopes of reasoning right so Mary went to the market Mary brought an umbrella Mary came home where is umbrella now right so this is a very very tall example of that So basically you need multiple hops of reasoning inside a very very long document to answer the question you need to collect information some evidence here then find the next piece of evidence then connected to this 3rd piece of evidence and then combine all 3 to generate the answers. [00:47:33] But what we showed here was this new data set popular did a 3rd quarter narrative which needs multiple hopes of reasoning but also generates answers as opposed to just multiple choice or soft Max over a large vocabulary so it actually needs generation of answers and here we showed that you need multiple hopes of reasoning not only inside the context but your model needs to actually sometimes go outside the context go to some external knowledge base extract the right information from there and come back and continue the reasoning inside the document rates are very complex even for humans a very complex procedure where you have to find hubs inside the document go out 1st decide when to go get the external knowledge come back continue the reasoning and then merge all of this to generate the answer. [00:48:17] So this is exactly what we do we proposed this new reasoning fairly cornered and the necessary and optional common sense reasoning cells where instead of the use so this is the top part of the usual multi hope reasoning question answering model where basically you take the question and the context and there's multiple sort of reasoning cells like they can be something like Max or. [00:48:39] Where you have multiple reasoning cells different levels of attention maybe you attend on one thing then you make sure you don't go back to their attention thing else collect all of that but then instead of having the usual reasoning sell like buy it have we have this thing where it's both by and after on the context but also the common sense relations So basically you have this extra little process of extracting the right subtree of knowledge given the context and the question so basically this attention has a bypass here where it can decide whether to also use external knowledge from the common sense relations here or bypass that this arrow so basically not to use it right so this is the bypass where it can decide whether common sense is necessary here or optional and then at every step you add this capability and the model automatically learns that at each reasoning step whether it needs to combine internal and external knowledge or only internal knowledge so this basically again is more like a one slight pointer the results are in the paper and we should have an LP talks record and release next week. [00:49:45] I think this should be the 2nd last thing so so this goes into fact verification where this is another kind of knowledge that's more high level so we want probably heard of fake news especially now all right so. In more technical terms we can call it fact extraction and verification where given a claim right if I just say a sentence. [00:50:08] That's called a claim how do we verify that game right how do we extract the right knowledge 1st to clarify that claim and then verify this claim so luckily our community created a very interesting shared task for this idea of an LP called the fever task fever is fact extraction and verification and something. [00:50:31] So we guard the rank one here on in terms of 25 teams it's a very interesting leaderboard it's not public so you should go and try to get better results but the idea here is that you're given this orange claim and there's 3 steps here so you're given all of the key Pedia where it's a very very large scale task so given the single sentence in a claim you have to go to all of the key Pedia do document retrieval 1st to figure out what are the most relevant documents that might help me verify this claim and then inside each of those documents you have to do sentence or paragraph selection to figure out which parts of these documents might be relevant to answer to verify the claim and then finally when you have collected the set of sentences you have to verify this claim with this contact and I did every dance and the call verification is a 3 label class patient task so S. means. [00:51:25] I mean just as to be verified. Itself ISIS this knowledge this information you've collected is sufficient to verify the claim are means refuted that you were able to strictly say no this is wrong it's not true and any is not enough information which means that the model is saying I'm not able to do Not enough information for me to strictly say verified or refuted so this is actually very similar to 3 year old tasks in an LP which is known as entitlement right match and I lay Stanford and I like corpus so this is exactly what it is it's basically given a premise and a hypothesis that task wants you to classify something as entail. [00:52:06] Contradicted or unrelated which means that it has extra extra information so we have a joint model in the paper extended version is coming out of Tripoli. Very show how we can use the same entailment style neural semantic matching model for document retrieval sentence selection and game verification and then basically how do each of these models help each other to have some joint information flowing across them or to use so the way the reason I'm talking about this is that one of the next steps that we're looking at is how can we incorporate this into dialogue models right when you're having a conversation with the machines having a conversation it should be able to take the information that's coming in from the R. trends be able to verify that as opposed to blindly just responding to every conversation which is harmful because then it can propagate and encourage more fake news or fact misleading fact spread because if it just continues the conversation instead of saying wait this is in factual right this is not the refuted fact. [00:53:14] And then the last word in the talk is some of this work that we've been doing like I said in the bubbles thread one of these bubbles is how can we make language generation much more robust by adding more low level semantic skills into it so by that I mean things like entailment generation right so the idea is the left side and right side both are C.N.N. national papers on document summarization So this is one of the very popular language generation task right given a very long document can we generate maybe a 100 words somebody offered So this basically has. [00:53:50] This usually has 34 issues one is saliency right you warned somebody to have all the salient or important information from the documents the other is redundancy right you want to avoid repeated information because you're only given 100 words in the summary right so you want to use that real estate is very expensive right you want every word matters so these have been well studied right but the thing that's not been studied or has been hard to study is how do you make sure your summary at the end of the day is not containing any contradictory or unrelated information from the input document because that's what entertainment is so we've been focusing a lot on that we added some novel information into summarization models by multitask learning and reinforcement learning so what we did was the center model is S.G. which is somebody generation so the somebody generation and Corder takes that document and then the cord fit into the summary Now what we can do is we can share these encoders and decoders with other tasks like. [00:54:53] The task of question generation and entailment generation so entailment generation is teaching the model and then when Generation is the task where you are given a long premise and you have to generate a logical subset of it right so this was a class question to ask in the community we converted to a generation past where given a premise and hypothesis you have to classify whether the premises entail the hypothesis contradict or is unrelated we converted it to a generation that is given a long sentence can we generate a short sentence out of that is logically a subset of the long sentence so this can be directly shared with the model of summarization because it's the same task right you want to have make sure your summary logically isn't paid by the input. [00:55:35] And then the question generation model is a task where given a document can you generate important questions for that document so you can take something like question answering data sets like scored and convert it to a question generation did a 3rd which teaches a summarization warlord saliency credit if the model can learn to ask the right questions then it will be able to generate those in the summary so the same thing you can take a long document and input can be the document the output can be the question and then we showed that if you have a multilayered encoder and a multilayer decoder you can share the higher level areas between these tasks which means higher level means away from the input and away from the output closer to the attention model and if you share these 3 tasks at the higher level layers of A from the input and output these higher level layers are more semantic in nature which has been shown also in the region community whereas the lower level task closer to input and output are more same tactic and lexical they are more looking at the words and the syntax whereas the more layers you hired the inner layers become more about semantics and this is the other diversion of it where you can also add these as rewards you can generate a summary sample the summary and then fire and payment saliency study classifiers honored as rewards. [00:56:51] And finally the 3rd version of this was the calling paper where we did this task or tax simplification So this is our task which is very important also for disability studies where you want to have a complicated text and you wanted to be automatically simplified. So here again we had the same sentence simplification tasks which can be helped with the entailment generation and paraphrase generation so and then when Generation again teachers are told to generate a logical subset and paraphrase generation teaches there how you can paraphrase things inside the sentence to make it simple but the interesting novel thing here was that if you were done multitask learning the biggest annoying factor of multitasking learning one of them is the mixing ratio right you have to learn the curriculum of how to alternate lead train these models how much a box to give to each of them so we did a multi armed bandit Mordred to automatically do this so it's a dynamic model that automatically learns the best curriculum of which order to train the task sim and we're also extending this now or to automatically choose which layers of the model to share and also the auxiliary tasks themselves like a few 100 tasks What are the best models what are the best auxiliary tasks that can help the given task yes all the cords available. [00:58:07] That said this is the group that does all the work I just talk and then. We hang the sponsors and this is a release the new group is called N.L.P. Dorsey is guardian and sees all the information is there the people the papers software all of that we have a post doc opening if anyone's interested there's this flyer on my web page and the group page very flexible position in terms of funding but also a lot of focus on faculty development and we'll have N.L.P. faculty openings to and including machine learning and rewards Thanks.