Visually grounded language understanding and generation
MetadataShow full item record
The world around us involves multiple modalities -- we see objects, feel texture, hear sounds, smell odors and so on. In order for Artificial Intelligence (AI) to make progress in understanding the world around us, it needs to be able to interpret and reason about multiple modalities. In this thesis, I take steps towards studying how inducing appropriate grounding in deep models improves multi-modal AI capabilities, in the context of vision and language. Specifically, I cover these four tasks: visual question answering, neural image captioning, visual dialog and vision and language pretraining. In visual question answering, we collected a large scale visual question answering dataset and I study various baselines to benchmark these tasks. To jointly reason about image and question, I propose a novel co-attention mechanism that can learn fine-grained grounding to answer the question. In image captioning, I address the model designs for grounded caption generation of a image. A key focus is to extend the model with the ability to know when to look at the image when generating each word. For the words which have explicit visual correspondence, we further proposed a novel approach that reconciles classical slot filling approaches with modern neural captioning approaches. As a result, our model can produce natural language explicitly grounded in entities that object detectors find in the image. In visual dialog, I study both sides of the visual dialog agents -- questioner and answerer. For modeling answerer which answers visual questions in dialog, I introduce a novel discriminant perceptual loss that transfers knowledge from a discriminative model a generative model. For modeling questioner, I consider an image guessing game as a test-bed for balancing task performance and language drift. I propose a Dialog without Dialog task, which requires agents to generalize from single round visual question generation with full supervision to a multi-round dialog-based image guessing game without direct language supervision. The proposed visually-grounded dialog models that can adapt to new tasks while exhibiting less linguistic drift. In vision and language pretraining, I study more general models that can learn visual groundings from massive meta-data on the internet. I also explore the multi-task vision and language representation learning. Our results not only show that a single model can perform all 12 vision and language tasks, but also that joint training can lead to improvements in task metric compared to single-task training with the same architecture. Through this work, I demonstrate that inducing appropriate grounding in deep models improves multi-modal AI capabilities. Finally, I briefly discuss the challenges in this domain and the extensions of recent works.