Reasoning about Complex Media from Weak Multi-modal Supervision
MetadataShow full item record
In a world of abundant information targeting multiple senses, and increasingly powerful media, we need new mechanisms to model content. Techniques for representing individual channels, such as visual data or textual data, have greatly improved, and some techniques exist to model the relationship between channels that are “mirror images” of each other and contain the same semantics. However, multimodal data in the real world contains little redundancy; the visual and textual channels complement each other. We examine the relationship between multiple channels in complex media, in two domains, advertisements and political articles. First, we collect a large dataset of advertisements and public service announcements, covering almost forty topics (ranging from automobiles and clothing, to health and domestic violence). We pose decoding the ads as automatically answering the questions “What should do viewer do, according to the ad” (the suggested action), and “Why should the viewer do the suggested action, according to the ad” (the suggested reason). We train a variety of algorithms to choose the appropriate action-reason statement, given the ad image and potentially a slogan embedded in it. The task is challenging because of the great diversity in how different users annotate an ad, even if they draw similar conclusions. One approach mines information from external knowledge bases, but there is a plethora of information that can be retrieved yet is not relevant. We show how to automatically transform the training data in order to focus our approach’s attention to relevant facts, without relevance annotations for training. We also present an approach for learning to recognize new concepts given supervision only in the form of noisy captions. Second, we collect a dataset of multimodal political articles containing lengthy text and a small number of images. We learn to predict the political bias of the article, as well as perform cross-modal retrieval despite large visual variability for the same topic. To infer political bias, we use generative modeling to show how the face of the same politician appears differently at each end of the political spectrum. To understand how image and text contribute to persuasion and bias, we learn to retrieve sentences for a given image, and vice versa. The task is challenging because unlike image-text in captioning, the images and text in political articles overlap in only a very abstract sense. We impose a loss requiring images that correspond to similar text to live closeby in a projection space, even if they appear very diverse purely visually. We show that our loss significantly improves performance in conjunction with a variety of existing recent losses. We also propose new weighting mechanisms to prioritize abstract image-text relationships during training.