Leveraging mid-level representations for complex activity recognition
MetadataShow full item record
Dynamic scene understanding requires learning representations of the components of the scene including objects, environments, actions and events. Complex activity recognition from images and videos requires annotating large datasets with action labels which is a tedious and expensive task. Thus, there is a need to design a mid-level or intermediate feature representation which does not require millions of labels, yet is able to generalize to semantic-level recognition of activities in visual data. This thesis makes three contributions in this regard. First, we propose an event concept-based intermediate representation which learns concepts via the Web and uses this representation to identify events even with a single labeled example. To demonstrate the strength of the proposed approaches, we contribute two diverse social event datasets to the community. We then present a use case of event concepts as a mid-level representation that generalizes to sentiment recognition in diverse social event images. Second, we propose to train Generative Adversarial Networks (GANs) with video frames (which does not require labels), use the trained discriminator from GANs as an intermediate representation and finetune it on a smaller labeled video activity dataset to recognize actions in videos. This unsupervised pre-training step avoids any manual feature engineering, video frame encoding or searching for the best video frame sampling technique. Our third contribution is a self-supervised learning approach on videos that exploits both spatial and temporal coherency to learn feature representations on video data without any supervision. We demonstrate the transfer learning capability of this model on smaller labeled datasets. We present comprehensive experimental analysis on the self-supervised model to provide insights into the unsupervised pretraining paradigm and how it can help with activity recognition on target datasets which the model has never seen during training.