Understanding the motion of a human state in video classification
Castro, Daniel Alejandro
MetadataShow full item record
For the last 50 years we have studied the correspondence between human motion and the action or goal they are attempting to accomplish. Humans themselves subconsciously learn subtle cues about other individuals that gives them insight into their motivation and overall sincerity. In contrast, computers require significant guidance in order to correctly determine deceivingly basic activities. Due to the recent advent of deep learning, many algorithms do not make explicit use of motion parameters to categorize these activities. With the recent advent of widespread video recording and the sheer amount of video data being stored, the ability to study human motion has never been more essential. In this thesis, we propose that our understanding of human motion representations and its context can be leveraged for more effective action classification. We explore two distinct approaches for understanding human motion in video. Our first approach for classifying human activities is within an egocentric context. In this approach frames are captured every minute at a low frame rate video that represents a summary of a persons' day. The challenge in this context is that you do not have an explicitly visual representation of a human. In order to tackle this problem we therefore leverage contextual information alongside the image data to improve the understanding of our daily activities. In this approach, motion is implicitly represented in the image data given that we do not have a visual representation of a human pose. We combine existing neural network models with contextual information using a process we label a late-fusion ensemble. We rely on the convolutional network to encode high-level motion parameters which we later demonstrate performs comparably to explicitly encoding motion representations such as optical flow. We also demonstrate that our model extends to other participants with only two days of additional training data. This work enabled us to understand the importance of leveraging context through parameterization for learning human activities. In our second approach, we improve this encoding by learning from three representations that attempt to integrate motion parameters into video categorization: (1): regular video frames (2): optical flow and (3): human pose representation. Regular video frames are most commonly used in video analysis on a per-frame basis due to the nature of most video categories. We introduce a technique which enables us to combine contextual features with a traditional neural network to improve the classification of human actions in egocentric video. Then, we introduce a dataset focused on humans performing various dances, an activity which inherently requires its motion to be identified. We discuss the value and relevance of this dataset along the most commonly used video datasets and among a handful of recently released datasets which are relevant to human motion. Next, we analyze the performance of existing algorithms with each of the motion parameterizations mentioned above. This assists us in understanding the intrinsic value of each representation and a better understanding of each algorithm. Following this, we introduce an approach that utilizes each of the motion parameterizations concurrently, in order to have a better understanding of the video. From here, we introduce a method to represent a human pose over time to improve human video categorization. Specifically, we look at specific joint distances over time to generate features that represents the distribution of specific human poses over time. Performance of each individual metric will be computed and analyzed in order to assess their intrinsic value. The main objective and contribution of our work is to introduce a parameterization of human poses which improve action recognition in video.