Deriving Sensor-based Complex Human Activity Recognition Models Using Videos
MetadataShow full item record
With the ever-increasing number of ubiquitous and mobile devices, Human Activity Recognition (HAR) using wearables has become a central pillar in ubiquitous and mobile computing. HAR systems commonly adopt machine learning approaches, which use supervised training on labeled datasets. Recent success in HAR has come along with the advances in supervised training techniques, namely deep learning models, which also have made dramatic breakthroughs in various domains, such as computer vision, natural language processing, and speech recognition. Across domains, the keys to derive robust recognition models, which strongly generalize across application boundaries, were highly complex analysis models and large-scale labeled datasets to serve the data-hungry nature of deep learning models. Although the field of HAR has seen first, substantial success from using deep learning models, the complexity of HAR models is still constrained, mainly due to the typically only small-scale datasets. Conventionally, sensor datasets are collected in user studies in a laboratory environment. The process is very labor-intensive, recruiting participants is expensive, and annotations are time-consuming. As a consequence, the sensor data collection often results in only a limited size of a labeled dataset, where a model derived from such a small-scale dataset is not likely to generalize well. My research develops a framework, namely IMUTube, that can potentially alleviate the limitations of large-scale labeled data collection in sensor-based HAR, which is the most pressing issue to limit the model performance in HAR systems. I aim to harvest existing video data from large-scale repositories, such as YouTube. IMUTube is a system that bridges the modality gap between videos and wearable sensors by tracking human motions captured in videos. Once the motion information is extracted from the videos, the information is transformed to virtual Inertial Measurement Unit (IMU) sensor signals for various on-body locations. The collection of virtual IMU data from a large amount of videos is then used for deriving HAR systems that can be used in real-world settings. The overarching idea is appealing due to the sheer size of readily accessible video repositories and the availability of weak labels in the form of video titles and descriptions. The IMUTube framework automatically extracts motion information from arbitrary human activity videos and is thereby not limited to specific scenes or viewpoints by integrating techniques from the fields of computer vision, computer graphics, and signal processing. Tracking 3D motion information from unrestricted online video poses multiple challenges, such as fast camera motion, noise, lighting changes, occlusions, and so on. IMUTube automatically identifies artifacts in the video that challenges robust motion tracking to generate high-quality virtual IMU data only from those video segments that exhibit the least noise. Using IMUTube, I show that complex models, which could not have been derived using the typical, small-scale datasets of real IMU sensor readings, have become trainable with the weakly-labeled virtual IMU dataset collected from many videos. The availability of more complex HAR models represents the first step towards research opportunities to design sophisticated, deep learning models that shall capture sensor data more effectively than the state-of-the-art. Overall, my work opens up research opportunities for the human activity recognition community to generate large-scale labeled datasets in an automated, cost-effective manner. Having access to larger-scale datasets opens up possibilities for deriving more robust and more complex activity recognition models that can be employed in entirely new application scenarios.