Weakly Supervised Learning for Musical Instrument Classification
Gururani, Siddharth Kumar
MetadataShow full item record
Automatically recognizing musical instruments in audio recordings is an important task in music information retrieval (MIR). With increasing complexity of modeling techniques, the focus of the Musical Instrument Classification (MIC) task has shifted from single note audio analysis to MIC with real world polytimbral music. Increasingly complex models also increase the need for high quality labeled data. For the MIC task, there do not exist such large-scale fully annotated datasets. Instead researchers tend to utilize multi-track data to obtain fine-grained instrument activity annotation. Such datasets are also known as strongly labeled datasets (SLDs). These datasets are usually small and skewed in terms of genre and instrument distribution. Hence, SLDs are not the ideal choice for training generalizable MIC models. Recently, weakly labeled datasets (WLDs), with only clip-level annotations, have been presented. These are typically larger in scale than SLDs. However, methods popular in MIC literature are designed to be trained and evaluated SLDs. These do not naturally extend to the task of weakly labeled MIC. Additionally, during the labeling process, clips are not necessarily annotated with a class label for each instrument. This leads to missing labels in the dataset making it a partially labeled dataset. In this thesis, three methods are proposed to address challenges posed by weakly labeled and partially labeled data. The first one aims at learning using weak labels. The MIC task is formulated as a multi-instance multi-label classification problem. Under this framework, an attention-based model is proposed that can focus on salient instances in weakly labeled data. The other two methods focus on utilizing any information that may be gained from data with missing labels. These methods fall under the semi-supervised learning (SSL) framework, where models are trained using labeled and unlabeled data. The first semi-supervised method involves deep generative models that extend the unsupervised variational autoencoder to a semi-supervised model. The final method is based on consistency regularization-based SSL. The method proposed uses the mean teacher model, where a teacher model maintains a moving average or low-pass filtered version of a student model. The consistency regularization loss is unsupervised and may thus be applied to both labeled and unlabeled data. Additional experiments on music tagging with a large-scale WLD demonstrates the effectiveness of consistency regularization with limited labeled data. The methods presented in this thesis generally outperform methods developed using SLDs. The findings in this thesis not only impact the MIC task but also impact other music classification tasks where labeled data might be scarce. This thesis hopes to pave the way for future researchers to venture away from purely supervised learning and also consider weakly supervised approaches to solve MIR problems without access to large amounts of data.