Classification of affect using novel voice and visual features
Kim, Jonathan Chongkang
MetadataShow full item record
Emotion adds an important element to the discussion of how information is conveyed and processed by humans; indeed, it plays an important role in the contextual understanding of messages. This research is centered on investigating relevant features for affect classification, along with modeling the multimodal and multitemporal nature of emotion. The use of formant-based features for affect classification is explored. Since linear predictive coding (LPC) based formant estimators often encounter problems with modeling speech elements, such as nasalized phonemes and give inconsistent results for bandwidth estimation, a robust formant-tracking algorithm was introduced to better model the formant and spectral properties of speech. The algorithm utilizes Gaussian mixtures to estimate spectral parameters and refines the estimates using maximum a posteriori (MAP) adaptation. When the method was used for features extraction applied to emotion classification, the results indicate that an improved formant-tracking method will also provide improved emotion classification accuracy. Spectral features contain rich information about expressivity and emotion. However, most of the recent work in affective computing has not progressed beyond analyzing the mel-frequency cepstral coefficients (MFCC’s) and their derivatives. A novel method for characterizing spectral peaks was introduced. The method uses a multi-resolution sinusoidal transform coding (MRSTC). Because of MRSTC’s high precision in representing spectral features, including preservation of high frequency content not present in the MFCC’s, additional resolving power was demonstrated. Facial expressions were analyzed using 53 motion capture (MoCap) markers. Statistical and regression measures of these markers were used for emotion classification along the voice features. Since different modalities use different sampling frequencies and analysis window lengths, a novel classifier fusion algorithm was introduced. This algorithm is intended to integrate classifiers trained at various analysis lengths, as well as those obtained from other modalities. Classification accuracy was statistically significantly improved using a multimodal-multitemporal approach with the introduced classifier fusion method. A practical application of the techniques for emotion classification was explored using social dyadic plays between a child and an adult. The Multimodal Dyadic Behavior (MMDB) dataset was used to automatically predict young children’s levels of engagement using linguistic and non-linguistic vocal cues along with visual cues, such as direction of a child’s gaze or a child’s gestures. Although this and similar research is limited by inconsistent subjective boundaries, and differing theoretical definitions of emotion, a significant step toward successful emotion classification has been demonstrated; key to the progress has been via novel voice and visual features and a newly developed multimodal-multitemporal approach.