Improving mispronunciation detection and enriching diagnostic feedback for non-native learners of Mandarin
MetadataShow full item record
Computer assisted pronunciation training (CAPT) system has been designed to help students improve their speaking skills by providing automatic pronunciation scores and diagnostic feedback. Its mispronunciation detection performance highly depends on the quality of the ASR acoustic model trained with a non-native corpus, and the binary detectors verifying whether the current pronunciation is correctly articulated. Meanwhile, its diagnostic ability is dependent on the choice of the modeled units (e.g., phone, articulation manner, and place), and whether the made decision of selected verifiers/classifiers is interpretable. In this thesis, we show our effort to improve the mispronunciation detection of Mandarin and enrich diagnostic feedback for second language learners. The problem is tackled from the perspective of acoustic modeling, verification and feedback generation of Mandarin phones and tones. For the acoustic modeling part, speech attributes and soft targets are respectively proposed to help resolve phone and tone’s hard-assignment labels, which are not optimal for describing irregular non-native pronunciations. Subsequently, multisource information or better trained acoustic model can provide more accurate features for mispronunciation detectors. Experimental results show that enhanced features can bring consistent improvement for Mandarin phone/tone mispronunciation detection. For the verification part, segmental pronunciation representation, usually calculated by frame-level averaging in a DNN, is now learned by the memory components in a BLSTM, which directly uses sequential context information to embed a sequence of pronunciation scores into a pronunciation vector to improve the performance of mispronunciation detectors. This improvement is observed both in the phone and tone’s mispronunciation detection task. For the feedback generation part, with the help of phone-, articulatory-, and tone-level posterior scores and interpretable decision trees, we can visualize nonnative mispronunciations and provide comprehensive feedback, including articulation manner, place, and pitch contour-related diagnostic information, to help L2 learners. Experimental results confirm that our proposed decision trees can provide accurate diagnostic feedback.