Discriminative and adaptive training for robust speech recognition and understanding
MetadataShow full item record
Robust automatic speech recognition (ASR) and understanding (ASU) under various conditions remains to be a challenging problem even with the advances of deep learning. To achieve robust ASU, two discriminative training objectives are proposed for keyword spotting and topic classification: (1) To accurately recognize the semantically important keywords, the non-uniform error cost minimum classification error training of deep neural network (DNN) and bi-directional long short-term memory (BLSTM) acoustic models is proposed to minimize the recognition errors of only the keywords. (2) To compensate for the mismatched objectives of speech recognition and understanding, minimum semantic error cost training of the BLSTM acoustic model is proposed to generate semantically accurate lattices for topic classification. Further, to expand the application of the ASU system to various conditions, four adaptive training approaches are proposed to improve the robustness of the ASR under different conditions: (1) To suppress the effect of inter-speaker variability on speaker-independent DNN acoustic model, speaker-invariant training is proposed to learn a deep representation in the DNN that is both senone-discriminative and speaker-invariant through adversarial multi-task training (2) To achieve condition-robust unsupervised adaptation with parallel data, adversarial teacher-student learning is proposed to suppress multiple factors of condition variability in the procedure of knowledge transfer from a well-trained source domain LSTM acoustic model to the target domain. (3) To further improve the adversarial learning for unsupervised adaptation with unparallel data, domain separation networks are used to enhance the domain-invariance of the senone-discriminative deep representation by explicitly modeling the private component that is unique to each domain. (4) To achieve robust far-field ASR, an LSTM adaptive beamforming network is proposed to estimate the real-time beamforming filter coefficients to cope with non-stationary environmental noise and dynamic nature of source and microphones positions.