Single channel speech enhancement with residual learning and recurrent network
MetadataShow full item record
For speech enhancement tasks, non-stationary noise such as babble noise is much harder to suppress than stationary noise. In low SNR environment, it is even more challenging to remove noise without creating significant artifacts and distortion. Moreover, many state-of-the-art deep learning based algorithms propose a multiple time-frames to one time-frame regression model. In our work, we propose a speech de-noising neural network adopting multiple time-frames to multiple time-frames approach, aiming to greatly reduce computation burden for real-world applications as well as maintain decent speech quality. In this paper, we propose two neural networks, namely ResSE and ResCRN. ResSE takes form of a ResNet architecture and is inspired by DuCNN, an image enhancement network. With its rich and deep structure and the help of residual connections, ResSE is very efficient at extracting spatial-features and is able to outperform traditional log-MMSE algorithms. ResCRN,with the addition of LSTM layers, is capable at both spatial and temporal modeling. It utilizes both local and global contextual structure information and improves speech quality even when faced with unseen speaker and unseen noises, proving that ResCRN is able to generalize quite well.