Automatic speaker verification and diarization on VoxCeleb data collection
MetadataShow full item record
Automatic speaker verification (ASV) is increasingly getting more attention in speech research field in recent years. Because of the importance of cyber-security and personal property security, ASV can be used in many fields in the future in addition to fingerprint and face information. In ASV research, a variety of datasets are needed to train good models. Current datasets include NIST SRE, VoxCeleb, etc. In this work, to collect a non-English speaking dataset, the pipeline of VoxCeleb data collection is adopted to collect an East Asian language-speaking Celebrities (EACeleb) dataset. To remove some noisy segments of the output and make the dataset cleaner, speaker diarization is used in this research and the collected data is filtered. Due to the lack of ground truth labels of the collected data, ASV is used to measure the data cleanness improvement of our dataset. Equal error rate (EER) can be lowered by 25.63% after speaker diarization compared to the original EACeleb using a pretrained x-vector model for measurement. Also, by training the speaker verification using EACeleb data, when testing the EER performance, EACeleb after diarization can outperform VoxCeleb by 36.78%.