AISHELL-3 Corpus:
Baseline System Samples

Arxiv: 2010.11567
Github Repo: sos1sos2Sixteen/aishell-3-baseline-fc
Dataset Download: www.aishelltech.com/aishell_3
For further questions regarding the dataset: tech@aishelldata.com


Authors

Abstract

        In this paper, we present AISHELL-3, a large-scale and high-fidelity multi-speaker Mandarin speech corpus which could be used to train multi-speaker Text-to-Speech (TTS) systems. The corpus contains roughly 85 hours of emotion-neutral recordings spoken by 218 native Chinese mandarin speakers. Their auxiliary attributes such as gender, age group and native accents are explicitly marked and provided in the corpus. Accordingly, transcripts in Chinese character-level and pinyin-level are provided along with the recordings. We present a baseline system that uses AISHELL-3 for multi-speaker Madarin speech synthesis. The multi-speaker speech synthesis system is an extension on Tacotron-2 where a speaker verification model and a corresponding loss regarding voice similarity are incorporated as the feedback constraint. We aim to use the presented corpus to build a robust synthesis model that is able to achieve zero-shot voice cloning. The system trained on this dataset also generalizes well on speakers that are never seen in the training process. Objective evaluation results from our experiments show that the proposed multi-speaker synthesis system achieves high voice similarity concerning both speaker embedding similarity and equal error rate measurement. The dataset, baseline system code and generated samples are available online.


Dataset Samples

Sample audios and labels from the AISHELL-3 dataset (in original 44.1kHz format)

Sample Audios
Labels
在教学楼内释放大量烟雾
zai4 jiao4 xue2 lou2 nei4 shi4 fang4 da4 liang4 yan1 wu4
不过英特尔之后不会继续接受如此大的损失
bu2 guo4 ying1 te4 er3 zhi1 hou4 bu2 hui4 ji4 xu4 jie1 shou4 ru2 ci3 da4 de5 sun3 shi1
替我播放相思风雨中
ti4 wo3 bo1 fang4 xiang1 si1 feng1 yu3 zong1

Baseline System Samples

The following sections exhibits audio samples generated by the Baseline TTS system described in detail in our paper. (in down-sampled 16kHz format)

Vocoder Mel-Reconstruction

We use a 16kHz MelGAN trained on the presented dataset as our neural vocoder module. listed below are pairs of original audio samples and their Mel-reconstructed counter-parts, which intuitively demonstrates the performance of the vocoder used to produce the synthesized voices presented in this web page.

Original Voice
Mel-Reconstructed Voice
Text
一股脑倒给了现场的观众们
想听一下邓紫棋的歌
在中国官方轰轰烈烈纪念抗日战争胜利七十周年之际
本身是难度比较大的事

Text-dependent Synthesis

below are pairs of original recordings and synthesized samples conditioning on the same text and speaker embeddings (not ground-truth aligend samples), all of the reference audios are separated from the training data:
Original Voice
Parallel Synthesis
Text
张大千国画有什么
目前北京市有旗胜家园、万科、红狮等数个限价房小区
你对我所做的一切
一二六八零三七九六七一
楼市回暖也不可能回到以前的发展高度

Free-form Synthesis

Below are synthesized samples where the textual contents are an excerpt of a random customer-service phone call.
Original Voice
Out-of-domain Sentence

Speaker Similarity

Speaker embedding visualization using t-SNE, embeddings extracted from audio recordings are represented with dots, synthesized with crosses. From left to right: (A) validation-set recordings; (B) validation-set recordings & synthesized; (C) test-set recordings; (D) test-set recordings & synthesized. Please note that the validation set contains seen speakers while the test-set only includes unseen ones.
Validation-set Speakers
Original Voice
Synthesized Voice
Test-set Speakers
Original Voice
Synthesized Voice