NVIDIA, Mozilla Open Source Common Voice Dataset to Train Voice-Enabled Apps


NVIDIA and Mozilla have released the latest Common Voice Dataset with over 13,000 hours of crowd-sourced speech data, adding another 16 languages to the corpus. The Common Voice is a largest open data voice dataset and designed to democratise voice technology and is already used by developers, researchers and academics worldwide.

NVIDIA has released multilingual speech recognition models in NGC for free as part of the partnership mission to democratise voice technology. NeMo is an open-source toolkit for researchers developing state-of-the-art conversational AI models. Researchers can further fine-tune these models on multilingual datasets. 

Contributors mobilise their own communities to donate speech data to the MCV public database, which anyone can then use to train voice-enabled technology. As part of NVIDIA’s collaboration with Mozilla Common Voice, the models trained on this and other public datasets are made available for free via an open-source toolkit called NVIDIA NeMo, the company said in a blog. 

The latest Common Voice Dataset now consists of 13,905 hours, an increase of 4,622 hours from the previous release and introduces 16 new language to the dataset including Basaa, Slovak, Northern Kurdish, Bulgarian, Kazakh, Bashkir, Galician, Uyghur, Armenian, Belarusian, Urdu, Guarani, Serbian, Uzbek, Azerbaijani, Hausa. The top five languages in the Common Voice Dataset by total hours are English (2,630 hours), Kinyarwanda (2,260), German (1,040), Catalan (920), and Esperanto (840).

“Languages that have increased the most by percentage are Thai (almost 20x growth, from 12 hours to 250 hours), Luganda (9x growth, from 8 hours to 80 hours), Esperanto (more than 7x growth, from 100 hours to 840 hours), and Tamil (more than 8x growth, from 24 hours to 220 hours),” reads the blog. 


Please enter your comment!
Please enter your name here