Nvidia and Mozilla release the latest version of the Common Voice Dataset

The latest iteration of the Common Voice Dataset was released today bringing with it over 13,000 hours' worth of crowd-sourced speech data, over 182,000 unique voices, and a total of 76 languages.

The Mozilla Common Voice logo and mascot

Mozilla Common Voice is an initiative to "democratize and transform the voice recognition landscape". Started a few years back under the umbrella of the Mozilla Foundation and now partnered by Nvidia, the initiative allows volunteers to contribute to the world's largest open data voice dataset—the Common Voice Dataset—for speech synthesis and recognition software.

Today, the latest release of the Common Voice Dataset was made public. It brings several welcome additions to the table. Firstly, the corpus now has over 13,000 hours of crowd-sourced speech data. Compared to the previous release, the latest release brings 4,622 hours' worth of fresh audio data. 16 new languages, namely Basaa, Slovak, Northern Kurdish, Bulgarian, Kazakh, Bashkir, Galician, Uyghur, Armenian, Belarusian, Urdu, Guarani, Serbian, Uzbek, Azerbaijani, and Hausa have been added as well. This brings the total number of languages in the dataset up to 76. Overall, the dataset now has over 182,000 unique voices, a direct result of the 25% growth in the contributor community in the last six months.

Common Voice dataset release is now 13,905 hours, an increase of 4,622 hours from the previous release.

Introduces 16 new languages to the Common Voice dataset: Basaa, Slovak, Northern Kurdish, Bulgarian, Kazakh, Bashkir, Galician, Uyghur, Armenian, Belarusian, Urdu, Guarani, Serbian, Uzbek, Azerbaijani, Hausa.

The top five languages by total hours are English (2,630 hours), Kinyarwanda (2,260), German (1,040), Catalan (920), and Esperanto (840).

Languages that have increased the most by percentage are Thai (almost 20x growth, from 12 hours to 250 hours), Luganda (9x growth, from 8 hours to 80 hours), Esperanto (more than 7x growth, from 100 hours to 840 hours), and Tamil (more than 8x growth, from 24 hours to 220 hours).

The dataset now features over 182,000 unique voices, a 25% growth in contributor community in just six months.

If you are interested in contributing to the Common Voice dataset, head over to this link. Samples from the current corpus can be found on the same link. Metadata and instructions specific to downloading and using the dataset can be found on this GitHub repository. As part of the collaboration between Mozilla and Nvidia, the models trained on this and other public datasets are available for free via Nvidia NeMo, which is Nvidia's toolkit for building speech recognition and conversational models.