Earlier this week, Mozilla revealed that its Common Voice dataset now contains more than 20,000 hours of content that can be used by anyone around the world to improve their speech recognition software, almost double what it was a year ago. The latest dataset in the English language comes in at a huge 71 GB and now there are more languages supported than ever with the addition of Tigre, Taiwanese (Minnan), Meadow Mari, Bengali, Toki Pona, and Cantonese.
According to Mozilla, the Common Voice project is important because it allows anyone to contribute their voice to the project which should allow virtual assistants to understand more accents. In addition, it ensures that Big Tech firms aren’t the only ones with large datasets – this gives smaller developers and companies a chance to build competing products and services.
The main highlights that Mozilla picked out for the latest dataset release are as follows:
- The new release also features six new languages: Tigre, Taiwanese (Minnan), Meadow Mari, Bengali, Toki Pona and Cantonese.
- Twenty-seven languages now have at least 100 hours of speech data. They include Bengali, Thai, Basque, and Frisian.
- Nine languages now have at least 500 hours of speech data. They include Kinyarwanda (2,383 hours), Catalan (2,045 hours), and Swahili (719 hours).
- Nine languages now all have at least 45% of their gender tags as female. They include Marathi, Dhivehi, and Luganda.
- The Catalan community fueled major growth. The Catalan community’s Project AINA — a collaboration between Barcelona Supercomputing Center and the Catalan Government — mobilized Catalan speakers to contribute to Common Voice.
- Supporting community participation in decision making yet. The Common Voice language Rep Cohort has contributed feedback and learnings about optimal sentence collection, the inclusion of language variants, and more.
If you’d like to contribute to Common Voice, doing so is very simple. Just head over to the project’s website and choose whether you would like to read some sentences or verify whether other people’s sentences are correct. You can keep track of all of your contributions to the project by signing up and contributing while logged in.