ZeroSpeech challenge: Teach AI to learn speech like a child

WHAT THE CHALLENGE IS:

A collaboration between the Facebook AI Research (FAIR) group and the Paris Sciences & Lettres University, with additional sponsorship from Microsoft Research, to challenge other researchers to teach AI systems to learn speech in a way that more closely resembles how young children learn. The ZeroSpeech 2019 challenge (which builds on previous efforts in 2015 and 2017) asks participants to build a speech synthesizer using only audio input, without any text or phonetic labels.

HOW IT WORKS:

The challenge’s central task is to build an AI system that can discover, in an unknown language, the machine equivalent of text of phonetic labels and use them to re-synthesize a sentence in a given voice. Essentially, the system must discover its own discrete “orthographic” notation, which may or may not correspond to linguistically defined subword units like consonants, vowels, and syllables.

Participants are provided with raw audio, as well as a baseline system with one component that performs subword discovery and another for speech synthesis. Participants can either replace the baseline with a new end-to-end system or improve one of the baseline’s components in order to generate a higher-quality waveform. Entries will be evaluated based on the bit rate of the discovered set of labels and the overall waveform quality. Submissions are due March 15. Teams with the top-scoring or most innovative papers will be selected for presentation at the Interspeech conference in September.

WHY IT MATTERS:

Replicating the way in which children learn to speak before they learn to read and write will be useful for improving a wide range of AI tasks related to thousands of “low-resource” languages, where there are limited linguistic or textual resources available for training AI systems. This challenge will not only explore unsupervised learning techniques — an important area of pursuit for versatile and scalable AI — but it will also help shift research related to automatic translation and natural language understanding away from English-centric work and toward a more global perspective and capability.