Low-resource speech evaluation
Timeline
- 25.02 — training data release
- 10.03 — test data release
- 20.03 — submissions due
- 23.03 — results published
- 30.03 — papers due
Task description
General remarks
The participants can use the Lingvodoc project data provided by the organizers as well as any other available data. The source code of the solutions as well as the data used must be published. All files are UTF-8 (without BOM) encoded Every participant can make up to 3 submissions.
Input data
Datasets are available at the Datasets page.
There are files having the following structure:
- recording id
- text transcription / spelling
- language code according to Ethnologue with slight modifications used in Lingvodoc:
- alt-tub — Tubalar
- koi-yzv — Komi-Yazva
- yrk-for — Forest Nenets
- genus
- family
- the word can probably be repeated several times
- the data can contain words pronounced in Russian
The two latter columns have been calculated for the whole subcorpus the data has been extracted from. The genus and the family are specified according to Lingvodoc.
The track has three subtasks:
1. Language detection
The participants will detect the language, the genus and the family for an utterance. All genera and families will be specified in the training data. However, the test data will also have surprise languages. The participants should specify X for the surprise language utterances. The data can have repetitions as well as Russian stimuli pronounced within the utterances. We suppose that the language detection task has already been accomplished in “cleaner” conditions so it would be useful to see how the solutions will perform on “field” data
Evaluation
We evaluate the following for the files submitted:
- the percentage of properly detected languages
- the percentage of properly detected genera
- the percentage of properly detected families
2. Speech recognition
The participants will transcribe utterances or spell them. A test dataset without repetitions will be provided. However we consider repetitions in utterances as a challenge for the participants as it well help work with common field data
Evaluation
Every file is evaluated with:
- character error rate
3. Automatic detection of Russian stimuli
See motivation in 2