Low-resource speech evaluation

Timeline
Task description
Evaluation scripts
Participants

Timeline

25.02 — training data release
10.03 — test data release
20.03 — submissions due
23.03 — results published
30.03 — papers due

Task description

General remarks

The participants can use the Lingvodoc project data provided by the organizers as well as any other available data. The source code of the solutions as well as the data used must be published. All files are UTF-8 (without BOM) encoded Every participant can make up to 3 submissions.

Input data

Datasets are available at the Datasets page.

There are files having the following structure:

recording id
text transcription / spelling
language code according to Ethnologue with slight modifications used in Lingvodoc:
- alt-tub — Tubalar
- koi-yzv — Komi-Yazva
- yrk-for — Forest Nenets
genus
family
the word can probably be repeated several times
the data can contain words pronounced in Russian

The two latter columns have been calculated for the whole subcorpus the data has been extracted from. The genus and the family are specified according to Lingvodoc.

The track has three subtasks:

1. Language detection

The participants will detect the language, the genus and the family for an utterance. All genera and families will be specified in the training data. However, the test data will also have surprise languages. The participants should specify X for the surprise language utterances. The data can have repetitions as well as Russian stimuli pronounced within the utterances. We suppose that the language detection task has already been accomplished in “cleaner” conditions so it would be useful to see how the solutions will perform on “field” data

Evaluation

We evaluate the following for the files submitted:

the percentage of properly detected languages
the percentage of properly detected genera
the percentage of properly detected families

2. Speech recognition

The participants will transcribe utterances or spell them. A test dataset without repetitions will be provided. However we consider repetitions in utterances as a challenge for the participants as it well help work with common field data

Evaluation

Every file is evaluated with:

character error rate

3. Automatic detection of Russian stimuli

See motivation in 2

Evaluation scripts

Codalab link

Sample submission

Evaluation script

Evaluation script description

Participants

Google-group

Telegram channel and chat