This document is a parameterized RMarkdown report which contains current statistics about the Iźva Komi Documentation Project corpus. The original project IKDP aiming at building this corpus was led by Rogier Blokland, Michael Rießler and Marina Fedina. It was funded by Kone Foundation during the period 2014-2016. The current continuation project IKDP-2, funded by Kone Foundation until 2020 and led by Rogier Blokland, Michael Rießler and Niko Partanen, is working with the extension of this corpus.

The corpus is currently archived into the Language Archive within Max Planck Institute of Nijmegen, and can be found under name Spoken Komi Corpus: IKDP.

The idea of this document is to help navigate what actually is in the corpus.

Translation status

Corpus in numbers

In the plot below we see the token distribution across the years from which the project has data. Year 2015 is the most represented, because in this period two fieldwork trips were done, one to Izhma and one to Naryan-Mar. 

## 
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
## 
##     discard
## The following object is masked from 'package:readr':
## 
##     col_factor

Currently the corpus contains 101 different ELAN files with 144 different speakers. In total there are 290689 tokens. They are distributed across speakers (of different age and gender) within following pattern:

sex n
F 184946
M 96885

Distribution of tokens

The corpus contains currently 290689 tokens.

Other distributions

The version we are currently archiving doesn’t include written texts, as it is essentially a spoken language documentation corpus. However, there are numerous texts written by the authors who speak this dialect, and also dialectal publications, that could possibly be integrated to the corpus.

We also have recordings from many writers in different decades, so this will certainly be one future research direction.

Media distribution

The raw data in corpus is stored on three media formats: text, audio and video. This means that we may have data that is only transcribed in a specific format. Other possibilities are audio and video. Actually video can be understood only as audio + video. Such a thing as corpus data including a video recording without audio does not exist. Or maybe it exists, strictly speaking, but it would make absolutely no sense whatsoever to include it into a speech corpus. Thereby the main distinction is whether it is possible to consult actual raw data in some recorded format or whether we must rely on transcriptions done, usually, on the spot. The situation also exists where there are different multimedia recordings associated with the transcription, but there is no exact knowledge about their whereabouts. In these cases we must consider the transcription being as close to raw data as we can get.

Main thing we see here that almost all we have recorded has both video and audio. There are few individual sessions where video was not made.

Geographic distribution

So the northern areas are quite well covered, and there are also many speakers born in the southern part of the Komi Republic.

# izva_corpus %>%
#   select(token, session_name, utterance_id, elan_path, audio_path, start_ms, end_ms, participant, duration, year) %>%
#   nest()

Thanks to Kone Foundation

Without the generous and continuous support of Kone Foundation our work would not have been possible.