ELAN course in Syktyvkar

Introduction

I was teaching in Syktyvkar 1.6.-9.6.2017, as we had a practice seminar with the second year students of Komi and English philologies in Syktyvkar State University. This practice has been arranged regulary for many years, and I have also participated to these earlier in different roles. The idea is that the students from Komi department of Syktyvkar State University either go to do fieldwork practice, or they work on already existing recordings at the University. This year the method was latter, and I was invited to teach how to make transcriptions in ELAN. I was very glad to have such an opportunity, and I’m hopeful we will have more possibilities in the future to continue this kind of work.

The students worked either alone or in pairs. The idea was that each group transcribes around 45 minutes, and those who are alone bit less. In practice this resulted in around 2000 transcribed tokens by one student. This was all bit experimental, but even though we still have not totally finished the coursework, it seems we ended up with around 20,000 transcribed tokens, which is a very good result. Of course, it is most crucial that the students get accustomed to work with ELAN, but having something concrete in hand is also important.

The course had few distinct parts, one was the segmentation and annotation part, after which we moved into analysis section. This was relatively short and shallow part, but we explored different ways to make multi-layered searches in ELAN and went through the basics of regular expressions. We also fixed some regularly occurring mistakes automatically with regular expression search and replaces, which maybe helped in part to show what is the point of this. Similarly we used additional tier for notes and encouraged everyone to use it, which also gives one model for later annotating some features on their distinct tiers as part of research practice with ELAN corpus.

The course in practice

We were lucky that most of the students had their own laptops, and I had some headphones and similar equipments with me as well. So every group was well equipped to their task.

In order to make the learning materials easily accessible, I put everything online on a course website. Everything was also shown on projector in class with ELAN open, but I thought having screenshots for all most important tasks could be one way to ensure that everyone can follow.

The transcription system we used is called Scientific Komi Transcription (Коми научнӧй транскрипчия). It is a transcription system used commonly with Permic languages, and it is normally used on this course. So it was logical for us to use this also this time. One of the advantages it has is that the students were able to use the Cyrillic keyboard they commonly used, with just a few additional characters, most of which are already in official Komi keyboard.

So with help of Jeremy Bradley we added those characters into Russian keyboard after discussing with the students which locations they would find most useful. Besides these we still had to add ' as it doesn’t exist in the Russian keyboard.

Before course I spent some time digitalizing the C-cassettes on which we had some older recordings. This was done in somewhat primitive manner, and it is very likely that this work has to be repeated later with proper equipment. However, I think this was anyway worth doing as we were now able to select recordings from bit wider area where Komi is spoken. We had tapes also from the other regions among northern dialects, but when the students were grouped it just went so that none of those was selected for now. Also some recordings were discarded from this practice because the quality was lower than what we wanted.

The course itself started with segmenting the files, which were assigned and matched with groups on a Google Spreadsheet where we generally followed the status of different tasks and files. I would had liked to arrange a system where each student fills the metadata they discover from the recordings on the spreadsheet, but there wasn’t maybe enough time to introduce this properly. It isn’t too difficult to pick the metadata from the tapes now either, but this could had been a good time to go through metadata related practices and conventions.

After segmenting we moved into transcription, with the result that different groups were in different phases in different times, which was not really a problem anyway. The actual course plan for receiving a grade is following:

Transcribing the file up to demanded point
Returning the file
Receiving a list of corrections, partially noted on note-tier, but recurring mistakes would be mentioned only once
Sending back the corrected version

So now we are in the point where everyone has returned the file, and director of Komi department, Rimma Pavlovna Popova, and I will continue to mark the parts which still need some supervision.

Result

One of the most important result is that we have a group of young Komi students who are familiar with transcribing and ELAN, but also the data we received will be very useful. We are still discussing what is the best way to distribute this data, and it will take some time to make the corrections to some of them, but we can already observe the preliminary results.

We can first examine the most common tokens:

Most common words in transcriptions
content	n
да	695
а	418
вӧли	356
и	344
сиjӧ	305
ме	234
вот	226
сес’с’а	178
но	170
оз	151

This looks like the most common words in any other Komi corpus, I would say. All in all there are 19370 transcribed word tokens.

Geographical distribution

Then what is the geographical distibution? We were transcribing files from Užga, Koygorodok, Myrponaib, Obyachevo, Shoshka, Ust’-Nem and Körtkerös.

Map of villages from which data was selected

So we have been covering nicely the Southern Komi-Zyrian dialects, with some varieties missing, but still having quite nice coverage of different dialects.

Most of the recordings are old C-cassettes, so the quality is not as high as one could hope. Exception to this comes from Upper-Sysola recordings from 2012, which are generally of a good quality. This is very good, since Upper-Sysola has several phonological developments which deserve further attention.

What to do differently

I have not had that much teaching experience, so I was positively surprised how well everything eventually turned out. What I would like to change or improve upon are maybe following things:

Use of version control should had been introduced immediately, for example by students just dragging their files to GitHub in the end of the day. We didn’t have reliable internet connection so we didn’t do that, but this is still quite realistically done.
Strickt rule that no file is to be renamed. This created a bit of hassle when students were returning files with totally different filenames, which made it hard to keep track of files, and also created danger that something is messed up or lost.
Next time we have to take more time before the course to exactly choose the files which we want to work with, also when we have more experience it will be easier to say what is the most realistic result for one group and so on. I’m personally very interested about the question what is a normal and non-exhaustive working pace for transcription work by a native speaker who is familiar with the task and software.
Get the students more engaged with the course program. Maybe I should had opened a VKontakte group for the course? I guess this is not allowed in all countries, but I’m rather confident in this context it would had been ok.

Future plans

Having learned very much about this experience myself, I’m always happy to return to Syktyvkar State University for any period. I was studying there for five months in 2011, and I consider it as one of my “home” universities. It is also one of the foremost centers of Komi research, and the native Komi speakers are studying there primarily in Komi. This kind of institutions are very valuable, and globally speaking very rare.

I’m personally amazed every time I visit Syktyvkar when I see the scale of different Komi language related work being done there. Deep ties to institutions and researchers on areas where different Uralic languages are spoken are the backbone also for all research we can conduct in the western universities. I’m very thankful for LATTICE laboratory for supporting my work during the two weeks in Komi.

2016 Niko Partanen or individual writers.