This is a set of course materials Niko Partanen (me) is using while teaching the this course. It is supposed to be used with a Komi-Zyrian Test Corpus and the materials have an accompanying website, which is more thorough but also still more at the draft level. I work currently in LATTICE laboratory in Paris, and I’m also working with IKDP-2 research project.
If you have comments or suggestions, any feedback, my email address is nikotapiopartanen@gmail.com. In principle I’m also in Twitter, @nikopartanen, but I don’t understand very well how to use it.
The lectures assume that you have done something like:
git clone https://github.com/langdoc/elan_lectures
git clone https://github.com/langdoc/testcorpus
git clone https://github.com/langdoc/praat-stuff
So when you are in the main directory what you see is:
elan_lectures
testcorpus
praat-stuff
It doesn’t matter if there are other folders around.
The course is split along following slides:
Before the course starts it can be a good idea to install some of the following tools:
You can do from command line:
git clone https://github.com/langdoc/elan_lectures
git clone https://github.com/langdoc/testcorpus
Course materials, or related materials, are also slowly being collected into accompanying website, but this doesn’t usually reflect the actual state of things I want to discuss. This work in on draft level, but the idea is to have there one method of discussing things further.
You need (at least) following R packages.
pip install pympi-ling
#install.packages("devtools")
#install.packages("tidyverse")
library(devtools)
library(tidyverse)
## ── Attaching packages ──────────────────────────── tidyverse 1.2.0 ──
## ✔ ggplot2 2.2.1.9000 ✔ purrr 0.2.4
## ✔ tibble 1.3.4 ✔ dplyr 0.7.4
## ✔ tidyr 0.7.2 ✔ stringr 1.2.0
## ✔ readr 1.1.1 ✔ forcats 0.2.0
## ── Conflicts ─────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(xml2)
#install_github(repo = "langdoc/FRelan")
#install_github(repo = "achubaty/meow")
library(FRelan)
library(meow)
In RStudio you can easily arrange your work around projects. Those can be accessed in right-upper corner of the program. If you work in a project, your working directory is always the one of the project, and you avoid lots of hassle.
In case you need, this is how you can change the working directory.
getwd()
setwd("~/path/to/some/directory")
Throughout the course we assume that you have following file structure:
higher_level_folder
testcorpus
elan_lectures
praat-stuff
FRelan::read_eaf(eaf_file = '../testcorpus/kpv_udo20120330SazinaJS-encounter.eaf')
## # A tibble: 221 x 11
## token utterance
## * <chr> <chr>
## 1 и И эшшӧ ӧтик тор, мый тэнад, тэныд мам висьталіс интереснӧй юӧр,
## 2 эшшӧ И эшшӧ ӧтик тор, мый тэнад, тэныд мам висьталіс интереснӧй юӧр,
## 3 ӧтик И эшшӧ ӧтик тор, мый тэнад, тэныд мам висьталіс интереснӧй юӧр,
## 4 тор И эшшӧ ӧтик тор, мый тэнад, тэныд мам висьталіс интереснӧй юӧр,
## 5 , И эшшӧ ӧтик тор, мый тэнад, тэныд мам висьталіс интереснӧй юӧр,
## 6 мый И эшшӧ ӧтик тор, мый тэнад, тэныд мам висьталіс интереснӧй юӧр,
## 7 тэнад И эшшӧ ӧтик тор, мый тэнад, тэныд мам висьталіс интереснӧй юӧр,
## 8 , И эшшӧ ӧтик тор, мый тэнад, тэныд мам висьталіс интереснӧй юӧр,
## 9 тэныд И эшшӧ ӧтик тор, мый тэнад, тэныд мам висьталіс интереснӧй юӧр,
## 10 мам И эшшӧ ӧтик тор, мый тэнад, тэныд мам висьталіс интереснӧй юӧр,
## # ... with 211 more rows, and 9 more variables: reference <chr>,
## # participant <chr>, time_start <dbl>, time_end <dbl>,
## # session_name <chr>, filename <chr>, word <chr>, after <chr>,
## # before <chr>
FRelan::read_tier(eaf_file = '../testcorpus/kpv_izva20140330-1-fragment.eaf', linguistic_type = "refT")
## # A tibble: 5 x 8
## content annot_id ref_id participant tier_id
## <chr> <chr> <chr> <chr> <chr>
## 1 kpv_izva20140330-1-b-097 a1 <NA> MVF-F-1984 ref@MVF-F-1984
## 2 kpv_izva20140330-1-b-098 a2 <NA> MVF-F-1984 ref@MVF-F-1984
## 3 kpv_izva20140330-1-b-099 a3 <NA> MVF-F-1984 ref@MVF-F-1984
## 4 kpv_izva20140330-1-b-100 a4 <NA> MVF-F-1984 ref@MVF-F-1984
## 5 kpv_izva20140330-1-b-101 a5 <NA> MVF-F-1984 ref@MVF-F-1984
## # ... with 3 more variables: type <chr>, time_slot_1 <chr>,
## # time_slot_2 <chr>
FRelan::read_tier(eaf_file = '../testcorpus/kpv_izva20140330-1-fragment.eaf', linguistic_type = "orthT")
## # A tibble: 5 x 8
## content
## <chr>
## 1 Ме, кӧнечнэ же, вӧлі кык лун в шоке от Тайланда, потому что, но, менэ стран
## 2 Ааа, но думайта: ветла, - и муні, значит.
## 3 И сэтэн прӧстэ зэй уна ставыс доступнэ, но разврат да быдчемаыс, и сыысь м
## 4 слонъяс вылэ видзеді, экскурсияяс, ок, моресэ аддзылі, сэтэн Южно-Китайскей
## 5 И ставыс сыа отойдитіс, сэсся, ну, кудз бы, иг обращайт на это внимание.
## # ... with 7 more variables: annot_id <chr>, ref_id <chr>,
## # participant <chr>, tier_id <chr>, type <chr>, time_slot_1 <chr>,
## # time_slot_2 <chr>
Things will generally go better if you never have in filenames:
corpus <- FRelan::read_tier(eaf_file = '../testcorpus/kpv_izva20140330-1-fragment.eaf', linguistic_type = "orthT")
corpus %>% select(conte)
## DOES NOT WORK
## This should be customized somehow
FRelan::read_tier(eaf_file = '../testcorpus/kpv_izva20140330-1-fragment.eaf', linguistic_type = "orthT",
participant_location = "attribute|prefix|suffix")
FRelan::read_eaf('pite.eaf', ind_tier = "reference tier type", sa_tier = "transcription tier type", ss_tier = "token level tier type")
In principle you can write something into csv like this.
FRelan::read_tier(eaf_file = '../testcorpus/kpv_izva20140330-1-fragment.eaf', linguistic_type = "orthT") %>%
write_csv('results.csv')
read_csv('results.csv')
## Parsed with column specification:
## cols(
## content = col_character(),
## annot_id = col_character(),
## ref_id = col_character(),
## participant = col_character(),
## tier_id = col_character(),
## type = col_character(),
## time_slot_1 = col_character(),
## time_slot_2 = col_character()
## )
## # A tibble: 5 x 8
## content
## <chr>
## 1 Ме, кӧнечнэ же, вӧлі кык лун в шоке от Тайланда, потому что, но, менэ стран
## 2 Ааа, но думайта: ветла, - и муні, значит.
## 3 И сэтэн прӧстэ зэй уна ставыс доступнэ, но разврат да быдчемаыс, и сыысь м
## 4 слонъяс вылэ видзеді, экскурсияяс, ок, моресэ аддзылі, сэтэн Южно-Китайскей
## 5 И ставыс сыа отойдитіс, сэсся, ну, кудз бы, иг обращайт на это внимание.
## # ... with 7 more variables: annot_id <chr>, ref_id <chr>,
## # participant <chr>, tier_id <chr>, type <chr>, time_slot_1 <chr>,
## # time_slot_2 <chr>
There are also other ways:
This should be adaptable to most of ELAN tier structures, although I admit one maybe has to think a bit what is going on there. In principle one should always be able to recover the logic of ELAN’s internal structure like this.
read_custom_eaf <- function(path_to_file = '../testcorpus/kpv_udo20120330SazinaJS-encounter.eaf'){
ref <- FRelan::read_tier(eaf_file = path_to_file, linguistic_type = "refT") %>%
dplyr::select(content, annot_id, participant, time_slot_1, time_slot_2) %>%
dplyr::rename(ref = content) %>%
dplyr::rename(ref_id = annot_id)
orth <- FRelan::read_tier(eaf_file = path_to_file, linguistic_type = "orthT") %>%
dplyr::select(content, annot_id, ref_id, participant) %>%
dplyr::rename(orth = content) %>%
dplyr::rename(orth_id = annot_id) # %>%
# dplyr::rename(ref_id = ref_id) # This is there just as a note
token <- FRelan::read_tier(eaf_file = path_to_file, linguistic_type = "wordT") %>%
dplyr::select(content, annot_id, ref_id, participant) %>%
dplyr::rename(token = content) %>%
dplyr::rename(token_id = annot_id) %>%
dplyr::rename(orth_id = ref_id)
lemma <- FRelan::read_tier(eaf_file = path_to_file, linguistic_type = "lemmaT") %>%
dplyr::select(content, annot_id, ref_id, participant) %>%
dplyr::rename(lemma = content) %>%
dplyr::rename(lemma_id = annot_id) %>%
dplyr::rename(token_id = ref_id)
pos <- FRelan::read_tier(eaf_file = path_to_file, linguistic_type = "posT") %>%
dplyr::select(content, ref_id, participant) %>%
dplyr::rename(pos = content) %>%
dplyr::rename(lemma_id = ref_id)
elan <- left_join(ref, orth) %>%
left_join(token) %>%
left_join(lemma) %>%
left_join(pos) %>%
select(token, lemma, pos, time_slot_1, time_slot_2, everything(), -ends_with('_id'))
time_slots <- FRelan::read_timeslots(path_to_file)
corpus <- elan %>%
left_join(time_slots %>% rename(time_slot_1 = time_slot_id)) %>%
rename(time_start = time_value) %>%
left_join(time_slots %>% rename(time_slot_2 = time_slot_id)) %>%
rename(time_end = time_value) %>%
select(token, lemma, pos, participant, time_start, time_end, everything(), -starts_with('time_slot_'))
corpus %>% mutate(filename = path_to_file)
}
library(FRelan)
dir("../testcorpus/", "kpv.+eaf$", full.names = TRUE) %>%
map(read_custom_eaf)
## Joining, by = c("ref_id", "participant")
## Joining, by = c("participant", "orth_id")
## Joining, by = c("participant", "token_id")
## Joining, by = c("participant", "lemma_id")
## Joining, by = "time_slot_1"
## Joining, by = "time_slot_2"
## Joining, by = c("ref_id", "participant")
## Joining, by = c("participant", "orth_id")
## Joining, by = c("participant", "token_id")
## Joining, by = c("participant", "lemma_id")
## Joining, by = "time_slot_1"
## Joining, by = "time_slot_2"
## Joining, by = c("ref_id", "participant")
## Joining, by = c("participant", "orth_id")
## Joining, by = c("participant", "token_id")
## Joining, by = c("participant", "lemma_id")
## Joining, by = "time_slot_1"
## Joining, by = "time_slot_2"
## [[1]]
## # A tibble: 95 x 9
## token lemma pos participant time_start time_end
## <chr> <lgl> <lgl> <chr> <dbl> <dbl>
## 1 Ме NA NA MVF-F-1984 0 6086
## 2 , NA NA MVF-F-1984 0 6086
## 3 кӧнечнэ NA NA MVF-F-1984 0 6086
## 4 же NA NA MVF-F-1984 0 6086
## 5 , NA NA MVF-F-1984 0 6086
## 6 вӧлі NA NA MVF-F-1984 0 6086
## 7 кык NA NA MVF-F-1984 0 6086
## 8 лун NA NA MVF-F-1984 0 6086
## 9 в NA NA MVF-F-1984 0 6086
## 10 шоке NA NA MVF-F-1984 0 6086
## # ... with 85 more rows, and 3 more variables: ref <chr>, orth <chr>,
## # filename <chr>
##
## [[2]]
## # A tibble: 279 x 9
## token lemma pos participant time_start time_end
## <chr> <lgl> <lgl> <chr> <dbl> <dbl>
## 1 Значит NA NA JAI-M-1939 0 6196
## 2 , NA NA JAI-M-1939 0 6196
## 3 турун NA NA JAI-M-1939 0 6196
## 4 ми NA NA JAI-M-1939 0 6196
## 5 пуктам NA NA JAI-M-1939 0 6196
## 6 вӧлі NA NA JAI-M-1939 0 6196
## 7 Кытшыль NA NA JAI-M-1939 0 6196
## 8 коськын NA NA JAI-M-1939 0 6196
## 9 , NA NA JAI-M-1939 0 6196
## 10 квайт NA NA JAI-M-1939 0 6196
## # ... with 269 more rows, and 3 more variables: ref <chr>, orth <chr>,
## # filename <chr>
##
## [[3]]
## # A tibble: 240 x 9
## token lemma pos participant time_start time_end
## <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 И и CC NTP-M-1986 170 3730
## 2 эшшӧ эшшӧ _ NTP-M-1986 170 3730
## 3 ӧтик ӧтик Num NTP-M-1986 170 3730
## 4 тор тор N NTP-M-1986 170 3730
## 5 , , CLB NTP-M-1986 170 3730
## 6 мый мый CS NTP-M-1986 170 3730
## 7 тэнад тэ Pron NTP-M-1986 170 3730
## 8 , , CLB NTP-M-1986 170 3730
## 9 тэныд тэ Pron NTP-M-1986 170 3730
## 10 мам мам N NTP-M-1986 170 3730
## # ... with 230 more rows, and 3 more variables: ref <chr>, orth <chr>,
## # filename <chr>
library(tidyverse)
library(xml2)
elan_file = "sje20150329b.eaf"
read_xml(elan_file) %>%
xml_find_all("//TIER[@LINGUISTIC_TYPE_REF='wordT' and starts-with(@TIER_ID, 'notes')]") %>%
map(~ glue::glue("Fix tier {xml_attr(.x, 'TIER_ID')} in file {elan_file}"))
CC-BY Niko Partanen 2017