How does ELAN work?

Basically it …

  • Saves things into XML file
  • Saves something when you press save
  • Finishes some things when exiting program
  • pfsx files contain
    • View settings
    • Last opening location
  • Does some validation
    • One when opening the file
    • One with Multiple files search
  • Opening file in ELAN and touching something has consequence of XML being pretty printed

Later validation more strickt

How is ELAN file structured?

read_xml('../adv_elan_draft/notebooks/test.eaf') %>% xml_structure()
## <ANNOTATION_DOCUMENT [AUTHOR, DATE, FORMAT, VERSION, noNamespaceSchemaLocation, xmlns:xsi]>
##   <HEADER [MEDIA_FILE, TIME_UNITS]>
##     <MEDIA_DESCRIPTOR [MEDIA_URL, MIME_TYPE, RELATIVE_MEDIA_URL]>
##     <PROPERTY [NAME]>
##       {text}
##     <PROPERTY [NAME]>
##       {text}
##   <TIME_ORDER>
##     <TIME_SLOT [TIME_SLOT_ID, TIME_VALUE]>
##     <TIME_SLOT [TIME_SLOT_ID, TIME_VALUE]>
##     <TIME_SLOT [TIME_SLOT_ID, TIME_VALUE]>
##     <TIME_SLOT [TIME_SLOT_ID, TIME_VALUE]>
##     <TIME_SLOT [TIME_SLOT_ID, TIME_VALUE]>
##     <TIME_SLOT [TIME_SLOT_ID, TIME_VALUE]>
##     <TIME_SLOT [TIME_SLOT_ID, TIME_VALUE]>
##     <TIME_SLOT [TIME_SLOT_ID, TIME_VALUE]>
##   <TIER [LINGUISTIC_TYPE_REF, TIER_ID]>
##     <ANNOTATION>
##       <ALIGNABLE_ANNOTATION [ANNOTATION_ID, TIME_SLOT_REF1, TIME_SLOT_REF2]>
##         <ANNOTATION_VALUE>
##           {text}
##     <ANNOTATION>
##       <ALIGNABLE_ANNOTATION [ANNOTATION_ID, TIME_SLOT_REF1, TIME_SLOT_REF2]>
##         <ANNOTATION_VALUE>
##           {text}
##   <TIER [LINGUISTIC_TYPE_REF, PARENT_REF, PARTICIPANT, TIER_ID]>
##     <ANNOTATION>
##       <REF_ANNOTATION [ANNOTATION_ID, ANNOTATION_REF]>
##         <ANNOTATION_VALUE>
##           {text}
##     <ANNOTATION>
##       <REF_ANNOTATION [ANNOTATION_ID, ANNOTATION_REF]>
##         <ANNOTATION_VALUE>
##           {text}
##   <TIER [LINGUISTIC_TYPE_REF, PARTICIPANT, TIER_ID]>
##     <ANNOTATION>
##       <ALIGNABLE_ANNOTATION [ANNOTATION_ID, TIME_SLOT_REF1, TIME_SLOT_REF2]>
##         <ANNOTATION_VALUE>
##           {text}
##     <ANNOTATION>
##       <ALIGNABLE_ANNOTATION [ANNOTATION_ID, TIME_SLOT_REF1, TIME_SLOT_REF2]>
##         <ANNOTATION_VALUE>
##           {text}
##   <TIER [LINGUISTIC_TYPE_REF, PARENT_REF, PARTICIPANT, TIER_ID]>
##     <ANNOTATION>
##       <REF_ANNOTATION [ANNOTATION_ID, ANNOTATION_REF]>
##         <ANNOTATION_VALUE>
##           {text}
##     <ANNOTATION>
##       <REF_ANNOTATION [ANNOTATION_ID, ANNOTATION_REF]>
##         <ANNOTATION_VALUE>
##           {text}
##   <LINGUISTIC_TYPE [GRAPHIC_REFERENCES, LINGUISTIC_TYPE_ID, TIME_ALIGNABLE]>
##   <LINGUISTIC_TYPE [CONSTRAINTS, GRAPHIC_REFERENCES, LINGUISTIC_TYPE_ID, TIME_ALIGNABLE]>
##   <LINGUISTIC_TYPE [CONSTRAINTS, GRAPHIC_REFERENCES, LINGUISTIC_TYPE_ID, TIME_ALIGNABLE]>
##   <CONSTRAINT [DESCRIPTION, STEREOTYPE]>
##   <CONSTRAINT [DESCRIPTION, STEREOTYPE]>
##   <CONSTRAINT [DESCRIPTION, STEREOTYPE]>
##   <CONSTRAINT [DESCRIPTION, STEREOTYPE]>

What we want from it?

  • Participant, session name, tier content, time codes
  • Logical combination of the above
  • Good test for internal logic of an ELAN tier structure!

Example file

Where are the annotations?

  • On individual tiers, for each speaker on their own structure
  • The relations are stored within tier ID’s
  • But not if you use included in as linguistic type!

Where are the participants?

And time codes?

Notice with time codes…

  • Time codes are just listed, annotations tell what are starts and ends
  • But I think it uses one time slot only once?
  • Times are in milliseconds

What does this mean?

  • Parsing ELAN file content demands walking through the logic in tier structure
  • At times there is little logic :(
  • However …

bad structure

always wins

inconsistent structure

Parsing ELAN file to R

  • In earlier example we used read_eaf() function
  • Nice, but demands quite specific structure
    • Although customizable
  • There are also functions read_tier() and read_timeslot()
library(FRelan)
read_tier(eaf_file = '../testcorpus/kpv_izva20140330-1-fragment.eaf',
linguistic_type = 'wordT')
## # A tibble: 95 x 8
##    content annot_id ref_id participant         tier_id  type time_slot_1
##      <chr>    <chr>  <chr>       <chr>           <chr> <chr>       <chr>
##  1      Ме     a124     a6  MVF-F-1984 word@MVF-F-1984 wordT        <NA>
##  2       ,     a125     a6  MVF-F-1984 word@MVF-F-1984 wordT        <NA>
##  3 кӧнечнэ     a126     a6  MVF-F-1984 word@MVF-F-1984 wordT        <NA>
##  4      же     a127     a6  MVF-F-1984 word@MVF-F-1984 wordT        <NA>
##  5       ,     a128     a6  MVF-F-1984 word@MVF-F-1984 wordT        <NA>
##  6    вӧлі     a129     a6  MVF-F-1984 word@MVF-F-1984 wordT        <NA>
##  7     кык     a130     a6  MVF-F-1984 word@MVF-F-1984 wordT        <NA>
##  8     лун     a131     a6  MVF-F-1984 word@MVF-F-1984 wordT        <NA>
##  9       в     a132     a6  MVF-F-1984 word@MVF-F-1984 wordT        <NA>
## 10    шоке     a133     a6  MVF-F-1984 word@MVF-F-1984 wordT        <NA>
## # ... with 85 more rows, and 1 more variables: time_slot_2 <chr>

  • The output we get is directly from the ELAN XML file
  • Let’s look into some other types of tier
read_tier(eaf_file = '../testcorpus/kpv_izva20140330-1-fragment.eaf',
linguistic_type = 'refT')
## # A tibble: 5 x 8
##                    content annot_id ref_id participant        tier_id
##                      <chr>    <chr>  <chr>       <chr>          <chr>
## 1 kpv_izva20140330-1-b-097       a1   <NA>  MVF-F-1984 ref@MVF-F-1984
## 2 kpv_izva20140330-1-b-098       a2   <NA>  MVF-F-1984 ref@MVF-F-1984
## 3 kpv_izva20140330-1-b-099       a3   <NA>  MVF-F-1984 ref@MVF-F-1984
## 4 kpv_izva20140330-1-b-100       a4   <NA>  MVF-F-1984 ref@MVF-F-1984
## 5 kpv_izva20140330-1-b-101       a5   <NA>  MVF-F-1984 ref@MVF-F-1984
## # ... with 3 more variables: type <chr>, time_slot_1 <chr>,
## #   time_slot_2 <chr>

What’s going on?

  • Whatever the content, it gets returned as “content”
  • If there are time slot values, those are picked
  • Participant and tier id are picked as well
  • We also get annotation id and reference id
    • But as we understand, one tiers id is anothers reference

path_to_file = '../testcorpus/kpv_udo20120330SazinaJS-encounter.eaf'
ref <- FRelan::read_tier(eaf_file = path_to_file, linguistic_type = "refT") %>%
dplyr::select(content, annot_id, participant, time_slot_1, time_slot_2) %>%
dplyr::rename(ref = content) %>%
dplyr::rename(ref_id = annot_id)
orth <- FRelan::read_tier(eaf_file = path_to_file, linguistic_type = "orthT") %>%
dplyr::select(content, annot_id, ref_id, participant) %>%
dplyr::rename(orth = content) %>%
dplyr::rename(orth_id = annot_id) # %>%
# dplyr::rename(ref_id = ref_id) # This is there just as a note
token <- FRelan::read_tier(eaf_file = path_to_file, linguistic_type = "wordT") %>%
dplyr::select(content, annot_id, ref_id, participant) %>%
dplyr::rename(token = content) %>%
dplyr::rename(token_id = annot_id) %>%
dplyr::rename(orth_id = ref_id)
lemma <- FRelan::read_tier(eaf_file = path_to_file, linguistic_type = "lemmaT") %>%
dplyr::select(content, annot_id, ref_id, participant) %>%
dplyr::rename(lemma = content) %>%
dplyr::rename(lemma_id = annot_id) %>%
dplyr::rename(token_id = ref_id)
pos <- FRelan::read_tier(eaf_file = path_to_file, linguistic_type = "posT") %>%
dplyr::select(content, ref_id, participant) %>%
dplyr::rename(pos = content) %>%
dplyr::rename(lemma_id = ref_id)

elan <- left_join(ref, orth) %>%
left_join(token) %>%
left_join(lemma) %>%
left_join(pos) %>%
select(token, lemma, pos, time_slot_1, time_slot_2, everything(), -ends_with('_id'))
time_slots <- FRelan::read_timeslots(path_to_file)
corpus <- elan %>%
left_join(time_slots %>% rename(time_slot_1 = time_slot_id)) %>%
rename(time_start = time_value) %>%
left_join(time_slots %>% rename(time_slot_2 = time_slot_id)) %>%
rename(time_end = time_value) %>%
select(token, lemma, pos, participant, time_start, time_end, everything(), -starts_with('time_slot_'))

corpus
## # A tibble: 240 x 8
##    token lemma   pos participant time_start time_end
##    <chr> <chr> <chr>       <chr>      <dbl>    <dbl>
##  1     И     и    CC  NTP-M-1986        170     3730
##  2  эшшӧ  эшшӧ     _  NTP-M-1986        170     3730
##  3  ӧтик  ӧтик   Num  NTP-M-1986        170     3730
##  4   тор   тор     N  NTP-M-1986        170     3730
##  5     ,     ,   CLB  NTP-M-1986        170     3730
##  6   мый   мый    CS  NTP-M-1986        170     3730
##  7 тэнад    тэ  Pron  NTP-M-1986        170     3730
##  8     ,     ,   CLB  NTP-M-1986        170     3730
##  9 тэныд    тэ  Pron  NTP-M-1986        170     3730
## 10   мам   мам     N  NTP-M-1986        170     3730
## # ... with 230 more rows, and 2 more variables: ref <chr>, orth <chr>
read_custom_eaf <- function(path_to_file){
  all the code from above...
}

FRelan::read_custom_eaf(path_to_file = 'path/to/my_file.eaf')

What can go wrong?

  • Tier doesn’t exist
  • Tier has a different name
  • Tier types are different
  • XML is malformed

Parsing actual corpus

Looping through files

  • In R there are plenty of ways not to write a for-loop
  • Idea is always same: take multiple items of something and do something for every unit
  • All paths to the files in corpus is a good starting point
dir(path = '../testcorpus', pattern = '.eaf$', full.names = TRUE)
## [1] "../testcorpus/kpv_izva20140330-1-fragment.eaf"      
## [2] "../testcorpus/kpv_izva20140404IgusevJA-fragment.eaf"
## [3] "../testcorpus/kpv_udo20120330SazinaJS-encounter.eaf"

Or:

dir(path = '../testcorpus', pattern = '.+izva.+eaf$', full.names = TRUE)
## [1] "../testcorpus/kpv_izva20140330-1-fragment.eaf"      
## [2] "../testcorpus/kpv_izva20140404IgusevJA-fragment.eaf"

elan_files <- dir(path = '../testcorpus', pattern = '.eaf$', full.names = TRUE)
elan_files %>% map(read_custom_eaf)
## [[1]]
## # A tibble: 95 x 9
##      token lemma   pos participant time_start time_end
##      <chr> <lgl> <lgl>       <chr>      <dbl>    <dbl>
##  1      Ме    NA    NA  MVF-F-1984          0     6086
##  2       ,    NA    NA  MVF-F-1984          0     6086
##  3 кӧнечнэ    NA    NA  MVF-F-1984          0     6086
##  4      же    NA    NA  MVF-F-1984          0     6086
##  5       ,    NA    NA  MVF-F-1984          0     6086
##  6    вӧлі    NA    NA  MVF-F-1984          0     6086
##  7     кык    NA    NA  MVF-F-1984          0     6086
##  8     лун    NA    NA  MVF-F-1984          0     6086
##  9       в    NA    NA  MVF-F-1984          0     6086
## 10    шоке    NA    NA  MVF-F-1984          0     6086
## # ... with 85 more rows, and 3 more variables: ref <chr>, orth <chr>,
## #   session_name <chr>
## 
## [[2]]
## # A tibble: 279 x 9
##      token lemma   pos participant time_start time_end
##      <chr> <lgl> <lgl>       <chr>      <dbl>    <dbl>
##  1  Значит    NA    NA  JAI-M-1939          0     6196
##  2       ,    NA    NA  JAI-M-1939          0     6196
##  3   турун    NA    NA  JAI-M-1939          0     6196
##  4      ми    NA    NA  JAI-M-1939          0     6196
##  5  пуктам    NA    NA  JAI-M-1939          0     6196
##  6    вӧлі    NA    NA  JAI-M-1939          0     6196
##  7 Кытшыль    NA    NA  JAI-M-1939          0     6196
##  8 коськын    NA    NA  JAI-M-1939          0     6196
##  9       ,    NA    NA  JAI-M-1939          0     6196
## 10   квайт    NA    NA  JAI-M-1939          0     6196
## # ... with 269 more rows, and 3 more variables: ref <chr>, orth <chr>,
## #   session_name <chr>
## 
## [[3]]
## # A tibble: 240 x 9
##    token lemma   pos participant time_start time_end
##    <chr> <chr> <chr>       <chr>      <dbl>    <dbl>
##  1     И     и    CC  NTP-M-1986        170     3730
##  2  эшшӧ  эшшӧ     _  NTP-M-1986        170     3730
##  3  ӧтик  ӧтик   Num  NTP-M-1986        170     3730
##  4   тор   тор     N  NTP-M-1986        170     3730
##  5     ,     ,   CLB  NTP-M-1986        170     3730
##  6   мый   мый    CS  NTP-M-1986        170     3730
##  7 тэнад    тэ  Pron  NTP-M-1986        170     3730
##  8     ,     ,   CLB  NTP-M-1986        170     3730
##  9 тэныд    тэ  Pron  NTP-M-1986        170     3730
## 10   мам   мам     N  NTP-M-1986        170     3730
## # ... with 230 more rows, and 3 more variables: ref <chr>, orth <chr>,
## #   session_name <chr>
  • This list format is useful for testing!

elan_corpus <- elan_files %>% map(read_custom_eaf) %>% bind_rows()
meta <- dir('../testcorpus/', pattern = 'cmdi$', full.names = TRUE) %>%
map(read_cmdi) %>%
bind_rows()
test_corpus <- left_join(elan_corpus, meta) %>% left_join(read_csv('coordinates.csv'))

  • I can recommend trying View function in RStudio
  • Function is more portable than script!
write_rds(test_corpus, 'test_corpus.rds')
test_corpus <- read_rds('test_corpus.rds')
source('parse_corpus.R')
test_corpus <- monster_function_that_does_everything(folder_to_go = "~/Desktop/corpus")
test_corpus %>% View

test_corpus %>% count(participant)
## # A tibble: 4 x 2
##   participant     n
##         <chr> <int>
## 1  JAI-M-1939   275
## 2  JSS-F-1988   197
## 3  MVF-F-1984    95
## 4  NTP-M-1986    47
test_corpus %>% count(session_location)
## # A tibble: 3 x 2
##    session_location     n
##               <chr> <int>
## 1     Diyur, Russia    95
## 2 Helsinki, Finland   240
## 3 Syktyvkar, Russia   279
test_corpus %>% count(year_birth)
## # A tibble: 4 x 2
##   year_birth     n
##        <chr> <int>
## 1       1939   275
## 2       1984    95
## 3       1986    47
## 4       1988   197

Next: More advanced example or Trying this further?