Getting our test corpus into R

Niko Partanen

Test corpus

What is there?

  • Tiny collection of Komi recordings
  • Can be downloaded here
    • git clone https://github.com/langdoc/testcorpus
  • Fragmentary recordings, but quite realistic examples
  • Metadata in CMDI files

Structure

  • Follows the Freiburg standards in tier structure and naming
    • Matches closely Kola Saami and Pite Saami corpora
- reference tier
    \- transcription tier
        \- token tier
           \- lemma
               \- pos
    \- translation tier
  • ELAN corpora are very individualistic
    • Nothing works out of the box, but we will move into customization after lunch

FRelan

  • An R package that contains many functions usable with Freiburg standard
  • Some parts probably adaptable elsewhere
library(devtools)
install_github('langdoc/FRelan')
  • Later we will focus into read_tier() and read_cmdi() functions
  • After that individual parsing method can be combined into a new function

Reading files into R

library(tidyverse)
library(xml2)
corpus <- dir('../testcorpus', pattern = 'eaf$', full.names = TRUE) %>%
map(FRelan::read_eaf) %>%
bind_rows() %>%
select(token, participant, session_name, time_start, time_end, everything())
corpus
## # A tibble: 595 x 11
##      token participant                session_name time_start time_end
##      <chr>       <chr>                       <chr>      <dbl>    <dbl>
##  1      ме  MVF-F-1984 kpv_izva20140330-1-fragment          0     6086
##  2       ,  MVF-F-1984 kpv_izva20140330-1-fragment          0     6086
##  3 кӧнечнэ  MVF-F-1984 kpv_izva20140330-1-fragment          0     6086
##  4      же  MVF-F-1984 kpv_izva20140330-1-fragment          0     6086
##  5       ,  MVF-F-1984 kpv_izva20140330-1-fragment          0     6086
##  6    вӧлі  MVF-F-1984 kpv_izva20140330-1-fragment          0     6086
##  7     кык  MVF-F-1984 kpv_izva20140330-1-fragment          0     6086
##  8     лун  MVF-F-1984 kpv_izva20140330-1-fragment          0     6086
##  9       в  MVF-F-1984 kpv_izva20140330-1-fragment          0     6086
## 10    шоке  MVF-F-1984 kpv_izva20140330-1-fragment          0     6086
## # ... with 585 more rows, and 6 more variables: utterance <chr>,
## #   reference <chr>, filename <chr>, word <chr>, after <chr>, before <chr>

Working further

  • In this point it is easy to filter and examine the result as any data frame
  • I have described the basic ‘verbs’ here
corpus %>%
filter(token == 'вӧлі')
## # A tibble: 15 x 11
##    token participant                      session_name time_start time_end
##    <chr>       <chr>                             <chr>      <dbl>    <dbl>
##  1  вӧлі  MVF-F-1984       kpv_izva20140330-1-fragment          0     6086
##  2  вӧлі  MVF-F-1984       kpv_izva20140330-1-fragment       9313    14870
##  3  вӧлі  JAI-M-1939 kpv_izva20140404IgusevJA-fragment          0     6196
##  4  вӧлі  JAI-M-1939 kpv_izva20140404IgusevJA-fragment       6675     9865
##  5  вӧлі  JAI-M-1939 kpv_izva20140404IgusevJA-fragment      23375    27600
##  6  вӧлі  JAI-M-1939 kpv_izva20140404IgusevJA-fragment      27848    32261
##  7  вӧлі  JAI-M-1939 kpv_izva20140404IgusevJA-fragment      80405    83726
##  8  вӧлі  JAI-M-1939 kpv_izva20140404IgusevJA-fragment      83726    85805
##  9  вӧлі  NTP-M-1986 kpv_udo20120330SazinaJS-encounter       9080     9970
## 10  вӧлі  JSS-F-1988 kpv_udo20120330SazinaJS-encounter      13510    17400
## 11  вӧлі  JSS-F-1988 kpv_udo20120330SazinaJS-encounter      17405    19515
## 12  вӧлі  JSS-F-1988 kpv_udo20120330SazinaJS-encounter      17405    19515
## 13  вӧлі  JSS-F-1988 kpv_udo20120330SazinaJS-encounter      34016    34993
## 14  вӧлі  JSS-F-1988 kpv_udo20120330SazinaJS-encounter      35781    38639
## 15  вӧлі  JSS-F-1988 kpv_udo20120330SazinaJS-encounter      81389    82220
## # ... with 6 more variables: utterance <chr>, reference <chr>,
## #   filename <chr>, word <chr>, after <chr>, before <chr>
corpus %>%
filter(lag(token) == 'татшӧм' & token == 'вӧлі') %>%
select(token, utterance, everything())
## # A tibble: 1 x 11
##   token    utterance participant                      session_name
##   <chr>        <chr>       <chr>                             <chr>
## 1  вӧлі Татшӧм вӧлі.  JSS-F-1988 kpv_udo20120330SazinaJS-encounter
## # ... with 7 more variables: time_start <dbl>, time_end <dbl>,
## #   reference <chr>, filename <chr>, word <chr>, after <chr>, before <chr>
  • lag() and lead() give the previous and next value
  • With POS-tagged corpus, for example, one can easily search in manner:
corpus %>% filter(lag(pos) == 'Pron' & token == 'V')
  • To find all pronoun + verb bigrams, for example.

Housekeeping

  • It can also be useful to look for inconsistencies
  • There should only be characters that belong to Komi writing system
    • One approach could be to filter out everything that is not punctuation or Cyrillic!
corpus %>% filter(! str_detect(token, '[[:punct:]\\p{Cyrillic}]'))
## # A tibble: 1 x 11
##   token participant                      session_name time_start time_end
##   <chr>       <chr>                             <chr>      <dbl>    <dbl>
## 1     a  NTP-M-1986 kpv_udo20120330SazinaJS-encounter      85947    86212
## # ... with 6 more variables: utterance <chr>, reference <chr>,
## #   filename <chr>, word <chr>, after <chr>, before <chr>

Opening file

corpus %>% filter(! str_detect(token, '[[:punct:]\\p{Cyrillic}]')) %>%
FRelan::open_eaf(1)

  • To break this down:
    • a group: [ ]
    • punctuation: [[:punct:]]
    • Cyrillic Unicode block: \\p{Cyrillic}
  • This could almost be done in ELAN as well

Metadata

What about it?

  • Often discussed in archiving context
  • Target of intensive standardization
    • … with very inconclusive results
  • Comes in variety of formats
  • Cannot be used in ELAN searches

cmdi

Parsing CMDI to R

library(glue)
read_cmdi <- function(cmdi_file){ # this defines the function
read_xml(cmdi_file) %>% # reads the xml
xml_find_all('//cmd:Actor') %>% # finds all Actor nodes
map(~ tibble(participant = .x %>% xml_find_first('./cmd:Code') %>% xml_text,
session_name = .x %>% xml_find_first('../../cmd:Name') %>% xml_text,
year_birth = .x %>% xml_find_first('./cmd:BirthDate') %>% xml_text,
year_rec = .x %>% xml_find_first('../../cmd:Date') %>% xml_text,
role = .x %>% xml_find_first('./cmd:Role') %>% xml_text,
sex = .x %>% xml_find_first('./cmd:Sex') %>% xml_text,
session_address = .x %>% xml_find_first('../../cmd:Location/cmd:Address') %>% xml_text,
session_country = .x %>% xml_find_first('../../cmd:Location/cmd:Country') %>% xml_text,
session_location = paste0(session_address, ', ', session_country),
education = .x %>% xml_find_first('./cmd:Education') %>% xml_text,
name_full = .x %>% xml_find_first('./cmd:FullName') %>% xml_text)) %>%
bind_rows() # After everything is collected into tibble/dataframe,
# we can just bind the rows together
}

Applying the function

In this point we can apply the function we just wrote into all cmdi files we have.

metadata <- dir('../testcorpus', 'cmdi$', full.names = TRUE) %>%
map(read_cmdi) %>% bind_rows()
metadata
## # A tibble: 9 x 11
##   participant                      session_name year_birth   year_rec
##         <chr>                             <chr>      <chr>      <chr>
## 1  MVF-F-1984       kpv_izva20140330-1-fragment       1984 2014-03-30
## 2  VCP-M-1993       kpv_izva20140330-1-fragment       1993 2014-03-30
## 3  NTP-M-1986       kpv_izva20140330-1-fragment       1986 2014-03-30
## 4   MR-M-1971       kpv_izva20140330-1-fragment       1971 2014-03-30
## 5   RB-M-1971       kpv_izva20140330-1-fragment       1971 2014-03-30
## 6  JAI-M-1939 kpv_izva20140404IgusevJA-fragment       1939 2014-04-04
## 7  NTP-M-1986 kpv_izva20140404IgusevJA-fragment       1986 2014-04-04
## 8  NTP-M-1986 kpv_udo20120330SazinaJS-encounter       1986 2012-03-30
## 9  JSS-F-1988 kpv_udo20120330SazinaJS-encounter       1988 2012-03-30
## # ... with 7 more variables: role <chr>, sex <chr>, session_address <chr>,
## #   session_country <chr>, session_location <chr>, education <chr>,
## #   name_full <chr>
corpus_full <- left_join(corpus, metadata)
corpus_full
## # A tibble: 595 x 20
##      token participant                session_name time_start time_end
##      <chr>       <chr>                       <chr>      <dbl>    <dbl>
##  1      ме  MVF-F-1984 kpv_izva20140330-1-fragment          0     6086
##  2       ,  MVF-F-1984 kpv_izva20140330-1-fragment          0     6086
##  3 кӧнечнэ  MVF-F-1984 kpv_izva20140330-1-fragment          0     6086
##  4      же  MVF-F-1984 kpv_izva20140330-1-fragment          0     6086
##  5       ,  MVF-F-1984 kpv_izva20140330-1-fragment          0     6086
##  6    вӧлі  MVF-F-1984 kpv_izva20140330-1-fragment          0     6086
##  7     кык  MVF-F-1984 kpv_izva20140330-1-fragment          0     6086
##  8     лун  MVF-F-1984 kpv_izva20140330-1-fragment          0     6086
##  9       в  MVF-F-1984 kpv_izva20140330-1-fragment          0     6086
## 10    шоке  MVF-F-1984 kpv_izva20140330-1-fragment          0     6086
## # ... with 585 more rows, and 15 more variables: utterance <chr>,
## #   reference <chr>, filename <chr>, word <chr>, after <chr>,
## #   before <chr>, year_birth <chr>, year_rec <chr>, role <chr>, sex <chr>,
## #   session_address <chr>, session_country <chr>, session_location <chr>,
## #   education <chr>, name_full <chr>

  • Let’s observe what we have in R for a second!
  • Let’s change something, i.e. in metadata

Exploring the corpus

  • What can we do with the values we have?
  • Is something missing or problematic?
  • Is it always clear what we “have”?

# coordinates <- corpus_full %>%
# distinct(session_location) %>%
# as.data.frame() %>%
# ggmap::mutate_geocode(session_location) %>%
# as_tibble()
# write_csv(coordinates, 'coordinates.csv')
coordinates <- read_csv('coordinates.csv', col_types = 'cdd')
corpus_geo <- left_join(corpus_full, coordinates) %>%
rename(lon_session = lon,
lat_session = lat)
library(leaflet)
library(htmlwidgets)
library(widgetframe)
map <- leaflet(data = corpus_geo %>% add_count(session_name)) %>%
addProviderTiles(providers$CartoDB.Positron) %>%
addCircleMarkers(lng = ~lon_session,
lat = ~lat_session, radius = ~log(n),
popup = ~glue('Recording place: {session_location}</br>
Number of tokens: {n}'))
frameWidget(map)
kpv_map <- leaflet(data = kpv %>% filter(! is.na(lon_session))) %>%
addProviderTiles(providers$CartoDB.Positron) %>%
addCircleMarkers(lng = ~jitter(lon_session, 10),
lat = ~jitter(lat_session, 10),
popup = ~glue('{session_name}</br>
{title_eng}</br>
Recording place: {session_location}</br>
Number of tokens: {token_count}</br>
<a href="">Link to archive</a>'),
clusterOptions = markerClusterOptions())
frameWidget(kpv_map)

What we just created?

Alright, then add fancy feature {fancy feature}!

Not so fast!

Simplicity comes with drawbacks

  • It is trivially easy to add features, if…
    • They are supported
    • Someone has added that into the R package we use
  • Anything can be added…
    • But demands using JavaScript
    • Needs in-depth knowledge of related libraries

Worth noting

  • This is not an actual application
  • There are limits of interactivity
  • But this is not a bad deal after all