Getting complicated

More advanced example

In this example we are going to use an emuR R package
Connects to the web services in BAS
Starting point:
- Forced alignation tools are good enough
- We can get from utterance level to phoneme level for free

The way it works

emuR takes specifically formatted files, builds a database out of those, sends something from there to Munchen, does something over there, we get the result back, that can be written into Praat TextGrid
When we have Praat TextGrids, we can apply PraatScript
(We can execute the PraatScript safely from Terminal or R)
You’ll see!

It’s often good idea to do all modification and variable creation in one spot…

eafs <- dir(path = '../testcorpus/', pattern = 'eaf$', full.names = TRUE)

corpus <- eafs %>% map(read_custom_eaf) %>% bind_rows()

corpus <- corpus %>% 
  mutate(time_duration = time_end - time_start) %>%
  mutate(audio_file = str_replace(filename, 'eaf$', 'wav')) %>%
  mutate(orth_trimmed = str_replace_all(orth, c('[:punct:]' = '', 
                                                '\\s+' = ' '))) %>%
  filter(! participant == 'NTP-M-1986') %>% # just getting rid of myself
  select(orth_trimmed, time_start, time_end, time_duration, everything())

plot(density(corpus$time_duration))

One type of data emuR can handle is audio file + matching text file
It is also picky about the audio files

library(exifr)
corpus %>% distinct(audio_file) %>%
        pull(audio_file) %>%
        map(~ exifr::read_exif(.x)) %>% bind_rows() %>%
        rename(audio_file = SourceFile) %>%
  select(BitsPerSample, Duration, FileType, NumChannels, everything())

## # A tibble: 3 x 18
##   BitsPerSample  Duration FileType NumChannels
##           <int>     <dbl>    <chr>       <int>
## 1            16  21.51046      WAV           1
## 2            16 101.82546      WAV           1
## 3            16  90.70592      WAV           1
## # ... with 14 more variables: audio_file <chr>, ExifToolVersion <dbl>,
## #   FileName <chr>, Directory <chr>, FileSize <int>, FileModifyDate <chr>,
## #   FileAccessDate <chr>, FileInodeChangeDate <chr>,
## #   FilePermissions <int>, FileTypeExtension <chr>, MIMEType <chr>,
## #   Encoding <int>, SampleRate <int>, AvgBytesPerSec <int>

Actually now we have right amount of channels!

In case you have too many channels…

library(glue)

corpus %>% 
  distinct(audio_file) %>% 
  pull(audio_file) %>% 
  walk(~ {
    seewave::sox(glue("{.x} -c 1 {str_replace(.x, '.wav$', '-mono.wav')}"))
    })

sox file.wav -c 1 file-mono.wav

Let’s define a function that cuts ELAN files by reference
seewave package can access sox from R

cut_elan_ref <- function(audio_file, reference_id, start, duration){
  
  if (dir.exists('../testcorpus/reference_clips') == FALSE) {
    dir.create('../testcorpus/reference_clips')
  }

  seewave::sox(command = glue("{audio_file} ../testcorpus/reference_clips/{reference_id}.wav trim {start / 1000} {duration / 1000}"))
  
}

walk is same as map, but it doesn’t output anything, you use it for side-effects (like writing file)
- Aaaactually, it silently returns what it got

corpus %>% distinct(audio_file, ref, time_start, time_duration, orth_trimmed) %>%
  split(.$ref) %>%
  walk(., ~ cut_elan_ref(.x$audio_file, .x$ref, .x$time_start, .x$time_duration)) %>%
  walk(., ~ write_lines(.x$orth_trimmed[1], path = glue::glue('../testcorpus/reference_clips/', .$ref[1], '.txt')))

library(emuR)
convert_txtCollection(dbName = 'testcorpus', 
                            sourceDir = '../testcorpus/reference_clips', 
                            targetDir = '.', 
                            txtExtension = '.txt', 
                            mediaFileExtension = 'wav', 
                            attributeDefinitionName = 'orth')

dbHandle = load_emuDB('testcorpus_emuDB', verbose = F)

runBASwebservice_g2pForTokenization(handle = dbHandle,
  transcriptionAttributeDefinitionName = 'orth', language = 'rus-RU',
  orthoAttributeDefinitionName = 'ORT', resume = FALSE,
  verbose = TRUE)

runBASwebservice_g2pForPronunciation(handle = dbHandle,
                  orthoAttributeDefinitionName = 'ORT',
                  language = 'und', 
                  canoAttributeDefinitionName = 'KAN', 
                  params = list(embed = 'maus', imap=RCurl::fileUpload("../testcorpus/kpv-sampa.txt")), 
                  resume = FALSE, 
                  verbose = TRUE)

runBASwebservice_maus(handle = dbHandle,
                      canoAttributeDefinitionName = 'KAN',
                      language = 'rus-RU',
                      mausAttributeDefinitionName = 'MAUS',
                      chunkLevel = NULL,
                      turnChunkLevelIntoItemLevel = TRUE,
                      perspective = 'default',
                      resume = FALSE,
                      verbose = TRUE)

export_TextGridCollection(dbHandle, targetDir = '../testcorpus/praat_freiburg', attributeDefinitionNames = c('ORT', 'KAN', 'MAUS'))

This is where one starts to think:

“should had I checked all transcriptions once more before I did this?”

Questions?

Up next: Integrating tools, some nice plots

Integrating tools

Shiny

Earlier we had an example of a JavaScript based web content
Nice, but still limited
It is possible to build small applications also in R
Can be hosted for free online, but things get fast tricky
Anyway having something running and up on server is already bit more complex
Generally very good for prototyping
- Compact, lots of examples, logic easy to follow
- Again, probably it all could be done from scratch in JavaScript

PraatScript

A scripting language that allows using Praat without touching Praat
A very good tutorial here
Works within Praat, and outside Praat
Somewhat popular: internet is full of examples
- Mietta Lennes’ collection great
Not entirely easy or intuitive, but not that bad
- Follows a lot what happens in Praat GUI

We can combine following facts

Praat can be opened from command line
PraatScript can be run from command line
We can save the results into a textfile
We can read a textfile into R

In case you are curious:

git clone http://github.com/langdoc/praat-stuff
What follows is somewhat complicated interaction of Praat, shell scripts and R

What those scripts do

For vowels, we get the formants
For sibilants, we get the centre of gravity

Why we do this?

Helps to locate vowel segments with mistakes
Is interesting!
At least with Komi there are several things to check
- Does cog of unvoiced sibilants reming that of Russian?
- How do dialectal extra vowels influence the vowel system as the whole?
- etc.

If you can look it up in Praat,

you can extract it with PraatScript

Nice, but what are those points?

Try:

install.packages("shiny")
install.packages("shinydashboard")
install.packages("tidyverse")
install.packages("ggplot2")
install.packages("tuneR")
install.packages("seewave")
install.packages("forcats")
runGitHub("phoneme-viewer", "langdoc")

Also the cat pictures make a point!

Where do the cat pictures come from?

This is the only part that works on everyones laptop!

meow::meow
function () 
{
    url <- paste0("http://thecatapi.com/api/images/get?format=src&type=jpg&size=med")
    tmp <- tempfile()
    dl_status <- download.file(url, tmp, quiet = TRUE, mode = "wb")
    pic <- jpeg::readJPEG(tmp)
    plot(1, type = "n", xlim = c(0, 1), ylim = c(0, 1), bty = "n", 
        xaxt = "n", yaxt = "n", xlab = "", ylab = "")
    graphics::rasterImage(pic, 0, 0, 1, 1)
    rm_status <- file.remove(tmp)
    status <- all(!as.logical(dl_status), rm_status)
    return(invisible(status))
}
<bytecode: 0x1272f51b8>
<environment: namespace:meow>

There is a cat picture API!

Conceptually the same as emuR example
“make phoneme level segmentation” = “give me a medium sized jpg cat pic”
Why the archives have no APIs?
Why so few morphological analysators have?
- Bindings would give same advantage