Getting complicated

More advanced example

  • In this example we are going to use an emuR R package
  • Connects to the web services in BAS
  • Starting point:
    • Forced alignation tools are good enough
    • We can get from utterance level to phoneme level for free

The way it works

  • emuR takes specifically formatted files, builds a database out of those, sends something from there to Munchen, does something over there, we get the result back, that can be written into Praat TextGrid
  • When we have Praat TextGrids, we can apply PraatScript
  • (We can execute the PraatScript safely from Terminal or R)
  • You’ll see!

  • It’s often good idea to do all modification and variable creation in one spot…
eafs <- dir(path = '../testcorpus/', pattern = 'eaf$', full.names = TRUE)
corpus <- eafs %>% map(read_custom_eaf) %>% bind_rows()
corpus <- corpus %>%
mutate(time_duration = time_end - time_start) %>%
mutate(audio_file = str_replace(filename, 'eaf$', 'wav')) %>%
mutate(orth_trimmed = str_replace_all(orth, c('[:punct:]' = '',
'\\s+' = ' '))) %>%
filter(! participant == 'NTP-M-1986') %>% # just getting rid of myself
select(orth_trimmed, time_start, time_end, time_duration, everything())

plot(density(corpus$time_duration))

  • One type of data emuR can handle is audio file + matching text file
  • It is also picky about the audio files
library(exifr)
corpus %>% distinct(audio_file) %>%
pull(audio_file) %>%
map(~ exifr::read_exif(.x)) %>% bind_rows() %>%
rename(audio_file = SourceFile) %>%
select(BitsPerSample, Duration, FileType, NumChannels, everything())
## # A tibble: 3 x 18
##   BitsPerSample  Duration FileType NumChannels
##           <int>     <dbl>    <chr>       <int>
## 1            16  21.51046      WAV           1
## 2            16 101.82546      WAV           1
## 3            16  90.70592      WAV           1
## # ... with 14 more variables: audio_file <chr>, ExifToolVersion <dbl>,
## #   FileName <chr>, Directory <chr>, FileSize <int>, FileModifyDate <chr>,
## #   FileAccessDate <chr>, FileInodeChangeDate <chr>,
## #   FilePermissions <int>, FileTypeExtension <chr>, MIMEType <chr>,
## #   Encoding <int>, SampleRate <int>, AvgBytesPerSec <int>
  • Actually now we have right amount of channels!

  • In case you have too many channels…
library(glue)
corpus %>%
distinct(audio_file) %>%
pull(audio_file) %>%
walk(~ {
seewave::sox(glue("{.x} -c 1 {str_replace(.x, '.wav$', '-mono.wav')}"))
})
  • sox file.wav -c 1 file-mono.wav

  • Let’s define a function that cuts ELAN files by reference
  • seewave package can access sox from R
cut_elan_ref <- function(audio_file, reference_id, start, duration){
if (dir.exists('../testcorpus/reference_clips') == FALSE) {
dir.create('../testcorpus/reference_clips')
}
seewave::sox(command = glue("{audio_file} ../testcorpus/reference_clips/{reference_id}.wav trim {start / 1000} {duration / 1000}"))
}

  • walk is same as map, but it doesn’t output anything, you use it for side-effects (like writing file)
    • Aaaactually, it silently returns what it got
corpus %>% distinct(audio_file, ref, time_start, time_duration, orth_trimmed) %>%
split(.$ref) %>%
walk(., ~ cut_elan_ref(.x$audio_file, .x$ref, .x$time_start, .x$time_duration)) %>%
walk(., ~ write_lines(.x$orth_trimmed[1], path = glue::glue('../testcorpus/reference_clips/', .$ref[1], '.txt')))

library(emuR)
convert_txtCollection(dbName = 'testcorpus',
sourceDir = '../testcorpus/reference_clips',
targetDir = '.',
txtExtension = '.txt',
mediaFileExtension = 'wav',
attributeDefinitionName = 'orth')
dbHandle = load_emuDB('testcorpus_emuDB', verbose = F)

runBASwebservice_g2pForTokenization(handle = dbHandle,
transcriptionAttributeDefinitionName = 'orth', language = 'rus-RU',
orthoAttributeDefinitionName = 'ORT', resume = FALSE,
verbose = TRUE)
runBASwebservice_g2pForPronunciation(handle = dbHandle,
orthoAttributeDefinitionName = 'ORT',
language = 'und',
canoAttributeDefinitionName = 'KAN',
params = list(embed = 'maus', imap=RCurl::fileUpload("../testcorpus/kpv-sampa.txt")),
resume = FALSE,
verbose = TRUE)
runBASwebservice_maus(handle = dbHandle,
canoAttributeDefinitionName = 'KAN',
language = 'rus-RU',
mausAttributeDefinitionName = 'MAUS',
chunkLevel = NULL,
turnChunkLevelIntoItemLevel = TRUE,
perspective = 'default',
resume = FALSE,
verbose = TRUE)
export_TextGridCollection(dbHandle, targetDir = '../testcorpus/praat_freiburg', attributeDefinitionNames = c('ORT', 'KAN', 'MAUS'))

This is where one starts to think:

“should had I checked all transcriptions once more before I did this?”

Questions?

Up next: Integrating tools, some nice plots

Integrating tools

Shiny

  • Earlier we had an example of a JavaScript based web content
  • Nice, but still limited
  • It is possible to build small applications also in R
  • Can be hosted for free online, but things get fast tricky
  • Anyway having something running and up on server is already bit more complex
  • Generally very good for prototyping
    • Compact, lots of examples, logic easy to follow
    • Again, probably it all could be done from scratch in JavaScript

PraatScript

  • A scripting language that allows using Praat without touching Praat
  • A very good tutorial here
  • Works within Praat, and outside Praat
  • Somewhat popular: internet is full of examples
  • Not entirely easy or intuitive, but not that bad
    • Follows a lot what happens in Praat GUI

We can combine following facts

  • Praat can be opened from command line
  • PraatScript can be run from command line
  • We can save the results into a textfile
  • We can read a textfile into R

What those scripts do

  • For vowels, we get the formants
  • For sibilants, we get the centre of gravity

Why we do this?

  • Helps to locate vowel segments with mistakes
  • Is interesting!
  • At least with Komi there are several things to check
    • Does cog of unvoiced sibilants reming that of Russian?
    • How do dialectal extra vowels influence the vowel system as the whole?
    • etc.

If you can look it up in Praat,

you can extract it with PraatScript

Nice, but what are those points?

  • Try:
install.packages("shiny")
install.packages("shinydashboard")
install.packages("tidyverse")
install.packages("ggplot2")
install.packages("tuneR")
install.packages("seewave")
install.packages("forcats")
runGitHub("phoneme-viewer", "langdoc")

Also the cat pictures make a point!

Where do the cat pictures come from?

This is the only part that works on everyones laptop!

meow::meow
function () 
{
    url <- paste0("http://thecatapi.com/api/images/get?format=src&type=jpg&size=med")
    tmp <- tempfile()
    dl_status <- download.file(url, tmp, quiet = TRUE, mode = "wb")
    pic <- jpeg::readJPEG(tmp)
    plot(1, type = "n", xlim = c(0, 1), ylim = c(0, 1), bty = "n", 
        xaxt = "n", yaxt = "n", xlab = "", ylab = "")
    graphics::rasterImage(pic, 0, 0, 1, 1)
    rm_status <- file.remove(tmp)
    status <- all(!as.logical(dl_status), rm_status)
    return(invisible(status))
}
<bytecode: 0x1272f51b8>
<environment: namespace:meow>

There is a cat picture API!

  • Conceptually the same as emuR example
  • “make phoneme level segmentation” = “give me a medium sized jpg cat pic”
  • Why the archives have no APIs?
  • Why so few morphological analysators have?
    • Bindings would give same advantage