Command line tools in language documentation

As far as I see, there are few command line tools that are irreplaceable in language documentation work. These are imagemagick for images, SoX for audio files, ffmpeg for video files and pandoc for general file conversions. Each of these is so useful that they could merit complete books to be written about them, but I just share here one useful trick how SoX can be used.

Corrupted WAV files

I don’t know at all how common this problem is. I suspect fairly common, but it is hard to say. With some recorders, at least on Edirol R-09 which I have used for many years, if the battery dies or the recorder falls or for some other reason the recording is interrupted, it is very possible that the WAV file becomes corrupted. It doesn’t open, it can’t be previewed, on Mac Finder it looks as if it had no length. This whole “corruption” means in this case, as far as I understand, that the recorder didn’t finish writing the file properly and left something out from the WAV header. All the recorded content is there in the file, but the file isn’t really finalized. Mysteriously some programs can open these files, for example, I’ve often been able to listen them on VLC player.

Many ways to fix these files involve opening it as RAW audio file in Audacity and saving that into a new file. This works well. However, you can also use following command in SoX:

sox --ignore-length corrupted.wav fixed.wav

So any SoX command that copies the file but does nothing to the content should do quite the same, as long as it ends up rewriting the header. --ignore-length doesn’t really have any real purpose here, but I think otherwise SoX would get stuck to the way how corrupted file seemingly has the length of 0:00.

I’m assuming it would be entirely safe to remove the corrupted files, but actually I have for now kept those as well. I don’t know if this makes sense, but I’m anyway rather paranoid about having backups and deleting files always feels bit hazardous. Anyway I wanted to share this procedure here as it is regularly occurring problem. I think I’ve used also different SoX commands in the past, but I had to Google this again now when I was working with some older recordings, and ended up with this. This leads of course into more complex question of how we actually store in machine readable format all manipulations which we have done in different software, but this is another topic already…

SoX has tons of parameters that can be tweaked and used For example, with command:

soxi file.wav

It is easy to get information about the audio file. Another case is that some tools may need very specific version of the wav file, and this can be acquired, for example, like this:

sox normal.wav -b 16 -r 16000 -c 1 specific.wav

It is common that different software demands something like this. SoX can also be called from R with seewave package, and there seems to be a Python wrapper as well.


Creative Commons Attribution 4.0 2016 Niko Partanen or individual writers.