audio transcription with whisper from R

Last week, OpenAI released version 2 of an updated neural net called Whisper that approaches human level robustness and accuracy on speech recognition. You can now directly call from R a C/C++ inference engine which allow you to transcribe .wav audio files.

logo audio whisper x100

To allow to easily do this in R, BNOSAC created an R wrapper around the whisper.cpp code. This R package is available at https://github.com/bnosac/audio.whisper and can be installed as follows. 

remotes::install_github("bnosac/audio.whisper")

The following code shows how you can transcribe an example 16-bit wav file with a fragment of a speech by JFK available here

library(audio.whisper)
model <- whisper("tiny")
path  <- system.file(package = "audio.whisper", "samples", "jfk.wav")
trans <- predict(model, newdata = path, language = "en", n_threads = 2)
trans
$n_segments
[1] 1

$data
 segment         from           to                                                                                                       text
       1 00:00:00.000 00:00:11.000  And so my fellow Americans ask not what your country can do for you ask what you can do for your country.

$tokens
 segment      token token_prob
       1        And  0.7476438
       1         so  0.9042299
       1         my  0.6872202
       1     fellow  0.9984470
       1  Americans  0.9589157
       1        ask  0.2573057
       1        not  0.7678108
       1       what  0.6542882
       1       your  0.9386917
       1   counstry  0.9854987
       1        can  0.9813995
       1         do  0.9937403
       1        for  0.9791515
       1        you  0.9925495
       1        ask  0.3058807
       1       what  0.8303462
       1        you  0.9735528
       1        can  0.9711444
       1         do  0.9616748
       1        for  0.9778513
       1       your  0.9604713
       1    country  0.9923630
       1          .  0.4983074

Another example based on a Micro Machines commercial from the 1980's.

I've always wanted to get the transcription of the performances of Francis E. Dec available on UbuWeb Sound - Francis E. Dec like this performance: https://www.ubu.com/media/sound/dec_francis/Dec-Francis-E_rant1.mp3. This is how you can now do that from R.

library(av)
download.file(url = "https://www.ubu.com/media/sound/dec_francis/Dec-Francis-E_rant1.mp3", 
destfile = "rant1.mp3", mode = "wb") av_audio_convert("rant1.mp3", output = "output.wav", format = "wav", sample_rate = 16000)

trans <- predict(model, newdata = "output.wav", language = "en", duration = 30 * 1000, offset = 7 * 1000, token_timestamps = TRUE) trans $n_segments [1] 11 $data segment from to text 1 00:00:07.000 00:00:09.000 Look at the picture. 2 00:00:09.000 00:00:11.000 See the skull. 3 00:00:11.000 00:00:13.000 The part of bone removed. 4 00:00:13.000 00:00:16.000 The master race Frankenstein radio controls. 5 00:00:16.000 00:00:18.000 The brain thoughts broadcasting radio. 6 00:00:18.000 00:00:21.000 The eyesight television. The Frankenstein earphone radio. 7 00:00:21.000 00:00:25.000 The threshold brain wash radio. The latest new skull reforming. 8 00:00:25.000 00:00:28.000 To contain all Frankenstein controls. 9 00:00:28.000 00:00:31.000 Even in thin skulls of white pedigree males. 10 00:00:31.000 00:00:34.000 Visible Frankenstein controls. 11 00:00:34.000 00:00:37.000 The synthetic nerve radio, directional and an alloop. $tokens segment token token_prob token_from token_to 1 Look 0.4281234 00:00:07.290 00:00:07.420 1 at 0.9485379 00:00:07.420 00:00:07.620 1 the 0.9758387 00:00:07.620 00:00:07.940 1 picture 0.9734664 00:00:08.150 00:00:08.580 1 . 0.9688568 00:00:08.680 00:00:08.910 2 See 0.9847929 00:00:09.000 00:00:09.420 2 the 0.7588121 00:00:09.420 00:00:09.840 2 skull 0.9989663 00:00:09.840 00:00:10.310 2 . 0.9548351 00:00:10.550 00:00:11.000 3 The 0.9914295 00:00:11.000 00:00:11.170 3 part 0.9789217 00:00:11.560 00:00:11.600 3 of 0.9958754 00:00:11.600 00:00:11.770 3 bone 0.9759618 00:00:11.770 00:00:12.030 3 removed 0.9956936 00:00:12.190 00:00:12.710 3 . 0.9965582 00:00:12.710 00:00:12.940
...

Maybe in the near future we will put it on CRAN, currently it is only at https://github.com/bnosac/audio.whisper.

Get in touch if you are interested in this and let us know what you plan to use it for.