Transcribe videos using OpenAI Whisper for free
Introduction
OpenAI, the company behind GPT-3 and DALL-E 2 has just released a voice model called Whisper that can transcribe audio fragments to multiple languages and translate them to English. The main difference to the other two models is that Whisper is available with an open source license. So we can download it, customize it and run it as much as we want. Let’s try it!
We will setup a local environment to run our Whisper project. In particular, we will be interested to transcribe some of our Youtube videos to text.
Setting up the environment for Data Science
During the last weeks, I have been exploring different templates to organize my data science projects. So I have decided to apply one of those, the Cookiecutter for Data Science as basis for this project. Let’s configure it from scratch
Setting up the virtual environment
CookieCutter is a python program to produce templates based on rules and user prompts. We will use in particular the Data Science CookieCutter. The documentation recommends setting up the virtual environments using virtualenvwrapper.
pip install virtualenvwrapper
export WORKON_HOME=~/Envs
mkdir -p $WORKON_HOME
source /home/elsatch/miniconda3/bin/virtualenvwrapper.sh
Please note that the route to the virtualenvwrapper.sh
file might vary on your system. This route worked for Ubuntu plus miniconda, but on my Mac the route was: source /Users/username/opt/anaconda3/bin/virtualenvwrapper.sh
.
To create and activate the virtual environment:
mkvirtualenv whisper_env
Setting up CookieCutter
To install CookieCutter:
python3 -m pip install cookiecutter
To initialize our project, yt_whisper
we just run this command, filling the different options:
cookiecutter https://github.com/drivendata/cookiecutter-data-science
Setting up Whisper
To install Whisper:
pip install git+https://github.com/openai/whisper.git
Testing Whisper
To test Whisper, we have created a couple of scripts inside the data folder. First one, get_videos.py will download the audio from a Youtube video to your data/external
folder. YouTube offers several formats (or streams) for each video. To list all of them, you can use the tools youtube-dl
or yt-dlp
:
yt-dlp -F https://www.youtube.com/watch\?v\=9hfqdGO49Po
[youtube] 9hfqdGO49Po: Downloading webpage
[youtube] 9hfqdGO49Po: Downloading android player API JSON
[info] Available formats for 9hfqdGO49Po:
ID EXT RESOLUTION FPS │ FILESIZE TBR PROTO │ VCODEC VBR ACODEC ABR ASR MORE INFO
──────────────────────────────────────────────────────────────────────────────────────────────────────────────────
sb2 mhtml 48x27 │ mhtml │ images storyboard
sb1 mhtml 80x45 │ mhtml │ images storyboard
sb0 mhtml 160x90 │ mhtml │ images storyboard
139 m4a │ 12.25MiB 48k https │ audio only mp4a.40.5 48k 22050Hz low, m4a_dash
249 webm │ 12.78MiB 50k https │ audio only opus 50k 48000Hz low, webm_dash
250 webm │ 15.88MiB 63k https │ audio only opus 63k 48000Hz low, webm_dash
140 m4a │ 32.51MiB 129k https │ audio only mp4a.40.2 129k 44100Hz medium, m4a_dash
251 webm │ 28.98MiB 115k https │ audio only opus 115k 48000Hz medium, webm_dash
17 3gp 176x144 6 │ 20.78MiB 82k https │ mp4v.20.3 82k mp4a.40.2 0k 22050Hz 144p
394 mp4 256x144 25 │ 21.94MiB 87k https │ av01.0.00M.08 87k video only 144p, mp4_dash
160 mp4 256x144 25 │ 11.21MiB 44k https │ avc1.4d400c 44k video only 144p, mp4_dash
278 webm 256x144 25 │ 23.34MiB 92k https │ vp9 92k video only 144p, webm_dash
395 mp4 426x240 25 │ 34.87MiB 138k https │ av01.0.00M.08 138k video only 240p, mp4_dash
133 mp4 426x240 25 │ 22.90MiB 91k https │ avc1.4d4015 91k video only 240p, mp4_dash
242 webm 426x240 25 │ 40.59MiB 161k https │ vp9 161k video only 240p, webm_dash
396 mp4 640x360 25 │ 61.12MiB 243k https │ av01.0.01M.08 243k video only 360p, mp4_dash
134 mp4 640x360 25 │ 40.80MiB 162k https │ avc1.4d401e 162k video only 360p, mp4_dash
18 mp4 640x360 25 │ ~ 74.99MiB 291k https │ avc1.42001E 291k mp4a.40.2 0k 44100Hz 360p
243 webm 640x360 25 │ 74.18MiB 295k https │ vp9 295k video only 360p, webm_dash
397 mp4 854x480 25 │ 107.40MiB 427k https │ av01.0.04M.08 427k video only 480p, mp4_dash
135 mp4 854x480 25 │ 68.83MiB 274k https │ avc1.4d401e 274k video only 480p, mp4_dash
244 webm 854x480 25 │ 124.10MiB 494k https │ vp9 494k video only 480p, webm_dash
398 mp4 1280x720 25 │ 205.31MiB 817k https │ av01.0.05M.08 817k video only 720p, mp4_dash
136 mp4 1280x720 25 │ 109.82MiB 437k https │ avc1.4d401f 437k video only 720p, mp4_dash
22 mp4 1280x720 25 │ ~145.66MiB 566k https │ avc1.64001F 566k mp4a.40.2 0k 44100Hz 720p
247 webm 1280x720 25 │ 239.34MiB 953k https │ vp9 953k video only 720p, webm_dash
399 mp4 1920x1080 25 │ 395.44MiB 1575k https │ av01.0.08M.08 1575k video only 1080p, mp4_dash
137 mp4 1920x1080 25 │ 334.54MiB 1332k https │ avc1.640028 1332k video only 1080p, mp4_dash
248 webm 1920x1080 25 │ 432.80MiB 1723k https │ vp9 1723k video only 1080p, webm_dash
Out of all the audio formats, the “best quality” in m4a is 140 (129kbps at 44100Hz). You might try also the webm format (251). We will pass this as a parameter in our script get_videos.py:
from pytube import YouTube
# Video Negocios de Impresión 3D bajo demanda con Diego Trapero
yt = YouTube('https://www.youtube.com/watch?v=9hfqdGO49Po')
stream = yt.streams.get_by_itag(140)
stream.download(output_path='../../data/external/', filename='negocios3D.m4a')
Now it’s time to transcribe it using Whisper. You can do it using a script like this:
import whisper
model = whisper.load_model('medium')
result = model.transcribe('../../data/external/negocios3D.m4a')
print(result['text'])
This a sample output using the medium model:
Hola, bienvenidos a un nuevo episodio de Laura Maker, bienvenidos a una nueva entrevista. Hoy, por petición popular, volvemos a estar con… De una persona. Diego Trapero, efectivamente. Alguien vio la entrevista y le gustó y dijo, tenéis que volver a veros. Y aquí estamos. Solo que veréis que esto ha cambiado un poco para los que nos estéis suscitando por el podcast. Tenéis un vídeo con un tour que acabamos de preparar para que veáis cómo son las oficinas de Bitfab, que es el proyecto que está lanzando Diego y en el que hablamos en la última entrevista. Pero claro, no es lo mismo hablarlo en abstracto que estar aquí. Bueno, cuéntanos dónde estamos y qué es lo que tenemos por aquí.
As you, Spanish speakers can see, the output is quite precise. It got almost everything right, except for the podcast name.
The larger the model, the better the accuracy but… the longer the times and the biggest GPU memory you need. As per the OpenAI Whisper Documentation:
Size | Parameters | English-only model | Multilingual model | Required VRAM | Relative speed |
---|---|---|---|---|---|
tiny | 39 M | tiny.en |
tiny |
~1 GB | ~32x |
base | 74 M | base.en |
base |
~1 GB | ~16x |
small | 244 M | small.en |
small |
~2 GB | ~6x |
medium | 769 M | medium.en |
medium |
~5 GB | ~2x |
large | 1550 M | N/A | large |
~10 GB | 1x |
In my particular case, I could only use up to medium as my RTX2060 has 6 GB of VRAM. I tried running large, but run into an error:
OutOfMemoryError: CUDA out of memory. Tried to allocate 26.00 MiB (GPU 0; 5.79 GiB total capacity; 4.49 GiB already allocated; 11.19 MiB free; 4.70 GiB reserved in total by PyTorch)
How much would cost to transcribe at scale
I’d like to transcribe my complete channel of Youtube videos. In total, they are around 300 videos, with an average duration of 15 min (aprox). Total running time would be around 75 hours. Let’s say there are 100 hours to round up.
To run the transcription, I’d need to get a cloud instance equipped with a GPU that has at least 10 Gb of RAM. Checking some providers, and assuming no overhead to run this task (setting up the environment, scheduling, etc) it would cost:
Provider | Instance | GPU | GPU VRAM | Cost per hour | Total cost |
---|---|---|---|---|---|
Lambda Labs | 1xNVIDIA A100 | A100 | 40 GB | 1.10 USD | 110 USD |
Paperspace | P5000 | P5000 | 16 GB | 0.78 USD | 78 USD |
Then you’d need to add storage, network transfer, etc. But this is a starting point!
Conclusions
Compared to other previous attempts at transcribing, the quality of OpenAI Whisper is astonishing. It creates punctuation and can work in several languages. Next step on the road, learn how to serialize/paralelize the work, so I can process all videos in a controlled manner.
Photo by Kristina Flour at Unsplash
Releted Posts
Setting up doom emacs in Ubuntu 20.04
Introduction In this post I will capture the steps required to install doom emacs in a fresh new install of Ubuntu 20.
Read moreSetting up an AI workstation
Introduction In this document, I will share the steps required to get an AI workstation machine ready. I’ll be updating the content as my configuration evolves.
Read moreCreating fast presentations using orgmode and reveal
Introduction In this brief post I’ll explain how to create fast presentations using emacs + orgmode + reveal. Requisites emacs orgmode ox-reveal Installing OX-Reveal OX-Reveal module creates a new entry in the export menu to convert your Org file to Reveal.
Read more