Transcribe YouTube videos
Transcribe YouTube videos to get transcripts with timestamps using Whisper Large v3 from Open AI
This guide covers how to use, and optimise, Whisper Large v3 to extract text from the audio of YouTube videos. If you want to use this model straight away you can use a live version on our explore page here for free!
We will be covering the following:
- How to setup and install the necessary python dependencies
- Downloading YouTube videos
- Extracting Audio from a Video file
- Creating transcripts from an Audio file using
openai/whisper-large-v3
from Huggingface.
We will be using the pipeline-ai
library to wrap our code and try the model through a premade dashboard included in the library, but the code will work outside of this dependency as well and is easy to adapt.
All source code is available on our Github
https://github.com/mystic-ai/pipeline/tree/main/examples/audio-to-text/youtube-transcript
Setting up your environment
To begin you will need the following dependencies:
pipeline-ai==2.0.15
transformers==4.37.2
torch==2.2.0
accelerate==0.27.2
pytube==15.0.0
moviepy==1.0.3
and you will also need the following system dependencies:
sudo apt-get install -y ffmpeg
FFmpeg is necessary for the audio processing towards the end of this guide.
This can now be wrapped into the pipeline.yaml
for your pipeline. Execute the following command in an empty directory to begin working with a fresh pipeline:
pipeline container init -n youtube-transcript
then in the pipeline.yaml
paste in the following:
runtime:
container_commands:
- apt-get update
- apt-get install -y ffmpeg git
python:
version: "3.10"
requirements:
- pipeline-ai
- transformers==4.37.2
- torch==2.2.0
- accelerate==0.27.2
- pytube==15.0.0
- moviepy==1.0.3
cuda_version: "11.4"
accelerators: ["nvidia_l4"]
pipeline_graph: new_pipeline:my_new_pipeline
pipeline_name: youtube-transcript
description: null
readme: null
extras: {}
cluster: null
Downloading YouTube videos and getting the audio
To download YouTube videos we will be using the very simple pytube
library and the YouTube
object. This library supports downloading Videos in multiple formats, and other options, but we will reduce the complexity of this and only download the audio component of the video:
from pytube import YouTube
video_link = "PUT A LINK IN HERE"
audio_stream = YouTube(video_link).streams.filter(only_audio=True).first()
if audio_stream is None:
raise ValueError("Invalid video link")
audio_stream.download(filename="sample.mp4")
This will download the video locally to sample.mp4
. Note that this is still in the mp4
format (video) so we need to perform one extra step to get a dedicated audio file out. This is where moviepy
comes in:
from moviepy.editor import AudioFileClip
clip = AudioFileClip("sample.mp4")
clip.write_audiofile("sample.mp3")
Now we have our perfect audio file!
Now that we have the basic downloading/processing initial steps done we can put this code into our pipeline inference steps. In the new_pipeline.py
python file in your directory you can update the code to the following:
from pytube import YouTube
from urllib.parse import urlparse
from moviepy.editor import AudioFileClip
from pipeline import Pipeline, Variable, entity, pipe
# Put your model inside of the below entity class
@entity
class MyModelClass:
@pipe(run_once=True, on_startup=True)
def load(self) -> None:
# Perform any operations needed to load your model here
print("Loading model...")
... # Model loading coming later
print("Model loaded!")
@pipe
def predict(self, video_link: str) -> dict:
print("Predicting...")
url = urlparse(video_link)
if url.netloc != "www.youtube.com":
raise ValueError("Invalid video link")
audio_stream = YouTube(video_link).streams.filter(only_audio=True).first()
if audio_stream is None:
raise ValueError("Invalid video link")
audio_stream.download(filename="sample.mp4")
clip = AudioFileClip("sample.mp4")
clip.write_audiofile("sample.mp3")
... # Inference coming later
print("Prediction complete!")
# print(result)
return result
with Pipeline() as builder:
input_var = Variable(
str,
description="Input video link",
title="Input video link",
)
my_model = MyModelClass()
my_model.load()
output_var = my_model.predict(input_var)
builder.output(output_var)
my_new_pipeline = builder.get_pipeline()
Extra validation
In the above snippet we use
urllib
to validate the url and check that we are referencing YouTube correctly. This is highly recommended to avoid potential issues with URLs being incorrectly formatted and strange exceptions being raised.
Getting the transcript
This step is where the ML portion comes into play. Now that we have our audio file we can being to load the Whisper Large v3 model. We'll being doing this using transformers:
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
model_id = "openai/whisper-large-v3"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id,
torch_dtype=torch_dtype,
low_cpu_mem_usage=True,
use_safetensors=True,
)
model.to(device)
processor = AutoProcessor.from_pretrained(model_id)
ml_pipe = pipeline(
"automatic-speech-recognition",
model=model,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
max_new_tokens=128,
chunk_length_s=30,
batch_size=16,
return_timestamps=True,
torch_dtype=torch_dtype,
device=device,
)
Performing inference is now very basic:
result = ml_pipe(
"sample.mp3",
return_timestamps=True,
)
'''
result = {
"text" : "this is the text ....",
"chunks" : [] # This is a list of the timestamps and texts
}
'''
We can wrap all of this code into our full pipeline script:
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from pytube import YouTube
from urllib.parse import urlparse
from pipeline import Pipeline, Variable, entity, pipe
from moviepy.editor import AudioFileClip
# Put your model inside of the below entity class
@entity
class MyModelClass:
@pipe(run_once=True, on_startup=True)
def load(self) -> None:
# Perform any operations needed to load your model here
print("Loading model...")
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model_id = "openai/whisper-large-v3"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id,
torch_dtype=torch_dtype,
low_cpu_mem_usage=True,
use_safetensors=True,
)
model.to(device)
processor = AutoProcessor.from_pretrained(model_id)
pipe = pipeline(
"automatic-speech-recognition",
model=model,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
max_new_tokens=128,
chunk_length_s=30,
batch_size=16,
return_timestamps=True,
torch_dtype=torch_dtype,
device=device,
)
self.pipe = pipe
print("Model loaded!")
@pipe
def predict(self, video_link: str) -> dict:
print("Predicting...")
url = urlparse(video_link)
if url.netloc != "www.youtube.com":
raise ValueError("Invalid video link")
audio_stream = YouTube(video_link).streams.filter(only_audio=True).first()
if audio_stream is None:
raise ValueError("Invalid video link")
audio_stream.download(filename="sample.mp4")
clip = AudioFileClip("sample.mp4")
clip.write_audiofile("sample.mp3")
result = self.pipe(
"sample.mp3",
return_timestamps=True,
)
print("Prediction complete!")
# print(result)
return result
with Pipeline() as builder:
input_var = Variable(
str,
description="Input video link",
title="Input video link",
)
my_model = MyModelClass()
my_model.load()
output_var = my_model.predict(input_var)
builder.output(output_var)
my_new_pipeline = builder.get_pipeline()
Running the following commands will give you an interactive dashboard to try this locally:
pipeline container build
pipeline container up -d
And now navigate to your browser to see the following:
To deploy this privately on Mystic you can then run:
pipeline container push
Create an account to get serverless deployments
Create an account today at mystic.ai
Caching model weights to reduce cold start
To cache the model weights in the container so that they're not downloaded every time you will need to create a new file in the working directory called download_weights.py
, and enter the following:
import torch
from transformers import AutoModelForSpeechSeq2Seq
model_id = "openai/whisper-large-v3"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id,
torch_dtype=torch.float16,
low_cpu_mem_usage=True,
use_safetensors=True,
)
model.save_pretrained("./model_weights")
This will pull the weights to the local directory to be built directly into the container to reduce the cold start time. To use the weights you will need to point the model to the directory, and rebuild the container. After you've run the above script you will need to rebuild with pipeline container build
, and then point transformers to that directory by editing the model load code to the following:
model = AutoModelForSpeechSeq2Seq.from_pretrained(
"./model_weights",
torch_dtype=torch_dtype,
low_cpu_mem_usage=True,
use_safetensors=True,
)
This will now look for the model locally instead of downloading it every time!
Updated 6 months ago