Skip to main content

Groq Whisper Instagram Reel Subtitler

By Chris HoView in the Groq Cookbook

This guide will walk you through creating an automated subtitle generator for Instagram Reels using Groq Whisper. The script extracts audio from a video, transcribes it using Groq's Whisper API, and overlays word by word subtitles onto the video.

Example video output: example_video_output.mp4

Technologies Used

  • Groq Whisper Large V3 Turbo: AI-powered speech-to-text transcription with word level time stamps.
  • MoviePy: Handles video and subtitle overlaying.
  • Python OS Module: Manages file paths.

Step 1: Install Dependencies

Ensure you have the necessary Python packages installed:

pip install moviepy groq python-dotenv

Note: MoviePy requries FFmpeg, an open source program that handles audio and video. You can download it here: https://ffmpeg.org/download.html or if you have Homebrew installed on Mac, run this command in your terminal:

brew install ffmpeg

Step 2: Setup API Key

Create a GroqCloud account and get your API key:

Sign up at GroqCloud. Navigate to API Keys and click on Generate API Key

Store the key securely in an .env file:

GROQ_API_KEY=your_groq_api_key

Then in the captioner.py file, import the packages and load the API key.

import os
from groq import Groq
import datetime
from moviepy import *
from moviepy.video.tools.subtitles import SubtitlesClip
from moviepy.video.io.VideoFileClip import VideoFileClip

GROQ_API_KEY = os.environo"GROQ_API_KEY"]
client = Groq(api_key=GROQ_API_KEY)

Step 3: Convert MP4 to MP3

Before transcribing, we must extract audio from the video.

def convert_mp4_to_mp3(mp4_filepath, mp3_file):
"""
Converts an MP4 file to MP3.

Args:
mp4_filepath: Path to the input MP4 file.
mp3_filepath: Path to save the output MP3 file.
"""
video_clip = VideoFileClip(mp4_filepath)

# Extract audio from video
video_clip.audio.write_audiofile(mp3_file)
print("now is an mp3")
video_clip.close()

Step 4: Transcribe Audio Using Groq Whisper

Now that we have the mp3 file from the above function, We send the extracted MP3 audio to Whisper hosted on Groq for lightning-fast transcription.

We use the verbose_json mode on Whisper to get back timestamped word segments so we know when to place each word on the video.

def transcribe_audio(mp3_file):
"""
Transcribes an audio file using Groq Whisper API.

Args:
mp3_file (str): Path to the MP3 file.

Returns:
list: Transcribed text segments with timestamps.
"""
with open(mp3_file, "rb") as file:
transcription = client.audio.transcriptions.create(
file=(mp3_file, file.read()),
model="whisper-large-v3-turbo", # Alternatively, use "distil-whisper-large-v3-en" for a faster and lower cost (English-only)
timestamp_granularities=u"word"], # Word level time stamps
response_format="verbose_json",
language="en",
temperature=0.0
)

print(transcription.words)
return transcription.words

Step 5: Overlay Subtitle Clips

From the previous function, we'll recieve a JSON that contains timestamped segments of words. With these word segments, we'll loop through them and create TextClips to be put into the video at the correct time.

Example of the JSON you would recieve that we'll iterate through:

p
{'word': 'This', 'start': 0.1, 'end': 0.28},
{'word': 'month', 'start': 0.28, 'end': 0.56},
{'word': 'I', 'start': 0.56, 'end': 0.78},
{'word': 'traveled', 'start': 0.78, 'end': 1.12},
{'word': 'to', 'start': 1.12, 'end': 1.38}
...
def add_subtitles(verbose_json, width, fontsize):
text_clips = t]

for segment in verbose_json:
text_clips.append(
TextClip(text=segmentx"word"],
font_size=fontsize,
stroke_width=5,
stroke_color="black",
font="./Roboto-Condensed-Bold.otf",
color="white",
size=(width, None),
method="caption",
text_align="center",
margin=(30, 0)
)
.with_start(segmentr"start"])
.with_end(segmentn"end"])
.with_position("center")
)
return text_clips

Step 6: Call the functions

Now that we've defined the functions, we need to create the appropriate variables and call the functions in the correct order.

Groq Whisper Instagram Reel Subtitler
This guide will walk you through creating an automated subtitle generator for Instagram Reels using Groq Whisper. The script extracts audio from a video, transcribes it using Groq's Whisper API, and overlays word by word subtitles onto the video.

Example video output: example_video_output.mp4

Technologies Used
Groq Whisper Large V3 Turbo: AI-powered speech-to-text transcription with word level time stamps.
MoviePy: Handles video and subtitle overlaying.
Python OS Module: Manages file paths.
Step 1: Install Dependencies
Ensure you have the necessary Python packages installed:

pip install moviepy groq python-dotenv
Note: MoviePy requries FFmpeg, an open source program that handles audio and video. You can download it here: https://ffmpeg.org/download.html or if you have Homebrew installed on Mac, run this command in your terminal:

brew install ffmpeg
Step 2: Setup API Key
Create a GroqCloud account and get your API key:

Sign up at GroqCloud. Navigate to API Keys and click on Generate API Key

Store the key securely in an .env file:

GROQ_API_KEY=your_groq_api_key
Then in the captioner.py file, import the packages and load the API key.

import os
from groq import Groq
import datetime
from moviepy import *
from moviepy.video.tools.subtitles import SubtitlesClip
from moviepy.video.io.VideoFileClip import VideoFileClip

GROQ_API_KEY = os.environo"GROQ_API_KEY"]
client = Groq(api_key=GROQ_API_KEY)
Step 3: Convert MP4 to MP3
Before transcribing, we must extract audio from the video.

def convert_mp4_to_mp3(mp4_filepath, mp3_file):
"""
Converts an MP4 file to MP3.

Args:
mp4_filepath: Path to the input MP4 file.
mp3_filepath: Path to save the output MP3 file.
"""
video_clip = VideoFileClip(mp4_filepath)

# Extract audio from video
video_clip.audio.write_audiofile(mp3_file)
print("now is an mp3")
video_clip.close()
Step 4: Transcribe Audio Using Groq Whisper
Now that we have the mp3 file from the above function, We send the extracted MP3 audio to Whisper hosted on Groq for lightning-fast transcription.

We use the verbose_json mode on Whisper to get back timestamped word segments so we know when to place each word on the video.

def transcribe_audio(mp3_file):
"""
Transcribes an audio file using Groq Whisper API.

Args:
mp3_file (str): Path to the MP3 file.

Returns:
list: Transcribed text segments with timestamps.
"""
with open(mp3_file, "rb") as file:
transcription = client.audio.transcriptions.create(
file=(mp3_file, file.read()),
model="whisper-large-v3-turbo", # Alternatively, use "distil-whisper-large-v3-en" for a faster and lower cost (English-only)
timestamp_granularities=u"word"], # Word level time stamps
response_format="verbose_json",
language="en",
temperature=0.0
)

print(transcription.words)
return transcription.words
Step 5: Overlay Subtitle Clips
From the previous function, we'll recieve a JSON that contains timestamped segments of words. With these word segments, we'll loop through them and create TextClips to be put into the video at the correct time.

Example of the JSON you would recieve that we'll iterate through:

r
{'word': 'This', 'start': 0.1, 'end': 0.28},
{'word': 'month', 'start': 0.28, 'end': 0.56},
{'word': 'I', 'start': 0.56, 'end': 0.78},
{'word': 'traveled', 'start': 0.78, 'end': 1.12},
{'word': 'to', 'start': 1.12, 'end': 1.38}
...
def add_subtitles(verbose_json, width, fontsize):
text_clips = t]

for segment in verbose_json:
text_clips.append(
TextClip(text=segmentx"word"],
font_size=fontsize,
stroke_width=5,
stroke_color="black",
font="./Roboto-Condensed-Bold.otf",
color="white",
size=(width, None),
method="caption",
text_align="center",
margin=(30, 0)
)
.with_start(segmentr"start"])
.with_end(segmentn"end"])
.with_position("center")
)
return text_clips
Step 6: Call the functions
Now that we've defined the functions, we need to create the appropriate variables and call the functions in the correct order.

# Change the video_file to the path of where your video file name is
video_file = "input.mp4"

# The output video name and path
output_file = "output_with_subtitles.mp4"

# Loading the video as a VideoFileClip
original_clip = VideoFileClip(video_file)
width = original_clip.w # the width of the video, so the subtitles don't overflow

# where the extracted mp3 audio from the video will be saved
mp3_file = "output.mp3"
convert_mp4_to_mp3(video_file, mp3_file)

# Call Whisper hosted on Groq to get the timestamped word segments
segments = transcribe_audio(mp3_file)

# Create a list of text clips from the segments
text_clip_list = add_subtitles(segments, width, fontsize=40)

# Create a CompositeVideoClip with the original video and textclips
final_clip = CompositeVideoClip(Voriginal_clip] + text_clip_list)

# Generate the final video with subtitles on it
final_clip.write_videofile("final.mp4", codec="libx264")
print("Subtitled video saved as:", output_file)
Step 6: Run the python script
(replace captioner.py with your python file's name if it is not called captioner.py)

python3 captioner.py
Troubleshooting errors:
On MacOS, playing audio within VSCode versus opening up the video in Finder uses different audio encoding outputs. Adding audio_codec="aac" to the output line final_clip.write_videofile("final.mp4", codec="libx264", audio_codec="aac") will allow you to hear audio on playback in MacOS Finder. But without it, you will only be able to hear the audio file from within VSCode and not from the Finder.

Step 6: Run the python script

(replace captioner.py with your python file's name if it is not called captioner.py)

python3 captioner.py

Troubleshooting errors:

  • On MacOS, playing audio within VSCode versus opening up the video in Finder uses different audio encoding outputs. Adding audio_codec="aac" to the output line final_clip.write_videofile("final.mp4", codec="libx264", audio_codec="aac") will allow you to hear audio on playback in MacOS Finder. But without it, you will only be able to hear the audio file from within VSCode and not from the Finder.
Be the first to reply!

Reply