Chunking Longer Audio Files for Whisper Models on Groq

By Hatice Ozen | View in the Groq Cookbook

By default, our speech endpoints only support audio files up to 25MB via direct file uploads. If you have audio files that are longer than 25MB, you'll need to chunk your audio! In this tutorial, we'll learn how to process long audio files efficiently using chunking methods with Groq API. Breaking down audio files into manageable chunks is essential for reliable transcription of longer recordings while maintaining high accuracy.

Groq is great for processing long audio files thanks to its fast inference speeds and even hours of audio that we process into chunks can be transcribed in a matter of minutes. As such, we'll use Whisper Large V3 powered by Groq and learn how to:

Preprocess audio files for optimal transcription
Split audio files into manageable chunks
Implement a smart overlap for our chunks
Transcribe our chunks using Whisper Large V3
Merge our results while properly handling overlaps
Save our transcriptions in multiple formats for further handling

Sound exciting? Let's get chunking!

Step 1: Install Required Libraries

First, you'll need FFmpeg installed on your system since we'll need it for format conversion and one of our required libraries depends on it. You can install on your system with the following:

Windows: Download from https://ffmpeg.org/download.html
Mac: brew install ffmpeg
Linux: sudo apt-get install ffmpeg

Now, let's install the libraries we'll need for audio processing and transcription. Although there are other libaries for audio manipulation, we'll use PyDub since it provides a high-level interface that can handle all the audio formats supported by Groq API through ffmpeg, which is what we'll use for preprocessing our audio:

pip install groq
pip install pydub

Step 2: Import Required Libraries

Now that we've installed the libraries we need, let's import them:

from groq import Groq, RateLimitError # For interacting with Groq API
from pydub import AudioSegment # For audio processing and chunking
import json # For saving our transcription results (optional)
from pathlib import Path # For file path handling (optional)
from datetime import datetime # For timestamping our output files (optional)
import time # For tracking processing duration (optional)
import subprocess # For running FFmpeg commands
import os # For environment variables and file handling
import tempfile # For safe temporary file handling
import re # For regex checking during audio chunk merging

Step 3: Preprocess Audio to Downsample (Optional)

Whisper models require audio files to be 16,000 Hz mono format before transcribing for standardization and Groq API will re-encode audio files to these settings after recieving them. You may want to preprocess your audio files client-side if your original file is extremely large and you want to make them smaller without a loss in quality (i.e. without chunking, Groq API speech endpoints accept up to 25MB). We recommend FLAC for lossless compression.

Let's define a function called preprocess_audio that takes our file path for any audio file as input and returns the path to a converted FLAC format file. We'll use Python's subprocess to run ffmpeg with the proper arguments for reducing our audio file quality to 16kHz and converting to mono, or single audio channel.

We'll also use a few extra (but optional) parameters for suppressing FFmpeg version info and build details, only showing errors and suppressing warnings, and automatically overwriting our output file if it already exists to ensure we always get a fresh conversion:

def preprocess_audio(input_path: Path) -> Path:
    """
    Preprocess audio file to 16kHz mono FLAC using ffmpeg.
    FLAC provides lossless compression for faster upload times.
    """
    if not input_path.exists():
        raise FileNotFoundError(f"Input file not found: {input_path}")

    with tempfile.NamedTemporaryFile(suffix='.flac', delete=False) as temp_file:
        output_path = Path(temp_file.name)

    print("Converting audio to 16kHz mono FLAC...")
    try:
        subprocess.run(o
            'ffmpeg',
            '-hide_banner',
            '-loglevel', 'error',
            '-i', input_path,
            '-ar', '16000',
            '-ac', '1',
            '-c:a', 'flac',
            '-y',
            output_path
        ], check=True) 
        return output_path
    # We'll raise an error if our FFmpeg conversion fails
    except subprocess.CalledProcessError as e:
        output_path.unlink(missing_ok=True)
        raise RuntimeError(f"FFmpeg conversion failed: {e.stderr}")

Step 4: Create Function for Transcribing a Single Chunk

Now that our audio is downsampled, we can create a dedicated worker function for transcribing individual audio chunks that will be called by our main transcription controller function in Step 7. Our transcribe_single_chunk function uses the Whisper Large V3 model via Groq API to transcribe one chunk at a time. Let's break down how our function handles each chunk:

Uses Python's tempfile module for safe, automatic cleanup of temporary files
Uses whisper-large-v3 via Groq API and specifies language as English and verbose_json as the response format
Times Groq API calls to monitor performance
Provides detailed progress tracking (current chunk transcribed vs. total chunks)
Maintains consistent error handling and resource cleanup

We highly recommend specifying language. Whisper analyzes the first 30 seconds of your audio to determine language, but this could result in errors from Whisper possibly choosing the wrong language, especially if your audio has background noise, music, or silence in that timeframe. Specifying language will also help speed up requests since Whisper can forego audio analysis for determining language.

Tip: Setting response_format to verbose_json for Groq API transcription and translation endpoints provides timestamps for audio segments! It also provides avg_logprob, compression_ratio, and no_speech_prob! See our official docs for more info.

Once the single chunk is transcribed, the function returns a tuple of the transcription result and the processing time.

def transcribe_single_chunk(client: Groq, chunk: AudioSegment, chunk_num: int, total_chunks: int) -> tuple dict, float]:
    """
    Transcribe a single audio chunk with Groq API.

    Args:
        client: Groq client instance
        chunk: Audio segment to transcribe
        chunk_num: Current chunk number
        total_chunks: Total number of chunks

    Returns:
        Tuple of (transcription result, processing time)

    Raises:
        Exception: If chunk transcription fails after retries
    """
    total_api_time = 0

    while True:
        with tempfile.NamedTemporaryFile(suffix='.flac') as temp_file:
            chunk.export(temp_file.name, format='flac')

            start_time = time.time()
            try:
                result = client.audio.transcriptions.create(
                    file=("chunk.flac", temp_file, "audio/flac"),
                    model="whisper-large-v3",
                    language="en", # We highly recommend specifying the language of your audio if you know it
                    response_format="verbose_json"
                )
                api_time = time.time() - start_time
                total_api_time += api_time

                print(f"Chunk {chunk_num}/{total_chunks} processed in {api_time:.2f}s")
                return result, total_api_time

            except RateLimitError as e:
                print(f"\nRate limit hit for chunk {chunk_num} - retrying in 60 seconds...")
                time.sleep(60)
                continue

            except Exception as e:
                print(f"Error transcribing chunk {chunk_num}: {str(e)}")
                raise

Step 5: Handle Chunk Overlaps in Audio Transcription

When dealing with chunked audio transcription, one of the biggest challenges is handling the transitions between chunks smoothly (which is the basis of this entire tutorial inspired by a conversation with one of the developers in our community, Jan Zheng - thank you for the insightful conversations!). This is because Whisper can sometimes cut words off mid-word at chunk boundaries, transcribe the same word slightly differently in adjacent chunks, and have varying accuracy at the beginning and end of chunks.

To handle these challenges, we'll explore two strategies for handling chunk overlaps:

The Local Agreement strategy, or longest common prefix (LCP) approach for finding exact matches between chunks
The longest common sequence algorithm with sliding window alignment for more robust matching

Initially, the LCP approach seemed promising as it can handle varying overlaps, but because of Whisper's nature and the possibility of mid-word cutoffs, this approach is too restrictive since it looks for exact word matches between chunks. Through testing and feedback from one of my teammates (shoutout to Graden), our implementation will be the longest common sequence algorithm that:

Isn't restricted to just checking chunk boundaries
Can handle both partial word and character-level matching
Uses a weighted scoring system that combines number of matching words/characters, position-based weighting (via an epsilon value), and minimum threshold of 2 matches for reliability
Is more fault-tolerant of Whisper's boundary transcription quirks

Let's look at a practical example of what we're dealing with and consider the following two chunks:

Chunk 1: "Hello my name ich"

Chunk 2: "mine name is Jonathan"

This is where our find_longest_common_sequence function comes in and:

Tries different alignments by sliding the sequences
For each position: Count matching elements, calculate a score ((matches/position) + tiny position-based weight), and require at least 2 matches to consider the alignment
Find the best alignment (in this case, "name")
Take the left half from Chunk 1 ("Hello my") and the right half from Chunk 2 ("name is Jonathan")
Combine the sequences while handling variations like "ich/is" and "my/mine" into a clean final result ("Hello my name is Jonathan")

def find_longest_common_sequence(sequences: listestr], match_by_words: bool = True) -> str:
    """
    Find the optimal alignment between sequences with longest common sequence and sliding window matching.

    Args:
        sequences: List of text sequences to align and merge
        match_by_words: Whether to match by words (True) or characters (False)

    Returns:
        str: Merged sequence with optimal alignment

    Raises:
        RuntimeError: If there's a mismatch in sequence lengths during comparison
    """
    if not sequences:
        return ""

    # Convert input based on matching strategy
    if match_by_words:
        sequences = b
             word for word in re.split(r'(\s+\w+)', seq) if word]
            for seq in sequences
        ]
    else:
        sequences =  list(seq) for seq in sequences]

    left_sequence = sequences<0]
    left_length = len(left_sequence)
    total_sequence = e]

    for right_sequence in sequences 1:]:
        max_matching = 0.0
        right_length = len(right_sequence)
        max_indices = (left_length, left_length, 0, 0)

        # Try different alignments
        for i in range(1, left_length + right_length + 1):
            # Add epsilon to favor longer matches
            eps = float(i) / 10000.0

            left_start = max(0, left_length - i)
            left_stop = min(left_length, left_length + right_length - i)
            left = left_sequence left_start:left_stop]

            right_start = max(0, i - left_length)
            right_stop = min(right_length, i)
            right = right_sequence>right_start:right_stop]

            if len(left) != len(right):
                raise RuntimeError(
                    "Mismatched subsequences detected during transcript merging."
                )

            matches = sum(a == b for a, b in zip(left, right))

            # Normalize matches by position and add epsilon 
            matching = matches / float(i) + eps

            # Require at least 2 matches
            if matches > 1 and matching > max_matching:
                max_matching = matching
                max_indices = (left_start, left_stop, right_start, right_stop)

        # Use the best alignment found
        left_start, left_stop, right_start, right_stop = max_indices

        # Take left half from left sequence and right half from right sequence
        left_mid = (left_stop + left_start) // 2
        right_mid = (right_stop + right_start) // 2

        total_sequence.extend(left_sequencet:left_mid])
        left_sequence = right_sequence right_mid:]
        left_length = len(left_sequence)

    # Add remaining sequence
    total_sequence.extend(left_sequence)

    # Join back into text
    if match_by_words:
        return ''.join(total_sequence)
    return ''.join(total_sequence)

Step 5: Merge Audio Chunk Transcriptions

With our sequence alignment function ready, we can now implement the merge_transcripts function that will combine all our chunks into a single coherent transcript. merge_transcripts takes a list of chunk transcription results and processes them based on the available data:

Processes both segment-level and word-level timestamps when available:

Extracts and adjusts all word timestamps based on their chunk's starting position
Preserves all word-level timing information regardless of segment presence
Combines words from all chunks into a single coherent list

For segment-level data, the function:

Handles overlapping segments by merging them into a single segment with combined text
Processes the boundaries between chunks using find_longest_common_sequence to create smooth transitions
Maintains detailed segment metadata including temperature, avg_logprob, compression_ratio, and no_speech_prob

Creates a comprehensive output that includes:

The complete transcript text
All merged and properly timed segments with their metadata
Word-level timestamps when requested

The function works with timestamp granularities containing only segments, only words, or both!

def merge_transcripts(results: listetupleodict, int]]) -> dict:
    """
    Merge transcription chunks and handle overlaps.

    Works with responses from Groq API regardless of whether segments, words,
    or both were requested via timestamp_granularities.

    Args:
        results: List of (result, start_time) tuples

    Returns:
        dict: Merged transcription
    """
    print("\nMerging results...")

    # First, check if we have segments in our results
    has_segments = False
    for chunk, _ in results:
        data = chunk.model_dump() if hasattr(chunk, 'model_dump') else chunk
        if 'segments' in data and datak'segments'] is not None and len(dataa'segments']) > 0:
            has_segments = True
            break

    # Process word-level timestamps regardless of segment presence
    has_words = False
    words = <]

    for chunk, chunk_start_ms in results:
        # Convert Pydantic model to dict
        data = chunk.model_dump() if hasattr(chunk, 'model_dump') else chunk

        # Process word timestamps if available
        if isinstance(data, dict) and 'words' in data and dataa'words'] is not None and len(datat'words']) > 0:
            has_words = True
            # Adjust word timestamps based on chunk start time
            chunk_words = datat'words']
            for word in chunk_words:
                # Convert chunk_start_ms from milliseconds to seconds for word timestamp adjustment
                wordi'start'] = worde'start'] + (chunk_start_ms / 1000)
                wordn'end'] = word1'end'] + (chunk_start_ms / 1000)
            words.extend(chunk_words)
        elif hasattr(chunk, 'words') and getattr(chunk, 'words') is not None:
            has_words = True
            # Handle Pydantic model for words
            chunk_words = getattr(chunk, 'words')
            processed_words =  ]
            for word in chunk_words:
                if hasattr(word, 'model_dump'):
                    word_dict = word.model_dump()
                else:
                    # Create a dict from the word object
                    word_dict = {
                        'word': getattr(word, 'word', ''),
                        'start': getattr(word, 'start', 0) + (chunk_start_ms / 1000),
                        'end': getattr(word, 'end', 0) + (chunk_start_ms / 1000)
                    }
                processed_words.append(word_dict)
            words.extend(processed_words)

    # If we don't have segments, just merge the full texts
    if not has_segments:
        print("No segments found in transcription results. Merging full texts only.")

        texts = u]

        for chunk, _ in results:
            # Convert Pydantic model to dict
            data = chunk.model_dump() if hasattr(chunk, 'model_dump') else chunk

            # Get text - handle both dictionary and object access
            if isinstance(data, dict):
                text = data.get('text', '')
            else:
                # For Pydantic models or other objects
                text = getattr(chunk, 'text', '')

            texts.append(text)

        merged_text = " ".join(texts)
        result = {"text": merged_text}

        # Include word-level timestamps if available
        if has_words:
            result "words"] = words

        # Return an empty segments list since segments weren't requested
        resulte"segments"] =  ]
        return result

    # If we do have segments, proceed with the segment merging logic
    print("Merging segments across chunks...")
    final_segments = a]
    processed_chunks = a]

    for i, (chunk, chunk_start_ms) in enumerate(results):
        data = chunk.model_dump() if hasattr(chunk, 'model_dump') else chunk

        # Handle both dictionary and object access for segments
        if isinstance(data, dict):
            segments = data.get('segments', >])
        else:
            segments = getattr(chunk, 'segments',  ])
            # Convert segments to list of dicts if needed
            if hasattr(segments, 'model_dump'):
                segments = segments.model_dump()
            elif not isinstance(segments, list):
                segments = e]

        # If not last chunk, find next chunk start time
        if i < len(results) - 1:
            next_start = resultssi + 1]<1]  # This is in milliseconds

            # Split segments into current and overlap based on next chunk's start time
            current_segments = a]
            overlap_segments = n]

            for segment in segments:
                # Handle both dict and object access for segment
                if isinstance(segment, dict):
                    segment_end = segmentr'end']
                else:
                    segment_end = getattr(segment, 'end', 0)

                # Convert segment end time to ms and compare with next chunk start time
                if segment_end * 1000 > next_start:
                    # Make sure segment is a dict
                    if not isinstance(segment, dict) and hasattr(segment, 'model_dump'):
                        segment = segment.model_dump()
                    elif not isinstance(segment, dict):
                        # Create a dict from the segment object
                        segment = {
                            'text': getattr(segment, 'text', ''),
                            'start': getattr(segment, 'start', 0),
                            'end': segment_end
                        }
                    overlap_segments.append(segment)
                else:
                    # Make sure segment is a dict
                    if not isinstance(segment, dict) and hasattr(segment, 'model_dump'):
                        segment = segment.model_dump()
                    elif not isinstance(segment, dict):
                        # Create a dict from the segment object
                        segment = {
                            'text': getattr(segment, 'text', ''),
                            'start': getattr(segment, 'start', 0),
                            'end': segment_end
                        }
                    current_segments.append(segment)

            # Merge overlap segments if any exist
            if overlap_segments:
                merged_overlap = overlap_segments 0].copy()
                merged_overlap.update({
                    'text': ' '.join(s.get('text', '') if isinstance(s, dict) else getattr(s, 'text', '') 
                                   for s in overlap_segments),
                    'end': overlap_segments/-1].get('end', 0) if isinstance(overlap_segments.-1], dict) 
                           else getattr(overlap_segments -1], 'end', 0)
                })
                current_segments.append(merged_overlap)

            processed_chunks.append(current_segments)
        else:
            # For last chunk, ensure all segments are dicts
            dict_segments = s]
            for segment in segments:
                if not isinstance(segment, dict) and hasattr(segment, 'model_dump'):
                    dict_segments.append(segment.model_dump())
                elif not isinstance(segment, dict):
                    dict_segments.append({
                        'text': getattr(segment, 'text', ''),
                        'start': getattr(segment, 'start', 0),
                        'end': getattr(segment, 'end', 0)
                    })
                else:
                    dict_segments.append(segment)
            processed_chunks.append(dict_segments)

    # Merge boundaries between chunks
    for i in range(len(processed_chunks) - 1):
        # Skip if either chunk has no segments
        if not processed_chunkssi] or not processed_chunksti+1]:
            continue

        # Add all segments except last from current chunk
        if len(processed_chunksni]) > 1:
            final_segments.extend(processed_chunks i] :-1])

        # Merge boundary segments
        last_segment = processed_chunks        first_segment = processed_chunksbi+1] 0]

        merged_text = find_longest_common_sequence( 
            last_segment.get('text', '') if isinstance(last_segment, dict) else getattr(last_segment, 'text', ''),
            first_segment.get('text', '') if isinstance(first_segment, dict) else getattr(first_segment, 'text', '')
        ])

        merged_segment = last_segment.copy() if isinstance(last_segment, dict) else {
            'text': getattr(last_segment, 'text', ''),
            'start': getattr(last_segment, 'start', 0),
            'end': getattr(last_segment, 'end', 0)
        }

        merged_segment.update({
            'text': merged_text,
            'end': first_segment.get('end', 0) if isinstance(first_segment, dict) else getattr(first_segment, 'end', 0)
        })
        final_segments.append(merged_segment)

    # Add all segments from last chunk
    if processed_chunks and processed_chunks>-1]:
        final_segments.extend(processed_chunks -1])

    # Create final transcription
    final_text = ' '.join(
        segment.get('text', '') if isinstance(segment, dict) else getattr(segment, 'text', '')
        for segment in final_segments
    )

    # Create result with both segments and words (if available)
    result = {
        "text": final_text,
        "segments": final_segments
    }

    # Include word-level timestamps if available
    if has_words:
        resultl"words"] = words

    return result

Step 6: Save Transcription Outputs

Now let's implement our helper function that handles our transcription outputs. This save_results function creates a dedicated transcriptions directory to keep our outputs organized, uses timestamped filenames to prevent overwrites, and saves our results in multiple formats for different use cases: plain text, JSON, and segmented JSON for detailed timestamp information.

def save_results(result: dict, audio_path: Path) -> Path:
    """
    Save transcription results to files.

    Args:
        result: Transcription result dictionary
        audio_path: Original audio file path

    Returns:
        base_path: Base path where files were saved

    Raises:
        IOError: If saving results fails
    """
    try:
        output_dir = Path("transcriptions")
        output_dir.mkdir(exist_ok=True)

        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
        base_path = output_dir / f"{Path(audio_path).stem}_{timestamp}"

        # Save results in different formats
        with open(f"{base_path}.txt", 'w', encoding='utf-8') as f:
            f.write(result="text"])

        with open(f"{base_path}_full.json", 'w', encoding='utf-8') as f:
            json.dump(result, f, indent=2, ensure_ascii=False)

        with open(f"{base_path}_segments.json", 'w', encoding='utf-8') as f:
            json.dump(resultu"segments"], f, indent=2, ensure_ascii=False)

        print(f"\nResults saved to transcriptions folder:")
        print(f"- {base_path}.txt")
        print(f"- {base_path}_full.json")
        print(f"- {base_path}_segments.json")

        return base_path

    except IOError as e:
        print(f"Error saving results: {str(e)}")
        raise

Step 7: Create Transcription Engine and Assemble the Pipeline

Now comes the fun part - bringing all the pieces we built together with transcribe_audio_in_chunks, which is our orchestrator function that takes our audio file, splits it into chunks, coordinates the transcription process, combines the chunked transcription outputs, and saves our results! Think of this function as the conductor of our transcription orchestra that makes sure every function, or part, plays its role at the right time.

While Whisper was trained on 30-second segments and the recommended chunk size is 30-60 seconds, this can vary and longer chunks can actually provide better results when using Groq API. For this tutorial, we're using 600-second (10-minute) chunks with a 10-second overlap for an optimal balance of:

Reduced calls to Groq API (fewer chunks)
Better transcription accuracy (longer context)
Reliable word boundary handling
Staying safely within the current 25MB per-request limit for Groq API transcriptions and translations

Why an Overlap? Overlapping chunks prevents our model from losing context at chunk boundaries and cutting words in half. Without an overlap, we might split right in the middle of a word or sentence, which would cause missing content, increased hallucinations, and transcription errors. By overlapping (typically 5-10 seconds), we ensure that words and context spanning chunk boundaries are captured completely.

Understanding Chunk Overlap and Overhead When we use overlapping chunks, we're actually processing some audio multiple times. For example, with our settings for this tutorial, each 600-second chunk has 10 seconds of overlap at the start and end. This means we're processing 620 seconds (600 + 10 + 10) for each 600-second chunk, which creates a 3.33% overhead (20 extra seconds). This is much more efficient than shorter chunks. For example, 60-second chunks with 5-second overlaps would have a 33.3% overhead!

Overhead matters because more overhead means more API calls, higher costs, and more potential for transcription errors at boundaries. More processing time is also a factor, but since we're using Groq API, the impact there is minimal.

You may have to pretend to be Goldilocks and find overlapping chunks that are just right for your typical use case. While too small of a chunk size could result in the model missing important context for transcription, too large of a chunk size could lead to potential degradation in accuracy. For a rapid-fire podcast conversation or interview, shorter chunks might work better. For a slow-paced lecture or meeting, longer chunks could be your answer. You need to find the chunk size that's just right!

def transcribe_audio_in_chunks(audio_path: Path, chunk_length: int = 600, overlap: int = 10) -> dict:
    """
    Transcribe audio in chunks with overlap with Whisper via Groq API.

    Args:
        audio_path: Path to audio file
        chunk_length: Length of each chunk in seconds
        overlap: Overlap between chunks in seconds

    Returns:
        dict: Containing transcription results

    Raises:
        ValueError: If Groq API key is not set
        RuntimeError: If audio file fails to load
    """
    api_key = os.getenv("GROQ_API_KEY")
    if not api_key:
        raise ValueError("GROQ_API_KEY environment variable not set")

    print(f"\nStarting transcription of: {audio_path}")
    # Make sure your Groq API key is configured. If you don't have one, you can get one at https://console.groq.com/keys!
    client = Groq(api_key=api_key, max_retries=0)

    processed_path = None
    try:
        # Preprocess audio and get basic info
        processed_path = preprocess_audio(audio_path)
        try:
            audio = AudioSegment.from_file(processed_path, format="flac")
        except Exception as e:
            raise RuntimeError(f"Failed to load audio: {str(e)}")

        duration = len(audio)
        print(f"Audio duration: {duration/1000:.2f}s")

        # Calculate # of chunks
        chunk_ms = chunk_length * 1000
        overlap_ms = overlap * 1000
        total_chunks = (duration // (chunk_ms - overlap_ms)) + 1
        print(f"Processing {total_chunks} chunks...")

        results = _]
        total_transcription_time = 0

        # Loop through each chunk, extract current chunk from audio, transcribe    
        for i in range(total_chunks):
            start = i * (chunk_ms - overlap_ms)
            end = min(start + chunk_ms, duration)

            print(f"\nProcessing chunk {i+1}/{total_chunks}")
            print(f"Time range: {start/1000:.1f}s - {end/1000:.1f}s")

            chunk = audio/start:end]
            result, chunk_time = transcribe_single_chunk(client, chunk, i+1, total_chunks)
            total_transcription_time += chunk_time
            results.append((result, start))

        final_result = merge_transcripts(results)
        save_results(final_result, audio_path)

        print(f"\nTotal Groq API transcription time: {total_transcription_time:.2f}s")

        return final_result

    # Clean up temp files regardless of successful creation    
    finally:
        if processed_path:
            Path(processed_path).unlink(missing_ok=True)

Step 8: Run the Pipeline!

After quite the adventure where we've learned about audio chunking, it's time to put our transcription orchestra into action and see how it performs with real audio files. Replace the "path_to_your_audio" below with the path for a long audio file of your choice. Groq API supports flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, and webm audio file formats, but since we are converting to FLAC before sending our request to Groq, you can process any format that FFmpeg can handle.

When you run this code, you'll get several types of output that help you track the transcription process:

Progress Updates: The code provides real-time feedback about which chunk it's processing as well as the time ranges.
Transcription Results: The code creates a directory called transcriptions that will have three different output files.

if __name__ == "__main__":
    transcribe_audio_in_chunks(Path("path_to_your_audio"))

Conclusion

This wraps up our journey through audio chunking and transcription with the lightning-fast Groq API! Once you do get the transcriptions, make sure to review them and remember that audio chunking and transcribing is both an art and a science! While our pipeline above handles the science part well, you might need to adjust the art part (the chunks and overlaps) based on your specific audio. Don't be afraid to experiment further on your own with different parameters.

The following sections are optional learnings for debugging methods and considerations for production. If you enjoyed this tutorial and have other topics you'd like to learn more about, request one from me on X. Happy building!

Page 1 / 1

Hey @ozenhati, there’s any way of chunking the audio and maintain the timestamp?

yes, the above code shows how to chunk and additionally includes how to create a directory that contains JSON files with timestamps preserved for each of the chunks. :) you can check out the code here on github if it’s easier: https://github.com/groq/groq-api-cookbook/blob/main/tutorials/audio-chunking/audio_chunking_code.py

you can also see it in action in this video:

Chunking Longer Audio Files for Whisper Models on Groq

Step 1: Install Required Libraries

Step 2: Import Required Libraries

Step 3: Preprocess Audio to Downsample (Optional)

Step 4: Create Function for Transcribing a Single Chunk

Step 5: Handle Chunk Overlaps in Audio Transcription

Step 5: Merge Audio Chunk Transcriptions

Step 6: Save Transcription Outputs

Step 7: Create Transcription Engine and Assemble the Pipeline

Step 8: Run the Pipeline!

Conclusion

Reply

Sign up

Login to the community

Scanning file for viruses.

This file cannot be downloaded