Skip to content

WhisperX Processing

WhisperX provides AI-powered speech-to-text with speaker identification (diarization), handled externally via a Hugging Face Space component.

Overview

  • What: AI transcription & speaker identification (diarization).
  • Where: External processing via Hugging Face Space.
  • Requires: Hugging Face Account + API Token.

Integration Architecture

The application offloads heavy AI processing to an external Hugging Face Space. This allows for high-quality transcription without requiring powerful local GPUs.

graph LR
    %% Nodes
    subgraph App ["Application Backend"]
        Worker["Transcription Worker"]
    end

    subgraph External ["Hugging Face Space"]
        HF["WhisperX Component"]
    end

    subgraph Storage ["S3 Storage"]
        S3["Results Bucket"]
    end

    %% Flow
    Worker -->|"1. Send Audio + Token"| HF
    HF -->|"2. AI Processing"| HF
    HF -->|"3. Return JSON/SRT"| Worker
    Worker -->|"4. Save Results"| S3

    %% Styling
    classDef app fill:#e3f2fd,stroke:#1565c0,stroke-width:2px
    classDef ext fill:#f3e5f5,stroke:#6a1b9a,stroke-width:2px
    classDef store fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px

    class Worker app
    class HF ext
    class S3 store

Capabilities

WhisperX combines two AI models to provide a complete transcript:

Feature Model Output
Speaker Detection pyannote.audio Identifies speakers turns.
Transcription OpenAI Whisper Converts speech to text (Multi-language) and offers transcription and translation modes.

Output Format

The result is a standard .srt subtitle file containing both the text and the speaker labels (e.g., [SPEAKER_01]: Hello world).

Configuration

To enable this feature, you must configure your credentials.

1. Accept Model Terms

You must visit these pages and accept the terms of use, or the API will fail:

2. Get API Token

  1. Go to Hugging Face Settings ⧉.
  2. Create a New Token with Read permissions.

3. Update Environment

Add your token to config/.env.secret:

config/.env.secret
HF_TOKEN=hf_your_token_here

The .env specifies space and model:

config/.env
# Recommended Models: large-v2 or large-v3
HF_MODEL=large-v3
HF_SPACE_URL=https://katospiegel-odtp-pyannote-whisper.hf.space/

Model Selection

Use only large-v2 or large-v3. Smaller models may cause reindex errors since the output will be of poor quality then.

Advanced Usage

While the application handles processing on WhisperX per API, you can also upload files that where previously processed on the WhisperX UI.

Manual Upload

You can run WhisperX manually via the Hugging Face UI ⧉ and then import the resulting files using the Commandline. But this is not not recommended and involves some manual renaming of your files. See Commandline

WhisperX on Hugging Face

You can directly access WhisperX on Hugging Face ⧉ To change the settings from CPU to GPU you can clone the space to your own account and then pay for a tier with more efficient processing.

WhisperX Interactive Interface

WhisperX API Interface

WhisperX SRT Output