Skip to content

Data Storage Architecture

Overview

  • Architecture: A dual-layer system that separates long-term storage from application performance.
  • Primary Store (The Vault): S3/Garage stores the raw media and expensive AI results (permanent).
  • Secondary Store (The Index): MongoDB and Solr store metadata and search indices (fast & flexible).
  • Strategy: Ensures costly data is safe while allowing the application index to be rebuilt ("re-indexed") at any time.

Architecture

The system uses a Dual-Layer storage strategy to balance data safety with application performance.

Primary Storage (The "Vault")

  • Technology: S3-compatible Object Store (via Garage ⧉)
  • Purpose: Long-term preservation of costly data.
  • Content: Raw media files (.mp4, .wav) and the expensive AI outputs (WhisperX JSONs, SRTs).
  • Philosophy: This layer is treated as the "Single Source of Truth." Since generating transcripts takes significant GPU time and money, this data is stored permanently and safely here.

Secondary Storage (The "Index")

  • Technology: MongoDB ⧉ & Apache Solr ⧉
  • Purpose: Fast access and flexibility for the User Interface.
  • Content:
    • MongoDB: Tracks processing status, user edits, and metadata corrections.
    • Solr: Provides sub-second full-text search across millions of speaker statements.
  • Philosophy: This layer is "derived." It can be wiped and rebuilt (re-indexed) at any time using the data from the Primary Storage. This allows you to change your database schema or search logic without ever losing the original AI transcriptions.

Data Flow Diagram

This diagram illustrates how the costly data is safely isolated, while the secondary storage acts as a flexible cache for the application.

graph TD
    subgraph Layer1 ["Layer 1: Primary Storage (Permanent & Costly)"]
        S3[<b>Garage S3</b><br>Raw Media + AI Results]
    end

    subgraph Layer2 ["Layer 2: Secondary Storage (Fast & Flexible)"]
        Mongo[<b>MongoDB</b><br>Metadata & User Edits]
        Solr[<b>Solr</b><br>Search Index]
    end

    subgraph App ["User Interface"]
        UI[Dashboard & Player]
    end

    %% --- Relationships ---

    %% The Re-index Flow
    S3 ==>|Re-Index / Sync| Mongo
    S3 ==>|Re-Index / Sync| Solr

    %% App Usage
    Mongo <-->|Reads/Writes Metadata| UI
    Solr -.->|Search Queries| UI

    %% Streaming
    S3 -.->|Streams Video| UI

    %% --- Styling ---
    classDef primary fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px;
    classDef secondary fill:#fff3e0,stroke:#e65100,stroke-width:2px,stroke-dasharray: 5 5;
    classDef app fill:#e3f2fd,stroke:#1565c0,stroke-width:2px;

    class S3 primary;
    class Mongo,Solr secondary;
    class UI app;

Strategy

Production Advice: Choose Stable S3

In a production environment, the Primary Storage is your most critical asset. If you lose this data, you lose the expensive compute results.

  • Recommendation: Replace the local Garage container with a managed S3 service (e.g., AWS S3, MinIO Cluster) or ensure your hosting volume has rigorous daily backups.
  • Contrast: The Secondary Storage (MongoDB/Solr) is less critical. It can be treated as "part of the app"—if it fails, you can simply spin up empty databases and re-index everything from your safe Primary Storage.