# Getting Started This tutorial walks through the core workflows for the `mava-exchange` library: writing a `.mediapkg` file from DataFrames, reading one back, validating it, and inspecting it from the command line. ## Installation ```bash pip install mava-exchange # or with uv: uv add mava-exchange ``` ## Concepts A `.mediapkg` file is a ZIP archive containing annotation data for one or more videos. Each video has one or more **tracks** — Parquet files containing the actual data. There are two kinds of tracks: - **ObservationSeries** — a dense time-series of numeric values sampled at regular intervals. Each row is one point in time with one or more numeric dimensions. Use this for ML model outputs like emotion scores, audio volume, or any score sampled at a fixed rate. - **AnnotationSeries** — sparse interval annotations. Each row covers a time span (`start_seconds` → `end_seconds`) with a string value. Use this for transcripts, shot boundaries, or any labeled segment. - **AnnotationListSeries** — sparse interval annotations with multiple labels per segment. Each row covers a time span with a list of string values. Use this for multi-label classifications, keyword tags, or any annotation where multiple values apply simultaneously. --- ## 1. Writing a `.mediapkg` ### 1.1 Define your tracks First describe what your data means using `ObservationSeries` or `AnnotationSeries`. This is the semantic layer — it tells consumers what each column measures. ```python from mava_exchange import ObservationSeries, AnnotationSeries, AnnotationListSeries, DimensionSpec # A time-series track: one numeric value per dimension per timestep emotion_track = ObservationSeries( name="emotions", description="Face emotion probability scores from DeepFace model", sampling_interval=0.5, # seconds between samples dimensions=[ DimensionSpec("angry", "Anger probability", "[0,1]"), DimensionSpec("happy", "Happiness probability","[0,1]"), DimensionSpec("neutral", "Neutral expression", "[0,1]"), ] ) # An interval annotation track: start, end, and a string label per row transcript_track = AnnotationSeries( name="transcript", description="Speech-to-text segments from Whisper", ) # A multi-label annotation track: start, end, and a list of labels per row scene_tags_track = AnnotationListSeries( name="scene_tags", description="Scene classification tags from Places3 model", ) ``` You can define any dimensions you need — the library is not tied to emotion scores. For example, a different tool might declare: ```python explosion_track = ObservationSeries( name="explosion_detection", description="Explosion probability from audio model, sampled every 0.1s", sampling_interval=0.1, dimensions=[ DimensionSpec("explosion", "Explosion probability", "[0,1]"), ] ) ``` ### 1.2 Prepare your DataFrames Each track expects a DataFrame with the columns declared in its definition. For an **ObservationSeries**, the required columns are `start_seconds` plus one column per dimension: ```python import pandas as pd import numpy as np n = 100 emotions_df = pd.DataFrame({ "start_seconds": np.arange(n) * 0.5, "angry": np.random.uniform(0, 0.3, n), "happy": np.random.uniform(0, 0.8, n), "neutral": np.random.uniform(0, 0.5, n), }) ``` For an **AnnotationSeries**, the required columns are `start_seconds`, `end_seconds`, and `annotations`: ```python transcript_df = pd.DataFrame({ "start_seconds": [0.0, 12.5, 30.1], "end_seconds": [12.3, 29.8, 45.0], "annotations": [ "Welcome to the conference.", "Today we discuss video annotation.", "Thank you for joining us.", ], }) ``` For an **AnnotationListSeries**, the required columns are start_seconds, end_seconds, and annotations — but annotations contains lists of strings: ``` scene_tags_df = pd.DataFrame({ "start_seconds": [0.0, 45.2, 78.5], "end_seconds": [45.2, 78.5, 120.0], "annotations": [ ["outdoor", "natural"], ["indoor"], ["outdoor", "man-made"], ], }) ``` ### 1.3 Write the package Use `MediaPackageWriter` as a context manager. Call `add_video()` first, then `add_track()` for each track. The file is written when the `with` block exits. ```python from mava_exchange import MediaPackageWriter with MediaPackageWriter("corpus.mediapkg", description="My annotation corpus") as writer: writer.add_video( video_id="video_001", src="https://example.org/videos/talk.mp4", ) writer.add_track("video_001", emotion_track, emotions_df) writer.add_track("video_001", transcript_track, transcript_df) ``` ### 1.4 Multiple videos Add as many videos as you need before the `with` block exits. Videos can have different track sets — a track name shared across videos must have an identical definition: ```python rms_track = ObservationSeries( name="rms_volume", description="RMS audio volume", sampling_interval=0.064, dimensions=[DimensionSpec("rms", "Root mean square audio volume", ">=0")] ) rms_df = pd.DataFrame({ "start_seconds": np.arange(200) * 0.064, "rms": np.abs(np.random.normal(0.1, 0.02, 200)), }) with MediaPackageWriter("corpus.mediapkg", description="Two-video corpus") as writer: # video_001: emotions + transcript writer.add_video("video_001", "https://example.org/videos/talk_001.mp4") writer.add_track("video_001", emotion_track, emotions_df) writer.add_track("video_001", transcript_track, transcript_df) # video_002: rms volume + transcript (different track set) writer.add_video("video_002", "https://example.org/videos/talk_002.mp4") writer.add_track("video_002", rms_track, rms_df) writer.add_track("video_002", transcript_track, transcript_df) ``` --- ## 2. Reading a `.mediapkg` Use `MediaPackageReader` to read a package. Use it as a context manager to ensure the file is closed properly. ```python from mava_exchange import MediaPackageReader with MediaPackageReader("corpus.mediapkg") as reader: # What's in this package? print(reader.video_ids) # ["video_001", "video_002"] print(reader.track_names) # ["emotions", "transcript", "rms_volume", "scene_tags"] # Which tracks does a specific video have? print(reader.tracks_for_video("video_001")) # ["emotions", "transcript"] print(reader.tracks_for_video("video_002")) # ["rms_volume", "transcript"] # Read a track into a DataFrame df = reader.read_track("video_001", "emotions") print(df.head()) # start_seconds angry happy neutral # 0 0.0 0.12451 0.64231 0.23318 # 1 0.5 0.08734 0.71204 0.20062 # Read all tracks for a video at once tracks = reader.read_video("video_001") # tracks == {"emotions": df, "transcript": df} # Get track definition (reconstructed as a typed object) track = reader.track_def("emotions") print(track.sampling_interval) # 0.5 print([d.name for d in track.dimensions]) # ["angry", "happy", "neutral"] # Get video metadata meta = reader.video_meta("video_001") print(meta["src"]) # "https://example.org/videos/talk_001.mp4" ``` ### Quick file stats without loading data ```python with MediaPackageReader("corpus.mediapkg") as reader: for stat in reader.file_stats(): ratio = (1 - stat["compressed_bytes"] / stat["size_bytes"]) * 100 print(f"{stat['path']:<40} {stat['rows']:>6} rows {ratio:.0f}% compressed") ``` --- ## 3. Validating a `.mediapkg` ### From Python ```python from mava_exchange.validate import validate_mediapkg result = validate_mediapkg("corpus.mediapkg") if result.valid: print("Package is valid.") else: print(result.summary()) ``` The validator checks: - manifest structure and required fields - every file referenced in the manifest exists in the archive - every referenced track is defined - `start_seconds` is non-null, non-negative, and ordered - `end_seconds > start_seconds` for all `AnnotationSeries` rows - dimension columns are numeric and non-null for `ObservationSeries` Pass `strict=True` to also warn about recommended but optional fields: ```python result = validate_mediapkg("corpus.mediapkg", strict=True) print(result.summary()) ``` ### From the command line ```bash mediapkg-validate corpus.mediapkg mediapkg-validate corpus.mediapkg --strict ``` Exit code is `0` for valid and `1` for invalid — works in CI pipelines: ```bash mediapkg-validate corpus.mediapkg || exit 1 ``` --- ## 4. Inspecting from the CLI The `mediapkg-inspect` command gives a human-readable summary without writing any code. **Corpus overview:** ```bash mediapkg-inspect corpus.mediapkg ``` ``` ════════════════════════════════════════════════════════════ corpus.mediapkg ════════════════════════════════════════════════════════════ Version: 0.1 Created: 2025-08-12T10:00:00+00:00 Ontology: http://example.org/mava/ontology# Description: Two-video corpus Videos: 2 Tracks: emotions mava:ObservationSeries @0.5s [angry, happy, neutral] transcript mava:AnnotationSeries rms_volume mava:ObservationSeries @0.064s [rms] Videos: video_001 src: https://example.org/videos/talk_001.mp4 tracks: emotions, transcript video_002 src: https://example.org/videos/talk_002.mp4 tracks: rms_volume, transcript Files: Path Rows Raw Compressed Saved -------------------------------------------- ------ ------ ---------- ----- video_001/emotions.parquet 100 8.2KB 3.1KB 62% video_001/transcript.parquet 3 2.1KB 1.4KB 33% video_002/rms_volume.parquet 200 6.4KB 2.8KB 56% video_002/transcript.parquet 3 2.1KB 1.4KB 33% ``` **Drill into a specific track:** ```bash mediapkg-inspect corpus.mediapkg --track emotions --video video_001 --head 3 ``` ``` Track: emotions (mava:ObservationSeries) Video: video_001 Desc: Face emotion probability scores from DeepFace model Rows: 100 Columns: start_seconds double[pyarrow] angry double[pyarrow] happy double[pyarrow] neutral double[pyarrow] First 3 rows: start_seconds angry happy neutral 0.0 0.12451 0.64231 0.23318 0.5 0.08734 0.71204 0.20062 1.0 0.21003 0.55891 0.23106 Dimensions: angry Anger probability [0,1] happy Happiness probability [0,1] neutral Neutral expression [0,1] ``` --- ## 5. The `.mediapkg` format at a glance A `.mediapkg` is a ZIP archive. You can always unzip it manually to inspect: ```bash unzip -l corpus.mediapkg # or unzip corpus.mediapkg -d corpus_contents/ cat corpus_contents/manifest.json ``` The `manifest.json` is human-readable JSON containing all metadata, the JSON-LD context mapping column names to the MAVA ontology, and the file inventory. See `spec/SPEC.md` for the full format specification. --- ## Next steps - See `examples/tsv_to_mediapkg.py` for a complete example converting real TSV annotation files from two different tools into a corpus package. - See `spec/SPEC.md` for the full format specification. - See `spec/mava.ttl` for the MAVA ontology and SHACL validation shapes.