MediaPkg Format Specification¶

Version: 0.1 (Draft) Repository: https://github.com/smaennel/mava-exchange Ontology: http://example.org/mava/ontology# License: Apache 2.0

Status¶

This is an early draft specification. It is being developed in the context of the TIBAV-A and videoscope use cases at ETH Zurich / Swiss Data Science Center. Feedback welcome via GitHub issues.

The key words MUST, MUST NOT, REQUIRED, SHOULD, SHOULD NOT, RECOMMENDED, MAY, and OPTIONAL in this document are to be interpreted as described in RFC 2119.

1. Introduction¶

MediaPkg (.mediapkg) is a lightweight, compressed interchange format for time-based annotations on video files. It is designed for efficient exchange between video annotation and processing tools without the overhead of full RDF serialisation.

A .mediapkg file is a ZIP archive containing:

A manifest.json describing the corpus, its videos, annotation tracks, and the semantic meaning of all columns
One or more Parquet files — one per annotation track per video — containing the actual annotation data

The format is backed by the MAVA ontology, which provides a shared vocabulary for annotation tools in linguistics and multimodal analysis.

2. Motivation¶

Tools in the video annotation space produce a variety of outputs — emotion scores, shot boundaries, transcripts, scene labels, and others. These outputs are currently exchanged as plain TSV or CSV files with no shared vocabulary, making interoperability between tools difficult.

A previous approach exported all data to RDF/Turtle. While semantically rich, this proved too slow for large videos — serialisation of a single video’s annotations could take ten minutes or more.

MediaPkg addresses this by separating two concerns:

Data transport — handled by Parquet inside ZIP, which is fast, compressed, and columnar
Semantic meaning — handled by the MAVA ontology and a JSON-LD @context in the manifest

This mirrors the approach taken by GeoParquet in the geospatial domain: efficient binary data with attached semantic metadata.

3. Relation to Existing Standards¶

GeoParquet¶

MediaPkg is directly inspired by GeoParquet. GeoParquet embeds metadata inside Parquet files under a geo key to describe geometry columns. MediaPkg uses the same mechanism (a mj key in the Parquet metadata) and extends it with a JSON-LD context for full ontology linkage.

Concept	GeoParquet	MediaPkg
Container	`.parquet` file	`.mediapkg` ZIP archive
Metadata location	Embedded in Parquet `key_value_metadata`	`manifest.json` in ZIP
Semantic layer	None (operational only)	JSON-LD `@context` → MAVA ontology
Coordinate system	WGS84 (spatial)	Seconds from video start (temporal)
Multi-file corpus	Not addressed	Multiple videos in one archive

ELAN / EAF¶

ELAN (EUDICO Linguistic Annotator) is the dominant annotation tool in linguistics and multimodal research. It stores annotations in the EAF format (ELAN Annotation Format) — an XML file organised around tiers: named layers that group time-aligned annotations. Tiers can be hierarchically connected and linked to linguistic types.

EAF is the primary export format of several tools in the MAVA project context, including TIB-AV-A. MediaPkg is not a replacement for EAF — it serves a different purpose:

Concept	EAF	MediaPkg
Primary use	Human annotation and transcription	Machine-generated analysis output
Data model	Tiers with typed, hierarchical annotations	Flat observation and annotation series
Numeric time-series	Not supported	First-class (ObservationSeries)
File format	XML	Parquet inside ZIP
Corpus packaging	One file per recording	Multiple videos in one archive
Performance	Slow to parse at scale	Columnar, fast reads

A future integration goal is to allow tools to import EAF annotations into .mediapkg as AnnotationSeries tracks, enabling combined analysis of human annotations and AI-generated observation data in a single package.

CLARIN / CMDI¶

CLARIN (Common Language Resources and Technology Infrastructure) is the European research infrastructure for language resources. Its metadata standard is CMDI (Component Metadata Infrastructure) — a flexible, component-based XML metadata framework standardised in ISO 24622.

CMDI operates at a different level than MediaPkg: it describes resources (corpora, recordings, tools) rather than the annotation data inside them. A CMDI record for a video corpus would describe who created it, what language is spoken, what tools were used, and where the data can be found — but not the content of the annotations themselves.

The two are complementary:

Concept	CMDI	MediaPkg
Describes	The resource (corpus-level metadata)	The annotation data
Format	XML	JSON-LD + Parquet
Audience	Archive infrastructure, discovery	Analysis tools, data exchange
Granularity	Corpus and recording level	Track and row level

For CLARIN-compliant archiving, a .mediapkg corpus would be described by a CMDI metadata record that references the package as a resource. The MAVA ontology provides the semantic vocabulary that bridges the two levels.

Frictionless Data¶

Frictionless Data is a framework for describing, validating, and packaging tabular data. It supports Parquet files through a generic datapackage.json metadata file that declares table schemas with column types, constraints, and human-readable descriptions.

The fundamental difference is architectural: Frictionless is built around domain-agnostic tooling that works with any tabular data — its value lies in generic validation and packaging infrastructure that users adapt to their domain by writing schemas. MediaPkg is built around a domain-specific ontology (MAVA) that defines the semantics of video annotations — tooling is written to support this ontology, not the other way around.

Concept	Frictionless Data	MediaPkg
Domain	Generic tabular data	Video annotations (time-series + intervals)
Semantic layer	None (operational metadata only)	JSON-LD `@context` → MAVA ontology
Schema	Table Schema (columns, types, constraints)	Track types (ObservationSeries, AnnotationSeries) with ontology mappings
Multi-file corpus	Flat collection of unrelated tables	Hierarchical structure (videos → tracks)
Use case	Data validation and publishing	Tool interoperability with semantic precision

W3C Web Annotation¶

The W3C Web Annotation model is general-purpose but verbose and does not have first-class support for dense numeric time-series (e.g. scores sampled every 0.5s). MediaPkg reuses the MAVA ontology’s annotation vocabulary but does not require full Web Annotation compliance.

WebVTT¶

WebVTT is well-suited for subtitles but is a plain text format with no support for structured or numeric properties. It cannot represent multi-dimensional observation scores or structured interval annotations.

4. File Format¶

Extension and MIME type¶

File extension: .mediapkg
MIME type: application/vnd.mava.mediapkg+zip (provisional)

Structure¶

A .mediapkg file MUST be a valid ZIP archive (as defined by PKWARE ZIP) using DEFLATE compression.

It MUST contain a manifest.json at the root of the archive.

It MUST contain at least one Parquet file referenced by the manifest.

corpus.mediapkg
  manifest.json
  video_001/
    emotions.parquet
    transcript.parquet
    shots.parquet
    face_emotions.parquet
  video_002/
    emotions.parquet
    transcript.parquet
    ...

Each video’s files MUST be contained in a folder named by the video’s id as declared in the manifest.

5. manifest.json¶

The manifest.json file is the entry point to a .mediapkg archive. It MUST be valid JSON and MUST be located at the root of the ZIP archive.

Top-level fields¶

Field	Type	Required	Description
`version`	string	MUST	Format version. Currently `"0.1"`.
`created`	string	MUST	ISO 8601 datetime of package creation.
`ontology`	string (URI)	MUST	URI of the MAVA ontology.
`context`	object	MUST	JSON-LD `@context` mapping column names to ontology URIs.
`tracks`	object	MUST	Track definitions describing each Parquet file type.
`videos`	array	MUST	One entry per video. Must contain at least one entry.
`description`	string	OPTIONAL	Human-readable description of the corpus.

Example¶

{
  "version": "0.1",
  "created": "2025-08-12T10:00:00+00:00",
  "description": "Emotion and transcript annotations for talk recordings",
  "ontology": "http://example.org/mava/ontology#",
  "context": {
    "@context": {
      "xsd": "http://www.w3.org/2001/XMLSchema#",
      "mava": "http://example.org/mava/ontology#",
      "start_seconds": { "@id": "mava:atTime", "@type": "xsd:decimal" },
      "end_seconds": { "@id": "mava:endTime", "@type": "xsd:decimal" },
      "annotations": { "@id": "mava:stringValue", "@type": "xsd:string" }
    }
  },
  "tracks": {},
  "videos": []
}

The `context` field¶

The context field MUST contain a JSON-LD @context object that maps every column name used in any Parquet file in the archive to a term in the MAVA ontology. This is the single authoritative semantic mapping for the whole corpus — it is not duplicated per file.

A consumer that wishes to reconstruct RDF triples from the Parquet data MUST use this context.

The `tracks` field¶

The tracks field is an object where each key is a track name and each value describes one type of annotation file. Track names MUST be lowercase strings with no spaces (use underscores).

Each track entry MUST contain:

Field	Type	Required	Description
`type`	string	MUST	Either `"mava:ObservationSeries"` or `"mava:AnnotationSeries"`, or `"mava:AnnotationListSeries"`.
`description`	string	MUST	Human-readable description of the track.
`columns`	array of strings	MUST	Ordered list of column names in the Parquet file.
`dimensions`	object	MUST for `ObservationSeries`	Maps each value column name to its description and value range.
`sampling_interval_seconds`	number	OPTIONAL	For `ObservationSeries`: the sampling interval in seconds.

Track example — ObservationSeries¶

"emotions": {
  "type": "mava:ObservationSeries",
  "description": "Per-frame probability scores from face analysis, sampled every 0.5s.",
  "sampling_interval_seconds": 0.5,
  "columns": ["start_seconds", "angry", "disgust", "fear", "happy", "sad", "surprise", "neutral"],
  "dimensions": {
    "angry":    {"description": "Anger probability score",    "range": "[0,1]"},
    "disgust":  {"description": "Disgust probability score",  "range": "[0,1]"},
    "fear":     {"description": "Fear probability score",     "range": "[0,1]"},
    "happy":    {"description": "Happiness probability score","range": "[0,1]"},
    "sad":      {"description": "Sadness probability score",  "range": "[0,1]"},
    "surprise": {"description": "Surprise probability score", "range": "[0,1]"},
    "neutral":  {"description": "Neutral expression score",   "range": "[0,1]"}
  }
}

Track example — AnnotationSeries¶

"transcript": {
  "type": "mava:AnnotationSeries",
  "description": "Speech-to-text segments from Whisper transcription model.",
  "columns": ["start_seconds", "end_seconds", "annotations"]
}

Track example - AnnotationListSeries¶

"scene_tags": {
  "type": "mava:AnnotationListSeries",
  "description": "Scene classification tags from Places3 model (indoor/outdoor + natural/man-made)",
  "columns": ["start_seconds", "end_seconds", "annotations"]
}

The `videos` field¶

The videos field MUST be an array with at least one entry. Each entry describes one video in the corpus.

Field	Type	Required	Description
`id`	string	MUST	Unique identifier for this video. Used as the folder name inside the ZIP.
`src`	string (URI)	MUST	URI or filename of the original video file.
`files`	object	MUST	Maps track name → path of the Parquet file inside the ZIP.
`title`	string	OPTIONAL	Human-readable title.
`duration_seconds`	number	OPTIONAL	Total duration of the video in seconds.

Video entry example¶

{
  "id": "video_001",
  "src": "https://example.org/videos/talk.mp4",
  "title": "Example Talk",
  "duration_seconds": 3600.0,
  "files": {
    "emotions": "video_001/emotions.parquet",
    "transcript": "video_001/transcript.parquet",
    "shots": "video_001/shots.parquet",
    "face_emotions": "video_001/face_emotions.parquet"
  }
}

6. Track Types¶

MediaPkg defines two track types, reflecting the fundamental distinction in the MAVA ontology between dense time-series data and sparse interval annotations.

6.1 ObservationSeries¶

An ObservationSeries is a dense, regularly sampled time-series of numeric values. Each row is a mava:ObservationPoint.

Required columns:

Column	Maps to	Parquet type	Description
`start_seconds`	`mava:atTime`	`DOUBLE`	Time of the observation in seconds from video start. MUST be non-negative.

Dimension columns: One column per declared dimension. Column names MUST match the keys in the track’s dimensions object in the manifest. Values MUST be numeric (xsd:decimal).

Examples of ObservationSeries tracks: emotion scores, explosion detection scores, interior/exterior classification scores, any dense ML model output.

6.2 AnnotationSeries¶

An AnnotationSeries is a sparse set of interval annotations. Each row is a mava:AnnotationSegment with a start time, end time, and a string value.

Required columns:

Column	Maps to	Parquet type	Description
`start_seconds`	`mava:startTime`	`DOUBLE`	Start of the interval in seconds. MUST be non-negative.
`end_seconds`	`mava:endTime`	`DOUBLE`	End of the interval in seconds. MUST be greater than `start_seconds`.
`annotations`	`mava:stringValue`	`STRING`	The annotation value for this segment.

6.2 AnnotationListSeries¶

An AnnotationListSeries is a sparse set of interval annotations where each segment has multiple simultaneous values. Each row is a mava:AnnotationListSegment with a start time, end time, and a list of strings.

Required columns:

Column	Maps to	Parquet type	Description
`start_seconds`	`mava:startTime`	`DOUBLE`	Start of the interval in seconds. MUST be non-negative.
`end_seconds`	`mava:endTime`	`DOUBLE`	End of the interval in seconds. MUST be greater than `start_seconds`.
`annotations`	`mava:listValue`	`LIST<STRING>`	A list of string annotations for this segment.

Note on duration: Duration is not stored as a column — it is derivable as end_seconds - start_seconds. This avoids redundancy and potential inconsistency.

Note on timecodes: Human-readable timecode strings (e.g. HH:MM:SS.ms) are not stored in the interchange format. They are derivable from seconds and add no information. Tools that need to display timecodes MUST compute them from start_seconds / end_seconds.

Examples of AnnotationSeries tracks: shot boundaries, speech-to-text transcripts, face emotion labels, scene labels.

7. Parquet Files¶

Encoding¶

Each Parquet file in a .mediapkg archive MUST conform to the Apache Parquet format specification.

Column names MUST match the columns array declared for the track in the manifest exactly, in the declared order.

Column types¶

There are two kinds of columns:

Fixed columns have the same name in every track: start_seconds, end_seconds, and annotations. Their Parquet type is always the same.

Dimension columns have variable names — whatever was declared in the track’s dimensions object in the manifest. Their Parquet type is always DOUBLE regardless of the name.

Column	Parquet type	Notes
`start_seconds`	`DOUBLE`	Fixed name, always present
`end_seconds`	`DOUBLE`	Fixed name, AnnotationSeries only
Dimension columns	`DOUBLE`	Variable names, ObservationSeries only
`annotations`	`BYTE_ARRAY` (UTF-8 string)	Fixed name, AnnotationSeries only
`annotations` (list)	`LIST<BYTE_ARRAY>` (UTF-8 strings)	Fixed name, AnnotationListSeries only

Example — ObservationSeries (emotions track)¶

The dimension columns are angry, fear, happy, sad, neutral — the names declared in the track’s dimensions object. All are DOUBLE.

emotions.parquet
┌───────────────┬─────────┬─────────┬─────────┬─────────┬─────────┐
│ start_seconds │  angry  │  fear   │  happy  │   sad   │ neutral │
│    DOUBLE     │ DOUBLE  │ DOUBLE  │ DOUBLE  │ DOUBLE  │ DOUBLE  │
├───────────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│     0.000     │  0.021  │  0.014  │  0.743  │  0.011  │  0.211  │
│     0.500     │  0.034  │  0.009  │  0.698  │  0.028  │  0.231  │
│     1.000     │  0.018  │  0.022  │  0.712  │  0.015  │  0.233  │
└───────────────┴─────────┴─────────┴─────────┴─────────┴─────────┘

A different tool producing explosion scores would have different dimension column names but the same structure:

explosion.parquet
┌───────────────┬───────────┐
│ start_seconds │ explosion │
│    DOUBLE     │  DOUBLE   │
├───────────────┼───────────┤
│     0.000     │   0.003   │
│     0.100     │   0.011   │
│     0.200     │   0.842   │
└───────────────┴───────────┘

Example — AnnotationSeries (transcript track)¶

No dimension columns. Fixed columns start_seconds, end_seconds, annotations only.

transcript.parquet
┌───────────────┬─────────────┬──────────────────────────────────┐
│ start_seconds │ end_seconds │           annotations            │
│    DOUBLE     │   DOUBLE    │           BYTE_ARRAY             │
├───────────────┼─────────────┼──────────────────────────────────┤
│     0.000     │    12.300   │ "Welcome to the conference."     │
│    12.500     │    29.800   │ "Today we discuss annotation."   │
│    30.100     │    45.000   │ "Thank you for joining us."      │
└───────────────┴─────────────┴──────────────────────────────────┘

Example — AnnotationListSeries (scene_tags track)¶

Similar to the AnnotationSeries, just that the annotations are lists of variable length such as tags.

scene_tags.parquet
┌───────────────┬─────────────┬───────────────────────────┐
│ start_seconds │ end_seconds │       annotations         │
│    DOUBLE     │   DOUBLE    │     LIST<STRING>          │
├───────────────┼─────────────┼───────────────────────────┤
│     0.000     │    45.200   │ ["outdoor", "natural"]    │
│    45.200     │    78.500   │ ["indoor"]                │
│    78.500     │   120.000   │ ["outdoor", "man-made"]   │
└───────────────┴─────────────┴───────────────────────────┘

Compression¶

Parquet files SHOULD use Snappy or ZSTD compression for row groups. The ZIP archive provides additional compression on top.

Row ordering¶

Rows in an ObservationSeries file MUST be ordered by start_seconds ascending.

Rows in an AnnotationSeries file SHOULD be ordered by start_seconds ascending.

8. Ontology and Semantic Layer¶

MAVA Ontology¶

All semantic terms used in MediaPkg are defined in the MAVA ontology at http://example.org/mava/ontology#. The ontology is provided in Turtle (.ttl) format in the spec/ directory of this repository.

The ontology defines:

mava:VideoCorpus — a collection of videos
mava:Video — a single video resource
mava:MediaPackage — a .mediapkg archive
mava:ObservationSeries / mava:ObservationPoint — dense time-series analysis
mava:AnnotationSeries / mava:AnnotationSegment / mava:AnnotationListSegment — sparse interval annotations
mava:Dimension — a single measured quantity within an ObservationSeries
Time properties: mava:atTime, mava:startTime, mava:endTime
Value properties: mava:numericValue, mava:stringValue, mava:ListValue

JSON-LD Context¶

The context field in the manifest is a JSON-LD @context that maps Parquet column names to MAVA ontology term URIs. A consumer that reads a .mediapkg file and wishes to export RDF MUST use this context to expand column names to full URIs before serialising.

Reused Vocabularies¶

Prefix	Namespace	Used for
`mava:`	`http://example.org/mava/ontology#`	All domain-specific terms
`xsd:`	`http://www.w3.org/2001/XMLSchema#`	Datatype declarations
`dcterms:`	`http://purl.org/dc/terms/`	Ontology metadata

9. Validation¶

Manifest validation¶

A manifest.json MUST be valid JSON. Implementations SHOULD validate it against the JSON Schema provided at spec/manifest.schema.json in this repository.

Ontology validation (SHACL)¶

SHACL shapes for validating RDF data exported from a .mediapkg are included in spec/mava.ttl. The shapes enforce:

Every ObservationPoint has exactly one mava:atTime value (non-negative) and belongs to an ObservationSeries
Every AnnotationSegment has exactly one mava:startTime and one mava:endTime (both non-negative) and belongs to an AnnotationSeries
Every ObservationSeries has a description and at least one declared dimension
Every Dimension has exactly one name matching a Parquet column

Package validation¶

Implementations SHOULD verify that:

Every file referenced in manifest.videos[*].files exists in the ZIP archive
Column names in each Parquet file match the columns array for that track in the manifest
end_seconds > start_seconds for all rows in AnnotationSeries files
start_seconds >= 0 for all rows

10. Corpus Packages¶

A single .mediapkg file MAY contain annotations for more than one video. This is the primary mechanism for distributing a corpus.

Combining packages¶

Two or more single-video .mediapkg files MAY be combined into a corpus package. When combining:

The videos arrays from all input packages MUST be merged
Each video id MUST be unique within the combined manifest — implementors MUST check for conflicts before combining
The context and tracks from the first package are used as the basis; implementations SHOULD warn if input packages have conflicting contexts or track definitions
All Parquet files from input packages are included unchanged, preserving their folder structure

Splitting packages¶

A corpus package MAY be split into individual video packages by extracting each video’s folder and writing a manifest containing only that video’s entry.

11. Design Decisions¶

Why ZIP and not a single Parquet file?¶

A single Parquet file cannot contain a corpus-level manifest or multiple videos with heterogeneous track schemas. ZIP is universally supported, allows incremental reading, and provides an additional compression layer on top of Parquet’s own compression.

Why Parquet and not Arrow IPC?¶

Parquet provides better compression (important for transfer) and broader ecosystem support. Arrow IPC is faster to write but produces larger files. For an interchange format where transfer efficiency matters, Parquet is the better choice. Individual tools may use Arrow IPC internally.

Why not RDF/Turtle directly?¶

Serialising a single video’s annotations as RDF took approximately ten minutes in practice. Parquet serialisation of the same data takes under a second. RDF export remains possible as a derived output — a tool can read a .mediapkg and export Turtle using the JSON-LD context — but it is not the primary interchange mechanism.

Why `end_seconds` instead of `duration_seconds`?¶

End time is more directly useful for querying: “find all annotations overlapping timestamp X” requires start_seconds <= X AND end_seconds >= X. With duration you would need to compute start + duration in every query. Duration is trivially derivable as end - start.

Why no timecode columns?¶

Timecode strings (HH:MM:SS.ms) are derivable from seconds and add no information. Including them in every row would increase file size and introduce potential inconsistencies. Tools that need to display timecodes compute them at render time.

Why are dimension names not ontology properties?¶

The MAVA ontology does not define properties for specific score types (e.g. mava:angryScore). Instead, mava:Dimension allows any measured quantity to be declared per series in the manifest. This means the ontology does not need to change when new ML models with new output types are introduced — only the manifest’s dimensions object needs updating.

Why JSON-LD for the context and not plain URI mappings?¶

JSON-LD is a W3C standard for embedding linked data semantics in JSON. It is machine-actionable: a JSON-LD processor can expand column names to full URIs and convert rows to RDF triples without any custom code. Plain URI mappings would require a custom parser. Both approaches store the same information, but JSON-LD is more interoperable.

MediaPkg v0.1 Draft — ETH Zurich / Swiss Data Science Center

MediaPkg Format Specification¶

Status¶

1. Introduction¶

2. Motivation¶

3. Relation to Existing Standards¶

GeoParquet¶

ELAN / EAF¶

CLARIN / CMDI¶

Frictionless Data¶

W3C Web Annotation¶

WebVTT¶

4. File Format¶

Extension and MIME type¶

Structure¶

5. manifest.json¶

Top-level fields¶

Example¶

The context field¶

The tracks field¶

Track example — ObservationSeries¶

Track example — AnnotationSeries¶

Track example - AnnotationListSeries¶

The videos field¶

Video entry example¶

6. Track Types¶

6.1 ObservationSeries¶

6.2 AnnotationSeries¶

6.2 AnnotationListSeries¶

7. Parquet Files¶

Encoding¶

Column types¶

Example — ObservationSeries (emotions track)¶

Example — AnnotationSeries (transcript track)¶

Example — AnnotationListSeries (scene_tags track)¶

Compression¶

Row ordering¶

8. Ontology and Semantic Layer¶

MAVA Ontology¶

JSON-LD Context¶

Reused Vocabularies¶

9. Validation¶

Manifest validation¶

Ontology validation (SHACL)¶

Package validation¶

10. Corpus Packages¶

Combining packages¶

Splitting packages¶

11. Design Decisions¶

Why ZIP and not a single Parquet file?¶

Why Parquet and not Arrow IPC?¶

Why not RDF/Turtle directly?¶

Why end_seconds instead of duration_seconds?¶

Why no timecode columns?¶

Why are dimension names not ontology properties?¶

Why JSON-LD for the context and not plain URI mappings?¶

The `context` field¶

The `tracks` field¶

The `videos` field¶

Why `end_seconds` instead of `duration_seconds`?¶