MediaPkg Format Specification¶
Version: 0.1 (Draft) Repository: https://github.com/smaennel/mava-exchange Ontology: http://example.org/mava/ontology# License: Apache 2.0
Status¶
This is an early draft specification. It is being developed in the context of the TIBAV-A and videoscope use cases at ETH Zurich / Swiss Data Science Center. Feedback welcome via GitHub issues.
The key words MUST, MUST NOT, REQUIRED, SHOULD, SHOULD NOT, RECOMMENDED, MAY, and OPTIONAL in this document are to be interpreted as described in RFC 2119.
1. Introduction¶
MediaPkg (.mediapkg) is a lightweight, compressed interchange format for
time-based annotations on video files. It is designed for efficient exchange
between video annotation and processing tools without the overhead of full RDF
serialisation.
A .mediapkg file is a ZIP archive containing:
A
manifest.jsondescribing the corpus, its videos, annotation tracks, and the semantic meaning of all columnsOne or more Parquet files — one per annotation track per video — containing the actual annotation data
The format is backed by the MAVA ontology, which provides a shared vocabulary for annotation tools in linguistics and multimodal analysis.
2. Motivation¶
Tools in the video annotation space produce a variety of outputs — emotion scores, shot boundaries, transcripts, scene labels, and others. These outputs are currently exchanged as plain TSV or CSV files with no shared vocabulary, making interoperability between tools difficult.
A previous approach exported all data to RDF/Turtle. While semantically rich, this proved too slow for large videos — serialisation of a single video’s annotations could take ten minutes or more.
MediaPkg addresses this by separating two concerns:
Data transport — handled by Parquet inside ZIP, which is fast, compressed, and columnar
Semantic meaning — handled by the MAVA ontology and a JSON-LD
@contextin the manifest
This mirrors the approach taken by GeoParquet in the geospatial domain: efficient binary data with attached semantic metadata.
3. Relation to Existing Standards¶
GeoParquet¶
MediaPkg is directly inspired by GeoParquet.
GeoParquet embeds metadata inside Parquet files under a geo key to describe
geometry columns. MediaPkg uses the same mechanism (a mj key in the Parquet
metadata) and extends it with a JSON-LD context for full ontology linkage.
Concept |
GeoParquet |
MediaPkg |
|---|---|---|
Container |
|
|
Metadata location |
Embedded in Parquet |
|
Semantic layer |
None (operational only) |
JSON-LD |
Coordinate system |
WGS84 (spatial) |
Seconds from video start (temporal) |
Multi-file corpus |
Not addressed |
Multiple videos in one archive |
ELAN / EAF¶
ELAN (EUDICO Linguistic Annotator) is the dominant annotation tool in linguistics and multimodal research. It stores annotations in the EAF format (ELAN Annotation Format) — an XML file organised around tiers: named layers that group time-aligned annotations. Tiers can be hierarchically connected and linked to linguistic types.
EAF is the primary export format of several tools in the MAVA project context, including TIB-AV-A. MediaPkg is not a replacement for EAF — it serves a different purpose:
Concept |
EAF |
MediaPkg |
|---|---|---|
Primary use |
Human annotation and transcription |
Machine-generated analysis output |
Data model |
Tiers with typed, hierarchical annotations |
Flat observation and annotation series |
Numeric time-series |
Not supported |
First-class (ObservationSeries) |
File format |
XML |
Parquet inside ZIP |
Corpus packaging |
One file per recording |
Multiple videos in one archive |
Performance |
Slow to parse at scale |
Columnar, fast reads |
A future integration goal is to allow tools to import EAF annotations into
.mediapkg as AnnotationSeries tracks, enabling combined analysis of human
annotations and AI-generated observation data in a single package.
CLARIN / CMDI¶
CLARIN (Common Language Resources and Technology Infrastructure) is the European research infrastructure for language resources. Its metadata standard is CMDI (Component Metadata Infrastructure) — a flexible, component-based XML metadata framework standardised in ISO 24622.
CMDI operates at a different level than MediaPkg: it describes resources (corpora, recordings, tools) rather than the annotation data inside them. A CMDI record for a video corpus would describe who created it, what language is spoken, what tools were used, and where the data can be found — but not the content of the annotations themselves.
The two are complementary:
Concept |
CMDI |
MediaPkg |
|---|---|---|
Describes |
The resource (corpus-level metadata) |
The annotation data |
Format |
XML |
JSON-LD + Parquet |
Audience |
Archive infrastructure, discovery |
Analysis tools, data exchange |
Granularity |
Corpus and recording level |
Track and row level |
For CLARIN-compliant archiving, a .mediapkg corpus would be described by a
CMDI metadata record that references the package as a resource. The MAVA
ontology provides the semantic vocabulary that bridges the two levels.
Frictionless Data¶
Frictionless Data is a framework for describing,
validating, and packaging tabular data. It supports Parquet files through a
generic datapackage.json metadata file that declares table schemas with column
types, constraints, and human-readable descriptions.
The fundamental difference is architectural: Frictionless is built around domain-agnostic tooling that works with any tabular data — its value lies in generic validation and packaging infrastructure that users adapt to their domain by writing schemas. MediaPkg is built around a domain-specific ontology (MAVA) that defines the semantics of video annotations — tooling is written to support this ontology, not the other way around.
Concept |
Frictionless Data |
MediaPkg |
|---|---|---|
Domain |
Generic tabular data |
Video annotations (time-series + intervals) |
Semantic layer |
None (operational metadata only) |
JSON-LD |
Schema |
Table Schema (columns, types, constraints) |
Track types (ObservationSeries, AnnotationSeries) with ontology mappings |
Multi-file corpus |
Flat collection of unrelated tables |
Hierarchical structure (videos → tracks) |
Use case |
Data validation and publishing |
Tool interoperability with semantic precision |
W3C Web Annotation¶
The W3C Web Annotation model is general-purpose but verbose and does not have first-class support for dense numeric time-series (e.g. scores sampled every 0.5s). MediaPkg reuses the MAVA ontology’s annotation vocabulary but does not require full Web Annotation compliance.
WebVTT¶
WebVTT is well-suited for subtitles but is a plain text format with no support for structured or numeric properties. It cannot represent multi-dimensional observation scores or structured interval annotations.
4. File Format¶
Extension and MIME type¶
File extension:
.mediapkgMIME type:
application/vnd.mava.mediapkg+zip(provisional)
Structure¶
A .mediapkg file MUST be a valid ZIP archive (as defined by
PKWARE ZIP) using
DEFLATE compression.
It MUST contain a manifest.json at the root of the archive.
It MUST contain at least one Parquet file referenced by the manifest.
corpus.mediapkg
manifest.json
video_001/
emotions.parquet
transcript.parquet
shots.parquet
face_emotions.parquet
video_002/
emotions.parquet
transcript.parquet
...
Each video’s files MUST be contained in a folder named by the video’s id as
declared in the manifest.
5. manifest.json¶
The manifest.json file is the entry point to a .mediapkg archive. It MUST be
valid JSON and MUST be located at the root of the ZIP archive.
Top-level fields¶
Field |
Type |
Required |
Description |
|---|---|---|---|
|
string |
MUST |
Format version. Currently |
|
string |
MUST |
ISO 8601 datetime of package creation. |
|
string (URI) |
MUST |
URI of the MAVA ontology. |
|
object |
MUST |
JSON-LD |
|
object |
MUST |
Track definitions describing each Parquet file type. |
|
array |
MUST |
One entry per video. Must contain at least one entry. |
|
string |
OPTIONAL |
Human-readable description of the corpus. |
Example¶
{
"version": "0.1",
"created": "2025-08-12T10:00:00+00:00",
"description": "Emotion and transcript annotations for talk recordings",
"ontology": "http://example.org/mava/ontology#",
"context": {
"@context": {
"xsd": "http://www.w3.org/2001/XMLSchema#",
"mava": "http://example.org/mava/ontology#",
"start_seconds": { "@id": "mava:atTime", "@type": "xsd:decimal" },
"end_seconds": { "@id": "mava:endTime", "@type": "xsd:decimal" },
"annotations": { "@id": "mava:stringValue", "@type": "xsd:string" }
}
},
"tracks": {},
"videos": []
}
The context field¶
The context field MUST contain a JSON-LD @context object that maps every
column name used in any Parquet file in the archive to a term in the MAVA
ontology. This is the single authoritative semantic mapping for the whole corpus
— it is not duplicated per file.
A consumer that wishes to reconstruct RDF triples from the Parquet data MUST use this context.
The tracks field¶
The tracks field is an object where each key is a track name and each value
describes one type of annotation file. Track names MUST be lowercase strings
with no spaces (use underscores).
Each track entry MUST contain:
Field |
Type |
Required |
Description |
|---|---|---|---|
|
string |
MUST |
Either |
|
string |
MUST |
Human-readable description of the track. |
|
array of strings |
MUST |
Ordered list of column names in the Parquet file. |
|
object |
MUST for |
Maps each value column name to its description and value range. |
|
number |
OPTIONAL |
For |
Track example — ObservationSeries¶
"emotions": {
"type": "mava:ObservationSeries",
"description": "Per-frame probability scores from face analysis, sampled every 0.5s.",
"sampling_interval_seconds": 0.5,
"columns": ["start_seconds", "angry", "disgust", "fear", "happy", "sad", "surprise", "neutral"],
"dimensions": {
"angry": {"description": "Anger probability score", "range": "[0,1]"},
"disgust": {"description": "Disgust probability score", "range": "[0,1]"},
"fear": {"description": "Fear probability score", "range": "[0,1]"},
"happy": {"description": "Happiness probability score","range": "[0,1]"},
"sad": {"description": "Sadness probability score", "range": "[0,1]"},
"surprise": {"description": "Surprise probability score", "range": "[0,1]"},
"neutral": {"description": "Neutral expression score", "range": "[0,1]"}
}
}
Track example — AnnotationSeries¶
"transcript": {
"type": "mava:AnnotationSeries",
"description": "Speech-to-text segments from Whisper transcription model.",
"columns": ["start_seconds", "end_seconds", "annotations"]
}
Track example - AnnotationListSeries¶
"scene_tags": {
"type": "mava:AnnotationListSeries",
"description": "Scene classification tags from Places3 model (indoor/outdoor + natural/man-made)",
"columns": ["start_seconds", "end_seconds", "annotations"]
}
The videos field¶
The videos field MUST be an array with at least one entry. Each entry
describes one video in the corpus.
Field |
Type |
Required |
Description |
|---|---|---|---|
|
string |
MUST |
Unique identifier for this video. Used as the folder name inside the ZIP. |
|
string (URI) |
MUST |
URI or filename of the original video file. |
|
object |
MUST |
Maps track name → path of the Parquet file inside the ZIP. |
|
string |
OPTIONAL |
Human-readable title. |
|
number |
OPTIONAL |
Total duration of the video in seconds. |
Video entry example¶
{
"id": "video_001",
"src": "https://example.org/videos/talk.mp4",
"title": "Example Talk",
"duration_seconds": 3600.0,
"files": {
"emotions": "video_001/emotions.parquet",
"transcript": "video_001/transcript.parquet",
"shots": "video_001/shots.parquet",
"face_emotions": "video_001/face_emotions.parquet"
}
}
6. Track Types¶
MediaPkg defines two track types, reflecting the fundamental distinction in the MAVA ontology between dense time-series data and sparse interval annotations.
6.1 ObservationSeries¶
An ObservationSeries is a dense, regularly sampled time-series of numeric
values. Each row is a mava:ObservationPoint.
Required columns:
Column |
Maps to |
Parquet type |
Description |
|---|---|---|---|
|
|
|
Time of the observation in seconds from video start. MUST be non-negative. |
Dimension columns: One column per declared dimension. Column names MUST
match the keys in the track’s dimensions object in the manifest. Values MUST
be numeric (xsd:decimal).
Examples of ObservationSeries tracks: emotion scores, explosion detection scores, interior/exterior classification scores, any dense ML model output.
6.2 AnnotationSeries¶
An AnnotationSeries is a sparse set of interval annotations. Each row is a
mava:AnnotationSegment with a start time, end time, and a string value.
Required columns:
Column |
Maps to |
Parquet type |
Description |
|---|---|---|---|
|
|
|
Start of the interval in seconds. MUST be non-negative. |
|
|
|
End of the interval in seconds. MUST be greater than |
|
|
|
The annotation value for this segment. |
6.2 AnnotationListSeries¶
An AnnotationListSeries is a sparse set of interval annotations where each
segment has multiple simultaneous values. Each row is a
mava:AnnotationListSegment with a start time, end time, and a list of strings.
Required columns:
Column |
Maps to |
Parquet type |
Description |
|---|---|---|---|
|
|
|
Start of the interval in seconds. MUST be non-negative. |
|
|
|
End of the interval in seconds. MUST be greater than |
|
|
|
A list of string annotations for this segment. |
Note on duration: Duration is not stored as a column — it is derivable as
end_seconds - start_seconds. This avoids redundancy and potential
inconsistency.
Note on timecodes: Human-readable timecode strings (e.g. HH:MM:SS.ms) are
not stored in the interchange format. They are derivable from seconds and add no
information. Tools that need to display timecodes MUST compute them from
start_seconds / end_seconds.
Examples of AnnotationSeries tracks: shot boundaries, speech-to-text transcripts, face emotion labels, scene labels.
7. Parquet Files¶
Encoding¶
Each Parquet file in a .mediapkg archive MUST conform to the
Apache Parquet format specification.
Column names MUST match the columns array declared for the track in the
manifest exactly, in the declared order.
Column types¶
There are two kinds of columns:
Fixed columns have the same name in every track: start_seconds,
end_seconds, and annotations. Their Parquet type is always the same.
Dimension columns have variable names — whatever was declared in the track’s
dimensions object in the manifest. Their Parquet type is always DOUBLE
regardless of the name.
Column |
Parquet type |
Notes |
|---|---|---|
|
|
Fixed name, always present |
|
|
Fixed name, AnnotationSeries only |
Dimension columns |
|
Variable names, ObservationSeries only |
|
|
Fixed name, AnnotationSeries only |
|
|
Fixed name, AnnotationListSeries only |
Example — ObservationSeries (emotions track)¶
The dimension columns are angry, fear, happy, sad, neutral — the names
declared in the track’s dimensions object. All are DOUBLE.
emotions.parquet
┌───────────────┬─────────┬─────────┬─────────┬─────────┬─────────┐
│ start_seconds │ angry │ fear │ happy │ sad │ neutral │
│ DOUBLE │ DOUBLE │ DOUBLE │ DOUBLE │ DOUBLE │ DOUBLE │
├───────────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│ 0.000 │ 0.021 │ 0.014 │ 0.743 │ 0.011 │ 0.211 │
│ 0.500 │ 0.034 │ 0.009 │ 0.698 │ 0.028 │ 0.231 │
│ 1.000 │ 0.018 │ 0.022 │ 0.712 │ 0.015 │ 0.233 │
└───────────────┴─────────┴─────────┴─────────┴─────────┴─────────┘
A different tool producing explosion scores would have different dimension column names but the same structure:
explosion.parquet
┌───────────────┬───────────┐
│ start_seconds │ explosion │
│ DOUBLE │ DOUBLE │
├───────────────┼───────────┤
│ 0.000 │ 0.003 │
│ 0.100 │ 0.011 │
│ 0.200 │ 0.842 │
└───────────────┴───────────┘
Example — AnnotationSeries (transcript track)¶
No dimension columns. Fixed columns start_seconds, end_seconds,
annotations only.
transcript.parquet
┌───────────────┬─────────────┬──────────────────────────────────┐
│ start_seconds │ end_seconds │ annotations │
│ DOUBLE │ DOUBLE │ BYTE_ARRAY │
├───────────────┼─────────────┼──────────────────────────────────┤
│ 0.000 │ 12.300 │ "Welcome to the conference." │
│ 12.500 │ 29.800 │ "Today we discuss annotation." │
│ 30.100 │ 45.000 │ "Thank you for joining us." │
└───────────────┴─────────────┴──────────────────────────────────┘
Compression¶
Parquet files SHOULD use Snappy or ZSTD compression for row groups. The ZIP archive provides additional compression on top.
Row ordering¶
Rows in an ObservationSeries file MUST be ordered by start_seconds
ascending.
Rows in an AnnotationSeries file SHOULD be ordered by start_seconds
ascending.
8. Ontology and Semantic Layer¶
MAVA Ontology¶
All semantic terms used in MediaPkg are defined in the MAVA ontology at
http://example.org/mava/ontology#. The ontology is provided in Turtle (.ttl)
format in the spec/ directory of this repository.
The ontology defines:
mava:VideoCorpus— a collection of videosmava:Video— a single video resourcemava:MediaPackage— a.mediapkgarchivemava:ObservationSeries/mava:ObservationPoint— dense time-series analysismava:AnnotationSeries/mava:AnnotationSegment/mava:AnnotationListSegment— sparse interval annotationsmava:Dimension— a single measured quantity within an ObservationSeriesTime properties:
mava:atTime,mava:startTime,mava:endTimeValue properties:
mava:numericValue,mava:stringValue,mava:ListValue
JSON-LD Context¶
The context field in the manifest is a JSON-LD @context that maps Parquet
column names to MAVA ontology term URIs. A consumer that reads a .mediapkg
file and wishes to export RDF MUST use this context to expand column names to
full URIs before serialising.
Reused Vocabularies¶
Prefix |
Namespace |
Used for |
|---|---|---|
|
|
All domain-specific terms |
|
|
Datatype declarations |
|
|
Ontology metadata |
9. Validation¶
Manifest validation¶
A manifest.json MUST be valid JSON. Implementations SHOULD validate it against
the JSON Schema provided at spec/manifest.schema.json in this repository.
Ontology validation (SHACL)¶
SHACL shapes for validating RDF data exported from a .mediapkg are included in
spec/mava.ttl. The shapes enforce:
Every
ObservationPointhas exactly onemava:atTimevalue (non-negative) and belongs to anObservationSeriesEvery
AnnotationSegmenthas exactly onemava:startTimeand onemava:endTime(both non-negative) and belongs to anAnnotationSeriesEvery
ObservationSerieshas a description and at least one declared dimensionEvery
Dimensionhas exactly one name matching a Parquet column
Package validation¶
Implementations SHOULD verify that:
Every file referenced in
manifest.videos[*].filesexists in the ZIP archiveColumn names in each Parquet file match the
columnsarray for that track in the manifestend_seconds > start_secondsfor all rows inAnnotationSeriesfilesstart_seconds >= 0for all rows
10. Corpus Packages¶
A single .mediapkg file MAY contain annotations for more than one video. This
is the primary mechanism for distributing a corpus.
Combining packages¶
Two or more single-video .mediapkg files MAY be combined into a corpus
package. When combining:
The
videosarrays from all input packages MUST be mergedEach video
idMUST be unique within the combined manifest — implementors MUST check for conflicts before combiningThe
contextandtracksfrom the first package are used as the basis; implementations SHOULD warn if input packages have conflicting contexts or track definitionsAll Parquet files from input packages are included unchanged, preserving their folder structure
Splitting packages¶
A corpus package MAY be split into individual video packages by extracting each video’s folder and writing a manifest containing only that video’s entry.
11. Design Decisions¶
Why ZIP and not a single Parquet file?¶
A single Parquet file cannot contain a corpus-level manifest or multiple videos with heterogeneous track schemas. ZIP is universally supported, allows incremental reading, and provides an additional compression layer on top of Parquet’s own compression.
Why Parquet and not Arrow IPC?¶
Parquet provides better compression (important for transfer) and broader ecosystem support. Arrow IPC is faster to write but produces larger files. For an interchange format where transfer efficiency matters, Parquet is the better choice. Individual tools may use Arrow IPC internally.
Why not RDF/Turtle directly?¶
Serialising a single video’s annotations as RDF took approximately ten minutes
in practice. Parquet serialisation of the same data takes under a second. RDF
export remains possible as a derived output — a tool can read a .mediapkg and
export Turtle using the JSON-LD context — but it is not the primary interchange
mechanism.
Why end_seconds instead of duration_seconds?¶
End time is more directly useful for querying: “find all annotations overlapping
timestamp X” requires start_seconds <= X AND end_seconds >= X. With duration
you would need to compute start + duration in every query. Duration is
trivially derivable as end - start.
Why no timecode columns?¶
Timecode strings (HH:MM:SS.ms) are derivable from seconds and add no
information. Including them in every row would increase file size and introduce
potential inconsistencies. Tools that need to display timecodes compute them at
render time.
Why are dimension names not ontology properties?¶
The MAVA ontology does not define properties for specific score types (e.g.
mava:angryScore). Instead, mava:Dimension allows any measured quantity to be
declared per series in the manifest. This means the ontology does not need to
change when new ML models with new output types are introduced — only the
manifest’s dimensions object needs updating.
Why JSON-LD for the context and not plain URI mappings?¶
JSON-LD is a W3C standard for embedding linked data semantics in JSON. It is machine-actionable: a JSON-LD processor can expand column names to full URIs and convert rows to RDF triples without any custom code. Plain URI mappings would require a custom parser. Both approaches store the same information, but JSON-LD is more interoperable.
MediaPkg v0.1 Draft — ETH Zurich / Swiss Data Science Center