Dataloader
Overview
Note
Loading the data into the App assumes that you have the backend running: See Install options for setup options
The processed data is loaded from Pipeline S3 into 3 databases:
- App S3: S3 Database for processed data: this is done manually with only a small modification see below
- App MongoDB Mongo database for all metadata on speakers, segments, transcripts and translations
- App Solr Solr search engine where speaker segments are loaded as documents into Solr
flowchart LR
subgraph WebApp[Web Application]
B[(App S3)]
E{SRT Parser}
C[(App MongoDB)]
D[(App Solr)]
end
A[(Pipeline S3)] -- manual --> B
B --> E
E --> C
E --> D
style WebApp fill:white
Loading into the App S3
All files are from Pipeline S3 loaded into App S3: this is currently done manually.
App S3 needs just one extra file HRC_20220328T0000-metadata.yml: it is derived from HRC_20220328T0000-files.yml.
debates
└── HRC_20220328T0000
├── HRC_20220328T0000-files.yml
├── HRC_20220328T0000-metadata.yml
...
Example for HRC_20220328T0000-metadata.yml:
s3_prefix: HRC_20160622T0000
media:
key: HRC_20160622T0000.mp4
type: video
format: mp4
s3_keys:
- name: HRC_20220328T0000.json
type: json
description: JSON file containing metadata transcription ...
- name: HRC_20220328T0000-files.yml
type: yml
description: YAML file containing metadata of the files ...
- name: HRC_20220328T0000.mp4
type: mp4
description: MP4 video file from the 2020 03 28 00:00 session
- name: HRC_20220328T0000-original.wav
type: wav
description: Original audio file from the 2020 03 28 00:00 session
- name: HRC_20220328T0000-transcription_original.srt
type: srt
description: Transcription file in SRT format ...
- name: HRC_20220328T0000-transcription_original.pdf
type: pdf
description: PDF file containing the transcription ...
- name: HRC_20220328T0000-translation_original_english.srt
type: srt
description: Translation file in SRT format to English ...
- name: HRC_20220328T0000-translation_original_english.pdf
type: pdf
description: PDF file containing the English translation ...
context:
type: "Human Rights Council"
session: "32th session"
public: True
schedule:
date: "2016-06-22"
time: "10:00"
timezone: "Europe/Zurich"
- The metadata in
contextandschedulehave been derived from https://conf.unog.ch/digitalrecordings/en s3_prefix: is the prefix or directory on S3, where the files for the media item are storedmedia: points to the actual media file that is played in themedia player:
media subkeys:
key: is the actual media filetype: can bevideooraudioformat: format is the media file format: for videosmp4is supported and for audio fileswav.
Loading into Mongodb and Solr
Warning
Only do these steps on a fresh set up when you database is empty: otherwise it will mess up your existing data
Once the environment is setup, the commands to load data into the mongodb and Solr are the following per media item:
Then load the data from App S3:
Start API Server
After this step the data should be available. You can now start the backend:
You will find the api documentation at http://localhost:8000/docs or as json file at http://localhost:8000/openapi.json
For your convenience it has also been added to this documentation: api documentation