Dataloader

Overview

Note

Loading the data into the App assumes that you have the backend running: See Install options for setup options

The processed data is loaded from Pipeline S3 into 3 databases:

App S3: S3 Database for processed data: this is done manually with only a small modification see below
App MongoDB Mongo database for all metadata on speakers, segments, transcripts and translations
App Solr Solr search engine where speaker segments are loaded as documents into Solr

flowchart LR
    subgraph WebApp[Web Application]
        B[(App S3)]
        E{SRT Parser}
        C[(App MongoDB)]
        D[(App Solr)]
    end
    A[(Pipeline S3)] -- manual --> B
    B --> E
    E --> C
    E --> D
    style WebApp fill:white

Loading into the App S3

All files are from Pipeline S3 loaded into App S3: this is currently done manually. App S3 needs just one extra file HRC_20220328T0000-metadata.yml: it is derived from HRC_20220328T0000-files.yml.

debates
└── HRC_20220328T0000
    ├── HRC_20220328T0000-files.yml
    ├── HRC_20220328T0000-metadata.yml
    ...

Example for HRC_20220328T0000-metadata.yml:

s3_prefix: HRC_20160622T0000
media:
  key: HRC_20160622T0000.mp4
  type: video
  format: mp4
s3_keys:
  - name: HRC_20220328T0000.json
    type: json
    description: JSON file containing metadata transcription ...
  - name: HRC_20220328T0000-files.yml
    type: yml
    description: YAML file containing metadata of the files ...
  - name: HRC_20220328T0000.mp4
    type: mp4
    description: MP4 video file from the 2020 03 28 00:00 session
  - name: HRC_20220328T0000-original.wav
    type: wav
    description: Original audio file from the 2020 03 28 00:00 session
  - name: HRC_20220328T0000-transcription_original.srt
    type: srt
    description: Transcription file in SRT format ...
  - name: HRC_20220328T0000-transcription_original.pdf
    type: pdf
    description: PDF file containing the transcription ...
  - name: HRC_20220328T0000-translation_original_english.srt
    type: srt
    description: Translation file in SRT format to English ...
  - name: HRC_20220328T0000-translation_original_english.pdf
    type: pdf
    description: PDF file containing the English translation ...
context:
  type: "Human Rights Council"
  session: "32th session"
  public: True
schedule:
  date: "2016-06-22"
  time: "10:00"
  timezone: "Europe/Zurich"

The metadata in context and schedule have been derived from https://conf.unog.ch/digitalrecordings/en
s3_prefix: is the prefix or directory on S3, where the files for the media item are stored
media: points to the actual media file that is played in the media player:

media subkeys:

key: is the actual media file
type: can be videoor audio
format: format is the media file format: for videos mp4 is supported and for audio files wav.

Loading into Mongodb and Solr

Warning

Only do these steps on a fresh set up when you database is empty: otherwise it will mess up your existing data

Once the environment is setup, the commands to load data into the mongodb and Solr are the following per media item:

Then load the data from App S3:

python debates.py s3-to-mongo-solr HRC_20220328T0000

Start API Server

After this step the data should be available. You can now start the backend:

python debates.py serve

You will find the api documentation at http://localhost:8000/docs or as json file at http://localhost:8000/openapi.json For your convenience it has also been added to this documentation: api documentation