Working with genomics data#

Genomic data can reach large volumes and is typically stored in domain-specific file formats such as CRAM, BAM or VCF. In MODOs genomics files are linked to a metadata element and directly stored within the object. To access region-specific information without downloading the entire file the remote storage is linked to a htsget server that allows secure streaming over the network.

Data streaming#

MODOs supports streaming of data from CRAM, BAM, VCF and BCF files to access specific genomic regions. In MODOs

from modos.api import MODO

# Load MODO from remote storage
modo=MODO(path= 's3://modos-demo/ex', endpoint = 'http://localhost')

# Stream a specific region
modo.stream_genomics(file_path = "demo1.cram", region = "BA000007.3")
# Stream chromosome BA000007.3 from modos-demo/ex/demo1.cram
modos --endpoint http://localhost stream --region BA000007.3 s3://modos-demo/ex/demo1.cram

Warning

We highly recommend using the MODOs CLI for streaming. The output can directly be passed to tools like samtools. Streaming using the MODOs python api will return a pysam object. pysam does not allow reading from byte-streams and thus the streamed region will be written into an temporary file before parsing to pysam. For large files/regions this can cause issues.

Data encryption and decryption#

Genomic data is typically sensitive, and data sharing increases the risk to data security. MODOs supports Crypt4GH, an encryption format developed by the Global Alliance for Genomics and Health (GA4GH). Crypt4GH encryption is based on authenticated envelope encryption that will encrypt the data itself as well as the key to decrypt the data.

In MODOs, all genomic files can be encrypted or decrypted with a single command call:

from modos.api import MODO

# Load local MODO
modo = MODO(path = "data/ex")

# Show all files
modo.list_files()
# [PosixPath('data/ex/demo1.cram'),
# PosixPath('data/ex/demo1.cram.crai')]

# Encrypt genomic files using the public key stored at "path/to/recipient.pub"
modo.encrypt("path/to/recipient.pub")

# Files were encrypted
modo.list_files()
# [PosixPath('data/ex/demo1.cram.c4gh'),
# PosixPath('data/ex/demo1.cram.crai.c4gh')]

# Decrypt genomic files using the secret key stored at "path/to/recipient.sec"
modo.decrypt("path/to/recipient.sec")

# Files were decrypted
modo.list_files()
# [PosixPath('data/ex/demo1.cram'),
# PosixPath('data/ex/demo1.cram.crai')]
# Encrypt genomic files in data/ex using the public key stored at "path/to/recipient.pub"
modos c4gh encrypt -p path/to/recipient.pub /data/ex


# Decrypt encrypted files in data/ex using the secret key stored at "path/to/recipient.sec"
modos c4gh decrypt -s path/to/recipient.sec /data/ex

Note

Only local modos can be encrypted or decrypted, but not remote objects.