Working with genomics data#

Genomic data can reach large volumes and is typically stored in domain-specific file formats such as CRAM, BAM or VCF. In MODOs genomics files are linked to a metadata element and directly stored within the object. To access region-specific information without downloading the entire file the remote storage is linked to a htsget server that allows secure streaming over the network.

Data streaming#

MODOs supports streaming of data from CRAM, BAM, VCF and BCF files to access specific genomic regions. In MODOs

from modos.api import MODO

# Load MODO from remote storage
modo=MODO(path= 's3://modos-demo/ex', endpoint = 'http://localhost')

# Stream a specific region
modo.stream_genomics(file_path = "demo1.cram", region = "BA000007.3")
# Stream chromosome BA000007.3 from modos-demo/ex/demo1.cram
modos --endpoint http://localhost stream --region BA000007.3 s3://modos-demo/ex/demo1.cram

Warning

We highly recommend using the MODOs CLI for streaming. The output can directly be passed to tools like samtools. Streaming using the MODOs python api will return a pysam object. pysam does not allow reading from byte-streams and thus the streamed region will be written into an temporary file before parsing to pysam. For large files/regions this can cause issues.