Working with genomics data#
Genomic data can reach large volumes and is typically stored in domain-specific file formats such as CRAM, BAM or VCF. In MODOs
genomics files are linked to a metadata element and directly stored within the object. To access region-specific information without downloading the entire file the remote storage is linked to a htsget server that allows secure streaming over the network.
Data streaming#
MODOs
supports streaming of data from CRAM, BAM, VCF and BCF files to access specific genomic regions. In MODOs
from modos.api import MODO
# Load MODO from remote storage
modo=MODO(path= 's3://modos-demo/ex', endpoint = 'http://localhost')
# Stream a specific region
modo.stream_genomics(file_path = "demo1.cram", region = "BA000007.3")
# Stream chromosome BA000007.3 from modos-demo/ex/demo1.cram
modos --endpoint http://localhost stream --region BA000007.3 s3://modos-demo/ex/demo1.cram
Warning
We highly recommend using the MODOs
CLI for streaming. The output can directly be passed to tools like samtools. Streaming using the MODOs
python api will return a pysam object. pysam
does not allow reading from byte-streams and thus the streamed region will be written into an temporary file before parsing to pysam
. For large files/regions this can cause issues.