modos.genomics.htsget#
htsget client implementation
The htsget protocol [1] allows to stream slices of genomic data from a remote server. The client is implemented as a file-like interface that lazily streams chunks from the server.
In practice, the client sends a request for a file with a specific format and genomic region. The htsget server finds the byte ranges on the data server (e.g. S3) corresponding to the requests and responds with a “ticket”.
The ticket is a json document containing a list of blocks; each having headers and a URL pointing to_file the corresponding byte ranges on the data server.
The client then streams data from these URLs, effectively concatenating the blocks into a single stream.
Notes
This implementation differs from the reference GA4GH implementation [2] in that it allows lazily consuming chunks from a file-like interface without saving to a file. A downside of this approach is that the client cannot seek.
Additionally, this implementation does not support asynchronous fetching of blocks, which means that blocks are fetched sequentially.
References
Classes#
Genomic region consisting of a chromosome (aka reference) name |
|
Enumeration of all supported genomic file suffixes. |
|
Transparent iterator over blocks of an htsget stream. |
|
A file-like handle to a read-only, buffered htsget stream. |
|
Connection to an htsget resource. |
Functions#
|
Automatically instantiate a pysam file object from input path and passes any additional kwarg to it. |
|
Build an htsget URL from a host, path, and region. |
|
Given a URL to an htsget resource, extract the host, path, and region. |
Module Contents#
- class modos.genomics.htsget.Region[source]#
Genomic region consisting of a chromosome (aka reference) name and a 0-indexed half-open coordinate interval. Note that the end may not be specified, in which it will be set to math.inf.
- to_htsget_query()[source]#
Serializes the region into an htsget URL query.
Example
>>> Region(chrom='chr1', start=0, end=100).to_htsget_query() 'referenceName=chr1&start=0&end=100'
- classmethod from_htsget_query(url)[source]#
Instantiate from an htsget URL query
Example
>>> Region.from_htsget_query( ... "http://localhost/htsget/reads/ex/demo1?format=CRAM&referenceName=chr1&start=0" ... ) Region(chrom='chr1', start=0, end=inf)
- Parameters:
url (str)
- classmethod from_ucsc(ucsc)[source]#
Instantiate from a UCSC-formatted region string.
Example
>>> Region.from_ucsc('chr-1ba:10-320') Region(chrom='chr-1ba', start=10, end=320) >>> Region.from_ucsc('chr1:-320') Region(chrom='chr1', start=0, end=320) >>> Region.from_ucsc('chr1:10-') Region(chrom='chr1', start=10, end=inf) >>> Region.from_ucsc('chr1:10') Region(chrom='chr1', start=10, end=inf)
Note
For more information about the UCSC coordinate system, see: http://genomewiki.ucsc.edu/index.php/Coordinate_Transforms
- classmethod from_pysam(record)[source]#
- Parameters:
record (pysam.VariantRecord | pysam.AlignedSegment)
- Return type:
- class modos.genomics.htsget.GenomicFileSuffix[source]#
-
Enumeration of all supported genomic file suffixes.
- CRAM = ('.cram',)#
- BAM = ('.bam',)#
- SAM = ('.sam',)#
- VCF = ('.vcf', '.vcf.gz')#
- BCF = ('.bcf',)#
- FASTA = ('.fasta', '.fa')#
- FASTQ = ('.fastq', '.fq')#
- classmethod from_path(path)[source]#
- Parameters:
path (pathlib.Path)
- Return type:
- modos.genomics.htsget.read_pysam(path, region=None, **kwargs)[source]#
Automatically instantiate a pysam file object from input path and passes any additional kwarg to it.
- Parameters:
path (pathlib.Path)
region (Optional[modos.genomics.region.Region])
- Return type:
Iterator[pysam.AlignedSegment | pysam.VariantRecord]
- modos.genomics.htsget.build_htsget_url(host, path, region)[source]#
Build an htsget URL from a host, path, and region.
Examples
>>> build_htsget_url( ... "http://localhost:8000", ... Path("file.bam"), ... Region("chr1", 0, 1000) ... ) 'http://localhost:8000/reads/file?format=BAM&referenceName=chr1&start=0&end=1000'
- Parameters:
host (pydantic.HttpUrl)
path (pathlib.Path)
region (Optional[modos.genomics.region.Region])
- Return type:
- modos.genomics.htsget.parse_htsget_url(url)[source]#
Given a URL to an htsget resource, extract the host, path, and region.
- Parameters:
url (pydantic.HttpUrl)
- Return type:
tuple[str, pathlib.Path, Optional[modos.genomics.region.Region]]
- class modos.genomics.htsget._HtsgetBlockIter(blocks, chunk_size=65536, timeout=60)[source]#
Transparent iterator over blocks of an htsget stream.
This is used internally by HtsgetStream to lazily fetch and concatenate blocks.
Examples
>>> next(_HtsgetBlockIter([ ... {"url": "data:;base64,MTIzNDU2Nzg5"}, ... {"url": "data:;base64,MTIzNDU2Nzg5"}, ... ])) b'123456789'
- class modos.genomics.htsget.HtsgetStream(blocks)[source]#
Bases:
io.RawIOBase
A file-like handle to a read-only, buffered htsget stream.
Examples
>>> stream = HtsgetStream([ ... {"url": "data:;base64,MTIzNDU2Nzg5Cg=="}, ... {"url": "data:;base64,MTIzNDU2Nzg5Cg=="}, ... ]) >>> stream.read(4) b'1234'
- readable()[source]#
Return whether object was opened for reading.
If False, read() will raise OSError.
- Return type:
- readinto(b)[source]#
Read up to len(b) bytes into a writable buffer bytes and return the number of bytes read.
Notes
See https://docs.python.org/3/library/io.html#io.RawIOBase.readinto
- Return type:
- class modos.genomics.htsget.HtsgetConnection[source]#
Connection to an htsget resource. It allows to open a stream to the resource and lazily fetch data from it.
- path: pathlib.Path[source]#
- region: modos.genomics.region.Region | None[source]#
- to_file(path)[source]#
Save all data from the stream to a file.
- Parameters:
path (pathlib.Path)