modos.genomics.htsget#

htsget client implementation

The htsget protocol [1] allows to stream slices of genomic data from a remote server. The client is implemented as a file-like interface that lazily streams chunks from the server.

In practice, the client sends a request for a file with a specific format and genomic region. The htsget server finds the byte ranges on the data server (e.g. S3) corresponding to the requests and responds with a “ticket”.

The ticket is a json document containing a list of blocks; each having headers and a URL pointing to_file the corresponding byte ranges on the data server.

The client then streams data from these URLs, effectively concatenating the blocks into a single stream.

htsget mechanism diagram

Illustration of the mechanism through which the htsget server allows streaming and random-access on genomic files. See [1] for more details.#

Notes

This implementation differs from the reference GA4GH implementation [2] in that it allows lazily consuming chunks from a file-like interface without saving to a file. A downside of this approach is that the client cannot seek.

Additionally, this implementation does not support asynchronous fetching of blocks, which means that blocks are fetched sequentially.

References

Classes#

Region

Genomic region consisting of a chromosome (aka reference) name

GenomicFileSuffix

Enumeration of all supported genomic file suffixes.

_HtsgetBlockIter

Transparent iterator over blocks of an htsget stream.

HtsgetStream

A file-like handle to a read-only, buffered htsget stream.

HtsgetConnection

Connection to an htsget resource.

Functions#

read_pysam(path[, region])

Automatically instantiate a pysam file object from input path and passes any additional kwarg to it.

build_htsget_url(host, path, region)

Build an htsget URL from a host, path, and region.

parse_htsget_url(url)

Given a URL to an htsget resource, extract the host, path, and region.

Module Contents#

class modos.genomics.htsget.Region[source]#

Genomic region consisting of a chromosome (aka reference) name and a 0-indexed half-open coordinate interval. Note that the end may not be specified, in which it will be set to math.inf.

chrom: str#
start: int#
end: int | float#
__post_init__()[source]#
to_htsget_query()[source]#

Serializes the region into an htsget URL query.

Example

>>> Region(chrom='chr1', start=0, end=100).to_htsget_query()
'referenceName=chr1&start=0&end=100'
to_tuple()[source]#

Return the region as a simple tuple.

Return type:

tuple[str, Optional[int], Optional[int]]

classmethod from_htsget_query(url)[source]#

Instantiate from an htsget URL query

Example

>>> Region.from_htsget_query(
...   "http://localhost/htsget/reads/ex/demo1?format=CRAM&referenceName=chr1&start=0"
... )
Region(chrom='chr1', start=0, end=inf)
Parameters:

url (str)

classmethod from_ucsc(ucsc)[source]#

Instantiate from a UCSC-formatted region string.

Example

>>> Region.from_ucsc('chr-1ba:10-320')
Region(chrom='chr-1ba', start=10, end=320)
>>> Region.from_ucsc('chr1:-320')
Region(chrom='chr1', start=0, end=320)
>>> Region.from_ucsc('chr1:10-')
Region(chrom='chr1', start=10, end=inf)
>>> Region.from_ucsc('chr1:10')
Region(chrom='chr1', start=10, end=inf)

Note

For more information about the UCSC coordinate system, see: http://genomewiki.ucsc.edu/index.php/Coordinate_Transforms

Parameters:

ucsc (str)

Return type:

Region

classmethod from_pysam(record)[source]#
Parameters:

record (pysam.VariantRecord | pysam.AlignedSegment)

Return type:

Region

overlaps(other)[source]#

Checks if other in self. This check if any portion of other overlaps with self.

Parameters:

other (Region)

Return type:

bool

contains(other)[source]#

Checks if other is fully contained in self.

Parameters:

other (Region)

Return type:

bool

class modos.genomics.htsget.GenomicFileSuffix[source]#

Bases: tuple, enum.Enum

Enumeration of all supported genomic file suffixes.

CRAM = ('.cram',)#
BAM = ('.bam',)#
SAM = ('.sam',)#
VCF = ('.vcf', '.vcf.gz')#
BCF = ('.bcf',)#
FASTA = ('.fasta', '.fa')#
FASTQ = ('.fastq', '.fq')#
classmethod from_path(path)[source]#
Parameters:

path (pathlib.Path)

Return type:

GenomicFileSuffix

get_index_suffix()[source]#

Return the supported index suffix related to a genomic filetype

Return type:

str

to_htsget_endpoint()[source]#

Return the htsget endpoint for a genomic file type

Return type:

str

modos.genomics.htsget.read_pysam(path, region=None, **kwargs)[source]#

Automatically instantiate a pysam file object from input path and passes any additional kwarg to it.

Parameters:
Return type:

Iterator[pysam.AlignedSegment | pysam.VariantRecord]

modos.genomics.htsget.build_htsget_url(host, path, region)[source]#

Build an htsget URL from a host, path, and region.

Examples

>>> build_htsget_url(
...   "http://localhost:8000",
...   Path("file.bam"),
...   Region("chr1", 0, 1000)
... )
'http://localhost:8000/reads/file?format=BAM&referenceName=chr1&start=0&end=1000'
Parameters:
Return type:

str

modos.genomics.htsget.parse_htsget_url(url)[source]#

Given a URL to an htsget resource, extract the host, path, and region.

Parameters:

url (pydantic.HttpUrl)

Return type:

tuple[str, pathlib.Path, Optional[modos.genomics.region.Region]]

class modos.genomics.htsget._HtsgetBlockIter(blocks, chunk_size=65536, timeout=60)[source]#

Transparent iterator over blocks of an htsget stream.

This is used internally by HtsgetStream to lazily fetch and concatenate blocks.

Examples

>>> next(_HtsgetBlockIter([
...     {"url": "data:;base64,MTIzNDU2Nzg5"},
...     {"url": "data:;base64,MTIzNDU2Nzg5"},
... ]))
b'123456789'
Parameters:

blocks (list[dict])

__iter__()[source]#
_consume_block()[source]#

Get streaming iterator over current block.

Return type:

Iterator[bytes]

__next__()[source]#

Stream next chunk of current block, or first chunk of next block.

Return type:

bytes

class modos.genomics.htsget.HtsgetStream(blocks)[source]#

Bases: io.RawIOBase

A file-like handle to a read-only, buffered htsget stream.

Examples

>>> stream = HtsgetStream([
...   {"url": "data:;base64,MTIzNDU2Nzg5Cg=="},
...   {"url": "data:;base64,MTIzNDU2Nzg5Cg=="},
... ])
>>> stream.read(4)
b'1234'
Parameters:

blocks (list[dict])

readable()[source]#

Return whether object was opened for reading.

If False, read() will raise OSError.

Return type:

bool

readinto(b)[source]#

Read up to len(b) bytes into a writable buffer bytes and return the number of bytes read.

Notes

See https://docs.python.org/3/library/io.html#io.RawIOBase.readinto

Return type:

int

class modos.genomics.htsget.HtsgetConnection[source]#

Connection to an htsget resource. It allows to open a stream to the resource and lazily fetch data from it.

host: pydantic.HttpUrl[source]#
path: pathlib.Path[source]#
region: modos.genomics.region.Region | None[source]#
property url: str[source]#

URL to fetch the ticket.

Return type:

str

ticket()[source]#

Ticket containing the URLs to fetch the data.

Return type:

dict

open()[source]#

Open a connection to the stream data.

Return type:

io.RawIOBase

to_file(path)[source]#

Save all data from the stream to a file.

Parameters:

path (pathlib.Path)

classmethod from_url(url)[source]#

Open connection directly from an htsget URL.

Parameters:

url (str)

to_pysam(reference_filename=None)[source]#

Convert the stream to a pysam object.

Parameters:

reference_filename (Optional[str])

Return type:

Iterator[pysam.AlignedSegment | pysam.VariantRecord]