modos.api#

Classes#

LocalStorage

Helper class that provides a standard way to create an ABC using

S3Storage

Helper class that provides a standard way to create an ABC using

ElementType

Enumeration of all element types.

UserElementType

Enumeration of element types exposed to the user.

GenomicFileSuffix

Enumeration of all supported genomic file suffixes.

HtsgetConnection

Connection to an htsget resource.

Region

Genomic region consisting of a chromosome (aka reference) name

EndpointManager

Handle modos server endpoints.

MODO

Multi-Omics Digital Object

Functions#

attrs_to_graph(meta, uri_prefix)

Convert a attribute dictionary to an RDF graph of metadata.

add_metadata_group(parent_group, metadata)

Add input metadata dictionary to an existing zarr group.

list_zarr_items(group)

Recursively list all zarr groups and arrays

class_from_name(name)

dict_to_instance(element)

set_data_path(element[, source_file])

Set the data_path attribute, if it is not specified to the modo root.

set_haspart_relationship(child_class, child_path, ...)

Add element to the hasPart attribute of a parent zarr group

update_haspart_id(element)

update the id of the has_part property of an element to use the full id including its type

read_pysam(path[, region])

Automatically instantiate a pysam file object from input path and passes any additional kwarg to it.

extract_metadata(instance, base_path)

Extract metadata from files associated to a model instance

parse_attributes(path)

Load model specification from file into a list of dictionaries. Model types must be specified as @type

is_s3_path(path)

Check if a path is an S3 path

Module Contents#

modos.api.attrs_to_graph(meta, uri_prefix)[source]#

Convert a attribute dictionary to an RDF graph of metadata.

Parameters:
Return type:

rdflib.Graph

modos.api.add_metadata_group(parent_group, metadata)[source]#

Add input metadata dictionary to an existing zarr group.

Parameters:
Return type:

None

modos.api.list_zarr_items(group)[source]#

Recursively list all zarr groups and arrays

Parameters:

group (zarr.hierarchy.Group)

Return type:

list[zarr.hierarchy.Group | zarr.core.Array]

class modos.api.LocalStorage(path)[source]#

Bases: Storage

Helper class that provides a standard way to create an ABC using inheritance.

Parameters:

path (pathlib.Path)

property zarr: zarr.hierarchy.Group#
Return type:

zarr.hierarchy.Group

property path: pathlib.Path#
Return type:

pathlib.Path

exists(target)[source]#
Parameters:

target (pathlib.Path)

Return type:

bool

list(target=None)[source]#
Parameters:

target (Optional[pathlib.Path])

remove(target)[source]#
Parameters:

target (pathlib.Path)

put(source, target)[source]#
Parameters:
class modos.api.S3Storage(path, s3_endpoint, s3_kwargs=None)[source]#

Bases: Storage

Helper class that provides a standard way to create an ABC using inheritance.

Parameters:
  • path (str)

  • s3_endpoint (pydantic.HttpUrl)

  • s3_kwargs (Optional[dict[str, Any]])

property path: pathlib.Path#
Return type:

pathlib.Path

property zarr: zarr.hierarchy.Group#
Return type:

zarr.hierarchy.Group

exists(target=ZARR_ROOT)[source]#
Parameters:

target (pathlib.Path)

Return type:

bool

list(target=None)[source]#
Parameters:

target (Optional[pathlib.Path])

Return type:

Generator[pathlib.Path, None, None]

remove(target)[source]#
Parameters:

target (pathlib.Path)

put(source, target)[source]#
Parameters:
modos.api.class_from_name(name)[source]#
Parameters:

name (str)

modos.api.dict_to_instance(element)[source]#
Parameters:

element (Mapping[str, Any])

Return type:

Any

class modos.api.ElementType[source]#

Bases: str, enum.Enum

Enumeration of all element types.

SAMPLE = 'sample'#
ASSAY = 'assay'#
DATA_ENTITY = 'data'#
REFERENCE_GENOME = 'reference'#
REFERENCE_SEQUENCE = 'sequence'#
get_target_class()[source]#

Return the target class for the element type.

Return type:

type

classmethod from_object(obj)[source]#

Return the element type from an object.

classmethod from_model_name(name)[source]#

Return the element type from an object name.

Parameters:

name (str)

modos.api.set_data_path(element, source_file=None)[source]#

Set the data_path attribute, if it is not specified to the modo root.

Parameters:
Return type:

dict

modos.api.set_haspart_relationship(child_class, child_path, parent_group)[source]#

Add element to the hasPart attribute of a parent zarr group

Parameters:
class modos.api.UserElementType[source]#

Bases: str, enum.Enum

Enumeration of element types exposed to the user.

SAMPLE = 'sample'#
ASSAY = 'assay'#
DATA_ENTITY = 'data'#
REFERENCE_GENOME = 'reference'#
get_target_class()[source]#

Return the target class for the element type.

Return type:

type

classmethod from_object(obj)[source]#

Return the element type from an object.

modos.api.update_haspart_id(element)[source]#

update the id of the has_part property of an element to use the full id including its type

Parameters:

element (modos_schema.datamodel.DataEntity | modos_schema.datamodel.Sample | modos_schema.datamodel.Assay | modos_schema.datamodel.ReferenceGenome | modos_schema.datamodel.MODO)

class modos.api.GenomicFileSuffix[source]#

Bases: tuple, enum.Enum

Enumeration of all supported genomic file suffixes.

CRAM = ('.cram',)#
BAM = ('.bam',)#
SAM = ('.sam',)#
VCF = ('.vcf', '.vcf.gz')#
BCF = ('.bcf',)#
FASTA = ('.fasta', '.fa')#
FASTQ = ('.fastq', '.fq')#
classmethod from_path(path)[source]#
Parameters:

path (pathlib.Path)

Return type:

GenomicFileSuffix

get_index_suffix()[source]#

Return the supported index suffix related to a genomic filetype

Return type:

str

to_htsget_endpoint()[source]#

Return the htsget endpoint for a genomic file type

Return type:

str

modos.api.read_pysam(path, region=None, **kwargs)[source]#

Automatically instantiate a pysam file object from input path and passes any additional kwarg to it.

Parameters:
Return type:

Iterator[pysam.AlignedSegment | pysam.VariantRecord]

class modos.api.HtsgetConnection[source]#

Connection to an htsget resource. It allows to open a stream to the resource and lazily fetch data from it.

host: pydantic.HttpUrl#
path: pathlib.Path#
region: modos.genomics.region.Region | None#
property url: str#

URL to fetch the ticket.

Return type:

str

ticket()[source]#

Ticket containing the URLs to fetch the data.

Return type:

dict

open()[source]#

Open a connection to the stream data.

Return type:

io.RawIOBase

to_file(path)[source]#

Save all data from the stream to a file.

Parameters:

path (pathlib.Path)

classmethod from_url(url)[source]#

Open connection directly from an htsget URL.

Parameters:

url (str)

to_pysam(reference_filename=None)[source]#

Convert the stream to a pysam object.

Parameters:

reference_filename (Optional[str])

Return type:

Iterator[pysam.AlignedSegment | pysam.VariantRecord]

class modos.api.Region[source]#

Genomic region consisting of a chromosome (aka reference) name and a 0-indexed half-open coordinate interval. Note that the end may not be specified, in which it will be set to math.inf.

chrom: str#
start: int#
end: int | float#
__post_init__()[source]#
to_htsget_query()[source]#

Serializes the region into an htsget URL query.

Example

>>> Region(chrom='chr1', start=0, end=100).to_htsget_query()
'referenceName=chr1&start=0&end=100'
to_tuple()[source]#

Return the region as a simple tuple.

Return type:

tuple[str, Optional[int], Optional[int]]

classmethod from_htsget_query(url)[source]#

Instantiate from an htsget URL query

Example

>>> Region.from_htsget_query(
...   "http://localhost/htsget/reads/ex/demo1?format=CRAM&referenceName=chr1&start=0"
... )
Region(chrom='chr1', start=0, end=inf)
Parameters:

url (str)

classmethod from_ucsc(ucsc)[source]#

Instantiate from a UCSC-formatted region string.

Example

>>> Region.from_ucsc('chr-1ba:10-320')
Region(chrom='chr-1ba', start=10, end=320)
>>> Region.from_ucsc('chr1:-320')
Region(chrom='chr1', start=0, end=320)
>>> Region.from_ucsc('chr1:10-')
Region(chrom='chr1', start=10, end=inf)
>>> Region.from_ucsc('chr1:10')
Region(chrom='chr1', start=10, end=inf)

Note

For more information about the UCSC coordinate system, see: http://genomewiki.ucsc.edu/index.php/Coordinate_Transforms

Parameters:

ucsc (str)

Return type:

Region

classmethod from_pysam(record)[source]#
Parameters:

record (pysam.VariantRecord | pysam.AlignedSegment)

Return type:

Region

overlaps(other)[source]#

Checks if other in self. This check if any portion of other overlaps with self.

Parameters:

other (Region)

Return type:

bool

contains(other)[source]#

Checks if other is fully contained in self.

Parameters:

other (Region)

Return type:

bool

modos.api.extract_metadata(instance, base_path)[source]#

Extract metadata from files associated to a model instance

Parameters:

base_path (pathlib.Path)

Return type:

List

modos.api.parse_attributes(path)[source]#

Load model specification from file into a list of dictionaries. Model types must be specified as @type

Parameters:

path (pathlib.Path)

Return type:

List[dict]

class modos.api.EndpointManager[source]#

Handle modos server endpoints. If a modos server url is provided, it is used to detect available service urls. Alternatively, service urls can be provided explicitely if no modos server is available.

Parameters:
  • modos – URL to the modos server.

  • services – Mapping of services to their urls.

Examples

>>> ex = EndpointManager(modos="http://modos.example.org") 
>>> ex.list() 
{
  's3: Url('http://s3.example.org/'),
  'htsget': Url('http://htsget.example.org/')
}
>>> ex.htsget 
Url('http://htsget.example.org/')
>>> ex = EndpointManager(services={"s3": "http://s3.example.org"})
>>> ex.s3
Url('http://s3.example.org/')
modos: pydantic.HttpUrl | None = None#
services: dict[str, pydantic.HttpUrl]#
list()[source]#

List available endpoints.

Return type:

dict[str, pydantic.HttpUrl]

property s3: pydantic.HttpUrl | None#
Return type:

Optional[pydantic.HttpUrl]

property htsget: pydantic.HttpUrl | None#
Return type:

Optional[pydantic.HttpUrl]

modos.api.is_s3_path(path)[source]#

Check if a path is an S3 path

Parameters:

path (str)

class modos.api.MODO(path, id=None, name=None, description=None, creation_date=date.today(), last_update_date=date.today(), has_assay=[], source_uri=None, endpoint=None, s3_kwargs=None, services=None)[source]#

Multi-Omics Digital Object A digital archive containing several multi-omics data and records connected by zarr-backed metadata.

Parameters:
  • path (Union[pathlib.Path, str]) – Path to the archive directory.

  • id (Optional[str]) – MODO identifier. Defaults to the directory name.

  • name (Optional[str]) – Human-readable name.

  • description (Optional[str]) – Human readable description.

  • creation_date (datetime.date) – When the MODO was created.

  • last_update_date (datetime.date) – When the MODO was last updated.

  • has_assay (List) – Existing assay identifiers to attach to MODO.

  • source_uri (Optional[str]) – URI of the source data.

  • endpoint (Optional[pydantic.HttpUrl]) – URL to the modos server.

  • s3_kwargs (Optional[dict[str, Any]]) – Keyword arguments for the S3 storage.

  • services (Optional[dict[str, pydantic.HttpUrl]]) – Optional dictionary of service endpoints.

storage#

Storage backend for the archive.

Type:

Storage

endpoint#

Server endpoint manager.

Type:

EndpointManager

Examples

>>> demo = MODO("data/ex")

# List identifiers of samples in the archive >>> demo.list_samples() [‘sample/sample1’]

# List files in the archive >>> files = sorted(demo.list_files()) >>> assert Path(‘data/ex/demo1.cram’) in files >>> assert Path(‘data/ex/reference1.fa’) in files

property zarr: zarr.hierarchy.Group[source]#
Return type:

zarr.hierarchy.Group

property path: pathlib.Path[source]#
Return type:

pathlib.Path

property metadata: dict[source]#
Return type:

dict

knowledge_graph(uri_prefix=None)[source]#

Return an RDF graph of the metadata. All identifiers are converted to valid URIs if needed.

Parameters:

uri_prefix (Optional[str])

Return type:

rdflib.Graph

show_contents(element=None)[source]#

Produces a YAML document of the object’s contents.

Parameters:

element (Optional[str]) – Element, or group of elements (e.g. data or data/element_id) to show. If not provided, shows the metadata of the entire MODO.

Return type:

str

list_files()[source]#

Lists files in the archive recursively (except for the zarr file).

Return type:

List[pathlib.Path]

list_arrays(element=None)[source]#

Views arrays in the archive recursively.

Parameters:

element (Optional[str]) – Element, or group of elements (e.g. data or data/element_id) to show. If not provided, shows the metadata of the entire MODO.

Return type:

zarr.hierarchy.TreeViewer

query(query)[source]#

Use SPARQL to query the metadata graph

Parameters:

query (str)

list_samples()[source]#

Lists samples in the archive.

update_date(date=date.today())[source]#

update last_update_date attribute

Parameters:

date (MODO.update_date.date)

remove_element(element_id)[source]#

Remove an element from the archive, along with any files directly attached to it and links from other elements to it.

Parameters:

element_id (str)

remove_object()[source]#

Remove the complete modo object

add_element(element, source_file=None, part_of=None)[source]#

Add an element to the archive. If a data file is provided, it will be added to the archive. If the element is part of another element, the parent metadata will be updated.

Parameters:
  • element (modos_schema.datamodel.DataEntity | modos_schema.datamodel.Sample | modos_schema.datamodel.Assay | modos_schema.datamodel.ReferenceGenome) – Element to add to the archive.

  • source_file (Optional[pathlib.Path]) – File to associate with the element.

  • part_of (Optional[str]) – Id of the parent element. It must be scoped to the type. For example “sample/foo”.

_add_any_element(element, source_file=None, part_of=None)[source]#

Add an element of any type to the storage.

Parameters:
  • element (modos_schema.datamodel.DataEntity | modos_schema.datamodel.Sample | modos_schema.datamodel.Assay | modos_schema.datamodel.ReferenceSequence | modos_schema.datamodel.ReferenceGenome)

  • source_file (Optional[pathlib.Path])

  • part_of (Optional[str])

update_element(element_id, new)[source]#

Update element metadata in place by adding new values from model object.

Parameters:
  • element_id (str) – Full id path in the zarr store.

  • new (modos_schema.datamodel.DataEntity | modos_schema.datamodel.Sample | modos_schema.datamodel.Assay | modos_schema.datamodel.MODO) – Element containing the enriched metadata.

enrich_metadata()[source]#

Enrich MODO metadata in place using content from associated data files.

stream_genomics(file_path, region=None, reference_filename=None)[source]#

Slices both local and remote CRAM, VCF (.vcf.gz), and BCF files returning an iterator over records.

Parameters:
  • file_path (str)

  • region (Optional[str])

  • reference_filename (Optional[str])

Return type:

Iterator[pysam.AlignedSegment | pysam.VariantRecord]

classmethod from_file(config_path, object_path, endpoint=None, s3_kwargs=None, services=None, no_remove=False)[source]#

build a modo from a yaml or json file

Parameters:
  • config_path (pathlib.Path)

  • object_path (str)

  • endpoint (Optional[pydantic.HttpUrl])

  • s3_kwargs (Optional[dict])

  • services (Optional[dict[str, pydantic.HttpUrl]])

  • no_remove (bool)

Return type:

MODO