Handling data arrays with MODOS#

Any count-like data, e.g protein abundances, RNA counts, metabolomic measurements, etc. can be stored as arrays in the MODO. The underlying zarr supports array creation as well as an interface to NumPy arrays.

Load data#

Using panda DataFrames#

Count-like data can usually be loaded into pandas DataFrame. To keep column names (observations) and row names (variables) both need to be stored in a separate numpy array first:

import pandas as pd
import numpy as np

# Example of RNA-seq count data

rna_count = pd.read_csv('/path/to/rna/counts.csv', index_col="gene")
rna_count
# rna_count
#             time1  time2  time3    ...
# gene
# Xkr4        1891   2410   2159     ...
# Rp1         2      2      0        ...
# ...         ...    ...    ...      ...
# TrnP        334    202    218      ...

obs = rna_count.columns.to_numpy()
var = rna_count.index.to_numpy()
rna_array = rna_count.to_numpy()

obs
# array(['time1', 'time2', 'time3', ...], dtype=object)

var
# array(['Xkr4', 'Rp1', ..., 'TrnP'], dtype=object)

rna_array
# array([[1891, 2410, 2159, ...],
#        [   2,    2,    0, ...],
#        ...,
#        [ 334,  202,  218, ...]])

Warning

to_numpy() automatically removes row and column names from pandas DataFrames. It is important to store them separately, if they contain important information.

Note

Skip this section, if you already have your data in a NumPy array.

Add array element to a MODO#

Next, an element with the metadata describing the array can be added to the MODO:

from modos.api import MODO
import modos_schema.datamodel as model

# load modo - example at "data/ex"
modo= MODO("data/ex")

# Generate an Array element
array_element = model.Array(id="rna1", name= "RNA raw counts", description = "RNA counts from multiple timepoints", has_sample="sample/sample1", data_format = "Zarr", data_path="data/ex/data/rna1")

# Add element to modo
modo.add_element(element = array_element)

# Check the modo structure
modo.list_arrays()
#/
# ├── assay
# ├── data
# │   └── rna1
# ├── reference
# └── sample
#     └── sample1

Note

Skip this step, if you want to add the count data to an already existing element in the MODO. A helper function to facilitate adding the metadata element and numpy array in one step will also be added in future releases.

Add array to a MODO#

Finally all arrays can be added to the modo element:

modo.archive["data/rna1"].create_dataset("data", data=rna_array)
modo.archive["data/rna1"].create_dataset("obs", data=obs)
modo.archive["data/rna1"].create_dataset("var", data=var, object_codec=numcodecs.JSON())

# update zarr metadata
zarr.consolidate_metadata(modo.store)

# check the new structure
modo.list_arrays()
#/
# ├── assay
# ├── data
# │   └── rna1
# │       └── data (1473,3) float64
# │       └── obs (3,) object
# │       └── var (1473,) object
# ├── reference
# └── sample
#     └── sample1

Access Array data#

Load array as pandas DataFrame#

To access array data and analyse them the separated arrays can be loaded into a pandas dataframe:

import pandas as pd

rna_array = modo.archive["data/rna1/data"][:]
obs = modo.archive["data/rna1/obs"][:]
var = modo.archive["data/rna1/var"][:]

rna_counts = pd.DataFrame(rna_array, index=var, columns=obs)