An inputs
group should be present at the root of the HDF5 file, containing all information relating to the data inputs.
This includes the definition of the input files as well as summary statistics like the number of cells and features.
The group itself contains the parameters
and results
subgroups.
In this section, a single "input dataset" should contain counts for a set of cells across a consistent set of features. That is, each cell in an input dataset has a non-missing count for the same set of features, possibly across multiple modalities. An input dataset will usually contain per-feature annotations, and may also contain per-cell annotations. It may be represented by a single file (e.g., HDF5-based formats) or multiple files (e.g., MatrixMarket). Note that, if multiple input datasets are supplied, they need not have the same set of features.
The "loaded dataset" refers to the in-memory representation of the input dataset(s).
- The identities of the rows (features) of the loaded dataset may be a permutation or subset of the rows in the input dataset(s). Indeed for multiple inputs, the loaded dataset only contains the intersection of common features. However, a permutation or subset may also be applied to a single input.
- The identities of the columns (cells) in the loaded dataset is determined from the column order of the input dataset(s). For a single input, the column order in the loaded dataset is the same as that of the input. For multiple inputs, the column order in the loaded dataset is defined as a concatenation of the input datasets in the order supplied.
- If multiple inputs are supplied, each input dataset is assumed to contain cells in a single "block" (e.g., batch or sample). If only single input dataset is present, multiple blocks can be defined by a blocking factor in the per-cell annotation.
Definitions:
embedded
: whether the files for the input dataset(s) were embedded in the*.kana
file. Iffalse
, the input files are assumed to be linked instead.
parameters
should contain:
-
datasets
: a group containing information about the input datasets. Each child of this group is itself a group that corresponds to a single input dataset; each subgroup is named after an index from 0 toN - 1
, whereN
is the total number of datasets. (N
should be a positive integer.) Thedatasets/<dataset_index>
subgroup should itself contain:-
name
: scalar string containing the name of the input dataset. This is primarily used for display purposes, and should be unique across all children ofdatasets
. -
format
: scalar string specifying the file format for a single matrix. This is usually one of (but not limited to)"MatrixMarket"
,"10X"
or"H5AD"
. Other input dataset types may specify other values forformat
according to the application. -
files
: a group containing the file information for this input dataset. Each child of this group is itself a group that corresponds to a single file; each subgrouop is named after an index from 0 toF - 1
, whereF
is the total number of files for this input dataset. Eachfiles/<file_index>
subgroup should contain:-
type
: a scalar string specifying the type of the file. The requirements here depend on the input dataset'sformat
- the most common examples are listed below:- If
format = "MatrixMarket"
, there should be exactly onetype = "matrix"
corresponding to the (possibly Gzipped)*.mtx
file. There may be zero or onetype = "features"
, containing a (possibly Gzipped) TSV file with the Ensembl and gene symbols for each row. There may be zero or onetype = "cells"
, containing a (possibly Gzipped) TSV file with the annotations for each column. - If
format = "10X"
or"H5AD"
, there should be exactly onetype = "h5"
. - If
format = "SummarizedExperiment"
, there should be exactly onetype = "rds"
.
- If
-
name
: a scalar string specifying the file name. This can be any arbitrary string and is typically used to assist disambiguation.
If
embedded = true
, we additionally expect eachfiles/<file_index>
subgroup to contain:offset
: a scalar integer specifying where the file starts as an offset from the start of the remaining bytes section.size
: a non-negative scalar integer specifying the number of bytes in the file. Each file is represented by the byte range from[offset, offset + size)
in the remaining bytes section.
Byte ranges of files should be contiguous in the remaining bytes section, following the order in which they encountered (in
datasets
first, and then withinfiles
per input dataset).Alternatively, if
embedded = false
, we instead expect eachfiles/<file_index>
subgroup to contain:id
: a scalar string containing some unique identifier for this file. The interpretation ofid
is application-specific but usually refers to some cache or database.
-
Optionally, the
datasets/<dataset_index>
subgroup might contain:options
: a scalar string in JSON format. This encodes a JSON object containing arbitrary number of additional options for loading this input dataset.
-
Optionally, parameters
may contain a subset
group.
This may in turn contain a cells
group, specifying the subset of cells to be used in downstream steps.
The subset/cells
group should contain one of:
indices
, a 1-dimensional integer dataset specifying the indices of the cells to retain. Indices refer to columns of the loaded dataset, and should be unique and sorted. The length of this dataset should be equal to the number of cells inresults/num_cells
.field
, a string scalar specifying the field of the column annotations to use for filtering. This field can either be a continuous or categorical variable.- If categorical, the group should also contain
values
, a 1-dimensional string dataset specifying the allowable values for this field. The subset is defined as all cells with afield
value invalues
. - If continuous, the group should also contain
ranges
, a 2-dimensional float dataset specifying the ranges of allowable values. Each row defines a range[a, b]
wherea
is the value in the first column andb
is the value in the second column. Ranges should be non-overlapping (excepting the boundaries) and sorted in order of increasinga
. The subset is defined as all cells that fall in any of the specified ranges.
- If categorical, the group should also contain
For single input datasets, parameters
may also contain:
block_factor
: a string scalar specifying the field in the per-cell annotation that contains the sample blocking factor. If present, it is assumed that the matrix contains data for multiple samples.
results
should contain:
-
num_cells
: an integer scalar specifying the number of cells in the loaded dataset. This should count all cells if multiple input datasets are provided. Value should be positive. Ifsubset/indices
is present, its length should be equal tonum_cells
. -
num_blocks
: an integer scalar specifying the number of blocks in the loaded dataset.- If multiple input datasets are present, the number of input datasets should be equal to
num_blocks
. - If a single input dataset is present and
block_factor
is not provided,num_blocks
should be equal to 1. - If a single input dataset is present and
block_factor
is provided,num_blocks
may be any positive value.
- If multiple input datasets are present, the number of input datasets should be equal to
-
feature_identities
: a group containing 1-dimensional integer datasets, each named after a modality:"RNA"
for RNA data, if available."ADT"
for ADT data, if available."CRISPR"
for CRISPR data, if available.
Each dataset is of length equal to the number of features for its corresponding modality, and contains the identities of the features of that modality in the loaded dataset.
-
For a single input dataset, each feature identity is defined as a row index of the input dataset, for some definition of "row" that depends on the
format
of input dataset. The interpretations of the row indices for the most commonformat
types are listed here:- For
format = "MatrixMarket"
,"H5AD"
and"10X"
, each row is that of the stored count matrix. Thus, feature identities can be directly used to index into the feature annotation of these formats. - For
format = "SummarizedExperiment"
, each row index refers to the index of the main/alternative experiment corresponding to that modality. Thus, feature identities can only be used to indexrowData()
of the appropriate main/alternative experiment. The mapping between modalities and experiments should be specified in theoptions
of the relevantparameters/datasets
.
- For
-
For multiple input datasets, each feature identity is defined as a row index of the first input dataset. Note that not all rows of the first input may be present in the loaded dataset after taking the intersection.
The use of row indices here ensures that features can be disambiguated in the absence of a unique key in the feature annotations. Feature identities are parallel to the per-feature results in subsequent analysis steps, e.g.,
feature_selection
,marker_detection
.
Optionally, results
may contain:
feature_names
: a group containing 1-dimensional string datasets. Each dataset is named after a modality as inidentities
, is of length equal to the corresponding entry ofidentities
, and contains human-readable names for the features of its modality in the loaded dataset. Order of feature identifiers should be consistent with that inidentities
. Note that this field is only provided for ease of manual interpretation, and not all modalities may be represented. As human-readable names may be ambiguous or unavailable, applications should prefer to useidentities
instead.
Updated in version 3.0, with the following changes from the previous version:
- Standardized information for single/multiple inputs into a consistent format. This is done by turning each input dataset into its own group, and nesting the file information inside each input's group.
- Renamed
sample_factor
toblock_factor
, to avoid confusion between different concepts of samples. Similarly renamednum_samples
tonum_blocks
. - Renamed
identities
tofeature_identities
for clarity. - Removed
num_features
as this is made redundant by the lengths offeature_identities
. - Added
feature_names
to make the results easier to interpret during manual inspection, so that users do not have to map thefeature_identities
for each input format.