You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We have developed a compartmenatlized schema for storing all required aspects of a taxonomy. The fields in the AIT schema are associated to a broad category term (described below) which form a piece of the whole AIT file format.
(Note: A pervious version of this standard is available as a Google Doc).
Schema category terms
Described here are the broad categories that all fields are associated with.
Data: Includes anything critical for understanding the cell by gene matrix and to link it with other components. Includes data (raw and processed), gene information, and cell identifiers. Note that for the purposes of this schema, we are excluding raw data (fastq, bam files, etc.) from consideration and are starting from the count matrix.
Assigned metadata: Includes cell-level metadata that is assigned at some point in the process between when a cell goes from the donor to a value in the data, and (in theory) can be ENTIRELY captured by values in Allen Institute, BICAN, or related standardized pipelines. It includes fields that describe: donor metadata, experimental protocols, dissection information, RNA QC metrics, and sequencing metadata.
Calculated metadata: Includes any cell-level or cluster-level metadata that can be calculated explicitly from the Data and Assigned Metadata without the need for human intervention. It includes fields that describe: # reads detected/cell, # UMI/cell, fraction of cells per cluster derived from each anatomic dissections, expressed neurotransmitter genes (quantitatively defined), standard quality control metrics (e.g., doublet score) per cluster.
Annotations: Includes fields related to the annotation of clusters or groups of clusters (collectively called "cell sets"). It includes fields that describe cluster levels, cluster relationships, canonical marker genes, links to existing ontologies (e.g., CL, UBERON), expert annotations, and dendrograms.
Analysis: Includes fields which are required for specific analysis for example: latent spaces (e.g., UMAP), cluster level gene summaries (e.g., cluster means, proportions), and variable genes.
Tooling: Includes fields required for specific tools (e.g., cellxgene, TDT, CAS, CAP, and cell type annotation) that are not strictly part of the taxonomy but are required to inter-operate between various tools.
Schema
Within each broad categorical term, fields are ordered by their location in the anndata object: X, raw, obsm, obs, var, uns.
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED" "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14, RFC2119, and RFC8174 when, and only when, they appear in all capitals, as shown here.
X
Key
X
Annotator
Curator
Value
X component contains normalized expression data (cell x gene) in scipy.sparse.csr_matrix matrix format.
Type
numeric
Required
RECOMMENDED
Tags
Data
raw
The raw component contains the unfiltered anndata object containing a count matrix in raw.X.
The obs component contains cell-level metadata summarized at the cell level.
index of pandas.Dataframe
Key
index of pandas.Dataframe
Annotator
Curator
Value
Unique identifier corresponding to each individual cell.
Type
Index, str
Required
MUST
Tags
Assigned metadata
cluster_id
Key
cluster_id
Annotator
Curator
Value
Identifier for cell set computed from a clustering algorithm.
Type
str
Required
MUST
Tags
Annotations
[cellannotation_setname]
Key
[cellannotation_setname]
Annotator
Curator
Value
Column name in obs is the string [cellannotation_setname] and the values are the strings describing an annotation level of the taxonomy.
Type
Categorical
Required
RECOMMENDED
Tags
Annotations
Examples: Neuronal, Inhibitory, LHX6 (MGE), PVALB, Inh L5-6 PVALB LGR5 from Hodge et al. 2019
cell_type_ontology_term_id
Key
cell_type_ontology_term_id
Annotator
Curator
Value
This MUST be a CL term. If no appropriate high-level term can be found or the cell type is unknown, then it is STRONGLY RECOMMENDED to use "CL:0000003" for native cell.
Type
Categorical
Required
RECOMMENDED
Tags
Annotations
load_id
Key
load_id
Annotator
Curator
Value
Identifier for the sequencing library for which molecular measurements from a specific set of cells is derived.
Type
str
Required
MUST
Tags
Assigned metadata
donor_id
Key
donor_id
Annotator
Curator
Value
Identifier for the unique individual, ideal from the specimen portal (or other upstream source). This is called donor_label in the BKP. Should converge on a standard term. More than one identifier may be needed, but ideally for the analysis only a single one is retained and stored here.
Type
str
Required
MUST
Tags
Assigned metadata
assay
Key
assay
Annotator
Curator
Value
Human-readable sequencing modality which should have a corresponding EFO ontology term. e.g., 'Smart-seq2'corresponds to 'EFO:0008931', '10x 3' v3'corresponds to 'EFO:0009922'.
Type
str
Required
MUST
Tags
Assigned metadata
assay_ontology_term_id
Key
assay_ontology_term_id
Annotator
Curator/Computed
Value
Most appropriate EFO ontology term for assay. (e.g.,"10x 3' v2"="EFO:0009899","10x 3' v3"="EFO:0009922","Smart-seq"="EFO:0008930").
Type
str
Required
MUST
Tags
Assigned metadata
organism
Key
organism
Annotator
Curator
Value
Species from which cells were collected. This MUST be the human-readable name assigned to the value of organism_ontology_term_id
Type
str
Required
MUST
Tags
Assigned metadata
organism_ontology_term_id
Key
organism_ontology_term_id
Annotator
Computed
Value
NCBITaxon identifier which MUST be a child of NCBITaxon:33208 for Metazoa. Ontology terms are mapped from organism using the GeneOrthology github repo.
Type
str
Required
RECOMMENDED
Tags
Assigned metadata
donor_age
Key
donor_age
Annotator
Curator
Value
Currently a free text field for defining the age of the donor. In CELLxGENE this is recorded in development_stage_ontology_term_id and is HsapDv if human, MmusDv if mouse. I'm not sure what this means, but more generally, we should align with BICAN on how to deal with this value.
Type
Categorical
Required
MUST
Tags
Assigned metadata
anatomical_region
Key
anatomical_region
Annotator
Curator
Value
Human readable name assigned to the value of anatomical_region_ontology_term_id
Type
Categorical
Required
MUST
Tags
Assigned metadata
anatomical_region_ontology_term_id
Key
anatomical_region_ontology_term_id
Annotator
Curator/Computed
Value
UBERON terms for the anatomical_region field that we have (e.g., 'brain': 'UBERON_0000955').
Type
Categorical
Required
RECOMMENDED
Tags
Assigned metadata
brain_region_ontology_term_id
Key
brain_region_ontology_term_id
Annotator
Curator/Computed
Value
Brain region IDs from one of the brain-bican atlases for the anatomical_region field. Currently includes DHBA, HBA, and MBA, but will expand.
Type
Categorical
Required
RECOMMENDED
Tags
Assigned metadata
self_reported_sex
Key
self_reported_sex
Annotator
Curator
Value
Placeholder for donor reported sex. Called sex_ontology_term_id (e.g., PATO:0000384/383 for male/female) in CELLxGENE and called "donor_sex" in BKP. We should align on a single term.
Type
Categorical
Required
MUST
Tags
Assigned metadata
self_reported_sex_ontology_term_id
Key
self_reported_sex_ontology_term_id
Annotator
Curator/Computed
Value
A child of PATO:0001894 for phenotypic sex or "unknown" if unavailable or if sex corresponds to something not included in PATO. Female = PATO_0000383 and Male = PATO_0000384.
Type
Categorical
Required
MUST
Tags
Assigned metadata
self_reported_ethnicity
Key
self_reported_ethnicity
Annotator
Curator
Value
This MUST be a child of PATO:0001894 for phenotypic sex or "unknown" if unavailable.
Type
Categorical
Required
RECOMMENDED
Tags
Assigned metadata
self_reported_ethnicity_ontology_term_id
Key
self_reported_ethnicity_ontology_term_id
Annotator
Curator/Computed
Value
Either the most relevant HANCESTRO term,"multiethnic" if more than one ethnicity is reported, or "unknown" if unavailable.
Type
Categorical
Required
RECOMMENDED
Tags
Assigned metadata
disease
Key
disease
Annotator
Curator
Value
A term corresponding to disease state (or "control" for normal/healthy)
Type
Categorical
Required
RECOMMENDED
Tags
Assigned metadata
disease_ontology_term_id
Key
disease_ontology_term_id
Annotator
Curator/Computed
Value
This MUST be a MONDO term or "PATO:0000461" for normal or healthy.
Type
Categorical
Required
RECOMMENDED
Tags
Assigned metadata
suspension_type
Key
suspension_type
Annotator
Curator
Value
Either "cell", "nucleus", or "na".
Type
Categorical
Required
MUST
Tags
Assigned metadata
is_primary_data
Key
is_primary_data
Annotator
Curator
Value
This MUST be True if this is the canonical instance of this cellular observation and False if not. This is commonly False for meta-analyses reusing data or for secondary views of data.
The var component contains gene level information.
index of pandas.Dataframe
Key
index of pandas.Dataframe
Annotator
Curator
Value
If the feature is a gene then this MUST be an human readable SYMBOL term. The index of the pandas.DataFrame MUST contain unique identifiers for features. If present, the index of raw.var MUST be identical to the index of var.
Type
str
Required
MUST
Tags
Assigned metadata
ensembl_id
Key
ensembl_id
Annotator
Curator
Value
If the feature is a gene then this MUST be a gene ID from ensembl. Each index of the pandas.DataFrame MUST map to a unique emsembl_id identifiers for features. If present, the raw.var.ensembl_id MUST be identical to the var.ensembl_id.
Type
str
Required
RECOMMENDED
Tags
Assigned metadata
highly_variable_genes[_name]
Key
highly_variable_genes
Annotator
Curator
Value
A logical vector indicating which genes are highly variable. Multiple highly variable gene sets can be specified.
Type
bool
Required
RECOMMENDED
Tags
Analysis
marker_genes[_name]
Key
marker_genes_[set_name]
Annotator
Curator
Value
A logical vector indicating which genes are markers. Multiple marker gene sets can be specified.
Type
bool
Required
RECOMMENDED
Tags
Analysis
uns
The uns component contains more general information and fields with formatting incompatible with the above components.
title
Key
title
Annotator
Curator
Value
This text describes and differentiates the dataset from other datasets in the same collection. It is STRONGLY RECOMMENDED that each dataset title in a collection is unique and does not depend on other metadata such as a different assay to disambiguate it from other datasets in the collection.
Type
str
Required
MUST
Tags
Tooling
dataset_purl
Key
dataset_purl
Annotator
Curator
Value
Link to molelcular data (cell x gene) if not present in X or raw.X. This can be an AWS S3 bucket or other permanent URL for the taxonomy expression data.
Type
str
Required
RECOMMENDED
Tags
Data
batch_condition
Key
batch_condition
Annotator
Curator
Value
Together, these keys define the batches that a normalization or integration algorithm should be aware of. Values MUST refer to cell metadata keys in obs.
Type
list[str]
Required
RECOMMENDED
Tags
Tooling
reference_genome
Key
reference_genome
Annotator
Curator
Value
Reference genome used to align molecular measurements.
Type
str
Required
RECOMMENDED
Tags
Assigned metadata
gene_annotation_version
Key
gene_annotation_version
Annotator
Curator
Value
Genome annotation version used during alignment. e.g. .gtf or .gff file.
Type
str
Required
RECOMMENDED
Tags
Assigned metadata
dend
Key
dend
Annotator
Curator
Value
A json formatted dendrogram encoding the taxonomy hierarchy. Either computed or derived from cluster groupings.
Type
json
Required
RECOMMENDED
Tags
Annotations
hierarchy
Key
hierarchy
Annotator
Curator
Value
An ordering of cluster_id and higher level groupings from [cellannotation_setname] where smaller numbers are broader types. E.g. {"Class": 0, "Subclass": 1, "cluser_id": 2}
Type
dict{str: int}
Required
MUST
Tags
Annotations
mode
Key
mode
Annotator
Curator/Computed
Value
Indicator of which child of the parent taxonomy to utilize. Mode determines cells to remove based on filter as well as switching to relevant analysis components of the uns related to child taxonomy specific analysis tooling.
Type
str
Required
MUST
Tags
Tooling
filter
Key
filter
Annotator
Curator/Computed
Value
Indicator of which cells to use for a given child taxonomy saved as a list of booleans for each cell. TRUE indicates a cell should be removed and FALSE indicates the cell should not be removed. Each entree in this list is named for the relevant "mode" and has TRUE/FALSE calls indicating whether a cell is filtered out. e.g., the "standard" taxonony is all FALSE.
Type
list[[mode]][bool]
Required
MUST
Tags
Tooling
cluster_algorithm
Key
cluster_algorithm
Annotator
Curator
Value
Full description of clustering parameters as a data.frame.
Type
data.frame
Required
MUST
Tags
Annotations
cluster_info
Key
cluster_info
Annotator
Curator/Computed
Value
A data.frame of cluster information.
Type
data.frame
Required
MUST
Tags
Annotations
cluster_id_median_expr
Key
cluster_id_median_expr
Annotator
Curator/Computed
Value
Marker gene expression in on-target and off-target cell populations, useful for patchseq analysis. Also includes information about KL divergence calculations and associated QC calls. Defined by buildPatchseqTaxonomy.
Type
numpy.ndarray
Required
MUST
Tags
Annotations
default_embedding
Key
default_embedding
Annotator
Curator/Computed
Value
The value MUST match a key to an embedding in obsm for the embedding to display by default.
Type
str
Required
RECOMMENDED
Tags
Tooling
schema_version
Key
schema_version
Annotator
Computed
Value
Allen Institute Taxonomy schema version. e.g. "1.0.0"
Type
str
Required
MUST
Tags
Tooling
cellannotation_schema
Key
cell_annotation_schema
Annotator
Computed
Value
A json storing the entire cell annotation schema (CAS) information.
Type
json
Required
RECOMMENDED
Tags
Tooling
cell_annotation_schema: extended calculated metadata about annotations and labelsets stores in uns as in CAS - BICAN extension format under labelsets.
obsm (Embeddings)
The obsm component contains all dimensionality reductions of the taxonomy (cell x dim). To display a dataset Curators MUST annotate one or more embeddings of at least two-dimensions (e.g. tSNE, UMAP, PCA, spatial coordinates) as numpy.ndarrays in obsm.
X_[embedding]
Key
X_[embedding]
Annotator
Curator/Computed
Value
An n-dimensional embedding (cell x dim) of the high dimensional expression data.