Skip to content

Commit

Permalink
fix: tissue_ontology_term_id and cell_type_ontology_term_id fields fo…
Browse files Browse the repository at this point in the history
…r 3 new species (#1235)
  • Loading branch information
joyceyan authored Jan 31, 2025
1 parent 8112a32 commit 0387a19
Show file tree
Hide file tree
Showing 3 changed files with 182 additions and 143 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -108,85 +108,33 @@ components:
columns:
cell_type_ontology_term_id:
type: curie
dependencies:
- # If organism is Zebrafish
rule:
column: organism_ontology_term_id
match_exact:
terms:
- NCBITaxon:7955
error_message_suffix: >-
When 'organism_ontology_term_id' is 'NCBITaxon:7955' (Danio rerio),
'cell_type_ontology_term_id' MUST be a descendant term id of 'ZFA:0009000' (cell).
type: curie
curie_constraints:
ontologies:
- ZFA
allowed:
ancestors:
ZFA:
- ZFA:0009000
exceptions:
- unknown
- # If organism is fruit fly
rule:
column: organism_ontology_term_id
match_exact:
terms:
- NCBITaxon:7227
error_message_suffix: >-
When 'organism_ontology_term_id' is 'NCBITaxon:7227' (Drosophila melanogaster),
'cell_type_ontology_term_id' MUST be a descendant term id of 'FBbt:00007002' (cell).
type: curie
curie_constraints:
ontologies:
- FBbt
allowed:
ancestors:
FBbt:
- FBbt:00007002
exceptions:
- unknown
- # If organism is c. elegans
rule:
column: organism_ontology_term_id
match_exact:
terms:
- NCBITaxon:6239
error_message_suffix: >-
When 'organism_ontology_term_id' is 'NCBITaxon:6239' (Caenorhabditis elegans),
'cell_type_ontology_term_id' MUST be a descendant term id of 'WBbt:0004017' (cell).
type: curie
curie_constraints:
ontologies:
- WBbt
allowed:
ancestors:
WBbt:
- WBbt:0004017
forbidden:
ancestors:
WBbt:
- WBbt:0006803
terms:
- WBbt:0006803
exceptions:
- unknown
# else if column does not match any of the above
curie_constraints:
ontologies:
- CL
- ZFA
- FBbt
- WBbt
exceptions:
- unknown
allowed:
ancestors:
ZFA:
- ZFA:0009000
FBbt:
- FBbt:00007002
WBbt:
- WBbt:0004017
CL:
- CL:0000000
forbidden:
ancestors:
WBbt:
- WBbt:0006803
terms:
- CL:0000255
- CL:0000257
- CL:0000548
- WBbt:0006803
add_labels:
- type: curie
to_column: cell_type
Expand Down Expand Up @@ -276,86 +224,50 @@ components:
tissue_ontology_term_id:
type: curie
dependencies:
- # If organism is Zebrafish
- # If tissue_type is tissue OR organoid
rule:
column: organism_ontology_term_id
column: tissue_type
match_exact:
terms:
- NCBITaxon:7955
- tissue
- organoid
error_message_suffix: >-
When 'organism_ontology_term_id' is 'NCBITaxon:7955' (Danio rerio),
'tissue_ontology_term_id' MUST be the most accurate descendant
of ZFA:0100000 for zebrafish anatomical entity and MUST NOT be ZFA:0009000
for cell or any of its descendants.
When 'tissue_type' is 'tissue' or 'organoid', 'tissue_ontology_term_id' must be a valid UBERON, ZFA, FBbt, or WBbt term.
type: curie
curie_constraints:
ontologies:
- UBERON
- ZFA
- FBbt
- WBbt
allowed:
ancestors:
ZFA:
- ZFA:0100000
forbidden:
terms:
- ZFA:0009000
- ZFA:0001093
ancestors:
ZFA:
- ZFA:0009000
- # If organism is fruit fly
rule:
column: organism_ontology_term_id
match_exact:
terms:
- NCBITaxon:7227
error_message_suffix: >-
When 'organism_ontology_term_id' is 'NCBITaxon:7227' (Drosophila melanogaster),
'tissue_ontology_term_id' MUST be the most accurate descendant
of FBbt:10000000 for fruit fly anatomical entity and MUST NOT be FBbt:00007002
for cell or any of its descendants.
type: curie
curie_constraints:
ontologies:
- FBbt
allowed:
ancestors:
FBbt:
- FBbt:10000000
forbidden:
terms:
- FBbt:00007002
ancestors:
FBbt:
- FBbt:00007002
- # If organism is c. elegans
rule:
column: organism_ontology_term_id
match_exact:
terms:
- NCBITaxon:6239
error_message_suffix: >-
When 'organism_ontology_term_id' is 'NCBITaxon:6239' (Caenorhabditis elegans),
'tissue_ontology_term_id' MUST be the most accurate descendant
of WBbt:0005766 for Anatomy
type: curie
curie_constraints:
ontologies:
- WBbt
allowed:
ancestors:
WBbt:
- WBbt:0005766
UBERON:
- UBERON:0001062
forbidden:
ancestors:
WBbt:
- WBbt:0004017
- WBbt:0006803
terms:
- ZFA:0009000
- ZFA:0001093
- FBbt:00007002
- WBbt:0007849
- WBbt:0007850
- WBbt:0008595
- WBbt:0004017
- WBbt:0006803
ancestors:
ZFA:
- ZFA:0009000
WBbt:
- WBbt:0004017
- WBbt:0006803
FBbt:
- FBbt:00007002
- # If tissue_type is cell culture
rule:
column: tissue_type
Expand All @@ -377,14 +289,6 @@ components:
- CL:0000255
- CL:0000257
- CL:0000548
# else if column does not match any of the above
curie_constraints:
ontologies:
- UBERON
allowed:
ancestors:
UBERON:
- UBERON:0001062
add_labels:
- type: curie
to_column: tissue
Expand Down
89 changes: 89 additions & 0 deletions cellxgene_schema_cli/cellxgene_schema/validate.py
Original file line number Diff line number Diff line change
Expand Up @@ -465,6 +465,91 @@ def count_nonzeros(matrix_chunk: Union[np.ndarray, sparse.spmatrix], is_sparse_m
nonzeros = count_nonzeros(matrix.compute(), is_sparse_matrix)[0]
return nonzeros

def _validate_tissue_ontology_term_id(self):
"""
For `tissue_ontology_term_id`, the schema_definition.yaml allows all possible terms regardless of what
the organism is. This block of code does further validation to make sure that if zebrafish, fruit fly,
or roundworm is specified, only the correct ontologies are used.
This is quite a bit easier to understand than fully overhauling the schema definition to allow for these
very specific cases. Note that we only check for prefixes, since validation that these are proper ontology
terms / descendants is done within the curie constraints
"""
organism_column = "organism_ontology_term_id"
tissue_column = "tissue_ontology_term_id"
tissue_type_column = "tissue_type"

required_columns = [tissue_column, organism_column, tissue_type_column]
for column in required_columns:
if column not in self.adata.obs.columns:
return

allowed_prefixes = {
"NCBITaxon:6239": ("WBbt", "UBERON"),
"NCBITaxon:7955": ("ZFA", "UBERON"),
"NCBITaxon:7227": ("FBbt", "UBERON"),
}

def is_valid_row(row):
if row[tissue_type_column] == "cell culture":
return True
allowed = allowed_prefixes.get(row[organism_column], ("UBERON",))
return row[tissue_column].startswith(allowed)

try:
invalid_rows = ~self.adata.obs.apply(is_valid_row, axis=1)

if invalid_rows.any():
self.errors.append(
"When tissue_type is tissue or organoid, tissue_ontology_term_id must be a valid UBERON term. "
"If organism is NCBITaxon:6239, it can be a valid UBERON term or a valid WBbt term. "
"If organism is NCBITaxon:7955, it can be a valid UBERON term or a valid ZFA term. "
"If organism is NCBITaxon:7227, it can be a valid UBERON term or a valid FBbt term."
)
except Exception as e:
self.errors.append(f"Unexpected error validating tissue_ontology_term_id: {e}")

def _validate_cell_type_ontology_term_id(self):
"""
For `cell_type_ontology_term_id`, the schema_definition.yaml allows all possible terms regardless of what
the organism is. This block of code does further validation to make sure that if zebrafish, fruit fly,
or roundworm is specified, only the correct ontologies are used.
This is quite a bit easier to understand than fully overhauling the schema definition to allow for these
very specific cases. Note that we only check for prefixes, since validation that these are proper ontology
terms / descendants is done within the curie constraints
"""
organism_column = "organism_ontology_term_id"
cell_type_column = "cell_type_ontology_term_id"

required_columns = [cell_type_column, organism_column]
for column in required_columns:
if column not in self.adata.obs.columns:
return

allowed_prefixes = {
"NCBITaxon:6239": ("WBbt", "CL"),
"NCBITaxon:7955": ("ZFA", "CL"),
"NCBITaxon:7227": ("FBbt", "CL"),
}

def is_valid_row(row):
if row[cell_type_column] == "unknown":
return True
allowed = allowed_prefixes.get(row[organism_column], ("CL",))
return row[cell_type_column].startswith(allowed)

try:
invalid_rows = ~self.adata.obs.apply(is_valid_row, axis=1)

if invalid_rows.any():
self.errors.append(
"cell_type_ontology_term_id must be a valid CL term. "
"If organism is NCBITaxon:6239, it can be a valid CL term or a valid WBbt term. "
"If organism is NCBITaxon:7955, it can be a valid CL term or a valid ZFA term. "
"If organism is NCBITaxon:7227, it can be a valid CL term or a valid FBbt term."
)
except Exception as e:
self.errors.append(f"Unexpected error validating cell_type_ontology_term_id: {e}")

def _validate_genetic_ancestry(self):
"""
Performs row-based validation of the genetic_ancestry_X fields. This ensures that a valid row must be:
Expand Down Expand Up @@ -2048,6 +2133,10 @@ def _deep_check(self):
# Validate genetic ancestry
self._validate_genetic_ancestry()

# Organism-specific prefix validation
self._validate_tissue_ontology_term_id()
self._validate_cell_type_ontology_term_id()

# Checks each component
for component_name, component_def in self.schema_def["components"].items():
logger.debug(f"Validating component: {component_name}")
Expand Down
Loading

0 comments on commit 0387a19

Please sign in to comment.