# Overview

We expect a `marker_detection` HDF5 group at the root of the file, containing marker statistics for each cluster.
The group itself contains the `parameters` and `results` subgroups.

**Definitions:**

- `num_clusters`: number of clusters in the analysis.
This is typically determined from the `choose_clustering` step.
- `rna_available`: whether RNA data is available.
This is typically determined from the `inputs` step.
- `rna_num_features`: number of features in the RNA data.
This is typically determined from the `inputs` step.
- `adt_available`: whether ADT data is available.
This is typically determined from the `inputs` step.
- `adt_num_features`: number of features in the ADT data.
This is typically determined from the `inputs` step.
- `crispr_available`: whether CRISPR data is available.
This is typically determined from the `inputs` step.
- `crispr_num_features`: number of features in the ADT data.
This is typically determined from the `inputs` step.

# Parameters

`parameters` will contain:

- `lfc_threshold`: a scalar float specifying the log-fold change threshold used to compute effect sizes.
This should be non-negative.
- `compute_auc`: a scalar integer to be interpreted as a boolean, indicating whether AUCs were computed.

# Results

`results` should contain `per_cluster`, a group containing the marker results for each cluster.
Each child of `per_cluster` corresponds to a cluster and is itself a group named after its cluster index (0 to `num_clusters - 1`).
Each `per_cluster/<cluster_index>` group contains further subgroups, one named after each modality:

- If `rna_available = true`, there should be an `"RNA"` subgroup.
- If `adt_available = true`, there should be an `"ADT"` subgroup.
- If `crispr_available = true`, there should be a `"CRISPR"` subgroup.

Each modality-specific subgroup contains the statistics for that modality:

- `means`: a float dataset of length equal to the number of features for this modality (as determined from the relevant `*_num_features`), containing the mean expression of each feature in the current cluster.
- `detected`: a float dataset of length equal to the number of features, containing the proportion of cells with detected expression of each feature in the current cluster.
- `lfc`: an group containing statistics for the log-fold changes from all pairwise comparisons involving the current cluster.
This contains:
- `min`: a float dataset of length equal to the number of features, containing the minimum log-fold change across all pairwise comparisons for each feature.
- `mean`: a float dataset of length equal to the number of features, containing the mean log-fold change across all pairwise comparisons for each feature.
- `min_rank`: a float dataset of length equal to the number of features, containing the minimum rank of the log-fold changes across all pairwise comparisons for each feature.
- `delta_detected`: same as `lfc`, but for the delta-detected (i.e., difference in the percentage of detected expression).
- `cohen`: same as `lfc`, but for Cohen's d.
- `auc`: same as `lfc`, but for the AUCs.
This may be omitted if `compute_auc` is falsey.

# History

Updated in version 3.0, with the following changes from the [previous version](v2_0.md):

- Added the log-FC threshold parameter.
- Allow AUCs to be optional, depending on whether they were computed.