We expect an pca
HDF5 group at the root of the file, describing how the PCA was performed on the RNA expression matrix.
The group itself contains the parameters
and results
subgroups.
Definitions:
num_cells
: number of cells remaining after QC filtering. This is typically determined from thequality_control
step.
parameters
should contain:
num_hvgs
: a scalar integer containing the number of highly variable genes to use to compute the PCA.num_pcs
: a scalar integer containing the number of PCs to compute.block_method
: a scalar string specifying the method to use when dealing with multiple blocks in the dataset. This may be"none"
,"regress"
or"mnn"
.
results
should contain:
-
pcs
: a 2-dimensional float dataset containing the PC coordinates in a row-major layout. Each row corresponds to a cell (after QC filtering) and each column corresponds to a PC. The total number of rows and columns is defined asnum_cells
andnum_pcs
, respectively.If
block_method = "mnn"
, the PCs will be computed using a weighted method that adjusts for differences in the number of cells across blocks. Ifblock_method = "regress"
, the PCs will be computed on the residuals after regressing out the block-wise effects. -
var_exp
: a float dataset of length equal to the number of PCs, containing the percentage of variance explained by each PC.
If block_method = "mnn"
, the results
group will also contain:
corrected
, a float dataset with the same dimensions aspcs
, containing the MNN-corrected PCs for each cell
Updated in version 1.1, with the following changes from the previous version:
- Added support for various forms of batch correction.