This repository contains the code for benchmarking the quality of the representation of single-cell histone post translational modifications (scHPTM).
This is the companion to Raimundo, F., Prompsy, P., Vert, J.P. and Vallot, C., 2022. Best practices for single-cell histone modification analysis. bioRxiv, pp.2022-09..
We evaluate the effect of the following factors:
- Matrix construction.
- Quality control (based on coverage).
- Feature selection effect (using HVG or coverage).
- Role of number of cells per experiment (by downsampling).
- Role of coverage per cell (by downsampling cells based on coverage).
- Role of normalization used (RPM or TF-IDF).
- Role of dimension reduction algorithm (using 7 standard single-cell epigenetics pipelines).
The evaluation relies on having a robust co-assay (either transcriptomic or surface proteins), and measuring how well the scHPTM representation conserves the local geometry that can be observed in the reference co-assay.
The original matrices are accessible in two formats:
.fragments.tsv
: for Marsolier and Zhang. This is a raw format that contains the reads per barcode.matrix.mtx
: for Zhu, the data is already transformed into 5kbps bins.
In this analysis we decided to follow a simple folder structure:
- All the data are stored under
$HOME/data
. - The matrices are stored under
$HOME/data/matrices/$DATASET/$MARK/$CONDITION/$MATRIX_CONSTRUCTION
- The embeddings are stored under
$HOME/data/matrices/$DATASET/$MARK/$CONDITION/$MATRIX_CONSTRUCTION
Where:
$HOME
: is your home directory (/home/username
)$DATASET
: is one ofPairedTag_Zhu_2021
,scChIP_Marsolier_2022
,scCutTagPro_Zhang_2021
$MARK
: is one ofH3K27ac
,H3K27me3
,H3K4me1
,H3K4me3
,H3K9me3
$CONDITION
: is the in-silico experiment we did, unmodified data is theraw
condition.$MATRIX_CONDITION
: is either the binsize (e.g.10k
) or the annotation used (e.g.GeneTSS
).
We downloaded the following H3K27me3 scChIP data from GEO:
- GSM5008855
MM468_DMSO1_day60_H3K27me3 [scChIP-seq]
- GSM5008856
MM468_DMSO3_day77_H3K27me3 [scChIP-seq]
- GSM5008857
MM468_DMSO5_day131_H3K27me3 [scChIP-seq]
- GSM5008858
MM468_5FU1_day33_H3K27me3 [scChIP-seq]
- GSM5008859
MM468_5FU2_day67_H3K27me3 [scChIP-seq]
The DMSO
data are untreated MM468 cells, the 5FU
data are trested cells before resistance to
chemotherapy happens.
Using the web interface of ChromSCape we generated matrices for different binsizes (5kbps up to
1000kbps) in different analysis called MarsollierK27.$BINSIZE
.
We also generated a matrix using the author annotation using the _peaks.tsv.gz
files contained on
GEO and called it AuthorAnnotation.
These R objects are then converted into language agnostic 10x formated data using the
Create Marsolier.ipynb
notebook, in that step we also filter regions known to have CNAs on this
cell line, as well as filter barcodes (more than 3k reads, less than 15k reads, and keep 10k most
covered regions).
You will need to replace gamazeps
with your username in that notebook.
The data is downloaded from GEO
GSE152020_Paired-Tag_H3K27ac_DNA_filtered_matrix.tar.gz
GSE152020_Paired-Tag_H3K27me3_DNA_filtered_matrix.tar.gz
GSE152020_Paired-Tag_H3K4me1_DNA_filtered_matrix.tar.gz
GSE152020_Paired-Tag_H3K4me3_DNA_filtered_matrix.tar.gz
GSE152020_Paired-Tag_H3K9me3_DNA_filtered_matrix.tar.gz
And each is moved to $HOME/data/matrices/PairedTag_Zhu_2021/$MARK/raw/5k
.
Afterwards, the data is rebinned using resize_bins.py
, here is an example command:
time python resize_bins.py \
--input_path=$HOME/data/matrices/PairedTag_Zhu_2021/H3K27me3/raw/5k \
--output_dir=$HOME/data/matrices/PairedTag_Zhu_2021/H3K27me3/raw/1000k \
--binsize=1000
The authors have already filtered the cells, so we keep them all.
The data is downloaded from zenodo
Matrices are built in the same fashion as for scChIP_Marsolier_2022
The cell filtering is done by keeping the same cells as the original authors (see the `` notebook).
Feature selection is done using the filter_sce_features.R
script, here is the usage:
mkdir -p $HOME/data/matrices/PairedTag_Zhu_2021/H3K27me3/filtered_hvg_90/1000k
Rscript filter_sce_features.R \
--input_path=$HOME/data/matrices/PairedTag_Zhu_2021/H3K27me3/raw/1000k \
--output_path=$HOME/data/matrices/PairedTag_Zhu_2021/H3K27me3/filtered_hvg_90/1000k \
--mode=hvg
--fraction_features=0.9 \
fraction_features
: fraction of features to keep (0.9 for 90%)mode
: eitherhvg
for Seurat's HVG selection, orcoverage
for keeping the regions with the highest coverage.
Note: you will need to replace gamazeps
by your username.
Sample selection is done by using the filter_cells_quality.R
script, here is the usage:
Rscript filter_sce_features.R \
--input_path=$HOME/data/matrices/PairedTag_Zhu_2021/H3K27me3/raw/1000k \
It will create the various filtered_cell_q_x_y
(where the cells with coverage between x and y percentiles are kept).
Note: you will need to replace gamazeps
by your username.
Downsampling the cellls at random is done by using the sample_cells.R
script, here is the usage:
Rscript sample_cells.R \
--input_path=$HOME/data/matrices/PairedTag_Zhu_2021/H3K27me3/raw/1000k \
It will create the various sampled_cell_p_x
, where x
is the percentage of cells kept.
All the scripts assume that the data is stored in 10x mtx format.
time Rscript R_analysis.R \
--input_path=$HOME/data/matrices/scCutTagPro_Zhang_2021/H3K4me3/raw/50 \
--output_path=$HOME/data/embeddings/scCutTagPro_Zhang_2021/H3K4me3/raw/50
time python3 nmf_process.py \
--input_path=$HOME/data/matrices/scCutTagPro_Zhang_2021/H3K4me3/raw/50 \
--output_path=$HOME/data/embeddings/scCutTagPro_Zhang_2021/H3K4me3/raw/50
time python3 peakVI_process.py \
--input_path=$HOME/data/matrices/scCutTagPro_Zhang_2021/H3K4me3/raw/50 \
--output_path=$HOME/data/embeddings/scCutTagPro_Zhang_2021/H3K4me3/raw/50
time python3 scale_process.py \
--input_path=$HOME/data/matrices/scCutTagPro_Zhang_2021/H3K4me3/raw/50 \
--output_path=$HOME/data/embeddings/scCutTagPro_Zhang_2021/H3K4me3/raw/50
The embeddings will be stored in the output_path
, each embeddin is stored in a .csv
file named
after the method.
All the metrics will be stored in $HOME/data/scores/DATASET/MARK/MATRIX_CONSTRUCTION
The compute_scores.py
script will attempt to compute the neighborhood score for all the different
matrix construction for a given mark, dataset and preporcessing.
- The
gt_path
is theadt
extracted from the.rds
object for scCUT&Tag, and the mode isADT
. - The
gt_path
is the scRNA-seq matrix downloaded from GEO for the PairedTag, and the mode isRNA
.
python3 compute_scores.py \
--output_path=$HOME/data/scores/scCutTagPro_Zhang_2021/H3K4me3/raw/ \
--embeddings_path=$HOME/data/embeddings/scCutTagPro_Zhang_2021/H3K4me3/raw/ \
--mode=ADT \
--gt_path=$HOME/data/input/scCutTagPro_Zhang_2021/H3K4me3.adt.csv
The supervised metrics require an input 'ground truth' which is:
meta.tsv
from geo for Pairedtag- The author annotation in the
.rds
files for scCut&TagPro - Any matrix of the same mark for scChIP
time python3 compute_supervised_metrics.py '
--output_path=$HOME/data/scores/scCutTagPro_Zhang_2021/H3K4me1/raw/20k '
--embeddings_path=$HOME/data/embeddings/scCutTagPro_Zhang_2021/H3K4me1/raw/20k '
--gt_path=$HOME/data/input/scCutTagPro_Zhang_2021/H3K4me1')
- All the scores are stored under
data/scores
. - All the runtime informations are stored in the
logs
folder.
The analysis colab is scCUTTag_analysis.ipynb
.
This code is based on a fork of the work originally done at Google, whose source can be found here.