This capsule contains code and instructions for assessing the performance of various analysis methods for analyzing CRISPR regulatory screens. It contains the following sections:
Analysis and performance
Advanced flags
Note; the example below is from the GitHub and can be found here. The example has been altered slightly (referencing of file locations) to fit the format of Ocean Code.
After simulating CRISPR screen data, you are ready to analyze it! CRSsim provides a script for analyzing the data using a number of methods and comparing their performance.
Running the analysis will generate the following outputs:
- a per-guide scores file, written to
{output_name}_guideScores.csv
- a per-genome scores file, written to
{output_name}_genomeScores.csv
- a bedGraph file for genome scores, written to
{output_name}_genomeScores.bedgraph
- an AUC plot summarizing the performance of all methods compared, written to
{output_name}_perElement_AUCeval.pdf
- a prAUC plot summarizing the performance of all methods compared, written to
{output_name}_perElement_prAUCeval.pdf
- a .csv file summarizing the AUC and prAUC scores for each method evaluated, written to
{output_name}_perElement_method_eval.csv
To analyze data and evaluate method performance you will need the R packages listed below. This capsule should already have them set up.
-
dplyr
-
ggplot2
-
pROC
-
glmmTMB
-
extraDistr
-
MESS
-
IRanges
-
GenomicRanges
-
edgeR
-
DESeq2
We have provided an example data set from a selection screen. This way you don't have to first simulate your own data; however, the example on GitHub walks through how to do this yourself.
In this example we will evaluate the performance of RELICS and other mehtods. To get access to these methods the corresponding scripts must first be sourced.
# working directory is in 'code'
source('RELICS.r')
source('RELICS_performance.r')
- Flags are set up as an R list object. Run the following command in R to create this object:
analysis.specs <- list()
- Set the output name for the analysis. All output files generated by the analysis will contain this prefix in the filehandle. Be sure to choose a different name from any existing output files. In this case we are specifying the output location (
../results/
) in addition to the file name (Example_performanceEval
).
analysis.specs$dataName <- '../results/Example_performanceEval'
- Specify the paths to the guide information file and the counts file generated in the simulation step. Here, we have provided example files so that the user does not have to run a full simulation prior to this example. For file format details see section 3.1 below or here.
analysis.specs$test1 <- 'string1'
some text
analysis.specs$test2 <- 'string2'
other text
analysis.specs$CountFileLoc <- '../data/Example_simulation_counts.csv'
analysis.specs$sgRNAInfoFileLoc <- '../data/Example_simulation_info.csv'
analysis.specs$sgRNAInfoFileLoc <- '../data/Example_simulation_info.csv'
- Multiple analysis methods can be applied to the data: RELICS, fold change, edgeR, and DESeq2. The user can specify which of these methods they would like to apply to the data:
analysis.specs$Method <- c('RELICS-search', 'FoldChange', 'edgeR', 'DESeq2')
Each pool in a replicate is separated by a comma, each replicate is separated by a semi-colon. Detail for the RELICS analysis instructions can be found here.
analysis.specs$repl_groups <- '1,2;3,4'
analysis.specs$glmm_positiveTraining <- 'exon'
analysis.specs$glmm_negativeTraining <- 'neg'
- For edgeR, DESeq2, and fold change, select the pools to be compared against one another. Pools are referenced by their corresponding column index in the count file. For DESeq2 and edgeR,
Group1
refers to the pools where an enrichment of guides is expected. In this case we're analyzing a selection screen where guides targeting regulatory elements drop out, so relatively speaking they will be enriched in thebefore
pools. For fold change,Group1
is the numerator andGroup2
the denominator. Log10 fold change is calculated.
analysis.specs$Group1 <- c(1,3)
analysis.specs$Group2 <- c(2,4)
- For fold change, you need to specify whether the different pools are paired (from the same replicate with a 1-1 correspondence) or if there is an imbalance between the groups. In the case of the latter, all counts from
Group1
andGroup2
are combined respectively before calculating the fold change.
analysis.specs$foldChangePaired <- TRUE # else FALSE
- Specify that results should be evaluated based on a set of regions known to be true positives and true negatives
analysis.specs$simulated_data <- TRUE # specify that the analysis is based on simulated data where the ground thruth is known
analysis.specs$pos_regions <- '../Example_data/Example_simulation_enhancers.csv' # file location of all known positive regions
analysis.specs$evaluate_perElement_Performance <- TRUE # specify that the performance of different methods is to be evaluated
analysis.specs$positiveLabels <- 'pos' # label for regions which are true positives
analysis.specs$negativeLabels <- c('neg', 'chr') # labels for regions which are true negatives
- Depending on the CRISPR system used, the range of perturbation effect will be different. We recommend setting the range to 20bp for
CRISPRcas9
and to 1000bp forCRISPRi
andCRISPRa
. Note that the effect range is added to the positions specified in the info file. If the effect range is already accounted for in the positions specified in the annotation file, then it should be set to zero here.
In case of a dualCRISPR
system, an arbitrary crisprEffectRange
can be specified as RELICS will automatically use the deletion range between guide 1 and guide 2 as effect range.
analysis.specs$crisprSystem <- 'CRISPRi' # other options: CRISPRcas9, CRISPRa, dualCRISPR
analysis.specs$crisprEffectRange <- 1000 # use $deletionSize if system is dualCRISPR
- Once you have your flags set, create a specification file using the
write_specs_file()
function. The arguments and their names will be written to this file for future reference. The two arguments that this function takes are the list of flags you just set (analysis.specs
) and the desired name of the output file (.txt
will be automatically used as the file extension, so do not include any file extension for this argument).
write_specs_file(analysis.specs, 'Example_performanceEval_specs')
- Once the specification file has been set up, simply use the
analyze_data()
function to start the analysis. The example here should take about 5 minutes, depending on your operating system.
analyze_data('Example_performanceEval_specs.txt')
2.1.1 alphaRRA. To analyze data with MAGeCK as used by Diao et al. 2017, specify use of alphaRRA
and the bins within which to tile the analyzed region. Note, alphaRRA
will be applied to per-guide scores from all methods given by the Method
flag.
analysis.specs$postScoreAlphaRRA <- TRUE
analysis.specs$binSize <- 50
2.1.2 Sliding Window. To combine guide scores using a sliding window as in Fulco et al. 2016 or Simeonov et al. 2017, specify the number of guides to include per sliding window as well as the maximum window size (if the tiled deletion generates a 10KB gap it would not make sense to consider the entire region as one score).
analysis.specs$postScoreSlidingWindow <- TRUE
analysis.specs$guidePerSlidingWindow <- 15
analysis.specs$maxWindowSize <- 8000
The guide count file provided for analysis is a .csv file, where each column is a pool and each row the counts from a guide in a given pool (without guide names). The rows must match the guide information file (see below) such that guide in row 1 is the same across both files. The file should have a header, but the column names do not matter. Top of example guide count file: Example_simulation_counts.csv
repl1_before | repl1_after | repl2_before | repl2_after |
---|---|---|---|
15 | 6 | 11 | 14 |
8 | 1 | 6 | 5 |
8 | 6 | 3 | 11 |
15 | 16 | 8 | 9 |
For the analysis the provided guide information file must contain the following: chromosome, start and end position as well as a guide label. The guide labels are provided as strings, same with the chromosome positions. Top of example guide information file Example_simulation_info.csv.
chrom | start | end | label |
---|---|---|---|
chr8 | chr | 128703371 | 128703391 |
chr8 | chr | 128703511 | 128703531 |
chr8 | chr | 128703521 | 128703541 |
chr8 | chr | 128703539 | 128703559 |