Skip to content

Latest commit

 

History

History
178 lines (133 loc) · 9.75 KB

README (1).md

File metadata and controls

178 lines (133 loc) · 9.75 KB

CRSsim: performance

This capsule contains code and instructions for assessing the performance of various analysis methods for analyzing CRISPR regulatory screens. It contains the following sections:

Analysis and performance

Advanced flags

Note; the example below is from the GitHub and can be found here. The example has been altered slightly (referencing of file locations) to fit the format of Ocean Code.

1. Analyze simulated data and evaluate performance

After simulating CRISPR screen data, you are ready to analyze it! CRSsim provides a script for analyzing the data using a number of methods and comparing their performance.

Running the analysis will generate the following outputs:

  • a per-guide scores file, written to {output_name}_guideScores.csv
  • a per-genome scores file, written to {output_name}_genomeScores.csv
  • a bedGraph file for genome scores, written to {output_name}_genomeScores.bedgraph
  • an AUC plot summarizing the performance of all methods compared, written to {output_name}_perElement_AUCeval.pdf
  • a prAUC plot summarizing the performance of all methods compared, written to {output_name}_perElement_prAUCeval.pdf
  • a .csv file summarizing the AUC and prAUC scores for each method evaluated, written to {output_name}_perElement_method_eval.csv

1.1 Installations and setup for analysis and performance evaluation

To analyze data and evaluate method performance you will need the R packages listed below. This capsule should already have them set up.

1.1.1 R packages

  • dplyr

  • ggplot2

  • pROC

  • glmmTMB

  • extraDistr

  • MESS

1.1.2 Bioconductor packages

  • IRanges

  • GenomicRanges

  • edgeR

  • DESeq2

1.2 Analyze sample selection screen data: Flags used in CRSsim_performance_Rscript.r

We have provided an example data set from a selection screen. This way you don't have to first simulate your own data; however, the example on GitHub walks through how to do this yourself.

1.2.1 Source the script

In this example we will evaluate the performance of RELICS and other mehtods. To get access to these methods the corresponding scripts must first be sourced.

# working directory is in 'code'
source('RELICS.r')
source('RELICS_performance.r')

2.2.2. Setting up analysis flags

  1. Flags are set up as an R list object. Run the following command in R to create this object:
analysis.specs <- list()
  1. Set the output name for the analysis. All output files generated by the analysis will contain this prefix in the filehandle. Be sure to choose a different name from any existing output files. In this case we are specifying the output location (../results/) in addition to the file name (Example_performanceEval).
analysis.specs$dataName <- '../results/Example_performanceEval'
  1. Specify the paths to the guide information file and the counts file generated in the simulation step. Here, we have provided example files so that the user does not have to run a full simulation prior to this example. For file format details see section 3.1 below or here.
analysis.specs$test1 <- 'string1'

some text

analysis.specs$test2 <- 'string2'

other text

analysis.specs$CountFileLoc <- '../data/Example_simulation_counts.csv'
analysis.specs$sgRNAInfoFileLoc <- '../data/Example_simulation_info.csv'

analysis.specs$sgRNAInfoFileLoc <- '../data/Example_simulation_info.csv'

  1. Multiple analysis methods can be applied to the data: RELICS, fold change, edgeR, and DESeq2. The user can specify which of these methods they would like to apply to the data:
analysis.specs$Method <- c('RELICS-search', 'FoldChange', 'edgeR', 'DESeq2')

Each pool in a replicate is separated by a comma, each replicate is separated by a semi-colon. Detail for the RELICS analysis instructions can be found here.

analysis.specs$repl_groups <- '1,2;3,4'
analysis.specs$glmm_positiveTraining <- 'exon'
analysis.specs$glmm_negativeTraining <- 'neg' 
  1. For edgeR, DESeq2, and fold change, select the pools to be compared against one another. Pools are referenced by their corresponding column index in the count file. For DESeq2 and edgeR, Group1 refers to the pools where an enrichment of guides is expected. In this case we're analyzing a selection screen where guides targeting regulatory elements drop out, so relatively speaking they will be enriched in the before pools. For fold change, Group1 is the numerator and Group2 the denominator. Log10 fold change is calculated.
analysis.specs$Group1 <- c(1,3)
analysis.specs$Group2 <- c(2,4)
  1. For fold change, you need to specify whether the different pools are paired (from the same replicate with a 1-1 correspondence) or if there is an imbalance between the groups. In the case of the latter, all counts from Group1 and Group2 are combined respectively before calculating the fold change.
analysis.specs$foldChangePaired <- TRUE # else FALSE
  1. Specify that results should be evaluated based on a set of regions known to be true positives and true negatives
analysis.specs$simulated_data <- TRUE # specify that the analysis is based on simulated data where the ground thruth is known
analysis.specs$pos_regions <- '../Example_data/Example_simulation_enhancers.csv' # file location of all known positive regions
analysis.specs$evaluate_perElement_Performance <- TRUE # specify that the performance of different methods is to be evaluated
analysis.specs$positiveLabels <- 'pos' # label for regions which are true positives
analysis.specs$negativeLabels <- c('neg', 'chr') # labels for regions which are true negatives
  1. Depending on the CRISPR system used, the range of perturbation effect will be different. We recommend setting the range to 20bp for CRISPRcas9 and to 1000bp for CRISPRi and CRISPRa. Note that the effect range is added to the positions specified in the info file. If the effect range is already accounted for in the positions specified in the annotation file, then it should be set to zero here.

In case of a dualCRISPR system, an arbitrary crisprEffectRange can be specified as RELICS will automatically use the deletion range between guide 1 and guide 2 as effect range.

analysis.specs$crisprSystem <- 'CRISPRi' # other options: CRISPRcas9, CRISPRa, dualCRISPR
analysis.specs$crisprEffectRange <- 1000 # use $deletionSize if system is dualCRISPR
  1. Once you have your flags set, create a specification file using the write_specs_file() function. The arguments and their names will be written to this file for future reference. The two arguments that this function takes are the list of flags you just set (analysis.specs) and the desired name of the output file (.txt will be automatically used as the file extension, so do not include any file extension for this argument).
write_specs_file(analysis.specs, 'Example_performanceEval_specs')
  1. Once the specification file has been set up, simply use the analyze_data() function to start the analysis. The example here should take about 5 minutes, depending on your operating system.
analyze_data('Example_performanceEval_specs.txt')

2. Advanced Flags

2.1 Advanced Analysis and performance evaluation

2.1.1 alphaRRA. To analyze data with MAGeCK as used by Diao et al. 2017, specify use of alphaRRA and the bins within which to tile the analyzed region. Note, alphaRRA will be applied to per-guide scores from all methods given by the Method flag.

analysis.specs$postScoreAlphaRRA <- TRUE
analysis.specs$binSize <- 50

2.1.2 Sliding Window. To combine guide scores using a sliding window as in Fulco et al. 2016 or Simeonov et al. 2017, specify the number of guides to include per sliding window as well as the maximum window size (if the tiled deletion generates a 10KB gap it would not make sense to consider the entire region as one score).

analysis.specs$postScoreSlidingWindow <- TRUE
analysis.specs$guidePerSlidingWindow <- 15
analysis.specs$maxWindowSize <- 8000

3. Input table formats

3.1 analysis and performance evaluation

3.1.1 Guide count file

The guide count file provided for analysis is a .csv file, where each column is a pool and each row the counts from a guide in a given pool (without guide names). The rows must match the guide information file (see below) such that guide in row 1 is the same across both files. The file should have a header, but the column names do not matter. Top of example guide count file: Example_simulation_counts.csv

repl1_before repl1_after repl2_before repl2_after
15 6 11 14
8 1 6 5
8 6 3 11
15 16 8 9

3.1.2 Guide information file

For the analysis the provided guide information file must contain the following: chromosome, start and end position as well as a guide label. The guide labels are provided as strings, same with the chromosome positions. Top of example guide information file Example_simulation_info.csv.

chrom start end label
chr8 chr 128703371 128703391
chr8 chr 128703511 128703531
chr8 chr 128703521 128703541
chr8 chr 128703539 128703559