STARR_Seq_to_MPRAnalyze_input is a pipeline that processes STARR Seq data (.fastq.gz) to an input file for MPRAnalyze (.csv). This repository contains the workflow and scripts for processing this data, and many of these scripts were adopted and curated from Arpit Misha (post-doc in the Hawkins lab). If there are any questions/bugs/errors, please contact me at [email protected] or [email protected].
- Confirm that conda is installed.
- Clone this repository into the location you want to run the pipeline.
- Create and activate the provided environment:
git clone https://github.com/hawkins-lab/STARR_Seq_to_MPRAnalyze_input.git \
&& cd STARR_Seq_to_MPRAnalyze_input/env/ \
&& conda env create -f STARRSeq2MPRAnalyze_env \
&& conda env create -f umitools_env
- Navigate to the STARR_Seq_to_MPRAnalyze_input directory.
- Add any Starr Seq files into the pipeline_input directory. These should be .fasta.gz files.
- Submit a job to the computation cluster with the run_pipeline.sh script.
- Wait ~12 hours for the data to be processed
- Check pipeline_output/13_final_mpranalyze_input/ directory for the MPRAnalyze input .csv files.
Just a few notes for reference:
- We run all heavy computational processes on University of Washington's Genome Sciences cluster computing.
- Arpit has notified me that the first steps of processing the fastq.gz files may have different barcode lengths. If that is the case, you might need to modify the numbers (i.e. cutadapt -j 0 -u 50 -u -40 might be different than what is written here).