Update 2023-5-30:
A pipeline for GLORI is available here: https://github.com/jhfoxliu/GLORI_pipeline
Update 2022-10-29:
Congrats the publication of the m6A field game changer GLORI! So happy to be the source of inspiration to their pipeline!!
Update 2021-9-8:
Update scripts in python3 (give up...)
Update 2021-6-30:
It shows that hisat-3N is a very powerful tool for BS-seq alignment. We are developping a new pipeline to integrate it.
Update 2021-6-3:
We are updating a galaxy verion of this pipeline.
Update 2019-11-4:
There should be some bugs in the replicate merging step, since some extreme situations might not be considerated. If anyone got errors while running that script, don't be hesitant to inform the author.
Update 2019-9-16:
#important update# I found that the pileup` function was update after pysam v0.15.0. An important change is that pysam will no longer return non-proper
alignments as default. This can result in underestimated coverages and inaccurate m5C levels. To avoid this, you can add ignore_orphans=False
and ignore_overlaps=False
to Line 29 in the original file. Or please use pileup_genome_multiprocessing_v1.4_pysam_v0.15.0.py
if you are using pysam version >= v0.15.0.
Update 2019-7-26: m5C_caller_multiple.py -- fix a bug in using overall conversion rate
Update 2019-5-19: m5C_pileup_formatter.py -- fix version error
This is a highly modularized pipeline used in our mRNA m5C sequencing project.
System | CentOS Linux release 7.4.1708 (Core) |
CPU | 8 cores or more |
RAM | 70 GB or more |
Disk | 50 GB for human metadata |
200 GB for a 30 million PE150 data | |
500 GB for a 200 million PE125 data |
- Windows is not compatible since some modules cannot be installed.
- Exactly, we don't need so much disk... Add some clean-up steps and compress the results will help. 60+GB is a reasonable size for compressed results from ~100 million raw reads (2.2GB for the pileup results, you can delete the useless files to release the disk). Normally, 20GB is enough for 30 million reads.
- After pileup, it requires ~25GB memory to tidy up the results for human genome. Use a database in disk will decrese the amount of memory usage, but the program runs much slower.
Python 2.7.14/2.7.16 with numpy (1.13.3), scipy (0.19.1), pysam (0.12.0.1), Biopython (1.70). Higher versions of these modules should be compatible.
JAVA | JRE 1.8.0_131 |
Quality control and formatting | Cutadapt 1.14 |
Trimmomatic 0.36 | |
Mapping | HISAT2 2.1.0 |
Bowtie 2.2.9/2.3.4.2 | |
meRanGh/meRanGs (compatible in theory) | |
Bam processing | Samtools 1.6 |
Name | Usage |
---|---|
Metadata generation | |
gtf2anno.py | Transfer GTF to UCSC annotation table (Ensembl format only) |
gtf2genelist.py | Extract gene/isoform information from GTF (Ensembl format only) |
anno_to_base.py | Annotate each base in GTF |
anno_to_base_remove_redundance.py | Remove the redundance |
fasta_c2t.py | Convert Cs in fasta to Ts |
BS_hisat2_index.py | Build HISAT2 indexes |
ref_sizes.py | Extract the lengths of the references |
Alignment and pileup | |
BS_hisat2.py | Map reads to the genome |
BS_bowtie2.py | Map reads to the transcriptome |
Bam_transcriptome_to_genome_v1.0.py | Convert transcriptome alignment to genome alignment (BAM to BAM) |
concat_bam.py | Merge BAM files |
pileup_genome_multiprocessing_v1.4.py | Split the reference and pileup in a multiprocessing manner |
m5C_pileup_formatter.py | Format pileups |
Call sites | |
m5C_caller.py | Call m5C sites from the formatted pileup files |
m5C_caller_multiple.py | Call m5C sites from multiple samples |
m5C_intersection_single_r1.py | Identify m5C sites in each sample |
m5C_intersection_multi_r1.py | Identify m5C sites in replicates |
Most of the scripts can run stand-alone. Make sure python modules are installed. You can use pip
to install all the modules required. Python 3 is not compatible recently, however, one can try using python's 2to3
util.
Please read m5C-BS-seq-step-by-step-computation-protocol-v1.1.pdf
for details.
If you are using task managers like SGE
or SJM
, you can use the script m5C_pipeline_generator_qsub.py
or m5C_pipeline_generator_SJM.py
to generate .sh or .sjm files (for SJM
, you need to install sjm_tools
<https://github.com/sysuliujh/Bioinfo-toolkit/tree/master/sjm_tools> first).
Please contact Jianheng Liu ([email protected])
for questions and bug report.
Please cite Huang, T., Chen, W., Liu, J., Gu, N. & Zhang, R. Genome-wide identification of mRNA 5-methylcytosine in mammals. Nature structural & molecular biology, doi:10.1038/s41594-019-0218-x (2019) (https://www.nature.com/articles/s41594-019-0218-x).