-
Notifications
You must be signed in to change notification settings - Fork 7
/
README
52 lines (37 loc) · 1.88 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
Karect
======
KAUST Assembly Read Error Correction Tool
Installation
============
tar -xzf karect-1.0.tgz
cd karect
make
Instructions
============
To get all instructions, run the program:
./karect
Test Data and Running Example
=============================
Karect can accept as input any fasta/fastq file of assembly reads:
Running example used in the paper of correcting Staphylococcus aureus Illumina reads:
1) Download the files frag_1.fastq.gz and frag_2.fatstq.gz (and genome.fasta if you need to evaluate results) from:
http://gage.cbcb.umd.edu/data/Staphylococcus_aureus/Data.original/
2) Decompress frag_1.fastq.gz and frag_2.fatstq.gz by:
gunzip frag_1.fastq.gz
gunzip frag_2.fastq.gz
3) Use Karect to correct the read sequences (modify file paths if needed):
./karect -correct -threads=12 -matchtype=hamming -celltype=haploid -inputfile=./frag_1.fastq -inputfile=./frag_2.fastq
which produces the corrected read files: ./karect_frag_1.fastq and ./karect_frag_2.fastq
4) If you need to evaluate correction accuracy using the reference genome (genome.fasta):
4a) First, align original reads to the reference genome, to produce the file ./align.txt
./karect -align -threads=12 -matchtype=hamming -inputfile=./frag_1.fastq -inputfile=./frag_2.fastq -refgenomefile=./genome.fasta -alignfile=./align.txt
4b) Second, evaluate the correction accuracy to produce the file ./eval.txt
./karect -eval -threads=12 -matchtype=hamming -inputfile=./frag_1.fastq -inputfile=./frag_2.fastq -resultfile=./karect_frag_1.fastq -resultfile=./karect_frag_2.fastq -refgenomefile=./genome.fasta -alignfile=./align.txt -evalfile=./eval.txt
Author
======
Amin Allam
Reference
=========
Currently submitted to Bioinformatics, under the title:
Karect: Accurate Correction of Substitution, Insertion and Deletion Errors for Next-generation Sequencing Data