Skip to content

Commit

Permalink
Merge pull request #10 from nterhoeven/develop
Browse files Browse the repository at this point in the history
Develop
  • Loading branch information
nterhoeven authored Feb 6, 2018
2 parents 303c167 + 5edd001 commit f3138c5
Showing 1 changed file with 10 additions and 10 deletions.
20 changes: 10 additions & 10 deletions paper/paper.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,31 +32,31 @@ bibliography: paper.bib

Repetitive elements constitute a substantial fraction of most eukaryotic genomes.
Still, their actual amount differs strongly between species. For example, the genome of *Saccharomyces cervisiae*
contains only about 3 % repeats ([@kim_transposable_1998]), *Arabidopsis* harbours 14 % ([@the_arabidopsis_genome_initiative_analysis_2000]),
human 50 % ([@lander_initial_2001]) and wheat even 90 % ([@clavijo_improved_2017]).
contains only about 3 % repeats [@kim_transposable_1998], *Arabidopsis* harbours 14 % [@the_arabidopsis_genome_initiative_analysis_2000],
human 50 % [@lander_initial_2001] and wheat even 90 % [@clavijo_improved_2017].

Annotation and Classification of these elements is a pivotal step in the annotation of each genome.
Furthermore, tracing their history can give ample insights into the evolution of a genome and thereby,
of a species. Accordingly, different methods for repeat annotation have been developed ([@smit_repeatmasker_2013], [@benson_tandem_1999], [@gymrek_lobstr:_2012]).
of a species. Accordingly, different methods for repeat annotation have been developed [@smit_repeatmasker_2013; @benson_tandem_1999; @gymrek_lobstr:_2012].
Still, typically they rely on an assembled genome sequence – a prerequisite which can lead to erroneous results.
As repetitive elements are highly similar assembly algorithms will collapse repeat variants into a single
occurrence or not assemble the repetitive regions at all. Thus, the annotation of repeat regions and thereby the
characterization of their content and diversity solely based on an assembled genome sequence can give misleading results.

To address this challenge, we developed reper, a kmer based method to detect, classify and quantify repeats
in next generation sequencing (NGS) data without the need of a genome assembly.
Our pipeline samples reads with high kmer coverage directly from the NGS dataset (kmer counts based on jellyfish [@marcais_fast_2011]). This subset is
assembled using the transcriptome assembler Trinity ([@grabherr_full-length_2011]), allowing reper to recover repeat variants at a high resolution.
To create exemplar sequences of each repeat in the genome, the assembled repeats ar clustered using cd-hit ([@li_cd-hit:_2006],[@fu_cd-hit:_2012]).
These are further classified based on homology to known repeats using multiple blast ([@camacho_blast]) searches. Since reper was developed with
a focus on plant data, the default classification libraries are REdat ([@nussbaumer_mips_2012]) for repeats, and refseq ([@oleary_reference_2016]) for chloroplast and mitochondrial
Our pipeline samples reads with high kmer coverage directly from the NGS dataset. The kmer counts are acquired using jellyfish [@marcais_fast_2011]. This subset is
assembled using the transcriptome assembler Trinity [@grabherr_full-length_2011], allowing reper to recover repeat variants at a high resolution.
To create exemplar sequences of each repeat in the genome, the assembled repeats are clustered using cd-hit [@li_cd-hit:_2006; @fu_cd-hit:_2012].
These are further classified based on homology to known repeats using multiple blast [@camacho_blast] searches. Since reper was developed with
a focus on plant data, the default classification libraries are REdat [@nussbaumer_mips_2012] for repeats, and refseq [@oleary_reference_2016] for chloroplast and mitochondrial
sequences. The reference database, however, can easily be customized to the user's needs. A configuration script for
the popular, but proprietary database repbase is provided with the package as well.
Next, the repeat content is quantified on sequence, cluster and class level using read mappings (bowtie2 and samtools, [@langmead_fast_2012] and [@li_sequence_2009]).
Next, the repeat content is quantified on sequence, cluster and class level based on read mappings using bowtie2 and samtools [@langmead_fast_2012; @li_sequence_2009].
Finally, the repeat landscape can be analyzed and graphically represented with the R script provided with the pipeline.
Currently, reper is specifically customized to work with paired-end Illumina data, but support of long-read technologies such as PacBio and Nanopore is in development.

To date, there is only a single software package with a similar functionality to reper, namely dnaPipeTE ([@goubert_novo_2015]).
To date, there is only a single software package with a similar functionality to reper, namely dnaPipeTE [@goubert_novo_2015].
Still, it relies on dependencies like RepeatMasker, which has to be installed independently as well as the proprietary repeat database repbase by giri.
Contrasting, The reper source code is available on [github](https://github.com/nterhoeven/reper) under the MIT license.
To further ease installation and usage, a Docker container with a complete reper installation is also provided.
Expand Down

0 comments on commit f3138c5

Please sign in to comment.