Skip to content
christianparobek edited this page Sep 25, 2014 · 18 revisions

It is important to filter tandem repeats. I used a local instance of TandemRepeatFinder, v4.07b. Turns out this is the same program Derrick used to identify tandem repeats in his P. falciparum genomes. I used the following command-line, run locally on my lab workstation:

trf407b.linux64 ref.fasta 2 7 7 80 10 50 500 -h -ngs > trf.txt

Derrick used similar parameters for his genomes: 2 7 7 80 10 50 200.

The output of this can be converted to different formats. The following converts to GATK's ".intervals" format (chr:start-stop), using bash's regex matching:

# Christian Parobek
# Started September 18 2014
# Script to convert Tandem Repeat Finder output format
# to GATK's ".intervals" format to use with SelectVariants
# Must specify and input and output filename at the commandline    

while read line
 	if [[ $line =~ @([^ ]+) ]] #match all non-space char after @
		chromosome="${BASH_REMATCH[1]}:" # add a colon after chr name
	elif [[ $line =~ ([0-9]+ [0-9]+) ]]
		echo "${string// /-}" >> $2
 done < $1

The following converts to BED format (chr\tstart\tstop), using bash's regex matching:

# Christian Parobek
# Modified September 25 2014
# Script to convert Tandem Repeat Finder output format
# to BED format
# Must specify and input and output filename at the commandline

while read line
	if [[ $line =~ @([^ ]+) ]] #match all characters after @ except for space
		chromosome="${BASH_REMATCH[1]}\t" # add a tab after chr name
	elif [[ $line =~ ([0-9]+ [0-9]+) ]]
		string=$chromosome${BASH_REMATCH[1]} # put chr name together with coords
		echo -e "${string// /\t}" >> $2 # change space to tab between coords
done < $1
Clone this wiki locally