Skip to content
christianparobek edited this page Sep 25, 2014 · 18 revisions

It is important to filter tandem repeats. I used a local instance of TandemRepeatFinder, v4.07b. Turns out this is the same program Derrick used to identify tandem repeats in his P. falciparum genomes. I used the following command-line, run locally on my lab workstation:

trf407b.linux64 ref.fasta 2 7 7 80 10 50 500 -h -ngs > trf.txt

Derrick used similar parameters for his genomes: 2 7 7 80 10 50 200.

The output of this can be converted to different formats. The following trf2gatk.sh converts to GATK's ".intervals" format (chr:start-stop), using bash's regex matching:

# Christian Parobek
# Started September 18 2014
# Script to convert Tandem Repeat Finder output format
# to GATK's ".intervals" format to use with SelectVariants
# Must specify and input and output filename at the commandline    

while read line
do
 	if [[ $line =~ @([^ ]+) ]] #match all non-space char after @
	then
		chromosome="${BASH_REMATCH[1]}:" # add a colon after chr name
	elif [[ $line =~ ([0-9]+ [0-9]+) ]]
	then`  
		string=$chromosome${BASH_REMATCH[1]}
		echo "${string// /-}" >> $2
	fi
 done < $1

The following trf2bed.sh converts to BED format (chr\tstart\tstop), using bash's regex matching:

# Christian Parobek
# Modified September 25 2014
# Script to convert Tandem Repeat Finder output format
# to BED format
# Must specify and input and output filename at the commandline

while read line
do
	if [[ $line =~ @([^ ]+) ]] #match all characters after @ except for space
	then
		chromosome="${BASH_REMATCH[1]}\t" # add a tab after chr name
	elif [[ $line =~ ([0-9]+ [0-9]+) ]]
	then
		string=$chromosome${BASH_REMATCH[1]} # put chr name together with coords
		echo -e "${string// /\t}" >> $2 # change space to tab between coords
	fi
done < $1
Clone this wiki locally