-
Notifications
You must be signed in to change notification settings - Fork 0
trfMasking
christianparobek edited this page Sep 25, 2014
·
18 revisions
It is important to filter tandem repeats. I used a local instance of TandemRepeatFinder, v4.07b. Turns out this is the same program Derrick used to identify tandem repeats in his P. falciparum genomes. I used the following command-line, run locally on my lab workstation:
trf407b.linux64 ref.fasta 2 7 7 80 10 50 500 -h -ngs > trf.txt
Derrick used similar parameters for his genomes: 2 7 7 80 10 50 200
.
The output of this can be converted to different formats.
The following trf2gatk.sh
converts to GATK's ".intervals" format (chr:start-stop), using bash
's regex matching:
# Christian Parobek
# Started September 18 2014
# Script to convert Tandem Repeat Finder output format
# to GATK's ".intervals" format to use with SelectVariants
# Must specify and input and output filename at the commandline
while read line
do
if [[ $line =~ @([^ ]+) ]] #match all non-space char after @
then
chromosome="${BASH_REMATCH[1]}:" # add a colon after chr name
elif [[ $line =~ ([0-9]+ [0-9]+) ]]
then`
string=$chromosome${BASH_REMATCH[1]}
echo "${string// /-}" >> $2
fi
done < $1
The following trf2bed.sh
converts to BED format (chr\tstart\tstop), using bash
's regex matching:
# Christian Parobek
# Modified September 25 2014
# Script to convert Tandem Repeat Finder output format
# to BED format
# Must specify and input and output filename at the commandline
while read line
do
if [[ $line =~ @([^ ]+) ]] #match all characters after @ except for space
then
chromosome="${BASH_REMATCH[1]}\t" # add a tab after chr name
elif [[ $line =~ ([0-9]+ [0-9]+) ]]
then
string=$chromosome${BASH_REMATCH[1]} # put chr name together with coords
echo -e "${string// /\t}" >> $2 # change space to tab between coords
fi
done < $1