-
Notifications
You must be signed in to change notification settings - Fork 24
tutorial strawberry
This is a tutorial using smudgeplot v0.2.1
Previous versions:
- version v0.1.3 - the corresponding version of this page.
- version v0.2.0 - the corresponding version of this page.
Let's get the data from the NCBI strawberry project PRJDB3320 published originally in Hirakawa et al. 2014. To complete this tutorial you will need 16 cores, about 200GB of RAM memory, 300GB of disk space and several hours of computations.
We will warm up on a diploid species Fragaria iinumae and then we will do the famous octoploid strawberry F. ananassa.
F. iinumae has reads in SRA under accession DRR013884, we can download them from ebi ftp server (it will take a while, it's quite a lot of data):
mkdir -p strawberry_iinumae && cd strawberry_iinumae
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/DRR013/DRR013884/DRR013884_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/DRR013/DRR013884/DRR013884_2.fastq.gz
great, we got 70.1GB of data, 109G base pairs, corresponding to approximately 500x of the ~200M Fragaria genome. It's useful to know a bit idea about the expected genome coverage.
Now we would like to build a database of all 21-mers using kmc
mkdir tmp # create a directory for temporary files
ls DRR013884_1.fastq.gz DRR013884_2.fastq.gz > FILES # create a file with both the raw read files
kmc -k21 -t16 -m64 -ci1 -cs10000 @FILES kmer_counts tmp # run kmc
We expect coverage to be enormous therefore we can just chose a really high cutoff hopefully filtering out all the sequencing errors. Now let's extract all the kmers with coverage >100x and <3000x.
kmc_dump -ci100 -cx3000 kmer_counts kmer_k21.dump
The file kmer_k21.dump
is a plain text file. You can check how it looks like and count how many kmers we have actually extracted
head kmer_k21.dump # just to see how the file looks like
wc -l kmer_k21.dump # count number of kmers are in the dump file
there are ~170M kmers, we can approximate the RAM memory requirement by multiplying the number by 350, so approx 60GB. We can expect the kmer pairs to be extracted in ~two of hours of computation.
smudgeplot.py hetkmers -o kmer_pairs < kmer_k21.dump
now we got a file kmer_pairs_coverages.tsv
that contains kmer pair coverages.
We need to feed the coverage file to the R script that will plot the smudgeplot and estimate ploidy, specify input file (-i
), output name (-o
) and a title for the plot so you will know what are you looking at (argument -t
). We also filter out one percent of kmers with the highest coverage (the quantile filter) since it makes a nicer zoom to relevant area of the figure
smudgeplot.py plot -o f_iinumae -t "Fragaria iinumae" -q 0.99 kmer_pairs_coverages.tsv
And that's it. Now a smugleplot are plotted on both the linear (f_iinumae_smudgeplot.png
) and the log scale (f_iinumae_smudgeplot_log10.png
). All the information is embedded in the figures, but it's also saved in f_iinumae_summary_table.tsv
file. Sometimes it's easier to look at the logscale smudges sometimes non-tranformed. This time log transformation made a clear picture
Oh heck, this Fragaria species seems to be tetraploid OR a species with a lot of closely related paralogs (approximately twice more than heterozygous loci). The reason is that we searched for loci that are exactly one SNP from each other, if there are two homozygous loci that are exactly one nucleotide different, I will pick them up as AABB. If the smudge would be strong around AAAB, the evidence for tetraploidy would be stronger since such paralog structure would be more than peculiar. Alright, is this a tetraploid??
All the literature I found were talking about this species as the basal branching Fragaria lineage that is diploid. However in Potter et al. 2000 they found an octoploid strawberry plant that was determined as Fragaria iinumae and reclassified subsequently. Maybe this individual that was sequenced is also a hybrid, but more likely it's just the duplication story and false inference from the side of smudgeplot?! Just to make sure that I have not done something stupid I run also GenomeScope (a step I have omitted at the beginning).
kmc_tools transform kmer_counts histogram kmer_k21.hist -cx10000
Rscript genomescope.R -i kmer_k21.hist -k 21 -p 2 -o . -n Fiinumae_genomescope
Alright, the haploid genome size estimate is ~200M assuming diploidy, suggesting that it's indeed not tetraploid. Furthermore The heterozygosity is really low (0.17%) which explains why duplication signal is relatively stronger than the heterozygosity signal. We also see that the duplication bump (the third peak in the histogram) is indeed quite high. I am very intrigued that so many of the duplications are so recent that it made the smudgeplot so confusing.
I must admit, this was not the best example. I did not know how it's going to end up when I started.
cd ..
Let's see if we will be able to predict the genome ploidy of the cultivar hybrid species F. ananassa.
Here we want to download 4 libraries, so I made this small for loop that will fetch the data
mkdir -p strawberry_ananassa && cd strawberry_ananassa
for lib in DRR013873 DRR013874 DRR013875 DRR013877; do
libdir=$(echo $lib | cut -b 1-6)
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/$libdir/$lib/"$lib"_{1,2}.fastq.gz
done
Let's count kmers using kmc. This time we are not that sure what we are dealing with, so we want to extract a histogram of kmers as well to determine L and U:
ls *.fastq.gz > FILES
kmc -k21 -m64 -ci1 -cs10000 @FILES kmer_counts tmp
kmc_tools transform kmer_counts histogram kmer_k21.hist
I would like to see the kmer spectra first before deciding about L and U. It also help not to fall into the same trap as with the F. iinumae smudgeplot.
Rscript genomescope.R -i kmer_k21.hist -k 21 -p 2 -o . -n Fananassa_genomescope 10000
The histogram:
The log-log histogram of the same data
It seems that the haploid genome coverage is ~60x (visible on the non-tranformed histogram), then we also see on the log-log histogram that the last big bump is somewhere bellow 1000x. We can either decide just form the histograms (L=30 and U=2000 would be reasonable choice to me) or we can run script kmer_cov_cutoff.R
to estimate it from the histogram using a simple algorithm for detection of local minima. The script return L=26 and U=600. These are perhaps also alright values, but just to be sure that we capture also duploications in the polyploid genome we choose U twice higher. While dumping kmers we can directly pipe them to the scripts for the computation of the kmer pairs. Note that this process might require up to 200GB of RAM, there are many more kmers than in the previous species.
L=$(smudgeplot.py cutoff kmer_k21.hist L) # 26
U=1200
kmc_dump -ci$L -cx$U kmer_counts /dev/stdout | smudgeplot.py hetkmers -o kmer_pairs
Finally we run smudgeplot
smudgeplot.py plot -o f_ananassa kmer_pairs_coverages.tsv
Last time I have shown log transformed smudges, this time I plot the data on a linear scale because smudges are too much overlapping and due to log scale one could not see so clearly the edges of smudges.
Wow, finally something interesting. It looks like we are dealing with an octaploid and the evidence seems robust since it comes mostly from AAAAAABB loci (the legend is not showing smudges with more than 6 letters, to see the smudges check out the summary_table.tsv
file). It also seems that the genome structure is AAAABBCC, where A and B are more closely related than C (because the smudge AAAAAABB is the brightest) and because there is quite bright AAAABB smudge suggesting senario where two haplotypes (let's say CC) do have a fixed indel and therefore we detect them as a hexploid loci. Analogically we could interpret the AABB smudge...
I could discuss this a whole day long, but I guess you got the point, so that's it. You made a really beautiful "smudgeplot alla fragola" and now you are ready for your own data.
Wanna try something without my help? You can do another strawberry species like diploid Fragaria nipponica with accesion DRR013885. Tweet about it (#smudgeplot), let us know how it went.
Or even better, you can try one of the species we discussed that could be interesting and make such tutorial out of it, we will be happy to add it to the wiki.