This is a supplementary material. It allows, nearly perfect, replication of the study and exact code that was used to generate all the results and figures.
You can also check ESEB 2021 talk about preliminary results of this work.
The sequenced materials resequencing data: PRJEB44694
All raw reads will be automatically downloaded from EBI using
snakemake download_all_reads
Downloading the reference genome
mkdir -p data/reference/Afus1/ data/reference/Ocin1/
wget ftp://ftp.ebi.ac.uk/pub/databases/ena/wgs/public/caj/CAJVCH01.fasta.gz -O data/reference/Afus1/genome.fa.gz
wget ftp://ftp.ebi.ac.uk/pub/databases/ena/wgs/public/lji/LJIJ01.fasta.gz -O data/reference/Ocin1/genome.fa.gz
The data
Allacma fusca
data/raw_reads/{individual}/{accesion}_1.fastq.gz
data/reference/Afus1/genome.fa.gz
This section shows
- k-mer based estimate of coverage (k-mer coverage) using genomescope
- mapping of reads and subsequent estimate of mapping coverages
This approach is complementary to to the mapping approach, the advantage is that it can be done directly on the k-mer spectra and it's does not require assembling and mapping reads (and therefore avoids all the problems with these processes). The drawback is that each male k-mer spectrum will have co-founded paternal heterozygous sites with X chromosome and maternal autosomal sites, therefore the interpretation of the first k-mer peak is not as straightforward as of the first peak of the mapping coverage.
Lower k helps with low coverage datasets in cost of resolution. Usually for genomes with relatively modest coverage and short reads a default k = 21 is used. We also use this k for all our genomes with exception of BH3-2, where we (after attempting of fitting a model with k = 21) reduced the k to 17.
With k = 17, a smaller fraction of genome occurs on unique k-mers, but the k-mer coverage is (R - k + 1 / R) * C, hence lower k is, higher coverage we get. The improvement is only marginal, but we needs only a marginal improvement to make the fit easier.
Here is an exmaple for one sample (BH3-2) and one values of k (21) - funny enough, this not a k-mer spectrum we used in the end as we later redid it with k = 17, but the principle remains the same.
We start by creating kmer database and then extracting a k-mer histogram
mkdir data/raw_reads/BH3-2/tmp
ls data/raw_reads/BH3-2/ERR5883998_[1,2].fastq.gz > data/raw_reads/BH3-2/FILES
# kmer 21, 16 threads, 64G of memory, counting kmer coverages between 1 and 10000x
kmc -k21 -t30 -m64 -ci1 -cs100000 @data/raw_reads/BH3-2/FILES data/BH3-2/kmer_db/kmer_counts data/BH3-2/tmp
kmc_tools transform data/BH3-2/kmer_db/kmer_counts histogram data/Ocin1/BH3-2/kmer_k21.hist -cx100000
And do this for all the samples.
These models are largely inspired by GenomsScope source code and GenomeScope is included among the used models. They are adjusted and simplified models that are estimating 1n and 2n peaks separately (i.e. approach similar to the coverage analysis of mapped reads). The simplification of GenomeScope model lays is two aspects - first, we completely ignore anything that occurs in the genome more than once or twice (k-mers in the 1n and 2n peaks), compared to GenomeScope which has an explicit fit of duplications within the genome. Second is that we removed the term that is estimating genome-wise heterozygosity knowing k-mer length. Instead we use term het
, which simply proportion of 1n and 2n k-mers that are in the 1n peak. That is because we don't actually desire to estmate heterozygosity and the model converges better we simplify the term. The trick to model the two peaks separatelly is in adding of a term proportion
that models the coverage ratio between first and the second peak, which GenomeScope has fixed to 1:2.
# model, x - coverages, y - coverage frequencies, kmerEst, lengthEst, hetEst - starting values for the respective parameters
# note the starting value for proportion is 0.5, meaning that out starting values is the GenomeScope-like estimate
nlsLM_2peak_proportional_peaks <- function(x, y, kmerEst, lengthEst, hetEst = 0.6){
two_tissue_model <- nlsLM(y ~ ((het * dnbinom(x, size = (proportion * kmercov2) / bias, mu = (proportion * kmercov2))) +
((1 - het) * dnbinom(x, size = kmercov2 / bias, mu = kmercov2))) * length,
start = list(proportion = 0.5, kmercov2 = 2 * kmerEst, bias = 0.5, length = lengthEst, het = hetEst),
control = list(minFactor=1e-12, maxiter=40))
return(two_tissue_model)
}
All the fits of the two tissue model are calculated in the scripts/kmer_estimates_of_two_tissue_model.R script.
The main reason we formulate the model in this way is that we can calculate asymptotic confidence intervals of all fitted variables, including proportion
. Now, if the CI of proportion
spans 0.5, we conclude there was no significant deviation from 1:2 coverage ratio.
To run the model on all the calculated k-mer spectra, run
Rscript scripts/kmer_estimates_of_two_tissue_model.R
This will generate a bunch of plots and a table with estimates for 2 A. fusca males, 3 A. fusca females and one O. cincta (non-PGE springtial) male (the table is included).
To test how the method works on samples with various sizes of X chromosomes, levels of heterozygosity, fraction of sperm and various sequencing coverages, we performed a Power analysis, check the documentation for more details. The summary table with all the results is here, to plot the analyis for all the scenarios, run
Rscript scripts/plot_power_analysis.R
We have also used classical GenomeScope model to validate our approach. The genomescope was fit as follows
genomescope.R -i data/Ocin1/kmer_db/kmer_k21.hist -o data/Ocin1/genome_profiling -n Ocin1
The reads are mapped on the reference using bowtie2
and sorted using samtools
as follows
# indexing reference
# bowtie2-build data/reference/Afus1/genome.fa.gz $REFERENCE
bowtie2 --very-sensitive-local -p 16 -x $REFERENCE \
-1 $R1 -2 $R2 \
--rg-id "$RG_ID" --rg SM:"$SAMPLE" --rg PL:ILLUMINA --rg LB:LIB_"$SAMPLE" \
| samtools view -h -q -20 \
| samtools sort -@10 -O bam - > $SAMPLE.rg.sorted.bam
for all the samples run
snakemake map_all
Note, when I was mapping reads, I messed up something with the reference sample and run this script scripts/mapping_reference_reads_Afus1.sh
instead. However, I think I fixed the problem afterwards so this note should be (hopefully) outdated.
Allacma
Generate per scaffold coverages
samtools depth data/mapped_reads/BH3-2.rg.sorted.rmdup.bam | scripts/depth2cov_per_contig.py > data/mapped_reads/BH3-2_cov_per_scf.tsv
samtools depth data/mapped_reads/Afus1.rg.sorted.rmdup.bam | scripts/depth2cov_per_contig.py > data/mapped_reads/Afus1_cov_per_scf.tsv
etc. We then pull together all the mean coverages in a table tables/Afus_mean_coverage_table.tsv provided in this repo.
Orchesella
samtools depth data/mapped_reads/Ocin2.rg.sorted.bam | scripts/depth2cov_per_contig.py > data/mapped_reads/Ocin2_cov_per_scf.tsv
We use weighted kernel smoothing to see how many modes there are in the mapping coverage (1 = female, 2 = male). Following script will plot figures/Allacma_cov_estimates.png
and a bunch of thers too.
Rscript scripts/plot_mapping_coverages.R
but most importandly, it generates tables/resequencing_coverage_estimates.tsv, which contains estimates of 2n mapping coverage estimates for all the samples, but also 1n estimates for the two male samples (BH3-2
and Afus1
).
To generate tables/chr_assignments_Afus1.tsv
table, run following R script
Rscript scripts/assign-X-linked-scaffolds.R
Calculation of expectation of allele coverages under different scenarios.
The k-mer based expecation of maternal and paternal coverage is tricky. We could - in theory - use a formula to convert kmer coverage to genomic coverage, that is however in practice very imprecise. Instead, we use the estimated fraction of sperm, mapping coverage estimate and the formula we used for the two tissue model.
The original formula is f_h = 1 - (c_A - c_X) / c_X
, where f_h
is fraction of sperm, and c_
are respective coverages of X chromosomes and autosomes. With a few modifications we can adjust the equation to be in a form c_X = c_A / (2 - f_h)
which is also the expectation for the maternal autosomal alleles. Therefore the paternal expecation is c_A - c_X
.
In the case of BH3-2 f_h = 0.353
and c_A = 28.73
(which I took as mean of all homozygous variants on autosomes so we avoid any circularity in the argument). Then c_m = c_X = 17.44
and therefore c_p = 11.29
.
We ploted the modality of the coverage distributions used for sexing of individuals (unimodal - females; bimodal - males; SM Figure 1) and estimated the 1n/2n coverages from mapped reads to the reference using kernel smoothing. Both operation performed in the plot_mapping_coverages.R script.
As the result, we have the table with coverage estimates.
To test PGE, we called variants and sort them out to chromosomes. The test was based on expectation that the sperm in male A. fusca contains the maternal genome only. If that's the case, there is an expected coverage shift of maternal to paternal alleles which is proportional to the fraction of sperm in the sample (see calculated expectation above).
We call variants using freebayes.
# A. fusca
freebayes -f data/reference/Afus1/genome.fa -b data/mapped_reads/Afus1.rg.sorted.rmdup.bam data/mapped_reads/BH3-2.rg.sorted.rmdup.bam data/mapped_reads/WW2-1.rg.sorted.rmdup.bam data/mapped_reads/WW3-1.rg.sorted.rmdup.bam data/mapped_reads/WW5-4.rg.sorted.rmdup.bam data/mapped_reads/WW1-2.rg.sorted.rmdup.bam data/mapped_reads/WW2-5.rg.sorted.rmdup.bam data/mapped_reads/WW5-1.rg.sorted.rmdup.bam data/mapped_reads/WW5-5.rg.sorted.rmdup.bam data/mapped_reads/WW1-4.rg.sorted.rmdup.bam data/mapped_reads/WW2-6.rg.sorted.rmdup.bam data/mapped_reads/WW5-3.rg.sorted.rmdup.bam data/mapped_reads/WW5-6.rg.sorted.rmdup.bam --populations tables/populations.tsv --hwe-priors-off --standard-filters --min-coverage 5 -p 2 > data/SNP_calls/freebayes_all_samples_raw.vcf
# Ocin2
freebayes -f data/reference/Ocin1/genome.fa -b data/mapped_reads/Ocin2.rg.sorted.bam --standard-filters --min-coverage 5 -p 2 > data/SNP_calls/freebayes_Ocin2_raw.vcf
# Redoing SNP calls for Afus1
freebayes -f data/reference/Afus1/genome.fa -b data/mapped_reads/Afus1.rg.sorted.rmdup.bam --standard-filters --min-coverage 5 -p 2 > data/SNP_calls/freebayes_Afus1_raw.vcf
Allele coverage distributions
# qsub -o logs -e logs -cwd -N SNP_filt -V -pe smp64 1 -b yes '' -> submission on our cluster
scripts/subset_SNPs_to_chromosomes.sh data/SNP_calls/freebayes_all_samples_raw.vcf tables/chr_assignments_Afus1.tsv
scripts/subset_SNPs_to_chromosomes.sh data/SNP_calls/freebayes_Ocin2_raw.vcf tables/chr_assignments_Ocin1.tsv
scripts/subset_SNPs_to_chromosomes.sh data/SNP_calls/freebayes_Afus1_raw.vcf tables/chr_assignments_Afus1.tsv
data/SNP_calls/freebayes_Afus_filt.vcf
data/SNP_calls/freebayes_Afus1_raw_filt.vcf
data/SNP_calls/freebayes_Ocin2_raw_filt.vcf
python3 scripts/sort_variants_with_respect_to_chromosomes.py -o data/SNP_calls/freebayes_Afus_filt_sorted data/SNP_calls/freebayes_Afus_filt.vcf tables/chr_assignments_Afus1.tsv
python3 scripts/sort_variants_with_respect_to_chromosomes.py -o data/SNP_calls/freebayes_Ocin2_filt_sorted data/SNP_calls/freebayes_Ocin2_filt.vcf tables/chr_assignments_Ocin1.tsv
First, I will extract just the needed info about the variants in BH3-2, Afus1 and Ocin2
i=10
for ind in WW5-3 WW5-4 WW5-5 WW3-1 WW2-1 Afus1 BH3-2 WW2-5 WW5-6 WW5-1 WW1-2 WW2-6 WW1-4; do
echo $ind $i;
# grep "^#" data/SNP_calls/freebayes_Afus_filt_sorted_A.vcf | tail -1 | cut -f 1,2,$i
grep -v "#" data/SNP_calls/freebayes_Afus_filt_sorted_A.vcf | cut -f 1,2,$i | tr -s ":" "\t" | cut -f 1,2,3,4,6,8 | cut -f 1 -d ',' | awk '{ if ($3 != "."){ print $0 } }' > data/SNP_calls/freebayes_Afus_filt_sorted_"$ind"_A.tsv
grep -v "#" data/SNP_calls/freebayes_Afus_filt_sorted_X.vcf | cut -f 1,2,$i | tr -s ":" "\t" | cut -f 1,2,3,4,6,8 | cut -f 1 -d ',' | awk '{ if ($3 != "."){ print $0 } }' > data/SNP_calls/freebayes_Afus_filt_sorted_"$ind"_X.tsv
i=$(($i+1));
done
# if to extract just BH3-2
# grep -v "#" data/SNP_calls/freebayes_Afus_filt_sorted_A.vcf | cut -f 1,2,16 | tr -s ":" "\t" | cut -f 1,2,3,4,6,8 | cut -f 1 -d ',' | awk '{ if ($3 != "."){ print $0 } }' > data/SNP_calls/freebayes_Afus_filt_sorted_BH3-2_A.tsv
# grep -v "#" data/SNP_calls/freebayes_Afus_filt_sorted_X.vcf | cut -f 1,2,16 | tr -s ":" "\t" | cut -f 1,2,3,4,6,8 | cut -f 1 -d ',' | awk '{ if ($3 != "."){ print $0 } }' > data/SNP_calls/freebayes_Afus_filt_sorted_BH3-2_X.tsv
# Ocin2
grep -v "^#" data/SNP_calls/freebayes_Ocin2_filt_sorted_A.vcf | cut -f 1,2,10 | tr -s ":" "\t" | cut -f 1,2,3,4,6,8 | cut -f 1 -d ',' | awk '{ if ($3 != "."){ print $0 } }' > data/SNP_calls/freebayes_Ocin2_filt_sorted_A_Ocin2.tsv
grep -v "^#" data/SNP_calls/freebayes_Ocin2_filt_sorted_X.vcf | cut -f 1,2,10 | tr -s ":" "\t" | cut -f 1,2,3,4,6,8 | cut -f 1 -d ',' | awk '{ if ($3 != "."){ print $0 } }' > data/SNP_calls/freebayes_Ocin2_filt_sorted_X_Ocin2.tsv
Now we have a bunch of data/SNP_calls/*_sorted_X.tsv
and data/SNP_calls/*_sorted_A.tsv
tables that contains scf, pos, genotype, total_cov, ref_cov, alt_cov comuns.
Now we have the SNP calls, we need to sort out if one allele have always a higher coverage than the other and that this difference is not simply due to coverage variation.
for ind in WW5-6 WW5-4 BH3-2 WW3-1 WW2-1 WW1-2 WW2-5 WW5-1 WW5-5 WW1-4 WW2-6 WW5-3; do
Rscript scripts/plot_heterozygous_autosomal_allele_coverages.R "$ind" data/SNP_calls/run0/"$ind"_asn_snps.tsv figures/het_autosomal_allele_supports/"$ind";
done
To plot autosomal and X chromosome covera supports (figures/het_autosomal_allele_supports/autosomal_and_X_variant_coverages_<SAMPLE>.png
) run to get the male plots
Rscript scripts/plot_heterozygous_autosomal_allele_coverages_BH3-2.R
Rscript scripts/plot_heterozygous_autosomal_allele_coverages_Afus1.R
Rscript scripts/plot_heterozygous_autosomal_allele_coverages_Ocin2.R
and for females run
All females
for ind in WW5-6 WW5-4 BH3-2 WW3-1 WW2-1 WW1-2 WW2-5 WW5-1 WW5-5 WW1-4 WW2-6 WW5-3; do
Rscript scripts/plot_heterozygous_autosomal_allele_coverages.R "$ind" data/SNP_calls/run0/"$ind"_asn_snps.tsv figures/het_autosomal_allele_supports/"$ind".png;
done
Heterozygous SNP analysis:
data/SNP_calls/freebayes_Afus_filt_sorted_BH3-2_X.tsv
data/SNP_calls/freebayes_Afus_filt_sorted_BH3-2_A.tsv
data/mapped_reads/Afus1_per_scf_cov_medians.tsv
data/SNP_calls/freebayes_Afus_filt_sorted_Afus1_A.tsv
data/SNP_calls/freebayes_Afus_filt_sorted_Afus1_X.tsv
data/SNP_calls/run0/*_asn_snps.tsv
Coverage plots:
data/mapped_reads/Afus1_cov_per_scf.tsv
data/mapped_reads/BH3-2_cov_per_scf.tsv
data/mapped_reads/Ocin2_cov_per_scf.tsv
Using mapped reads of Ocin2
to the reference genome (ID), we estimated 1n and 2n coverage to be 53.43x and 101.12x respectively. According to the two tissue model this would imply that Ocin2
- a species with completely regular meiosis, has ~10% of the body form of cells with the same number of sex chromosomes and
asn_tab <- read.table('tables/chr_assignments_Ocin1.tsv', header = T)
ind = 'Ocin2'
ind_tab <- asn_tab[asn_tab[, 'male_coverage'] < filt_quantile & asn_tab[, 'len'] > 20000, c('scf', 'len', 'male_coverage')]
cov_2n <- 101.1228
cov_1n <- 53.4376
png('figures/Ocin2_male_coverage_vs_scf_length.png')
plot(ind_tab$male_coverage, log10(ind_tab$len), pch = 20, xlab = 'Coverage', ylab = 'log10 Scaffold length', main = 'mapping coverage vs scaffold length in O. cincta male')
lines(c(53.3, 53.3), c(min(log10(ind_tab$len)), 6), lwd = 2, col = 'red')
lines(c(101.15, 101.15), c(min(log10(ind_tab$len)), 6), lwd = 2, lty = 2, col = 'red')
legend('topright', bty = 'n', c('1n coverage est', '2n coverage est'), lty = c(1, 2), lwd = 2, col = 'red')
dev.off()
Looking at this plot, it is very clear that at leas some of the long scaffolds are in part missabled or with large scale structural variations. Scaffolds that have a few hundred thousand bases are expected to have the mean mapping coverage very close to the expected value, however, we do see substantial number of data points that are in between of the two distributions and therefore are likely representing a chimera of long stretches of sequences that are in one or two genomic copies in the male genome. These streches might be either large scale deletions or misassembled pieces of the X chromosome.
Every assembly error or structural variation will cause that the mean scaffold mapping covrerage will be somewhere between true 1n and 2n coverage. Hence, the two peaks will always be closer to each other than they would exected to be in perfect data which implies that these errors will bias and inflate the estimate.