-
Notifications
You must be signed in to change notification settings - Fork 24
1n estimation: subsets
The idea behind this estimate is subset kmer pairs with minor coverage ratio around (+- 0.01) expected ratios. For example kmer pairs with minor coverage ~0.2 are expected to be either AAAAB, AAAAAAAABB, ... etc. The expected ratios used are 0.5, 0.333, 0.25, 0.2, 0.166 corresponding to di, tri, ..., hexaploid smudges. For each subset we individually fir a kernel smoothing esitmates of local maximas; the maximum with lowest coverage is considered the peak corresponding to single B smudge (AB, AAB, etc...) and this peak is then used to calculate the genomic counts of all the other peaks in that subset. The coverage of each peak / their genomic count represent then 1n coverage estimated of individual peaks and the output 1n estimate is their average weighted by the sizes of the individual peaks.
Sometimes it happens that an organism has no AB, or AAB smudge, only AABB or AAAABB respectively. In those cases the 1n estimates will be wrong. However, it's unusual that this happens for all the subsets and the weighted mean is usually very close to expected number. This number is subsequentially adjusted by an 1n estimation: highest peak.
(the following pseudocode applied for dev2
banch)
for column_of_kmers in AB AAB AAAB AAAAB AAAAAB:
expected_minor_allele_cov = sum(column_of_kmers == B) / sum(column_of_kmers = A)
subset_of_kmers = kmer_pairs that have expected_minor_allele_cov +- 0.01
peaks = find_local_maxima_using_kernel_smoothing(coverage_sub_of_subset_kmer_pairs)
1n_estimates_of_peaks = peak_covrages / (length(column_of_kmers) * round(peak_covrages / min(peak_covrages)))
1n_subset_estimate = weighted_mean(all_1n_estimates_of_peaks, all_peaks_sizes)