Refactors

probabilityOfSequence, transitionPorbability, and frequency

Pass them a normalized array so frequenies only need to be counted once

Purpose/Motivation

Provide a simple, intuitive kmer API and CLI for kmer profile generation

Demonstrate the appropriate distance metric to show similarity between two kmer profiles

Demonstrate scalability of the algorithm on different sized genomes

Background

Methodology

Graphics/Hypotheses

References

Questions

Distance metrics

What is the appropriate distance metric between profiles?

Euclidean

Normalized squared euclidean distance

Correlation distance

Is there a fast calculation for these distance metrics?

Should this library be refactored into a Python executable?

Should multiprocessing support be included?

Probability metrics

Can the probability metric demonstrate basic recall of a sequence from the genome?

How does the probability metric adapt to point mutations in the sequence?

How many different locations in a single sequence should be given a single point mutation?

How many different single sequences should be tested to show generalization of the metric?

How does the probability metric handle sequences from related genomes?

How can the

Deliverables

Multiple distance metrics

Distance and normalizations

Normalize by median method?

Distance is median of pairwise counts

Check distribution of counts on real world data

Run on maybe 10 TCGA, 10 metagenomes, 10 bacterial, 10 mammalian?

RStudio look at distributions

Make KnitR report of distributions

Normalizations

Make density plots

Look at boxplot before/after normalizations w/ different normalizations

Base the distance metrics off of the distribution