- Stronghold
- Textbook Track
Hamming distance
The minimum number of substitutions(Errors) required to change one string into the other.
--> check [BA5] Edit distance (= Levenshtein distance, Wagner–Fischer algorithm)
Brute Force algorithm
KMP algorithm
= while matching sequences, by using Failure Function(Pi array), count the mismatch prefix/suffix region as array(can find location as like an index) and skip that region
C D = mismatch -> Pi array for [0 1 0 1]
AABA ACAA DAAB AABA
AABA
a aBa
aAba
Aab a
aa Ba
a Aba
Aaba
AABA
AABA
GC skew
In some bacterial genomes, there is an enrichment of guanine over cytosine, because of cytosine deamination on okazaki fragement.
GC skew = (G - C)/(G + C)
def min_gc_skew(string):
min_skew_list = [0]
skew = 0
min_skew = 0
for i in range(len(string)):
if string[i] == C면 -1, G면 +1 로 skew 계산
if 현재 누적skew가 min_skew보다 작으면
min_skew = skew
min_skew_list에 해당 위치 추가
return min_skew_list
Greedy search
The problem-solving heuristic algorithm of making the locally optimal choice at each stage.
: greedy choice -> feasibility check -> update solution (local) -> repeat -> optimality check (global)
Gibbs sampler
that randomly starts -> determine and restart from initial as EM algorithm process
-> more efficient than greedy search but slow
Pseudo count (Laplacian smoothing)
Median string
Graph
G=(V,E)
example : 1-2-3
V,v for nodes(vertices(from vertex)) = {1,2,3}
E,u for edges = {(1,2),(2,3)}
- Hamming graph
- De bruijn graph
- Kautz graph
De Bruijn Graph (DBG)
n-dimensional m-symbol directed graph
-
$m^n$ vertices(nodes) -
each nodes has
$m$ income and outcome edges -
all possible length-n sequences allows multiple m-symbols appear
-
each DBG follows Eulerian or Hamiltonian cycle.
strong : speedy weak : indel error, naive DBG spend lot of times
Eulerian/Hamiltonian cycle
= cycle, circuit (start=end) / path, trail(start≠end) / distance(as scored)
Eulerian : finite graph that visits every edge exactly once. (can be found in both "directed/undirected")
= Konigsberg's bridge problem
= euler's theorem
= all nodes have an even(2,4,6...) degree(edge numbers)
Hamiltonian : graph that visits each nodes exactly once. (can be found in both "directed/undirected")
= traceable path
= manhattan tour problem, traveling salesman problem(TSP)
= NP complete problem -> as Brute Force
NP complete(NPC), NP hard problem
nondeterministic polynomial-time complete
(nondeterministic Turing machines(NTM) = Brute force search algorithm)
(polynomial-time -> deterministic algorithm / linear programming)
-> possible (yes/no = P/NP) for n times
->
Fleury/Hierholzer algorithm
= algorithm for finding Eulerian path (E = # of edges)
-
Fleury : O(E^2)
- start node = #E:odd or random(if all #E same)
- no brige for edge
- choose -> erase edge
-
Hierholzer : O(E)
- start node = random
- choose -> erase edge
Depth first search (DFS)
= find path(cycle) of tree, DAG, maze
DFS algorithm : O(|V|+|E|)
-
V = # of nodes
-
E = # of edges
DFS(node,Graph):
if node in Graph:
for all directed edges from v(node) to w(neighbor) that are in G.adjacentEdges(v) do
if vertex w is not labeled as discovered then
recursively call DFS(G, w)
### [Example] ###
def dfs(node):
if node in graph:
for neighbor in graph[node]:
if (node, neighbor) not in visited_edges:
visited_edges.add((node, neighbor))
dfs(neighbor)
path.append(node)
In/Out-degree, Source/Sink
Cyclic peptide from Spectrum
= to predict peptide from mass spectrometer
def score( 'peptide', 'spectrum' ):
return matchinng score
score = ('NEQL', [0, 99, 113, 114, 128, 227, 257, 370, 371, 484]) = 11
L N Q E LN LE NQ EQ QLN LNE EQL NEQ LNEQ
### NEQL = 0 113 114 128 129 227 242 242 257 355 356 370 371 484 (129, 242, 242 = missing masses) Theoritical
### spectrum = 0 99 113 114 128 227 257 299 355 356 370 371 484 (99, 299 = false masses) Experitmental
### overlaps 1 2 3 4 5 6 7 8 9 10 11
### spectrum based cyclic peptide: N-E-Q-L L-Q-E-N
E-Q-L-N Q-E-N-L
Q-L-N-E E-N-L-Q
L-N-E-Q N-L-Q-E
Integer mass
= The mass of an amino acid or another compound expressed as an integer.
{'C': 12.01, 'H': 1.008, 'O': 16.00, 'N': 14.01, 'S': 32.07}
{'G': 'C2H3N1O1', 'A': 'C3H5N1O1', 'S': 'C3H5N1O2', 'P': 'C5H7N1O1', 'V': 'C5H9N1O1',
'T': 'C4H7N1O2', 'C': 'C3H5N1O1S1', 'I': 'C6H11N1O1', 'L': 'C6H11N1O1', 'N': 'C4H6N2O2',
'D': 'C4H5N1O3', 'K': 'C6H12N2O1', 'Q': 'C5H8N2O2', 'E': 'C5H7N1O3', 'M': 'C5H9N1O1S1',
'H': 'C6H7N3O1', 'F': 'C9H9N1O1', 'R': 'C6H12N4O1', 'Y': 'C9H9N1O2', 'W': 'C11H10N2O1'}
{'G': 57.05, 'A': 71.08, 'S': 87.08, 'P': 97.12, 'V': 99.13, -- into -> {'G': 57, 'A': 71, 'S': 87, 'P': 97, 'V': 99,
'T': 101.11, 'C': 103.15, 'I': 113.16, 'L': 113.16, 'N': 114.11, 'T': 101, 'C': 103, 'I': 113, 'L': 113, 'N': 114,
'D': 115.09, 'K': 128.18, 'Q': 128.13, 'E': 129.12, 'M': 131.20, 'D': 115, 'K': 128, 'Q': 128, 'E': 129, 'M': 131,
'H': 137.15, 'F': 147.17, 'R': 156.20, 'Y': 163.17, 'W': 186.21} 'H': 137, 'F': 147, 'R': 156, 'Y': 163, 'W': 186}
Dynamic Programing (DP) problems
= Mathematical optimization method with recursive sub-problems
Source : start (0,0)
Sink : end (m,n)
traceback (optimization) : Sink to Source
- Fibonacci Sequence
- Change making Problem
- Longest Increasing Subsequence, LIS
- Matrix Chain Multiplication
- 0/1 Knapsack Problem
- Shortest Path Problem
- Subset Sum Problem
Wagner–Fischer algorithm
Needleman-Wunch algorithm
= Global alignment
-
$H_{i-1,j-1} + s(a,b) \max_{k\geq 1} (H_{i-k,j} - \sigma_k)$ -
$max_{k\geq 1}({H_{i-k,j}-\sigma_k})$ -
$max_{l\geq 1}({H_{i,j-l}-\sigma_l})$
Smith-Waterman algorithm
= Local alignment
-
$H_{i-1,j-1} + s(a,b) \max_{k\geq 1} (H_{i-k,j} - \sigma_k)$ -
$max_{k\geq 1}({H_{i-k,j}-\sigma_k})$ -
$max_{l\geq 1}({H_{i,j-l}-\sigma_l})$ -
$0$
BLOSUM, PAM
BLOSUM = BLOcks SUbstitution Matrix ( for local align )
The matrix built from blocks with less than r% of similarity
1. conservative sequence alignments without gap = BLOCK
2. in BLOCK database, 20 a.a's 210 possible substitution rate (under the relative abundance of each a.a)
3. get log-odds score
PAM = Point Accepted Mutation ( for global align )
Relating the number of mutated amino acids per 100 A.A
BLOSUM | PAM |
---|---|
BLOSUM90 | PAM100 |
BLOSUM89 | PAM120 |
BLOSUM62 | PAM160 |
BLOSUM52 | PAM200 |
BLOSUM45 | PAM250 |
Modified alignments
-
Semi-Global alignment:
AAAAABBBBBCCCCCD AABB---CC---D
-
Fitting alignment:
AAAAABBBBBCCCCC BB-CC
-
Overlap alignment:
AAABBBCCC BB-CCDD
-
Affine gap alignment:
-
Multiple Sequence alignment (MSA):
Tree, Adjacency matrix
Definition of "Tree"
- graph without cycle (graph = node + edge --> check [BA3] Graph)
- leaf node (degree=1) : without any child nodes = external node = terminal node = outer node
- internal node (degree>1) : with any child nodes = internal node = inner node = inode
- if # of node > 2, each internal node have > 1 edges.
- total (n) nodes have (n-1) edge(s).
-
unrooted tree : consist of leaf nodes(degree=1) and internal nodes(degree>1).
-
rooted tree : have root node(degree=2), so that makes leaf(degree=1), internal(degree>2).
-
simple tree : # of node > 2, 1 pair of leaf in each inner nodes(=parent node).
-
weighted tree : difference between length of each edge (node to another node) is exist.
most of weighted trees like : 0->1:1 1->2:5 2->3:4 (start node)->(end node):distance means : 0 - 1 ----- 2 --- 3
Adjacency matrix
= To indicate graph into matrix, pair of nodes(verticles) are adjacent or not.
= If the graph is undiredted, = adjacency matrix is symmetric
--> check [BA3] DFS
Additive/Fitted/Distance matrix
= To explain the characteristic of Tree as an matrix,
-
Additive matrix = Distance matrix = Additive distance matrix
= distance between leaves only
0 13 21 22 == 0 2 -> distance of : 0 2 == 21 13 0 12 13 \ / \ / 21 12 0 13 ? - ? ? - ? 22 13 13 0 / \ 1 3
-
Adjacency matrix = Fitted matrix
= distance beteween all posible nodes (if not, edge = 0)
0 0 0 0 11 0 == 0 2 -> distance of : 0 == 11 0 0 0 0 2 0 \ / \ 0 0 0 0 0 6 4 - 5 4 0 0 0 0 0 7 / \ 11 2 0 0 0 4 1 3 0 0 6 7 4 0
Limb length
A \ / E
C -- D
B / \ F
in here, AC = limb length
for A and B, is neighbor node
limb length = distance between leaf node and parent node
K-center cluster
- Maximal distance
- Farthest First Traversal
$min\sum(data-center) \ and \ max\sum(center-center)$
- K-means based on "Center of gravity theorem"
- Squared Error Distortion
$Distortion(Data,Centers) = (1/n) min \sum(Euclidian dist(Data, Centers))\times2$
K-means vs KNN cluster
K-means : k refers # of classes (unsupervised)
- clustering
KNN : k refers # of nearest neighbors (supervised, class already selected)
- classifying, regression
Lloyd algorithm
-
E step (estimated):
- Set “k” number of centers.
- Create clusters from the closest points.
-
M step (maximization):
- Find the center of gravity of each cluster in E-step.
- = Backtracking optimal solution
-
Repeat steps E-M.
-
Stop (=convergence): When the center no longer changes. (= the center stabilized)
= minimized squared error distortion
= global minimum
- Disadvantage: Sensitive to initial value and "k" to be classified, clusters may differ from reality (biological interpretation (e.g. BIC))
- Compensation: Random start must be performed multiple times
Hidden path
=
Viterbi algorithm
=
ba3h
- ba3e + ba3g (pattern -> DBG -> DAG)
def overlap(patterns):
### for max length overlapping { prefix : [suffix] } by using count_overlap
return DBG
def count_overlap(prefix,suffix)
return overlap_length
def find_end(DBG):
return start_node
def eulerian(DBG):
def dfs_stack(start_node):
stack.append(start_node)
while stack:
path.append
dfs_stack(start_node)
return path[::-1]
def stringmake(path)
### result += every single suffix[overlap_length:] by using count_overlap
return