Skip to content

WoobeenJeong/Rosalind

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 

Repository files navigation

Rosalind

  • Stronghold
  • Textbook Track

Algorithm Explanation:

[BA1, SUBS]

Hamming distance

The minimum number of substitutions(Errors) required to change one string into the other.

--> check [BA5] Edit distance (= Levenshtein distance, Wagner–Fischer algorithm)

Brute Force algorithm

(= exhaustive search, generate and test) (= nondeterministic Turing machines)

Systematically checking all possible candidates for whether or not each candidate satisfies the problem's statement.

image

sliding window : n(string) - m(pattern) + 1

complexity : O(mn)

KMP algorithm
= Knuth Morris Pratt algorithm (-> for motif finding : Exact match)

= while matching sequences, by using Failure Function(Pi array), count the mismatch prefix/suffix region as array(can find location as like an index) and skip that region

        C   D          = mismatch    -> Pi array for [0 1 0 1]
  AABA ACAA DAAB AABA               
  AABA
     a aBa
       aAba
        Aab a
         aa Ba
          a Aba
            Aaba
             AABA
                AABA
GC skew

In some bacterial genomes, there is an enrichment of guanine over cytosine, because of cytosine deamination on okazaki fragement.

GC skew = (G - C)/(G + C)

def min_gc_skew(string):
    min_skew_list = [0]
    skew = 0
    min_skew = 0
    
    for i in range(len(string)):    
        if string[i] == C면 -1, G면 +1 로 skew 계산

        if 현재 누적skew가 min_skew보다 작으면
            min_skew = skew
            min_skew_list에 해당 위치 추가
            
    return min_skew_list

[BA2]

Greedy search

The problem-solving heuristic algorithm of making the locally optimal choice at each stage.

: greedy choice -> feasibility check -> update solution (local) -> repeat -> optimality check (global)

  • same as DP : "heuristic, calculate all possible"

  • differ from DP : "does not reconsider the choice, previous decision doesn't affect after works"

  • limitations : NOT an optimal solution, local optimization

    image

Gibbs sampler
= Markov Chain Monte Carlo(MCMC) based algorithm by Bayesian inference,

that randomly starts -> determine and restart from initial as EM algorithm process

-> more efficient than greedy search but slow

Pseudo count (Laplacian smoothing)
= Laplace smoothing, Additive smoothing = Lidstone smoothing

A technique used to smooth count data, eliminating issues caused by certain values having 0 occurrences.

("pseudocount" α > 0 is a smoothing parameter. α = 0 corresponds to no smoothing.)

image

Median string
= Commonly found kmer motif(pattern) from all strings(DNAs,...) with the least distance.

image

[BA3]

Graph

G=(V,E)

example : 1-2-3

V,v for nodes(vertices(from vertex)) = {1,2,3}

E,u for edges = {(1,2),(2,3)}

  • Hamming graph
  • De bruijn graph
  • Kautz graph
De Bruijn Graph (DBG)

n-dimensional m-symbol directed graph $((1,2)\neq(2,1))$

  • $m^n$ vertices(nodes)

  • each nodes has $m$ income and outcome edges

  • all possible length-n sequences allows multiple m-symbols appear

  • each DBG follows Eulerian or Hamiltonian cycle.

    image

strong : speedy weak : indel error, naive DBG spend lot of times

Eulerian/Hamiltonian cycle

= cycle, circuit (start=end) / path, trail(start≠end) / distance(as scored)


Eulerian : finite graph that visits every edge exactly once. (can be found in both "directed/undirected")

= Konigsberg's bridge problem

= euler's theorem

= all nodes have an even(2,4,6...) degree(edge numbers)


Hamiltonian : graph that visits each nodes exactly once. (can be found in both "directed/undirected")

= traceable path

= manhattan tour problem, traveling salesman problem(TSP)

= NP complete problem -> as Brute Force

NP complete(NPC), NP hard problem

nondeterministic polynomial-time complete


(nondeterministic Turing machines(NTM) = Brute force search algorithm)

(polynomial-time -> deterministic algorithm / linear programming)

$2^{O(\log \ n)} = poly(n)$

-> possible (yes/no = P/NP) for n times

$Let \ L \ as \ text, \ \forall L' \in NP \ and \ L' {\leq}_p L, \ then \ LP-hard$

-> $\subset$ halting problem

Fleury/Hierholzer algorithm

= algorithm for finding Eulerian path (E = # of edges)

  • Fleury : O(E^2)

    1. start node = #E:odd or random(if all #E same)
    2. no brige for edge
    3. choose -> erase edge
  • Hierholzer : O(E)

    1. start node = random
    2. choose -> erase edge
Depth first search (DFS)

= find path(cycle) of tree, DAG, maze

DFS algorithm : O(|V|+|E|)

  • V = # of nodes

  • E = # of edges

DFS(node,Graph):
    if node in Graph:
    for all directed edges from v(node) to w(neighbor) that are in G.adjacentEdges(v) do
        if vertex w is not labeled as discovered then
            recursively call DFS(G, w)

### [Example] ###

def dfs(node):
    if node in graph:
        for neighbor in graph[node]:
            if (node, neighbor) not in visited_edges:
                visited_edges.add((node, neighbor))
                dfs(neighbor)
    path.append(node)
In/Out-degree, Source/Sink
= For direted graph(path/cycle even tree),

$Let \ G=(V,E), v \in V (v \ for\ nodes) \ \sum_{v \in V}{deg^- (v)} + \sum_{v \in V}{deg^+ (v)}=|A|$

$If \ A=0, balanced \ directed \ graph$

${deg^- (v)}$ = Indegree = Source : start point of matrix : inward edges < outward edges

${deg^+ (v)}$ = Outdegree = Sink : end point of matrix : inward edges > outward edges

[BA4]

Cyclic peptide from Spectrum

= to predict peptide from mass spectrometer

  def score( 'peptide', 'spectrum' ):
     return matchinng score
  
  score = ('NEQL', [0, 99, 113, 114, 128, 227, 257, 370, 371, 484]) = 11
  
                       L   N   Q   E  LN  LE  NQ  EQ      QLN LNE EQL NEQ LNEQ  
  ### NEQL =     0    113 114 128 129 227 242 242 257     355 356 370 371 484      (129, 242, 242 = missing masses) Theoritical 
  ### spectrum = 0 99 113 114 128     227         257 299 355 356 370 371 484      (99, 299 = false masses) Experitmental
  ### overlaps   1     2   3   4       5           6       7   8   9   10  11

  ### spectrum based cyclic peptide:   N-E-Q-L         L-Q-E-N
                                         E-Q-L-N         Q-E-N-L
                                           Q-L-N-E         E-N-L-Q
                                             L-N-E-Q         N-L-Q-E
Integer mass

= The mass of an amino acid or another compound expressed as an integer.

   {'C': 12.01, 'H': 1.008, 'O': 16.00, 'N': 14.01, 'S': 32.07}
   
   {'G': 'C2H3N1O1', 'A': 'C3H5N1O1', 'S': 'C3H5N1O2', 'P': 'C5H7N1O1', 'V': 'C5H9N1O1',
    'T': 'C4H7N1O2', 'C': 'C3H5N1O1S1', 'I': 'C6H11N1O1', 'L': 'C6H11N1O1', 'N': 'C4H6N2O2',
    'D': 'C4H5N1O3', 'K': 'C6H12N2O1', 'Q': 'C5H8N2O2', 'E': 'C5H7N1O3', 'M': 'C5H9N1O1S1',
    'H': 'C6H7N3O1', 'F': 'C9H9N1O1', 'R': 'C6H12N4O1', 'Y': 'C9H9N1O2', 'W': 'C11H10N2O1'}

  {'G': 57.05, 'A': 71.08, 'S': 87.08, 'P': 97.12, 'V': 99.13,     -- into ->   {'G': 57, 'A': 71, 'S': 87, 'P': 97, 'V': 99,
   'T': 101.11, 'C': 103.15, 'I': 113.16, 'L': 113.16, 'N': 114.11,              'T': 101, 'C': 103, 'I': 113, 'L': 113, 'N': 114,
   'D': 115.09, 'K': 128.18, 'Q': 128.13, 'E': 129.12, 'M': 131.20,              'D': 115, 'K': 128, 'Q': 128, 'E': 129, 'M': 131,
   'H': 137.15, 'F': 147.17, 'R': 156.20, 'Y': 163.17, 'W': 186.21}              'H': 137, 'F': 147, 'R': 156, 'Y': 163, 'W': 186}

[BA5]

Dynamic Programing (DP) problems
by : Richard Bellman (1950s)

= Mathematical optimization method with recursive sub-problems

Source : start (0,0)

Sink : end (m,n)

traceback (optimization) : Sink to Source

  1. Fibonacci Sequence
  2. Change making Problem
  3. Longest Increasing Subsequence, LIS
  4. Matrix Chain Multiplication
  5. 0/1 Knapsack Problem
  6. Shortest Path Problem
  7. Subset Sum Problem
Wagner–Fischer algorithm

image

= Edit distance

= Levenshtein distance

The minimum number of single character edit(insertion, deletion, substitutions) required to change one string into the other.

Needleman-Wunch algorithm

= Global alignment

$H_{k,l} =0$

$(1 &lt; k \leq i &lt; n, \ 1 &lt; l \leq j &lt; m)$

$H(x)=$

  • $H_{i-1,j-1} + s(a,b) \max_{k\geq 1} (H_{i-k,j} - \sigma_k)$

  • $max_{k\geq 1}({H_{i-k,j}-\sigma_k})$

  • $max_{l\geq 1}({H_{i,j-l}-\sigma_l})$

Smith-Waterman algorithm

= Local alignment

$H_{k,0} = H_{0,l} =0$

$(1 &lt; k \leq i &lt; n, \ 1 &lt; l \leq j &lt; m)$

$H(x)=$

  • $H_{i-1,j-1} + s(a,b) \max_{k\geq 1} (H_{i-k,j} - \sigma_k)$

  • $max_{k\geq 1}({H_{i-k,j}-\sigma_k})$

  • $max_{l\geq 1}({H_{i,j-l}-\sigma_l})$

  • $0$

BLOSUM, PAM

BLOSUM = BLOcks SUbstitution Matrix ( for local align )

     The matrix built from blocks with less than r% of similarity

     1. conservative sequence alignments without gap = BLOCK
     
     2. in BLOCK database, 20 a.a's 210 possible substitution rate (under the relative abundance of each a.a)

     3. get log-odds score

PAM = Point Accepted Mutation ( for global align )

  Relating the number of mutated amino acids per 100 A.A
BLOSUM PAM
BLOSUM90 PAM100
BLOSUM89 PAM120
BLOSUM62 PAM160
BLOSUM52 PAM200
BLOSUM45 PAM250
Modified alignments
  • Semi-Global alignment:

    AAAAABBBBBCCCCCD
    
       AABB---CC---D
    
  • Fitting alignment:

    AAAAABBBBBCCCCC
    
           BB-CC
    
  • Overlap alignment:

     AAABBBCCC
    
         BB-CCDD
    
  • Affine gap alignment:

  • Multiple Sequence alignment (MSA):

[BA7]

Tree, Adjacency matrix

Definition of "Tree"

  1. graph without cycle (graph = node + edge --> check [BA3] Graph)
  2. leaf node (degree=1) : without any child nodes = external node = terminal node = outer node
  3. internal node (degree>1) : with any child nodes = internal node = inner node = inode
  4. if # of node > 2, each internal node have > 1 edges.
  5. total (n) nodes have (n-1) edge(s).
  • unrooted tree : consist of leaf nodes(degree=1) and internal nodes(degree>1).

  • rooted tree : have root node(degree=2), so that makes leaf(degree=1), internal(degree>2).

  • simple tree : # of node > 2, 1 pair of leaf in each inner nodes(=parent node).

  • weighted tree : difference between length of each edge (node to another node) is exist.

    most of weighted trees like :
       0->1:1
       1->2:5
       2->3:4
       (start node)->(end node):distance
    means :
       0 - 1 ----- 2 --- 3
    

Adjacency matrix

= To indicate graph into matrix, pair of nodes(verticles) are adjacent or not.

= If the graph is undiredted, = adjacency matrix is symmetric

--> check [BA3] DFS

Additive/Fitted/Distance matrix

= To explain the characteristic of Tree as an matrix,

  • Additive matrix = Distance matrix = Additive distance matrix

    = distance between leaves only

    0   13   21   22         ==      0             2         -> distance of : 0           2    == 21
    13   0   12   13                    \       /                               \       /
    21   12   0   13                      ?  - ?                                  ? - ?
    22   13   13   0                    /        \
                                     1              3
    
  • Adjacency matrix = Fitted matrix

    = distance beteween all posible nodes (if not, edge = 0)

    0   0   0   0   11   0  ==         0            2       ->  distance of : 0           == 11
    0   0   0   0   2   0                 \       /                             \
    0   0   0   0   0   6                   4 - 5                                  4
    0   0   0   0   0   7                 /       \ 
    11   2   0   0   0   4              1            3
    0   0   6   7   4   0
    
Limb length
  A \       / E
      C -- D  
  B /       \ F

$CE = [(AC+CE)+(BC+CE)-(AC+BC)]/2$

$so, AC = AE - (AE+BE-AB)/2$

in here, AC = limb length

for A and B, is neighbor node

limb length = distance between leaf node and parent node

[BA8]

K-center cluster
  1. Maximal distance
    • Farthest First Traversal
    • $min\sum(data-center) \ and \ max\sum(center-center)$
  2. K-means based on "Center of gravity theorem"
    • Squared Error Distortion
    • $Distortion(Data,Centers) = (1/n) min \sum(Euclidian dist(Data, Centers))\times2$
K-means vs KNN cluster

K-means : k refers # of classes (unsupervised)

  • clustering

KNN : k refers # of nearest neighbors (supervised, class already selected)

  • classifying, regression
Lloyd algorithm
= Voronoi iteration (relaxation)
  • E step (estimated):

    1. Set “k” number of centers.
    2. Create clusters from the closest points.
  • M step (maximization):

    1. Find the center of gravity of each cluster in E-step.
    2. = Backtracking optimal solution
  • Repeat steps E-M.

  • Stop (=convergence): When the center no longer changes. (= the center stabilized)

    = minimized squared error distortion

    = global minimum

  • Disadvantage: Sensitive to initial value and "k" to be classified, clusters may differ from reality (biological interpretation (e.g. BIC))
  • Compensation: Random start must be performed multiple times

[BA10]

Hidden path

=

Viterbi algorithm

=


Hint

ba2d

image

ba3e, ba3j

image

image

ba3h
  • ba3e + ba3g (pattern -> DBG -> DAG)
def overlap(patterns):
   ### for max length overlapping { prefix : [suffix] } by using count_overlap
   return DBG
   
def count_overlap(prefix,suffix)
   return overlap_length

def find_end(DBG):
   return start_node

def eulerian(DBG):
   def dfs_stack(start_node):
      stack.append(start_node)
      while stack:
         path.append
   dfs_stack(start_node)
   return path[::-1]

def stringmake(path)
   ### result += every single suffix[overlap_length:] by using count_overlap
   return

ba4d

= simple memoization training for preparing DP

image

ba5k

image

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages