-
Notifications
You must be signed in to change notification settings - Fork 44
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
d1089e1
commit c6c1688
Showing
4 changed files
with
1,251 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,258 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"Title: Fuzzy sequence matching with cross-correlation\n", | ||
"\n", | ||
"## Cross-correlation\n", | ||
"\n", | ||
"I am going to generate a \"random file\" - an array with values ranging between -1 and 1, and then randomly choose a \"random slice\" - a subarray portion from the \"random file\". I'm then going to randomly mutate some of the values in teh subarray, and see if I can still find where in the \"random file\" the \"random slice\" came from." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 2, | ||
"metadata": { | ||
"collapsed": false | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"import random\n", | ||
"import numpy as np\n", | ||
"\n", | ||
"FILE_SIZE = 100000\n", | ||
"SLICE_SIZE = 100\n", | ||
"\n", | ||
"random_file = np.asarray([random.uniform(-1,1) for i in range(FILE_SIZE)])\n", | ||
"random_slice_index = random.randint(0, FILE_SIZE-SLICE_SIZE)\n", | ||
"random_slice = random_file[random_slice_index:random_slice_index+SLICE_SIZE]\n", | ||
"\n", | ||
"# Randomly change the sample by about 50% - simulating a 50% noise\n", | ||
"random_slice_mutated = np.asarray([i + random.uniform(-0.5,0.5) for i in random_slice])" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"So now, we are going to try to find where in `random_file` the `random_slice_mutated` came from. We know it should be `random_slice_index`.\n", | ||
"\n", | ||
"### First, let's write our own cross-correlation function to find it." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 3, | ||
"metadata": { | ||
"collapsed": false | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"# Coding our own cross-correlation algorithm\n", | ||
"\n", | ||
"def find_sample_index(container, sample):\n", | ||
" best_match_index = 0\n", | ||
" largest_dot_product = 0\n", | ||
" sample_size = len(sample)\n", | ||
" for index in range(len(container) - sample_size):\n", | ||
" dot = np.dot(container[index:index+sample_size], sample)\n", | ||
" if dot > largest_dot_product:\n", | ||
" best_match_index = index\n", | ||
" largest_dot_product = dot\n", | ||
" return (best_match_index, largest_dot_product)\n", | ||
" " | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 4, | ||
"metadata": { | ||
"collapsed": false | ||
}, | ||
"outputs": [ | ||
{ | ||
"data": { | ||
"text/plain": [ | ||
"(14108, 33.238798044766)" | ||
] | ||
}, | ||
"execution_count": 4, | ||
"metadata": {}, | ||
"output_type": "execute_result" | ||
} | ||
], | ||
"source": [ | ||
"find_sample_index(random_file, random_slice_mutated)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 6, | ||
"metadata": { | ||
"collapsed": false | ||
}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"Actual index (should match): 14108\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"print(\"Actual index (should match):\", random_slice_index)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"### Now, let's use the `scipy.signal.correlate()` function instead of our own one.\n", | ||
"\n", | ||
"Note that the numpy implementation returns a vector of all of the correlation values, not just the largest one, so to find the largest index, we need to find the largest value." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 7, | ||
"metadata": { | ||
"collapsed": false | ||
}, | ||
"outputs": [ | ||
{ | ||
"data": { | ||
"text/plain": [ | ||
"14108" | ||
] | ||
}, | ||
"execution_count": 7, | ||
"metadata": {}, | ||
"output_type": "execute_result" | ||
} | ||
], | ||
"source": [ | ||
"import scipy.signal\n", | ||
"result = scipy.signal.correlate(random_file, random_slice)\n", | ||
"max_correlation_index = np.argmax(result) - (SLICE_SIZE-1)\n", | ||
"max_correlation_index" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"### Cross-correlation algorithm" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 8, | ||
"metadata": { | ||
"collapsed": false | ||
}, | ||
"outputs": [ | ||
{ | ||
"data": { | ||
"text/plain": [ | ||
"array([ 3, 8, 14, 20, 26, 14, 5])" | ||
] | ||
}, | ||
"execution_count": 8, | ||
"metadata": {}, | ||
"output_type": "execute_result" | ||
} | ||
], | ||
"source": [ | ||
"a = np.array([1,2,3,4,5])\n", | ||
"b = np.array([1,2,3])\n", | ||
"scipy.signal.correlate(a,b)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"For vectors:\n", | ||
"\n", | ||
"A=\n", | ||
"<table>\n", | ||
"<tr><td>1</td><td>2</td><td>3</td><td>4</td><td>5</td></tr>\n", | ||
"</table>\n", | ||
"\n", | ||
"B=\n", | ||
"<table>\n", | ||
"<tr><td>1</td><td>2</td><td>3</td></tr>\n", | ||
"</table>\n", | ||
"\n", | ||
"The cross-correlate algorithm zero-pads each vector as such (where the blank cells represent 0):\n", | ||
"\n", | ||
"#### Step 1\n", | ||
"<table>\n", | ||
"<tr><td></td><td></td><td>1</td><td>2</td><td>3</td><td>4</td><td>5</td></tr>\n", | ||
"<tr><td>1</td><td>2</td><td>3</td><td></td><td></td><td></td><td></td></tr>\n", | ||
"</table>\n", | ||
"\n", | ||
" 0*1 + 0*2 + 1*3 + 2*0 + 3*0 + 4*0 + 5*0 = 3\n", | ||
"\n", | ||
"#### Step 2\n", | ||
"<table>\n", | ||
"<tr><td></td><td>1</td><td>2</td><td>3</td><td>4</td><td>5</td></tr>\n", | ||
"<tr><td>1</td><td>2</td><td>3</td><td></td><td></td><td></td></tr>\n", | ||
"</table>\n", | ||
"\n", | ||
" 0*1 + 1*2 + 2*3 + 3*0 + 4*0 + 5*0 = 8\n", | ||
"\n", | ||
"#### Step 3\n", | ||
"<table>\n", | ||
"<tr><td>1</td><td>2</td><td>3</td><td>4</td><td>5</td></tr>\n", | ||
"<tr><td>1</td><td>2</td><td>3</td><td></td><td></td></tr>\n", | ||
"</table>\n", | ||
"\n", | ||
" 1*1 + 2*2 + 3*3 + 4*0 + 5*0 = 14\n", | ||
"\n", | ||
"## ...\n", | ||
"\n", | ||
"#### Last step\n", | ||
"<table>\n", | ||
"<tr><td>1</td><td>2</td><td>3</td><td>4</td><td>5</td><td></td><td></td></tr>\n", | ||
"<tr><td></td><td></td><td></td><td></td><td>1</td><td>2</td><td>3</td></tr>\n", | ||
"</table>\n", | ||
"\n", | ||
"So each step consists of doing the appropriate zero-padding and taking the dot product of the 2 vectors, and then moving to the right one spot." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": { | ||
"collapsed": false | ||
}, | ||
"outputs": [], | ||
"source": [] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "Python 3", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.5.1" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 0 | ||
} |
Oops, something went wrong.