From 503f960074531eac0b0989b4fcd4560605ee306e Mon Sep 17 00:00:00 2001 From: aneesh-joshi Date: Sat, 17 Jun 2017 15:33:21 +0530 Subject: [PATCH 1/4] fixed incorrect link for unsup learning in quick start ipynb --- .../gensim Quick Start-checkpoint.ipynb | 389 ++++++++++++++++++ gensim Quick Start.ipynb | 34 +- 2 files changed, 400 insertions(+), 23 deletions(-) create mode 100644 .ipynb_checkpoints/gensim Quick Start-checkpoint.ipynb diff --git a/.ipynb_checkpoints/gensim Quick Start-checkpoint.ipynb b/.ipynb_checkpoints/gensim Quick Start-checkpoint.ipynb new file mode 100644 index 0000000000..91f06656bc --- /dev/null +++ b/.ipynb_checkpoints/gensim Quick Start-checkpoint.ipynb @@ -0,0 +1,389 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + " # Getting Started with `gensim`\n", + " \n", + " The goal of this tutorial is to get a new user up-and-running with `gensim`. This notebook covers the following objectives.\n", + " \n", + " ## Objectives\n", + " \n", + " * Installing `gensim`.\n", + " * Accessing the `gensim` Jupyter notebook tutorials.\n", + " * Presenting the core concepts behind the library.\n", + " \n", + " " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Installing `gensim`\n", + "\n", + "Before we can start using `gensim` for [natural language processing (NLP)](https://en.wikipedia.org/wiki/Natural_language_processing), you will need to install Python along with `gensim` and its dependences. It is suggested that a new user install a prepackaged python distribution and a number of popular distributions are listed below.\n", + "\n", + "* [Anaconda ](https://www.continuum.io/downloads)\n", + "* [EPD ](https://store.enthought.com/downloads)\n", + "* [WinPython ](https://winpython.github.io)\n", + "\n", + "Once Python is installed, we will use `pip` to install the `gensim` library. First, we will make sure that Python is installed and accessible from the command line. From the command line, execute the following command:\n", + "\n", + " which python\n", + " \n", + "The resulting address should correspond to the Python distribution that you installed above. Now that we have verified that we are using the correct version of Python, we can install `gensim` from the command line as follows:\n", + "\n", + " pip install -U gensim\n", + " \n", + "To verify that `gensim` was installed correctly, you can activate Python from the command line and execute `import gensim`\n", + "\n", + " $ python\n", + " Python 3.5.1 |Anaconda custom (x86_64)| (default, Jun 15 2016, 16:14:02)\n", + " [GCC 4.2.1 Compatible Apple LLVM 4.2 (clang-425.0.28)] on darwin\n", + " Type \"help\", \"copyright\", \"credits\" or \"license\" for more information.\n", + " >>> import gensim\n", + " >>> # No error is a good thing\n", + " >>> exit()\n", + "\n", + "**Note:** Windows users that are following long should either use [Windows subsystem for Linux](https://channel9.msdn.com/events/Windows/Windows-Developer-Day-Creators-Update/Developer-tools-and-updates) or another bash implementation for Windows, such as [Git bash](https://git-for-windows.github.io/)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Accessing the `gensim` Jupyter notebooks\n", + "\n", + "All of the `gensim` tutorials (including this document) are stored in [Jupyter notebooks](http://jupyter-notebook-beginner-guide.readthedocs.io/en/latest/what_is_jupyter.html). These notebooks allow the user to run the code locally while working through the material. If you would like to run a tutorial locally, first clone the GitHub repository for the project.\n", + "``` bash\n", + " $ git clone https://github.com/RaRe-Technologies/gensim.git\n", + "``` \n", + "Next, start a Jupyter notebook server. This is accomplished using the following bash commands (or starting the notebook server from the GUI application).\n", + "\n", + "``` bash\n", + " $ cd gensim\n", + " $ pwd\n", + " /Users/user1/home/gensim\n", + " $ cd docs/notebooks\n", + " $ jupyter notebook\n", + "``` \n", + "After a few moments, Jupyter will open a web page in your browser and you can access each tutorial by clicking on the corresponding link. \n", + "\n", + "\n", + "\n", + "This will open the corresponding notebook in a separate tab. The Python code in the notebook can be executed by selecting/clicking on a cell and pressing SHIFT + ENTER.\n", + "\n", + "\n", + "\n", + "**Note:** The order of cell execution matters. Be sure to run all of the code cells in order from top to bottom, you you might encounter errors." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Core Concepts and Simple Example\n", + "\n", + "This section introduces the basic concepts and terms needed to understand and use `gensim` and provides a simple usage example. In particular, we will build a model that measures the importance of a particular word.\n", + "\n", + "At a very high-level, `gensim` is a tool for discovering the semantic structure of documents by examining the patterns of words (or higher-level structures such as entire sentences or documents). `gensim` accomplishes this by taking a *corpus*, a collection of text documents, and producing a *vector* representation of the text in the corpus. The vector representation can then be used to train a *model*, which is an algorithms to create different representations of the data, which are usually more semantic. These three concepts are key to understanding how `gensim` works so let's take a moment to explain what each of them means. At the same time, we'll work through a simple example that illustrates each of them.\n", + "\n", + "### Corpus\n", + "\n", + "A *corpus* is a collection of digital documents. This collection is the input to `gensim` from which it will infer the structure of the documents, their topics, etc. The latent structure inferred from the corpus can later be used to assign topics to new documents which were not present in the training corpus. For this reason, we also refer to this collection as the *training corpus*. No human intervention (such as tagging the documents by hand) is required - the topic classification is [unsupervised](https://en.wikipedia.org/wiki/Unsupervised_learning).\n", + "\n", + "For our corpus, we'll use a list of 9 strings, each consisting of only a single sentence." + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "raw_corpus = [\"Human machine interface for lab abc computer applications\",\n", + " \"A survey of user opinion of computer system response time\",\n", + " \"The EPS user interface management system\",\n", + " \"System and human system engineering testing of EPS\", \n", + " \"Relation of user perceived response time to error measurement\",\n", + " \"The generation of random binary unordered trees\",\n", + " \"The intersection graph of paths in trees\",\n", + " \"Graph minors IV Widths of trees and well quasi ordering\",\n", + " \"Graph minors A survey\"]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This is a particularly small example of a corpus for illustration purposes. Another example could be a list of all the plays written by Shakespeare, list of all wikipedia articles, or all tweets by a particular person of interest.\n", + "\n", + "After collecting our corpus, there are typically a number of preprocessing steps we want to undertake. We'll keep it simple and just remove some commonly used English words (such as 'the') and words that occur only once in the corpus. In the process of doing so, we'll [tokenize](https://en.wikipedia.org/wiki/Lexical_analysis#Tokenization) our data. Tokenization breaks up the documents into words (in this case using space as a delimiter)." + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[['human', 'interface', 'computer'],\n", + " ['survey', 'user', 'computer', 'system', 'response', 'time'],\n", + " ['eps', 'user', 'interface', 'system'],\n", + " ['system', 'human', 'system', 'eps'],\n", + " ['user', 'response', 'time'],\n", + " ['trees'],\n", + " ['graph', 'trees'],\n", + " ['graph', 'minors', 'trees'],\n", + " ['graph', 'minors', 'survey']]" + ] + }, + "execution_count": 2, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Create a set of frequent words\n", + "stoplist = set('for a of the and to in'.split(' '))\n", + "# Lowercase each document, split it by white space and filter out stopwords\n", + "texts = [[word for word in document.lower().split() if word not in stoplist]\n", + " for document in raw_corpus]\n", + "\n", + "# Count word frequencies\n", + "from collections import defaultdict\n", + "frequency = defaultdict(int)\n", + "for text in texts:\n", + " for token in text:\n", + " frequency[token] += 1\n", + "\n", + "# Only keep words that appear more than once\n", + "processed_corpus = [[token for token in text if frequency[token] > 1] for text in texts]\n", + "processed_corpus" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Before proceeding, we want to associate each word in the corpus with a unique integer ID. We can do this using the `gensim.corpora.Dictionary` class. This dictionary defines the vocabulary of all words that our processing knows about." + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Dictionary(12 unique tokens: [u'minors', u'graph', u'system', u'trees', u'eps']...)\n" + ] + } + ], + "source": [ + "from gensim import corpora\n", + "\n", + "dictionary = corpora.Dictionary(processed_corpus)\n", + "print(dictionary)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Because our corpus is small, there is only 12 different tokens in this `Dictionary`. For larger corpuses, dictionaries that contains hundreds of thousands of tokens are quite common." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Vector\n", + "\n", + "To infer the latent structure in our corpus we need a way to represent documents that we can manipulate mathematically. One approach is to represent each document as a vector. There are various approaches for creating a vector representation of a document but a simple example is the *bag-of-words model*. Under the bag-of-words model each document is represented by a vector containing the frequency counts of each word in the dictionary. For example, given a dictionary containing the words `['coffee', 'milk', 'sugar', 'spoon']` a document consisting of the string `\"coffee milk coffee\"` could be represented by the vector `[2, 1, 0, 0]` where the entries of the vector are (in order) the occurrences of \"coffee\", \"milk\", \"sugar\" and \"spoon\" in the document. The length of the vector is the number of entries in the dictionary. One of the main properties of the bag-of-words model is that it completely ignores the order of the tokens in the document that is encoded, which is where the name bag-of-words comes from.\n", + "\n", + "Our processed corpus has 12 unique words in it, which means that each document will be represented by a 12-dimensional vector under the bag-of-words model. We can use the dictionary to turn tokenized documents into these 12-dimensional vectors. We can see what these IDs correspond to:" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{u'minors': 11, u'graph': 10, u'system': 6, u'trees': 9, u'eps': 8, u'computer': 1, u'survey': 5, u'user': 7, u'human': 2, u'time': 4, u'interface': 0, u'response': 3}\n" + ] + } + ], + "source": [ + "print(dictionary.token2id)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "For example, suppose we wanted to vectorize the phrase \"Human computer interaction\" (note that this phrase was not in our original corpus). We can create the bag-of-word representation for a document using the `doc2bow` method of the dictionary, which returns a sparse representation of the word counts:" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[(1, 1), (2, 1)]" + ] + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "new_doc = \"Human computer interaction\"\n", + "new_vec = dictionary.doc2bow(new_doc.lower().split())\n", + "new_vec" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The first entry in each tuple corresponds to the ID of the token in the dictionary, the second corresponds to the count of this token." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Note that \"interaction\" did not occur in the original corpus and so it was not included in the vectorization. Also note that this vector only contains entries for words that actually appeared in the document. Because any given document will only contain a few words out of the many words in the dictionary, words that do not appear in the vectorization are represented as implicitly zero as a space saving measure.\n", + "\n", + "We can convert our entire original corpus to a list of vectors:" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[[(0, 1), (1, 1), (2, 1)],\n", + " [(1, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)],\n", + " [(0, 1), (6, 1), (7, 1), (8, 1)],\n", + " [(2, 1), (6, 2), (8, 1)],\n", + " [(3, 1), (4, 1), (7, 1)],\n", + " [(9, 1)],\n", + " [(9, 1), (10, 1)],\n", + " [(9, 1), (10, 1), (11, 1)],\n", + " [(5, 1), (10, 1), (11, 1)]]" + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "bow_corpus = [dictionary.doc2bow(text) for text in processed_corpus]\n", + "bow_corpus" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Note that while this list lives entirely in memory, in most applications you will want a more scalable solution. Luckily, `gensim` allows you to use any iterator that returns a single document vector at a time. See the documentation for more details.\n", + "\n", + "### Model\n", + "\n", + "Now that we have vectorized our corpus we can begin to transform it using *models*. We use model as an abstract term referring to a transformation from one document representation to another. In `gensim`, documents are represented as vectors so a model can be thought of as a transformation between two [vector spaces](https://en.wikipedia.org/wiki/Vector_space). The details of this transformation are learned from the training corpus.\n", + "\n", + "One simple example of a model is [tf-idf](https://en.wikipedia.org/wiki/Tf%E2%80%93idf). The tf-idf model transforms vectors from the bag-of-words representation to a vector space, where the frequency counts are weighted according to the relative rarity of each word in the corpus.\n", + "\n", + "Here's a simple example. Let's initialize the tf-idf model, training it on our corpus and transforming the string \"system minors\":" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[(6, 0.5898341626740045), (11, 0.8075244024440723)]" + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from gensim import models\n", + "# train the model\n", + "tfidf = models.TfidfModel(bow_corpus)\n", + "# transform the \"system minors\" string\n", + "tfidf[dictionary.doc2bow(\"system minors\".lower().split())]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The `tfidf` model again returns a list of tuples, where the first entry is the token ID and the second entry is the tf-idf weighting. Note that the ID corresponding to \"system\" (which occurred 4 times in the original corpus) has been weighted lower than the ID corresponding to \"minors\" (which only occurred twice).\n", + "\n", + "`gensim` offers a number of different models/transformations. See [Transformations and Topics](https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/Topics_and_Transformations.ipynb) for details." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Next Steps\n", + "\n", + "Interested in learning more about `gensim`? Please read through the following notebooks.\n", + "\n", + "1. [Corpora_and_Vector_Spaces.ipynb](docs/notebooks/Corpora_and_Vector_Spaces.ipynb)\n", + "2. [word2vec.ipynb](docs/notebooks/word2vec.ipynb)" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 2", + "language": "python", + "name": "python2" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.1" + } + }, + "nbformat": 4, + "nbformat_minor": 1 +} diff --git a/gensim Quick Start.ipynb b/gensim Quick Start.ipynb index a1fad3a7a3..91f06656bc 100644 --- a/gensim Quick Start.ipynb +++ b/gensim Quick Start.ipynb @@ -92,7 +92,7 @@ "\n", "### Corpus\n", "\n", - "A *corpus* is a collection of digital documents. This collection is the input to `gensim` from which it will infer the structure of the documents, their topics, etc. The latent structure inferred from the corpus can later be used to assign topics to new documents which were not present in the training corpus. For this reason, we also refer to this collection as the *training corpus*. No human intervention (such as tagging the documents by hand) is required - the topic classification is [unsupervised](https://en.wikipedia.org/wiki/Unsupervised_learning.html).\n", + "A *corpus* is a collection of digital documents. This collection is the input to `gensim` from which it will infer the structure of the documents, their topics, etc. The latent structure inferred from the corpus can later be used to assign topics to new documents which were not present in the training corpus. For this reason, we also refer to this collection as the *training corpus*. No human intervention (such as tagging the documents by hand) is required - the topic classification is [unsupervised](https://en.wikipedia.org/wiki/Unsupervised_learning).\n", "\n", "For our corpus, we'll use a list of 9 strings, each consisting of only a single sentence." ] @@ -128,9 +128,7 @@ { "cell_type": "code", "execution_count": 2, - "metadata": { - "collapsed": false - }, + "metadata": {}, "outputs": [ { "data": { @@ -180,9 +178,7 @@ { "cell_type": "code", "execution_count": 3, - "metadata": { - "collapsed": false - }, + "metadata": {}, "outputs": [ { "name": "stdout", @@ -220,9 +216,7 @@ { "cell_type": "code", "execution_count": 4, - "metadata": { - "collapsed": false - }, + "metadata": {}, "outputs": [ { "name": "stdout", @@ -246,9 +240,7 @@ { "cell_type": "code", "execution_count": 5, - "metadata": { - "collapsed": false - }, + "metadata": {}, "outputs": [ { "data": { @@ -286,9 +278,7 @@ { "cell_type": "code", "execution_count": 6, - "metadata": { - "collapsed": false - }, + "metadata": {}, "outputs": [ { "data": { @@ -332,9 +322,7 @@ { "cell_type": "code", "execution_count": 7, - "metadata": { - "collapsed": false - }, + "metadata": {}, "outputs": [ { "data": { @@ -379,9 +367,9 @@ ], "metadata": { "kernelspec": { - "display_name": "Python [Root]", + "display_name": "Python 2", "language": "python", - "name": "Python [Root]" + "name": "python2" }, "language_info": { "codemirror_mode": { @@ -393,9 +381,9 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.5.1" + "version": "3.6.1" } }, "nbformat": 4, - "nbformat_minor": 0 + "nbformat_minor": 1 } From a3104abf9bf1fafe6cb7f3f64905734222a9f4c7 Mon Sep 17 00:00:00 2001 From: aneesh-joshi Date: Sat, 17 Jun 2017 15:47:42 +0530 Subject: [PATCH 2/4] fixed link to unsup learning in gensim quick start notebook --- gensim Quick Start.ipynb | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/gensim Quick Start.ipynb b/gensim Quick Start.ipynb index 91f06656bc..60bc121966 100644 --- a/gensim Quick Start.ipynb +++ b/gensim Quick Start.ipynb @@ -56,7 +56,7 @@ "source": [ "## Accessing the `gensim` Jupyter notebooks\n", "\n", - "All of the `gensim` tutorials (including this document) are stored in [Jupyter notebooks](http://jupyter-notebook-beginner-guide.readthedocs.io/en/latest/what_is_jupyter.html). These notebooks allow the user to run the code locally while working through the material. If you would like to run a tutorial locally, first clone the GitHub repository for the project.\n", + "All of the `gensim` tutorials (including this document) are stored in [Jupyter notebooks](http://jupyter-notebook-beginner-guide.readthedocs.io/en/latest/what_is_jupyter). These notebooks allow the user to run the code locally while working through the material. If you would like to run a tutorial locally, first clone the GitHub repository for the project.\n", "``` bash\n", " $ git clone https://github.com/RaRe-Technologies/gensim.git\n", "``` \n", From 42d06c5bc6e79d0a9676330a538ba4192caa780a Mon Sep 17 00:00:00 2001 From: aneesh-joshi Date: Mon, 19 Jun 2017 13:01:19 +0530 Subject: [PATCH 3/4] fixed self made error on jupyter notebook link --- gensim Quick Start.ipynb | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/gensim Quick Start.ipynb b/gensim Quick Start.ipynb index 60bc121966..91f06656bc 100644 --- a/gensim Quick Start.ipynb +++ b/gensim Quick Start.ipynb @@ -56,7 +56,7 @@ "source": [ "## Accessing the `gensim` Jupyter notebooks\n", "\n", - "All of the `gensim` tutorials (including this document) are stored in [Jupyter notebooks](http://jupyter-notebook-beginner-guide.readthedocs.io/en/latest/what_is_jupyter). These notebooks allow the user to run the code locally while working through the material. If you would like to run a tutorial locally, first clone the GitHub repository for the project.\n", + "All of the `gensim` tutorials (including this document) are stored in [Jupyter notebooks](http://jupyter-notebook-beginner-guide.readthedocs.io/en/latest/what_is_jupyter.html). These notebooks allow the user to run the code locally while working through the material. If you would like to run a tutorial locally, first clone the GitHub repository for the project.\n", "``` bash\n", " $ git clone https://github.com/RaRe-Technologies/gensim.git\n", "``` \n", From 16998b10e05d2463c1d44102da4902946840db81 Mon Sep 17 00:00:00 2001 From: aneesh-joshi Date: Wed, 21 Jun 2017 10:49:15 +0530 Subject: [PATCH 4/4] removed .ipynb checkpoint --- .../gensim Quick Start-checkpoint.ipynb | 389 ------------------ 1 file changed, 389 deletions(-) delete mode 100644 .ipynb_checkpoints/gensim Quick Start-checkpoint.ipynb diff --git a/.ipynb_checkpoints/gensim Quick Start-checkpoint.ipynb b/.ipynb_checkpoints/gensim Quick Start-checkpoint.ipynb deleted file mode 100644 index 91f06656bc..0000000000 --- a/.ipynb_checkpoints/gensim Quick Start-checkpoint.ipynb +++ /dev/null @@ -1,389 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - " # Getting Started with `gensim`\n", - " \n", - " The goal of this tutorial is to get a new user up-and-running with `gensim`. This notebook covers the following objectives.\n", - " \n", - " ## Objectives\n", - " \n", - " * Installing `gensim`.\n", - " * Accessing the `gensim` Jupyter notebook tutorials.\n", - " * Presenting the core concepts behind the library.\n", - " \n", - " " - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Installing `gensim`\n", - "\n", - "Before we can start using `gensim` for [natural language processing (NLP)](https://en.wikipedia.org/wiki/Natural_language_processing), you will need to install Python along with `gensim` and its dependences. It is suggested that a new user install a prepackaged python distribution and a number of popular distributions are listed below.\n", - "\n", - "* [Anaconda ](https://www.continuum.io/downloads)\n", - "* [EPD ](https://store.enthought.com/downloads)\n", - "* [WinPython ](https://winpython.github.io)\n", - "\n", - "Once Python is installed, we will use `pip` to install the `gensim` library. First, we will make sure that Python is installed and accessible from the command line. From the command line, execute the following command:\n", - "\n", - " which python\n", - " \n", - "The resulting address should correspond to the Python distribution that you installed above. Now that we have verified that we are using the correct version of Python, we can install `gensim` from the command line as follows:\n", - "\n", - " pip install -U gensim\n", - " \n", - "To verify that `gensim` was installed correctly, you can activate Python from the command line and execute `import gensim`\n", - "\n", - " $ python\n", - " Python 3.5.1 |Anaconda custom (x86_64)| (default, Jun 15 2016, 16:14:02)\n", - " [GCC 4.2.1 Compatible Apple LLVM 4.2 (clang-425.0.28)] on darwin\n", - " Type \"help\", \"copyright\", \"credits\" or \"license\" for more information.\n", - " >>> import gensim\n", - " >>> # No error is a good thing\n", - " >>> exit()\n", - "\n", - "**Note:** Windows users that are following long should either use [Windows subsystem for Linux](https://channel9.msdn.com/events/Windows/Windows-Developer-Day-Creators-Update/Developer-tools-and-updates) or another bash implementation for Windows, such as [Git bash](https://git-for-windows.github.io/)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Accessing the `gensim` Jupyter notebooks\n", - "\n", - "All of the `gensim` tutorials (including this document) are stored in [Jupyter notebooks](http://jupyter-notebook-beginner-guide.readthedocs.io/en/latest/what_is_jupyter.html). These notebooks allow the user to run the code locally while working through the material. If you would like to run a tutorial locally, first clone the GitHub repository for the project.\n", - "``` bash\n", - " $ git clone https://github.com/RaRe-Technologies/gensim.git\n", - "``` \n", - "Next, start a Jupyter notebook server. This is accomplished using the following bash commands (or starting the notebook server from the GUI application).\n", - "\n", - "``` bash\n", - " $ cd gensim\n", - " $ pwd\n", - " /Users/user1/home/gensim\n", - " $ cd docs/notebooks\n", - " $ jupyter notebook\n", - "``` \n", - "After a few moments, Jupyter will open a web page in your browser and you can access each tutorial by clicking on the corresponding link. \n", - "\n", - "\n", - "\n", - "This will open the corresponding notebook in a separate tab. The Python code in the notebook can be executed by selecting/clicking on a cell and pressing SHIFT + ENTER.\n", - "\n", - "\n", - "\n", - "**Note:** The order of cell execution matters. Be sure to run all of the code cells in order from top to bottom, you you might encounter errors." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Core Concepts and Simple Example\n", - "\n", - "This section introduces the basic concepts and terms needed to understand and use `gensim` and provides a simple usage example. In particular, we will build a model that measures the importance of a particular word.\n", - "\n", - "At a very high-level, `gensim` is a tool for discovering the semantic structure of documents by examining the patterns of words (or higher-level structures such as entire sentences or documents). `gensim` accomplishes this by taking a *corpus*, a collection of text documents, and producing a *vector* representation of the text in the corpus. The vector representation can then be used to train a *model*, which is an algorithms to create different representations of the data, which are usually more semantic. These three concepts are key to understanding how `gensim` works so let's take a moment to explain what each of them means. At the same time, we'll work through a simple example that illustrates each of them.\n", - "\n", - "### Corpus\n", - "\n", - "A *corpus* is a collection of digital documents. This collection is the input to `gensim` from which it will infer the structure of the documents, their topics, etc. The latent structure inferred from the corpus can later be used to assign topics to new documents which were not present in the training corpus. For this reason, we also refer to this collection as the *training corpus*. No human intervention (such as tagging the documents by hand) is required - the topic classification is [unsupervised](https://en.wikipedia.org/wiki/Unsupervised_learning).\n", - "\n", - "For our corpus, we'll use a list of 9 strings, each consisting of only a single sentence." - ] - }, - { - "cell_type": "code", - "execution_count": 1, - "metadata": { - "collapsed": true - }, - "outputs": [], - "source": [ - "raw_corpus = [\"Human machine interface for lab abc computer applications\",\n", - " \"A survey of user opinion of computer system response time\",\n", - " \"The EPS user interface management system\",\n", - " \"System and human system engineering testing of EPS\", \n", - " \"Relation of user perceived response time to error measurement\",\n", - " \"The generation of random binary unordered trees\",\n", - " \"The intersection graph of paths in trees\",\n", - " \"Graph minors IV Widths of trees and well quasi ordering\",\n", - " \"Graph minors A survey\"]" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "This is a particularly small example of a corpus for illustration purposes. Another example could be a list of all the plays written by Shakespeare, list of all wikipedia articles, or all tweets by a particular person of interest.\n", - "\n", - "After collecting our corpus, there are typically a number of preprocessing steps we want to undertake. We'll keep it simple and just remove some commonly used English words (such as 'the') and words that occur only once in the corpus. In the process of doing so, we'll [tokenize](https://en.wikipedia.org/wiki/Lexical_analysis#Tokenization) our data. Tokenization breaks up the documents into words (in this case using space as a delimiter)." - ] - }, - { - "cell_type": "code", - "execution_count": 2, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "[['human', 'interface', 'computer'],\n", - " ['survey', 'user', 'computer', 'system', 'response', 'time'],\n", - " ['eps', 'user', 'interface', 'system'],\n", - " ['system', 'human', 'system', 'eps'],\n", - " ['user', 'response', 'time'],\n", - " ['trees'],\n", - " ['graph', 'trees'],\n", - " ['graph', 'minors', 'trees'],\n", - " ['graph', 'minors', 'survey']]" - ] - }, - "execution_count": 2, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# Create a set of frequent words\n", - "stoplist = set('for a of the and to in'.split(' '))\n", - "# Lowercase each document, split it by white space and filter out stopwords\n", - "texts = [[word for word in document.lower().split() if word not in stoplist]\n", - " for document in raw_corpus]\n", - "\n", - "# Count word frequencies\n", - "from collections import defaultdict\n", - "frequency = defaultdict(int)\n", - "for text in texts:\n", - " for token in text:\n", - " frequency[token] += 1\n", - "\n", - "# Only keep words that appear more than once\n", - "processed_corpus = [[token for token in text if frequency[token] > 1] for text in texts]\n", - "processed_corpus" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Before proceeding, we want to associate each word in the corpus with a unique integer ID. We can do this using the `gensim.corpora.Dictionary` class. This dictionary defines the vocabulary of all words that our processing knows about." - ] - }, - { - "cell_type": "code", - "execution_count": 3, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Dictionary(12 unique tokens: [u'minors', u'graph', u'system', u'trees', u'eps']...)\n" - ] - } - ], - "source": [ - "from gensim import corpora\n", - "\n", - "dictionary = corpora.Dictionary(processed_corpus)\n", - "print(dictionary)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Because our corpus is small, there is only 12 different tokens in this `Dictionary`. For larger corpuses, dictionaries that contains hundreds of thousands of tokens are quite common." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Vector\n", - "\n", - "To infer the latent structure in our corpus we need a way to represent documents that we can manipulate mathematically. One approach is to represent each document as a vector. There are various approaches for creating a vector representation of a document but a simple example is the *bag-of-words model*. Under the bag-of-words model each document is represented by a vector containing the frequency counts of each word in the dictionary. For example, given a dictionary containing the words `['coffee', 'milk', 'sugar', 'spoon']` a document consisting of the string `\"coffee milk coffee\"` could be represented by the vector `[2, 1, 0, 0]` where the entries of the vector are (in order) the occurrences of \"coffee\", \"milk\", \"sugar\" and \"spoon\" in the document. The length of the vector is the number of entries in the dictionary. One of the main properties of the bag-of-words model is that it completely ignores the order of the tokens in the document that is encoded, which is where the name bag-of-words comes from.\n", - "\n", - "Our processed corpus has 12 unique words in it, which means that each document will be represented by a 12-dimensional vector under the bag-of-words model. We can use the dictionary to turn tokenized documents into these 12-dimensional vectors. We can see what these IDs correspond to:" - ] - }, - { - "cell_type": "code", - "execution_count": 4, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "{u'minors': 11, u'graph': 10, u'system': 6, u'trees': 9, u'eps': 8, u'computer': 1, u'survey': 5, u'user': 7, u'human': 2, u'time': 4, u'interface': 0, u'response': 3}\n" - ] - } - ], - "source": [ - "print(dictionary.token2id)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "For example, suppose we wanted to vectorize the phrase \"Human computer interaction\" (note that this phrase was not in our original corpus). We can create the bag-of-word representation for a document using the `doc2bow` method of the dictionary, which returns a sparse representation of the word counts:" - ] - }, - { - "cell_type": "code", - "execution_count": 5, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "[(1, 1), (2, 1)]" - ] - }, - "execution_count": 5, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "new_doc = \"Human computer interaction\"\n", - "new_vec = dictionary.doc2bow(new_doc.lower().split())\n", - "new_vec" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The first entry in each tuple corresponds to the ID of the token in the dictionary, the second corresponds to the count of this token." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Note that \"interaction\" did not occur in the original corpus and so it was not included in the vectorization. Also note that this vector only contains entries for words that actually appeared in the document. Because any given document will only contain a few words out of the many words in the dictionary, words that do not appear in the vectorization are represented as implicitly zero as a space saving measure.\n", - "\n", - "We can convert our entire original corpus to a list of vectors:" - ] - }, - { - "cell_type": "code", - "execution_count": 6, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "[[(0, 1), (1, 1), (2, 1)],\n", - " [(1, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)],\n", - " [(0, 1), (6, 1), (7, 1), (8, 1)],\n", - " [(2, 1), (6, 2), (8, 1)],\n", - " [(3, 1), (4, 1), (7, 1)],\n", - " [(9, 1)],\n", - " [(9, 1), (10, 1)],\n", - " [(9, 1), (10, 1), (11, 1)],\n", - " [(5, 1), (10, 1), (11, 1)]]" - ] - }, - "execution_count": 6, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "bow_corpus = [dictionary.doc2bow(text) for text in processed_corpus]\n", - "bow_corpus" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Note that while this list lives entirely in memory, in most applications you will want a more scalable solution. Luckily, `gensim` allows you to use any iterator that returns a single document vector at a time. See the documentation for more details.\n", - "\n", - "### Model\n", - "\n", - "Now that we have vectorized our corpus we can begin to transform it using *models*. We use model as an abstract term referring to a transformation from one document representation to another. In `gensim`, documents are represented as vectors so a model can be thought of as a transformation between two [vector spaces](https://en.wikipedia.org/wiki/Vector_space). The details of this transformation are learned from the training corpus.\n", - "\n", - "One simple example of a model is [tf-idf](https://en.wikipedia.org/wiki/Tf%E2%80%93idf). The tf-idf model transforms vectors from the bag-of-words representation to a vector space, where the frequency counts are weighted according to the relative rarity of each word in the corpus.\n", - "\n", - "Here's a simple example. Let's initialize the tf-idf model, training it on our corpus and transforming the string \"system minors\":" - ] - }, - { - "cell_type": "code", - "execution_count": 7, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "[(6, 0.5898341626740045), (11, 0.8075244024440723)]" - ] - }, - "execution_count": 7, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "from gensim import models\n", - "# train the model\n", - "tfidf = models.TfidfModel(bow_corpus)\n", - "# transform the \"system minors\" string\n", - "tfidf[dictionary.doc2bow(\"system minors\".lower().split())]" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The `tfidf` model again returns a list of tuples, where the first entry is the token ID and the second entry is the tf-idf weighting. Note that the ID corresponding to \"system\" (which occurred 4 times in the original corpus) has been weighted lower than the ID corresponding to \"minors\" (which only occurred twice).\n", - "\n", - "`gensim` offers a number of different models/transformations. See [Transformations and Topics](https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/Topics_and_Transformations.ipynb) for details." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Next Steps\n", - "\n", - "Interested in learning more about `gensim`? Please read through the following notebooks.\n", - "\n", - "1. [Corpora_and_Vector_Spaces.ipynb](docs/notebooks/Corpora_and_Vector_Spaces.ipynb)\n", - "2. [word2vec.ipynb](docs/notebooks/word2vec.ipynb)" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 2", - "language": "python", - "name": "python2" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.6.1" - } - }, - "nbformat": 4, - "nbformat_minor": 1 -}