diff --git a/docs/notebooks/Corpora_and_Vector_Spaces.ipynb b/docs/notebooks/Corpora_and_Vector_Spaces.ipynb index c6f7b6b189..7f62262b0f 100644 --- a/docs/notebooks/Corpora_and_Vector_Spaces.ipynb +++ b/docs/notebooks/Corpora_and_Vector_Spaces.ipynb @@ -44,7 +44,15 @@ "metadata": { "collapsed": false }, - "outputs": [], + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2017-05-07 13:52:33,796 : INFO : 'pattern' package not found; tag filters are not available for English\n" + ] + } + ], "source": [ "from gensim import corpora" ] @@ -127,25 +135,33 @@ "\n", "The ways to process documents are so varied and application- and language-dependent that I decided to not constrain them by any interface. Instead, a document is represented by the features extracted from it, not by its “surface” string form: how you get to the features is up to you. Below I describe one common, general-purpose approach (called bag-of-words), but keep in mind that different application domains call for different features, and, as always, it’s [garbage in, garbage out](https://en.wikipedia.org/wiki/Garbage_in,_garbage_out)...\n", "\n", - "To convert documents to vectors, we’ll use a document representation called [bag-of-words](https://en.wikipedia.org/wiki/Bag-of-words_model). In this representation, each document is represented by one vector where each vector element represents a question-answer pair, in the style of:\n", - "\n", - "\"How many times does the word *system* appear in the document? Once\"\n", + "To convert documents to vectors, we’ll use a document representation called [bag-of-words](https://en.wikipedia.org/wiki/Bag-of-words_model). In this representation, each document is represented by one vector where a vector element `i` represents the number of times the `i`th word appears in the document.\n", "\n", "It is advantageous to represent the questions only by their (integer) ids. The mapping between the questions and ids is called a dictionary:" ] }, { "cell_type": "code", - "execution_count": 5, + "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2017-05-07 14:04:55,398 : INFO : adding document #0 to Dictionary(0 unique tokens: [])\n", + "2017-05-07 14:04:55,400 : INFO : built Dictionary(12 unique tokens: ['human', 'interface', 'computer', 'survey', 'user']...) from 9 documents (total 29 corpus positions)\n", + "2017-05-07 14:04:55,402 : INFO : saving Dictionary object under /tmp/deerwester.dict, separately None\n", + "2017-05-07 14:04:55,404 : INFO : saved /tmp/deerwester.dict\n" + ] + }, { "name": "stdout", "output_type": "stream", "text": [ - "Dictionary(12 unique tokens: ['response', 'survey', 'computer', 'user', 'minors']...)\n" + "Dictionary(12 unique tokens: ['human', 'interface', 'computer', 'survey', 'user']...)\n" ] } ], @@ -159,12 +175,12 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Here we assigned a unique integer id to all words appearing in the corpus with the [gensim.corpora.dictionary.Dictionary](https://radimrehurek.com/gensim/corpora/dictionary.html#gensim.corpora.dictionary.Dictionary) class. This sweeps across the texts, collecting word counts and relevant statistics. In the end, we see there are twelve distinct words in the processed corpus, which means each document will be represented by twelve numbers (ie., by a 12-D vector). To see the mapping between words and their ids:" + "Here we assigned a unique integer ID to all words appearing in the processed corpus with the [gensim.corpora.dictionary.Dictionary](https://radimrehurek.com/gensim/corpora/dictionary.html#gensim.corpora.dictionary.Dictionary) class. This sweeps across the texts, collecting word counts and relevant statistics. In the end, we see there are twelve distinct words in the processed corpus, which means each document will be represented by twelve numbers (ie., by a 12-D vector). To see the mapping between words and their ids:" ] }, { "cell_type": "code", - "execution_count": 6, + "execution_count": 10, "metadata": { "collapsed": false }, @@ -173,7 +189,7 @@ "name": "stdout", "output_type": "stream", "text": [ - "{'response': 3, 'survey': 4, 'computer': 2, 'user': 5, 'minors': 11, 'time': 6, 'system': 7, 'graph': 10, 'interface': 1, 'human': 0, 'eps': 8, 'trees': 9}\n" + "{'human': 0, 'interface': 1, 'computer': 2, 'survey': 3, 'user': 4, 'system': 5, 'response': 6, 'time': 7, 'eps': 8, 'trees': 9, 'graph': 10, 'minors': 11}\n" ] } ], @@ -190,7 +206,7 @@ }, { "cell_type": "code", - "execution_count": 7, + "execution_count": 9, "metadata": { "collapsed": false }, @@ -218,24 +234,35 @@ }, { "cell_type": "code", - "execution_count": 8, + "execution_count": 11, "metadata": { "collapsed": false }, "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2017-05-07 14:15:59,996 : INFO : storing corpus in Matrix Market format to /tmp/deerwester.mm\n", + "2017-05-07 14:15:59,999 : INFO : saving sparse matrix to /tmp/deerwester.mm\n", + "2017-05-07 14:16:00,001 : INFO : PROGRESS: saving document #0\n", + "2017-05-07 14:16:00,003 : INFO : saved 9x12 matrix, density=25.926% (28/108)\n", + "2017-05-07 14:16:00,005 : INFO : saving MmCorpus index to /tmp/deerwester.mm.index\n" + ] + }, { "name": "stdout", "output_type": "stream", "text": [ "[(0, 1), (1, 1), (2, 1)]\n", "[(2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)]\n", - "[(1, 1), (5, 1), (7, 1), (8, 1)]\n", - "[(0, 1), (7, 2), (8, 1)]\n", - "[(3, 1), (5, 1), (6, 1)]\n", + "[(1, 1), (4, 1), (5, 1), (8, 1)]\n", + "[(0, 1), (5, 2), (8, 1)]\n", + "[(4, 1), (6, 1), (7, 1)]\n", "[(9, 1)]\n", "[(9, 1), (10, 1)]\n", "[(9, 1), (10, 1), (11, 1)]\n", - "[(4, 1), (10, 1), (11, 1)]\n" + "[(3, 1), (10, 1), (11, 1)]\n" ] } ], @@ -250,16 +277,16 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "By now it should be clear that the vector feature with `id=10 stands` for the question “How many times does the word graph appear in the document?” and that the answer is “zero” for the first six documents and “one” for the remaining three. As a matter of fact, we have arrived at exactly the same corpus of vectors as in the [Quick Example](https://radimrehurek.com/gensim/tutorial.html#first-example). If you're running this notebook by your own, the words id may differ, but you should be able to check the consistency between documents comparing their vectors. \n", + "By now it should be clear that the vector feature with `id=10` represents the number of times the word \"graph\" occurs in the document. The answer is “zero” for the first six documents and “one” for the remaining three. As a matter of fact, we have arrived at exactly the same corpus of vectors as in the [Quick Example](https://radimrehurek.com/gensim/tutorial.html#first-example). If you're running this notebook yourself the word IDs may differ, but you should be able to check the consistency between documents comparing their vectors. \n", "\n", "## Corpus Streaming – One Document at a Time\n", "\n", - "Note that *corpus* above resides fully in memory, as a plain Python list. In this simple example, it doesn’t matter much, but just to make things clear, let’s assume there are millions of documents in the corpus. Storing all of them in RAM won’t do. Instead, let’s assume the documents are stored in a file on disk, one document per line. Gensim only requires that a corpus must be able to return one document vector at a time:" + "Note that *corpus* above resides fully in memory, as a plain Python list. In this simple example, it doesn’t matter much, but just to make things clear, let’s assume there are millions of documents in the corpus. Storing all of them in RAM won’t do. Instead, let’s assume the documents are stored in a file on disk, one document per line. Gensim only requires that a corpus be able to return one document vector at a time:" ] }, { "cell_type": "code", - "execution_count": 9, + "execution_count": 12, "metadata": { "collapsed": true }, @@ -276,12 +303,12 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "The assumption that each document occupies one line in a single file is not important; you can mold the `__iter__` function to fit your input format, whatever it is. Walking directories, parsing XML, accessing network... Just parse your input to retrieve a clean list of tokens in each document, then convert the tokens via a dictionary to their ids and yield the resulting sparse vector inside `__iter__`." + "The assumption that each document occupies one line in a single file is not important; you can design the `__iter__` function to fit your input format, whatever that may be - walking directories, parsing XML, accessing network nodes... Just parse your input to retrieve a clean list of tokens in each document, then convert the tokens via a dictionary to their IDs and yield the resulting sparse vector inside `__iter__`." ] }, { "cell_type": "code", - "execution_count": 10, + "execution_count": 13, "metadata": { "collapsed": false }, @@ -290,7 +317,7 @@ "name": "stdout", "output_type": "stream", "text": [ - "<__main__.MyCorpus object at 0x7f4ad14856a0>\n" + "<__main__.MyCorpus object at 0x112c5acf8>\n" ] } ], @@ -303,12 +330,12 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Corpus is now an object. We didn’t define any way to print it, so `print` just outputs address of the object in memory. Not very useful. To see the constituent vectors, let’s iterate over the corpus and print each document vector (one at a time):" + "`corpus_memory_friendly` is now an object. We didn’t define any way to print it, so `print` just outputs address of the object in memory. Not very useful. To see the constituent vectors, let’s iterate over the corpus and print each document vector (one at a time):" ] }, { "cell_type": "code", - "execution_count": 12, + "execution_count": 14, "metadata": { "collapsed": false }, @@ -319,13 +346,13 @@ "text": [ "[(0, 1), (1, 1), (2, 1)]\n", "[(2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)]\n", - "[(1, 1), (5, 1), (7, 1), (8, 1)]\n", - "[(0, 1), (7, 2), (8, 1)]\n", - "[(3, 1), (5, 1), (6, 1)]\n", + "[(1, 1), (4, 1), (5, 1), (8, 1)]\n", + "[(0, 1), (5, 2), (8, 1)]\n", + "[(4, 1), (6, 1), (7, 1)]\n", "[(9, 1)]\n", "[(9, 1), (10, 1)]\n", "[(9, 1), (10, 1), (11, 1)]\n", - "[(4, 1), (10, 1), (11, 1)]\n" + "[(3, 1), (10, 1), (11, 1)]\n" ] } ], @@ -381,22 +408,34 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "And that is all there is to it! At least as far as bag-of-words representation is concerned. Of course, what we do with such corpus is another question; it is not at all clear how counting the frequency of distinct words could be useful. As it turns out, it isn’t, and we will need to apply a transformation on this simple representation first, before we can use it to compute any meaningful document vs. document similarities. Transformations are covered in the [next tutorial](https://radimrehurek.com/gensim/tut2.html), but before that, let’s briefly turn our attention to *corpus persistency*.\n", + "And that is all there is to it! At least as far as bag-of-words representation is concerned. Of course, what we do with such a corpus is another question; it is not at all clear how counting the frequency of distinct words could be useful. As it turns out, it isn’t, and we will need to apply a transformation on this simple representation first, before we can use it to compute any meaningful document vs. document similarities. Transformations are covered in the [next tutorial](https://radimrehurek.com/gensim/tut2.html), but before that, let’s briefly turn our attention to *corpus persistency*.\n", "\n", "## Corpus Formats\n", "\n", - "There exist several file formats for serializing a Vector Space corpus (~sequence of vectors) to disk. *Gensim* implements them via the *streaming corpus interface* mentioned earlier: documents are read from (resp. stored to) disk in a lazy fashion, one document at a time, without the whole corpus being read into main memory at once.\n", + "There exist several file formats for serializing a Vector Space corpus (~sequence of vectors) to disk. *Gensim* implements them via the *streaming corpus interface* mentioned earlier: documents are read from (or stored to) disk in a lazy fashion, one document at a time, without the whole corpus being read into main memory at once.\n", "\n", "One of the more notable file formats is the [Matrix Market format](http://math.nist.gov/MatrixMarket/formats.html). To save a corpus in the Matrix Market format:" ] }, { "cell_type": "code", - "execution_count": 14, + "execution_count": 15, "metadata": { - "collapsed": true + "collapsed": false }, - "outputs": [], + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2017-05-07 14:34:16,166 : INFO : storing corpus in Matrix Market format to /tmp/corpus.mm\n", + "2017-05-07 14:34:16,169 : INFO : saving sparse matrix to /tmp/corpus.mm\n", + "2017-05-07 14:34:16,170 : INFO : PROGRESS: saving document #0\n", + "2017-05-07 14:34:16,172 : INFO : saved 2x2 matrix, density=25.000% (1/4)\n", + "2017-05-07 14:34:16,173 : INFO : saving MmCorpus index to /tmp/corpus.mm.index\n" + ] + } + ], "source": [ "# create a toy corpus of 2 documents, as a plain Python list\n", "corpus = [[(1, 0.5)], []] # make one document empty, for the heck of it\n", @@ -413,11 +452,28 @@ }, { "cell_type": "code", - "execution_count": 15, + "execution_count": 16, "metadata": { - "collapsed": true + "collapsed": false }, - "outputs": [], + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2017-05-07 14:34:29,173 : INFO : converting corpus to SVMlight format: /tmp/corpus.svmlight\n", + "2017-05-07 14:34:29,176 : INFO : saving SvmLightCorpus index to /tmp/corpus.svmlight.index\n", + "2017-05-07 14:34:29,178 : INFO : no word id mapping provided; initializing from corpus\n", + "2017-05-07 14:34:29,179 : INFO : storing corpus in Blei's LDA-C format into /tmp/corpus.lda-c\n", + "2017-05-07 14:34:29,181 : INFO : saving vocabulary of 2 words to /tmp/corpus.lda-c.vocab\n", + "2017-05-07 14:34:29,183 : INFO : saving BleiCorpus index to /tmp/corpus.lda-c.index\n", + "2017-05-07 14:34:29,184 : INFO : no word id mapping provided; initializing from corpus\n", + "2017-05-07 14:34:29,186 : INFO : storing corpus in List-Of-Words format into /tmp/corpus.low\n", + "2017-05-07 14:34:29,188 : WARNING : List-of-words format can only save vectors with integer elements; 1 float entries were truncated to integer value\n", + "2017-05-07 14:34:29,190 : INFO : saving LowCorpus index to /tmp/corpus.low.index\n" + ] + } + ], "source": [ "corpora.SvmLightCorpus.serialize('/tmp/corpus.svmlight', corpus)\n", "corpora.BleiCorpus.serialize('/tmp/corpus.lda-c', corpus)\n", @@ -433,11 +489,21 @@ }, { "cell_type": "code", - "execution_count": 16, + "execution_count": 17, "metadata": { "collapsed": false }, - "outputs": [], + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2017-05-07 14:34:40,151 : INFO : loaded corpus index from /tmp/corpus.mm.index\n", + "2017-05-07 14:34:40,153 : INFO : initializing corpus reader from /tmp/corpus.mm\n", + "2017-05-07 14:34:40,156 : INFO : accepted corpus with 2 documents, 2 features, 1 non-zero entries\n" + ] + } + ], "source": [ "corpus = corpora.MmCorpus('/tmp/corpus.mm')" ] @@ -451,7 +517,7 @@ }, { "cell_type": "code", - "execution_count": 17, + "execution_count": 18, "metadata": { "collapsed": false }, @@ -477,7 +543,7 @@ }, { "cell_type": "code", - "execution_count": 18, + "execution_count": 19, "metadata": { "collapsed": false }, @@ -504,7 +570,7 @@ }, { "cell_type": "code", - "execution_count": 19, + "execution_count": 20, "metadata": { "collapsed": false }, @@ -535,11 +601,22 @@ }, { "cell_type": "code", - "execution_count": 20, + "execution_count": 21, "metadata": { - "collapsed": true + "collapsed": false }, - "outputs": [], + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2017-05-07 14:35:00,740 : INFO : no word id mapping provided; initializing from corpus\n", + "2017-05-07 14:35:00,743 : INFO : storing corpus in Blei's LDA-C format into /tmp/corpus.lda-c\n", + "2017-05-07 14:35:00,745 : INFO : saving vocabulary of 2 words to /tmp/corpus.lda-c.vocab\n", + "2017-05-07 14:35:00,747 : INFO : saving BleiCorpus index to /tmp/corpus.lda-c.index\n" + ] + } + ], "source": [ "corpora.BleiCorpus.serialize('/tmp/corpus.lda-c', corpus)" ] @@ -557,7 +634,7 @@ }, { "cell_type": "code", - "execution_count": 21, + "execution_count": 22, "metadata": { "collapsed": false }, @@ -595,9 +672,18 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "For a complete reference (Want to prune the dictionary to a smaller size? Optimize converting between corpora and NumPy/SciPy arrays?), see the [API documentation](https://radimrehurek.com/gensim/apiref.html). Or continue to the next tutorial on Topics and Transformations ([notebook](https://github.com/piskvorky/gensim/tree/develop/docs/notebooks/Topics_and_Transformations.ipynb) \n", + "For a complete reference (want to prune the dictionary to a smaller size? Optimize converting between corpora and NumPy/SciPy arrays?), see the [API documentation](https://radimrehurek.com/gensim/apiref.html). Or continue to the next tutorial on Topics and Transformations ([notebook](https://github.com/piskvorky/gensim/tree/develop/docs/notebooks/Topics_and_Transformations.ipynb) \n", "or [website](https://radimrehurek.com/gensim/tut2.html))." ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [] } ], "metadata": {