Remove Pattern lib dependency in News Classification tutorial (#1118)

* Import and download NLTK package on News Classification notebook Makes sure the user has downloaded the 'stopwords' package from NLTK without having to do it from shell. Signed-off-by: Luiz Carlos Cavalcanti <[email protected]> * Replace Pattern lib with NLTK on News Classification tutorial Since Pattern lib is not supported yet on Python 3, NLTK is now used also for lemmatization, removing the dependency for Pattern. Now it should be possible to run this tutorial ob both Python 2.5+ and 3.x. Signed-off-by: Luiz Carlos Cavalcanti <[email protected]>
piskvorky · Jan 29, 2017 · 9cb4910 · 9cb4910
1 parent 2817aa7
commit 9cb4910
Showing 1 changed file with 10 additions and 3 deletions.
diff --git a/docs/notebooks/gensim_news_classification.ipynb b/docs/notebooks/gensim_news_classification.ipynb
@@ -22,8 +22,7 @@
     "Following are the dependencies for this tutorial:\n",
     "    - Gensim Version >=0.13.1 would be preferred since we will be using topic coherence metrics extensively here.\n",
     "    - matplotlib\n",
-    "    - Patterns library; Gensim uses this for lemmatization. ONLY FOR PYTHON 2.5+ - no support for Python 3 yet.\n",
-    "    - nltk.stopwords\n",
+    "    - nltk.stopwords and nltk.wordnet\n",
     "    - pyLDAVis\n",
     "We will be playing around with 4 different topic models here:\n",
     "    - LSI (Latent Semantic Indexing)\n",
@@ -56,6 +55,10 @@
     "import numpy as np\n",
     "warnings.filterwarnings('ignore')  # Let's not pay heed to them right now\n",
     "\n",
+    "import nltk\n",
+    "nltk.download('stopwords') # Let's make sure the 'stopword' package is downloaded & updated\n",
+    "nltk.download('wordnet') # Let's also download wordnet, which will be used for lemmatization\n",
+    "\n",
     "from gensim.models import CoherenceModel, LdaModel, LsiModel, HdpModel\n",
     "from gensim.models.wrappers import LdaMallet\n",
     "from gensim.corpora import Dictionary\n",
@@ -306,7 +309,11 @@
     "    \"\"\"\n",
     "    texts = [[word for word in line if word not in stops] for line in texts]\n",
     "    texts = [bigram[line] for line in texts]\n",
-    "    texts = [[word.split('/')[0] for word in lemmatize(' '.join(line), allowed_tags=re.compile('(NN)'), min_length=3)] for line in texts]\n",
+    "    \n",
+    "    from nltk.stem import WordNetLemmatizer\n",
+    "    lemmatizer = WordNetLemmatizer()\n",
+    "\n",
+    "    texts = [[word for word in lemmatizer.lemmatize(' '.join(line), pos='v').split()] for line in texts]\n",
     "    return texts"
    ]
   },