Skip to content

Commit

Permalink
Remove Pattern lib dependency in News Classification tutorial (#1118)
Browse files Browse the repository at this point in the history
* Import and download NLTK package on News Classification notebook
Makes sure the user has downloaded the 'stopwords' package from NLTK without
having to do it from shell.

Signed-off-by: Luiz Carlos Cavalcanti <[email protected]>

* Replace Pattern lib with NLTK on News Classification tutorial

Since Pattern lib is not supported yet on Python 3, NLTK is now used also for
lemmatization, removing the dependency for Pattern. Now it should be possible
to run this tutorial ob both Python 2.5+ and 3.x.

Signed-off-by: Luiz Carlos Cavalcanti <[email protected]>
  • Loading branch information
luizcavalcanti authored and tmylk committed Jan 29, 2017
1 parent 2817aa7 commit 9cb4910
Showing 1 changed file with 10 additions and 3 deletions.
13 changes: 10 additions & 3 deletions docs/notebooks/gensim_news_classification.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -22,8 +22,7 @@
"Following are the dependencies for this tutorial:\n",
" - Gensim Version >=0.13.1 would be preferred since we will be using topic coherence metrics extensively here.\n",
" - matplotlib\n",
" - Patterns library; Gensim uses this for lemmatization. ONLY FOR PYTHON 2.5+ - no support for Python 3 yet.\n",
" - nltk.stopwords\n",
" - nltk.stopwords and nltk.wordnet\n",
" - pyLDAVis\n",
"We will be playing around with 4 different topic models here:\n",
" - LSI (Latent Semantic Indexing)\n",
Expand Down Expand Up @@ -56,6 +55,10 @@
"import numpy as np\n",
"warnings.filterwarnings('ignore') # Let's not pay heed to them right now\n",
"\n",
"import nltk\n",
"nltk.download('stopwords') # Let's make sure the 'stopword' package is downloaded & updated\n",
"nltk.download('wordnet') # Let's also download wordnet, which will be used for lemmatization\n",
"\n",
"from gensim.models import CoherenceModel, LdaModel, LsiModel, HdpModel\n",
"from gensim.models.wrappers import LdaMallet\n",
"from gensim.corpora import Dictionary\n",
Expand Down Expand Up @@ -306,7 +309,11 @@
" \"\"\"\n",
" texts = [[word for word in line if word not in stops] for line in texts]\n",
" texts = [bigram[line] for line in texts]\n",
" texts = [[word.split('/')[0] for word in lemmatize(' '.join(line), allowed_tags=re.compile('(NN)'), min_length=3)] for line in texts]\n",
" \n",
" from nltk.stem import WordNetLemmatizer\n",
" lemmatizer = WordNetLemmatizer()\n",
"\n",
" texts = [[word for word in lemmatizer.lemmatize(' '.join(line), pos='v').split()] for line in texts]\n",
" return texts"
]
},
Expand Down

0 comments on commit 9cb4910

Please sign in to comment.