-
Notifications
You must be signed in to change notification settings - Fork 53
TopicModel Exercise
For this exercise, we'll compute topics for a dataset using Mallet as a command line tool, and then as an API. The main points of the exercise are to:
- show how easy it is to get topics for a corpus
- explore some parameterizations and see their effect on the topics that are found
- do integration of a tool like Mallet, with a somewhat poorly documented API, into your code
You need a data set that either already is in a directory with one text document per file, or that you can process to obtain that form. For example:
- You could modify the Federalist papers extraction code so that it outputs one file per article. (The easiest thing is to turn that into a standalone script.)
- You can download an existing data set that already comes in that format, like 20 Newsgroups or the Enron data.
Compute topics by following the instructions for command-line usage for computing topic models with Mallet. Explore some of the different parameterizations (especially the number of topics) to see how the topics vary.
Create a project that uses Mallet as a dependency and compute topics using Mallet as an API rather from the command-line. This means that you can have a topic model as a first-class object in your application, rather than having to obtain it indirectly via computing it on the command-line, reading in output from it, etc.
This is not hard, but it requires you to do some of the sleuthing that is necessary for working with code like this in the real world.
Also, you can just wimp out and look at Mallet's page on Topic Modeling for Java Developers and convert the code given there to Scala.