TopicModel Exercise

For this exercise, we'll compute topics for a dataset using Mallet as a command line tool, and then as an API. The main points of the exercise are to:

show how easy it is to get topics for a corpus
explore some parameterizations and see their effect on the topics that are found
do integration of a tool like Mallet, with a somewhat poorly documented API, into your code

Step one: find and prepare a data set

You need a data set that either already is in a directory with one text document per file, or that you can process to obtain that form. For example:

You could modify the Federalist papers extraction code so that it outputs one file per article. (The easiest thing is to turn that into a standalone script.)
You can download an existing data set that already comes in that format, like 20 Newsgroups or the Enron data.

Step two: install Mallet and run it from the command line.

Compute topics by following the instructions for command-line usage for computing topic models with Mallet. Explore some of the different parameterizations (especially the number of topics) to see how the topics vary.

Step three: use Mallet as an API

Create a project that uses Mallet as a dependency and compute topics using Mallet as an API rather from the command-line. This means that you can have a topic model as a first-class object in your application, rather than having to obtain it indirectly via computing it on the command-line, reading in output from it, etc.

This is not hard, but it requires you to do some of the sleuthing that is necessary for working with code like this in the real world.

Also, you can just wimp out and look at Mallet's page on Topic Modeling for Java Developers and convert the code given there to Scala.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TopicModel Exercise

Step one: find and prepare a data set

Step two: install Mallet and run it from the command line.

Step three: use Mallet as an API

Clone this wiki locally