MLscan

Machine Learning Mailing List Scanner

This program attempts to predict if a message sent to a mailing list will be interesting based upon features extracted from similar messages previously sent to the mailing list. It does this by using a Multinomial Naive Bayes classifier.

The inspiration for the project was to try and automatically flag messages to the ovirt-users mailing list that should be replied to by specific groups. For example, if someone sends a message with a storage-related query, then someone else familiar with storage could be automatically notified of the incoming message. This can: lower the amount of mail that people need to scan, help people not accidentally overlook messages they should answer, and lower the response time between when a message is sent to the list and when a reply is made.

Status

This is currently a proof of concept. As such, it has provisions to train the classifier and test it against a given set of data, but it has no means to look for new messages, notify interested parties, create a list of the most interesting new mail, etc. that would be needed in order to deploy it to real-world use.

Program Flow

The program does the following:

Parses mailing list archive files from a text mbox format using a /^From / delimiter.
Parses individual messages using Apache mime4j.
Collates email threads into trees using the Message-Id and In-Reply-To headers.
Labels training examples using the "Interesting Senders" heuristic (see below).
Trains the classifier: a Multinomial Naive Bayes classifier from the Datumbox Machine Learning Framework.
Tests the classifier against new inputs.

Interesting Senders Heuristic

The class labels for training data are derived using a simple heuristic: we assume a given set of email addresses correspond to a set of people with some specific knowledge. If one of these people replies to a thread, then it's likely that thread has relevant subject matter; thus, we can classify a thread based upon the set of people that replied to it.

Usage

First, download some mailing list archives from e.g. mailman and decompress them.

Next, divide them such that some messages are part of a training set and some are test data. Provide the appropriate filenames to the respective mlscan options.

See the mlscan.sh file for example usage.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
doc		doc
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
mlscan.sh		mlscan.sh
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MLscan

Machine Learning Mailing List Scanner

Status

Program Flow

Interesting Senders Heuristic

Usage

About

Releases

Packages

Languages

License

gpadgett/mlscan

Folders and files

Latest commit

History

Repository files navigation

MLscan

Machine Learning Mailing List Scanner

Status

Program Flow

Interesting Senders Heuristic

Usage

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages