This package is currently used for machine learning related code.
Unifed currently utilises two forms of machine learning:
- A spam detection filter.
- A text toxicity classifier.
The spam detection models are created and trained in this package,
whereas the text toxicity classifier utilises the pre-trained
@tensorflow-models/toxicity
model.
The majority of the code in this package is for training a spam detection model.
-
Training data located in the
data
directory is converted into a common form, using the parsers located insrc/parsers
. -
The training data is then tokenized, using
src/tokenizer.ts
. -
A tensor is created using this data with the code in
src/tensor.ts
. -
The models used in
src/models
are trained with the data.
src/train.ts
provides a command line utility for training the models,
whereas src/test-model.ts
provides a command line utility for for
accessing the performance of models.
An API to utilise the models is exposed in src/index.ts
, which can
be used by other packages.
Training data is located in the data
directory. The sources for
the training data are as follows:
The models used have been taken from the following sources:
dense
(trained) - Sourcedense-pooling
(trained) - Sourcetwilio-dense
(trained) - Sourcelstm
(not trained) - Sourcebi-directional-lstm
(not trained) - Source
Some models have not been trained, as we did not have the computing resources to do so in a reasonable amount of time. Training and evaluating these would be an interesting project extension.
The models
directory contains the trained models. All
configuration information is stored within here. These
models take time to train and are checked into the
repository.
The meta
directory contains statistics about the training
data, used in the report. This directory is not committed,
as it contained hundreds of thousands of lines.
A detailed report outlining the development and evaluation of the spam detection filter is available in both the 3rd and 4th deliverables.
The text toxicity classifier utilises the pre-trained
@tensorflow-models/toxicity
model.
This package provides a simple API around the model in order to classify single pieces of text.