STREAM: Simplified Topic Retrieval, Exploration, and Analysis Module

📘Documentation | 🛠️Installation | Models | 🤔Report Issues

STREAM: Simplified Topic Retrieval, Exploration, and Analysis Module

- Topic Modeling Made Easy in Python -

We present STREAM, a Simplified Topic Retrieval, Exploration, and Analysis Module for User-Friendly and Interactive Topic Modeling and Visualization. Our paper can be found here.

🏃 Quick Start

Get started with STREAM in just a few lines of code:

from stream_topic.models import KmeansTM
from stream_topic.utils import TMDataset

dataset = TMDataset()
dataset.fetch_dataset("BBC_News")
dataset.preprocess(model_type="KmeansTM")

model = KmeansTM()
model.fit(dataset, n_topics=20)

topics = model.get_topics()
print(topics)

🚀 Installation

You can install STREAM directly from PyPI or from the GitHub repository:

PyPI (Recommended):
```
pip install stream_topic
```

GitHub:

pip install git+https://github.com/AnFreTh/STREAM.git

Download NLTK Resources: Ensure you have the necessary NLTK resources installed:

import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

📦 Available Models

STREAM offers a variety of neural as well as non-neural topic models and we are always trying to incorporate more and new models. If you wish to incorporate your own model, or want another model incorporated please raise an issue with the required information. Currently, the following models are implemented:

Name	Implementation
LDA	Latent Dirichlet Allocation
NMF	Non-negative Matrix Factorization
WordCluTM	Tired of topic models?
CEDC	Topics in the Haystack
DCTE	Human in the Loop
KMeansTM	Simple Kmeans followed by c-tfidf
SomTM	Self organizing map followed by c-tfidf
CBC	Coherence based document clustering
TNTM	Transformer-Representation Neural Topic Model
ETM	Topic modeling in embedding spaces
CTM	Combined Topic Model
CTMNeg	Contextualized Topic Models with Negative Sampling
ProdLDA	Autoencoding Variational Inference For Topic Models
NeuralLDA	Autoencoding Variational Inference For Topic Models
NSTM	Neural Topic Model via Optimal Transport

📊 Available Metrics

Since evaluating topic models, especially automatically, STREAM implements numerous evaluation metrics. Especially, the intruder based metrics, while they might take some time to compute, have shown great correlation with human evaluation.

Name	Description
ISIM	Average cosine similarity of top words of a topic to an intruder word.
INT	For a given topic and a given intruder word, Intruder Accuracy is the fraction of top words to which the intruder has the least similar embedding among all top words.
ISH	Calculates the shift in the centroid of a topic when an intruder word is replaced.
Expressivity	Cosine Distance of topics to meaningless (stopword) embedding centroid
Embedding Topic Diversity	Topic diversity in the embedding space
Embedding Coherence	Cosine similarity between the centroid of the embeddings of the stopwords and the centroid of the topic.
NPMI	Classical NPMi coherence computed on the source corpus.

🗂️ Available Datasets

To integrate custom datasets for modeling with STREAM, please follow the example notebook in the examples folder. For benchmarking new models, STREAM already includes the following datasets:

Name	# Docs	# Words	# Features	Description
Spotify_most_popular	5,860	18,193	17	Spotify dataset comprised of popular song lyrics and various tabular features.
Spotify_least_popular	5,124	20,168	14	Spotify dataset comprised of less popular song lyrics and various tabular features.
Spotify	11,012	25,835	14	General Spotify dataset with song lyrics and various tabular features.
Reddit_GME	21,559	11,724	6	Reddit dataset filtered for "Gamestop" (GME) from the Subreddit "r/wallstreetbets".
Stocktwits_GME	300,000	14,707	3	Stocktwits dataset filtered for "Gamestop" (GME), covering the GME short squeeze of 2021.
Stocktwits_GME_large	600,000	94,925	0	Larger Stocktwits dataset filtered for "Gamestop" (GME), covering the GME short squeeze of 2021.
Reuters	10,788	19,696	-	Preprocessed Reuters dataset.
Poliblogs	13,246	47,106	2	Preprocessed Poliblogs dataset suitable for STMs.
20NewsGroups	18,846	70,461	-	preprocessed 20NewsGroups dataset
BBC_News	2,225	19,116	-	preprocessed BBC News dataset

If you wish yo include and publish one of your datasets directly into the package, feel free to contact us.

🔧 Usage

To use one of the available models, follow the simple steps below:

Import the necessary modules:

from stream_topic.models import KmeansTM
from stream_topic.utils import TMDataset

🛠️ Preprocessing

Get your dataset and preprocess for your model:

dataset = TMDataset()
dataset.fetch_dataset("20NewsGroup")
dataset.preprocess(model_type="KmeansTM")

The specified model_type is optional and further arguments can be specified. Default steps are predefined for all included models. Steps like stopword removal and lemmatizing are automatically performed for models like e.g. LDA.

🚀 Model fitting

Fitting a model from STREAM follows a simple, sklearn-like logic and every model can be fit identically.

Choose the model you want to use and train it:

model = KmeansTM()
model.fit(dataset, n_topics=20)

Depending on the model, check the documentation for hyperparameter settings. To get the topics, simply run:

Get the topics:
```
topics = model.get_topics()
```

✅ Evaluation

stream-topic implements various evaluation metrics, mostly focused around the intruder word task. The implemented metrics achieve high correlations with human evaluation. See here for the detailed description of the metrics.

To evaluate your model simply use one of the metrics.

from stream_topic.metrics import ISIM, INT, ISH,Expressivity, NPMI

metric = ISIM()
metric.score(topics)

Scores for each topic are available via:

metric.score_per_topic(topics)

To leverage one of the metrics available in octis, simply create a model output that fits within the octis' framework

from octis.evaluation_metrics.diversity_metrics import TopicDiversity

model_output = {"topics": model.get_topics(), "topic-word-matrix": model.get_beta(), "topic-document-matrix": model.get_theta()}

metric = TopicDiversity(topk=10) # Initialize metric
topic_diversity_score = metric.score(model_output)

Similarly to use one of STREAMS metrics for any model, use the topics and occasionally the $\beta$ (topic-word-matrix) of the model to calculate the score.

🔍 Hyperparameter optimization

If you want to optimize the hyperparameters, simply run:

model.optimize_and_fit(
    dataset,
    min_topics=2,
    max_topics=20,
    criterion="aic",
    n_trials=20,
)

🖼️ Visualization

You can also specify to optimize with respect to any evaluation metric from stream_topic. Visualize the results:

from stream_topic.visuals import visualize_topic_model,visualize_topics
visualize_topic_model(
    model, 
    reduce_first=True, 
    port=8051,
    )

📈 Downstream Tasks

The general formulation of a Neural Additive Model (NAM) can be summarized by the equation:

$$ E(y) = h(β + ∑_{j=1}^{J} f_j(x_j)), $$

where $h(·)$ denotes the activation function in the output layer, such as a linear activation for regression tasks or softmax for classification tasks. $x ∈ R^j$ represents the input features, and $β$ is the intercept. The function $f_j : R → R$ corresponds to the Multi-Layer Perceptron (MLP) for the $j$-th feature.

Let's consider $x$ as a combination of categorical and numerical features $x_{tab}$ and document features $x_{doc}$. After applying a topic model, STREAM extracts topical prevalences from documents, effectively transforming the input into $z ≡ (x_{tab}, x_{top})$, a probability vector over documents and topics. Here, $x_{j(tab)}^{(i)}$ indicates the $j$-th tabular feature of the $i$-th observation, and $x_{k(top)}^{(i)}$ represents the $i$-th document's topical prevalence for topic $k$.

For preserving interpretability, the downstream model is defined as:

$$ h(E[y]) = β + ∑_{j=1}^{J} f_j(x_{j(tab)}) + ∑_{k=1}^{K} f_k(x_{k(top)}), $$

In this setup, visualizing the shape function k reveals the impact of a topic on the target variable y. For example, in the context of the Spotify dataset, this could illustrate how a topic influences a song's popularity.

Fitting a downstream model with a pre-trained topic model is straightforward using the PyTorch Trainer class. Subsequently, visualizing all shape functions can be done similarly to the approach described by Agarwal et al. (2021).

from lightning import Trainer
from stream_topic.NAM import DownstreamModel

# Instantiate the DownstreamModel
downstream_model = DownstreamModel(
    trained_topic_model=topic_model,
    target_column='target',  # Target variable
    task='regression',  # or 'classification'
    dataset=dataset,  
    batch_size=128,
    lr=0.0005
)

# Use PyTorch Lightning's Trainer to train and validate the model
trainer = Trainer(max_epochs=10)
trainer.fit(downstream_model)

# Plotting
from stream_topic.visuals import plot_downstream_model
plot_downstream_model(downstream_model)

🤝 Contributing and Testing New Models

We welcome contributions! Before you start, please:

Check Existing Issues: Look for existing issues or discussions that may cover your idea.
Fork and Clone: Fork the repository and clone it to your local machine.
Create a Branch: Work on a new branch to keep your changes organized.
Develop and Test: Develop your model and validate it using our provided testing script.
Submit a Pull Request: Once ready, submit a PR with a clear description of your changes.

For detailed guidelines on how to structure your contributions, see below.ng instructions provided below.

Steps for Contributing

Fork the Repository:
- Fork the repository to your GitHub account.
- Clone the forked repository to your local machine.
```
git clone https://github.com/your-username/your-repository.git
cd your-repository
```
Create a New Branch:
- Ensure you are on the develop branch and create a new branch for your model development.
```
git checkout develop
git checkout -b new-model-branch
```
Develop Your Model:
- Navigate to the mypackage/models/ directory.
- Create your model class file, ensuring it follows the expected structure and naming conventions.
- Implement the required methods (get_info, fit, predict) and attributes (topic_dict). Optionally, implement beta, theta, or corresponding methods (get_beta, get_theta).

Example Model Structure

Here is an example of how your model class should be structured:

import numpy as np
from mypackage.models.abstract_helper_models.base import BaseModel, TrainingStatus

class ExampleModel(BaseModel):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self._status = TrainingStatus.NOT_STARTED

    def get_info(self):
        return {"model_name": "ExampleModel", "trained": False}

    def any_other_processing_functions(self):
        pass

    def fit(self, dataset, n_topics=3):
        # do what you do during fitting the models
        self._status = TrainingStatus.INITIALIZED
        self._status = TrainingStatus.RUNNING
        self._status = TrainingStatus.SUCCEEDED

    def predict(self, texts):
        return [0] * len(texts)

    # If self.beta or self.theta are not assigned during fitting, plese include these two methods
    def get_beta(self):
        return self.beta

    def get_theta(self):
        return self.theta

Testing Your Model

Install Dependencies:
- Ensure all dependencies are installed.
```
pip install -r requirements.txt
```
Validate Your Model:
- To validate your model, use tests/validate_new_model.py to include your new model class.
```
from tests.model_validation import validate_model

validate_model(NewModel)
```

If this validation fails, it will tell you

Validation Criteria

The following checks are performed during validation:

Presence of required methods (get_info, fit, predict).
Presence of required attributes (topic_dict).
Either presence of optional attributes (beta, theta) or corresponding methods (get_beta, get_theta).
Correct shape and sum of theta.
Proper status transitions during model fitting.
get_info method returns a dictionary with model_name and trained keys.

Refer to the tests/model_validation.py script for detailed validation logic.

Submitting Your Contribution

Commit Your Changes:

Commit your changes to your branch.

git add .
git commit -m "Add new model: YourModelName"

Push to GitHub:
- Push your branch to your GitHub repository.
```
git push origin new-model-branch
```
Create a Pull Request:
- Go to the original repository on GitHub.
- Create a pull request from your forked repository and branch.
- Provide a clear description of your changes and request a review.

We appreciate your contributions and strive to make the integration process as smooth as possible. If you encounter any issues or have questions, feel free to open an issue on GitHub. Happy coding!

If you want to include a new model where these guidelines are not approriate please mark this in your review request.

📜 Citation

If you use this project in your research, please consider citing:

STREAM

@inproceedings{thielmann-etal-2024-stream,
    title = {STREAM: Simplified Topic Retrieval, Exploration, and Analysis Module},
    author = {Thielmann, Anton  and Reuter, Arik  and Weisser, Christoph  and Kant, Gillian  and Kumar, Manish  and S{\"a}fken, Benjamin},
    booktitle = {Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)},
    year = {2024},
    publisher = {Association for Computational Linguistics},
    pages = {435--444},
}

Metrics and CEDC

@article{thielmann2024topics,
  title={Topics in the haystack: Enhancing topic quality through corpus expansion},
  author={Thielmann, Anton and Reuter, Arik and Seifert, Quentin and Bergherr, Elisabeth and S{\"a}fken, Benjamin},
  journal={Computational Linguistics},
  pages={1--37},
  year={2024},
  publisher={MIT Press One Broadway, 12th Floor, Cambridge, Massachusetts 02142, USA~…}
}

TNTM

@article{reuter2024probabilistic,
  title={Probabilistic Topic Modelling with Transformer Representations},
  author={Reuter, Arik and Thielmann, Anton and Weisser, Christoph and S{\"a}fken, Benjamin and Kneib, Thomas},
  journal={arXiv preprint arXiv:2403.03737},
  year={2024}
}

DCTE

@inproceedings{thielmann2024human,
  title={Human in the Loop: How to Effectively Create Coherent Topics by Manually Labeling Only a Few Documents per Class},
  author={Thielmann, Anton F and Weisser, Christoph and S{\"a}fken, Benjamin},
  booktitle={Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)},
  pages={8395--8405},
  year={2024}
}

CBC

@inproceedings{thielmann2023coherence,
  title={Coherence based document clustering},
  author={Thielmann, Anton and Weisser, Christoph and Kneib, Thomas and S{\"a}fken, Benjamin},
  booktitle={2023 IEEE 17th International Conference on Semantic Computing (ICSC)},
  pages={9--16},
  year={2023},
  organization={IEEE}

If you use one of the Reddit or GME datasets, consider citing:

@article{kant2024one,
  title={One-way ticket to the moon? An NLP-based insight on the phenomenon of small-scale neo-broker trading},
  author={Kant, Gillian and Zhelyazkov, Ivan and Thielmann, Anton and Weisser, Christoph and Schlee, Michael and Ehrling, Christoph and S{\"a}fken, Benjamin and Kneib, Thomas},
  journal={Social Network Analysis and Mining},
  volume={14},
  number={1},
  pages={121},
  year={2024},
  publisher={Springer}
}

Name		Name	Last commit message	Last commit date
Latest commit History 394 Commits
.github/workflows		.github/workflows
assets		assets
docs		docs
stream_topic		stream_topic
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
readthedocs.yaml		readthedocs.yaml
requirements.txt		requirements.txt
requirements_tests.txt		requirements_tests.txt
ruff.toml		ruff.toml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

STREAM: Simplified Topic Retrieval, Exploration, and Analysis Module

- Topic Modeling Made Easy in Python -

Table of Contents

🏃 Quick Start

🚀 Installation

📦 Available Models

📊 Available Metrics

🗂️ Available Datasets

🔧 Usage

🛠️ Preprocessing

🚀 Model fitting

✅ Evaluation

🔍 Hyperparameter optimization

🖼️ Visualization

📈 Downstream Tasks

🤝 Contributing and Testing New Models

Steps for Contributing

Example Model Structure

Testing Your Model

Validation Criteria

Submitting Your Contribution

📜 Citation

STREAM

Metrics and CEDC

TNTM

DCTE

CBC

📝 License

About

Releases 9

Packages

Contributors 3

Languages

License

AnFreTh/STREAM

Folders and files

Latest commit

History

Repository files navigation

STREAM: Simplified Topic Retrieval, Exploration, and Analysis Module

- Topic Modeling Made Easy in Python -

Table of Contents

🏃 Quick Start

🚀 Installation

📦 Available Models

📊 Available Metrics

🗂️ Available Datasets

🔧 Usage

🛠️ Preprocessing

🚀 Model fitting

✅ Evaluation

🔍 Hyperparameter optimization

🖼️ Visualization

📈 Downstream Tasks

🤝 Contributing and Testing New Models

Steps for Contributing

Example Model Structure

Testing Your Model

Validation Criteria

Submitting Your Contribution

📜 Citation

STREAM

Metrics and CEDC

TNTM

DCTE

CBC

📝 License

About

Topics

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases 9

Packages 0

Contributors 3

Languages

Packages