Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Topic modelling using RAPIDS and BERT #41

Merged
merged 53 commits into from
Dec 21, 2021
Merged
Show file tree
Hide file tree
Changes from 52 commits
Commits
Show all changes
53 commits
Select commit Hold shift + click to select a range
efb7224
added topic modelling code
mayankanand007 Nov 18, 2021
eecc25e
squashing bugs
mayankanand007 Nov 18, 2021
5761738
remove prints
mayankanand007 Nov 18, 2021
43b7522
fix prints
mayankanand007 Nov 18, 2021
afed317
updated readme with relevant information
mayankanand007 Nov 19, 2021
33a81f9
Update README.md
mayankanand007 Nov 19, 2021
f753078
Update README.md
mayankanand007 Nov 19, 2021
6ee8924
added topic modelling example link
mayankanand007 Nov 19, 2021
2f7fe75
fix bugs
mayankanand007 Nov 22, 2021
2f2727d
Merge branch 'topic_modelling' of https://github.com/mayankanand007/r…
mayankanand007 Nov 22, 2021
678da80
addressing reviews part 1
mayankanand007 Nov 22, 2021
a3466f5
updated readme with installation steps
mayankanand007 Nov 22, 2021
747b8cf
clean up commented code
mayankanand007 Nov 22, 2021
aecd7de
updated example notebook
mayankanand007 Nov 22, 2021
158f990
fix issues with matrix manipulation
mayankanand007 Nov 23, 2021
ad652e9
integrated subwordtokenizer and code refactoring
mayankanand007 Nov 25, 2021
e9c090c
style fixes
mayankanand007 Nov 26, 2021
ea8455b
fixed embedding extraction base case
mayankanand007 Nov 26, 2021
ef9668a
fixed style, added docstrings
mayankanand007 Nov 26, 2021
217fe06
dlpack optimizations
mayankanand007 Nov 26, 2021
e3e15e7
addressing reviews part 1
mayankanand007 Nov 30, 2021
fdda302
fixed style issues, mmr work in progress
mayankanand007 Nov 30, 2021
a41d2ba
added pytest lazy fixture
mayankanand007 Nov 30, 2021
e7e4baa
updated readme with instructions
mayankanand007 Nov 30, 2021
d63f8d4
added dev installation instructions
mayankanand007 Nov 30, 2021
2c2beda
updated diagram
mayankanand007 Nov 30, 2021
0d35cf0
updated example notebook
mayankanand007 Nov 30, 2021
794ddad
fix critical bug in MMR
mayankanand007 Nov 30, 2021
53b0527
moved fix_padding into a pytest
mayankanand007 Nov 30, 2021
a313557
clean up commented code
mayankanand007 Nov 30, 2021
104f52f
fixed MMR, added environment.yml file
mayankanand007 Dec 1, 2021
f8b8eb4
replaced pandas conversion part 1
mayankanand007 Dec 1, 2021
8cf03bb
removed topic reduction code
mayankanand007 Dec 2, 2021
55396e9
addressed some code reviews
mayankanand007 Dec 8, 2021
b4b71a5
addressed reviews part 2
mayankanand007 Dec 8, 2021
d7a6700
moved vocabulary files
mayankanand007 Dec 8, 2021
2229aa9
fixed spacing issue in README
mayankanand007 Dec 8, 2021
c5054f2
updated diagram with better clarity
mayankanand007 Dec 9, 2021
6b723b6
updated directory tree
mayankanand007 Dec 9, 2021
10937c0
created conda-env files for dev, prod
mayankanand007 Dec 9, 2021
7428484
added clarification regarding CUDA version
mayankanand007 Dec 9, 2021
b6e21d5
fixed spelling error
mayankanand007 Dec 9, 2021
ff177fb
removed redundant code to sort by frequency
mayankanand007 Dec 9, 2021
7b3e2ee
refactored fit_transform test based on the new logic
mayankanand007 Dec 9, 2021
96e3cd7
some del statements to free up GPU memory
mayankanand007 Dec 13, 2021
89d6f4b
updated example with correct runtimes
mayankanand007 Dec 15, 2021
c24b9a2
updated example notebook with time
mayankanand007 Dec 15, 2021
2b26f14
created scripts to compare and benchmark implementations
mayankanand007 Dec 15, 2021
13fc7e7
resolved CUDA OOM issue in embedding extraction
mayankanand007 Dec 20, 2021
e1f9ae5
updated end to end example with RMM enabled
mayankanand007 Dec 20, 2021
d36a7ef
post updated example
mayankanand007 Dec 20, 2021
bf60566
fixed emptydoc issue in preprocessing and updated test
mayankanand007 Dec 21, 2021
f0ce6a1
updated tokenizer parallelism flag and module imports
mayankanand007 Dec 21, 2021
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,3 +11,4 @@
[pycuda\_cudf\_integration](./pycuda_cudf_integration) | Demonstrates processing python cudf dataframes using `pycuda`
[tfidf-benchmark](./tfidf-benchmark) | Benchmarks NLP text processing pipeline in cuML + Dask vs. Apache Spark
[rapids_triton_example](./rapids_triton_example) | Example of using RAPIDS+Pytorch with Nvidia Triton.
[cuBERT_topic_modelling](./cuBERT_topic_modelling) | Leveraging BERT, TF-IDF and NVIDIA RAPIDS to create easily interpretable topics.
93 changes: 93 additions & 0 deletions cuBERT_topic_modelling/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
# cuBERT-topic-modelling

Leveraging BERT, TF-IDF and NVIDIA RAPIDS to create easily interpretable topics.

## Overview

Currently, [BERTopic](https://github.com/MaartenGr/BERTopic) library in general utilizes GPUs as part of the SenteneTransformer package, which is only used during the encoding process. We can accelerate other bits of the pipeline such as dimension reduction (UMAP), Topic Creation (Count/TF-IDF Vectorization), Clustering (HDBSAN) and Topic Reduction using the RAPIDS ecosystem that has a similar API as Pandas, sckit-learn but leverages GPUs for the end-to-end pipeline.

![Overview of pipeline design](images/cuBERTopic.jpg)

### Code organization

```
cuBERT_topic_modelling
│ README.md
│ berttopic_example.ipynb
| ctfidf.py
| cuBERTopic.py
| embedding_extraction.py
| mmr.py
| setup.py
└───tests
│ │ test_ctfidf.py
│ │ test_data_preprocess.py
│ │ test_embeddings_extraction.py
| | test_fit_transform.py
| | test_hdbscan_clustering.py
| | test_mmr.py
| | test_subwordtokenizer.py
| | test_umap_dr.py
└───utils
| │ sparse_matrix_utils.py
└───vectorizer
| │ vectorizer.py
|
└───conda
| │ environment.yml
|
└───images
| │ cuBERTopic.jpg
|
└───vocab
| │ voc_has.txt
| | vocab.txt
|
```

## Installation

`cuBERTopic` runs on `cudf` and `cuml` which can be installed using instructions [here](https://rapids.ai/start.html). These packages run on NVIDIA GPU and CUDA and to determine the version you should install, you can check the CUDA version in your system (eg. do `nvidia-smi`). Here we assume the user has `conda` (or `mamba`) installed.

For example, if you have CUDA 11.2 and Ubuntu 20.04, then run:

```bash
conda create -n rapids-21.12 -c rapidsai-nightly -c nvidia -c conda-forge \
rapids=21.12 python=3.8 cudatoolkit=11.2
conda activate rapids-21.12
```

We also provide a conda environment file [here](conda/environment.yml), which you can use to create the environment as: `conda env create -f conda/environment.yml`. This will create a conda environment called `topic_p` (the name can be changed in the file). Remember to change it according to the CUDA version in your system.

After installing the dependencies (creating the conda environment), clone the repository and run `pip install -e .` from the root directory.

Now you can do `import cuBERTopic` and you're good to go!

Additionally, if you want to run and compare against `BERTopic` then you can find the instructions [here](https://github.com/MaartenGr/BERTopic). Essentially, just run: `pip install bertopic`

## Quick Start

An [example](berttopic_example.ipynb) notebook is provided, which goes through the installation, as well as comparing response using [BERTopic](https://github.com/MaartenGr/BERTopic). Make sure to install the dependencies by referring to the step above.

## Development

Contributions to this codebase are welcome!

[conda-dev](conda/conda_dev_env.yml) file has been provided which included all the development dependencies. Using this file, you can create your own dev environment by running:

```bash
conda env create -f conda/conda_dev_env.yml
```

We have written tests such that our correctness is verified by runnig `BERTopic` on the same input, hence it becomes a dependency to run our tests (`pip install bertopic`). Other Additional dependencies are `pytest` and `pytest-lazy-fixture`.

To run existing tests, you can do `pytest -v` from the root directory, and more tests can be added under `tests` as well.

## Acknowledgement

Our work ports the CPU implementation of the [BERTopic library](https://github.com/MaartenGr/BERTopic) to a python-based GPU backend using NVIDIA RAPIDS.

Please refer to Maarten Grootendorst's [blog](https://towardsdatascience.com/topic-modeling-with-bert-779f7db187e6) on how to use BERT to create your own topic model.
148 changes: 148 additions & 0 deletions cuBERT_topic_modelling/benchmarks/benchmark_berttopic.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,148 @@
{
mayankanand007 marked this conversation as resolved.
Show resolved Hide resolved
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"id": "82c186a1-8c70-4446-9b00-ffb7aabdbd6e",
"metadata": {},
"outputs": [],
"source": [
"from bertopic import BERTopic\n",
"from sklearn.datasets import fetch_20newsgroups\n",
"import pandas as pd\n",
"from sentence_transformers import SentenceTransformer\n",
"documents = fetch_20newsgroups(subset='all')['data']"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "da96f945-e7fc-4030-a790-29f4173a4dda",
"metadata": {},
"outputs": [],
"source": [
"topic_model = BERTopic()\n",
"documents = pd.DataFrame({\"Document\": documents,\n",
" \"ID\": range(len(documents)),\n",
" \"Topic\": None})"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "67b375af-347f-412b-99fa-6ea71e1abd83",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 3min 8s, sys: 9.73 s, total: 3min 18s\n",
"Wall time: 35.6 s\n"
]
}
],
"source": [
"%%time\n",
"#Extract embeddings\n",
"model = SentenceTransformer(\"all-MiniLM-L6-v2\")\n",
"embeddings = model.encode(\n",
" documents.Document,\n",
" show_progress_bar=False\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "ef482c47-c013-4909-85c3-67a10b806fef",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...\n",
"To disable this warning, you can either:\n",
"\t- Avoid using `tokenizers` before the fork if possible\n",
"\t- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)\n",
"CPU times: user 9min 35s, sys: 8.98 s, total: 9min 44s\n",
"Wall time: 21.9 s\n"
]
}
],
"source": [
"%%time\n",
"# Dimensionality Reduction\n",
"umap_embeddings = topic_model._reduce_dimensionality(embeddings)"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "51435da5-2013-4359-8f0d-f82433a4ad00",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 801 ms, sys: 20 ms, total: 821 ms\n",
"Wall time: 886 ms\n"
]
}
],
"source": [
"%%time\n",
"# Cluster UMAP embeddings with HDBSCAN\n",
"documents, probabilities = topic_model._cluster_embeddings(umap_embeddings, documents)"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "1810bb03-3f81-44ae-a079-1ca799b4ec87",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 6.89 s, sys: 152 ms, total: 7.04 s\n",
"Wall time: 7.04 s\n"
]
}
],
"source": [
"%%time\n",
"# Sort and Map Topic IDs by their frequency\n",
"if not topic_model.nr_topics:\n",
" documents = topic_model._sort_mappings_by_frequency(documents)\n",
"\n",
"# Extract topics by calculating c-TF-IDF\n",
"topic_model._extract_topics(documents) # does both topic extraction and representation"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.12"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Loading