rapidsai · VibhuJawa · Dec 21, 2021 · Nov 18, 2021 · Nov 18, 2021 · Nov 18, 2021
diff --git a/README.md b/README.md
@@ -11,3 +11,4 @@
 [pycuda\_cudf\_integration](./pycuda_cudf_integration) | Demonstrates processing python cudf dataframes using `pycuda`
 [tfidf-benchmark](./tfidf-benchmark) | Benchmarks NLP text processing pipeline in cuML + Dask vs. Apache Spark
 [rapids_triton_example](./rapids_triton_example) |  Example of using RAPIDS+Pytorch with Nvidia Triton.
+[cuBERT_topic_modelling](./cuBERT_topic_modelling) |  Leveraging BERT, TF-IDF and NVIDIA RAPIDS to create easily interpretable topics.
diff --git a/cuBERT_topic_modelling/README.md b/cuBERT_topic_modelling/README.md
@@ -0,0 +1,93 @@
+# cuBERT-topic-modelling
+
+Leveraging BERT, TF-IDF and NVIDIA RAPIDS to create easily interpretable topics.
+
+## Overview
+
+Currently, [BERTopic](https://github.com/MaartenGr/BERTopic) library in general utilizes GPUs as part of the SenteneTransformer package, which is only used during the encoding process. We can accelerate other bits of the pipeline such as dimension reduction (UMAP), Topic Creation (Count/TF-IDF Vectorization), Clustering (HDBSAN) and Topic Reduction using the RAPIDS ecosystem that has a similar API as Pandas, sckit-learn but leverages GPUs for the end-to-end pipeline.
+
+![Overview of pipeline design](images/cuBERTopic.jpg)
+
+### Code organization
+
+```
+cuBERT_topic_modelling
+│   README.md
+│   berttopic_example.ipynb
+|   ctfidf.py
+|   cuBERTopic.py
+|   embedding_extraction.py
+|   mmr.py
+|   setup.py
+│
+└───tests
+│   │   test_ctfidf.py
+│   │   test_data_preprocess.py
+│   │   test_embeddings_extraction.py
+|   |   test_fit_transform.py
+|   |   test_hdbscan_clustering.py
+|   |   test_mmr.py
+|   |   test_subwordtokenizer.py
+|   |   test_umap_dr.py
+│   
+└───utils
+|   │   sparse_matrix_utils.py
+│   
+└───vectorizer
+|   │   vectorizer.py
+| 
+└───conda
+|   │   environment.yml
+|
+└───images
+|   │   cuBERTopic.jpg
+|
+└───vocab
+|   │   voc_has.txt
+|   |   vocab.txt
+|
+```
+
+## Installation
+
+`cuBERTopic` runs on `cudf` and `cuml` which can be installed using instructions [here](https://rapids.ai/start.html). These packages run on NVIDIA GPU and CUDA and to determine the version you should install, you can check the CUDA version in your system (eg. do `nvidia-smi`). Here we assume the user has `conda` (or `mamba`) installed.
+
+For example, if you have CUDA 11.2 and Ubuntu 20.04, then run:
+
+```bash
+conda create -n rapids-21.12 -c rapidsai-nightly -c nvidia -c conda-forge \
+    rapids=21.12 python=3.8 cudatoolkit=11.2
+conda activate rapids-21.12
+```
+
+We also provide a conda environment file [here](conda/environment.yml), which you can use to create the environment as: `conda env create -f conda/environment.yml`. This will create a conda environment called `topic_p` (the name can be changed in the file). Remember to change it according to the CUDA version in your system.
+
+After installing the dependencies (creating the conda environment), clone the repository and run `pip install -e .` from the root directory. 
+
+Now you can do `import cuBERTopic` and you're good to go!
+
+Additionally, if you want to run and compare against `BERTopic` then you can find the instructions [here](https://github.com/MaartenGr/BERTopic). Essentially, just run: `pip install bertopic`
+
+## Quick Start
+
+An [example](berttopic_example.ipynb) notebook is provided, which goes through the installation, as well as comparing response using [BERTopic](https://github.com/MaartenGr/BERTopic). Make sure to install the dependencies by referring to the step above.
+
+## Development
+
+Contributions to this codebase are welcome! 
+
+[conda-dev](conda/conda_dev_env.yml) file has been provided which included all the development dependencies. Using this file, you can create your own dev environment by running: 
+
+```bash
+conda env create -f conda/conda_dev_env.yml
+```
+
+We have written tests such that our correctness is verified by runnig `BERTopic` on the same input, hence it becomes a dependency to run our tests (`pip install bertopic`). Other Additional dependencies are `pytest` and `pytest-lazy-fixture`.
+
+To run existing tests, you can do `pytest -v` from the root directory, and more tests can be added under `tests` as well.
+
+## Acknowledgement
+
+Our work ports the CPU implementation of the [BERTopic library](https://github.com/MaartenGr/BERTopic) to a python-based GPU backend using NVIDIA RAPIDS.
+
+Please refer to Maarten Grootendorst's [blog](https://towardsdatascience.com/topic-modeling-with-bert-779f7db187e6) on how to use BERT to create your own topic model.
diff --git a/cuBERT_topic_modelling/benchmarks/benchmark_berttopic.ipynb b/cuBERT_topic_modelling/benchmarks/benchmark_berttopic.ipynb
@@ -0,0 +1,148 @@
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "82c186a1-8c70-4446-9b00-ffb7aabdbd6e",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from bertopic import BERTopic\n",
+    "from sklearn.datasets import fetch_20newsgroups\n",
+    "import pandas as pd\n",
+    "from sentence_transformers import SentenceTransformer\n",
+    "documents = fetch_20newsgroups(subset='all')['data']"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "da96f945-e7fc-4030-a790-29f4173a4dda",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "topic_model = BERTopic()\n",
+    "documents = pd.DataFrame({\"Document\": documents,\n",
+    "                      \"ID\": range(len(documents)),\n",
+    "                      \"Topic\": None})"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "67b375af-347f-412b-99fa-6ea71e1abd83",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "CPU times: user 3min 8s, sys: 9.73 s, total: 3min 18s\n",
+      "Wall time: 35.6 s\n"
+     ]
+    }
+   ],
+   "source": [
+    "%%time\n",
+    "#Extract embeddings\n",
+    "model = SentenceTransformer(\"all-MiniLM-L6-v2\")\n",
+    "embeddings = model.encode(\n",
+    "    documents.Document,\n",
+    "    show_progress_bar=False\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "id": "ef482c47-c013-4909-85c3-67a10b806fef",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...\n",
+      "To disable this warning, you can either:\n",
+      "\t- Avoid using `tokenizers` before the fork if possible\n",
+      "\t- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)\n",
+      "CPU times: user 9min 35s, sys: 8.98 s, total: 9min 44s\n",
+      "Wall time: 21.9 s\n"
+     ]
+    }
+   ],
+   "source": [
+    "%%time\n",
+    "# Dimensionality Reduction\n",
+    "umap_embeddings = topic_model._reduce_dimensionality(embeddings)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "id": "51435da5-2013-4359-8f0d-f82433a4ad00",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "CPU times: user 801 ms, sys: 20 ms, total: 821 ms\n",
+      "Wall time: 886 ms\n"
+     ]
+    }
+   ],
+   "source": [
+    "%%time\n",
+    "# Cluster UMAP embeddings with HDBSCAN\n",
+    "documents, probabilities = topic_model._cluster_embeddings(umap_embeddings, documents)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "id": "1810bb03-3f81-44ae-a079-1ca799b4ec87",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "CPU times: user 6.89 s, sys: 152 ms, total: 7.04 s\n",
+      "Wall time: 7.04 s\n"
+     ]
+    }
+   ],
+   "source": [
+    "%%time\n",
+    "# Sort and Map Topic IDs by their frequency\n",
+    "if not topic_model.nr_topics:\n",
+    "    documents = topic_model._sort_mappings_by_frequency(documents)\n",
+    "\n",
+    "# Extract topics by calculating c-TF-IDF\n",
+    "topic_model._extract_topics(documents) # does both topic extraction and representation"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.8.12"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}