feat: Add stage_for_weaviate and schema creation function (#672)

* add weaviate docker compose * added staging brick and tests for weaviate * initial notebook and requirements file * add commentary to weaviate notebook * weaviate readme * update docs * version and change log * install weaviate client * install weaviate; skip for docker * linting, linting, linting * install weaviate client with deps * comments on weaviate client * fix module not found error for docker container * skipped wrong test in docker * fix typos * add in local-inference
Unstructured-IO · Jun 1, 2023 · c35fff2 · c35fff2
1 parent cf70c86
commit c35fff2
Show file tree

Hide file tree

Showing 11 changed files with 455 additions and 3 deletions.
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -138,6 +138,9 @@ jobs:
         sudo add-apt-repository -y ppa:alex-p/tesseract-ocr5
         sudo apt-get install -y tesseract-ocr tesseract-ocr-kor
         tesseract --version
+        # NOTE(robinson) - Installing weaviate-client separately here because the requests
+        # version conflicts with label_studio_sdk
+        pip install weaviate-client
         make test
         make check-coverage
 

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -2,10 +2,12 @@
 
 ### Enhancements
 
-* Builds from Unstructured base image, built off of Rocky Linux 8.7, this resolves almost all CVE's in the image.
-
 ### Features
 
+* Add `stage_for_weaviate` to stage `unstructured` outputs for upload to Weaviate, along with
+  a helper function for defining a class to use in Weaviate schemas.
+* Builds from Unstructured base image, built off of Rocky Linux 8.7, this resolves almost all CVE's in the image.
+
 ### Fixes
 
 ## 0.7.0

diff --git a/Makefile b/Makefile
@@ -41,6 +41,9 @@ install-nltk-models:
 .PHONY: install-test
 install-test:
 	python3 -m pip install -r requirements/test.txt
+	# NOTE(robinson) - Installing weaviate-client separately here because the requests
+	# version conflicts with label_studio_sdk
+	python3 -m pip install weaviate-client
 
 .PHONY: install-dev
 install-dev:
@@ -245,4 +248,4 @@ docker-jupyter-notebook:
 
 .PHONY: run-jupyter
 run-jupyter:
-	PYTHONPATH=$(realpath .) JUPYTER_PATH=$(realpath .) jupyter-notebook --NotebookApp.token='' --NotebookApp.password=''
+	PYTHONPATH=$(realpath .) JUPYTER_PATH=$(realpath .) jupyter-notebook --NotebookApp.token='' --NotebookApp.password=''
diff --git a/docs/source/bricks.rst b/docs/source/bricks.rst
@@ -1554,6 +1554,58 @@ See the `LabelStudio docs <https://labelstud.io/tags/labels.html>`_ for a full l
 for labels and annotations.
 
 
+``stage_for_weaviate``
+-----------------------
+
+The ``stage_for_weaviate`` staging function prepares a list of ``Element`` objects for ingestion into
+the `Weaviate <https://weaviate.io/>`_ vector database. You can create a schema in Weaviate
+for the `unstructured` outputs using the following workflow:
+
+.. code:: python
+
+  from unstructured.staging.weaviate import create_unstructured_weaviate_class
+
+  import weaviate
+
+  # Change `class_name` if you want the class for unstructured documents in Weaviate
+  # to have a different name
+  unstructured_class = create_unstructured_weaviate_class(class_name="UnstructuredDocument")
+  schema = {"classes": [unstructured_class]}
+
+  client = weaviate.Client("http://localhost:8080")
+  client.schema.create(schema)
+
+
+Once the schema is created, you can batch upload documents to Weaviate using the following workflow.
+See the `Weaviate documentation <https://weaviate.io/developers/weaviate>`_ for more details on
+options for uploading data and querying data once it has been uploaded.
+
+
+.. code:: python
+
+  from unstructured.partition.pdf import partition_pdf
+  from unstructured.staging.weaviate import stage_for_weaviate
+
+  import weaviate
+  from weaviate.util import generate_uuid5
+
+
+  filename = "example-docs/layout-parser-paper-fast.pdf"
+  elements = partition_pdf(filename=filename, strategy="fast")
+  data_objects = stage_for_weaviate(elements)
+
+  client = weaviate.Client("http://localhost:8080")
+
+  with client.batch(batch_size=10) as batch:
+      for data_object in tqdm.tqdm(data_objects):
+          batch.add_data_object(
+              data_object,
+              unstructured_class_name,
+              uuid=generate_uuid5(data_object),
+          )
+
+
+
 ``stage_for_baseplate``
 -----------------------
 

diff --git a/docs/source/integrations.rst b/docs/source/integrations.rst
@@ -75,3 +75,13 @@ the text from each element and their types such as ``NarrativeText`` or ``Title`
 -----------------------------
 You can format your JSON or CSV outputs for use with `Prodigy <https://prodi.gy/docs/api-loaders>`_ using the `stage_for_prodigy <https://unstructured-io.github.io/unstructured/bricks.html#stage-for-prodigy>`_ and `stage_csv_for_prodigy <https://unstructured-io.github.io/unstructured/bricks.html#stage-csv-for-prodigy>`_ staging bricks. After running ``stage_for_prodigy`` |
 ``stage_csv_for_prodigy``, you can write the results to a ``.json`` | ``.jsonl`` or a ``.csv`` file that is ready to be used with Prodigy. Follow the links for more details on usage.
+
+
+``Integration with Weaviate``
+-----------------------------
+`Weaviate <https://weaviate.io/>`_ is an open-source vector database that allows you to store data objects and vector embeddings
+from a variety of ML models. Storing text and embeddings in a vector database such as Weaviate is a key component of the
+`emerging LLM tech stack <https://medium.com/@unstructured-io/llms-and-the-emerging-ml-tech-stack-bdb189c8be5c>`_.
+See the `stage_for_weaviate <https://unstructured-io.github.io/unstructured/bricks.html#stage-for-weaviate>`_ docs for details
+on how to upload ``unstructured`` outputs to Weaviate. An example notebook is also available
+`here <https://github.com/Unstructured-IO/unstructured/tree/main/examples/weaviate>`_.
diff --git a/examples/weaviate/README.md b/examples/weaviate/README.md
@@ -0,0 +1,8 @@
+## Uploading data to Weaviate with `unstructured`
+
+The example notebook in this directory shows how to upload documents to Weaviate using the
+`unstructured` library. To get started with the notebook, use the following steps:
+
+- Run `pip install -r requirements.txt` to install the requirements.
+- Run `docker-compose up` to run the Weaviate container.
+- Run `jupyter-notebook` to start the notebook.
diff --git a/examples/weaviate/docker-compose.yml b/examples/weaviate/docker-compose.yml
@@ -0,0 +1,20 @@
+version: '3.4'
+services:
+  weaviate:
+    image: semitechnologies/weaviate:1.19.6
+    restart: on-failure:0
+    ports:
+     - "8080:8080"
+    environment:
+      QUERY_DEFAULTS_LIMIT: 20
+      AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED: 'true'
+      PERSISTENCE_DATA_PATH: "./data"
+      DEFAULT_VECTORIZER_MODULE: text2vec-transformers
+      ENABLE_MODULES: text2vec-transformers
+      TRANSFORMERS_INFERENCE_API: http://t2v-transformers:8080
+      CLUSTER_HOSTNAME: 'node1'
+  t2v-transformers:
+    image: semitechnologies/transformers-inference:sentence-transformers-multi-qa-MiniLM-L6-cos-v1
+    environment:
+      ENABLE_CUDA: 0 # set to 1 to enable
+      # NVIDIA_VISIBLE_DEVICES: all # enable if running with CUDA
diff --git a/examples/weaviate/requirements.txt b/examples/weaviate/requirements.txt
@@ -0,0 +1,4 @@
+jupyter
+tqdm
+weaviate-client
+unstructured[local-inference]
diff --git a/examples/weaviate/weaviate.ipynb b/examples/weaviate/weaviate.ipynb
@@ -0,0 +1,215 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "a3ce962e",
+   "metadata": {},
+   "source": [
+    "## Loading Data into Weaviate with `unstructured`\n",
+    "\n",
+    "This notebook shows a basic workflow for uploading document elements into Weaviate using the `unstructured` library. To get started with this notebook, first install the dependencies with `pip install -r requirements.txt` and start the Weaviate docker container with `docker-compose up`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "5d9ffc17",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import json\n",
+    "\n",
+    "import tqdm\n",
+    "from unstructured.partition.pdf import partition_pdf\n",
+    "from unstructured.staging.weaviate import create_unstructured_weaviate_class, stage_for_weaviate\n",
+    "import weaviate\n",
+    "from weaviate.util import generate_uuid5"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "673715e9",
+   "metadata": {},
+   "source": [
+    "The first step is to partition the document using the `unstructured` library. In the following example, we partition a PDF with `partition_pdf`. You can also partition over a dozen document types with the `partition` function."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "f9fc0cf9",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "filename = \"../../example-docs/layout-parser-paper-fast.pdf\"\n",
+    "elements = partition_pdf(filename=filename, strategy=\"fast\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "3ae76364",
+   "metadata": {},
+   "source": [
+    "Next, we'll create a schema for our Weaviate database using the `create_unstructured_weaviate_class` helper function from the `unstructured` library. The helper function generates a schema that includes all of the elements in the `ElementMetadata` object from `unstructured`. This includes information such as the filename and the page number of the document element. After specifying the schema, we create a connection to the database with the Weaviate client library and create the schema. You can change the name of the class by updating the `unstructured_class_name` variable."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "91057cb1",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "unstructured_class_name = \"UnstructuredDocument\""
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "id": "78e804bb",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "unstructured_class = create_unstructured_weaviate_class(unstructured_class_name)\n",
+    "schema = {\"classes\": [unstructured_class]}                    "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "id": "3e317a2d",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "client = weaviate.Client(\"http://localhost:8080\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "id": "0c508784",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "client.schema.create(schema)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "024ae133",
+   "metadata": {},
+   "source": [
+    "Next, we stage the elements for Weaviate using the `stage_for_weaviate` function and batch upload the results to Weaviate. `stage_for_weaviate` outputs a dictionary that conforms to the schema we created earlier. Once that data is stage, we can use the Weaviate client library to batch upload the results to Weaviate."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "id": "a7018bb1",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "data_objects = stage_for_weaviate(elements)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "id": "af712d8e",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "100%|██████████████████████████████████████████████████████████████████████| 28/28 [00:46<00:00,  1.66s/it]\n"
+     ]
+    }
+   ],
+   "source": [
+    "with client.batch(batch_size=10) as batch:\n",
+    "    for data_object in tqdm.tqdm(data_objects):\n",
+    "        batch.add_data_object(\n",
+    "            data_object,\n",
+    "            unstructured_class_name,\n",
+    "            uuid=generate_uuid5(data_object),\n",
+    "        )"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "dac10bf5",
+   "metadata": {},
+   "source": [
+    "Now that the documents are in Weaviate, we're able to run queries against Weaviate!"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "id": "14098434",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "{\n",
+      "    \"data\": {\n",
+      "        \"Get\": {\n",
+      "            \"UnstructuredDocument\": [\n",
+      "                {\n",
+      "                    \"text\": \"Deep Learning(DL)-based approaches are the state-of-the-art for a wide range of document image analysis (DIA) tasks including document image classi\\ufb01cation [11,\"\n",
+      "                }\n",
+      "            ]\n",
+      "        }\n",
+      "    }\n",
+      "}\n"
+     ]
+    }
+   ],
+   "source": [
+    "near_text = {\"concepts\": [\"document understanding\"]}\n",
+    "\n",
+    "result = (\n",
+    "    client.query\n",
+    "    .get(\"UnstructuredDocument\", [\"text\"])\n",
+    "    .with_near_text(near_text)\n",
+    "    .with_limit(1)\n",
+    "    .do()\n",
+    ")\n",
+    "\n",
+    "print(json.dumps(result, indent=4))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "c191217c",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.8.13"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}