-
Notifications
You must be signed in to change notification settings - Fork 838
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
feat: Add
stage_for_weaviate
and schema creation function (#672)
* add weaviate docker compose * added staging brick and tests for weaviate * initial notebook and requirements file * add commentary to weaviate notebook * weaviate readme * update docs * version and change log * install weaviate client * install weaviate; skip for docker * linting, linting, linting * install weaviate client with deps * comments on weaviate client * fix module not found error for docker container * skipped wrong test in docker * fix typos * add in local-inference
- Loading branch information
1 parent
cf70c86
commit c35fff2
Showing
11 changed files
with
455 additions
and
3 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,8 @@ | ||
## Uploading data to Weaviate with `unstructured` | ||
|
||
The example notebook in this directory shows how to upload documents to Weaviate using the | ||
`unstructured` library. To get started with the notebook, use the following steps: | ||
|
||
- Run `pip install -r requirements.txt` to install the requirements. | ||
- Run `docker-compose up` to run the Weaviate container. | ||
- Run `jupyter-notebook` to start the notebook. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,20 @@ | ||
version: '3.4' | ||
services: | ||
weaviate: | ||
image: semitechnologies/weaviate:1.19.6 | ||
restart: on-failure:0 | ||
ports: | ||
- "8080:8080" | ||
environment: | ||
QUERY_DEFAULTS_LIMIT: 20 | ||
AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED: 'true' | ||
PERSISTENCE_DATA_PATH: "./data" | ||
DEFAULT_VECTORIZER_MODULE: text2vec-transformers | ||
ENABLE_MODULES: text2vec-transformers | ||
TRANSFORMERS_INFERENCE_API: http://t2v-transformers:8080 | ||
CLUSTER_HOSTNAME: 'node1' | ||
t2v-transformers: | ||
image: semitechnologies/transformers-inference:sentence-transformers-multi-qa-MiniLM-L6-cos-v1 | ||
environment: | ||
ENABLE_CUDA: 0 # set to 1 to enable | ||
# NVIDIA_VISIBLE_DEVICES: all # enable if running with CUDA |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
jupyter | ||
tqdm | ||
weaviate-client | ||
unstructured[local-inference] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,215 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"id": "a3ce962e", | ||
"metadata": {}, | ||
"source": [ | ||
"## Loading Data into Weaviate with `unstructured`\n", | ||
"\n", | ||
"This notebook shows a basic workflow for uploading document elements into Weaviate using the `unstructured` library. To get started with this notebook, first install the dependencies with `pip install -r requirements.txt` and start the Weaviate docker container with `docker-compose up`." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 1, | ||
"id": "5d9ffc17", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"import json\n", | ||
"\n", | ||
"import tqdm\n", | ||
"from unstructured.partition.pdf import partition_pdf\n", | ||
"from unstructured.staging.weaviate import create_unstructured_weaviate_class, stage_for_weaviate\n", | ||
"import weaviate\n", | ||
"from weaviate.util import generate_uuid5" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "673715e9", | ||
"metadata": {}, | ||
"source": [ | ||
"The first step is to partition the document using the `unstructured` library. In the following example, we partition a PDF with `partition_pdf`. You can also partition over a dozen document types with the `partition` function." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 2, | ||
"id": "f9fc0cf9", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"filename = \"../../example-docs/layout-parser-paper-fast.pdf\"\n", | ||
"elements = partition_pdf(filename=filename, strategy=\"fast\")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "3ae76364", | ||
"metadata": {}, | ||
"source": [ | ||
"Next, we'll create a schema for our Weaviate database using the `create_unstructured_weaviate_class` helper function from the `unstructured` library. The helper function generates a schema that includes all of the elements in the `ElementMetadata` object from `unstructured`. This includes information such as the filename and the page number of the document element. After specifying the schema, we create a connection to the database with the Weaviate client library and create the schema. You can change the name of the class by updating the `unstructured_class_name` variable." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 3, | ||
"id": "91057cb1", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"unstructured_class_name = \"UnstructuredDocument\"" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 4, | ||
"id": "78e804bb", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"unstructured_class = create_unstructured_weaviate_class(unstructured_class_name)\n", | ||
"schema = {\"classes\": [unstructured_class]} " | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 5, | ||
"id": "3e317a2d", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"client = weaviate.Client(\"http://localhost:8080\")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 6, | ||
"id": "0c508784", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"client.schema.create(schema)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "024ae133", | ||
"metadata": {}, | ||
"source": [ | ||
"Next, we stage the elements for Weaviate using the `stage_for_weaviate` function and batch upload the results to Weaviate. `stage_for_weaviate` outputs a dictionary that conforms to the schema we created earlier. Once that data is stage, we can use the Weaviate client library to batch upload the results to Weaviate." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 7, | ||
"id": "a7018bb1", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"data_objects = stage_for_weaviate(elements)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 8, | ||
"id": "af712d8e", | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stderr", | ||
"output_type": "stream", | ||
"text": [ | ||
"100%|██████████████████████████████████████████████████████████████████████| 28/28 [00:46<00:00, 1.66s/it]\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"with client.batch(batch_size=10) as batch:\n", | ||
" for data_object in tqdm.tqdm(data_objects):\n", | ||
" batch.add_data_object(\n", | ||
" data_object,\n", | ||
" unstructured_class_name,\n", | ||
" uuid=generate_uuid5(data_object),\n", | ||
" )" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "dac10bf5", | ||
"metadata": {}, | ||
"source": [ | ||
"Now that the documents are in Weaviate, we're able to run queries against Weaviate!" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 9, | ||
"id": "14098434", | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"{\n", | ||
" \"data\": {\n", | ||
" \"Get\": {\n", | ||
" \"UnstructuredDocument\": [\n", | ||
" {\n", | ||
" \"text\": \"Deep Learning(DL)-based approaches are the state-of-the-art for a wide range of document image analysis (DIA) tasks including document image classi\\ufb01cation [11,\"\n", | ||
" }\n", | ||
" ]\n", | ||
" }\n", | ||
" }\n", | ||
"}\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"near_text = {\"concepts\": [\"document understanding\"]}\n", | ||
"\n", | ||
"result = (\n", | ||
" client.query\n", | ||
" .get(\"UnstructuredDocument\", [\"text\"])\n", | ||
" .with_near_text(near_text)\n", | ||
" .with_limit(1)\n", | ||
" .do()\n", | ||
")\n", | ||
"\n", | ||
"print(json.dumps(result, indent=4))" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "c191217c", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "Python 3 (ipykernel)", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.8.13" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 5 | ||
} |
Oops, something went wrong.