example: Bielik RAG with Weaviate and Ollama

speakleash · Nov 5, 2024 · cba526e · cba526e
1 parent f8bfe17
commit cba526e
Show file tree

Hide file tree

Showing 7 changed files with 439 additions and 0 deletions.
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,3 @@
+.venv
+data
+weaviate/.data
diff --git a/README.md b/README.md
@@ -48,3 +48,4 @@ W przypadku problemów lub pytań, sprawdź sekcję "Issues" w repozytorium lub
 | `Bielik_2_(AWQ)_structured_output.ipynb`                                             | V2: <a target="_blank" href="https://colab.research.google.com/drive/1engemkWlgvyU-Utnjvfderv_3So3XD8y?authuser=2"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>                                                                                                                                                                                                                 | Structured output using vLLM and Outlines                      | 
 | `draive`                                                                             | V2: [![GitHub](https://img.shields.io/badge/GitHub-181717?style=flat&logo=github&logoColor=white)](https://github.com/speakleash/Bielik-how-to-start/tree/main/draive)                                                                                                                                                                                                                                                           | Inference using draive lib                                     |                                                               |
 | `contract_enhancer`                                                                  | V2: [![GitHub](https://img.shields.io/badge/GitHub-181717?style=flat&logo=github&logoColor=white)](https://github.com/speakleash/Bielik-how-to-start/tree/main/contract_enhancer)                                                                                                                                                                                                                                                | RAG for contract enhancement                                   |
+| `weaviate`                                                                             | V2: [![GitHub](https://img.shields.io/badge/GitHub-181717?style=flat&logo=github&logoColor=white)](https://github.com/speakleash/Bielik-how-to-start/tree/main/weaviate)                                                                                                                                                                                                                                                           | Bielik RAG example using Weaviate vector DB                                     |                                                               |
diff --git a/weaviate/README.md b/weaviate/README.md
@@ -0,0 +1,49 @@
+# Bielik + Weaviate + Ollama
+
+🎯 Informacje
+-------------
+
+Ten folder zawiera zestaw przykładów w jaki sposób mozna uruchomić lokalnie [Bielika](https://bielik.ai/) z użyciem [Ollama](https://ollama.com/) wraz z bazą wektorową [Weaviate](https://weaviate.io/).
+
+📦 Wymagania
+------------
+
+W celu uruchomienia przykładów nalezy mieć skonfigurowane:
+
+1. Docker
+2. Python3
+
+💡 Rozpoczęcie pracy
+--------------------
+
+Przed uruchomieniem notebooków należy przygotować środowisko:
+
+1. Uruchomić kontenery z bazą Weaviate, wektoryzerami i modułem generatywnym Ollama:
+
+```sh
+docker compose up
+```
+
+2. Zanim przystąpisz do pracy ściągnij lokalnie Bielika wewnątrz kontenera Ollama:
+
+```sh
+docker exec -i generative_ollama ollama pull SpeakLeash/bielik-7b-instruct-v0.1-gguf
+```
+
+3. (opcjonalnie) Skonfiguruj osobne środowisko python:
+
+```sh
+python3 -m venv .venv
+source .venv/bin/activate
+```
+
+📖 Przykłady
+------------
+
+1. [0-import.ipynb](./notebooks/0-import.ipynb) - zaimportuj i zwektoryzuj dane
+2. [1-rag.ipynb](./notebooks/1-rag.ipynb) - odpytaj lokalnie swoje dane z wykorzystaniem Bielika
+
+🔗 Przydatne odnośniki
+----------------------
+
+- [Dataset użyty w notebookach](https://huggingface.co/datasets/allegro/summarization-polish-summaries-corpus)
diff --git a/weaviate/docker-compose.yml b/weaviate/docker-compose.yml
@@ -0,0 +1,39 @@
+---
+services:
+  weaviate:
+    command:
+    - --host
+    - 0.0.0.0
+    - --port
+    - '8080'
+    - --scheme
+    - http
+    - --write-timeout=600s
+    - --read-timeout=600s
+    image: cr.weaviate.io/semitechnologies/weaviate:1.27.1
+    ports:
+    - 8080:8080
+    - 50051:50051
+    volumes:
+    - ./.data/weaviate:/var/lib/weaviate
+    restart: on-failure:0
+    environment:
+      QUERY_DEFAULTS_LIMIT: 25
+      AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED: 'true'
+      PERSISTENCE_DATA_PATH: '/var/lib/weaviate'
+      CLUSTER_HOSTNAME: 'weaviate-0'
+      DEFAULT_VECTORIZER_MODULE: 'none'
+      MODULES_CLIENT_TIMEOUT: '600s'
+      ENABLE_MODULES: 'text2vec-transformers,generative-ollama'
+      TRANSFORMERS_INFERENCE_API: 'http://t2v-transformers-baai-bge-m3-onnx:8080'
+  t2v-transformers-baai-bge-m3-onnx:
+    image: cr.weaviate.io/semitechnologies/transformers-inference:baai-bge-m3-onnx
+  t2v-transformers-ipipan-silver-retriever-base-v1.1:
+    build:
+      dockerfile: ./ipipan.Dockerfile
+  generative-ollama:
+    image: ollama/ollama:0.3.14
+    container_name: generative_ollama
+    volumes:
+    - ./.data/generative_ollama:/root/.ollama
+...
diff --git a/weaviate/ipipan.Dockerfile b/weaviate/ipipan.Dockerfile
@@ -0,0 +1,2 @@
+FROM semitechnologies/transformers-inference:custom
+RUN MODEL_NAME=ipipan/silver-retriever-base-v1.1 USE_SENTENCE_TRANSFORMERS_VECTORIZER=true ONNX_RUNTIME=true ./download.py
diff --git a/weaviate/notebooks/0-import.ipynb b/weaviate/notebooks/0-import.ipynb
@@ -0,0 +1,235 @@
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!pip install weaviate-client\n",
+    "!pip install requests\n",
+    "!pip install datasets"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Pobierz i przygotuj dane\n",
+    "\n",
+    "Pobierz pierwsze 100 artykułów i ich streszczeń z tego datasetu: [allegro/summarization-polish-summaries-corpus](https://huggingface.co/datasets/allegro/summarization-polish-summaries-corpus/viewer/default/train?row=21)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from datasets import load_dataset\n",
+    "\n",
+    "data = load_dataset(\"allegro/summarization-polish-summaries-corpus\", cache_dir=\"./data\", split=\"train[:100]\")\n",
+    "\n",
+    "data = data.map(lambda x: {\n",
+    "    \"pelen_tekst\": x[\"source\"],\n",
+    "    \"streszczenie\": x[\"target\"],\n",
+    "}).remove_columns([\"source\", \"target\"])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Zaimportuj klienta Weaviate"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import weaviate\n",
+    "from weaviate.classes.config import Property, DataType, Configure\n",
+    "from weaviate.util import generate_uuid5"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Połącz się z lokalnie działającą bazą wektorową Weaviate:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "client = weaviate.connect_to_local()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Opcjonalnie usuń wszystkie dane znajdujące się w bazie"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "client.collections.delete_all()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Tworzenie bazy wektorów\n",
+    "\n",
+    "Na początek musimy utworzyć nową kolekcję w Weaviate:\n",
+    "- definiujemy 2 propertiesy w kolekcji: `page_num` i `text`\n",
+    "- definuijemy 2 wektoryzery danych o nazwach:\n",
+    "    - `silver_retriever`: model [ipipan/silver-retriever-base-v1.1](https://huggingface.co/ipipan/silver-retriever-base-v1.1)\n",
+    "        - konfigurujemy indeks `HNSW` z włączoną kompresją `Scalar Quatization`\n",
+    "    - `bge_m3`: model [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3)\n",
+    "        - konfigurujemy indeks `flat` z włączoną kompresją `Binary Quatization`\n",
+    "- podpinamy generatywny moduł Ollamy z modelem: [speakleash/Bielik-11B-v2.3-Instruct](https://huggingface.co/speakleash/Bielik-11B-v2.3-Instruct)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "collection = client.collections.create(\n",
+    "        name=\"Articles\",\n",
+    "        properties=[\n",
+    "            Property(name=\"pelen_tekst\", data_type=DataType.TEXT),\n",
+    "            Property(name=\"streszczenie\", data_type=DataType.TEXT),\n",
+    "        ],\n",
+    "        vectorizer_config=[\n",
+    "            Configure.NamedVectors.text2vec_transformers(\n",
+    "                name=\"silver_retriever\",\n",
+    "                source_properties=[\"streszczenie\"],\n",
+    "                inference_url=\"http://t2v-transformers-ipipan-silver-retriever-base-v1.1:8080\",\n",
+    "                vectorize_collection_name=False,\n",
+    "                vector_index_config=Configure.VectorIndex.hnsw(\n",
+    "                    quantizer=Configure.VectorIndex.Quantizer.sq(),\n",
+    "                ),\n",
+    "            ),\n",
+    "            Configure.NamedVectors.text2vec_transformers(\n",
+    "                name=\"bge_m3\",\n",
+    "                source_properties=[\"streszczenie\"],\n",
+    "                vectorize_collection_name=False,\n",
+    "                vector_index_config=Configure.VectorIndex.flat(\n",
+    "                    quantizer=Configure.VectorIndex.Quantizer.bq(),\n",
+    "                ),\n",
+    "            ),\n",
+    "        ],\n",
+    "        generative_config=Configure.Generative.ollama(\n",
+    "            api_endpoint=\"http://generative-ollama:11434\",\n",
+    "            model=\"SpeakLeash/bielik-7b-instruct-v0.1-gguf\",\n",
+    "        ),\n",
+    "    )"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Importowanie danych i tworzenie embeddingów\n",
+    "\n",
+    "Dzięki stworzonej konfiguracji Weaviate automatycznie zwektoryzuje dane znajdujące się w zmiennej `data` i utworzy 2 indeksy z wykorzystaniem modeli:\n",
+    "1. indeks: `silver_retriever` model: [ipipan/silver-retriever-base-v1.1](https://huggingface.co/ipipan/silver-retriever-base-v1.1)\n",
+    "2. indeks: `bge_m3` model: [BAAI/bge-reranker-v2-m3](https://huggingface.co/BAAI/bge-reranker-v2-m3)\n",
+    "\n",
+    "💡 To operacja może potrwać chwilkę gdyż uruchamiamy wektoryzery lokalnie z użyciem dockera"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "with collection.batch.dynamic() as batch:\n",
+    "    for obj in data:\n",
+    "        batch.add_object(properties=obj, uuid=generate_uuid5(obj[\"streszczenie\"]))\n",
+    "    batch.flush()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Sprawdźmy czy zaimportowano wszystkie 100 obiektów"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "count = collection.aggregate.over_all()\n",
+    "assert count.total_count == 100"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Pobierzmy dla przykładu 3 pierwszych obiektów i zobaczmy czy posiadają one embeddingi"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "result = collection.query.fetch_objects(limit=3, include_vector=True)\n",
+    "\n",
+    "for i, obj in enumerate(result.objects):\n",
+    "    print(f\"{i+1} obiekt\\nID: {obj.uuid}\\nStreszenie: {obj.properties[\"streszczenie\"]}\\nWektory dla streszczenia stworzone modelami:\\n- ipipan/silver-retriever-base-v1.1 wymiar: {len(obj.vector[\"silver_retriever\"])} wartość: {obj.vector[\"silver_retriever\"]}\\n- BAAI/bge-reranker-v2-m3 wymiar: {len(obj.vector[\"bge_m3\"])} wartość:  {obj.vector[\"bge_m3\"]}\\n\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "client.close()"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": ".venv",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.12.7"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,2 @@
		FROM semitechnologies/transformers-inference:custom
		RUN MODEL_NAME=ipipan/silver-retriever-base-v1.1 USE_SENTENCE_TRANSFORMERS_VECTORIZER=true ONNX_RUNTIME=true ./download.py