diff --git a/.gitmodules b/.gitmodules index e69de29bb..18b5bd727 100644 --- a/.gitmodules +++ b/.gitmodules @@ -0,0 +1,3 @@ +[submodule "examples"] + path = examples + url = https://github.com/SuperDuperDB/superduper-community-apps diff --git a/README.md b/README.md index aef415752..cd4c6584c 100644 --- a/README.md +++ b/README.md @@ -254,58 +254,58 @@ Also find use-cases and apps built by the community in the [superduper-community
- + - + - +
- Text-To-Image Search + Text-To-Image Search - Text-To-Video Search + Text-To-Video Search - Question the Docs + Question the Docs
- + - + - +
- Semantic Search Engine + Semantic Search Engine - Classical Machine Learning + Classical Machine Learning - Cross-Framework Transfer Learning + Cross-Framework Transfer Learning
diff --git a/examples b/examples new file mode 160000 index 000000000..a908a70b4 --- /dev/null +++ b/examples @@ -0,0 +1 @@ +Subproject commit a908a70b422ad15ced59127e8a9d2e973cef995f diff --git a/examples/mnist_torch.ipynb b/examples/mnist_torch.ipynb deleted file mode 100644 index e5a119f18..000000000 --- a/examples/mnist_torch.ipynb +++ /dev/null @@ -1,418 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "id": "4b24af19", - "metadata": {}, - "source": [ - "# Training and Maintaining MNIST Predictions with SuperDuperDB" - ] - }, - { - "cell_type": "markdown", - "id": "8905783f", - "metadata": {}, - "source": [ - "## Introduction\n", - "\n", - "This notebook outlines the process of implementing a classic machine learning classification task - MNIST handwritten digit recognition, using a convolutional neural network. However, we introduce a unique twist by performing the task in a database using SuperDuperDB." - ] - }, - { - "cell_type": "markdown", - "source": [ - "## Prerequisites\n", - "\n", - "Before diving into the implementation, ensure that you have the necessary libraries installed by running the following commands:" - ], - "metadata": { - "collapsed": false - }, - "id": "95f897a45b2a02cc" - }, - { - "cell_type": "code", - "execution_count": null, - "id": "a9897997-dee8-4947-9327-b96fe06a5a2c", - "metadata": {}, - "outputs": [], - "source": [ - "!pip install superduperdb\n", - "!pip install torch torchvision matplotlib" - ] - }, - { - "cell_type": "markdown", - "id": "e3812091", - "metadata": {}, - "source": [ - "## Connect to datastore \n", - "\n", - "First, we need to establish a connection to a MongoDB datastore via SuperDuperDB. You can configure the `MongoDB_URI` based on your specific setup. \n", - "Here are some examples of MongoDB URIs:\n", - "\n", - "* For testing (default connection): `mongomock://test`\n", - "* Local MongoDB instance: `mongodb://localhost:27017`\n", - "* MongoDB with authentication: `mongodb://superduper:superduper@mongodb:27017/documents`\n", - "* MongoDB Atlas: `mongodb+srv://:@/`" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "a28adbce", - "metadata": {}, - "outputs": [], - "source": [ - "from superduperdb import superduper\n", - "from superduperdb.backends.mongodb import Collection\n", - "import os\n", - "\n", - "mongodb_uri = os.getenv(\"MONGODB_URI\",\"mongomock://test\")\n", - "db = superduper(mongodb_uri)\n", - "\n", - "# Create a collection for MNIST\n", - "mnist_collection = Collection('mnist')" - ] - }, - { - "cell_type": "markdown", - "id": "6233e891", - "metadata": {}, - "source": [ - "\n", - "## Load Dataset\n", - "\n", - "After connecting to MongoDB, we add the MNIST dataset. SuperDuperDB excels at handling \"difficult\" data types, and we achieve this using an `Encoder`, which works in tandem with the `Document` wrappers. Together, they enable Python dictionaries containing non-JSONable or bytes objects to be inserted into the underlying data infrastructure. \n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "bf0934cc", - "metadata": {}, - "outputs": [], - "source": [ - "import torchvision\n", - "from superduperdb.ext.pillow import pil_image\n", - "from superduperdb import Document\n", - "from superduperdb.backends.mongodb import Collection\n", - "\n", - "import random\n", - "\n", - "# Load MNIST images as Python objects using the Python Imaging Library.\n", - "mnist_data = list(torchvision.datasets.MNIST(root='./data', download=True))\n", - "document_list = [Document({'img': pil_image(x[0]), 'class': x[1]}) for x in mnist_data]\n", - "\n", - "# Shuffle the data and select a subset of 1000 documents\n", - "random.shuffle(document_list)\n", - "data = document_list[:1000]\n", - "\n", - "# Insert the selected data into the mnist_collection\n", - "db.execute(\n", - " mnist_collection.insert_many(data[:-100]), # Insert all but the last 100 documents\n", - " encoders=(pil_image,) # Encode images using the Pillow library.\n", - ")" - ] - }, - { - "cell_type": "markdown", - "id": "c5341135", - "metadata": {}, - "source": [ - "Now that the images and their classes are inserted into the database, we can query the data in its original format. Particularly, we can use the `PIL.Image` instances to inspect the data." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "a36f9c3b", - "metadata": {}, - "outputs": [], - "source": [ - "# Get and display one of the images\n", - "r = db.execute(mnist_collection.find_one())\n", - "r.unpack()['img']" - ] - }, - { - "cell_type": "markdown", - "id": "1413d4c5", - "metadata": {}, - "source": [ - "## Build Model" - ] - }, - { - "cell_type": "markdown", - "id": "68fde8bb", - "metadata": {}, - "source": [ - "Next, we create our machine learning model. SuperDuperDB supports various frameworks out of the box, and in this case, we are using PyTorch, which is well-suited for computer vision tasks. In this example, we combine torch with torchvision.\n", - "\n", - "We create `postprocess` and `preprocess` functions to handle the communication with the SuperDuperDB `Datalayer`, and then wrap model, preprocessing and postprocessing to create a native SuperDuperDB handler.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "cfb425e1", - "metadata": {}, - "outputs": [], - "source": [ - "import torch\n", - "\n", - "class LeNet5(torch.nn.Module):\n", - " def __init__(self, num_classes):\n", - " super().__init__()\n", - " self.layer1 = torch.nn.Sequential(\n", - " torch.nn.Conv2d(1, 6, kernel_size=5, stride=1, padding=0),\n", - " torch.nn.BatchNorm2d(6),\n", - " torch.nn.ReLU(),\n", - " torch.nn.MaxPool2d(kernel_size=2, stride=2))\n", - " self.layer2 = torch.nn.Sequential(\n", - " torch.nn.Conv2d(6, 16, kernel_size=5, stride=1, padding=0),\n", - " torch.nn.BatchNorm2d(16),\n", - " torch.nn.ReLU(),\n", - " torch.nn.MaxPool2d(kernel_size=2, stride=2))\n", - " self.fc = torch.nn.Linear(400, 120)\n", - " self.relu = torch.nn.ReLU()\n", - " self.fc1 = torch.nn.Linear(120, 84)\n", - " self.relu1 = torch.nn.ReLU()\n", - " self.fc2 = torch.nn.Linear(84, num_classes)\n", - "\n", - " def forward(self, x):\n", - " out = self.layer1(x)\n", - " out = self.layer2(out)\n", - " out = out.reshape(out.size(0), -1)\n", - " out = self.fc(out)\n", - " out = self.relu(out)\n", - " out = self.fc1(out)\n", - " out = self.relu1(out)\n", - " out = self.fc2(out)\n", - " return out\n", - "\n", - " \n", - "def postprocess(x):\n", - " return int(x.topk(1)[1].item())\n", - "\n", - "\n", - "def preprocess(x):\n", - " return torchvision.transforms.Compose([\n", - " torchvision.transforms.Resize((32, 32)),\n", - " torchvision.transforms.ToTensor(),\n", - " torchvision.transforms.Normalize(mean=(0.1307,), std=(0.3081,))]\n", - " )(x)\n", - "\n", - "\n", - "# Create and insert a SuperDuperDB model into the database\n", - "model = superduper(LeNet5(10), preprocess=preprocess, postprocess=postprocess, preferred_devices=('cpu',))\n", - "db.add(model)" - ] - }, - { - "cell_type": "markdown", - "id": "dcf0457e", - "metadata": {}, - "source": [ - "## Train Model\n", - "\n", - "Now we are ready to \"train\" or \"fit\" the model. Trainable models in SuperDuperDB come with a sklearn-like `.fit` method. \n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "e7c610c1", - "metadata": {}, - "outputs": [], - "source": [ - "from torch.nn.functional import cross_entropy\n", - "\n", - "from superduperdb import Metric\n", - "from superduperdb import Dataset\n", - "from superduperdb.ext.torch.model import TorchTrainerConfiguration\n", - "\n", - "# Fit the model to the training data\n", - "job = model.fit(\n", - " X='img', # Feature matrix used as input data \n", - " y='class', # Target variable for training\n", - " db=db, # Database used for data retrieval\n", - " select=mnist_collection.find(), # Select the dataset\n", - " configuration=TorchTrainerConfiguration(\n", - " identifier='my_configuration',\n", - " objective=cross_entropy,\n", - " loader_kwargs={'batch_size': 10},\n", - " max_iterations=10,\n", - " validation_interval=5,\n", - " ),\n", - " metrics=[Metric(identifier='acc', object=lambda x, y: sum([xx == yy for xx, yy in zip(x, y)]) / len(x))],\n", - " validation_sets=[\n", - " Dataset(\n", - " identifier='my_valid',\n", - " select=Collection('mnist').find({'_fold': 'valid'}),\n", - " )\n", - " ],\n", - " distributed=False,\n", - ")" - ] - }, - { - "cell_type": "markdown", - "source": [ - "## Monitoring Training Efficiency\n", - "You can monitor the training efficiency with visualization tools like Matplotlib:" - ], - "metadata": { - "collapsed": false - }, - "id": "fdf5cccb2fe0b97b" - }, - { - "cell_type": "code", - "execution_count": null, - "id": "200d3be1", - "metadata": {}, - "outputs": [], - "source": [ - "from matplotlib import pyplot as plt\n", - "\n", - "# Load the model from the database\n", - "model = db.load('model', model.identifier)\n", - "\n", - "# Plot the accuracy values\n", - "plt.plot(model.metric_values['my_valid/acc'])\n", - "plt.show()" - ] - }, - { - "cell_type": "markdown", - "id": "0199b952", - "metadata": {}, - "source": [ - "\n", - "## On-the-fly Predictions\n", - "Once the model is trained, you can use it to continuously predict on new data as it arrives. This is set up by enabling a `listener` for the database (without loading all the data client-side). The listen toggle activates the model to make predictions on incoming data changes.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "f0e53249", - "metadata": {}, - "outputs": [], - "source": [ - "model.predict(\n", - " X='img', # Input feature \n", - " db=db, # Database used for data retrieval\n", - " select=mnist_collection.find(), # Select the dataset\n", - " listen=True, # Continuous predictions on incoming data \n", - " max_chunk_size=100, # Number of predictions to return at once\n", - ")" - ] - }, - { - "cell_type": "markdown", - "id": "7daae786", - "metadata": {}, - "source": [ - "We can see that predictions are available in `_outputs.img.lenet5`." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "bc71a143", - "metadata": {}, - "outputs": [], - "source": [ - "r = db.execute(mnist_collection.find_one({'_fold': 'valid'}))\n", - "r.unpack()" - ] - }, - { - "cell_type": "markdown", - "id": "7a78a2a1", - "metadata": {}, - "source": [ - "## Verification\n", - "\n", - "The models \"activated\" can be seen here:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "outputs": [], - "source": [ - "db.show('listener')" - ], - "metadata": { - "collapsed": false - }, - "id": "2a5308f4a158c931" - }, - { - "cell_type": "markdown", - "source": [ - "We can verify that the model is activated, by inserting the rest of the data:" - ], - "metadata": { - "collapsed": false - }, - "id": "dee36a804224cbb6" - }, - { - "cell_type": "code", - "execution_count": null, - "id": "c1aa56d0", - "metadata": {}, - "outputs": [], - "source": [ - "for r in data[-100:]:\n", - " r['update'] = True\n", - "\n", - "db.execute(mnist_collection.insert_many(data[-100:]))" - ] - }, - { - "cell_type": "markdown", - "id": "9eb48a30", - "metadata": {}, - "source": [ - "You can see that the inserted data, are now also populated with predictions:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "d8161983", - "metadata": {}, - "outputs": [], - "source": [ - "db.execute(mnist_collection.find_one({'update': True}))['_outputs']" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3 (ipykernel)", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.11.6" - } - }, - "nbformat": 4, - "nbformat_minor": 5 -} diff --git a/examples/multimodal_image_search_clip.ipynb b/examples/multimodal_image_search_clip.ipynb deleted file mode 100644 index 68d7d6885..000000000 --- a/examples/multimodal_image_search_clip.ipynb +++ /dev/null @@ -1,352 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "id": "238520e0", - "metadata": {}, - "source": [ - "# Multimodal Search Using CLIP" - ] - }, - { - "cell_type": "markdown", - "id": "a3590f0e", - "metadata": {}, - "source": [ - "## Introduction\n", - "\n", - "This notebook showcases the capabilities of SuperDuperDB for performing multimodal searches using the `VectorIndex`. SuperDuperDB's flexibility enables users and developers to integrate various models into the system and use them for vectorizing diverse queries during search and inference. In this demonstration, we leverage the [CLIP multimodal architecture](https://openai.com/research/clip)." - ] - }, - { - "cell_type": "markdown", - "id": "40272d6a2681c8e8", - "metadata": { - "collapsed": false, - "jupyter": { - "outputs_hidden": false - } - }, - "source": [ - "## Prerequisites\n", - "\n", - "Before diving into the implementation, ensure that you have the necessary libraries installed by running the following commands:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "5ebe1497", - "metadata": {}, - "outputs": [], - "source": [ - "!pip install superduperdb\n", - "!pip install ipython openai-clip\n", - "!pip install -U datasets" - ] - }, - { - "cell_type": "markdown", - "id": "b2f94ae8", - "metadata": {}, - "source": [ - "## Connect to datastore \n", - "\n", - "First, we need to establish a connection to a MongoDB datastore via SuperDuperDB. You can configure the `MongoDB_URI` based on your specific setup. \n", - "Here are some examples of MongoDB URIs:\n", - "\n", - "* For testing (default connection): `mongomock://test`\n", - "* Local MongoDB instance: `mongodb://localhost:27017`\n", - "* MongoDB with authentication: `mongodb://superduper:superduper@mongodb:27017/documents`\n", - "* MongoDB Atlas: `mongodb+srv://:@/`" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "2b5ef986", - "metadata": {}, - "outputs": [], - "source": [ - "import os\n", - "from superduperdb import superduper\n", - "from superduperdb.backends.mongodb import Collection\n", - "\n", - "mongodb_uri = os.getenv(\"MONGODB_URI\", \"mongomock://test\")\n", - "db = superduper(mongodb_uri, artifact_store='filesystem://./models/')\n", - "\n", - "# Super-Duper your Database!\n", - "db = superduper(mongodb_uri, artifact_store='filesystem://.data')\n", - "\n", - "collection = Collection('multimodal')" - ] - }, - { - "cell_type": "markdown", - "id": "6cd6d6b0", - "metadata": {}, - "source": [ - "## Load Dataset \n", - "\n", - "To make this notebook easily executable and interactive, we'll work with a sub-sample of the [Tiny-Imagenet dataset](https://paperswithcode.com/dataset/tiny-imagenet). The processes demonstrated here can be applied to larger datasets with higher resolution images as well. For such use-cases, however, it's advisable to use a machine with a GPU, otherwise they'll be some significant thumb twiddling to do.\n", - "\n", - "To insert images into the database, we utilize the `Encoder`-`Document` framework, which allows saving Python class instances as blobs in the `Datalayer` and retrieving them as Python objects. To this end, SuperDuperDB contains pre-configured support for `PIL.Image` instances. This simplifies the integration of Python AI models with the datalayer. It's also possible to create your own encoders.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "5f0f14fb-8e79-4bc6-88af-1a800aecb8db", - "metadata": {}, - "outputs": [], - "source": [ - "!curl -O https://superduperdb-public.s3.eu-west-1.amazonaws.com/coco_sample.zip\n", - "!unzip coco_sample.zip" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "e41e6faa-6b83-46d8-ab37-6de6fd346ee7", - "metadata": {}, - "outputs": [], - "source": [ - "from superduperdb import Document\n", - "from superduperdb.ext.pillow import pil_image as i\n", - "import glob\n", - "import random\n", - "\n", - "images = glob.glob('images_small/*.jpg')\n", - "documents = [Document({'image': i(uri=f'file://{img}')}) for img in images][:500]" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "1b3a63bf-9e1f-4266-823a-7a2208937e01", - "metadata": {}, - "outputs": [], - "source": [ - "documents[1]" - ] - }, - { - "cell_type": "markdown", - "id": "c9c7e282", - "metadata": {}, - "source": [ - "The wrapped python dictionaries may be inserted directly to the `Datalayer`:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "c32a91a5", - "metadata": {}, - "outputs": [], - "source": [ - "db.execute(collection.insert_many(documents), encoders=(i,))" - ] - }, - { - "cell_type": "markdown", - "id": "10d37264", - "metadata": {}, - "source": [ - "You can verify that the images are correctly stored as follows:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "7282a0fe", - "metadata": {}, - "outputs": [], - "source": [ - "x = db.execute(imagenet_collection.find_one()).unpack()['image']\n", - "display(x.resize((300, 300 * int(x.size[1] / x.size[0]))))" - ] - }, - { - "cell_type": "markdown", - "id": "dab27b50", - "metadata": {}, - "source": [ - "## Build Models\n", - "We now can wrap the CLIP model, to ready it for multimodal search. It involves 2 components:\n", - "\n", - "Now, let's prepare the CLIP model for multimodal search, which involves two components: `text encoding` and `visual encoding`. After installing both components, you can perform searches using both images and text to find matching items:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "916792d3", - "metadata": {}, - "outputs": [], - "source": [ - "import clip\n", - "from superduperdb import vector\n", - "from superduperdb.ext.torch import TorchModel\n", - "\n", - "# Load the CLIP model\n", - "model, preprocess = clip.load(\"RN50\", device='cpu')\n", - "\n", - "# Define a vector\n", - "e = vector(shape=(1024,))\n", - "\n", - "# Create a TorchModel for text encoding\n", - "text_model = TorchModel(\n", - " identifier='clip_text',\n", - " object=model,\n", - " preprocess=lambda x: clip.tokenize(x)[0],\n", - " postprocess=lambda x: x.tolist(),\n", - " encoder=e,\n", - " forward_method='encode_text', \n", - ")\n", - "\n", - "# Create a TorchModel for visual encoding\n", - "visual_model = TorchModel(\n", - " identifier='clip_image',\n", - " object=model.visual, \n", - " preprocess=preprocess,\n", - " postprocess=lambda x: x.tolist(),\n", - " encoder=e,\n", - ")" - ] - }, - { - "cell_type": "markdown", - "id": "b716bcb2", - "metadata": {}, - "source": [ - "## Create a Vector-Search Index\n", - "\n", - "Let's create the index for vector-based searching. We'll register both models with the index simultaneously, but specify that the `visual_model` will be responsible for creating the vectors in the database (`indexing_listener`). The `compatible_listener` specifies how an alternative model can be used to search the vectors, enabling multimodal search with models expecting different types of indexes." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "c4e0302c", - "metadata": {}, - "outputs": [], - "source": [ - "from superduperdb import VectorIndex\n", - "from superduperdb import Listener\n", - "\n", - "# Create a VectorIndex and add it to the database\n", - "db.add(\n", - " VectorIndex(\n", - " 'my-index',\n", - " indexing_listener=Listener(\n", - " model=visual_model,\n", - " key='image',\n", - " select=collection.find(),\n", - " predict_kwargs={'batch_size': 10},\n", - " ),\n", - " compatible_listener=Listener(\n", - " model=text_model,\n", - " key='text',\n", - " active=False,\n", - " select=None,\n", - " )\n", - " )\n", - ")" - ] - }, - { - "cell_type": "markdown", - "id": "18971a6d", - "metadata": {}, - "source": [ - "## Search Images Using Text\n", - "\n", - "Now we can demonstrate searching for images using text queries:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "ab994b5e", - "metadata": {}, - "outputs": [], - "source": [ - "from IPython.display import display\n", - "from superduperdb import Document\n", - "\n", - "query_string = 'sports'\n", - "\n", - "out = db.execute(\n", - " collection.like(Document({'text': query_string}), vector_index='my-index', n=3).find({})\n", - ")\n", - "\n", - "# Display the images from the search results\n", - "for r in search_results:\n", - " x = r['image'].x\n", - " display(x.resize((300, int(300 * x.size[1] / x.size[0]))))" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "b5e3ac22-044f-4675-976a-68ff9b59efe9", - "metadata": {}, - "outputs": [], - "source": [ - "img = db.execute(collection.find_one({}))['image']\n", - "img.x" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "c8569e4f-74f2-4ee5-9674-7829b2fcc62b", - "metadata": {}, - "outputs": [], - "source": [ - "cur = db.execute(\n", - " collection.like(Document({'image': img}), vector_index='my-index', n=3).find({})\n", - ")\n", - "\n", - "for r in cur:\n", - " x = r['image'].x\n", - " display(x.resize((300, int(300 * x.size[1] / x.size[0]))))" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "806a445f1dfacd90", - "metadata": { - "collapsed": false, - "jupyter": { - "outputs_hidden": false - } - }, - "outputs": [], - "source": [] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3 (ipykernel)", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.11.5" - } - }, - "nbformat": 4, - "nbformat_minor": 5 -} diff --git a/examples/question_the_docs.ipynb b/examples/question_the_docs.ipynb deleted file mode 100644 index 1ad4c65b0..000000000 --- a/examples/question_the_docs.ipynb +++ /dev/null @@ -1,450 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "id": "c042ddbb-c2c9-46ed-b36c-c965c0d7ff5b", - "metadata": {}, - "source": [ - "# Building Q&A Assistant Using Mongo and OpenAI" - ] - }, - { - "cell_type": "markdown", - "id": "7e6fbce6-fec9-47af-8701-99721eedec50", - "metadata": {}, - "source": [ - "## Introduction\n", - "\n", - "This notebook is designed to demonstrate how to implement a document Question-and-Answer (Q&A) task using SuperDuperDB in conjunction with OpenAI and MongoDB. It provides a step-by-step guide and explanation of each component involved in the process.\n" - ] - }, - { - "cell_type": "markdown", - "source": [ - "## Prerequisites\n", - "\n", - "Before diving into the implementation, ensure that you have the necessary libraries installed by running the following commands:" - ], - "metadata": { - "collapsed": false - }, - "id": "f98f1c7ae8e02278" - }, - { - "cell_type": "code", - "execution_count": null, - "id": "6858da67-597d-4d98-ae4a-41003bb569f4", - "metadata": {}, - "outputs": [], - "source": [ - "!pip install superduperdb\n", - "!pip install ipython openai==0.27.6" - ] - }, - { - "cell_type": "markdown", - "id": "e3befb73", - "metadata": {}, - "source": [ - "Additionally, ensure that you have set your openai API key as an environment variable. You can uncomment the following code and add your API key:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "f5bcdade-f988-4464-bfcf-806245031bb3", - "metadata": {}, - "outputs": [], - "source": [ - "import os\n", - "\n", - "#os.environ['OPENAI_API_KEY'] = 'sk-...'\n", - "\n", - "if 'OPENAI_API_KEY' not in os.environ:\n", - " raise Exception('Environment variable \"OPENAI_API_KEY\" not set')" - ] - }, - { - "cell_type": "markdown", - "source": [ - "## Connect to datastore \n", - "\n", - "First, we need to establish a connection to a MongoDB datastore via SuperDuperDB. You can configure the `MongoDB_URI` based on your specific setup. \n", - "Here are some examples of MongoDB URIs:\n", - "\n", - "* For testing (default connection): `mongomock://test`\n", - "* Local MongoDB instance: `mongodb://localhost:27017`\n", - "* MongoDB with authentication: `mongodb://superduper:superduper@mongodb:27017/documents`\n", - "* MongoDB Atlas: `mongodb+srv://:@/`" - ], - "metadata": { - "collapsed": false - }, - "id": "85c1a0f7572c43ba" - }, - { - "cell_type": "code", - "execution_count": null, - "id": "f42c42cc-af6a-4712-a993-d9c921693819", - "metadata": {}, - "outputs": [], - "source": [ - "from superduperdb import superduper\n", - "from superduperdb.backends.mongodb import Collection\n", - "import os\n", - "\n", - "mongodb_uri = os.getenv(\"MONGODB_URI\",\"mongomock://test\")\n", - "db = superduper(mongodb_uri)\n", - "\n", - "collection = Collection('questiondocs')" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "3ce857b0-738c-4d7f-bee0-f709c6fc5ddf", - "metadata": {}, - "outputs": [], - "source": [ - "db.metadata" - ] - }, - { - "cell_type": "markdown", - "source": [ - "## Load Dataset \n", - "\n", - "In this example we use the internal textual data from the `superduperdb` project's API documentation. The goal is to create a chatbot that can provide information about the project. You can either load the data from your local project or use the provided data. \n", - "\n", - "If you have the SuperDuperDB project locally and want to load the latest version of the API, uncomment the following cell:" - ], - "metadata": { - "collapsed": false - }, - "id": "737497f7d5032bf" - }, - { - "cell_type": "code", - "execution_count": null, - "id": "d72a2a52-964f-456e-88b6-040965f5ed1e", - "metadata": {}, - "outputs": [], - "source": [ - "# import glob\n", - "\n", - "# ROOT = '../docs/hr/content/docs/'\n", - "\n", - "# STRIDE = 3 # stride in numbers of lines\n", - "# WINDOW = 25 # length of window in numbers of lines\n", - "\n", - "# files = sorted(glob.glob(f'{ROOT}/*.md') + glob.glob(f'{ROOT}/*.mdx'))\n", - "\n", - "# content = sum([open(file).read().split('\\n') for file in files], [])\n", - "# chunks = ['\\n'.join(content[i: i + WINDOW]) for i in range(0, len(content), STRIDE)]" - ] - }, - { - "cell_type": "markdown", - "source": [ - "Otherwise, you can load the data from an external source. The chunks of text contain code snippets and explanations, which will be used to build the document Q&A chatbot. " - ], - "metadata": { - "collapsed": false - }, - "id": "c9803aef243ad58c" - }, - { - "cell_type": "code", - "execution_count": null, - "id": "a20bb184-d45b-4647-b3c3-7043db9a3239", - "metadata": {}, - "outputs": [], - "source": [ - "from IPython.display import *\n", - "\n", - "Markdown(chunks[20])" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "e587e284-0876-4464-a977-ac97a9070787", - "metadata": {}, - "outputs": [], - "source": [ - "!curl -O https://superduperdb-public.s3.eu-west-1.amazonaws.com/superduperdb_docs.json\n", - "\n", - "import json\n", - "from IPython.display import Markdown\n", - "\n", - "with open('superduperdb_docs.json') as f:\n", - " chunks = json.load(f)" - ] - }, - { - "cell_type": "markdown", - "id": "4f8c4636-88c6-42a4-b471-41be7c20680f", - "metadata": {}, - "source": [ - "You can see that the chunks of text contain bits of code, and explanations, \n", - "which can become useful in building a document Q&A chatbot." - ] - }, - { - "cell_type": "markdown", - "id": "0370732b-0c55-4672-b6be-0830f9a3a755", - "metadata": {}, - "source": [ - "As usual we insert the data. The `Document` wrapper allows `superduperdb` to handle records with special data types such as images,\n", - "video, and custom data-types." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "a7208ef2-c035-43b9-a624-ade42a06ed09", - "metadata": {}, - "outputs": [], - "source": [ - "from superduperdb import Document\n", - "\n", - "db.execute(collection.insert_many([Document({'txt': chunk}) for chunk in chunks]))" - ] - }, - { - "cell_type": "markdown", - "id": "4b299b6f-37ae-46d7-b064-7d368d98d68a", - "metadata": {}, - "source": [ - "## Create a Vector-Search Index\n", - "\n", - "To enable question-answering over your documents, we need to setup a standard `superduperdb` vector-search index using `openai` (although there are many options\n", - "here: `torch`, `sentence_transformers`, `transformers`, ...)" - ] - }, - { - "cell_type": "markdown", - "id": "7930b9a1-1483-4106-873c-d85a3920c64e", - "metadata": {}, - "source": [ - "A `Model` is a wrapper around a self-built or ecosystem model, such as `torch`, `transformers`, `openai`." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "56905f2e-485e-4179-8585-34eac26c0751", - "metadata": {}, - "outputs": [], - "source": [ - "from superduperdb.ext.openai import OpenAIEmbedding\n", - "\n", - "model = OpenAIEmbedding(model='text-embedding-ada-002')" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "6bb05a78-263e-4e6f-b429-8e51dbb932b8", - "metadata": {}, - "outputs": [], - "source": [ - "model.predict('This is a test', one=True)" - ] - }, - { - "cell_type": "markdown", - "id": "4331b81b-c257-4353-aab4-8f601bef78de", - "metadata": {}, - "source": [ - "A `Listener` \"deploys\" a `Model` to \"listen\" to incoming data, and compute outputs, which are saved in the database, via `db`." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "c1625dab-6438-494b-b74d-efb58bfc8610", - "metadata": {}, - "outputs": [], - "source": [ - "from superduperdb import Listener\n", - "\n", - "listener = Listener(model=model, key='txt', select=collection.find())" - ] - }, - { - "cell_type": "markdown", - "id": "591dad80-3788-441b-96db-a5bf23a16979", - "metadata": {}, - "source": [ - "A `VectorIndex` wraps a `Listener`, making its outputs searchable." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "1aa132d0-e6a2-46f6-9eb8-13fbce90ff11", - "metadata": {}, - "outputs": [], - "source": [ - "from superduperdb import VectorIndex\n", - "\n", - "db.add(\n", - " VectorIndex(identifier='my-index', indexing_listener=listener)\n", - ")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "7fde5b17-9d71-4535-aaf6-85f4fa9910e4", - "metadata": {}, - "outputs": [], - "source": [ - "db.execute(collection.find_one())" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "92948823-0d18-4e1b-b103-f226d6b09e52", - "metadata": {}, - "outputs": [], - "source": [ - "from superduperdb.backends.mongodb import Collection\n", - "from superduperdb import Document as D\n", - "from IPython.display import *\n", - "\n", - "query = 'Code snippet how to create a `VectorIndex` with a torchvision model'\n", - "\n", - "result = db.execute(\n", - " collection\n", - " .like(D({'txt': query}), vector_index='my-index', n=5)\n", - " .find()\n", - ")\n", - "\n", - "display(Markdown('---'))\n", - "\n", - "for r in result:\n", - " display(Markdown(r['txt']))\n", - " display(Markdown('---'))" - ] - }, - { - "cell_type": "markdown", - "source": [ - "## Create a Chat-Completion Component\n", - "\n", - "In this step, a chat-completion component is created and added to the system. This component is essential for the Q&A functionality:" - ], - "metadata": { - "collapsed": false - }, - "id": "e0922a0dc623d7bf" - }, - { - "cell_type": "code", - "execution_count": null, - "id": "abfa4df6-73ac-4d46-8047-011648e24958", - "metadata": {}, - "outputs": [], - "source": [ - "from superduperdb.ext.openai import OpenAIChatCompletion\n", - "\n", - "chat = OpenAIChatCompletion(\n", - " model='gpt-3.5-turbo',\n", - " prompt=(\n", - " 'Use the following description and code-snippets aboout SuperDuperDB to answer this question about SuperDuperDB\\n'\n", - " 'Do not use any other information you might have learned about other python packages\\n'\n", - " 'Only base your answer on the code-snippets retrieved\\n'\n", - " '{context}\\n\\n'\n", - " 'Here\\'s the question:\\n'\n", - " ),\n", - ")\n", - "\n", - "db.add(chat)\n", - "\n", - "print(db.show('model'))" - ] - }, - { - "cell_type": "markdown", - "id": "696ac7bb-eaaf-4bec-9561-603b3c98a736", - "metadata": {}, - "source": [ - "## Ask Questions to Your Docs\n", - "\n", - "Finally, you can ask questions about the documents. You can target specific queries and use the power of MongoDB for vector-search and filtering rules. Here's an example of asking a question:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "fc4a0f6c-9e24-47aa-bc73-7cc4507e94ff", - "metadata": {}, - "outputs": [], - "source": [ - "from superduperdb import Document\n", - "from IPython.display import Markdown\n", - "\n", - "# Define the search parameters\n", - "search_term = 'Can you give me a code-snippet to set up a `VectorIndex`?'\n", - "num_results = 5\n", - "\n", - "output, context = db.predict(\n", - " model_name='gpt-3.5-turbo',\n", - " input=search_term,\n", - " context_select=(\n", - " collection\n", - " .like(Document({'txt': search_term}), vector_index='my-index', n=num_results)\n", - " .find()\n", - " ),\n", - " context_key='txt',\n", - ")\n", - "\n", - "Markdown(output.content)" - ] - }, - { - "cell_type": "markdown", - "id": "b3d1fe16-78d7-4c8d-9991-1086cc9e51bb", - "metadata": {}, - "source": [ - "Reset the demo" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "ab688589-5180-4a78-8fc3-8d3ddaf11e37", - "metadata": {}, - "outputs": [], - "source": [ - "db.remove('vector_index', 'my-index', force=True)\n", - "db.remove('listener', 'text-embedding-ada-002/txt', force=True)\n", - "db.remove('model', 'text-embedding-ada-002', force=True)" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3 (ipykernel)", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.11.6" - } - }, - "nbformat": 4, - "nbformat_minor": 5 -} diff --git a/examples/sql-example.ipynb b/examples/sql-example.ipynb deleted file mode 100644 index c1dad4d70..000000000 --- a/examples/sql-example.ipynb +++ /dev/null @@ -1,339 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "id": "f0e29fef", - "metadata": {}, - "source": [ - "# End-2-end example using SQL databases\n", - "\n", - "SuperDuperDB allows users to connect to a MongoDB database, or any one of a range of SQL databases, i.e. from this selection:\n", - "\n", - "- MongoDB\n", - "- PostgreSQL\n", - "- SQLite\n", - "- DuckDB\n", - "- BigQuery\n", - "- ClickHouse\n", - "- DataFusion\n", - "- Druid\n", - "- Impala\n", - "- MSSQL\n", - "- MySQL\n", - "- Oracle\n", - "- pandas\n", - "- Polars\n", - "- PySpark\n", - "- Snowflake\n", - "- Trino\n", - "\n", - "In this example we show case how to implement multimodal vector-search with DuckDB.\n", - "This is a simple extension of multimodal vector-search with MongoDB, which is \n", - "just slightly easier to set-up (see [here](https://docs.superduperdb.com/docs/use_cases/items/multimodal_image_search_clip)).\n", - "Everything we do here applies equally to any of the above supported SQL databases, as well as to tabular data formats on disk, such as `pandas`." - ] - }, - { - "cell_type": "markdown", - "id": "ff1db9c6", - "metadata": {}, - "source": [ - "## Prerequisites\n", - "\n", - "Before working on this use-case, make sure that you've installed the software requirements:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "38d752ab", - "metadata": {}, - "outputs": [], - "source": [ - "!pip install superduperdb[demo]" - ] - }, - { - "cell_type": "markdown", - "id": "dfde8264", - "metadata": {}, - "source": [ - "## Connect to datastore\n", - "\n", - "The first step in any `superduperdb` workflow is to connect to your datastore.\n", - "In order to connect to a different datastore, add a different `URI`, e.g. `postgres://...`." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "b8e7ef91-9eda-4fbd-b34f-b49b5411fc47", - "metadata": {}, - "outputs": [], - "source": [ - "import os\n", - "from superduperdb import superduper\n", - "\n", - "os.makedirs('.superduperdb', exist_ok=True)\n", - "db = superduper('duckdb://.superduperdb/test.ddb')" - ] - }, - { - "cell_type": "markdown", - "id": "b8794451", - "metadata": {}, - "source": [ - "## Load dataset\n", - "\n", - "Now, Once connected, add some data to the datastore:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "b9d2b073-38b0-4d29-aa65-e568f19e7852", - "metadata": {}, - "outputs": [], - "source": [ - "!curl -O https://superduperdb-public.s3.eu-west-1.amazonaws.com/coco_sample.zip\n", - "!curl -O https://superduperdb-public.s3.eu-west-1.amazonaws.com/captions_tiny.json\n", - "!unzip coco_sample.zip\n", - "!mkdir -p data/coco\n", - "!mv images_small data/coco/images" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "ec5d36d3-7e74-4c87-92c2-ed1586330858", - "metadata": {}, - "outputs": [], - "source": [ - "import json\n", - "import pandas\n", - "import PIL.Image\n", - "\n", - "with open(captions_tiny.json') as f:\n", - " data = json.load(f)[:500]\n", - " \n", - "data = pandas.DataFrame([\n", - " {\n", - " 'image': r['image']['_content']['path'], \n", - " 'captions': r['captions']\n", - " } for r in data \n", - "])\n", - "data['id'] = pandas.Series(data.index).apply(str)\n", - "images_df = data[['id', 'image']]\n", - "\n", - "images_df['image'] = images_df['image'].apply(PIL.Image.open)\n", - "captions_df = data[['id', 'captions']].explode('captions')" - ] - }, - { - "cell_type": "markdown", - "id": "b43cd7d2", - "metadata": {}, - "source": [ - "## Define schema\n", - "\n", - "This use-case requires a table with images, and a table with text. \n", - "SuperDuperDB extends standard SQL functionality, by allowing developers to define\n", - "their own data-types via the `Encoder` abstraction." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "2d9483f3-78c5-47df-9fa2-4cd070282791", - "metadata": {}, - "outputs": [], - "source": [ - "from superduperdb.backends.ibis.query import Table\n", - "from superduperdb.backends.ibis.field_types import dtype\n", - "from superduperdb.ext.pillow import pil_image\n", - "from superduperdb import Schema\n", - "\n", - "captions = Table(\n", - " 'captions', \n", - " primary_id='id',\n", - " schema=Schema(\n", - " 'captions-schema',\n", - " fields={'id': dtype(str), 'captions': dtype(str)},\n", - " )\n", - ")\n", - "\n", - "images = Table(\n", - " 'images', \n", - " primary_id='id',\n", - " schema=Schema(\n", - " 'images-schema',\n", - " fields={'id': dtype(str), 'image': pil_image},\n", - " )\n", - ")\n", - "\n", - "db.add(captions)\n", - "db.add(images)" - ] - }, - { - "cell_type": "markdown", - "id": "115b2c14", - "metadata": {}, - "source": [ - "## Add data to the datastore" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "93cce29b-dd04-47d2-bdfc-fe3780e06ddb", - "metadata": {}, - "outputs": [], - "source": [ - "_ = db.execute(images.insert(images_df))\n", - "_ = db.execute(captions.insert(captions_df))" - ] - }, - { - "cell_type": "markdown", - "id": "def10282", - "metadata": {}, - "source": [ - "## Build SuperDuperDB `Model` instances\n", - "\n", - "This use-case uses the `superduperdb.ext.torch` extension. \n", - "Both models used, output `torch` tensors, which are encoded with `tensor`:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "d87ed43d-6f90-46c1-8851-6050ae21a051", - "metadata": {}, - "outputs": [], - "source": [ - "import clip\n", - "import torch\n", - "from superduperdb.ext.torch import TorchModel, tensor\n", - "\n", - "# Load the CLIP model\n", - "model, preprocess = clip.load(\"RN50\", device='cpu')\n", - "\n", - "# Define a tensor type\n", - "t = tensor(torch.float, shape=(1024,))\n", - "\n", - "# Create a TorchModel for text encoding\n", - "text_model = TorchModel(\n", - " identifier='clip_text',\n", - " object=model,\n", - " preprocess=lambda x: clip.tokenize(x)[0],\n", - " encoder=t,\n", - " forward_method='encode_text', \n", - ")\n", - "\n", - "# Create a TorchModel for visual encoding\n", - "visual_model = TorchModel(\n", - " identifier='clip_image',\n", - " object=model.visual, \n", - " preprocess=preprocess,\n", - " encoder=t,\n", - ")" - ] - }, - { - "cell_type": "markdown", - "id": "89c5c236", - "metadata": {}, - "source": [ - "## Create a Vector-Search Index\n", - "\n", - "Let's define a mult-modal search index on the basis of the models imported above.\n", - "The `visual_model` is applied to the images, to make the `images` table searchable." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "fb8aef2c-484f-41a9-9956-d80b9c58eaa8", - "metadata": {}, - "outputs": [], - "source": [ - "from superduperdb import VectorIndex, Listener\n", - "\n", - "db.add(\n", - " VectorIndex(\n", - " 'my-index',\n", - " indexing_listener=Listener(\n", - " model=visual_model,\n", - " key='image',\n", - " select=images,\n", - " ),\n", - " compatible_listener=Listener(\n", - " model=text_model,\n", - " key='captions',\n", - " active=False,\n", - " select=None,\n", - " )\n", - " )\n", - ")" - ] - }, - { - "cell_type": "markdown", - "id": "4d8b9e84", - "metadata": {}, - "source": [ - "## Search Images Using Text\n", - "\n", - "Now we can demonstrate searching for images using text queries:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "1731a574-921a-4cba-a65c-26bff9fb9c8c", - "metadata": {}, - "outputs": [], - "source": [ - "from superduperdb import Document\n", - "\n", - "res = db.execute(\n", - " images\n", - " .like(Document({'captions': 'dog catches frisbee'}), vector_index='my-index', n=10)\n", - " .limit(10)\n", - ")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "72522031-0af8-452a-bbd1-b27dede55154", - "metadata": {}, - "outputs": [], - "source": [ - "res[3]['image'].x" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3 (ipykernel)", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.11.6" - } - }, - "nbformat": 4, - "nbformat_minor": 5 -} diff --git a/examples/transfer_learning.ipynb b/examples/transfer_learning.ipynb deleted file mode 100644 index 1aeea0dcb..000000000 --- a/examples/transfer_learning.ipynb +++ /dev/null @@ -1,267 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "source": [ - "# Transfer Learning with Sentence Transformers and Scikit-Learn" - ], - "metadata": { - "collapsed": false - }, - "id": "fe6fd0ab0e1ad844" - }, - { - "cell_type": "markdown", - "source": [ - "## Introduction\n", - "\n", - "In this notebook, we will explore the process of transfer learning using SuperDuperDB. We will demonstrate how to connect to a MongoDB datastore, load a dataset, create a SuperDuperDB model based on Sentence Transformers, train a downstream model using Scikit-Learn, and apply the trained model to the database. Transfer learning is a powerful technique that can be used in various applications, such as vector search and downstream learning tasks." - ], - "metadata": { - "collapsed": false - }, - "id": "8dcde44d942793ff" - }, - { - "cell_type": "markdown", - "source": [ - "## Prerequisites\n", - "\n", - "Before diving into the implementation, ensure that you have the necessary libraries installed by running the following commands:" - ], - "metadata": { - "collapsed": false - }, - "id": "1809feca8a8dca5a" - }, - { - "cell_type": "code", - "execution_count": null, - "outputs": [], - "source": [ - "!pip install superduperdb\n", - "!pip install ipython numpy datasets sentence-transformers" - ], - "metadata": { - "collapsed": false - }, - "id": "94f3219ad932a327" - }, - { - "cell_type": "markdown", - "id": "6bc151f6", - "metadata": {}, - "source": [ - "## Connect to datastore " - ] - }, - { - "cell_type": "markdown", - "source": [ - "First, we need to establish a connection to a MongoDB datastore via SuperDuperDB. You can configure the `MongoDB_URI` based on your specific setup. \n", - "Here are some examples of MongoDB URIs:\n", - "\n", - "* For testing (default connection): `mongomock://test`\n", - "* Local MongoDB instance: `mongodb://localhost:27017`\n", - "* MongoDB with authentication: `mongodb://superduper:superduper@mongodb:27017/documents`\n", - "* MongoDB Atlas: `mongodb+srv://:@/`" - ], - "metadata": { - "collapsed": false - }, - "id": "5379007991707d17" - }, - { - "cell_type": "code", - "execution_count": null, - "id": "44f8ef76", - "metadata": {}, - "outputs": [], - "source": [ - "from superduperdb import superduper\n", - "from superduperdb.backends.mongodb import Collection\n", - "import os\n", - "\n", - "mongodb_uri = os.getenv(\"MONGODB_URI\",\"mongomock://test\")\n", - "db = superduper(mongodb_uri)\n", - "\n", - "collection = Collection('transfer')" - ] - }, - { - "cell_type": "markdown", - "id": "97fede97", - "metadata": {}, - "source": [ - "## Load Dataset\n", - "\n", - "Transfer learning can be applied to any data that can be processed with SuperDuperDB models.\n", - "For our example, we will use a labeled textual dataset with sentiment analysis. We'll load a subset of the IMDb dataset." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "8bb65106", - "metadata": {}, - "outputs": [], - "source": [ - "import numpy\n", - "from datasets import load_dataset\n", - "from superduperdb import Document as D\n", - "\n", - "data = load_dataset(\"imdb\")\n", - "\n", - "N_DATAPOINTS = 500 # Increase for higher quality\n", - "\n", - "train_data = [\n", - " D({'_fold': 'train', **data['train'][int(i)]}) \n", - " for i in numpy.random.permutation(len(data['train']))\n", - "][:N_DATAPOINTS]\n", - "\n", - "valid_data = [\n", - " D({'_fold': 'valid', **data['test'][int(i)]}) \n", - " for i in numpy.random.permutation(len(data['test']))\n", - "][:N_DATAPOINTS // 10]\n", - "\n", - "db.execute(collection.insert_many(train_data))" - ] - }, - { - "cell_type": "markdown", - "id": "00a92214", - "metadata": {}, - "source": [ - "## Run Model\n", - "\n", - "We'll create a SuperDuperDB model based on the `sentence_transformers` library. This demonstrates that you don't necessarily need a native SuperDuperDB integration with a model library to leverage its power. We configure the `Model wrapper` to work with the `SentenceTransformer class`. After configuration, we can link the model to a collection and daemonize the model with the `listen=True` keyword." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "fef91c74", - "metadata": {}, - "outputs": [], - "source": [ - "from superduperdb import Model\n", - "import sentence_transformers\n", - "from superduperdb.ext.numpy import array\n", - "\n", - "m = Model(\n", - " identifier='all-MiniLM-L6-v2',\n", - " object=sentence_transformers.SentenceTransformer('all-MiniLM-L6-v2'),\n", - " encoder=array('float32', shape=(384,)),\n", - " predict_method='encode',\n", - " batch_predict=True,\n", - ")\n", - "\n", - "m.predict(\n", - " X='text',\n", - " db=db,\n", - " select=collection.find(),\n", - " listen=True\n", - ")" - ] - }, - { - "cell_type": "markdown", - "id": "68fefc17", - "metadata": {}, - "source": [ - "## Train Downstream Model\n", - "Now that we've created and added the model that computes features for the `\"text\"`, we can train a downstream model using Scikit-Learn." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "8c2faeeb", - "metadata": {}, - "outputs": [], - "source": [ - "from sklearn.svm import SVC\n", - "\n", - "model = superduper(\n", - " SVC(gamma='scale', class_weight='balanced', C=100, verbose=True),\n", - " postprocess=lambda x: int(x)\n", - ")\n", - "\n", - "model.fit(\n", - " X='text',\n", - " y='label',\n", - " db=db,\n", - " select=collection.find().featurize({'text': 'all-MiniLM-L6-v2'}),\n", - ")" - ] - }, - { - "cell_type": "markdown", - "id": "d1e1f164", - "metadata": {}, - "source": [ - "## Run Downstream Model\n", - "\n", - "With the model trained, we can now apply it to the database. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "eee16436", - "metadata": {}, - "outputs": [], - "source": [ - "model.predict(\n", - " X='text',\n", - " db=db,\n", - " select=collection.find().featurize({'text': 'all-MiniLM-L6-v2'}),\n", - " listen=True,\n", - ")" - ] - }, - { - "cell_type": "markdown", - "id": "67b156c1", - "metadata": {}, - "source": [ - "## Verification\n", - "\n", - "To verify that the process has worked, we can sample a few records to inspect the sanity of the predictions." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "76958a1e", - "metadata": {}, - "outputs": [], - "source": [ - "r = next(db.execute(collection.aggregate([{'$sample': {'size': 1}}])))\n", - "print(r['text'][:100])\n", - "print(r['_outputs']['text']['svc'])" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3 (ipykernel)", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.11.6" - } - }, - "nbformat": 4, - "nbformat_minor": 5 -} diff --git a/examples/vector_search.ipynb b/examples/vector_search.ipynb deleted file mode 100644 index cf4172a6e..000000000 --- a/examples/vector_search.ipynb +++ /dev/null @@ -1,324 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "id": "0d352545-a8c6-45ad-8359-c9b6edd2b7d2", - "metadata": {}, - "source": [ - "# Vector-search with SuperDuperDB\n", - "\n", - "## Introduction\n", - "This notebook provides a detailed guide on performing vector search using SuperDuperDB. Vector search is a powerful technique for searching and retrieving documents based on their similarity to a query vector. In this guide, we will demonstrate how to set up SuperDuperDB for vector search and use it to search a dataset of documents." - ] - }, - { - "cell_type": "markdown", - "id": "f283b5675bea4619", - "metadata": { - "collapsed": false, - "jupyter": { - "outputs_hidden": false - } - }, - "source": [ - "## Prerequisites\n", - "\n", - "Before diving into the implementation, ensure that you have the necessary libraries installed by running the following commands:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "c1f9e69e-75f4-42f9-a48d-b1f68f02646d", - "metadata": {}, - "outputs": [], - "source": [ - "!pip install superduperdb\n", - "!pip install ipython" - ] - }, - { - "cell_type": "markdown", - "id": "f79d7ef8-46eb-4210-8d96-a09648314e37", - "metadata": {}, - "source": [ - "Additionally, ensure that you have set your openai API key as an environment variable. You can uncomment the following code and add your API key:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "a1c8e68c-045f-44b8-bfbf-4c9dff5cf30c", - "metadata": {}, - "outputs": [], - "source": [ - "import os\n", - "\n", - "#os.environ['OPENAI_API_KEY'] = 'sk-...'\n", - "\n", - "if 'OPENAI_API_KEY' not in os.environ:\n", - " raise Exception('You need to set an OpenAI key as environment variable: \"export OPEN_API_KEY=sk-...\"')" - ] - }, - { - "cell_type": "markdown", - "id": "4db1c2b4-e0b3-420f-ba1c-bd49655bff2b", - "metadata": {}, - "source": [ - "## Connect to datastore \n", - "\n", - "First, we need to establish a connection to a MongoDB datastore via SuperDuperDB. You can configure the `MongoDB_URI` based on your specific setup. \n", - "Here are some examples of MongoDB URIs:\n", - "\n", - "* For testing (default connection): `mongomock://test`\n", - "* Local MongoDB instance: `mongodb://localhost:27017`\n", - "* MongoDB with authentication: `mongodb://superduper:superduper@mongodb:27017/documents`\n", - "* MongoDB Atlas: `mongodb+srv://:@/`" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "8e097557-7c50-4442-9e38-1df8a9d8f211", - "metadata": {}, - "outputs": [], - "source": [ - "from superduperdb import superduper\n", - "from superduperdb.backends.mongodb import Collection\n", - "import os\n", - "\n", - "mongodb_uri = os.getenv(\"MONGODB_URI\",\"mongomock://test\")\n", - "db = superduper(mongodb_uri, artifact_store='filesystem://./data/')\n", - "\n", - "doc_collection = Collection('documents')" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "f41b3a35-760e-49aa-8387-6a5efb990ea5", - "metadata": {}, - "outputs": [], - "source": [ - "db.metadata" - ] - }, - { - "cell_type": "markdown", - "id": "6bb5ee2b-f0bb-4660-961d-fdf98833f33d", - "metadata": {}, - "source": [ - "## Load Dataset \n", - "\n", - "We have prepared a dataset, which is the inline documentation of the pymongo API. Let's load this dataset:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "049b4122-b2c9-4ca5-be3c-df788912ce34", - "metadata": {}, - "outputs": [], - "source": [ - "!curl -O https://superduperdb-public.s3.eu-west-1.amazonaws.com/pymongo.json\n", - "\n", - "import json\n", - "\n", - "with open('pymongo.json') as f:\n", - " data = json.load(f)" - ] - }, - { - "cell_type": "markdown", - "id": "420ef3662c07d91e", - "metadata": { - "collapsed": false, - "jupyter": { - "outputs_hidden": false - } - }, - "source": [ - "As usual, we insert the data:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "468ec3dc-fa1f-4c23-b569-456b8900b72c", - "metadata": {}, - "outputs": [], - "source": [ - "from superduperdb import Document\n", - "\n", - "db.execute(doc_collection.insert_many([Document(r) for r in data]))" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "aba4aa66-2aec-4986-b263-c510788bf478", - "metadata": {}, - "outputs": [], - "source": [ - "db.execute(Collection('documents').find_one())" - ] - }, - { - "cell_type": "markdown", - "id": "e7f78f2d-86c0-463f-8eb1-630cd65d48ef", - "metadata": {}, - "source": [ - "## Create Vectors\n", - "\n", - "In the remainder of the notebook, you can choose between using the `openai` or `sentence_transformers` libraries to perform vector search. After instantiating the model wrappers, the rest of the notebook remains identical.\n", - "\n", - "For OpenAI vectors:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "a0a30873-b7fc-4ec5-ace9-f3d4ca01bab2", - "metadata": {}, - "outputs": [], - "source": [ - "from superduperdb.ext.openai.model import OpenAIEmbedding\n", - "\n", - "model = OpenAIEmbedding(model='text-embedding-ada-002')" - ] - }, - { - "cell_type": "markdown", - "id": "7e8d1d264dd7ba1b", - "metadata": { - "collapsed": false, - "jupyter": { - "outputs_hidden": false - } - }, - "source": [ - "For Sentence-Transformers vectors, uncomment the following section:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "a14c5c5f-c770-4a94-884c-3705f1d0a627", - "metadata": {}, - "outputs": [], - "source": [ - "#import sentence_transformers\n", - "#from superduperdb import Model, vector\n", - "\n", - "#model = Model(\n", - "# identifier='all-MiniLM-L6-v2', \n", - "# object=sentence_transformers.SentenceTransformer('all-MiniLM-L6-v2'),\n", - "# encoder=vector(shape=(384,)),\n", - "# predict_method='encode', # Specify the prediction method\n", - "# postprocess=lambda x: x.tolist(), # Define postprocessing function\n", - "# batch_predict=True, # Generate predictions for a set of observations all at once \n", - "#)" - ] - }, - { - "cell_type": "markdown", - "id": "b546308d-45f2-4605-8778-7aca46fe3c7c", - "metadata": {}, - "source": [ - "## Index Vectors\n", - "\n", - "Now we can configure the Atlas vector-search index. This command saves and sets up a model to `listen` to a particular subfield (or the whole document) for new text, converts it on the fly to vectors, and then indexes these vectors using Atlas vector-search." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "46ce7a59-cdd2-46e5-a218-77ef73df7a95", - "metadata": {}, - "outputs": [], - "source": [ - "from superduperdb import Listener, VectorIndex\n", - "\n", - "db.add(\n", - " VectorIndex(\n", - " identifier=f'pymongo-docs-{model.identifier}',\n", - " indexing_listener=Listener(\n", - " select=doc_collection.find(),\n", - " key='value',\n", - " model=model,\n", - " predict_kwargs={'max_chunk_size': 1000},\n", - " ),\n", - " )\n", - ")\n", - "\n", - "db.show('vector_index')" - ] - }, - { - "cell_type": "raw", - "id": "8aea59ff7e8ef67c", - "metadata": { - "collapsed": false, - "jupyter": { - "outputs_hidden": false - } - }, - "source": [ - "## Perform Vector Search\n", - "\n", - "Now that the index is set up, we can use it in a query. SuperDuperDB provides some syntactic sugar for the `aggregate` search pipelines, which can be helpful. It also handles all the conversion of inputs to vectors under the hood." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "f0184fb1-10ae-4488-9e93-c56b5fcd9ac2", - "metadata": {}, - "outputs": [], - "source": [ - "from superduperdb import Document\n", - "from IPython.display import *\n", - "\n", - "# Define the search parameters\n", - "search_term = 'Query the database'\n", - "num_results = 5\n", - "\n", - "# Execute the query\n", - "result = db.execute(doc_collection\n", - " .like(Document({'value': search_term}), vector_index=f'pymongo-docs-{model.identifier}', n=num_results)\n", - " .find()\n", - ")\n", - "\n", - "# Display a horizontal line\n", - "display(Markdown('---'))\n", - "\n", - "# Iterate through the query results and display them\n", - "for r in result:\n", - " display(Markdown(f'### `{r[\"parent\"] + \".\" if r[\"parent\"] else \"\"}{r[\"res\"]}`'))\n", - " display(Markdown(r['value']))\n", - " display(Markdown('---'))" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3 (ipykernel)", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.11.5" - } - }, - "nbformat": 4, - "nbformat_minor": 5 -} diff --git a/examples/video_search.ipynb b/examples/video_search.ipynb deleted file mode 100644 index f405d5288..000000000 --- a/examples/video_search.ipynb +++ /dev/null @@ -1,549 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "id": "a58575c6-59c4-4289-869f-f5a1ac7e021c", - "metadata": {}, - "source": [ - "# Search within videos with text\n", - "\n", - "## Introduction\n", - "This notebook outlines the process of searching for specific textual information within videos and retrieving relevant video segments. To accomplish this, we utilize various libraries and techniques, such as:\n", - "* clip: A library for vision and language understanding.\n", - "* PIL: Python Imaging Library for image processing.\n", - "* torch: The PyTorch library for deep learning." - ] - }, - { - "cell_type": "markdown", - "id": "6eec562900dd0cff", - "metadata": { - "collapsed": false, - "jupyter": { - "outputs_hidden": false - } - }, - "source": [ - "## Prerequisites\n", - "\n", - "Before diving into the implementation, ensure that you have the necessary libraries installed by running the following commands:" - ] - }, - { - "cell_type": "code", - "execution_count": 5, - "id": "ab56c57e-fa04-43dd-9670-ade9b5c6d4ac", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Requirement already satisfied: ipython in /Users/dodo/SuperDuperDB/superduperdb/.venv/lib/python3.11/site-packages (8.17.2)\n", - "Collecting opencv-python\n", - " Obtaining dependency information for opencv-python from https://files.pythonhosted.org/packages/05/58/7ee92b21cb98689cbe28c69e3cf8ee51f261bfb6bc904ae578736d22d2e7/opencv_python-4.8.1.78-cp37-abi3-macosx_10_16_x86_64.whl.metadata\n", - " Using cached opencv_python-4.8.1.78-cp37-abi3-macosx_10_16_x86_64.whl.metadata (19 kB)\n", - "Requirement already satisfied: pillow in /Users/dodo/SuperDuperDB/superduperdb/.venv/lib/python3.11/site-packages (10.1.0)\n", - "Collecting openai-clip\n", - " Using cached openai_clip-1.0.1-py3-none-any.whl\n", - "Requirement already satisfied: decorator in /Users/dodo/SuperDuperDB/superduperdb/.venv/lib/python3.11/site-packages (from ipython) (5.1.1)\n", - "Requirement already satisfied: jedi>=0.16 in /Users/dodo/SuperDuperDB/superduperdb/.venv/lib/python3.11/site-packages (from ipython) (0.19.1)\n", - "Requirement already satisfied: matplotlib-inline in /Users/dodo/SuperDuperDB/superduperdb/.venv/lib/python3.11/site-packages (from ipython) (0.1.6)\n", - "Requirement already satisfied: prompt-toolkit!=3.0.37,<3.1.0,>=3.0.30 in /Users/dodo/SuperDuperDB/superduperdb/.venv/lib/python3.11/site-packages (from ipython) (3.0.39)\n", - "Requirement already satisfied: pygments>=2.4.0 in /Users/dodo/SuperDuperDB/superduperdb/.venv/lib/python3.11/site-packages (from ipython) (2.16.1)\n", - "Requirement already satisfied: stack-data in /Users/dodo/SuperDuperDB/superduperdb/.venv/lib/python3.11/site-packages (from ipython) (0.6.3)\n", - "Requirement already satisfied: traitlets>=5 in /Users/dodo/SuperDuperDB/superduperdb/.venv/lib/python3.11/site-packages (from ipython) (5.13.0)\n", - "Requirement already satisfied: pexpect>4.3 in /Users/dodo/SuperDuperDB/superduperdb/.venv/lib/python3.11/site-packages (from ipython) (4.8.0)\n", - "Requirement already satisfied: appnope in /Users/dodo/SuperDuperDB/superduperdb/.venv/lib/python3.11/site-packages (from ipython) (0.1.3)\n", - "Requirement already satisfied: numpy>=1.21.2 in /Users/dodo/SuperDuperDB/superduperdb/.venv/lib/python3.11/site-packages (from opencv-python) (1.26.1)\n", - "Requirement already satisfied: ftfy in /Users/dodo/SuperDuperDB/superduperdb/.venv/lib/python3.11/site-packages (from openai-clip) (6.1.1)\n", - "Requirement already satisfied: regex in /Users/dodo/SuperDuperDB/superduperdb/.venv/lib/python3.11/site-packages (from openai-clip) (2023.10.3)\n", - "Requirement already satisfied: tqdm in /Users/dodo/SuperDuperDB/superduperdb/.venv/lib/python3.11/site-packages (from openai-clip) (4.66.1)\n", - "Requirement already satisfied: parso<0.9.0,>=0.8.3 in /Users/dodo/SuperDuperDB/superduperdb/.venv/lib/python3.11/site-packages (from jedi>=0.16->ipython) (0.8.3)\n", - "Requirement already satisfied: ptyprocess>=0.5 in /Users/dodo/SuperDuperDB/superduperdb/.venv/lib/python3.11/site-packages (from pexpect>4.3->ipython) (0.7.0)\n", - "Requirement already satisfied: wcwidth in /Users/dodo/SuperDuperDB/superduperdb/.venv/lib/python3.11/site-packages (from prompt-toolkit!=3.0.37,<3.1.0,>=3.0.30->ipython) (0.2.9)\n", - "Requirement already satisfied: executing>=1.2.0 in /Users/dodo/SuperDuperDB/superduperdb/.venv/lib/python3.11/site-packages (from stack-data->ipython) (2.0.1)\n", - "Requirement already satisfied: asttokens>=2.1.0 in /Users/dodo/SuperDuperDB/superduperdb/.venv/lib/python3.11/site-packages (from stack-data->ipython) (2.4.1)\n", - "Requirement already satisfied: pure-eval in /Users/dodo/SuperDuperDB/superduperdb/.venv/lib/python3.11/site-packages (from stack-data->ipython) (0.2.2)\n", - "Requirement already satisfied: six>=1.12.0 in /Users/dodo/SuperDuperDB/superduperdb/.venv/lib/python3.11/site-packages (from asttokens>=2.1.0->stack-data->ipython) (1.16.0)\n", - "Using cached opencv_python-4.8.1.78-cp37-abi3-macosx_10_16_x86_64.whl (54.7 MB)\n", - "Installing collected packages: opencv-python, openai-clip\n", - "Successfully installed openai-clip-1.0.1 opencv-python-4.8.1.78\n", - "\n", - "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m23.2.1\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m23.3.1\u001b[0m\n", - "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\n" - ] - } - ], - "source": [ - "# !pip install superduperdb\n", - "!pip install ipython opencv-python pillow openai-clip" - ] - }, - { - "cell_type": "markdown", - "id": "f559fff0-df68-473a-94a2-afe39e4d5577", - "metadata": {}, - "source": [ - "## Connect to datastore \n", - "\n", - "First, we need to establish a connection to a MongoDB datastore via SuperDuperDB. You can configure the `MongoDB_URI` based on your specific setup. \n", - "Here are some examples of MongoDB URIs:\n", - "\n", - "* For testing (default connection): `mongomock://test`\n", - "* Local MongoDB instance: `mongodb://localhost:27017`\n", - "* MongoDB with authentication: `mongodb://superduper:superduper@mongodb:27017/documents`\n", - "* MongoDB Atlas: `mongodb+srv://:@/`" - ] - }, - { - "cell_type": "code", - "execution_count": 1, - "id": "99de0e3d-8918-4fc4-a45b-0a58b70793c6", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "\u001b[32m 2023-Nov-14 13:53:39.83\u001b[0m| \u001b[32m\u001b[1mSUCCESS \u001b[0m | \u001b[36mDuncans-MacBook-Pro.local\u001b[0m| \u001b[36msuperduperdb.base.build\u001b[0m:\u001b[36m69 \u001b[0m | \u001b[32m\u001b[1mInitializing DataBackend Client: mongomock.MongoClient('localhost', 27017)\u001b[0m\n" - ] - } - ], - "source": [ - "from superduperdb import superduper\n", - "from superduperdb.backends.mongodb import Collection\n", - "from superduperdb import CFG\n", - "import os\n", - "\n", - "CFG.downloads.hybrid = True\n", - "CFG.downloads.root = './'\n", - "\n", - "mongodb_uri = os.getenv(\"MONGODB_URI\",\"mongomock://test\")\n", - "db = superduper(mongodb_uri, artifact_store='filesystem://./data/')\n", - "\n", - "video_collection = Collection('videos')" - ] - }, - { - "cell_type": "markdown", - "id": "1e53ce4113115246", - "metadata": { - "collapsed": false, - "jupyter": { - "outputs_hidden": false - } - }, - "source": [ - "## Load Dataset\n", - "\n", - "We'll begin by configuring a video encoder." - ] - }, - { - "cell_type": "code", - "execution_count": 2, - "id": "ebac4921-5c83-4ba7-b793-67f5f90d42ec", - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "[]" - ] - }, - "execution_count": 2, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "from superduperdb import Encoder\n", - "\n", - "vid_enc = Encoder(\n", - " identifier='video_on_file',\n", - " load_hybrid=False,\n", - ")\n", - "\n", - "db.add(vid_enc)" - ] - }, - { - "cell_type": "markdown", - "id": "bf1cef0e-21ac-4291-b2c8-41065717ee67", - "metadata": {}, - "source": [ - "Now, let's retrieve a sample video from the internet and insert it into our collection." - ] - }, - { - "cell_type": "code", - "execution_count": 3, - "id": "ee6335cb-960d-4239-be6e-501d52b88026", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "\u001b[32m 2023-Nov-14 13:53:42.29\u001b[0m| \u001b[1mINFO \u001b[0m | \u001b[36mDuncans-MacBook-Pro.local\u001b[0m| \u001b[36msuperduperdb.misc.download\u001b[0m:\u001b[36m358 \u001b[0m | \u001b[1mfound 1 uris\u001b[0m\n", - "\u001b[32m 2023-Nov-14 13:53:42.54\u001b[0m| \u001b[1mINFO \u001b[0m | \u001b[36mDuncans-MacBook-Pro.local\u001b[0m| \u001b[36msuperduperdb.misc.download\u001b[0m:\u001b[36m125 \u001b[0m | \u001b[1mnumber of workers 0\u001b[0m\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1.10it/s]\n" - ] - }, - { - "data": { - "text/plain": [ - "[Document({'video': Encodable(encoder=Encoder(identifier='video_on_file', decoder=, encoder=, shape=None, version=0, load_hybrid=False), x=None, uri='https://superduperdb-public.s3.eu-west-1.amazonaws.com/animals_excerpt.mp4'), '_fold': 'train', '_id': ObjectId('65536dd6b0e451df3a649bc4')})]" - ] - }, - "execution_count": 3, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "from superduperdb.base.document import Document\n", - "\n", - "db.execute(video_collection.insert_one(\n", - " Document({'video': vid_enc(uri='https://superduperdb-public.s3.eu-west-1.amazonaws.com/animals_excerpt.mp4')})\n", - " )\n", - ")\n", - "\n", - "# Display the list of videos in the collection\n", - "list(db.execute(Collection('videos').find()))" - ] - }, - { - "cell_type": "markdown", - "id": "441fe6d6a9dee06b", - "metadata": { - "collapsed": false, - "jupyter": { - "outputs_hidden": false - } - }, - "source": [ - "## Register Encoders\n", - "\n", - "Next, we'll create encoders for processing videos and extracting frames. This encoder will help us convert videos into individual frames." - ] - }, - { - "cell_type": "code", - "execution_count": 6, - "id": "2af2d178-9ff2-496d-8293-e5aee3f12a19", - "metadata": {}, - "outputs": [], - "source": [ - "import cv2\n", - "import tqdm\n", - "from PIL import Image\n", - "from superduperdb.ext.pillow import pil_image\n", - "from superduperdb import Model, Schema\n", - "\n", - "\n", - "def video2images(video_file):\n", - " sample_freq = 10\n", - " cap = cv2.VideoCapture(video_file)\n", - "\n", - " frame_count = 0\n", - "\n", - " fps = cap.get(cv2.CAP_PROP_FPS)\n", - " print(fps)\n", - " extracted_frames = []\n", - " progress = tqdm.tqdm()\n", - "\n", - " while True:\n", - " ret, frame = cap.read()\n", - " if not ret:\n", - " break\n", - " current_timestamp = frame_count // fps\n", - " \n", - " if frame_count % sample_freq == 0:\n", - " extracted_frames.append({\n", - " 'image': Image.fromarray(frame[:,:,::-1]),\n", - " 'current_timestamp': current_timestamp,\n", - " })\n", - " frame_count += 1 \n", - " progress.update(1)\n", - " \n", - " cap.release()\n", - " cv2.destroyAllWindows()\n", - " return extracted_frames\n", - "\n", - "\n", - "video2images = Model(\n", - " identifier='video2images',\n", - " object=video2images,\n", - " flatten=True,\n", - " model_update_kwargs={'document_embedded': False},\n", - " output_schema=Schema(identifier='myschema', fields={'image': pil_image})\n", - ")" - ] - }, - { - "cell_type": "markdown", - "id": "19a28dbe-dcec-4c6b-bd2d-72dbd48daf39", - "metadata": {}, - "source": [ - "We'll also set up a listener to continuously download video URLs and save the best frames into another collection." - ] - }, - { - "cell_type": "code", - "execution_count": 7, - "id": "a30d093b-03d3-4bdb-aa8b-46ff974d1995", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "\u001b[32m 2023-Nov-14 13:54:27.17\u001b[0m| \u001b[1mINFO \u001b[0m | \u001b[36mDuncans-MacBook-Pro.local\u001b[0m| \u001b[36msuperduperdb.components.model\u001b[0m:\u001b[36m207 \u001b[0m | \u001b[1mAdding model video2images to db\u001b[0m\n", - "\u001b[32m 2023-Nov-14 13:54:27.17\u001b[0m| \u001b[1mINFO \u001b[0m | \u001b[36mDuncans-MacBook-Pro.local\u001b[0m| \u001b[36msuperduperdb.components.model\u001b[0m:\u001b[36m210 \u001b[0m | \u001b[1mDone.\u001b[0m\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "1it [00:00, 1916.08it/s]\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "30.0\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "900it [00:00, 1844.03it/s]\n" - ] - }, - { - "ename": "InvalidDocument", - "evalue": "documents must have only string keys, key was 0", - "output_type": "error", - "traceback": [ - "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", - "\u001b[0;31mInvalidDocument\u001b[0m Traceback (most recent call last)", - "Cell \u001b[0;32mIn[7], line 3\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[38;5;28;01mfrom\u001b[39;00m \u001b[38;5;21;01msuperduperdb\u001b[39;00m \u001b[38;5;28;01mimport\u001b[39;00m Listener\n\u001b[0;32m----> 3\u001b[0m \u001b[43mdb\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43madd\u001b[49m\u001b[43m(\u001b[49m\n\u001b[1;32m 4\u001b[0m \u001b[43m \u001b[49m\u001b[43mListener\u001b[49m\u001b[43m(\u001b[49m\n\u001b[1;32m 5\u001b[0m \u001b[43m \u001b[49m\u001b[43mmodel\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mvideo2images\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 6\u001b[0m \u001b[43m \u001b[49m\u001b[43mselect\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mvideo_collection\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mfind\u001b[49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 7\u001b[0m \u001b[43m \u001b[49m\u001b[43mkey\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;124;43m'\u001b[39;49m\u001b[38;5;124;43mvideo\u001b[39;49m\u001b[38;5;124;43m'\u001b[39;49m\u001b[43m,\u001b[49m\n\u001b[1;32m 8\u001b[0m \u001b[43m \u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 9\u001b[0m \u001b[43m)\u001b[49m\n\u001b[1;32m 11\u001b[0m db\u001b[38;5;241m.\u001b[39mexecute(Collection(\u001b[38;5;124m'\u001b[39m\u001b[38;5;124m_outputs.video.video2images\u001b[39m\u001b[38;5;124m'\u001b[39m)\u001b[38;5;241m.\u001b[39mfind_one())\u001b[38;5;241m.\u001b[39munpack()[\u001b[38;5;124m'\u001b[39m\u001b[38;5;124m_outputs\u001b[39m\u001b[38;5;124m'\u001b[39m][\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mvideo\u001b[39m\u001b[38;5;124m'\u001b[39m][\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mvideo2images\u001b[39m\u001b[38;5;124m'\u001b[39m][\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mimage\u001b[39m\u001b[38;5;124m'\u001b[39m]\n", - "File \u001b[0;32m~/SuperDuperDB/superduperdb/superduperdb/base/datalayer.py:491\u001b[0m, in \u001b[0;36mDatalayer.add\u001b[0;34m(self, object, dependencies)\u001b[0m\n\u001b[1;32m 483\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mtype\u001b[39m(\u001b[38;5;28mobject\u001b[39m)(\n\u001b[1;32m 484\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_add(\n\u001b[1;32m 485\u001b[0m \u001b[38;5;28mobject\u001b[39m\u001b[38;5;241m=\u001b[39mcomponent,\n\u001b[0;32m (...)\u001b[0m\n\u001b[1;32m 488\u001b[0m \u001b[38;5;28;01mfor\u001b[39;00m component \u001b[38;5;129;01min\u001b[39;00m \u001b[38;5;28mobject\u001b[39m\n\u001b[1;32m 489\u001b[0m )\n\u001b[1;32m 490\u001b[0m \u001b[38;5;28;01melif\u001b[39;00m \u001b[38;5;28misinstance\u001b[39m(\u001b[38;5;28mobject\u001b[39m, Component):\n\u001b[0;32m--> 491\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_add\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;28;43mobject\u001b[39;49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;28;43mobject\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mdependencies\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mdependencies\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 492\u001b[0m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[1;32m 493\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mValueError\u001b[39;00m(\n\u001b[1;32m 494\u001b[0m \u001b[38;5;124m'\u001b[39m\u001b[38;5;124mobject should be a sequence of `Component` or `Component`\u001b[39m\u001b[38;5;124m'\u001b[39m\n\u001b[1;32m 495\u001b[0m )\n", - "File \u001b[0;32m~/SuperDuperDB/superduperdb/superduperdb/base/datalayer.py:855\u001b[0m, in \u001b[0;36mDatalayer._add\u001b[0;34m(self, object, dependencies, serialized, parent)\u001b[0m\n\u001b[1;32m 851\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mmetadata\u001b[38;5;241m.\u001b[39mcreate_parent_child(parent, \u001b[38;5;28mobject\u001b[39m\u001b[38;5;241m.\u001b[39munique_id)\n\u001b[1;32m 852\u001b[0m \u001b[38;5;28mobject\u001b[39m\u001b[38;5;241m.\u001b[39mon_load(\n\u001b[1;32m 853\u001b[0m \u001b[38;5;28mself\u001b[39m\n\u001b[1;32m 854\u001b[0m ) \u001b[38;5;66;03m# TODO do I really need to call this here? Could be handled by `.on_create`?\u001b[39;00m\n\u001b[0;32m--> 855\u001b[0m jobs \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;43mobject\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mschedule_jobs\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mdependencies\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mdependencies\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 856\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m jobs\n", - "File \u001b[0;32m~/SuperDuperDB/superduperdb/superduperdb/components/listener.py:113\u001b[0m, in \u001b[0;36mListener.schedule_jobs\u001b[0;34m(self, database, dependencies, distributed, verbose)\u001b[0m\n\u001b[1;32m 110\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m ()\n\u001b[1;32m 112\u001b[0m \u001b[38;5;28;01massert\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28misinstance\u001b[39m(\u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mmodel, \u001b[38;5;28mstr\u001b[39m)\n\u001b[0;32m--> 113\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mmodel\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mpredict\u001b[49m\u001b[43m(\u001b[49m\n\u001b[1;32m 114\u001b[0m \u001b[43m \u001b[49m\u001b[43mX\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mkey\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 115\u001b[0m \u001b[43m \u001b[49m\u001b[43mdb\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mdatabase\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 116\u001b[0m \u001b[43m \u001b[49m\u001b[43mselect\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mselect\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 117\u001b[0m \u001b[43m \u001b[49m\u001b[43mdistributed\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mdistributed\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 118\u001b[0m \u001b[43m \u001b[49m\u001b[43mdependencies\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mdependencies\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 119\u001b[0m \u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43m(\u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mpredict_kwargs\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;129;43;01mor\u001b[39;49;00m\u001b[43m \u001b[49m\u001b[43m{\u001b[49m\u001b[43m}\u001b[49m\u001b[43m)\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 120\u001b[0m \u001b[43m\u001b[49m\u001b[43m)\u001b[49m\n", - "File \u001b[0;32m~/SuperDuperDB/superduperdb/superduperdb/components/model.py:238\u001b[0m, in \u001b[0;36mPredictMixin.predict\u001b[0;34m(self, X, db, select, distributed, ids, max_chunk_size, dependencies, listen, one, context, in_memory, overwrite, **kwargs)\u001b[0m\n\u001b[1;32m 236\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m select \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m \u001b[38;5;129;01mand\u001b[39;00m ids \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m:\n\u001b[1;32m 237\u001b[0m \u001b[38;5;28;01massert\u001b[39;00m db \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m\n\u001b[0;32m--> 238\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_predict_with_select\u001b[49m\u001b[43m(\u001b[49m\n\u001b[1;32m 239\u001b[0m \u001b[43m \u001b[49m\u001b[43mX\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mX\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 240\u001b[0m \u001b[43m \u001b[49m\u001b[43mselect\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mselect\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 241\u001b[0m \u001b[43m \u001b[49m\u001b[43mdb\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mdb\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 242\u001b[0m \u001b[43m \u001b[49m\u001b[43min_memory\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43min_memory\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 243\u001b[0m \u001b[43m \u001b[49m\u001b[43mmax_chunk_size\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mmax_chunk_size\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 244\u001b[0m \u001b[43m \u001b[49m\u001b[43moverwrite\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43moverwrite\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 245\u001b[0m \u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43mkwargs\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 246\u001b[0m \u001b[43m \u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 247\u001b[0m \u001b[38;5;28;01melif\u001b[39;00m select \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m \u001b[38;5;129;01mand\u001b[39;00m ids \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m:\n\u001b[1;32m 248\u001b[0m \u001b[38;5;28;01massert\u001b[39;00m db \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m\n", - "File \u001b[0;32m~/SuperDuperDB/superduperdb/superduperdb/components/model.py:319\u001b[0m, in \u001b[0;36mPredictMixin._predict_with_select\u001b[0;34m(self, X, select, db, max_chunk_size, in_memory, overwrite, **kwargs)\u001b[0m\n\u001b[1;32m 316\u001b[0m \u001b[38;5;28;01mfor\u001b[39;00m r \u001b[38;5;129;01min\u001b[39;00m tqdm\u001b[38;5;241m.\u001b[39mtqdm(db\u001b[38;5;241m.\u001b[39mexecute(query)):\n\u001b[1;32m 317\u001b[0m ids\u001b[38;5;241m.\u001b[39mappend(\u001b[38;5;28mstr\u001b[39m(r[db\u001b[38;5;241m.\u001b[39mdatabackend\u001b[38;5;241m.\u001b[39mid_field]))\n\u001b[0;32m--> 319\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_predict_with_select_and_ids\u001b[49m\u001b[43m(\u001b[49m\n\u001b[1;32m 320\u001b[0m \u001b[43m \u001b[49m\u001b[43mX\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mX\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 321\u001b[0m \u001b[43m \u001b[49m\u001b[43mdb\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mdb\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 322\u001b[0m \u001b[43m \u001b[49m\u001b[43mids\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mids\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 323\u001b[0m \u001b[43m \u001b[49m\u001b[43mselect\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mselect\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 324\u001b[0m \u001b[43m \u001b[49m\u001b[43mmax_chunk_size\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mmax_chunk_size\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 325\u001b[0m \u001b[43m \u001b[49m\u001b[43min_memory\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43min_memory\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 326\u001b[0m \u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43mkwargs\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 327\u001b[0m \u001b[43m\u001b[49m\u001b[43m)\u001b[49m\n", - "File \u001b[0;32m~/SuperDuperDB/superduperdb/superduperdb/components/model.py:395\u001b[0m, in \u001b[0;36mPredictMixin._predict_with_select_and_ids\u001b[0;34m(self, X, db, select, ids, in_memory, max_chunk_size, **kwargs)\u001b[0m\n\u001b[1;32m 391\u001b[0m outputs \u001b[38;5;241m=\u001b[39m encoded_ouputs \u001b[38;5;28;01mif\u001b[39;00m encoded_ouputs \u001b[38;5;28;01melse\u001b[39;00m outputs\n\u001b[1;32m 393\u001b[0m \u001b[38;5;28;01massert\u001b[39;00m \u001b[38;5;28misinstance\u001b[39m(\u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mversion, \u001b[38;5;28mint\u001b[39m)\n\u001b[0;32m--> 395\u001b[0m \u001b[43mselect\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mmodel_update\u001b[49m\u001b[43m(\u001b[49m\n\u001b[1;32m 396\u001b[0m \u001b[43m \u001b[49m\u001b[43mdb\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mdb\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 397\u001b[0m \u001b[43m \u001b[49m\u001b[43mmodel\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43midentifier\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 398\u001b[0m \u001b[43m \u001b[49m\u001b[43moutputs\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43moutputs\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 399\u001b[0m \u001b[43m \u001b[49m\u001b[43mkey\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mX\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 400\u001b[0m \u001b[43m \u001b[49m\u001b[43mversion\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mversion\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 401\u001b[0m \u001b[43m \u001b[49m\u001b[43mids\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mids\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 402\u001b[0m \u001b[43m \u001b[49m\u001b[43mflatten\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mflatten\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 403\u001b[0m \u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mmodel_update_kwargs\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 404\u001b[0m \u001b[43m\u001b[49m\u001b[43m)\u001b[49m\n", - "File \u001b[0;32m~/SuperDuperDB/superduperdb/superduperdb/backends/base/query.py:54\u001b[0m, in \u001b[0;36mSelect.model_update\u001b[0;34m(self, db, ids, key, model, version, outputs, **kwargs)\u001b[0m\n\u001b[1;32m 44\u001b[0m \u001b[38;5;28;01mdef\u001b[39;00m \u001b[38;5;21mmodel_update\u001b[39m(\u001b[38;5;28mself\u001b[39m, db, ids: t\u001b[38;5;241m.\u001b[39mSequence[t\u001b[38;5;241m.\u001b[39mAnnotated], key: \u001b[38;5;28mstr\u001b[39m, model: \u001b[38;5;28mstr\u001b[39m, version: \u001b[38;5;28mint\u001b[39m, outputs: t\u001b[38;5;241m.\u001b[39mSequence[t\u001b[38;5;241m.\u001b[39mAny], \u001b[38;5;241m*\u001b[39m\u001b[38;5;241m*\u001b[39mkwargs):\n\u001b[1;32m 45\u001b[0m \u001b[38;5;250m \u001b[39m\u001b[38;5;124;03m\"\"\"\u001b[39;00m\n\u001b[1;32m 46\u001b[0m \u001b[38;5;124;03m Update model outputs for a set of ids.\u001b[39;00m\n\u001b[1;32m 47\u001b[0m \n\u001b[0;32m (...)\u001b[0m\n\u001b[1;32m 52\u001b[0m \u001b[38;5;124;03m :param outputs: The outputs to update\u001b[39;00m\n\u001b[1;32m 53\u001b[0m \u001b[38;5;124;03m \"\"\"\u001b[39;00m\n\u001b[0;32m---> 54\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mtable_or_collection\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mmodel_update\u001b[49m\u001b[43m(\u001b[49m\n\u001b[1;32m 55\u001b[0m \u001b[43m \u001b[49m\u001b[43mdb\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mdb\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 56\u001b[0m \u001b[43m \u001b[49m\u001b[43mids\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mids\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 57\u001b[0m \u001b[43m \u001b[49m\u001b[43mkey\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mkey\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 58\u001b[0m \u001b[43m \u001b[49m\u001b[43mmodel\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mmodel\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 59\u001b[0m \u001b[43m \u001b[49m\u001b[43moutputs\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43moutputs\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 60\u001b[0m \u001b[43m \u001b[49m\u001b[43mversion\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mversion\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 61\u001b[0m \u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43mkwargs\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 62\u001b[0m \u001b[43m \u001b[49m\u001b[43m)\u001b[49m\n", - "File \u001b[0;32m~/SuperDuperDB/superduperdb/superduperdb/backends/mongodb/query.py:732\u001b[0m, in \u001b[0;36mCollection.model_update\u001b[0;34m(self, db, ids, key, model, version, outputs, document_embedded, flatten, **kwargs)\u001b[0m\n\u001b[1;32m 730\u001b[0m collection_name \u001b[38;5;241m=\u001b[39m \u001b[38;5;124mf\u001b[39m\u001b[38;5;124m'\u001b[39m\u001b[38;5;124m_outputs.\u001b[39m\u001b[38;5;132;01m{\u001b[39;00mkey\u001b[38;5;132;01m}\u001b[39;00m\u001b[38;5;124m.\u001b[39m\u001b[38;5;132;01m{\u001b[39;00mmodel\u001b[38;5;132;01m}\u001b[39;00m\u001b[38;5;124m'\u001b[39m\n\u001b[1;32m 731\u001b[0m collection \u001b[38;5;241m=\u001b[39m db\u001b[38;5;241m.\u001b[39mdatabackend\u001b[38;5;241m.\u001b[39mget_table_or_collection(collection_name)\n\u001b[0;32m--> 732\u001b[0m \u001b[43mcollection\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mbulk_write\u001b[49m\u001b[43m(\u001b[49m\u001b[43mbulk_docs\u001b[49m\u001b[43m)\u001b[49m\n", - "File \u001b[0;32m~/SuperDuperDB/superduperdb/.venv/lib/python3.11/site-packages/mongomock/collection.py:1823\u001b[0m, in \u001b[0;36mCollection.bulk_write\u001b[0;34m(self, requests, ordered, bypass_document_validation, session)\u001b[0m\n\u001b[1;32m 1821\u001b[0m \u001b[38;5;28;01mfor\u001b[39;00m operation \u001b[38;5;129;01min\u001b[39;00m requests:\n\u001b[1;32m 1822\u001b[0m operation\u001b[38;5;241m.\u001b[39m_add_to_bulk(bulk)\n\u001b[0;32m-> 1823\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m BulkWriteResult(\u001b[43mbulk\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mexecute\u001b[49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m, \u001b[38;5;28;01mTrue\u001b[39;00m)\n", - "File \u001b[0;32m~/SuperDuperDB/superduperdb/.venv/lib/python3.11/site-packages/mongomock/collection.py:314\u001b[0m, in \u001b[0;36mBulkOperationBuilder.execute\u001b[0;34m(self, write_concern)\u001b[0m\n\u001b[1;32m 312\u001b[0m exec_name \u001b[38;5;241m=\u001b[39m execute_func\u001b[38;5;241m.\u001b[39m\u001b[38;5;18m__name__\u001b[39m\n\u001b[1;32m 313\u001b[0m \u001b[38;5;28;01mtry\u001b[39;00m:\n\u001b[0;32m--> 314\u001b[0m op_result \u001b[38;5;241m=\u001b[39m \u001b[43mexecute_func\u001b[49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 315\u001b[0m \u001b[38;5;28;01mexcept\u001b[39;00m WriteError \u001b[38;5;28;01mas\u001b[39;00m error:\n\u001b[1;32m 316\u001b[0m result[\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mwriteErrors\u001b[39m\u001b[38;5;124m'\u001b[39m]\u001b[38;5;241m.\u001b[39mappend({\n\u001b[1;32m 317\u001b[0m \u001b[38;5;124m'\u001b[39m\u001b[38;5;124mindex\u001b[39m\u001b[38;5;124m'\u001b[39m: index,\n\u001b[1;32m 318\u001b[0m \u001b[38;5;124m'\u001b[39m\u001b[38;5;124mcode\u001b[39m\u001b[38;5;124m'\u001b[39m: error\u001b[38;5;241m.\u001b[39mcode,\n\u001b[1;32m 319\u001b[0m \u001b[38;5;124m'\u001b[39m\u001b[38;5;124merrmsg\u001b[39m\u001b[38;5;124m'\u001b[39m: \u001b[38;5;28mstr\u001b[39m(error),\n\u001b[1;32m 320\u001b[0m })\n", - "File \u001b[0;32m~/SuperDuperDB/superduperdb/.venv/lib/python3.11/site-packages/mongomock/collection.py:273\u001b[0m, in \u001b[0;36mBulkOperationBuilder.insert..exec_insert\u001b[0;34m()\u001b[0m\n\u001b[1;32m 272\u001b[0m \u001b[38;5;28;01mdef\u001b[39;00m \u001b[38;5;21mexec_insert\u001b[39m():\n\u001b[0;32m--> 273\u001b[0m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mcollection\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43minsert_one\u001b[49m\u001b[43m(\u001b[49m\n\u001b[1;32m 274\u001b[0m \u001b[43m \u001b[49m\u001b[43mdoc\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mbypass_document_validation\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_bypass_document_validation\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 275\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m {\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mnInserted\u001b[39m\u001b[38;5;124m'\u001b[39m: \u001b[38;5;241m1\u001b[39m}\n", - "File \u001b[0;32m~/SuperDuperDB/superduperdb/.venv/lib/python3.11/site-packages/mongomock/collection.py:454\u001b[0m, in \u001b[0;36mCollection.insert_one\u001b[0;34m(self, document, bypass_document_validation, session)\u001b[0m\n\u001b[1;32m 452\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m bypass_document_validation:\n\u001b[1;32m 453\u001b[0m validate_is_mutable_mapping(\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mdocument\u001b[39m\u001b[38;5;124m'\u001b[39m, document)\n\u001b[0;32m--> 454\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m InsertOneResult(\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_insert\u001b[49m\u001b[43m(\u001b[49m\u001b[43mdocument\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43msession\u001b[49m\u001b[43m)\u001b[49m, acknowledged\u001b[38;5;241m=\u001b[39m\u001b[38;5;28;01mTrue\u001b[39;00m)\n", - "File \u001b[0;32m~/SuperDuperDB/superduperdb/.venv/lib/python3.11/site-packages/mongomock/collection.py:505\u001b[0m, in \u001b[0;36mCollection._insert\u001b[0;34m(self, data, session, ordered)\u001b[0m\n\u001b[1;32m 501\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mValueError\u001b[39;00m(\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mDocument keys must be strings\u001b[39m\u001b[38;5;124m'\u001b[39m)\n\u001b[1;32m 503\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m BSON:\n\u001b[1;32m 504\u001b[0m \u001b[38;5;66;03m# bson validation\u001b[39;00m\n\u001b[0;32m--> 505\u001b[0m \u001b[43mBSON\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mencode\u001b[49m\u001b[43m(\u001b[49m\u001b[43mdata\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mcheck_keys\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;28;43;01mTrue\u001b[39;49;00m\u001b[43m)\u001b[49m\n\u001b[1;32m 507\u001b[0m \u001b[38;5;66;03m# Like pymongo, we should fill the _id in the inserted dict (odd behavior,\u001b[39;00m\n\u001b[1;32m 508\u001b[0m \u001b[38;5;66;03m# but we need to stick to it), so we must patch in-place the data dict\u001b[39;00m\n\u001b[1;32m 509\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;124m'\u001b[39m\u001b[38;5;124m_id\u001b[39m\u001b[38;5;124m'\u001b[39m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;129;01min\u001b[39;00m data:\n", - "File \u001b[0;32m~/SuperDuperDB/superduperdb/.venv/lib/python3.11/site-packages/bson/__init__.py:1428\u001b[0m, in \u001b[0;36mBSON.encode\u001b[0;34m(cls, document, check_keys, codec_options)\u001b[0m\n\u001b[1;32m 1401\u001b[0m \u001b[38;5;129m@classmethod\u001b[39m\n\u001b[1;32m 1402\u001b[0m \u001b[38;5;28;01mdef\u001b[39;00m \u001b[38;5;21mencode\u001b[39m(\n\u001b[1;32m 1403\u001b[0m \u001b[38;5;28mcls\u001b[39m: Type[BSON],\n\u001b[0;32m (...)\u001b[0m\n\u001b[1;32m 1406\u001b[0m codec_options: CodecOptions[Any] \u001b[38;5;241m=\u001b[39m DEFAULT_CODEC_OPTIONS,\n\u001b[1;32m 1407\u001b[0m ) \u001b[38;5;241m-\u001b[39m\u001b[38;5;241m>\u001b[39m BSON:\n\u001b[1;32m 1408\u001b[0m \u001b[38;5;250m \u001b[39m\u001b[38;5;124;03m\"\"\"Encode a document to a new :class:`BSON` instance.\u001b[39;00m\n\u001b[1;32m 1409\u001b[0m \n\u001b[1;32m 1410\u001b[0m \u001b[38;5;124;03m A document can be any mapping type (like :class:`dict`).\u001b[39;00m\n\u001b[0;32m (...)\u001b[0m\n\u001b[1;32m 1426\u001b[0m \u001b[38;5;124;03m Replaced `uuid_subtype` option with `codec_options`.\u001b[39;00m\n\u001b[1;32m 1427\u001b[0m \u001b[38;5;124;03m \"\"\"\u001b[39;00m\n\u001b[0;32m-> 1428\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mcls\u001b[39m(\u001b[43mencode\u001b[49m\u001b[43m(\u001b[49m\u001b[43mdocument\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mcheck_keys\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mcodec_options\u001b[49m\u001b[43m)\u001b[49m)\n", - "File \u001b[0;32m~/SuperDuperDB/superduperdb/.venv/lib/python3.11/site-packages/bson/__init__.py:1042\u001b[0m, in \u001b[0;36mencode\u001b[0;34m(document, check_keys, codec_options)\u001b[0m\n\u001b[1;32m 1039\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28misinstance\u001b[39m(codec_options, CodecOptions):\n\u001b[1;32m 1040\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m _CODEC_OPTIONS_TYPE_ERROR\n\u001b[0;32m-> 1042\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43m_dict_to_bson\u001b[49m\u001b[43m(\u001b[49m\u001b[43mdocument\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mcheck_keys\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mcodec_options\u001b[49m\u001b[43m)\u001b[49m\n", - "\u001b[0;31mInvalidDocument\u001b[0m: documents must have only string keys, key was 0" - ] - } - ], - "source": [ - "from superduperdb import Listener\n", - "\n", - "db.add(\n", - " Listener(\n", - " model=video2images,\n", - " select=video_collection.find(),\n", - " key='video',\n", - " )\n", - ")\n", - "\n", - "db.execute(Collection('_outputs.video.video2images').find_one()).unpack()['_outputs']['video']['video2images']['image']" - ] - }, - { - "cell_type": "markdown", - "id": "8ef3c353-fcc4-4f23-892b-c8a3796f952c", - "metadata": {}, - "source": [ - "## Create CLIP model\n", - "Now, we'll create a model for the CLIP (Contrastive Language-Image Pre-training) model, which will be used for visual and textual analysis." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "bd7329ea-75d1-4275-b754-1a977e76161a", - "metadata": {}, - "outputs": [], - "source": [ - "import clip\n", - "from superduperdb import vector\n", - "from superduperdb.ext.torch import TorchModel\n", - "\n", - "model, preprocess = clip.load(\"RN50\", device='cpu')\n", - "t = vector(shape=(1024,))\n", - "\n", - "visual_model = TorchModel(\n", - " identifier='clip_image',\n", - " preprocess=preprocess,\n", - " object=model.visual,\n", - " encoder=t,\n", - " postprocess=lambda x: x.tolist(),\n", - ")\n", - "\n", - "text_model = TorchModel(\n", - " identifier='clip_text',\n", - " object=model,\n", - " preprocess=lambda x: clip.tokenize(x)[0],\n", - " forward_method='encode_text',\n", - " encoder=t,\n", - " device='cpu',\n", - " preferred_devices=None,\n", - " postprocess=lambda x: x.tolist(),\n", - ")" - ] - }, - { - "cell_type": "markdown", - "id": "dfa470d9-35d0-4c53-a5d0-afba5456320a", - "metadata": {}, - "source": [ - "## Create VectorIndex\n", - "\n", - "We will set up a VectorIndex to index and search the video frames based on both visual and textual content. This involves creating an indexing listener for visual data and a compatible listener for textual data." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "475d27c7-81e0-47ae-a02b-b1df4332002c", - "metadata": {}, - "outputs": [], - "source": [ - "from superduperdb import Listener, VectorIndex\n", - "from superduperdb.backends.mongodb import Collection\n", - "\n", - "db.add(\n", - " VectorIndex(\n", - " identifier='video_search_index',\n", - " indexing_listener=Listener(\n", - " model=visual_model,\n", - " key='_outputs.video.video2images.image',\n", - " select=Collection('_outputs.video.video2images').find(),\n", - " ),\n", - " compatible_listener=Listener(\n", - " model=text_model,\n", - " key='text',\n", - " select=None,\n", - " active=False\n", - " )\n", - " )\n", - ")" - ] - }, - { - "cell_type": "markdown", - "id": "aeb7df08-c1ec-45db-8716-79db58ad6502", - "metadata": {}, - "source": [ - "## Query a text against saved frames." - ] - }, - { - "cell_type": "markdown", - "id": "95c48d0c-4f7a-4c32-a2e3-3f8d8985733a", - "metadata": {}, - "source": [ - "Now, let's search for something that happened during the video:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "31ba463f-97ae-4f83-890e-c852f9818e63", - "metadata": {}, - "outputs": [], - "source": [ - "# Define the search parameters\n", - "search_term = 'Some ducks'\n", - "num_results = 1\n", - "\n", - "\n", - "r = next(db.execute(\n", - " Collection('_outputs.video.video2images').like(Document({'text': search_term}), vector_index='video_search_index', n=num_results).find()\n", - "))\n", - "\n", - "search_timestamp = r['_outputs']['video']['video2images']['current_timestamp']\n", - "\n", - "# Get the back reference to the original video\n", - "video = db.execute(Collection('videos').find_one({'_id': r['_source']}))" - ] - }, - { - "cell_type": "markdown", - "id": "78fc11ff-dafc-4525-88a5-327ed547b89e", - "metadata": {}, - "source": [ - "## Start the video from the resultant timestamp:\n", - "\n", - "Finally, we can display and play the video starting from the timestamp where the searched text is found." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "eeda6711-15a4-465e-903d-ed0a1d0db672", - "metadata": {}, - "outputs": [], - "source": [ - "from IPython.display import display, HTML\n", - "\n", - "video_html = f\"\"\"\n", - "\n", - "\n", - "\"\"\"\n", - "\n", - "display(HTML(video_html))" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3 (ipykernel)", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.11.6" - } - }, - "nbformat": 4, - "nbformat_minor": 5 -} diff --git a/examples/voice_memos.ipynb b/examples/voice_memos.ipynb deleted file mode 100644 index 2b5079174..000000000 --- a/examples/voice_memos.ipynb +++ /dev/null @@ -1,371 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "id": "ae374483", - "metadata": {}, - "source": [ - "# Cataloguing voice-memos for a self managed personal assistant" - ] - }, - { - "cell_type": "markdown", - "id": "cc4fa500665eccb9", - "metadata": {}, - "source": [ - "## Introduction\n", - "\n", - "Discover the magic of SuperDuperDB as we seamlessly integrate models across different data modalities, such as audio and text. Experience the creation of highly sophisticated data-based applications with minimal boilerplate code.\n", - "\n", - "### Objectives:\n", - "\n", - "1. Maintain a database of audio recordings\n", - "2. Index the content of these audio recordings\n", - "3. Search and interrogate the content of these audio recordings\n", - "\n", - "### Our approach involves:\n", - "\n", - "* Utilizing a transformers model by Facebook's AI team to transcribe audio to text.\n", - "* Employing an OpenAI vectorization model to index the transcribed text.\n", - "* Harnessing OpenAI ChatGPT model in conjunction with relevant recordings to query the audio database." - ] - }, - { - "cell_type": "markdown", - "id": "ecf9f0ec45cb1f3", - "metadata": {}, - "source": [ - "## Prerequisites\n", - "\n", - "Before diving into the implementation, ensure that you have the necessary libraries installed by running the following commands:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "dce1a857", - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "!pip install superduperdb\n", - "!pip install transformers soundfile torchaudio librosa openai\n", - "!pip install -U datasets" - ] - }, - { - "cell_type": "markdown", - "id": "0d02e472-8395-435c-b46d-6a5158ef67fb", - "metadata": { - "tags": [] - }, - "source": [ - "Additionally, ensure that you have set your openai API key as an environment variable. You can uncomment the following code and add your API key:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "94262bf76c630b10", - "metadata": { - "collapsed": false, - "tags": [] - }, - "outputs": [], - "source": [ - "import os\n", - "\n", - "#os.environ['OPENAI_API_KEY'] = 'sk-XXXX'\n", - "\n", - "if 'OPENAI_API_KEY' not in os.environ:\n", - " raise Exception('Environment variable \"OPENAI_API_KEY\" not set')" - ] - }, - { - "cell_type": "markdown", - "id": "32971b8afdf76fe5", - "metadata": {}, - "source": [ - "## Connect to datastore \n", - "\n", - "First, we need to establish a connection to a MongoDB datastore via SuperDuperDB. You can configure the `MongoDB_URI` based on your specific setup. \n", - "Here are some examples of MongoDB URIs:\n", - "\n", - "* For testing (default connection): `mongomock://test`\n", - "* Local MongoDB instance: `mongodb://localhost:27017`\n", - "* MongoDB with authentication: `mongodb://superduper:superduper@mongodb:27017/documents`\n", - "* MongoDB Atlas: `mongodb+srv://:@/`" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "b84da3f2ef58e401", - "metadata": { - "collapsed": false, - "tags": [] - }, - "outputs": [], - "source": [ - "from superduperdb import superduper\n", - "from superduperdb.backends.mongodb import Collection\n", - "import os\n", - "\n", - "mongodb_uri = os.getenv(\"MONGODB_URI\",\"mongomock://test\")\n", - "db = superduper(mongodb_uri)\n", - "\n", - "# Create a collection for Voice memos\n", - "voice_collection = Collection('voice-memos')" - ] - }, - { - "cell_type": "markdown", - "id": "d13d051e8f5f6f", - "metadata": {}, - "source": [ - "\n", - "## Load Dataset\n", - "\n", - "In this example se use `LibriSpeech` as our voice recording dataset. It is a corpus of approximately 1000 hours of read English speech. The same functionality could be accomplised using any audio, in particular audio hosted on the web, or in an `s3` bucket. For instance, if you have a repository of audio of conference calls, or memos, this may be indexed in the same way. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "10ab7114", - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "from datasets import load_dataset\n", - "from superduperdb.ext.numpy import array\n", - "from superduperdb import Document\n", - "\n", - "data = load_dataset(\"hf-internal-testing/librispeech_asr_demo\", \"clean\", split=\"validation\")\n", - "\n", - "# Using an `Encoder`, we may add the audio data directly to a MongoDB collection:\n", - "enc = array('float64', shape=(None,))\n", - "\n", - "db.add(enc)\n", - "\n", - "db.execute(voice_collection.insert_many([\n", - " Document({'audio': enc(r['audio']['array'])}) for r in data\n", - "]))" - ] - }, - { - "cell_type": "markdown", - "id": "721f31f4626881e0", - "metadata": {}, - "source": [ - "## Install Pre-Trained Model (LibreSpeech) into Database\n", - "\n", - "Apply a pretrained `transformers` model to the data: " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "222284f7", - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "from transformers import Speech2TextProcessor, Speech2TextForConditionalGeneration\n", - "from superduperdb.ext.transformers import Pipeline\n", - "\n", - "model = Speech2TextForConditionalGeneration.from_pretrained(\"facebook/s2t-small-librispeech-asr\")\n", - "processor = Speech2TextProcessor.from_pretrained(\"facebook/s2t-small-librispeech-asr\")\n", - "\n", - "SAMPLING_RATE = 16000\n", - "\n", - "transcriber = Pipeline(\n", - " identifier='transcription',\n", - " object=model,\n", - " preprocess=processor,\n", - " preprocess_kwargs={'sampling_rate': SAMPLING_RATE, 'return_tensors': 'pt', 'padding': True},\n", - " postprocess=lambda x: processor.batch_decode(x, skip_special_tokens=True),\n", - " predict_method='generate',\n", - " preprocess_type='other',\n", - ")" - ] - }, - { - "cell_type": "markdown", - "id": "ed83a8b084844292", - "metadata": {}, - "source": [ - "# Run Predictions on All Recordings in the Collection\n", - "Apply the `Pipeline` to all audio recordings:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "573dccc4", - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "transcriber.predict(X='audio', db=db, select=voice_collection.find(), max_chunk_size=10)" - ] - }, - { - "cell_type": "markdown", - "id": "64a6cbd8e3d429d9", - "metadata": {}, - "source": [ - "## Ask Questions to Your Voice Assistant\n", - "\n", - "Ask questions to your voice assistant, targeting specific queries and utilizing the power of MongoDB for vector-search and filtering rules:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "3aedc03c", - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "from superduperdb import VectorIndex, Listener\n", - "from superduperdb.ext.openai import OpenAIEmbedding\n", - "\n", - "db.add(\n", - " VectorIndex(\n", - " identifier='my-index',\n", - " indexing_listener=Listener(\n", - " model=OpenAIEmbedding(model='text-embedding-ada-002'),\n", - " key='_outputs.audio.transcription',\n", - " select=voice_collection.find(),\n", - " ),\n", - " )\n", - ")" - ] - }, - { - "cell_type": "markdown", - "id": "2f92b56f", - "metadata": {}, - "source": [ - "Let's confirm this has worked, by searching for the `royal cavern`." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "7d2e3e56", - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "# Define the search parameters\n", - "search_term = 'royal cavern'\n", - "num_results = 2\n", - "\n", - "list(db.execute(\n", - " voice_collection.like(\n", - " {'_outputs.audio.transcription': search_term},\n", - " n=num_results,\n", - " vector_index='my-index',\n", - " ).find({}, {'_outputs.audio.transcription': 1})\n", - "))" - ] - }, - { - "cell_type": "markdown", - "id": "6068514b31268846", - "metadata": {}, - "source": [ - "## Enrich it with Chat-Completion \n", - "\n", - "Connect the previous steps with the gpt-3.5.turbo, a chat-completion model on OpenAI. The plan is to seed the completions with the most relevant audio recordings, as judged by their textual transcriptions. These transcriptions are retrieved using the previously configured `VectorIndex`. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "99e206af", - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "from superduperdb.ext.openai import OpenAIChatCompletion\n", - "\n", - "chat = OpenAIChatCompletion(\n", - " model='gpt-3.5-turbo',\n", - " prompt=(\n", - " 'Use the following facts to answer this question\\n'\n", - " '{context}\\n\\n'\n", - " 'Here\\'s the question:\\n'\n", - " ),\n", - ")\n", - "\n", - "db.add(chat)\n", - "\n", - "print(db.show('model'))" - ] - }, - { - "cell_type": "markdown", - "id": "9cb623c4", - "metadata": {}, - "source": [ - "## Full Voice-Assistant Experience\n", - "\n", - "Test the full model by asking a question about a specific fact mentioned in the audio recordings. The model will retrieve the most relevant recordings and use them to formulate its answer:\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "60d7f0af-6305-4c8c-be65-4b75ec7dbf50", - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "from superduperdb import Document\n", - "\n", - "q = 'Is anything really Greek?'\n", - "\n", - "print(db.predict(\n", - " model_name='gpt-3.5-turbo',\n", - " input=q,\n", - " context_select=voice_collection.like(\n", - " Document({'_outputs.audio.transcription': q}), vector_index='my-index'\n", - " ).find(),\n", - " context_key='_outputs.audio.transcription',\n", - ")[0].content)" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3 (ipykernel)", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.11.5" - } - }, - "nbformat": 4, - "nbformat_minor": 5 -}