From 2aef91bae729a2be1c5ed920a8fcf5c5489beafc Mon Sep 17 00:00:00 2001 From: YanivHyper-Space <124336435+YanivHyper-Space@users.noreply.github.com> Date: Tue, 26 Mar 2024 13:02:52 +0200 Subject: [PATCH] Add files via upload --- .../arXiv/CrimesInChicago_ClassicSearch.ipynb | 1158 +++++++++++++++++ 1 file changed, 1158 insertions(+) create mode 100644 DataSets/arXiv/CrimesInChicago_ClassicSearch.ipynb diff --git a/DataSets/arXiv/CrimesInChicago_ClassicSearch.ipynb b/DataSets/arXiv/CrimesInChicago_ClassicSearch.ipynb new file mode 100644 index 0000000..b354e7c --- /dev/null +++ b/DataSets/arXiv/CrimesInChicago_ClassicSearch.ipynb @@ -0,0 +1,1158 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "X3eWQaOFp2lG", + "metadata": { + "id": "X3eWQaOFp2lG" + }, + "source": [ + "![63f78014766fd30436c18a79_Hyperspace - navbar logo.png]()" + ] + }, + { + "cell_type": "markdown", + "id": "1Kone1nET7HS", + "metadata": { + "id": "1Kone1nET7HS" + }, + "source": [ + "# Classic Search With Hyperspace\n", + "This notebook demonstrates the use of Hyperspace engine for classic search, combining keyword and value matching.\n", + "For more info, see the [Hyperspace documentation](https://docs.hyper-space.io/hyperspace-docs/getting-started/overview).\n", + "\n", + "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/hyper-space-io/QuickStart/blob/master/DataSets/CrimesInChicago/CrimesInChicago_ClassicSearch.ipynb)\n", + "## The Dataset - Crimes In Chicago Dataset\n", + "From Kaggle:\n", + "\"This dataset reflects reported incidents of crime (with the exception of murders where data exists for each victim) that occurred in the City of Chicago from 2001 to present, minus the most recent seven days. Data is extracted from the Chicago Police Department's CLEAR (Citizen Law Enforcement Analysis and Reporting) system. In order to protect the privacy of crime victims, addresses are shown at the block level only and specific locations are not identified. This data includes unverified reports supplied to the Police Department. The preliminary crime classifications may be changed at a later date based upon additional investigation and there is always the possibility of mechanical or human error. Therefore, the Chicago Police Department does not guarantee (either expressed or implied) the accuracy, completeness, timeliness, or correct sequencing of the information and the information should not be used for comparison purposes over time.\"\n", + "\n", + "The [dataset](https://www.kaggle.com/datasets/chicago/chicago-crime) can be downloaded from [Hyperspace git](https://github.com/hyper-space-io/QuickStart/blob/main/DataSets/CrimesInChicago/100k-crimes-dataset-processed_data.zip).\n", + "\n", + "## The Dataset Fields\n", + "1. **Case Number {'type': 'keyword'}** - The Chicago Police Department RD Number (Records Division Number), which is unique to the incident.\n", + "2. **Date {'type': 'date', 'format': 'MM/dd/yyyy hh:mm:ss a'}** - Date when the incident occurred. this is sometimes a best estimate.\n", + "3. **Block {'type 'keyword'}** -The partially redacted address where the incident occurred, placing it on the same block as the actual address.\n", + "4. **IUCR {'type 'keyword'}** - The Illinois Unifrom Crime Reporting code. This is directly linked to the Primary Type and Description. See the list of IUCR codes at https://data.cityofchicago.org/d/c7ck-438e.\n", + "5. **Primary Type {'type 'keyword'}** - The primary description of the IUCR code.\n", + "6. **Description {'type 'keyword'}** - The secondary description of the IUCR code, a subcategory of the primary description.\n", + "7. **Location Description {'type 'keyword'}** - Description of the location where the incident occurred.\n", + "8. **Arrest {'type 'boolean'}** - Indicates whether an arrest was made.\n", + "9. **Domestic {'type 'boolean'}** - Indicates whether the incident was domestic-related as defined by the Illinois Domestic Violence Act.\n", + "10. **Beat {'type 'integer'}** - Indicates the beat where the incident occurred. A beat is the smallest police geographic area – each beat has a dedicated police beat car. Three to five beats make up a police sector, and three sectors make up a police district. The Chicago Police Department has 22 police districts. See the beats at https://data.cityofchicago.org/d/aerh-rz74.\n", + "11. **District {'type 'integer'}** - Indicates the police district where the incident occurred. See the districts at https://data.cityofchicago.org/d/fthy-xz3r.\n", + "12. **Ward {'type 'integer'}** - The ward (City Council district) where the incident occurred. See the wards at https://data.cityofchicago.org/d/sp34-6z76.\n", + "13. **Community Area {'type 'integer'}** - Indicates the community area where the incident occurred. Chicago has 77 community areas. See the community areas at https://data.cityofchicago.org/d/cauq-8yn6.\n", + "14. **FBI Code {'type 'keyword'}** - Indicates the crime classification as outlined in the FBI's National Incident-Based Reporting System (NIBRS). See the Chicago Police Department listing of these classifications at http://gis.chicagopolice.org/clearmap_crime_sums/crime_types.html.\n", + "15. **X Coordinate {'type 'integer'}** - The x coordinate of the location where the incident occurred in State Plane Illinois East NAD 1983 projection. This location is shifted from the actual location for partial redaction but falls on the same block.\n", + "16. **Y Coordinate {'type 'integer'}** - The y coordinate of the location where the incident occurred in State Plane Illinois East NAD 1983 projection. This location is shifted from the actual location for partial redaction but falls on the same block.\n", + "17. **Year {'type 'integer'}** - Year the incident occurred.\n", + "18. **Updated On {'type 'date', 'format 'MM/dd/yyyy hh:mm:ss a'}** - Date and time the record was last updated.\n", + "19. **Latitude {'type 'float'}** - The latitude of the location where the incident occurred. This location is shifted from the actual location for partial redaction but falls on the same block.\n", + "20. **Longitude {'type 'float'}** - The longitude of the location where the incident occurred. This location is shifted from the actual location for partial redaction but falls on the same block.\n", + "21. **Location {'type 'geo_point', 'struct_type 'list'}** - The location where the incident occurred in a format that allows for creation of maps and other geographic operations on this data portal. This location is shifted from the actual location for partial redaction but falls on the same block." + ] + }, + { + "cell_type": "markdown", + "id": "QHPcXuhkq4uN", + "metadata": { + "id": "QHPcXuhkq4uN" + }, + "source": [ + "\n", + "# Setting up the Hyperspace Environment\n", + "Setting up the environment and running the query includes the following steps\n", + "1. Download and install the client API\n", + "2. Connect to a server\n", + "3. Create data schema file\n", + "4. Create collection\n", + "5. Ingest data\n", + "6. Run query" + ] + }, + { + "cell_type": "markdown", + "id": "SZwFmp6tVLZ_", + "metadata": { + "id": "SZwFmp6tVLZ_" + }, + "source": [ + "## 1. Install the Hyperspace client API\n", + "Hyperspace API can be installed directly from git, using the following command:" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "l1DTYoFb1vnF", + "metadata": { + "ExecuteTime": { + "end_time": "2024-01-03T12:59:27.200146500Z", + "start_time": "2024-01-03T12:59:20.166207100Z" + }, + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "l1DTYoFb1vnF", + "outputId": "d828fd9c-4801-4204-b452-5af54af6bdba" + }, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Collecting git+https://github.com/hyper-space-io/hyperspace-py\n", + " Cloning https://github.com/hyper-space-io/hyperspace-py to /tmp/pip-req-build-axxmqr16\n", + " Running command git clone --filter=blob:none --quiet https://github.com/hyper-space-io/hyperspace-py /tmp/pip-req-build-axxmqr16\n", + " Resolved https://github.com/hyper-space-io/hyperspace-py to commit 70d23409dc1b8be4a73845f17f6b8f84a104b4ea\n", + " Preparing metadata (setup.py) ... \u001b[?25l\u001b[?25hdone\n", + "Requirement already satisfied: certifi>=14.05.14 in /usr/local/lib/python3.10/dist-packages (from hyperspace==1.0.0) (2024.2.2)\n", + "Requirement already satisfied: six>=1.10 in /usr/local/lib/python3.10/dist-packages (from hyperspace==1.0.0) (1.16.0)\n", + "Requirement already satisfied: python_dateutil>=2.5.3 in /usr/local/lib/python3.10/dist-packages (from hyperspace==1.0.0) (2.8.2)\n", + "Requirement already satisfied: setuptools>=21.0.0 in /usr/local/lib/python3.10/dist-packages (from hyperspace==1.0.0) (67.7.2)\n", + "Requirement already satisfied: urllib3>=1.15.1 in /usr/local/lib/python3.10/dist-packages (from hyperspace==1.0.0) (2.0.7)\n", + "Requirement already satisfied: msgpack in /usr/local/lib/python3.10/dist-packages (from hyperspace==1.0.0) (1.0.8)\n", + "Building wheels for collected packages: hyperspace\n", + " Building wheel for hyperspace (setup.py) ... \u001b[?25l\u001b[?25hdone\n", + " Created wheel for hyperspace: filename=hyperspace-1.0.0-py3-none-any.whl size=38874 sha256=7d4a0087925d1704fa1f2ea8111b66ae1305f0b0308dbe1f2cedbc68961126a2\n", + " Stored in directory: /tmp/pip-ephem-wheel-cache-d17ut6v3/wheels/c4/96/59/f4b91d653fdbfc819e48a7dacbea1c9f3de59a1bc113aa840d\n", + "Successfully built hyperspace\n", + "Installing collected packages: hyperspace\n", + "Successfully installed hyperspace-1.0.0\n" + ] + } + ], + "source": [ + "pip install git+https://github.com/hyper-space-io/hyperspace-py" + ] + }, + { + "cell_type": "markdown", + "id": "e9053cb5ddb9badb", + "metadata": { + "id": "e9053cb5ddb9badb" + }, + "source": [ + "### Download Dataset" + ] + }, + { + "cell_type": "code", + "source": [ + "from urllib.request import urlretrieve\n", + "import os\n", + "\n", + "def download_data(url, file_name):\n", + " \"\"\"\n", + " url (str): URL of the file to download.\n", + " file_name (str): Local path where the file will be saved.\n", + " \"\"\"\n", + " # Check if the file already exists and is not empty\n", + " if os.path.exists(file_name) and os.path.getsize(file_name) > 0:\n", + " print(f\"The file {file_name} already exists and is not empty.\")\n", + " else:\n", + " try:\n", + " # Attempt to download the file from `url` and save it locally under `file_name`\n", + " urlretrieve(url, file_name)\n", + " # Check if the file was downloaded and is not empty\n", + " if os.path.exists(file_name) and os.path.getsize(file_name) > 0:\n", + " print(f\"Successfully downloaded {file_name}\")\n", + " else:\n", + " print(\"Download failed or file is empty.\")\n", + "\n", + " except Exception as e:\n", + " print(f\"An error occurred: {e}\")\n", + "\n", + "import zipfile\n", + "def unzip_file(path_to_zip_file):\n", + " directory_to_extract_to = './'\n", + " try:\n", + " with zipfile.ZipFile(path_to_zip_file, 'r') as zip_ref:\n", + " zip_ref.extractall(directory_to_extract_to)\n", + " print(f'Success! Files have been extracted to {directory_to_extract_to}')\n", + "\n", + " except zipfile.BadZipFile: # Handle a bad zip file\n", + " print(\"Error: The file is a bad zip file. Unable to unzip.\")\n", + " except FileNotFoundError: # Handle the file not being found\n", + " print(\"Error: The file was not found. Please check the path.\")\n", + " except Exception as e: # Handle other exceptions\n", + " print(f\"An error occurred: {e}\")\n" + ], + "metadata": { + "id": "YHMBrZD1WTGW" + }, + "id": "YHMBrZD1WTGW", + "execution_count": 6, + "outputs": [] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "363dc2a1498d9fce", + "metadata": { + "ExecuteTime": { + "end_time": "2024-01-03T13:13:00.876533Z", + "start_time": "2024-01-03T13:12:56.193951Z" + }, + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "363dc2a1498d9fce", + "outputId": "753100f6-c241-4427-b360-2a9e6b4ad65c" + }, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Successfully downloaded ./100k-crimes-dataset-processed_data.zip\n", + "Success! Files have been extracted to ./\n" + ] + } + ], + "source": [ + "data_url = \"https://hyperspace-datasets.s3.eu-central-1.amazonaws.com/100k-crimes-dataset-processed_data.zip\"\n", + "\n", + "download_data(data_url, \"./100k-crimes-dataset-processed_data.zip\")\n", + "unzip_file(\"./100k-crimes-dataset-processed_data.zip\")\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "id": "TCZSwM6DVeDm", + "metadata": { + "id": "TCZSwM6DVeDm" + }, + "source": [ + "## 2. Connect to the server\n", + "\n", + "Once the Hyperspace API is installed, the database can be accessed by creating a local instance of the Hyperspace client. This step requires host address, username and password." + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "id": "8NVXuwyozUi7", + "metadata": { + "ExecuteTime": { + "end_time": "2024-01-01T19:47:50.381706500Z", + "start_time": "2024-01-01T19:47:49.201122400Z" + }, + "id": "8NVXuwyozUi7", + "colab": { + "base_uri": "https://localhost:8080/" + }, + "outputId": "44b52ad7-1f70-4ff7-a09a-8c554cb724e3" + }, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "\n" + ] + } + ], + "source": [ + "import hyperspace\n", + "from getpass import getpass\n", + "\n", + "username = \"USERNAME\"\n", + "host = \"HOST\"\n", + "password=getpass()\n", + "\n", + "\n", + "hyperspace_client = hyperspace.HyperspaceClientApi(host=host, username=username, password=password)\n", + "print(hyperspace_client)\n" + ] + }, + { + "cell_type": "markdown", + "id": "1rxCqQ58kuzL", + "metadata": { + "id": "1rxCqQ58kuzL" + }, + "source": [ + "We check the status before proceeding, also delete collection if exist ..." + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "id": "cH_lWsLLza0n", + "metadata": { + "ExecuteTime": { + "end_time": "2024-01-01T19:47:50.630929900Z", + "start_time": "2024-01-01T19:47:50.383712Z" + }, + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "cH_lWsLLza0n", + "outputId": "34c712f6-2bb9-435e-84c3-5032a9ad8e2c" + }, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "{'collections': {'DocRetrievalEmbedded': {'creation_time': '2024-03-20T10:50:16Z',\n", + " 'last_query_time': '2024-03-25T09:27:27Z',\n", + " 'size': 39},\n", + " 'GeneratedData': {'creation_time': '2024-03-26T05:27:09Z',\n", + " 'last_query_time': '2024-03-26T06:28:50Z',\n", + " 'size': 100000},\n", + " 'all-MiniLM-L6-v2_arXiv': {'creation_time': '2024-03-26T07:42:43Z',\n", + " 'last_query_time': '2024-03-26T08:03:16Z',\n", + " 'size': 604001}}}" + ] + }, + "metadata": {}, + "execution_count": 10 + } + ], + "source": [ + "collections_info = hyperspace_client.collections_info()\n", + "if 'CrimesInChicago' in collections_info['collections'] :\n", + " collections_info = hyperspace_client.delete_collection('CrimesInChicago')\n", + " collections_info = hyperspace_client.collections_info()\n", + "collections_info\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "id": "HXmdh3YGVfQV", + "metadata": { + "id": "HXmdh3YGVfQV" + }, + "source": [ + "## 3. Create a Data Schema File\n", + "\n", + "Similarly to other search databases, Hyper-Space database requires a configuration file that outlines the data schema. Here, we create a config file that corresponds to the fields of the given dataset.\n", + "\n", + "For vector fields, we also provide the index type to be used, and the metric. . Current options for index include \"**brute_force**\", \"**hnsw**\", \"**ivf**\", and \"**bin_ivf**\" for binary vectors, and \"**IP**\" (inner product) as a metric for floating point vectors and \"**Hamming**\" ([hamming distance](https://en.wikipedia.org/wiki/Hamming_distance)) for binary vectors.\n", + "Note that the key 'low_cardinality' enables faster search for low cardinality fields." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "wEPB1q3mno6H", + "metadata": { + "ExecuteTime": { + "end_time": "2024-01-01T19:47:50.673577300Z", + "start_time": "2024-01-01T19:47:50.635904500Z" + }, + "id": "wEPB1q3mno6H" + }, + "outputs": [], + "source": [ + "config = {\n", + " \"configuration\": {\n", + " \"ID\": {\n", + " \"type\": \"keyword\",\n", + " \"id\": True\n", + " },\n", + " \"Case Number\": {\n", + " \"type\": \"keyword\"\n", + " },\n", + " \"Date\": {\n", + " \"type\": \"date\",\n", + " \"format\": \"MM/dd/yyyy hh:mm:ss a\"\n", + " },\n", + " \"Block\": {\n", + " \"type\": \"keyword\"\n", + " },\n", + " \"IUCR\": {\n", + " \"type\": \"keyword\"\n", + " },\n", + " \"Primary Type\": {\n", + " \"type\": \"keyword\"\n", + " },\n", + " \"Description\": {\n", + " \"type\": \"keyword\"\n", + " },\n", + " \"Location Description\": {\n", + " \"type\": \"keyword\"\n", + " },\n", + " \"Arrest\": {\n", + " \"type\": \"boolean\"\n", + " },\n", + " \"Domestic\": {\n", + " \"type\": \"boolean\"\n", + " },\n", + " \"Beat\": {\n", + " \"type\": \"integer\"\n", + " },\n", + " \"District\": {\n", + " \"type\": \"integer\"\n", + " },\n", + " \"Ward\": {\n", + " \"type\": \"integer\"\n", + " },\n", + " \"Community Area\": {\n", + " \"type\": \"integer\"\n", + " },\n", + " \"FBI Code\": {\n", + " \"type\": \"keyword\"\n", + " },\n", + " \"X Coordinate\": {\n", + " \"type\": \"integer\"\n", + " },\n", + " \"Y Coordinate\": {\n", + " \"type\": \"integer\"\n", + " },\n", + " \"Year\": {\n", + " \"type\": \"integer\"\n", + " },\n", + " \"Updated On\": {\n", + " \"type\": \"date\",\n", + " \"format\": \"MM/dd/yyyy hh:mm:ss a\"\n", + " },\n", + " \"Latitude\": {\n", + " \"type\": \"float\"\n", + " },\n", + " \"Longitude\": {\n", + " \"type\": \"float\"\n", + " },\n", + " \"Location\": {\n", + " \"type\": \"geo_point\",\n", + " \"struct_type\": \"list\"\n", + " }\n", + " }\n", + "}\n" + ] + }, + { + "cell_type": "markdown", + "id": "7RTDkUsr3ead", + "metadata": { + "id": "7RTDkUsr3ead" + }, + "source": [ + "## 4. Create Collection\n", + "The Hyerspace engine stores data in Collections, where each collection commonly hosts data of similar context, etc. Each search is then performed within a collection. We create a collection using the command \"**create_collection**(schema_filename, collection_name)\"." + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "id": "nAzO0OH7nt6-", + "metadata": { + "ExecuteTime": { + "end_time": "2024-01-01T19:47:50.803106500Z", + "start_time": "2024-01-01T19:47:50.647905300Z" + }, + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "nAzO0OH7nt6-", + "outputId": "5b5983b3-d95e-4f4f-fc81-8b12c3e00073" + }, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "{'collections': {'CrimesInChicago': {'creation_time': '2024-03-26T10:43:12Z',\n", + " 'size': 0},\n", + " 'DocRetrievalEmbedded': {'creation_time': '2024-03-20T10:50:16Z',\n", + " 'last_query_time': '2024-03-25T09:27:27Z',\n", + " 'size': 39},\n", + " 'GeneratedData': {'creation_time': '2024-03-26T05:27:09Z',\n", + " 'last_query_time': '2024-03-26T06:28:50Z',\n", + " 'size': 100000},\n", + " 'all-MiniLM-L6-v2_arXiv': {'creation_time': '2024-03-26T07:42:43Z',\n", + " 'last_query_time': '2024-03-26T08:03:16Z',\n", + " 'size': 604001}}}" + ] + }, + "metadata": {}, + "execution_count": 12 + } + ], + "source": [ + "collection_name = 'CrimesInChicago'\n", + "\n", + "if collection_name not in hyperspace_client.collections_info()[\"collections\"]:\n", + " hyperspace_client.create_collection(config, collection_name)\n", + "\n", + "hyperspace_client.collections_info()\n" + ] + }, + { + "cell_type": "markdown", + "id": "lUpgHD2VWFXd", + "metadata": { + "id": "lUpgHD2VWFXd" + }, + "source": [ + "# 5. Ingest data\n", + "\n", + "In the next step we ingest the dataset in batches of 500 documents. This number can be controlled by the user, and in particular, can be increased in order improve ingestion time. We add batches of data using the command **add_batch**(batch, collection_name)." + ] + }, + { + "cell_type": "code", + "source": [ + "batch" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "G1jgQgojXmpf", + "outputId": "a964fe08-bc2c-4f6a-af3a-b0f59f16726e" + }, + "id": "G1jgQgojXmpf", + "execution_count": 16, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "[{'ID': 10224738,\n", + " 'Case Number': 'HY411648',\n", + " 'Date': 1441449000,\n", + " 'Block': '043XX S WOOD ST',\n", + " 'IUCR': '0486',\n", + " 'Primary Type': 'BATTERY',\n", + " 'Description': 'DOMESTIC BATTERY SIMPLE',\n", + " 'Location Description': 'RESIDENCE',\n", + " 'Arrest': False,\n", + " 'Domestic': True,\n", + " 'Beat': 924,\n", + " 'District': 9,\n", + " 'Ward': 12,\n", + " 'Community Area': 61,\n", + " 'FBI Code': '08B',\n", + " 'X Coordinate': 1165074,\n", + " 'Y Coordinate': 1875917,\n", + " 'Year': 2015,\n", + " 'Updated On': 1518270601,\n", + " 'Latitude': 41.815117282,\n", + " 'Longitude': -87.669999562,\n", + " 'Location': [41.815117282, -87.669999562]}]" + ] + }, + "metadata": {}, + "execution_count": 16 + } + ] + }, + { + "cell_type": "code", + "source": [ + "import numpy as np\n", + "import json\n", + "dataset_path = '100k-crimes-dataset-processed_data.json'\n", + "\n", + "BATCH_SIZE = 1000\n", + "\n", + "batch = []\n", + "with open(dataset_path) as data:\n", + " for i, metadata_row in enumerate(data):\n", + " row = {key: value for key, value in json.loads(metadata_row).items() if key in config[\"configuration\"].keys()}\n", + " row[\"ID\"] = str(row[\"ID\"])\n", + " batch.append(row)\n", + "\n", + " if i % BATCH_SIZE == 0:\n", + " response = hyperspace_client.add_batch(batch, collection_name)\n", + " batch.clear()\n", + " print(i, response)\n", + "hyperspace_client.commit(collection_name)\n" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "KljzusFkXDS0", + "outputId": "dea8192f-6440-4833-f11f-f041e252fcee" + }, + "id": "KljzusFkXDS0", + "execution_count": 17, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "0 {'code': 200, 'message': 'Batch successfully added', 'status': 'OK'}\n", + "1000 {'code': 200, 'message': 'Batch successfully added', 'status': 'OK'}\n", + "2000 {'code': 200, 'message': 'Batch successfully added', 'status': 'OK'}\n", + "3000 {'code': 200, 'message': 'Batch successfully added', 'status': 'OK'}\n", + "4000 {'code': 200, 'message': 'Batch successfully added', 'status': 'OK'}\n", + "5000 {'code': 200, 'message': 'Batch successfully added', 'status': 'OK'}\n", + "6000 {'code': 200, 'message': 'Batch successfully added', 'status': 'OK'}\n", + "7000 {'code': 200, 'message': 'Batch successfully added', 'status': 'OK'}\n", + "8000 {'code': 200, 'message': 'Batch successfully added', 'status': 'OK'}\n", + "9000 {'code': 200, 'message': 'Batch successfully added', 'status': 'OK'}\n", + "10000 {'code': 200, 'message': 'Batch successfully added', 'status': 'OK'}\n", + "11000 {'code': 200, 'message': 'Batch successfully added', 'status': 'OK'}\n", + "12000 {'code': 200, 'message': 'Batch successfully added', 'status': 'OK'}\n", + "13000 {'code': 200, 'message': 'Batch successfully added', 'status': 'OK'}\n", + "14000 {'code': 200, 'message': 'Batch successfully added', 'status': 'OK'}\n", + "15000 {'code': 200, 'message': 'Batch successfully added', 'status': 'OK'}\n", + "16000 {'code': 200, 'message': 'Batch successfully added', 'status': 'OK'}\n", + "17000 {'code': 200, 'message': 'Batch successfully added', 'status': 'OK'}\n", + "18000 {'code': 200, 'message': 'Batch successfully added', 'status': 'OK'}\n", + "19000 {'code': 200, 'message': 'Batch successfully added', 'status': 'OK'}\n", + "20000 {'code': 200, 'message': 'Batch successfully added', 'status': 'OK'}\n", + "21000 {'code': 200, 'message': 'Batch successfully added', 'status': 'OK'}\n", + "22000 {'code': 200, 'message': 'Batch successfully added', 'status': 'OK'}\n", + "23000 {'code': 200, 'message': 'Batch successfully added', 'status': 'OK'}\n", + "24000 {'code': 200, 'message': 'Batch successfully added', 'status': 'OK'}\n", + "25000 {'code': 200, 'message': 'Batch successfully added', 'status': 'OK'}\n", + "26000 {'code': 200, 'message': 'Batch successfully added', 'status': 'OK'}\n", + "27000 {'code': 200, 'message': 'Batch successfully added', 'status': 'OK'}\n", + "28000 {'code': 200, 'message': 'Batch successfully added', 'status': 'OK'}\n", + "29000 {'code': 200, 'message': 'Batch successfully added', 'status': 'OK'}\n", + "30000 {'code': 200, 'message': 'Batch successfully added', 'status': 'OK'}\n", + "31000 {'code': 200, 'message': 'Batch successfully added', 'status': 'OK'}\n", + "32000 {'code': 200, 'message': 'Batch successfully added', 'status': 'OK'}\n", + "33000 {'code': 200, 'message': 'Batch successfully added', 'status': 'OK'}\n", + "34000 {'code': 200, 'message': 'Batch successfully added', 'status': 'OK'}\n", + "35000 {'code': 200, 'message': 'Batch successfully added', 'status': 'OK'}\n", + "36000 {'code': 200, 'message': 'Batch successfully added', 'status': 'OK'}\n", + "37000 {'code': 200, 'message': 'Batch successfully added', 'status': 'OK'}\n", + "38000 {'code': 200, 'message': 'Batch successfully added', 'status': 'OK'}\n", + "39000 {'code': 200, 'message': 'Batch successfully added', 'status': 'OK'}\n", + "40000 {'code': 200, 'message': 'Batch successfully added', 'status': 'OK'}\n", + "41000 {'code': 200, 'message': 'Batch successfully added', 'status': 'OK'}\n", + "42000 {'code': 200, 'message': 'Batch successfully added', 'status': 'OK'}\n", + "43000 {'code': 200, 'message': 'Batch successfully added', 'status': 'OK'}\n", + "44000 {'code': 200, 'message': 'Batch successfully added', 'status': 'OK'}\n", + "45000 {'code': 200, 'message': 'Batch successfully added', 'status': 'OK'}\n", + "46000 {'code': 200, 'message': 'Batch successfully added', 'status': 'OK'}\n", + "47000 {'code': 200, 'message': 'Batch successfully added', 'status': 'OK'}\n", + "48000 {'code': 200, 'message': 'Batch successfully added', 'status': 'OK'}\n", + "49000 {'code': 200, 'message': 'Batch successfully added', 'status': 'OK'}\n", + "50000 {'code': 200, 'message': 'Batch successfully added', 'status': 'OK'}\n", + "51000 {'code': 200, 'message': 'Batch successfully added', 'status': 'OK'}\n", + "52000 {'code': 200, 'message': 'Batch successfully added', 'status': 'OK'}\n", + "53000 {'code': 200, 'message': 'Batch successfully added', 'status': 'OK'}\n", + "54000 {'code': 200, 'message': 'Batch successfully added', 'status': 'OK'}\n", + "55000 {'code': 200, 'message': 'Batch successfully added', 'status': 'OK'}\n", + "56000 {'code': 200, 'message': 'Batch successfully added', 'status': 'OK'}\n", + "57000 {'code': 200, 'message': 'Batch successfully added', 'status': 'OK'}\n", + "58000 {'code': 200, 'message': 'Batch successfully added', 'status': 'OK'}\n", + "59000 {'code': 200, 'message': 'Batch successfully added', 'status': 'OK'}\n", + "60000 {'code': 200, 'message': 'Batch successfully added', 'status': 'OK'}\n", + "61000 {'code': 200, 'message': 'Batch successfully added', 'status': 'OK'}\n", + "62000 {'code': 200, 'message': 'Batch successfully added', 'status': 'OK'}\n", + "63000 {'code': 200, 'message': 'Batch successfully added', 'status': 'OK'}\n", + "64000 {'code': 200, 'message': 'Batch successfully added', 'status': 'OK'}\n", + "65000 {'code': 200, 'message': 'Batch successfully added', 'status': 'OK'}\n", + "66000 {'code': 200, 'message': 'Batch successfully added', 'status': 'OK'}\n", + "67000 {'code': 200, 'message': 'Batch successfully added', 'status': 'OK'}\n", + "68000 {'code': 200, 'message': 'Batch successfully added', 'status': 'OK'}\n", + "69000 {'code': 200, 'message': 'Batch successfully added', 'status': 'OK'}\n", + "70000 {'code': 200, 'message': 'Batch successfully added', 'status': 'OK'}\n", + "71000 {'code': 200, 'message': 'Batch successfully added', 'status': 'OK'}\n", + "72000 {'code': 200, 'message': 'Batch successfully added', 'status': 'OK'}\n", + "73000 {'code': 200, 'message': 'Batch successfully added', 'status': 'OK'}\n", + "74000 {'code': 200, 'message': 'Batch successfully added', 'status': 'OK'}\n", + "75000 {'code': 200, 'message': 'Batch successfully added', 'status': 'OK'}\n", + "76000 {'code': 200, 'message': 'Batch successfully added', 'status': 'OK'}\n", + "77000 {'code': 200, 'message': 'Batch successfully added', 'status': 'OK'}\n", + "78000 {'code': 200, 'message': 'Batch successfully added', 'status': 'OK'}\n", + "79000 {'code': 200, 'message': 'Batch successfully added', 'status': 'OK'}\n", + "80000 {'code': 200, 'message': 'Batch successfully added', 'status': 'OK'}\n", + "81000 {'code': 200, 'message': 'Batch successfully added', 'status': 'OK'}\n", + "82000 {'code': 200, 'message': 'Batch successfully added', 'status': 'OK'}\n", + "83000 {'code': 200, 'message': 'Batch successfully added', 'status': 'OK'}\n", + "84000 {'code': 200, 'message': 'Batch successfully added', 'status': 'OK'}\n", + "85000 {'code': 200, 'message': 'Batch successfully added', 'status': 'OK'}\n", + "86000 {'code': 200, 'message': 'Batch successfully added', 'status': 'OK'}\n", + "87000 {'code': 200, 'message': 'Batch successfully added', 'status': 'OK'}\n", + "88000 {'code': 200, 'message': 'Batch successfully added', 'status': 'OK'}\n", + "89000 {'code': 200, 'message': 'Batch successfully added', 'status': 'OK'}\n", + "90000 {'code': 200, 'message': 'Batch successfully added', 'status': 'OK'}\n", + "91000 {'code': 200, 'message': 'Batch successfully added', 'status': 'OK'}\n", + "92000 {'code': 200, 'message': 'Batch successfully added', 'status': 'OK'}\n", + "93000 {'code': 200, 'message': 'Batch successfully added', 'status': 'OK'}\n", + "94000 {'code': 200, 'message': 'Batch successfully added', 'status': 'OK'}\n", + "95000 {'code': 200, 'message': 'Batch successfully added', 'status': 'OK'}\n", + "96000 {'code': 200, 'message': 'Batch successfully added', 'status': 'OK'}\n", + "97000 {'code': 200, 'message': 'Batch successfully added', 'status': 'OK'}\n", + "98000 {'code': 200, 'message': 'Batch successfully added', 'status': 'OK'}\n", + "99000 {'code': 200, 'message': 'Batch successfully added', 'status': 'OK'}\n" + ] + }, + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "{'code': 200, 'message': 'Dataset committed successfully', 'status': 'OK'}" + ] + }, + "metadata": {}, + "execution_count": 17 + } + ] + }, + { + "cell_type": "markdown", + "id": "8lRgTyKwWJ84", + "metadata": { + "id": "8lRgTyKwWJ84" + }, + "source": [ + "### Check collection size after ingestion" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9edaa4a5", + "metadata": { + "id": "9edaa4a5", + "outputId": "1d9898c4-d167-4ca7-c896-684c73f2e099" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{'collections': {'CrimesInChicago': {'creation_time': '2024-01-18T14:09:54Z', 'size': 100000}}}\n" + ] + } + ], + "source": [ + "print(hyperspace_client.collections_info())" + ] + }, + { + "cell_type": "markdown", + "id": "431d2480", + "metadata": { + "id": "431d2480" + }, + "source": [ + "## 6. Define Logic and Run a Query\n", + "We build a search query using Hyper-space. we first select one document from the datset, document \"1022\", lets take a look in this document\n" + ] + }, + { + "cell_type": "code", + "execution_count": 27, + "id": "0YdFBOCsOJka", + "metadata": { + "ExecuteTime": { + "end_time": "2024-01-01T19:48:23.799088500Z", + "start_time": "2024-01-01T19:48:23.709531200Z" + }, + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "0YdFBOCsOJka", + "outputId": "e9be3b2c-96f0-458e-c91b-be5836d57a04" + }, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "{'ID': '10224738',\n", + " 'Case Number': 'HY411648',\n", + " 'Date': 1441449000,\n", + " 'Block': '043XX S WOOD ST',\n", + " 'IUCR': '0486',\n", + " 'Primary Type': 'BATTERY',\n", + " 'Description': 'DOMESTIC BATTERY SIMPLE',\n", + " 'Location Description': 'RESIDENCE',\n", + " 'Arrest': False,\n", + " 'Domestic': True,\n", + " 'Beat': 924,\n", + " 'District': 9,\n", + " 'Ward': 12,\n", + " 'Community Area': 61,\n", + " 'FBI Code': '08B',\n", + " 'X Coordinate': 1165074,\n", + " 'Y Coordinate': 1875917,\n", + " 'Year': 2015,\n", + " 'Updated On': 1518270601,\n", + " 'Latitude': 41.815117282,\n", + " 'Longitude': -87.669999562,\n", + " 'Location': [41.815117282, -87.669999562]}" + ] + }, + "metadata": {}, + "execution_count": 27 + } + ], + "source": [ + "query_doc = hyperspace_client.get_document(collection_name, \"10224738\")\n", + "query_doc" + ] + }, + { + "cell_type": "markdown", + "id": "ozeEfW7b6g9Z", + "metadata": { + "id": "ozeEfW7b6g9Z" + }, + "source": [ + "We will use a very simple logic, which matchs the description and location, and make sure case number doesn't match so we won't get back the same result.\n", + "\n", + "We use the following logic:\n", + "\n", + "\n", + "* document/case will match if it has the same \"Description\" and not the same \"Case Number\"\n", + "* document/case will also match if it happended in the same \"Block\"\n", + "* we boost the score by 5 points if geo \"Location\" is close (40 something)\n", + "* we reduce the score by 5 points if the \"District\" is the same and case date if 40 to 100 days before the query case date\n", + "\n", + "Score function is listed next :\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "id": "GxF-npmSzv8w", + "metadata": { + "ExecuteTime": { + "end_time": "2024-01-01T19:48:24.919943600Z", + "start_time": "2024-01-01T19:48:23.804052800Z" + }, + "id": "GxF-npmSzv8w", + "colab": { + "base_uri": "https://localhost:8080/" + }, + "outputId": "c009e108-1429-45de-9b15-da5494561689" + }, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "{'code': 200, 'message': 'Function was set successfully', 'status': 'OK'}" + ] + }, + "metadata": {}, + "execution_count": 24 + } + ], + "source": [ + "def similarity_score(params, doc):\n", + " score = 0.0\n", + " if match('Description') and not match('Case Number'):\n", + " score = rarity_max('Description')\n", + " if geo_dist_match('Location',40):\n", + " score += 5\n", + " if params['District'] == doc['District'] and window_match('Date', 100, 40):\n", + " score -= 5\n", + "\n", + " return score\n", + "\n", + "hyperspace_client.set_function(similarity_score, collection_name, \"similarity_score\")\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "id": "9278f9a1", + "metadata": { + "id": "9278f9a1" + }, + "source": [ + "### Next, We Submit The Query" + ] + }, + { + "cell_type": "code", + "execution_count": 36, + "id": "5ff98e9d", + "metadata": { + "ExecuteTime": { + "end_time": "2024-01-01T19:48:27.188125100Z", + "start_time": "2024-01-01T19:48:24.923896Z" + }, + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "5ff98e9d", + "outputId": "fea3aa46-c069-4ee8-d698-dc14e0bb7e92" + }, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Query run time = 2.28533ms\n", + "------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------\n", + "Rank ID Score \n", + "====================================================================================================================================================================================\n", + "1 10224751 8.47 \n", + "2 10224758 8.47 \n", + "3 10224766 8.47 \n", + "4 10224787 8.47 \n", + "5 10224791 8.47 \n", + "6 10224801 8.47 \n", + "7 10224810 8.47 \n", + "8 10224843 8.47 \n", + "9 10224858 8.47 \n", + "10 10224863 8.47 \n", + "11 10224869 8.47 \n", + "12 10224874 8.47 \n", + "13 10224891 8.47 \n", + "14 10224895 8.47 \n", + "15 10224904 8.47 \n", + "16 10224927 8.47 \n", + "17 10224928 8.47 \n", + "18 10224929 8.47 \n", + "19 10224938 8.47 \n", + "20 10224968 8.47 \n", + "21 10224973 8.47 \n", + "22 10224978 8.47 \n", + "23 10224984 8.47 \n", + "24 10224988 8.47 \n", + "25 10225001 8.47 \n", + "26 10225005 8.47 \n", + "27 10225007 8.47 \n", + "28 10225012 8.47 \n", + "29 10225015 8.47 \n", + "30 10225037 8.47 \n" + ] + } + ], + "source": [ + "results = hyperspace_client.search({'params': query_doc},\n", + " size=30,\n", + " function_name='similarity_score',\n", + " collection_name=collection_name)\n", + "\n", + "candidates = results['candidates']\n", + "\n", + "print(f\"Query run time = {results['took_ms']}ms\")\n", + "print(\"-\"*180)\n", + "\n", + "\n", + "print(\"{:<5} {:<100} {:<40} \".format(\"Rank\", \"ID\", \"Score\"))\n", + "print(\"=\"*180)\n", + "\n", + "for i, result in enumerate(results['similarity']):\n", + " print(\"{:<5} {:<100} {:<10}\".format(i + 1, result['document_id'], round(result['score'], 2)))\n" + ] + }, + { + "cell_type": "markdown", + "id": "EOdmFy7WDBGx", + "metadata": { + "id": "EOdmFy7WDBGx" + }, + "source": [ + "We display the top 30 results. Note that results with similar score are ordered arbitrarily, so more complex logic will likely result in better outcome.\n", + "lets take a look in the content of top matching document, we can see the match in \"Description\", no match in \"Block\" and \"District\" and match in \"Location\"" + ] + }, + { + "cell_type": "markdown", + "id": "97f08c7c", + "metadata": { + "id": "97f08c7c" + }, + "source": [ + "### Score function with some aggregation" + ] + }, + { + "cell_type": "markdown", + "id": "33153809", + "metadata": { + "id": "33153809" + }, + "source": [ + "we add count aggregation to get the number of cases with \"Description\" match and separate count for the number of cases with additional geo match." + ] + }, + { + "cell_type": "code", + "execution_count": 39, + "id": "T3zQ_oWDiOlX", + "metadata": { + "ExecuteTime": { + "end_time": "2024-01-01T19:48:28.309475700Z", + "start_time": "2024-01-01T19:48:27.190122300Z" + }, + "id": "T3zQ_oWDiOlX", + "colab": { + "base_uri": "https://localhost:8080/" + }, + "outputId": "a6b5bb2c-8c25-4d90-9b5e-1d66ddea8727" + }, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "{'code': 200, 'message': 'Function was set successfully', 'status': 'OK'}" + ] + }, + "metadata": {}, + "execution_count": 39 + } + ], + "source": [ + "def similarity_score_aggregations(params, doc):\n", + " score = 0.0\n", + "\n", + " if match('Description') and not match('Case Number'):\n", + " aggregate_count(\"count of cases\")\n", + "\n", + " score = rarity_max('Description')\n", + " if geo_dist_match('Location',40):\n", + " aggregate_count(\"count of cases in nearby districts\")\n", + " score += 5\n", + " if params['District'] == doc['District'] and window_match('Date', 100, 40):\n", + " score -= 5\n", + " if match('Block'):\n", + " score += rarity_max('Block')\n", + " return score\n", + "\n", + "hyperspace_client.set_function(similarity_score_aggregations, collection_name, \"similarity_score_aggregations\")\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "id": "986987d6", + "metadata": { + "id": "986987d6" + }, + "source": [ + "### Next, we fire the query" + ] + }, + { + "cell_type": "code", + "execution_count": 49, + "id": "TwAlkz09iRtm", + "metadata": { + "ExecuteTime": { + "end_time": "2024-01-01T19:48:30.564849400Z", + "start_time": "2024-01-01T19:48:28.313459800Z" + }, + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "TwAlkz09iRtm", + "outputId": "d2bc64f3-debb-418d-b571-a6da243e98f3" + }, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Query run time = 2.43376ms\n", + "------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------\n", + "Rank ID Score \n", + "====================================================================================================================================================================================\n", + "1 10192664 21.98 \n", + "2 10325639 21.98 \n", + "3 10224738 13.51 \n", + "4 10263422 13.51 \n", + "5 10269178 13.51 \n", + "6 10318418 13.51 \n", + "7 10327572 13.51 \n", + "8 11609751 13.51 \n", + "9 10224751 8.47 \n", + "10 10224758 8.47 \n", + "11 10224766 8.47 \n", + "12 10224787 8.47 \n", + "13 10224791 8.47 \n", + "14 10224801 8.47 \n", + "15 10224810 8.47 \n", + "16 10224843 8.47 \n", + "17 10224858 8.47 \n", + "18 10224863 8.47 \n", + "19 10224869 8.47 \n", + "20 10224874 8.47 \n", + "21 10224891 8.47 \n", + "22 10224895 8.47 \n", + "23 10224904 8.47 \n", + "24 10224927 8.47 \n", + "25 10224928 8.47 \n", + "26 10224929 8.47 \n", + "27 10224938 8.47 \n", + "28 10224968 8.47 \n", + "29 10224973 8.47 \n", + "30 10224978 8.47 \n", + "====================================================================================================================================================================================\n", + "Count of initial candidates 8938\n" + ] + } + ], + "source": [ + "results = hyperspace_client.search({'params': query_doc},\n", + " size=30,\n", + " function_name='similarity_score_aggregations',\n", + " collection_name=collection_name)\n", + "\n", + "\n", + "candidates = results['candidates']\n", + "\n", + "print(f\"Query run time = {results['took_ms']}ms\")\n", + "print(\"-\"*180)\n", + "\n", + "\n", + "print(\"{:<5} {:<100} {:<40} \".format(\"Rank\", \"ID\", \"Score\"))\n", + "print(\"=\"*180)\n", + "\n", + "for i, result in enumerate(results['similarity']):\n", + " print(\"{:<5} {:<100} {:<10}\".format(i + 1, result['document_id'], round(result['score'], 2)))\n", + "print(\"=\"*180)\n", + "\n", + "print(\"Count of initial candidates \", results['aggregations'][\"count of cases\"]['count'])\n", + "print(\"Count of cases in nearby districts \", results['aggregations'][\"count of cases in nearby districts\"]['count'])" + ] + }, + { + "cell_type": "markdown", + "id": "ac3f4f25", + "metadata": { + "id": "ac3f4f25" + }, + "source": [ + "we can see the aggregation section in the results where the two counts are listed\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "id": "mltHy5eKiDmu", + "metadata": { + "id": "mltHy5eKiDmu" + }, + "source": [ + "For more information, visit us at [Hyperspace](https://www.hyper-space.io/)" + ] + } + ], + "metadata": { + "colab": { + "provenance": [] + }, + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.9.16" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} \ No newline at end of file