Improve: Python evals for exact search

unum-cloud · Mar 25, 2024 · 36f6c5e · 36f6c5e
1 parent 5b5973c
commit 36f6c5e
Show file tree

Hide file tree

Showing 8 changed files with 272 additions and 131 deletions.
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -33,11 +33,19 @@ brew install libomp llvm # MacOS
 Using modern syntax, this is how you build and run the test suite:
 
 ```sh
-cmake -DUSEARCH_BUILD_TEST_CPP=1 -B ./build_debug
+cmake -DUSEARCH_BUILD_TEST_CPP=1 -DCMAKE_BUILD_TYPE=Debug -B ./build_debug
 cmake --build ./build_debug --config Debug
 ./build_debug/test_cpp
 ```
 
+If there build mode is not specified, the default is `Release`.
+
+```sh
+cmake -DUSEARCH_BUILD_TEST_CPP=1 -B ./build_release
+cmake --build ./build_release --config Debug
+./build_release/test_cpp
+```
+
 The CMakeLists.txt file has a number of options you can pass:
 
 - What to build:
@@ -181,10 +189,12 @@ RUN yum install tar git python3 cmake gcc-c++ -y && yum groupinstall "Developmen
 # Assuming AWS Linux 2 uses old compilers:
 ENV USEARCH_USE_FP16LIB 1
 ENV DUSEARCH_USE_SIMSIMD 1
-ENV SIMSIMD_TARGET_X86_AVX2 1
-ENV SIMSIMD_TARGET_X86_AVX512 0
-ENV SIMSIMD_TARGET_ARM_NEON 1
-ENV SIMSIMD_TARGET_ARM_SVE 0
+ENV SIMSIMD_TARGET_HASWELL 1
+ENV SIMSIMD_TARGET_SKYLAKE 0
+ENV SIMSIMD_TARGET_ICE 0
+ENV SIMSIMD_TARGET_SAPPHIRE 0
+ENV SIMSIMD_TARGET_NEON 1
+ENV SIMSIMD_TARGET_SVE 0
 
 # For specific PR:
 # RUN npm install --build-from-source unum-cloud/usearch#pull/302/head
@@ -212,7 +222,7 @@ USearch provides Rust bindings available on [Crates.io](https://crates.io/crates
 The compilation settings are controlled by the `build.rs` and are independent from CMake used for C/C++ builds.
 
 ```sh
-cargo test -p usearch
+cargo test -p usearch -- --nocapture --test-threads=1
 cargo publish
 ```
 

diff --git a/README.md b/README.md
@@ -192,6 +192,44 @@ When compared to FAISS's `IndexFlatL2` in Google Colab, __[USearch may offer up
 - `faiss.IndexFlatL2`: __55.3 ms__.
 - `usearch.index.search`: __2.54 ms__.
 
+## User-Defined Functions
+
+While most vector search packages concentrate on just a few metrics - "Inner Product distance" and "Euclidean distance," USearch extends this list to include any user-defined metrics.
+This flexibility allows you to customize your search for various applications, from computing geospatial coordinates with the rare [Haversine][haversine] distance to creating custom metrics for composite embeddings from multiple AI models.
+
+![USearch: Vector Search Approaches](https://github.com/unum-cloud/usearch/blob/main/assets/usearch-approaches-white.png?raw=true)
+
+Unlike older approaches indexing high-dimensional spaces, like KD-Trees and Locality Sensitive Hashing, HNSW doesn't require vectors to be identical in length.
+They only have to be comparable.
+So you can apply it in [obscure][obscure] applications, like searching for similar sets or fuzzy text matching, using [GZip][gzip-similarity] as a distance function.
+
+> Read more about [JIT and UDF in USearch Python SDK](https://unum-cloud.github.io/usearch/python#user-defined-metrics-and-jit-in-python).
+
+[haversine]: https://ashvardanian.com/posts/abusing-vector-search#geo-spatial-indexing
+[obscure]: https://ashvardanian.com/posts/abusing-vector-search
+[gzip-similarity]: https://twitter.com/LukeGessler/status/1679211291292889100?s=20
+
+## Memory Efficiency, Downcasting, and Quantization
+
+Training a quantization model and dimension-reduction is a common approach to accelerate vector search.
+Those, however, are only sometimes reliable, can significantly affect the statistical properties of your data, and require regular adjustments if your distribution shifts.
+Instead, we have focused on high-precision arithmetic over low-precision downcasted vectors.
+The same index, and `add` and `search` operations will automatically down-cast or up-cast between `f64_t`, `f32_t`, `f16_t`, `i8_t`, and single-bit representations.
+You can use the following command to check, if hardware acceleration is enabled:
+
+```sh
+$ python -c 'from usearch.index import Index; print(Index(ndim=768, metric="cos", dtype="f16").hardware_acceleration)'
+> sapphire
+$ python -c 'from usearch.index import Index; print(Index(ndim=166, metric="tanimoto").hardware_acceleration)'
+> ice
+```
+
+Using smaller numeric types will save you RAM needed to store the vectors, but you can also compress the neighbors lists forming our proximity graphs.
+By default, 32-bit `uint32_t` is used to enumerate those, which is not enough if you need to address over 4 Billion entries.
+For such cases we provide a custom `uint40_t` type, that will still be 37.5% more space-efficient than the commonly used 8-byte integers, and will scale up to 1 Trillion entries.
+
+![USearch uint40_t support](https://github.com/unum-cloud/usearch/blob/main/assets/usearch-neighbor-types.png?raw=true)
+
 ## `Indexes` for Multi-Index Lookups
 
 For larger workloads targeting billions or even trillions of vectors, parallel multi-index lookups become invaluable.
@@ -264,43 +302,6 @@ pairs: dict = men.join(women, max_proposals=0, exact=False)
 
 > Read more in the post: [Combinatorial Stable Marriages for Semantic Search 💍](https://ashvardanian.com/posts/searching-stable-marriages)
 
-## User-Defined Functions
-
-While most vector search packages concentrate on just a few metrics - "Inner Product distance" and "Euclidean distance," USearch extends this list to include any user-defined metrics.
-This flexibility allows you to customize your search for various applications, from computing geospatial coordinates with the rare [Haversine][haversine] distance to creating custom metrics for composite embeddings from multiple AI models.
-
-![USearch: Vector Search Approaches](https://github.com/unum-cloud/usearch/blob/main/assets/usearch-approaches-white.png?raw=true)
-
-Unlike older approaches indexing high-dimensional spaces, like KD-Trees and Locality Sensitive Hashing, HNSW doesn't require vectors to be identical in length.
-They only have to be comparable.
-So you can apply it in [obscure][obscure] applications, like searching for similar sets or fuzzy text matching, using [GZip][gzip-similarity] as a distance function.
-
-> Read more about [JIT and UDF in USearch Python SDK](https://unum-cloud.github.io/usearch/python#user-defined-metrics-and-jit-in-python).
-
-[haversine]: https://ashvardanian.com/posts/abusing-vector-search#geo-spatial-indexing
-[obscure]: https://ashvardanian.com/posts/abusing-vector-search
-[gzip-similarity]: https://twitter.com/LukeGessler/status/1679211291292889100?s=20
-
-## Memory Efficiency, Downcasting, and Quantization
-
-Training a quantization model and dimension-reduction is a common approach to accelerate vector search.
-Those, however, are only sometimes reliable, can significantly affect the statistical properties of your data, and require regular adjustments if your distribution shifts.
-Instead, we have focused on high-precision arithmetic over low-precision downcasted vectors.
-The same index, and `add` and `search` operations will automatically down-cast or up-cast between `f64_t`, `f32_t`, `f16_t`, `i8_t`, and single-bit representations.
-You can use the following command to check, if hardware acceleration is enabled:
-
-```sh
-$ python -c 'from usearch.index import Index; print(Index(ndim=768, metric="cos", dtype="f16").hardware_acceleration)'
-> avx512+f16
-$ python -c 'from usearch.index import Index; print(Index(ndim=166, metric="tanimoto").hardware_acceleration)'
-> avx512+popcnt
-```
-
-Using smaller numeric types will save you RAM needed to store the vectors, but you can also compress the neighbors lists forming our proximity graphs.
-By default, 32-bit `uint32_t` is used to enumerate those, which is not enough if you need to address over 4 Billion entries.
-For such cases we provide a custom `uint40_t` type, that will still be 37.5% more space-efficient than the commonly used 8-byte integers, and will scale up to 1 Trillion entries.
-
-![USearch uint40_t support](https://github.com/unum-cloud/usearch/blob/main/assets/usearch-neighbor-types.png?raw=true)
 
 ## Functionality
 

diff --git a/cpp/bench.cpp b/cpp/bench.cpp
@@ -42,8 +42,6 @@
 #include <omp.h> // `omp_set_num_threads()`
 #endif
 
-#include <simsimd/simsimd.h>
-
 #include <usearch/index_dense.hpp>
 
 using namespace unum::usearch;

diff --git a/docs/benchmarks.md b/docs/benchmarks.md
@@ -61,23 +61,19 @@ Within this repository you will find two commonly used utilities:
 To achieve best highest results we suggest compiling locally for the target architecture.
 
 ```sh
-cmake -B ./build_release \
-    -DCMAKE_BUILD_TYPE=Release \
-    -DUSEARCH_USE_OPENMP=1 \
-    -DUSEARCH_USE_JEMALLOC=1 && \
-    make -C ./build_release -j
-
-./build_release/bench --help
+cmake -B ./build_release -USEARCH_BUILD_BENCH_CPP=1 -DUSEARCH_BUILD_TEST_C=1 -DUSEARCH_USE_OPENMP=1 -DUSEARCH_USE_SIMSIMD=1 
+cmake --build ./build_release --config Release -j
+./build_release/bench_cpp --help
 ```
 
 Which would print the following instructions.
 
 ```txt
 SYNOPSIS
-        ./build_release/bench [--vectors <path>] [--queries <path>] [--neighbors <path>] [-b] [-j
-                              <integer>] [-c <integer>] [--expansion-add <integer>]
-                              [--expansion-search <integer>] [--native|--f16quant|--i8quant]
-                              [--ip|--l2sq|--cos|--haversine] [-h]
+        ./build_release/bench_cpp [--vectors <path>] [--queries <path>] [--neighbors <path>] [-b] [-j
+                                  <integer>] [-c <integer>] [--expansion-add <integer>]
+                                  [--expansion-search <integer>] [--native|--f16quant|--i8quant]
+                                  [--ip|--l2sq|--cos|--haversine] [-h]
 
 OPTIONS
         --vectors <path>
@@ -106,7 +102,7 @@ OPTIONS
         --f16quant  Enable `f16_t` quantization
         --i8quant   Enable `int8_t` quantization
         --ip        Choose Inner Product metric
-        --l2sq        Choose L2 Euclidean metric
+        --l2sq      Choose L2 Euclidean metric
         --cos       Choose Angular metric
         --haversine Choose Haversine metric
         -h, --help  Print this help information on this tool and exit
@@ -115,12 +111,12 @@ OPTIONS
 Here is an example of running the C++ benchmark:
 
 ```sh
-./build_release/bench \
+./build_release/bench_cpp \
     --vectors datasets/wiki_1M/base.1M.fbin \
     --queries datasets/wiki_1M/query.public.100K.fbin \
     --neighbors datasets/wiki_1M/groundtruth.public.100K.ibin
 
-./build_release/bench \
+./build_release/bench_cpp \
     --vectors datasets/t2i_1B/base.1B.fbin \
     --queries datasets/t2i_1B/query.public.100K.fbin \
     --neighbors datasets/t2i_1B/groundtruth.public.100K.ibin \