Skip to content

Commit

Permalink
Improve: Python evals for exact search
Browse files Browse the repository at this point in the history
  • Loading branch information
ashvardanian committed Mar 25, 2024
1 parent 5b5973c commit 36f6c5e
Show file tree
Hide file tree
Showing 8 changed files with 272 additions and 131 deletions.
22 changes: 16 additions & 6 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,11 +33,19 @@ brew install libomp llvm # MacOS
Using modern syntax, this is how you build and run the test suite:

```sh
cmake -DUSEARCH_BUILD_TEST_CPP=1 -B ./build_debug
cmake -DUSEARCH_BUILD_TEST_CPP=1 -DCMAKE_BUILD_TYPE=Debug -B ./build_debug
cmake --build ./build_debug --config Debug
./build_debug/test_cpp
```

If there build mode is not specified, the default is `Release`.

```sh
cmake -DUSEARCH_BUILD_TEST_CPP=1 -B ./build_release
cmake --build ./build_release --config Debug
./build_release/test_cpp
```

The CMakeLists.txt file has a number of options you can pass:

- What to build:
Expand Down Expand Up @@ -181,10 +189,12 @@ RUN yum install tar git python3 cmake gcc-c++ -y && yum groupinstall "Developmen
# Assuming AWS Linux 2 uses old compilers:
ENV USEARCH_USE_FP16LIB 1
ENV DUSEARCH_USE_SIMSIMD 1
ENV SIMSIMD_TARGET_X86_AVX2 1
ENV SIMSIMD_TARGET_X86_AVX512 0
ENV SIMSIMD_TARGET_ARM_NEON 1
ENV SIMSIMD_TARGET_ARM_SVE 0
ENV SIMSIMD_TARGET_HASWELL 1
ENV SIMSIMD_TARGET_SKYLAKE 0
ENV SIMSIMD_TARGET_ICE 0
ENV SIMSIMD_TARGET_SAPPHIRE 0
ENV SIMSIMD_TARGET_NEON 1
ENV SIMSIMD_TARGET_SVE 0

# For specific PR:
# RUN npm install --build-from-source unum-cloud/usearch#pull/302/head
Expand Down Expand Up @@ -212,7 +222,7 @@ USearch provides Rust bindings available on [Crates.io](https://crates.io/crates
The compilation settings are controlled by the `build.rs` and are independent from CMake used for C/C++ builds.

```sh
cargo test -p usearch
cargo test -p usearch -- --nocapture --test-threads=1
cargo publish
```

Expand Down
75 changes: 38 additions & 37 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -192,6 +192,44 @@ When compared to FAISS's `IndexFlatL2` in Google Colab, __[USearch may offer up
- `faiss.IndexFlatL2`: __55.3 ms__.
- `usearch.index.search`: __2.54 ms__.

## User-Defined Functions

While most vector search packages concentrate on just a few metrics - "Inner Product distance" and "Euclidean distance," USearch extends this list to include any user-defined metrics.
This flexibility allows you to customize your search for various applications, from computing geospatial coordinates with the rare [Haversine][haversine] distance to creating custom metrics for composite embeddings from multiple AI models.

![USearch: Vector Search Approaches](https://github.com/unum-cloud/usearch/blob/main/assets/usearch-approaches-white.png?raw=true)

Unlike older approaches indexing high-dimensional spaces, like KD-Trees and Locality Sensitive Hashing, HNSW doesn't require vectors to be identical in length.
They only have to be comparable.
So you can apply it in [obscure][obscure] applications, like searching for similar sets or fuzzy text matching, using [GZip][gzip-similarity] as a distance function.

> Read more about [JIT and UDF in USearch Python SDK](https://unum-cloud.github.io/usearch/python#user-defined-metrics-and-jit-in-python).
[haversine]: https://ashvardanian.com/posts/abusing-vector-search#geo-spatial-indexing
[obscure]: https://ashvardanian.com/posts/abusing-vector-search
[gzip-similarity]: https://twitter.com/LukeGessler/status/1679211291292889100?s=20

## Memory Efficiency, Downcasting, and Quantization

Training a quantization model and dimension-reduction is a common approach to accelerate vector search.
Those, however, are only sometimes reliable, can significantly affect the statistical properties of your data, and require regular adjustments if your distribution shifts.
Instead, we have focused on high-precision arithmetic over low-precision downcasted vectors.
The same index, and `add` and `search` operations will automatically down-cast or up-cast between `f64_t`, `f32_t`, `f16_t`, `i8_t`, and single-bit representations.
You can use the following command to check, if hardware acceleration is enabled:

```sh
$ python -c 'from usearch.index import Index; print(Index(ndim=768, metric="cos", dtype="f16").hardware_acceleration)'
> sapphire
$ python -c 'from usearch.index import Index; print(Index(ndim=166, metric="tanimoto").hardware_acceleration)'
> ice
```

Using smaller numeric types will save you RAM needed to store the vectors, but you can also compress the neighbors lists forming our proximity graphs.
By default, 32-bit `uint32_t` is used to enumerate those, which is not enough if you need to address over 4 Billion entries.
For such cases we provide a custom `uint40_t` type, that will still be 37.5% more space-efficient than the commonly used 8-byte integers, and will scale up to 1 Trillion entries.

![USearch uint40_t support](https://github.com/unum-cloud/usearch/blob/main/assets/usearch-neighbor-types.png?raw=true)

## `Indexes` for Multi-Index Lookups

For larger workloads targeting billions or even trillions of vectors, parallel multi-index lookups become invaluable.
Expand Down Expand Up @@ -264,43 +302,6 @@ pairs: dict = men.join(women, max_proposals=0, exact=False)

> Read more in the post: [Combinatorial Stable Marriages for Semantic Search 💍](https://ashvardanian.com/posts/searching-stable-marriages)
## User-Defined Functions

While most vector search packages concentrate on just a few metrics - "Inner Product distance" and "Euclidean distance," USearch extends this list to include any user-defined metrics.
This flexibility allows you to customize your search for various applications, from computing geospatial coordinates with the rare [Haversine][haversine] distance to creating custom metrics for composite embeddings from multiple AI models.

![USearch: Vector Search Approaches](https://github.com/unum-cloud/usearch/blob/main/assets/usearch-approaches-white.png?raw=true)

Unlike older approaches indexing high-dimensional spaces, like KD-Trees and Locality Sensitive Hashing, HNSW doesn't require vectors to be identical in length.
They only have to be comparable.
So you can apply it in [obscure][obscure] applications, like searching for similar sets or fuzzy text matching, using [GZip][gzip-similarity] as a distance function.

> Read more about [JIT and UDF in USearch Python SDK](https://unum-cloud.github.io/usearch/python#user-defined-metrics-and-jit-in-python).
[haversine]: https://ashvardanian.com/posts/abusing-vector-search#geo-spatial-indexing
[obscure]: https://ashvardanian.com/posts/abusing-vector-search
[gzip-similarity]: https://twitter.com/LukeGessler/status/1679211291292889100?s=20

## Memory Efficiency, Downcasting, and Quantization

Training a quantization model and dimension-reduction is a common approach to accelerate vector search.
Those, however, are only sometimes reliable, can significantly affect the statistical properties of your data, and require regular adjustments if your distribution shifts.
Instead, we have focused on high-precision arithmetic over low-precision downcasted vectors.
The same index, and `add` and `search` operations will automatically down-cast or up-cast between `f64_t`, `f32_t`, `f16_t`, `i8_t`, and single-bit representations.
You can use the following command to check, if hardware acceleration is enabled:

```sh
$ python -c 'from usearch.index import Index; print(Index(ndim=768, metric="cos", dtype="f16").hardware_acceleration)'
> avx512+f16
$ python -c 'from usearch.index import Index; print(Index(ndim=166, metric="tanimoto").hardware_acceleration)'
> avx512+popcnt
```

Using smaller numeric types will save you RAM needed to store the vectors, but you can also compress the neighbors lists forming our proximity graphs.
By default, 32-bit `uint32_t` is used to enumerate those, which is not enough if you need to address over 4 Billion entries.
For such cases we provide a custom `uint40_t` type, that will still be 37.5% more space-efficient than the commonly used 8-byte integers, and will scale up to 1 Trillion entries.

![USearch uint40_t support](https://github.com/unum-cloud/usearch/blob/main/assets/usearch-neighbor-types.png?raw=true)

## Functionality

Expand Down
2 changes: 0 additions & 2 deletions cpp/bench.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -42,8 +42,6 @@
#include <omp.h> // `omp_set_num_threads()`
#endif

#include <simsimd/simsimd.h>

#include <usearch/index_dense.hpp>

using namespace unum::usearch;
Expand Down
24 changes: 10 additions & 14 deletions docs/benchmarks.md
Original file line number Diff line number Diff line change
Expand Up @@ -61,23 +61,19 @@ Within this repository you will find two commonly used utilities:
To achieve best highest results we suggest compiling locally for the target architecture.

```sh
cmake -B ./build_release \
-DCMAKE_BUILD_TYPE=Release \
-DUSEARCH_USE_OPENMP=1 \
-DUSEARCH_USE_JEMALLOC=1 && \
make -C ./build_release -j

./build_release/bench --help
cmake -B ./build_release -USEARCH_BUILD_BENCH_CPP=1 -DUSEARCH_BUILD_TEST_C=1 -DUSEARCH_USE_OPENMP=1 -DUSEARCH_USE_SIMSIMD=1
cmake --build ./build_release --config Release -j
./build_release/bench_cpp --help
```

Which would print the following instructions.

```txt
SYNOPSIS
./build_release/bench [--vectors <path>] [--queries <path>] [--neighbors <path>] [-b] [-j
<integer>] [-c <integer>] [--expansion-add <integer>]
[--expansion-search <integer>] [--native|--f16quant|--i8quant]
[--ip|--l2sq|--cos|--haversine] [-h]
./build_release/bench_cpp [--vectors <path>] [--queries <path>] [--neighbors <path>] [-b] [-j
<integer>] [-c <integer>] [--expansion-add <integer>]
[--expansion-search <integer>] [--native|--f16quant|--i8quant]
[--ip|--l2sq|--cos|--haversine] [-h]
OPTIONS
--vectors <path>
Expand Down Expand Up @@ -106,7 +102,7 @@ OPTIONS
--f16quant Enable `f16_t` quantization
--i8quant Enable `int8_t` quantization
--ip Choose Inner Product metric
--l2sq Choose L2 Euclidean metric
--l2sq Choose L2 Euclidean metric
--cos Choose Angular metric
--haversine Choose Haversine metric
-h, --help Print this help information on this tool and exit
Expand All @@ -115,12 +111,12 @@ OPTIONS
Here is an example of running the C++ benchmark:

```sh
./build_release/bench \
./build_release/bench_cpp \
--vectors datasets/wiki_1M/base.1M.fbin \
--queries datasets/wiki_1M/query.public.100K.fbin \
--neighbors datasets/wiki_1M/groundtruth.public.100K.ibin

./build_release/bench \
./build_release/bench_cpp \
--vectors datasets/t2i_1B/base.1B.fbin \
--queries datasets/t2i_1B/query.public.100K.fbin \
--neighbors datasets/t2i_1B/groundtruth.public.100K.ibin \
Expand Down
Loading

0 comments on commit 36f6c5e

Please sign in to comment.