C++ Improvements - API enhancement and increase testing #85

stephen29xie · 2024-08-26T20:06:07Z

Changes Made

C++

Add overloaded Index::addItems and Index::query methods that accept 2D vector of floats (std::vector<std::vector<float>>) instead of an NDArray.
- When the Java and Python bindings add items or query, they convert the Python and Java objects to a 2-D NDArray. The NDArray is abstracted. We should have an option to abstract that in the C++ interface too to maintain ease-of-use and intuitiveness, especially If Voyager is used as a standalone C++ library (cpp wrappers are using default namespace #52).
- This is used in the unit test when we add items or query the index
Add C++ tests
- A C++ test framework was added recently in C++ improvements #63 but there is only some basic Index property checking
- These additional tests generate random vectors and add them to an index. Each of these vectors is then used as a target vector in a query. We expect to see that the ANN distance is ~0 (allowing for some precision error based on storage type), and that the ANN is itself.
- We technically still one have only one test. We don't have support for test parameterization yet. As a workaround we use nested loops to enumerate and test different combinations of parameters.
- By nature of HNSW, the test is nondeterministic. Even with considerations for lower precision data storage types, it is possible that a test run will fail because we are testing that the true NN is equal to the ANN.

Testing

Added C++ tests

Related Issues

In these tests, we query with k = 1 (single nearest neighbour). If you change it or add an additional test with k = num_vectors_in_index, you can sporadically reproduce this issue: #38

Checklist

My code follows the code style of this project.
I have added and/or updated appropriate documentation (if applicable).
All new and existing tests pass locally with these changes.
I have run static code analysis (if available) and resolved any issues.
I have considered backward compatibility (if applicable).
I have confirmed that this PR does not introduce any security vulnerabilities.

Additional Comments

…loats instead of NDArray

stephen29xie · 2024-08-29T16:21:48Z

cpp/test/CMakeLists.txt

+# Add compiler flags
+target_compile_options(VoyagerTests PRIVATE -g)


-g flag builds executable with debugging symbols to use with a debugger

stephen29xie · 2024-09-04T19:09:44Z

cpp/src/array_utils.h

+  // flatten the 2d array into the NDArray's underlying 1D vector
+  std::vector<float> flatArray;
+  for (const auto &vector : vectors) {
+    flatArray.insert(flatArray.end(), vector.begin(), vector.end());
+  }


Is there a more memory-efficient way of doing this?

Not memory-efficient, but time-efficient, yes: we should pre-allocate the space in flatArray by calculating numVectors * dimensions, then doing std::memcpy into that buffer for each vector.

What we have here will resize (and potentially reallocate) the vector on each .insert, which makes this O(n²) as .insert is O(n) rather than O(1).

stephen29xie · 2024-09-04T19:10:21Z

cpp/test/test_main.cpp

+               float precisionTolerance) {
+  // create test data and ids
+  std::vector<std::vector<float>> inputData =
+      randomVectors(numVectors, numDimensions);


From meeting:

add conditional statement
if storageType = float32, then randomVectors
if storageType = float8 or e4m3, then randomQuantizedVectors

markkohdev

Looking good! Left some comments but nice work on this so far :)

markkohdev · 2024-09-04T19:12:00Z

cpp/src/array_utils.h

+/**
+ * Convert a 2D vector of float to NDArray<float, 2>
+ */
+NDArray<float, 2> vectorsToNDArray(std::vector<std::vector<float>> vectors) {


Let's add an explicit unit test for this

markkohdev · 2024-09-04T19:12:52Z

cpp/test/test_utils.cpp

+
+#include "array_utils.h"
+
+NDArray<float, 2> randomQuantizedVectorsNDArray(int numVectors,


I think we can remove these functions and instead just use the randomQuantizedVectors and randomVectors methods

markkohdev · 2024-09-04T19:30:44Z

cpp/src/array_utils.h

+  int dimensions = numVectors > 0 ? vectors[0].size() : 0;
+  std::array<int, 2> shape = {numVectors, dimensions};
+
+  // flatten the 2d array into the NDArray's underlying 1D vector


Rather than iterating over the outer and inner vectors, we should try to utilize the underlying data access that std::vector provides. Be careful with this though as there may be caveats around vector memory allocation boundaries and actual vector item counts

I am still iterating over each vector because I added a check to validate that each vector size is identical. But now I am preallocating the space I need in the flattened array to avoid resizing, and using std::memcpy to do the copying

cpp/src/array_utils.h

…e unused util methods

…tests

markkohdev

👍 Thanks for fixing this up! Looks good to me
🤠 🐴 🏇

stephen29xie added 5 commits August 26, 2024 16:04

Add C++ tests and overloaded Index methods that accept 2D vector of f…

94f5fd4

…loats instead of NDArray

Use most recent version of clang-format

723e189

Undo clang-format bump. Fix formatting

b61ef1e

clean up C++ test, increase number of vectors

8186a1f

Fix comment

12fd327

stephen29xie changed the title ~~[WIP] C++ Improvements~~ C++ Improvements - API enhancement and increase testing Aug 29, 2024

stephen29xie commented Aug 29, 2024

View reviewed changes

stephen29xie marked this pull request as ready for review August 29, 2024 16:22

stephen29xie requested review from markkohdev and psobot August 29, 2024 16:22

Move code into reusable function

a3d04c8

stephen29xie commented Sep 4, 2024

View reviewed changes

markkohdev requested changes Sep 4, 2024

View reviewed changes

psobot reviewed Sep 5, 2024

View reviewed changes

cpp/src/array_utils.h Show resolved Hide resolved

stephen29xie added 2 commits September 4, 2024 23:33

Use quantized random input vectors for Float8 and E4M3 storage. Remov…

f57b9ae

…e unused util methods

Optimize vectorsToNDArray() and add validation for vector sizes, add …

2264c04

…tests

stephen29xie requested review from psobot and markkohdev September 6, 2024 14:49

markkohdev approved these changes Sep 6, 2024

View reviewed changes

stephen29xie merged commit 88cfc46 into main Sep 10, 2024
57 checks passed

stephen29xie deleted the stephenx/cpp-improvements branch September 10, 2024 17:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

C++ Improvements - API enhancement and increase testing #85

C++ Improvements - API enhancement and increase testing #85

stephen29xie commented Aug 26, 2024 •

edited

Loading

stephen29xie Aug 29, 2024

stephen29xie Sep 4, 2024

psobot Sep 5, 2024

stephen29xie Sep 4, 2024

markkohdev left a comment

markkohdev Sep 4, 2024

markkohdev Sep 4, 2024

markkohdev Sep 4, 2024

stephen29xie Sep 6, 2024

markkohdev left a comment

		# Add compiler flags
		target_compile_options(VoyagerTests PRIVATE -g)


		#include "array_utils.h"

		NDArray<float, 2> randomQuantizedVectorsNDArray(int numVectors,

C++ Improvements - API enhancement and increase testing #85

C++ Improvements - API enhancement and increase testing #85

Conversation

stephen29xie commented Aug 26, 2024 • edited Loading

Changes Made

C++

Testing

Related Issues

Checklist

Additional Comments

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

markkohdev left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

markkohdev left a comment

Choose a reason for hiding this comment

stephen29xie commented Aug 26, 2024 •

edited

Loading