Skip to content

Commit

Permalink
adding some format docs
Browse files Browse the repository at this point in the history
  • Loading branch information
benwtrent committed Dec 9, 2024
1 parent 1c44537 commit ef147d8
Showing 1 changed file with 50 additions and 0 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,56 @@

/**
* Copied from Lucene, replace with Lucene's implementation sometime after Lucene 10
* Codec for encoding/decoding binary quantized vectors The binary quantization format used here
* is a per-vector optimized scalar quantization. Also see {@link
* org.elasticsearch.index.codec.vectors.es818.OptimizedScalarQuantizer}. Some of key features are:
*
* <ul>
* <li>Estimating the distance between two vectors using their centroid normalized distance. This
* requires some additional corrective factors, but allows for centroid normalization to occur.
* <li>Optimized scalar quantization to bit level of centroid normalized vectors.
* <li>Asymmetric quantization of vectors, where query vectors are quantized to half-byte
* precision (normalized to the centroid) and then compared directly against the single bit
* quantized vectors in the index.
* <li>Transforming the half-byte quantized query vectors in such a way that the comparison with
* single bit vectors can be done with bit arithmetic.
* </ul>
*
* The format is stored in two files:
*
* <h2>.veb (vector data) file</h2>
*
* <p>Stores the binary quantized vectors in a flat format. Additionally, it stores each vector's
* corrective factors. At the end of the file, additional information is stored for vector ordinal
* to centroid ordinal mapping and sparse vector information.
*
* <ul>
* <li>For each vector:
* <ul>
* <li><b>[byte]</b> the binary quantized values, each byte holds 8 bits.
* <li><b>[float]</b> the optimized quantiles and an additional similarity dependent corrective factor.
* <li><b>short</b> the sum of the quantized components </li>
* </ul>
* <li>After the vectors, sparse vector information keeping track of monotonic blocks.
* </ul>
*
* <h2>.vemb (vector metadata) file</h2>
*
* <p>Stores the metadata for the vectors. This includes the number of vectors, the number of
* dimensions, and file offset information.
*
* <ul>
* <li><b>int</b> the field number
* <li><b>int</b> the vector encoding ordinal
* <li><b>int</b> the vector similarity ordinal
* <li><b>vint</b> the vector dimensions
* <li><b>vlong</b> the offset to the vector data in the .veb file
* <li><b>vlong</b> the length of the vector data in the .veb file
* <li><b>vint</b> the number of vectors
* <li><b>[float]</b> the centroid </li>
* <li><b>float</b> the centroid square magnitude </li>
* <li>The sparse vector information, if required, mapping vector ordinal to doc ID
* </ul>
*/
public class ES818BinaryQuantizedVectorsFormat extends FlatVectorsFormat {

Expand Down

0 comments on commit ef147d8

Please sign in to comment.