Releases: finalfusion/finalfusion-rust
Releases · finalfusion/finalfusion-rust
0.18.0
0.17.2
- Add
WriteEmbeddings::write_embeddings_len
. This method returns the serialized length of embeddings in finalfusion format, without performing any serialization. - Add
WriteChunk::chunk_len
. This method returns the serialized length of a finalfusion chunk, without performing any serialization. - Switch the license to Apache License 2.0 or MIT
Add support for Floret embeddings
- Add support for reading, writing, and using Floret embeddings.
- Add a finalfusion chunk type for Floret-like vocabularies.
- Add support for batched embedding lookups (
embedding_batch
andembedding_batch_into
) - Improve error handling:
- Mark wrapped errors using
#[source]
to get better chains of error messages. - Split
Error::Io
intoError::Read
andError::Write
. - Rename some
Error
variants.
- Mark wrapped errors using
Subword vocabulary conversion
- Add conversion from bucketed subword to explicit subword embeddings.
- Hide
WordSimilarityResult
fields. Use thecosine_similarity
andword
methods instead.
Faster lookup of OPQ-quantized embeddings
- Make lookups of unknown words in OPQ-quantized embedding matrices 2.6x faster (resulting in ~1.6x faster allround lookups).
- Add the
Reconstruct
trait is a counterpart toQuantize
. This trait can be used to reconstruct quantized embedding matrices. Using this trait is also much faster than reconstructing individual embeddings. - Add more I/O checks to ensure that the embedding matrix can actually be represented in the native
usize
.
Improved error handling
Modernize and improve error handling
- Merge the
Error
andErrorKind
enums. - Move the
Error
enum to theerror
module. - Derive trait implementations using the
thiserror
crate. - Make the
Error
enum non-exhaustive - Replace the
ChunkIdentifier::try_from
method by an implementation of theTryFrom
crate.
This release also feature-gates the memmap
dependency (the memmap
feature is enabled by default).
Explicit n-gram vocabularies and first API-stable release
- Add
ExplicitVocab
, a subword vocabulary that stores n-grams explicitly. - Add the
Embedding::into
method. This method realizes an embedding into a user-provided array. - Support big-endian architectures.
- Add
WordIndex::word
andWordIndex::subword
methods. These will return an Option with the word index or subword indices, as applicable. - Expose the quantizer in
(Mmap)QuantizedArray
through thequantizer
method. - Add benchmarks for array and quantized embeddings.
- Split
WordSimilarity
intoWordSimilarity
andWordSimilarityBy
;EmbeddingSimilarity
intoEmbeddingSimilarity
andEmbeddingSimilarityBy
. - Rename
FinalfusionSubwordVocab
toBucketSubwordVocab
. - Expose fewer types through the prelude.
- Hide the
chunks
module. E.g.chunks::storage
becomesstorage
.
Reductive 0.3
This is a small update, that updates the reductive dependency to 0.3, which has a crucial bug fix for training product quantizers in multiple attempts. However, reductive 0.3 also requires rand 0.7, resulting in a changed API. Therefore, we have to bump the leading version number from 0.9 to 0.10.
Memory-mapped quantized arrays
- Add the
MmapQuantizedArray
storage type. - Rename
Vocab::len
toVocab::words_len
. - Add
Vocab::vocab_len
to get the vocabulary size including subword
indices.
Token robustness
- Improve reading of embeddings that contain unicode whitespace in tokens.
- Add lossy variants of the text/word2vec/fasttext reading methods. The lossy variants read tokens with invalid UTF-8 byte sequences.