Improved ID hashing on indexing nodes #272

timonv · 2024-09-05T08:25:39Z

Is your feature request related to a problem? Please describe.
Currently IDs for indexing nodes are generated with a hash based on the chunk and path, with the default_hasher from hash map via the hasher trait. The maximum amount of values is theoreticly 18446744073709551615. In storage, IDs are used to upsert data.

Overlap can happen with large amounts of data and would lead to incorrect results. There are better solutions. Additonally, the field is public and the method (calculate_hash()) does not set the field.

Describe the solution you'd like
Qdrant supports UUIDs now for a while. Idea is to use UUIDv3 (with md5) instead. Ideally, users can opt in to the new implementation, with a deprecation warning on the old implementation. For both solutions, id should just be retrieved by an id() function that lazilly retrieves or sets the id. Memory storage uses ordered ids for easier debugging, so some kind of overwrap might still be useful. All implementors of storage and node cache need to be updated.

Fixing id generation properly as per #272, will be merged in together. - **Clippy** - **fix(qdrant)!: Default hasher changed in Rust 1.81**

timonv added good first issue Good for newcomers help wanted Extra attention is needed labels Sep 5, 2024

This was referenced Sep 6, 2024

fix!: Rust 1.81 support #275

Merged

feat(indexing)!: Use UUIDv3 for indexing node ids #277

Merged

timonv added a commit that referenced this issue Sep 6, 2024

fix!: Rust 1.81 support (#275)

5a724df

Fixing id generation properly as per #272, will be merged in together. - **Clippy** - **fix(qdrant)!: Default hasher changed in Rust 1.81**

timonv closed this as completed in #277 Sep 6, 2024

timonv closed this as completed in 57fe4aa Sep 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improved ID hashing on indexing nodes #272

Improved ID hashing on indexing nodes #272

timonv commented Sep 5, 2024

Improved ID hashing on indexing nodes #272

Improved ID hashing on indexing nodes #272

Comments

timonv commented Sep 5, 2024