Property id_hash_keys of documents never set #1920

ArzelaAscoIi · 2021-12-22T11:33:03Z

Describe the bug
The attributes of objects of type Document always have id_hash_keys = None.

Context
Documents have a property called id_hash_keys which is used to generate the id. Before merging this pr Haystack used strings to generate the id. Now we pass attributes on which the hash should be generated, i.e. content, meta, ...
If we create two documents with the same content and generate the ids from two different keys, we get different ids, but in both cases the "id_hash_keys" field is "None".

from haystack import Document
doc = Document(content="some text", content_type="text", id_hash_keys=["key1"])
doc2 = Document(content="some text", content_type="text", id_hash_keys=["key2"])

The parameter id_hash_keys is passed to the init function but never set by self.id_hash_keys = id_hash_keys.

There are now two options:
Remove the parameter from the Document primitive.
Since this paramerter is already not set, it will not cause any errors as long as the creation of the id works as expected. However, we may need to discuss whether a document needs to store these values.

Store the id_hash_keys by self.id_hash_keys = id_hash_keys
This will cause some tests to fail:

Weaviate will complain, since the list of strings can not be parsed.
Document.to_dict will need some fixes

General question:
How do we want to define document similarity in general? At the document store level, the id is used to determine whether a document is replaced on insert. Since this id is based on context, this is basically equivalent to comparing only the context. The __eq__ function of documents additionally takes the metadata and other fields to compare the documents. So there is a possibility that doc1==doc2 returns False, but when both documents are inserted into a document store, only the second document is inserted, which is passed to write_documents.

The text was updated successfully, but these errors were encountered:

julian-risch · 2022-03-22T10:27:05Z

@bogdankostic @ArzelaAscoIi As of now we don't need Document to store id_hash_keys as an attribute. I can't come up with a potential use case that requires that. In the unlikely event that we need a Document to tell how its id was generated in future, we can add that functionality later. Until then, I'd say we remove id_hash_keys as an attribute of Document and just use id_hash_keys as a parameter to decide how to generate the id.

ArzelaAscoIi added the type:bug Something isn't working label Dec 22, 2021

bogdankostic mentioned this issue Mar 22, 2022

Change return types of indexing pipeline nodes #2342

Merged

bogdankostic closed this as completed in #2342 Mar 29, 2022

ZanSara mentioned this issue Dec 12, 2022

feat: store id_hash_keys in Document objects to make documents clonable #3697

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Property id_hash_keys of documents never set #1920

Property id_hash_keys of documents never set #1920

ArzelaAscoIi commented Dec 22, 2021 •

edited

Loading

julian-risch commented Mar 22, 2022 •

edited

Loading

Property id_hash_keys of documents never set #1920

Property id_hash_keys of documents never set #1920

Comments

ArzelaAscoIi commented Dec 22, 2021 • edited Loading

julian-risch commented Mar 22, 2022 • edited Loading

ArzelaAscoIi commented Dec 22, 2021 •

edited

Loading

julian-risch commented Mar 22, 2022 •

edited

Loading