You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
The attributes of objects of type Document always have id_hash_keys = None.
Context
Documents have a property called id_hash_keys which is used to generate the id. Before merging this pr Haystack used strings to generate the id. Now we pass attributes on which the hash should be generated, i.e. content, meta, ...
If we create two documents with the same content and generate the ids from two different keys, we get different ids, but in both cases the "id_hash_keys" field is "None".
The parameter id_hash_keys is passed to the init function but never set by self.id_hash_keys = id_hash_keys.
There are now two options: Remove the parameter from the Document primitive.
Since this paramerter is already not set, it will not cause any errors as long as the creation of the id works as expected. However, we may need to discuss whether a document needs to store these values.
Store the id_hash_keys by self.id_hash_keys = id_hash_keys
This will cause some tests to fail:
Weaviate will complain, since the list of strings can not be parsed.
Document.to_dict will need some fixes
General question:
How do we want to define document similarity in general? At the document store level, the id is used to determine whether a document is replaced on insert. Since this id is based on context, this is basically equivalent to comparing only the context. The __eq__ function of documents additionally takes the metadata and other fields to compare the documents. So there is a possibility that doc1==doc2 returns False, but when both documents are inserted into a document store, only the second document is inserted, which is passed to write_documents.
The text was updated successfully, but these errors were encountered:
@bogdankostic@ArzelaAscoIi As of now we don't need Document to store id_hash_keys as an attribute. I can't come up with a potential use case that requires that. In the unlikely event that we need a Document to tell how its id was generated in future, we can add that functionality later. Until then, I'd say we remove id_hash_keys as an attribute of Document and just use id_hash_keys as a parameter to decide how to generate the id.
Describe the bug
The attributes of objects of type
Document
always haveid_hash_keys = None
.Context
Documents have a property called
id_hash_keys
which is used to generate the id. Before merging this pr Haystack used strings to generate the id. Now we pass attributes on which the hash should be generated, i.e.content
,meta
, ...If we create two documents with the same content and generate the ids from two different keys, we get different ids, but in both cases the "id_hash_keys" field is "None".
The parameter
id_hash_keys
is passed to the init function but never set byself.id_hash_keys = id_hash_keys
.There are now two options:
Remove the parameter from the
Document
primitive.Since this paramerter is already not set, it will not cause any errors as long as the creation of the id works as expected. However, we may need to discuss whether a document needs to store these values.
Store the
id_hash_keys
byself.id_hash_keys = id_hash_keys
This will cause some tests to fail:
Document.to_dict
will need some fixesGeneral question:
How do we want to define document similarity in general? At the document store level, the id is used to determine whether a document is replaced on insert. Since this id is based on context, this is basically equivalent to comparing only the context. The
__eq__
function of documents additionally takes the metadata and other fields to compare the documents. So there is a possibility thatdoc1==doc2
returnsFalse
, but when both documents are inserted into a document store, only the second document is inserted, which is passed towrite_documents
.The text was updated successfully, but these errors were encountered: