-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf: enhanced InMemoryDocumentStore
BM25 query efficiency with incremental indexing
#7549
Conversation
… naming consistency; 3. remove unused import
@Guest400123064 Thank you for opening this PR! We really appreciate it. Our team will need a little bit more time to review your PR. |
Thanks for the reply! Yea, theoretically it should completely replicate |
Hi @Guest400123064, thanks for your contribution, this is very good work! I left some initial suggestions. |
…tistics as a dataclass instead of tuple to improve readability
Pull Request Test Coverage Report for Build 8938434225Details
💛 - Coveralls |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Related Issues
This proposal was first made as a stand-alone Haystack document store integration, which is linked to issue number 218 in haystack-integrations repo.
Proposed Changes:
Instead of reindexing with every new query, I choose to perform incremental indexing on document changes. This results in modifications primarily to
write_documents
,delet_documents
, andbm25_retrieval
.How did you test it?
As suggested by @julian-risch, the change should be non-breaking. Therefore, the test was performed with test cases implemented in
test/document_stores/test_in_memory.py
. 81 test cases passed and 3 cases failed with explainable causes:TestMemoryDocumentStore::test_from_dict
:self.bm25_algorithm
now points to the string literal of the algorithm name, instead of aBM25
object. So, it does not have the.__name__
attribute.TestMemoryDocumentStore::test_bm25_retrieval_with_non_scaled_BM25Okapi
: this is caused by the pytest fixture initializing a BM25L document store and the test case modified the underlying algorithm not from initializer, making the underlying algorithm being BM25L instead of Okapi BM25. Changing the initialized algorithm will result in a pass.TestMemoryDocumentStore::test_bm25_retrieval_with_text_and_table_content
: the non-matching documents have tied scores. The test case got a "lucky pass" because NumPy quick-sort alters the document orders even when the scores are the same.Notes for the reviewer
Any suggestion is appreciated
:)
Checklist
fix:
,feat:
,build:
,chore:
,ci:
,docs:
,style:
,refactor:
,perf:
,test:
.