Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: enhanced InMemoryDocumentStore BM25 query efficiency with incremental indexing #7549

Merged
merged 25 commits into from
May 3, 2024

Conversation

Guest400123064
Copy link
Contributor

Related Issues

This proposal was first made as a stand-alone Haystack document store integration, which is linked to issue number 218 in haystack-integrations repo.

Proposed Changes:

Instead of reindexing with every new query, I choose to perform incremental indexing on document changes. This results in modifications primarily to write_documents, delet_documents, and bm25_retrieval.

How did you test it?

As suggested by @julian-risch, the change should be non-breaking. Therefore, the test was performed with test cases implemented in test/document_stores/test_in_memory.py. 81 test cases passed and 3 cases failed with explainable causes:

  • TestMemoryDocumentStore::test_from_dict: self.bm25_algorithm now points to the string literal of the algorithm name, instead of a BM25 object. So, it does not have the .__name__ attribute.
  • TestMemoryDocumentStore::test_bm25_retrieval_with_non_scaled_BM25Okapi: this is caused by the pytest fixture initializing a BM25L document store and the test case modified the underlying algorithm not from initializer, making the underlying algorithm being BM25L instead of Okapi BM25. Changing the initialized algorithm will result in a pass.
  • TestMemoryDocumentStore::test_bm25_retrieval_with_text_and_table_content: the non-matching documents have tied scores. The test case got a "lucky pass" because NumPy quick-sort alters the document orders even when the scores are the same.

Notes for the reviewer

Any suggestion is appreciated :)

Checklist

@CLAassistant
Copy link

CLAassistant commented Apr 12, 2024

CLA assistant check
All committers have signed the CLA.

@github-actions github-actions bot added 2.x Related to Haystack v2.0 type:documentation Improvements on the docs labels Apr 12, 2024
… naming consistency; 3. remove unused import
@julian-risch
Copy link
Member

@Guest400123064 Thank you for opening this PR! We really appreciate it. Our team will need a little bit more time to review your PR.
Having had a first quick look, I think we can remove the haystack_bm25 dependency from the project here and remove the import also from the tests here if it is not used anymore in this single test.

@Guest400123064
Copy link
Contributor Author

Guest400123064 commented Apr 18, 2024

Thanks for the reply! Yea, theoretically it should completely replicate rank_bm25; I haven't done an extensive exact comparison, e.g, with fake data generated by hypothesis. But I am wondering if I should directly benchmark the retrieval performance instead of trying to match rank_bm25.

@davidsbatista
Copy link
Contributor

Hi @Guest400123064, thanks for your contribution, this is very good work! I left some initial suggestions.

@davidsbatista davidsbatista marked this pull request as ready for review April 29, 2024 11:12
@davidsbatista davidsbatista requested review from a team as code owners April 29, 2024 11:12
@davidsbatista davidsbatista requested review from dfokina and julian-risch and removed request for a team April 29, 2024 11:12
@julian-risch julian-risch removed their request for review April 30, 2024 07:02
@coveralls
Copy link
Collaborator

coveralls commented May 2, 2024

Pull Request Test Coverage Report for Build 8938434225

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • 5 unchanged lines in 1 file lost coverage.
  • Overall coverage increased (+0.2%) to 90.333%

Files with Coverage Reduction New Missed Lines %
document_stores/in_memory/document_store.py 5 98.04%
Totals Coverage Status
Change from base Build 8937849375: 0.2%
Covered Lines: 6513
Relevant Lines: 7210

💛 - Coveralls

@davidsbatista davidsbatista enabled auto-merge (squash) May 3, 2024 11:44
Copy link
Contributor

@davidsbatista davidsbatista left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@davidsbatista davidsbatista merged commit cd66a80 into deepset-ai:main May 3, 2024
24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2.x Related to Haystack v2.0 topic:build/distribution topic:tests type:documentation Improvements on the docs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants