perf: enhanced `InMemoryDocumentStore` BM25 query efficiency with incremental indexing #7549

Guest400123064 · 2024-04-12T23:01:43Z

Related Issues

This proposal was first made as a stand-alone Haystack document store integration, which is linked to issue number 218 in haystack-integrations repo.

Proposed Changes:

Instead of reindexing with every new query, I choose to perform incremental indexing on document changes. This results in modifications primarily to write_documents, delet_documents, and bm25_retrieval.

How did you test it?

As suggested by @julian-risch, the change should be non-breaking. Therefore, the test was performed with test cases implemented in test/document_stores/test_in_memory.py. 81 test cases passed and 3 cases failed with explainable causes:

TestMemoryDocumentStore::test_from_dict: self.bm25_algorithm now points to the string literal of the algorithm name, instead of a BM25 object. So, it does not have the .__name__ attribute.
TestMemoryDocumentStore::test_bm25_retrieval_with_non_scaled_BM25Okapi: this is caused by the pytest fixture initializing a BM25L document store and the test case modified the underlying algorithm not from initializer, making the underlying algorithm being BM25L instead of Okapi BM25. Changing the initialized algorithm will result in a pass.
TestMemoryDocumentStore::test_bm25_retrieval_with_text_and_table_content: the non-matching documents have tied scores. The test case got a "lucky pass" because NumPy quick-sort alters the document orders even when the scores are the same.

Notes for the reviewer

Any suggestion is appreciated :)

Checklist

I have read the contributors guidelines and the code of conduct
I have updated the related issue with new insights and changes
I added unit tests and updated the docstrings
I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test:.
I documented my code
I ran pre-commit hooks and fixed any issue

CLAassistant · 2024-04-12T23:01:50Z

All committers have signed the CLA.

… naming consistency; 3. remove unused import

julian-risch · 2024-04-18T10:44:40Z

@Guest400123064 Thank you for opening this PR! We really appreciate it. Our team will need a little bit more time to review your PR.
Having had a first quick look, I think we can remove the haystack_bm25 dependency from the project here and remove the import also from the tests here if it is not used anymore in this single test.

Guest400123064 · 2024-04-18T12:36:00Z

Thanks for the reply! Yea, theoretically it should completely replicate rank_bm25; I haven't done an extensive exact comparison, e.g, with fake data generated by hypothesis. But I am wondering if I should directly benchmark the retrieval performance instead of trying to match rank_bm25.

haystack/document_stores/in_memory/document_store.py

davidsbatista · 2024-04-24T09:13:24Z

Hi @Guest400123064, thanks for your contribution, this is very good work! I left some initial suggestions.

…tistics as a dataclass instead of tuple to improve readability

coveralls · 2024-05-02T09:14:59Z

Pull Request Test Coverage Report for Build 8938434225

Details

0 of 0 changed or added relevant lines in 0 files are covered.
5 unchanged lines in 1 file lost coverage.
Overall coverage increased (+0.2%) to 90.333%

Files with Coverage Reduction	New Missed Lines	%
document_stores/in_memory/document_store.py	5	98.04%

Totals
Change from base Build 8937849375:	0.2%
Covered Lines:	6513
Relevant Lines:	7210

💛 - Coveralls

davidsbatista

LGTM

Guest400123064 added 3 commits April 12, 2024 15:21

incorporating better bm25 impl without breaking interface

90e0216

incorporating better bm25 impl without breaking interface

aaa837f

all three bm25 algos

1ac92e7

github-actions bot added 2.x Related to Haystack v2.0 type:documentation Improvements on the docs labels Apr 12, 2024

1. setting algo post-init not allowed; 2. remove extra underscore for…

7bfdfc1

… naming consistency; 3. remove unused import

julian-risch requested a review from davidsbatista April 23, 2024 13:40

davidsbatista reviewed Apr 24, 2024

View reviewed changes

haystack/document_stores/in_memory/document_store.py Show resolved Hide resolved

davidsbatista reviewed Apr 24, 2024

View reviewed changes

haystack/document_stores/in_memory/document_store.py Outdated Show resolved Hide resolved

davidsbatista reviewed Apr 24, 2024

View reviewed changes

haystack/document_stores/in_memory/document_store.py Outdated Show resolved Hide resolved

Guest400123064 and others added 4 commits April 24, 2024 21:43

Merge branch 'deepset-ai:main' into main

e4ccdea

1. rename attribute name for IDF computation 2. organize document sta…

687bef3

…tistics as a dataclass instead of tuple to improve readability

fix score type initialization (int -> float) to pass mypy check

81e106d

release note included

511d56c

davidsbatista marked this pull request as ready for review April 29, 2024 11:12

davidsbatista requested review from a team as code owners April 29, 2024 11:12

davidsbatista requested review from dfokina and julian-risch and removed request for a team April 29, 2024 11:12

davidsbatista added 5 commits April 29, 2024 13:12

Merge branch 'main' into main

837f471

Merge branch 'main' into main

a219d84

fixing linting issues

89ffc21

fixing linting issues and mypy

db27192

fixing tests

d7f7ff7

github-actions bot added the topic:tests label Apr 29, 2024

julian-risch removed their request for review April 30, 2024 07:02

davidsbatista and others added 4 commits April 30, 2024 20:26

fixing tests

e54924f

Merge branch 'deepset-ai:main' into main

2be3240

removing heapq import and cleaning up logging

58d43f6

changing indexing order

b022975

davidsbatista added 5 commits May 2, 2024 11:30

adding more tests

4d3c4e1

increasing tests

e9fe6ad

Merge branch 'main' into main

9788a87

Merge branch 'main' into main

5303f73

removing rank_bm25 from pyproject.toml

3d133c7

github-actions bot added the topic:build/distribution label May 3, 2024

davidsbatista added 3 commits May 3, 2024 11:55

Merge branch 'main' into main

b12e2c0

Merge branch 'main' into main

3deebbe

Merge branch 'main' into main

046312a

davidsbatista enabled auto-merge (squash) May 3, 2024 11:44

davidsbatista approved these changes May 3, 2024

View reviewed changes

davidsbatista merged commit cd66a80 into deepset-ai:main May 3, 2024
24 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: enhanced `InMemoryDocumentStore` BM25 query efficiency with incremental indexing #7549

perf: enhanced `InMemoryDocumentStore` BM25 query efficiency with incremental indexing #7549

Guest400123064 commented Apr 12, 2024

CLAassistant commented Apr 12, 2024 •

edited

Loading

julian-risch commented Apr 18, 2024

Guest400123064 commented Apr 18, 2024 •

edited

Loading

davidsbatista commented Apr 24, 2024

coveralls commented May 2, 2024 •

edited

Loading

davidsbatista left a comment

perf: enhanced InMemoryDocumentStore BM25 query efficiency with incremental indexing #7549

perf: enhanced InMemoryDocumentStore BM25 query efficiency with incremental indexing #7549

Conversation

Guest400123064 commented Apr 12, 2024

Related Issues

Proposed Changes:

How did you test it?

Notes for the reviewer

Checklist

CLAassistant commented Apr 12, 2024 • edited Loading

julian-risch commented Apr 18, 2024

Guest400123064 commented Apr 18, 2024 • edited Loading

davidsbatista commented Apr 24, 2024

coveralls commented May 2, 2024 • edited Loading

Pull Request Test Coverage Report for Build 8938434225

Details

💛 - Coveralls

davidsbatista left a comment

Choose a reason for hiding this comment

perf: enhanced `InMemoryDocumentStore` BM25 query efficiency with incremental indexing #7549

perf: enhanced `InMemoryDocumentStore` BM25 query efficiency with incremental indexing #7549

CLAassistant commented Apr 12, 2024 •

edited

Loading

Guest400123064 commented Apr 18, 2024 •

edited

Loading

coveralls commented May 2, 2024 •

edited

Loading