Implement Context Matching #2293

tstadel · 2022-03-09T19:26:35Z

Proposed changes:

Implement methods calculate_context_similarity and match_context

Status (please check what you already did):

First draft (up for discussions & feedback)
Final code
Added tests
Updated documentation

closes #2265

…o context_matching

tstadel · 2022-03-15T20:30:11Z

Tests to be added...

ArzelaAscoIi

Awesome :) Thanks for also adding match_contexts !

julian-risch

Looks very good already. Looking forward to some tests. Maybe it's better to rename matching.py to context_matching.py. Otherwise it's too generic. For now utils is fine but I could imagine the code also under modeling/evaluation. Don't forget to add labels to the PR. 😉

julian-risch · 2022-03-16T14:57:42Z

haystack/utils/matching.py

+
+    :param context: The context to match.
+    :param candidate: The candidate to match the context.
+    :param min_words: The minimum number of words context and candidate need to have in order to be scored.


Just wondering whether we want to call this words or tokens. Could also be min_seq_len (minimum number of tokens) in reference to max_seq_len of the reader models.

words should be generally better understandable then tokens by most people. Also min_seq_len refers to a sequence of "whatever" because it totally depends on the tokenizer what you're dealing with (words, wordpieces, bytes, etc.). Here we're actually dealing with words. So I'd leave it like that.

tstadel · 2022-03-17T12:36:12Z

@julian-risch @ArzelaAscoIi
First version should be complete now:

tests added
changed min_words into min_length param, so we work on string length now which should be more generic and produce less edge cases than words
added boost_split_overlaps param to control whether boosting partial split overlaps resulting from preprocessing should be boosted/matched
added non-parallel (num_processes<=1) versions of match_context and match_contexts for easier debugging

ArzelaAscoIi

Quite complex! Looks good! 🚀

julian-risch

LGTM! 👍 There is just a small typo to be fixed before merging. Further, let's keep an eye on whether boost_split_overlaps increases the number of false positive matches. In that case, we might not want to use by default and set boost_split_overlaps=False by default.

julian-risch · 2022-03-21T08:05:46Z

haystack/utils/context_matching.py

+        grouped_matches = groupby(group_sorted_matches, key=lambda candidate: candidate.context_id)
+        for context_id, group in grouped_matches:
+            sorted_group = sorted(group, key=lambda candidate: candidate.score, reverse=True)
+            match_list = list((candiate_score.candidate_id, candiate_score.score) for candiate_score in sorted_group)


typo in candiate_score

tstadel · 2022-03-21T09:07:04Z

I agree the boost_split_overlaps flag needs further empirical testing. So far it does its job pretty good regarding the tests. So for now, I leave it with default value True.

tstadel and others added 10 commits March 9, 2022 20:25

first context_matching impl

ee7ba23

Update Documentation & Code Style

16419e9

sort matches

4de5fc8

fix matching bugs

d55ea35

Update Documentation & Code Style

3e4bc3e

add match_contexts

698a234

Merge branch 'context_matching' of github.com:deepset-ai/haystack int…

3917abe

…o context_matching

min_words added

2d6afaf

Update Documentation & Code Style

ff315f9

Merge branch 'master' into context_matching

60ddea7

tstadel marked this pull request as ready for review March 15, 2022 20:29

tstadel requested review from ArzelaAscoIi and julian-risch March 15, 2022 20:29

ArzelaAscoIi reviewed Mar 16, 2022

View reviewed changes

julian-risch reviewed Mar 16, 2022

View reviewed changes

tstadel and others added 9 commits March 16, 2022 16:35

rename matching.py to context_matching.py

f543ba3

fix mypy

449c225

added tests and heuristic for one-sided overlaps

e53fef2

Update Documentation & Code Style

95005f7

add another noise test

38941a7

Update Documentation & Code Style

31c11ca

improve boosting split overlaps

b7fc41b

add non parallel versions of match_context and match_contexts

68bda35

Update Documentation & Code Style

1c8d6e4

tstadel added 2 commits March 17, 2022 14:21

fix pylint finding

0016bcf

add tests for match_context and match_contexts

97d979b

ArzelaAscoIi approved these changes Mar 17, 2022

View reviewed changes

julian-risch approved these changes Mar 21, 2022

View reviewed changes

fix typo

1601deb

tstadel merged commit e13df4b into master Mar 21, 2022

tstadel deleted the context_matching branch March 21, 2022 09:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement Context Matching #2293

Implement Context Matching #2293

tstadel commented Mar 9, 2022 •

edited

Loading

tstadel commented Mar 15, 2022

ArzelaAscoIi left a comment

julian-risch left a comment

julian-risch Mar 16, 2022

tstadel Mar 16, 2022

tstadel commented Mar 17, 2022 •

edited

Loading

ArzelaAscoIi left a comment

julian-risch left a comment

julian-risch Mar 21, 2022

tstadel commented Mar 21, 2022

Implement Context Matching #2293

Implement Context Matching #2293

Conversation

tstadel commented Mar 9, 2022 • edited Loading

tstadel commented Mar 15, 2022

ArzelaAscoIi left a comment

Choose a reason for hiding this comment

julian-risch left a comment

Choose a reason for hiding this comment

julian-risch Mar 16, 2022

Choose a reason for hiding this comment

tstadel Mar 16, 2022

Choose a reason for hiding this comment

tstadel commented Mar 17, 2022 • edited Loading

ArzelaAscoIi left a comment

Choose a reason for hiding this comment

julian-risch left a comment

Choose a reason for hiding this comment

julian-risch Mar 21, 2022

Choose a reason for hiding this comment

tstadel commented Mar 21, 2022

tstadel commented Mar 9, 2022 •

edited

Loading

tstadel commented Mar 17, 2022 •

edited

Loading