Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using FAQ Style retrieval for Document Similarity Use Case #104

Closed
sonnylaskar opened this issue May 9, 2020 · 8 comments
Closed

Using FAQ Style retrieval for Document Similarity Use Case #104

sonnylaskar opened this issue May 9, 2020 · 8 comments
Assignees

Comments

@sonnylaskar
Copy link

Hi,

I am wondering if the FAQ Style retrieval can also be used as a document similarity use case for Information retrieval.

Use Case:
Say one has lots of articles stored in elasticsearch and given an input, find the closest matching article.
This is a common doc similarity use case. So the user could follow the FAQ Style retrieval and create embedding from the article text field (lets consider that as question field used in FAQ tutorial) and an incoming input can be matched with this embedding to retrieve the best matches. That match can be considered as the most similar document given the input.
The embedding creation process might be slow because the articles can be very long.

Please comment on what do you think about this use case and this implementation using haystack.
Or if you suggest some other approach.

Thanks

@tholor tholor self-assigned this May 11, 2020
@tholor
Copy link
Member

tholor commented May 11, 2020

Hey @sonnylaskar ,

Absolutely, you could create embeddings also for other texts than a question and use the EmbeddingRetriever to find the most similar documents. One remark though: usually, embeddings work well if they are created for similar units of text (e.g. two questions or two passages of text). If you have different units (e.g. one keyword query and one very long document) it's more tricky to have meaningful embeddings.
What is the exact use case you have in mind?
a) find most similar document given another "input" document
b) find most similar documents given a "user query"

For a) you could pretty much use the current implementation using one encoder (embedding model) for both texts, while for b) it's usually better to have two separate encoders (e.g. see #63 ).

Both should be possible in haystack without big modifications. The bigger work is probably to find models that produce good embeddings for your use case. This usually depends on domain, length of text and style of language.

@predoctech
Copy link

Hi @tholor ,
This post invoked my further thoughts on use cases and along with FB's DPR highlighted in #63 the following questions come to my mind:

  1. What is the right approach to "combine" results from the 2 retrievers for task b) above?
  2. How to evaluate the difference in performance between methodologies like cosine similarity vs dot product?
  3. Should there be a concept like "weights" that give more importance (e.g. higher ranking) to one encoder (say the question encoder) than the other (the passage encoder)?

Appreciate your thoughts.

@tholor
Copy link
Member

tholor commented May 15, 2020

  1. For case b) you will need one model to create the embeddings for your documents when you index them ("doc encoder") and then another model later at inference time to create the embedding for your incoming user question ("question encoder"). This happens in quite different phases and there's no need to really combine/concatenate two retrievers.
    In Haystack you could init an EmbeddingRetriever(embedding_model=<doc-encoder>, ...) for indexing your documents and then later at inference time you init a second one EmbeddingRetriever(embedding_model=<question-encoder>, ...) with your question encoder.
    (For simplicity, we could also think about adding an option here to have two embedding_models in the EmbeddingRetriever)

  2. We are currently implementing evaluation options for Reader, Retriever and Finder in Provide Evaluation #92. Should be merged soon and you will be able to measure different performance metrics like Recall or mean avg. precision to compare different Retrievers / Readers and understand bottlenecks in your setup

  3. Maybe you have a different use case in mind, but for the case b) from above you will take the embedding from question encoder and calculate the cosine similarity to the document embeddings. Introducing a weight for one of the embeddings seems not meaningful here.
    If you are thinking about having multiple fields that you want to compare (e.g. cosine sim. of question<->text_1 and cos. sim of question <-> headline) weights could be helpful, I believe.

@predoctech
Copy link

Thanks @tholor.
On 3. yes the usr case I mentioned involved comparing the similarity of incoming question to 2 fields. In fact what I have in mind was to calculate (cosine sim. of user query <-> FAQ question) and (cosine sim. of user query <-> FAQ answer) and somehow combine the probabilities of the two in order to determine which FAQ pair is best given the user query. Reason behind is some context are actually contained in FAQ's answer rather than the question (e.g. keyword, domain jargon), but are likely to be included in the user query. So just by looking at (cosine sim. of user query <-> FAQ question) it may not be the best match but when also taking (cosine sim. of user query <-> FAQ answer) into account the right answer may become trivial.
Is that something that can be handled in Haystack? Thanks.

@tholor
Copy link
Member

tholor commented May 18, 2020

Ok got it. This isn't implemented yet in Haystack but it would fit in the scope and I could see two options for the implementation:

a) Concatenation of Retrievers: Make the Finder accept a list of Retrievers and combine their scores (incl. weights) before feeding results to the Reader
Pro: most generic solution also allowing combination of BM25 + EmbeddingRetriever
Con: Two separate retriever queries (suboptimal efficiency in some cases)

b) New MultiEmbeddingRetriever: Extend the EmbeddingRetriever to the case of multiple Embeddings
Pro: simple implementation without side effects on the regular user interface
Con: won't allow the combination of other retriever methods (e.g. ElasticsearchRetriever + EmbeddingRetriver)

I am leaning towards a), but could see b) as a simpler shorter solution. Would you be interested in implementing one of them in a PR, @predoctech ? We could help you along the way. Otherwise we'll put it in our backlog and try to implement it in a few weeks from now.

@predoctech
Copy link

Thanks @tholor . Conceptually I'd think a) is much more powerful. You are right that even though I didn't mention ElasticsearchRetriever initially that is something that fit into our use case as well. Ideally the choice of various Retrievers, and correspondingly their weights, may shift according to the scores coming out from each iteration. For instance if the user query is an exact FAQ question with a 1.0 probability then only the Retriever on "question_emb" is ever needed. However if probability is low then maybe "answer_emb" will then kick in, and further down the road BM25 will be utilized on docs other than the FAQ dataset. Only a) can provide this sort of flexibility I think.

Unfortunately a) will be out of league given my skill level. I will try to understand haystack better to see if I stand a chance for b).

@predoctech
Copy link

@tholor I also wish to follow up on one remark you made above in reply #2 "embeddings work well if they are created for similar units of text ". So I suppose for "units" you refer to the token sequence length used by the pre-trained model. In that case what is the number for models like sentence_bert? The more important question maybe if there is any way to alter that length? Ideally for an EmbeddingRetriever use case the best performance will be when both user and faq queries contain similar number of tokens, and the context/semantics will be grouped in a similar fashion across multiple sequences between those 2 queries?

@tholor
Copy link
Member

tholor commented May 28, 2020

@predoctech sorry, somehow missed your follow-up questions here.

So I suppose for "units" you refer to the token sequence length used by the pre-trained model.

  1. The texts that you convert at inference time should be as similar as possible to the ones used for training (ideally in length and language style).
  2. Some models can also produce meaningful embeddings for texts that are longer than the ones at training time. Even in that case: If you create now one embedding for a short text (e.g. question) and one for long text (e.g. passage) the cosine similarity of both might not be very meaningful.

In that case what is the number for models like sentence_bert?

You could check the data that was used for training them. It's mostly NLI datasets working on a "sentence level". So their optimal performance will probably be on this unit of "one sentence" (e.g. a question) and might degrade for longer passages (e.g. a long FAQ answer).

The more important question maybe if there is any way to alter that length? Ideally for an EmbeddingRetriever use case the best performance will be when both user and faq queries contain similar number of tokens, and the context/semantics will be grouped in a similar fashion across multiple sequences between those 2 queries?

I would not worry too much if your user questions and FAQ questions differ by a few tokens. As mentioned above, it's more severe if you really compare a full passage with your single sentence.
The best way to "alter that length" is to use two different encoders (as here #63).

Hope this helps. Closing this for now and creating a new issue for the feature request of "Combining multiple retrievers". Feel free to reopen if there's more on your mind related to this particular discussion here :)

@tholor tholor closed this as completed May 28, 2020
masci pushed a commit that referenced this issue Nov 27, 2023
masci added a commit that referenced this issue Nov 27, 2023
* Ignore some mypy errors

* Fix I/O comparator

* Avoid calling asdict multiple times when comparing dataclasses

* Enhance component tests

* Fix I/O dataclasses comparison

* Use Any instead of type when expecting I/O dataclasses

* Fix mypy

* Change InputSocket taken_by field to sender

* Remove variadics implementation

* Adapt tests

* Enhance docs and simplify run

* Remove useless check on drawing

* Add __canals_optional_inputs__ field in components

* Rework a bit Pipeline._ready_to_run()

* Simplify some logic

* Add __canals_mandatory_inputs__ field in components

* Handle pipeline loops

* Fix tests

* Document component state run logic

* Add double loop pipeline test

* Make component decorator a class

* PR feedback

* Add error logging when registering Component with identical names

* Add 'remove' action that removes current component from Pipeline run input queue

* Simplify run checks and logging

* Better logging

* Apply suggestions from code review

Co-authored-by: ZanSara <[email protected]>

* Trim whitespace

* Add support for Union in Component's I/O

* Remove dependencies section in marshaled pipelines

* Create Component Protocol

* simpler optional deps

* Simplify component init wrapping and fix issue with save_init_params

* Update canals/pipeline/save_load.py

Co-authored-by: ZanSara <[email protected]>

* Simplify functions to find I/O sockets

* Fix import

* change import

* testing ci

* testing ci

* Simplify _save_init_params

* testing ci

* testing ci

* use direct pytest call

* trying to force old version for macos

* list macos versions

* list macos versions

* disable on macos

* remove extra

* refactor imports

* re-enable some logs

* some more tests

* small correction

* Remove unused leftover methods

* docs

* update docstring

* mention optionals

* example for dataclass initialization

* missed part

* fix api docs

* improve error reporting and testing

* add tests for Any

* parametrized tests

* fix test for py<3.10

* test type printing

* remove typing. prefix from Any (compat with Py3.11)

* test helpers

* test names

* add type_is_compatible()

* tests pass

* more tests

* add small comment

* handle Unions as anything else

* use sender/receiver for socket pairs

* more sender/receiver renames

* even more renames

* split if statement

* Update __about__.py

* fix logic operator and add tests

* Update __about__.py

* Simplify imports

* Move draw in pipeline module and clearly define public interface

* Format pyproject.toml

* Include only required files in built wheel

* Move sample components out of tests

* stub component class decorator

* update static sample components to new API

* stub

* dynamic output examples

* sum

* add components fixed

* re-add inputsocket and outputsocket creation

* fix component tests

* fixing tests

* Add methods to set I/O dinamically

* fix drawing

* fix some integration tests

* tests green

* pylint

* remove stray files

* Remove default in InputSocket and add is_optional field

* Fix drawing

* Rework sockets string representation

* Add back Component Protocol

* Simplify method to get string representation of types

* Remove sockets __str__

* Remove Component's I/O type checks at run time

* Remove IO check in init wrapper

* Update canals/utils.py

Co-authored-by: Massimiliano Pippi <[email protected]>

* Split __canals_io__ field in __canals_input__ and __canals_output__

* Order input and output fields

* Add test to verify __canals_component__ is set

* Remove empty line

* Add component class factory

* Fix API docs workflow failure

* fix api docs

* Update __about__.py

* Add component from_dict and to_dict methods

* Add Pipeline to_dict and from_dict

* Fix components tests

* Add some more tests

* Change error messages

* Simplify test_to_dict

* Add max_loops_allowed in test_to_dict

* Test non default max_loops_allowed in test_to_dict

* Rework marshal_pipelines

* Rework unmarshal_pipelines

* Rename some stuff

* allow falsy outputs

* apply falsy fix to validation

* add test for falsy inputs

* Split _cleanup_marshalled_data into two functions

* Use from_dict to deserialise component

* Remove commented out code and update variable name

* Add test to verify difference when unmarshaling Pipeline with duplicate names

* Update marshal_pipelines docstring

* update workflow

* exclude tests from mypy in pre-commit hooks

* add additional falsy tests

* remove unnecessary import

* split test into two

Co-authored-by: ZanSara <[email protected]>

* remove init_parameters decorator and fix assumptions

* fix accumulate

* stray if

* Bump version to 0.5.0

* Implement generic default_to_dict and default_from_dict

* Update default_to_dict docstring

Co-authored-by: Massimiliano Pippi <[email protected]>

* Remove all mentions of Component.defaults

* Add Remainder to_dict and from_dict (#91)

* Add Repeat to_dict and from_dict (#92)

* Add Sum to_dict and from_dict (#93)

* Add Greet to_dict and from_dict (#89)

Co-authored-by: Massimiliano Pippi <[email protected]>

* Rework Accumulate to_dict and from_dict (#86)

Co-authored-by: Massimiliano Pippi <[email protected]>

* Add to_dict and from_dict for Parity, Subtract, Double, Concatenate (#87)

* Add Concatenate to_dict and from_dict

* Add Double to_dict and from_dict

* Add Subtract to_dict and from_dict

* Add Parity to_dict and from_dict

---------

Co-authored-by: Massimiliano Pippi <[email protected]>

* Change _to_mermaid_text to use component serialization data (#94)

* Add MergeLoop to_dict and from_dict (#90)

Co-authored-by: Massimiliano Pippi <[email protected]>

* Add Threshold to_dict and from_dict (#97)

* Add AddFixedValue to_dict and from_dict (#88)

Co-authored-by: Massimiliano Pippi <[email protected]>

* Remove BaseTestComponent (#99)

* Change @component decorator so it doesn't add default to_dict and from_dict (#98)

* Rename some classes in tests to suppress Pytest warnings (#101)

* Check Component I/O socket names are valid (#100)

* Remove handling of shared component instances on Pipeline serialization (#102)

* Fix docs

* Bump version to 0.6.0

* Revert "Check Component I/O socket names are valid (#100)" (#103)

This reverts commit 4529874.

* Bump canals to 0.7.0

* Downgrade log from ERROR to DEBUG (#104)

* Make to/from_dict optional (#107)

* remove from/to dict from Protocol

* use a default marshaller

* example component with no serializers

* fix linting

* make it smarter

* fix linting

* thank you mypy protector of the dumb programmers

* feat: check returned dictionary (#106)

* better error message if components don't return dictionaries

* add test

* use factory

* needless import

* Update __about__.py

* fix default serialization and adjust sample components accordingly (#109)

* fix default serialization and adjust sample components accordingly

* typo

* fix pylint errors

* fix: `draw` function vs init parameters (#115)

* fix draw

* stray print

* Update version (#118)

* remove extras

* Revert "remove extras"

This reverts commit a096ff8.

* fix package name, change _parse_connection_name function name, add tests (#126)

* move sockets into components package (#127)

* chore: remove extras (#125)

* remove extras

* workflow

* typo

* fix: Sockets named "text/plain" or containing a "/" fail during pipeline.to_dict (#131)

* don't split sockets by /

* revert hashing edge keys

* docs: remove missing module from docs (#132)

* remove stray print (#123)

* addo sockets docs (#133)

* tidy up utils about types (#129)

* Update canals.md (#134)

* rename module in API docs

* make `__canals_output__` and `__canals_input__` management consistent  (#128)

* make __canals_output__ and __canals_input__ management consistent and assign them to the component instance

* make pylint happy

* return the original type instead of the metaclass

* use type checking instead of instance field

* declare the actual returned type

* fix after conflict resolution

* remove check

* Do not use a dict as intermediate format and use `Socket`s directly (#135)

* do not use a dict as intermediate format and use sockets directly to simplify code and remove side effects

* fix leftover from cherry-pick

* move is_optional evaluation for InputSocket to post_init (#136)

* re-introduce variadics to support Joiner node (#122)

* move sockets into components package

make __canals_output__ and __canals_input__ management consistent and assign them to the component instance

do not use a dict as intermediate format and use sockets directly to simplify code and remove side effects

move is_optional evaluation for InputSocket to post_init

re-introduce variadics to support Joiner node

restore connection-time check

use custom type annotation, fix tests

* fix leftovers from rebase

* rename fan-in to joiner

* clean up and fix typing

* let inputs arrive later

* address review comments

* address review comments

* fix docstrings

* try

* try

* fix run input

* linting

* remove comments

* fix pylint

* bumb version to 0.9.0 (#140)

* properly annotate classmethods (#139)

* feat: add `Pipeline.inputs()` (#120)

* add Pipeline.describe_input()

* add tests

* split dict and str outputs and add to error messages

* tests

* accepts/expects

* move methods

* fix tests

* fix module name

* tests

* review feedback

* Add missing typing_extensions dependency (#152)

* feat: use full connection data to route I/O (#148)

* fix sample components

* make sum variadic

* separate queue and buffer

* all works but loops & variadics together

* fix some tests

* fix some tests

* all tests green

* clean up code a bit

* refactor code

* fix tests

* fix self loops

* fix reused sockets bug

* add distinct loops

* add distinct loops test

* break out some code from run()

* docstring

* improve variadics drawing

* black

* document the deepcopy

* re-arrange connection dataclass and add tests

* consumer -> receiver

* fix typing

* move Connection-related code under component package

* clean up connect()

* cosmetics and typing

* fix linter, make Connection a dataclass again

* fix typing

* add test case for #105

---------

Co-authored-by: Massimiliano Pippi <[email protected]>

* feat: Add Component inputs/outputs functions (#158)

* Add component inputs/outputs methods

* Different impl approach

* Black fixes

* Rename functions to match naming in pipeline inputs/ouputs

* Fix find_component_inputs, update unit tests (#162)

* Fix API docs (#164)

* make Variadic wrap an iterable (#163)

* Add pipeline outputs method (#150)

Co-authored-by: ZanSara <[email protected]>

* Update __about__.py (#165)

Update version to 0.10.0

* add CODEOWNERS

* feat: read defaults from `run()` signature (#166)

* Read defaults from run signature

* simplify setting of sockets

* fix test

* Update sample_components/fstring.py

Co-authored-by: Massimiliano Pippi <[email protected]>

* Update canals/component/component.py

Co-authored-by: Massimiliano Pippi <[email protected]>

* dostring

---------

Co-authored-by: Massimiliano Pippi <[email protected]>

* Use full import path as 'type' in serialization.  (#167)

* Use full import path as 'type' in serialization. Try to import the path when deserializing

* fix test data

* add from_dict test

* remove leftover

* Update canals/pipeline/pipeline.py

Co-authored-by: ZanSara <[email protected]>

* add error message to PipelineError

---------

Co-authored-by: ZanSara <[email protected]>

* bump version

* fix: copy input values before passing them down pipeline.run (#168)

* copy input values before passing them down pipeline.run

* Update test_mutable_inputs.py

* fix mypy and pyright (#169)

* bump version

* remove data we won't keep

* reformat

* try

* skip tests on transient code

---------

Co-authored-by: Silvano Cerza <[email protected]>
Co-authored-by: Silvano Cerza <[email protected]>
Co-authored-by: ZanSara <[email protected]>
Co-authored-by: Michel Bartels <[email protected]>
Co-authored-by: ZanSara <[email protected]>
Co-authored-by: Julian Risch <[email protected]>
Co-authored-by: Julian Risch <[email protected]>
Co-authored-by: Vladimir Blagojevic <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants