Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add RouteDocuments and JoinAnswers nodes #2256

Merged
merged 19 commits into from
Mar 1, 2022
Merged

Conversation

bogdankostic
Copy link
Contributor

@bogdankostic bogdankostic commented Feb 28, 2022

This PR adds two nodes: a RouteDocuments node and a JoinAnswers node. Both nodes are needed in order to do QA on the combination of text and tables as the source of information.

The RouteDocuments takes as input a list of Documents and splits them by either content_type or a metadata field and routes the resulting splits to different outputs. An alternative to this node would have been to allow both texts and tables for the readers, but make the FARMReader skip Documents of type tables and the TableReader skip Documents of type text. However, having a designated node makes this process more explicit and allows splitting not only by content_type, but also other metadata values. (For example, for routing Documents to Readers that are trained on a specific domain or on a specific language)

The JoinAnswers node takes as input the predicted Answers of two individual Reader nodes and joins them to a single list of Answers.

This graph shows how a Pipeline allowing QA on both text and tables would look like:
pipeline

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@julian-risch
Copy link
Member

A JoinAnswers node would also be helpful for the use case described in this issue #1081 on combining FAQ and ExtractiveQA in a pipeline by @SasikiranJ

Copy link
Member

@julian-risch julian-risch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks quite good to me. I am just missing a test case for the JoinAnswers node, I saw some typing errors in the CI and I would like to briefly talk about the names of the new nodes with you. Feel free to ping me anytime.

@@ -59,6 +59,9 @@
{
"$ref": "#/definitions/ImageToTextConverterComponent"
},
{
"$ref": "#/definitions/JoinAnswersComponent"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the update of the Haystack documentation website isn't completed yet, we haven't upgraded the Haystack version to 1.2.1rc0 yet. Before merging your PR, we need to make sure that json-schemas/haystack-pipeline-1.2.0.schema.json is unchanged and a json-schemas/haystack-pipeline-1.2.1rc0.schema.json is created.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@brandenchan we'll merge this PR now and then we'll need to correct the schema files once the website updates for the v1.2.0 are done.

@@ -1041,6 +1043,35 @@ def test_documentsearch_document_store_authentication(retriever_with_docs, docum
assert kwargs["headers"] == auth_headers


def test_split_document_list_content_type(test_docs_xs):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add a test case to check both join modes of the JoinAnswers node.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a test case that covers both join modes.

{
"output_type": "stream",
"name": "stdout",
"text": [
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it on purpose to include the outputs in the tutorial?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the outputs help the users to understand the output without having to run the tutorial.

@@ -1,2 +1,4 @@
from haystack.nodes.other.docs2answers import Docs2Answers
from haystack.nodes.other.join_docs import JoinDocuments
from haystack.nodes.other.split_documents import SplitDocumentList
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seeing the different names of the other nodes, I am wondering whether we could have a more consistent naming scheme. Unfortunately, I don't have an alternative for SplitDocumentList in mind. Maybe we can briefly talk about it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about RouteDocuments? Similar to JoinDocuments and in theory there could later be a RouteAnswers node.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DocumentRouter would be more consistent with the other nodes (TableReader, Summarizer, Retriever) but then I am not immediately convinced by DocumentJoiner and AnswerJoiner.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RouteDocuments it is :)


def __init__(self, split_by: str = "content_type", metadata_values: Optional[List[str]] = None):
"""
:param split_by: Field to split the documents by. Either `"content_type"` or a metadata field name.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"by. Either" should become "by either"

@bogdankostic bogdankostic changed the title Add SplitDocumentList and JoinAnswers nodes Add RouteDocuments and JoinAnswers nodes Mar 1, 2022
@@ -1072,6 +1074,20 @@ def test_split_document_list_content_type(test_docs_xs):
assert result["output_3"][0].meta["meta_field"] == "test5"


@pytest.mark.parametrize("join_mode", ["concatenate", "merge"])
def test_join_answers_concatenate(join_mode):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_join_answers_concatenate is a little bit misleading as you test for "concatenate" and "merge".

@bogdankostic bogdankostic merged commit c5542bd into master Mar 1, 2022
@bogdankostic bogdankostic deleted the split_tables_and_texts branch March 1, 2022 16:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants