Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Classify pipeline's type based on its components #3132

Merged
merged 19 commits into from
Sep 21, 2022
Merged

Classify pipeline's type based on its components #3132

merged 19 commits into from
Sep 21, 2022

Conversation

vblagoje
Copy link
Member

@vblagoje vblagoje commented Sep 1, 2022

Related Issues

The encompassing issue is https://github.com/deepset-ai/haystack-private/issues/9

Proposed Changes:

Add a method that classifies the type of the pipeline based on components found

How did you test it?

Repeated invocations from colab notebook on a branch where the update trigger is 2 min instead of 24 hours
Added a unit test for pipeline classification

Notes for the reviewer

Focus on pipeline_invocation_counter decorator in haystack/utils/reflection.py, how the even is triggered (time) and what details are sent to telemetry server

Checklist

@vblagoje vblagoje requested review from a team as code owners September 1, 2022 14:30
@vblagoje vblagoje requested review from masci and removed request for a team September 1, 2022 14:30
Copy link
Member

@julian-risch julian-risch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great to see this issue addressed so quickly now. 👍 The PR is a good start and as discussed via slack, I would suggest to add a get_type() method to the BaseStandardPipeline class as well, which would internally just call self.pipeline.get_type().
Regarding the pipeline_types it just came to my mind that an enum would be an alternative to the dictionary. We aren't expecting a pipeline to be of multiple types at the same type.
Something that might be interesting in addition is to distinguish whether the retrieval part of the different pipeline types is a sparse retriever or a dense retriever or both.

Copy link
Contributor

@masci masci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a comment/question but overall LGTM

"RetrieverQuestionGenerationPipeline": lambda x: {"Retriever", "Question Generator"} <= set(x.keys()),
"QuestionAnswerGenerationPipeline": lambda x: {"QuestionGenerator", "Reader"} <= set(x.keys()),
"MostSimilarDocumentsPipeline": lambda x: len(x.values()) == 1
and any(comp for comp in x.values() if isinstance(comp, BaseDocumentStore)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I get this: if x.values() is 1, why the comprehension comp for comp in x.values()?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point - thanks @masci - will correct it

Copy link
Contributor

@bogdankostic bogdankostic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice PR! Almost good to go, I think only get_type needs to be slightly adapted as it contains cases that do not seem to be reachable.

"GenerativeQAPipeline": lambda x: {"Generator", "Retriever"} <= set(x.keys()),
"FAQPipeline": lambda x: {"Docs2Answers"} <= set(x.keys()),
"ExtractiveQAPipeline": lambda x: {"Reader", "Retriever"} <= set(x.keys()),
"DocumentSearchPipeline": lambda x: {"Retriever"} <= set(x.keys()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the Pipeline types "SearchSummarizationPipeline" and "RetrieverQuestionGenerationPipeline" are not reachable, given that here we define that each Pipeline that contains a "Retriever" is a "DocumentSearchPipeline".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, great catch @bogdankostic

"SearchSummarizationPipeline": lambda x: {"Retriever", "Summarizer"} <= set(x.keys()),
"TranslationWrapperPipeline": lambda x: {"InputTranslator", "OutputTranslator"} <= set(x.keys()),
"QuestionGenerationPipeline": lambda x: {"QuestionGenerator"} <= set(x.keys()),
"RetrieverQuestionGenerationPipeline": lambda x: {"Retriever", "Question Generator"} <= set(x.keys()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Question Generator" should probably be "QuestionGenerator" here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, another one

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, and also moved to the back of the list

@@ -654,7 +654,7 @@ class RetrieverQuestionGenerationPipeline(BaseStandardPipeline):
def __init__(self, retriever: BaseRetriever, question_generator: QuestionGenerator):
self.pipeline = Pipeline()
self.pipeline.add_node(component=retriever, name="Retriever", inputs=["Query"])
self.pipeline.add_node(component=question_generator, name="Question Generator", inputs=["Retriever"])
self.pipeline.add_node(component=question_generator, name="QuestionGenerator", inputs=["Retriever"])
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bogdankostic note this. Suprised this didn't cause any issues

)
pipe.get_type().startswith("TranslationWrapperPipeline")

# pipe = MostSimilarDocumentsPipeline(document_store=MockDocumentStore())
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bogdankostic we have an issue with MostSimilarDocumentsPipeline

@vblagoje
Copy link
Member Author

@bogdankostic added a unit test, and I found one subtle bug - MostSimilarDocumentsPipeline doesn't initialize the pipeline attribute. All pipeline methods that rely on this property will fail (draw, get_node, save_to_yaml, and, of course, get_type). We have to fix that issue! Today, wdyt?

@vblagoje
Copy link
Member Author

@bogdankostic I found instances of "Question Generator" used in tests.

@vblagoje
Copy link
Member Author

@masci would love to consult with you regarding this PR. By adding unit tests, I discovered some subtle bugs in MostSimilarDocumentsPipeline. See the discussion above.

@bogdankostic
Copy link
Contributor

@bogdankostic added a unit test, and I found one subtle bug - MostSimilarDocumentsPipeline doesn't initialize the pipeline attribute. All pipeline methods that rely on this property will fail (draw, get_node, save_to_yaml, and, of course, get_type). We have to fix that issue! Today, wdyt?

I see the problem. How would you propose fixing this?

@vblagoje
Copy link
Member Author

@bogdankostic added a unit test, and I found one subtle bug - MostSimilarDocumentsPipeline doesn't initialize the pipeline attribute. All pipeline methods that rely on this property will fail (draw, get_node, save_to_yaml, and, of course, get_type). We have to fix that issue! Today, wdyt?

I see the problem. How would you propose fixing this?

We'll fix it soon. Here is the issue

Copy link
Contributor

@masci masci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's merge, we will follow up on the bug you found in a separate PR

@masci masci dismissed bogdankostic’s stale review September 21, 2022 12:48

Bogdan is unavailable today and we want to try shipping this with 1.9

@vblagoje vblagoje merged commit 938e6fd into deepset-ai:main Sep 21, 2022
brandenchan pushed a commit that referenced this pull request Sep 21, 2022
* Add pipeline get_type mehod

* Add pipeline uptime

* Add pipeline telemetry event sending

* Send pipeline telemetry once a day (at most)

* Add pipeline invocation counter, change invocation counter logic

* Update allowed telemetry parameters - allow pipeline parameters

* PR review: add unit test
masci pushed a commit that referenced this pull request Sep 22, 2022
* Add pipeline get_type mehod

* Add pipeline uptime

* Add pipeline telemetry event sending

* Send pipeline telemetry once a day (at most)

* Add pipeline invocation counter, change invocation counter logic

* Update allowed telemetry parameters - allow pipeline parameters

* PR review: add unit test
@vblagoje vblagoje deleted the classify_pipe branch October 24, 2022 08:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants