Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: simplify Summarizer, add Document Merger #3452

Merged
merged 30 commits into from
Nov 3, 2022
Merged

refactor: simplify Summarizer, add Document Merger #3452

merged 30 commits into from
Nov 3, 2022

Conversation

anakin87
Copy link
Member

@anakin87 anakin87 commented Oct 21, 2022

Related Issues

Proposed Changes:

As discussed in #3403

  • make the summarizer suitable for indexing pipelines
  • write the summarization results in meta instead of altering document content (current behavior)
  • remove generate_single_summary parameter: currently it transforms several documents into one
    (it is better to design a dedicated node for the purpose of merging documents)

How did you test it?

  • Adapted some tests
  • Other tests to be added?

Notes for the reviewer

Just a first draft to understand what this change breaks

Checklist

@anakin87
Copy link
Member Author

As expected, this refactoring was breaking a few things, which I tried to fix. 😄
I made some comments in correspondence with the tests, which required the most onerous changes.

@ZanSara @brandenchan @bogdankostic please feel free to jump in and help to go in the right direction...

@anakin87 anakin87 marked this pull request as ready for review October 23, 2022 10:01
@anakin87 anakin87 requested a review from a team as a code owner October 23, 2022 10:01
@anakin87 anakin87 requested review from mayankjobanputra and removed request for a team October 23, 2022 10:01
Copy link
Contributor

@ZanSara ZanSara left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking great so far! I don't think you need more tests for this node right now, given we're only removing functionality. I found a coupe of leftover prints and one question only.

In addition, let's add the DocumentMerger in this same PR. This way we can re-introduce some of the removed tests with the new setup (docs merger + summarizer) and make sure no functionality is lost.

test/pipelines/test_eval_batch.py Outdated Show resolved Hide resolved
test/pipelines/test_eval_batch.py Outdated Show resolved Hide resolved
haystack/nodes/summarizer/transformers.py Show resolved Hide resolved
haystack/nodes/summarizer/transformers.py Outdated Show resolved Hide resolved
haystack/nodes/summarizer/transformers.py Show resolved Hide resolved
@ZanSara ZanSara added type:feature New feature or request topic:metadata labels Oct 24, 2022
@ZanSara ZanSara removed the request for review from mayankjobanputra October 24, 2022 10:39
@anakin87
Copy link
Member Author

@ZanSara thanks! When I have some time, I'll make the changes you suggested...

Just some questions:
how do you imagine the Document Merger?
How do we deal with the metadata of aggregated documents? Do we remove them?

@ZanSara
Copy link
Contributor

ZanSara commented Oct 24, 2022

How do you imagine the Document Merger?

So, the DocumentMerger should be super simple, something that just merges the content of the documents so that given a list of N, it returns a list of 1 (that's because Pipeline.run() always expects documents to be a List - and Pipeline.run_batch() always expects documents to be a List of List).

We might want to deal with the possibility of receiving input from multiple nodes. That's a messy topic though, so if you see that it becomes intractable and makes the node complex, ignore that. If anyone needs to handle multiple inputs, they can use JoinDocuments on top of the Merger for now.

How do we deal with the metadata of aggregated documents? Do we remove them?

Now that's a great question 😄 My intuition says that we should apply the following heuristic:

  1. If a key is present with the same value in all documents to be merged, add it to the merged doc too.
  2. Any other key is removed.

Mind that keys might contain nested values. I'd apply such heuristic recursively on dictionary entries.

Let's also pay attention to keep "important" keys like name, which is often accessed. I don't know how we could deal with a conflict there, so let's see which (if any) test break when we just treat it as any other, and decide once we have the outcomes. I also can't name any other critical field right now, so let's see if any test breaks and add other "special" keys to this list as necessary.

@vblagoje
Copy link
Member

Hey @anakin87 and @ZanSara, I am always for simplification and refactoring, but I don't think we can remove generate_single_summary feature in a non-major release - even if it was a total mess. There might be some users who are relying on it.

@anakin87
Copy link
Member Author

By introducing Document Merger, I think we would offer the same feature in a different and more structured way.
However, I agree that if someone is using generate_single_summary, we will break down how this feature is currently working...

@vblagoje @ZanSara Before investing time in this PR, I wait for your opinions!

@ZanSara
Copy link
Contributor

ZanSara commented Oct 31, 2022

Hey @anakin87! I'm going to review this PR now. There has been two new test suites introduced a moment ago, they just need another rebase and they'll pass 👍

Copy link
Contributor

@ZanSara ZanSara left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great progress! That's precisely how I imagined it.

A couple of things to note;

  • I think I found a heavy simplification of the meta fields merging algorithm and I've suggested it, Test it out if it works as expected or if I forgot something!
  • We need to make sure the documentation for DocumentMerger is generated. To do so, let's add document_merger to this list:
    modules: ['docs2answers', 'join_docs', 'join_answers', 'route_documents']

haystack/nodes/other/document_merger.py Outdated Show resolved Hide resolved
haystack/nodes/other/document_merger.py Outdated Show resolved Hide resolved
haystack/nodes/other/document_merger.py Show resolved Hide resolved
haystack/nodes/other/document_merger.py Outdated Show resolved Hide resolved
haystack/nodes/summarizer/transformers.py Outdated Show resolved Hide resolved
haystack/pipelines/standard_pipelines.py Show resolved Hide resolved
test/nodes/test_document_merger.py Outdated Show resolved Hide resolved
test/nodes/test_summarizer.py Outdated Show resolved Hide resolved
test/pipelines/test_eval_batch.py Outdated Show resolved Hide resolved
test/pipelines/test_eval_batch.py Outdated Show resolved Hide resolved
@anakin87
Copy link
Member Author

anakin87 commented Nov 1, 2022

@ZanSara thanks for the great review!!!

@anakin87
Copy link
Member Author

anakin87 commented Nov 1, 2022

integration-tests-windows (nodes) is failing:

ERROR test/nodes/test_generator.py::test_generator_pipeline[embedding-memory] - RuntimeError: [enforce fail at C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\c10\core\impl\alloc_cpu.cpp:81] data. DefaultCPUAllocator: not enough memory: you tried to allocate 4194304 bytes.

Any ideas?

@anakin87 anakin87 requested a review from ZanSara November 1, 2022 11:03
@anakin87 anakin87 marked this pull request as draft November 2, 2022 09:59
@anakin87 anakin87 marked this pull request as ready for review November 2, 2022 09:59
Copy link
Contributor

@ZanSara ZanSara left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @anakin87 sorry for this CI issue. We're working on it!

@anakin87
Copy link
Member Author

anakin87 commented Nov 3, 2022

Hi @ZanSara, I see that thanks to your contributions the tests are passing now!

Can you explain to me very quickly what you did?
Maybe next time I will be able to solve the problem on my own...

@ZanSara ZanSara merged commit 1a60e21 into deepset-ai:main Nov 3, 2022
@ZanSara
Copy link
Contributor

ZanSara commented Nov 3, 2022

Sure! This was an OOM (out-of-memory) error on the Windows GH runner. It's not the first time we see them...

What it means is that the machine simply doesn't have enough RAM left for all tests. What I've done was to reduce the memory footprint of the tests in two ways:

Unfortunately the test suites are very heavy and Windows runners tend to collapse rather fast. If that happens again, reducing the model size will probably help, but don't be afraid to ask for help if that's not enough. That's our fault after all! 😄

@anakin87 anakin87 deleted the summarizer_refactoring branch November 3, 2022 15:10
@ZanSara ZanSara requested a review from agnieszka-m November 3, 2022 16:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic:indexing topic:metadata type:documentation Improvements on the docs type:feature New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make Summarizer work in indexing pipeline
4 participants