refactor: simplify Summarizer, add Document Merger #3452

anakin87 · 2022-10-21T18:55:57Z

Related Issues

fixes Make Summarizer work in indexing pipeline #3403

Proposed Changes:

As discussed in #3403

make the summarizer suitable for indexing pipelines
write the summarization results in meta instead of altering document content (current behavior)
remove generate_single_summary parameter: currently it transforms several documents into one
(it is better to design a dedicated node for the purpose of merging documents)

How did you test it?

Adapted some tests
Other tests to be added?

Notes for the reviewer

Just a first draft to understand what this change breaks

Checklist

I have read the contributors guidelines and the code of conduct
I have updated the related issue with new insights and changes
I added tests that demonstrate the correct behavior of the change
I've used the conventional commit convention for my PR title
I documented my code
I ran pre-commit hooks and fixed any issue

test/nodes/test_extractor_translation.py

test/nodes/test_summarizer.py

test/pipelines/test_eval.py

anakin87 · 2022-10-23T10:00:13Z

As expected, this refactoring was breaking a few things, which I tried to fix. 😄
I made some comments in correspondence with the tests, which required the most onerous changes.

@ZanSara @brandenchan @bogdankostic please feel free to jump in and help to go in the right direction...

ZanSara

Looking great so far! I don't think you need more tests for this node right now, given we're only removing functionality. I found a coupe of leftover prints and one question only.

In addition, let's add the DocumentMerger in this same PR. This way we can re-introduce some of the removed tests with the new setup (docs merger + summarizer) and make sure no functionality is lost.

test/pipelines/test_eval_batch.py

haystack/nodes/summarizer/transformers.py

anakin87 · 2022-10-24T10:55:59Z

@ZanSara thanks! When I have some time, I'll make the changes you suggested...

Just some questions:
how do you imagine the Document Merger?
How do we deal with the metadata of aggregated documents? Do we remove them?

ZanSara · 2022-10-24T13:26:53Z

How do you imagine the Document Merger?

So, the DocumentMerger should be super simple, something that just merges the content of the documents so that given a list of N, it returns a list of 1 (that's because Pipeline.run() always expects documents to be a List - and Pipeline.run_batch() always expects documents to be a List of List).

We might want to deal with the possibility of receiving input from multiple nodes. That's a messy topic though, so if you see that it becomes intractable and makes the node complex, ignore that. If anyone needs to handle multiple inputs, they can use JoinDocuments on top of the Merger for now.

How do we deal with the metadata of aggregated documents? Do we remove them?

Now that's a great question 😄 My intuition says that we should apply the following heuristic:

If a key is present with the same value in all documents to be merged, add it to the merged doc too.
Any other key is removed.

Mind that keys might contain nested values. I'd apply such heuristic recursively on dictionary entries.

Let's also pay attention to keep "important" keys like name, which is often accessed. I don't know how we could deal with a conflict there, so let's see which (if any) test break when we just treat it as any other, and decide once we have the outcomes. I also can't name any other critical field right now, so let's see if any test breaks and add other "special" keys to this list as necessary.

vblagoje · 2022-10-27T11:39:06Z

Hey @anakin87 and @ZanSara, I am always for simplification and refactoring, but I don't think we can remove generate_single_summary feature in a non-major release - even if it was a total mess. There might be some users who are relying on it.

anakin87 · 2022-10-27T14:09:24Z

By introducing Document Merger, I think we would offer the same feature in a different and more structured way.
However, I agree that if someone is using generate_single_summary, we will break down how this feature is currently working...

@vblagoje @ZanSara Before investing time in this PR, I wait for your opinions!

ZanSara · 2022-10-31T16:23:11Z

Hey @anakin87! I'm going to review this PR now. There has been two new test suites introduced a moment ago, they just need another rebase and they'll pass 👍

ZanSara

Great progress! That's precisely how I imagined it.

A couple of things to note;

I think I found a heavy simplification of the meta fields merging algorithm and I've suggested it, Test it out if it works as expected or if I forgot something!
We need to make sure the documentation for DocumentMerger is generated. To do so, let's add document_merger to this list:

haystack/docs/_src/api/pydoc/other.yml

Line 4 in 8ddeda8

modules: ['docs2answers', 'join_docs', 'join_answers', 'route_documents']

haystack/nodes/other/document_merger.py

haystack/nodes/summarizer/transformers.py

haystack/pipelines/standard_pipelines.py

test/nodes/test_document_merger.py

test/nodes/test_summarizer.py

test/pipelines/test_eval_batch.py

anakin87 · 2022-11-01T10:06:25Z

@ZanSara thanks for the great review!!!

anakin87 · 2022-11-01T11:03:12Z

integration-tests-windows (nodes) is failing:

ERROR test/nodes/test_generator.py::test_generator_pipeline[embedding-memory] - RuntimeError: [enforce fail at C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\c10\core\impl\alloc_cpu.cpp:81] data. DefaultCPUAllocator: not enough memory: you tried to allocate 4194304 bytes.

Any ideas?

ZanSara

Hey @anakin87 sorry for this CI issue. We're working on it!

test/nodes/test_extractor_translation.py

test/nodes/test_summarizer.py

anakin87 · 2022-11-03T14:44:51Z

Hi @ZanSara, I see that thanks to your contributions the tests are passing now!

Can you explain to me very quickly what you did?
Maybe next time I will be able to solve the problem on my own...

ZanSara · 2022-11-03T15:09:42Z

Sure! This was an OOM (out-of-memory) error on the Windows GH runner. It's not the first time we see them...

What it means is that the machine simply doesn't have enough RAM left for all tests. What I've done was to reduce the memory footprint of the tests in two ways:

By removing one test that seemed slightly redundant: refactor: simplify Summarizer, add Document Merger #3452 (comment)
By using a much much smaller model for Summarizer in a test that was not designed to produce output: refactor: simplify Summarizer, add Document Merger #3452 (comment)

Unfortunately the test suites are very heavy and Windows runners tend to collapse rather fast. If that happens again, reducing the model size will probably help, but don't be afraid to ask for help if that's not enough. That's our fault after all! 😄

anakin87 added 10 commits October 21, 2022 20:47

remove generate_single_summary

556bfc7

update schemas

8197f48

Merge remote-tracking branch 'upstream/main' into summarizer_refactoring

2ff7c52

remove unused import

9e850c2

fix mypy

b69966c

fix mypy

2fcf55d

test: summarizer doesnt change content

39c7a84

other test correction

5614161

move test_summarizer_translation to test_extractor_translation

5301120

fix test

4e0f142

anakin87 commented Oct 23, 2022

View reviewed changes

test/nodes/test_extractor_translation.py Show resolved Hide resolved

anakin87 commented Oct 23, 2022

View reviewed changes

test/nodes/test_summarizer.py Outdated Show resolved Hide resolved

anakin87 commented Oct 23, 2022

View reviewed changes

test/nodes/test_summarizer.py Outdated Show resolved Hide resolved

anakin87 commented Oct 23, 2022

View reviewed changes

test/nodes/test_summarizer.py Outdated Show resolved Hide resolved

anakin87 commented Oct 23, 2022

View reviewed changes

test/pipelines/test_eval.py Show resolved Hide resolved

anakin87 marked this pull request as ready for review October 23, 2022 10:01

anakin87 requested a review from a team as a code owner October 23, 2022 10:01

anakin87 requested review from mayankjobanputra and removed request for a team October 23, 2022 10:01

ZanSara reviewed Oct 24, 2022

View reviewed changes

ZanSara added type:feature New feature or request topic:metadata labels Oct 24, 2022

ZanSara removed the request for review from mayankjobanputra October 24, 2022 10:39

ZanSara added the topic:indexing label Oct 24, 2022

Merge branch 'main' into summarizer_refactoring

5a8bfc5

ZanSara suggested changes Oct 31, 2022

View reviewed changes

anakin87 added 3 commits November 1, 2022 10:36

Merge branch 'main' into summarizer_refactoring

c772803

adapt to review

9558c9e

merge main

7a66619

extended deprecation docstring

0cc9c5a

anakin87 requested a review from ZanSara November 1, 2022 11:03

anakin87 marked this pull request as draft November 2, 2022 09:59

anakin87 marked this pull request as ready for review November 2, 2022 09:59

ZanSara approved these changes Nov 2, 2022

View reviewed changes

Merge branch 'main' into summarizer_refactoring

ca79476

ZanSara reviewed Nov 3, 2022

View reviewed changes

test/nodes/test_extractor_translation.py Outdated Show resolved Hide resolved

Update test/nodes/test_extractor_translation.py

c9fa988

ZanSara reviewed Nov 3, 2022

View reviewed changes

test/nodes/test_summarizer.py Outdated Show resolved Hide resolved

Update test/nodes/test_summarizer.py

435c81e

ZanSara reviewed Nov 3, 2022

View reviewed changes

test/nodes/test_summarizer.py Outdated Show resolved Hide resolved

ZanSara added 2 commits November 3, 2022 14:28

Update test/nodes/test_summarizer.py

78f2137

black

4ccf07d

ZanSara reviewed Nov 3, 2022

View reviewed changes

test/nodes/test_summarizer.py Outdated Show resolved Hide resolved

documents fixture

ab8e6ba

ZanSara merged commit 1a60e21 into deepset-ai:main Nov 3, 2022

anakin87 deleted the summarizer_refactoring branch November 3, 2022 15:10

ZanSara added action:needs documentation type:documentation Improvements on the docs labels Nov 3, 2022

ZanSara requested a review from agnieszka-m November 3, 2022 16:02

anakin87 mentioned this pull request Dec 21, 2022

refactor: remove deprecated parameters from Summarizer #3740

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: simplify Summarizer, add Document Merger #3452

refactor: simplify Summarizer, add Document Merger #3452

anakin87 commented Oct 21, 2022 •

edited

Loading

anakin87 commented Oct 23, 2022

ZanSara left a comment

anakin87 commented Oct 24, 2022

ZanSara commented Oct 24, 2022

vblagoje commented Oct 27, 2022

anakin87 commented Oct 27, 2022

ZanSara commented Oct 31, 2022

ZanSara left a comment

anakin87 commented Nov 1, 2022

anakin87 commented Nov 1, 2022

ZanSara left a comment

anakin87 commented Nov 3, 2022

ZanSara commented Nov 3, 2022

refactor: simplify Summarizer, add Document Merger #3452

refactor: simplify Summarizer, add Document Merger #3452

Conversation

anakin87 commented Oct 21, 2022 • edited Loading

Related Issues

Proposed Changes:

How did you test it?

Notes for the reviewer

Checklist

anakin87 commented Oct 23, 2022

ZanSara left a comment

Choose a reason for hiding this comment

anakin87 commented Oct 24, 2022

ZanSara commented Oct 24, 2022

vblagoje commented Oct 27, 2022

anakin87 commented Oct 27, 2022

ZanSara commented Oct 31, 2022

ZanSara left a comment

Choose a reason for hiding this comment

anakin87 commented Nov 1, 2022

anakin87 commented Nov 1, 2022

ZanSara left a comment

Choose a reason for hiding this comment

anakin87 commented Nov 3, 2022

ZanSara commented Nov 3, 2022

anakin87 commented Oct 21, 2022 •

edited

Loading