feat: Updated EntityExtractor to handle long texts and added better postprocessing #3154

sjrl · 2022-09-05T10:23:54Z

Related Issues

addresses part of Extend support for token classification #2969. Namely the postprocessing of results and being able to handle long text documents without truncation.

Proposed Changes:

This PR updates a number of aspects to the EntityExtractor node in haystack.

Update to default NER model to one of similar size, but with better metrics https://huggingface.co/elastic/distilbert-base-cased-finetuned-conll03-english
Addition of the aggregation_strategy option (set to first as default) which mitigates some of the issues identified in issue EntityExtractor can't deal well with out-of-vocabulary words #1706. More explanation is provided in issue EntityExtractor can't deal well with out-of-vocabulary words #1706.
Moved away from using the TokenClassificationPipeline provided by HuggingFace. This resulted in the largest changes because the HF pipeline silently truncated all document texts passed to it. Instead, following the inspiration of the Reader node, I added functionality to split long texts and feed each split into the model individually (or batches). Afterward, I recombine the splits grouped by the document they originally came from.
Added the option flatten_entities_in_meta_data so the entities can be stored in the metadata in a manner that can be used by the OpenSearch document store
Added a test for using the EntityExtractor in and indexing pipeline

Update:

Added new option pre_split_text so the user can optionally split all text by white space before passing it to the token classification model. This is common practice for NER pipelines, but is not out of the box supported by HuggingFace. As a result, more functionality was added to handle the post-processing of the model predictions when the text is pre-split into words. Namely, we determine word boundaries using the self.tokenizer.word_ids method and we update the character spans of the detected entities to correctly map back to the original (unsplit) text.
Added new optional parameter ignore_labels to allow users to specify what labels they would like to ignore.

How did you test it?

I made sure that the current tests for the EntityExtractor node still pass in the tests.
I also added new tests to test the new functionality added

Notes for the reviewer

Need to test EntityExtractor node in an indexing pipeline
Add test when using pre_split_text option
Add a test to check that the original text for unknown tokens is found

Checklist

I have read the contributors guidelines and the code of conduct
I have updated the related issue with new insights and changes
I added tests that demonstrate the correct behavior of the change
I've used the conventional commit convention for my PR title
I documented my code
I ran pre-commit hooks and fixed any issue

…ndle documents with lenghts longer than model_max_length

…by splitting up the document into smaller samples similar to how the reader works

…ist of docs where each doc can be longer than the model max length. Does not work with batch size larger than 1 right now.

sjrl · 2022-10-07T11:19:55Z

Hey, @vblagoje I would appreciate a review of the new code since your last review. I wrote a description of the new changes under the Update: header in the PR description (also reproduced below).

Update:

Added new option pre_split_text so the user can optionally split all text by white space before passing it to the token classification model. This is common practice for NER pipelines, but is not out of the box supported by HuggingFace. As a result, more functionality was added to handle the post-processing of the model predictions when the text is pre-split into words. Namely, we determine word boundaries using the self.tokenizer.word_ids method and we update the character spans of the detected entities to correctly map back to the original (unsplit) text.

Added new optional parameter ignore_labels to allow users to specify what labels they would like to ignore.

…p all postprocessing functions under one class

haystack/nodes/extractor/entity.py

…ge that a lot of the postprocessing parts came from the transformers library.

sjrl · 2022-10-12T10:55:12Z

When running the NER node (and the unit tests) this warning from HF appears

Process finished with exit code 0
PASSED [100%]huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Extracting entities: 100%|██████████| 3/3 [00:00<00:00, 35.96it/s]

This stems from HF being worried that we call the Fast tokenizer before creating a pytorch DataLoader.
According to the second answer in this this Stack Overflow Post this should be fine as long as the Fast Tokenizer is not called from within the DataLoader (excerpt from answer):

Alternatively convert your data to tokens beforehand and store them in a dict. Then your dataset should not use the tokenizer at all but during runtime simply calls the dict(key) where key is the index. This way you avoid conflict. The warning still comes but you simply dont use tokeniser during training any more (note for such scenarios to save space, avoid padding during tokenise and add later with collate_fn)

As recommended we create store the tokens in a dict and pass that to the DataLoader.

test/nodes/test_extractor.py

vblagoje · 2022-10-12T10:59:05Z

@sjrl Why not add a unit test for a document whose length is larger than tokenizer max_seq_len to capture the driving force behind this PR and also to the unit test as well? Not sure if we already have such a doc somewhere in unit tests, but if not we can provide one in the unit test itself.

sjrl · 2022-10-12T11:04:11Z

@sjrl Why not add a unit test for a document whose length is larger than tokenizer max_seq_len to capture the driving force behind this PR and also to the unit test as well? Not sure if we already have such a doc somewhere in unit tests, but if not we can provide one in the unit test itself.

Ahh I did do this by setting max_seq_len=6 for a few of the unit tests. So instead of a long doc I just shortened the amount of text the model can ingest in each step.

sjrl · 2022-10-12T11:06:01Z

Add unit test using Jean-Baptiste/camembert-ner to test that "first" aggregation method works for it now (did not before)

haystack/nodes/extractor/entity.py

vblagoje

LGTM

sjrl added 30 commits August 25, 2022 16:03

Initial commit to add training to EntityExtractor

64e20b2

Merge branch 'main' of github.com:deepset-ai/haystack into issue-2969

8818b80

Incorporated training script from HF

4b88cde

Some refactoring

4533440

Refactored data related tasks into NERDataProcessor

0295c38

Update default model to one that has better metrics

6f111e8

Refactoring and update to docs

a49f43c

Adding more docs and types

eb59f8b

Docs and bug fix for device

fd8e9c0

Remove str call

88d7d69

Move the device length check

6655cfc

Merge branch 'main' of github.com:deepset-ai/haystack into issue-2969

1ff4feb

Merge branch 'main' of github.com:deepset-ai/haystack into issue-2969

9f2a058

Making progress to use overflow_to_sample_mapping such that we can ha…

bffbafa

…ndle documents with lenghts longer than model_max_length

Added TODO

16df1e6

Updated postprocessing step

726506e

Updating docs

86e116b

Merge branch 'main' of github.com:deepset-ai/haystack into issue-2969

ef8c31e

New extract method can handle texts longer than the max_model_length …

8732167

…by splitting up the document into smaller samples similar to how the reader works

API docs

be52ebd

Add changes to make run_batch work with dicts

87c8217

Update to docs

5e9fdb9

Adding more types

6a2ae46

Works for batch size 1!! Specifically the method extract works on a l…

775f37c

…ist of docs where each doc can be longer than the model max length. Does not work with batch size larger than 1 right now.

It now works with batch sizes great than 1!!

6cda632

Merge branch 'main' of github.com:deepset-ai/haystack into issue-2969

a068483

Update to API docs and getting tests to pass.

d0c3d82

Adding NER model to cached models

3e35c62

Mypy fixes

4edd669

API docs

0a1bb57

Update schema

4462505

sjrl marked this pull request as ready for review October 7, 2022 11:17

sjrl added 3 commits October 7, 2022 15:08

Remove dependence on HuggingFace TokenClassificationPipeline and grou…

ee8826b

…p all postprocessing functions under one class

Update to json schema

677d89f

Mypy

29ef38a

sjrl commented Oct 7, 2022

View reviewed changes

haystack/nodes/extractor/entity.py Show resolved Hide resolved

sjrl added 3 commits October 10, 2022 10:28

Added copyright notice for HF and deepset to entity file to acknowled…

b6f09a8

…ge that a lot of the postprocessing parts came from the transformers library.

Adding docstrings

827de13

Fixed text squishing problem. Added additional unit test for it.

d8ad3e1

vblagoje reviewed Oct 12, 2022

View reviewed changes

test/nodes/test_extractor.py Outdated Show resolved Hide resolved

vblagoje reviewed Oct 13, 2022

View reviewed changes

haystack/nodes/extractor/entity.py Outdated Show resolved Hide resolved

vblagoje reviewed Oct 13, 2022

View reviewed changes

haystack/nodes/extractor/entity.py Show resolved Hide resolved

sjrl added 6 commits October 17, 2022 09:40

Merge branch 'main' of github.com:deepset-ai/haystack into issue-2969

29c7a41

PR comments

868d7f9

PR comments

998e039

PR comments

684dc0b

Pylint

e6b7521

Pylint

50b2529

vblagoje approved these changes Oct 17, 2022

View reviewed changes

sjrl added 3 commits October 17, 2022 17:17

Merge branch 'main' of github.com:deepset-ai/haystack into issue-2969

3ebc570

json schema

3bd72f0

Merge branch 'main' of github.com:deepset-ai/haystack into issue-2969

d711a9e

sjrl merged commit 15a59fd into main Oct 17, 2022

sjrl deleted the issue-2969 branch October 17, 2022 19:26

masci mentioned this pull request Nov 2, 2022

Extend support for token classification #2969

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Updated EntityExtractor to handle long texts and added better postprocessing #3154

feat: Updated EntityExtractor to handle long texts and added better postprocessing #3154

sjrl commented Sep 5, 2022 •

edited

Loading

sjrl commented Oct 7, 2022

sjrl commented Oct 12, 2022

vblagoje commented Oct 12, 2022

sjrl commented Oct 12, 2022

sjrl commented Oct 12, 2022 •

edited

Loading

vblagoje left a comment

feat: Updated EntityExtractor to handle long texts and added better postprocessing #3154

feat: Updated EntityExtractor to handle long texts and added better postprocessing #3154

Conversation

sjrl commented Sep 5, 2022 • edited Loading

Related Issues

Proposed Changes:

Update:

How did you test it?

Notes for the reviewer

Checklist

sjrl commented Oct 7, 2022

sjrl commented Oct 12, 2022

vblagoje commented Oct 12, 2022

sjrl commented Oct 12, 2022

sjrl commented Oct 12, 2022 • edited Loading

vblagoje left a comment

Choose a reason for hiding this comment

sjrl commented Sep 5, 2022 •

edited

Loading

sjrl commented Oct 12, 2022 •

edited

Loading