-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Updated EntityExtractor to handle long texts and added better postprocessing #3154
Conversation
…ndle documents with lenghts longer than model_max_length
…by splitting up the document into smaller samples similar to how the reader works
…ist of docs where each doc can be longer than the model max length. Does not work with batch size larger than 1 right now.
Hey, @vblagoje I would appreciate a review of the new code since your last review. I wrote a description of the new changes under the Update: header in the PR description (also reproduced below).
|
…p all postprocessing functions under one class
…ge that a lot of the postprocessing parts came from the transformers library.
When running the NER node (and the unit tests) this warning from HF appears Process finished with exit code 0
PASSED [100%]huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Extracting entities: 100%|██████████| 3/3 [00:00<00:00, 35.96it/s] This stems from HF being worried that we call the Fast tokenizer before creating a pytorch DataLoader.
As recommended we create store the tokens in a dict and pass that to the DataLoader. |
@sjrl Why not add a unit test for a document whose length is larger than tokenizer max_seq_len to capture the driving force behind this PR and also to the unit test as well? Not sure if we already have such a doc somewhere in unit tests, but if not we can provide one in the unit test itself. |
Ahh I did do this by setting |
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Related Issues
Proposed Changes:
This PR updates a number of aspects to the
EntityExtractor
node in haystack.aggregation_strategy
option (set tofirst
as default) which mitigates some of the issues identified in issueEntityExtractor
can't deal well with out-of-vocabulary words #1706. More explanation is provided in issueEntityExtractor
can't deal well with out-of-vocabulary words #1706.TokenClassificationPipeline
provided by HuggingFace. This resulted in the largest changes because the HF pipeline silently truncated all document texts passed to it. Instead, following the inspiration of theReader
node, I added functionality to split long texts and feed each split into the model individually (or batches). Afterward, I recombine the splits grouped by the document they originally came from.flatten_entities_in_meta_data
so the entities can be stored in the metadata in a manner that can be used by the OpenSearch document storeEntityExtractor
in and indexing pipelineUpdate:
pre_split_text
so the user can optionally split all text by white space before passing it to the token classification model. This is common practice for NER pipelines, but is not out of the box supported by HuggingFace. As a result, more functionality was added to handle the post-processing of the model predictions when the text is pre-split into words. Namely, we determine word boundaries using theself.tokenizer.word_ids
method and we update the character spans of the detected entities to correctly map back to the original (unsplit) text.ignore_labels
to allow users to specify what labels they would like to ignore.How did you test it?
EntityExtractor
node still pass in the tests.Notes for the reviewer
EntityExtractor
node in an indexing pipelinepre_split_text
optionChecklist