Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Updated EntityExtractor to handle long texts and added better postprocessing #3154

Merged
merged 91 commits into from
Oct 17, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
91 commits
Select commit Hold shift + click to select a range
64e20b2
Initial commit to add training to EntityExtractor
sjrl Aug 25, 2022
8818b80
Merge branch 'main' of github.com:deepset-ai/haystack into issue-2969
sjrl Sep 2, 2022
4b88cde
Incorporated training script from HF
sjrl Sep 2, 2022
4533440
Some refactoring
sjrl Sep 2, 2022
0295c38
Refactored data related tasks into NERDataProcessor
sjrl Sep 2, 2022
6f111e8
Update default model to one that has better metrics
sjrl Sep 5, 2022
a49f43c
Refactoring and update to docs
sjrl Sep 5, 2022
eb59f8b
Adding more docs and types
sjrl Sep 5, 2022
fd8e9c0
Docs and bug fix for device
sjrl Sep 5, 2022
88d7d69
Remove str call
sjrl Sep 5, 2022
6655cfc
Move the device length check
sjrl Sep 5, 2022
1ff4feb
Merge branch 'main' of github.com:deepset-ai/haystack into issue-2969
sjrl Sep 5, 2022
9f2a058
Merge branch 'main' of github.com:deepset-ai/haystack into issue-2969
sjrl Sep 7, 2022
bffbafa
Making progress to use overflow_to_sample_mapping such that we can ha…
sjrl Sep 7, 2022
16df1e6
Added TODO
sjrl Sep 7, 2022
726506e
Updated postprocessing step
sjrl Sep 8, 2022
86e116b
Updating docs
sjrl Sep 8, 2022
ef8c31e
Merge branch 'main' of github.com:deepset-ai/haystack into issue-2969
sjrl Sep 9, 2022
8732167
New extract method can handle texts longer than the max_model_length …
sjrl Sep 9, 2022
be52ebd
API docs
sjrl Sep 9, 2022
87c8217
Add changes to make run_batch work with dicts
sjrl Sep 9, 2022
5e9fdb9
Update to docs
sjrl Sep 9, 2022
6a2ae46
Adding more types
sjrl Sep 9, 2022
775f37c
Works for batch size 1!! Specifically the method extract works on a l…
sjrl Sep 9, 2022
6cda632
It now works with batch sizes great than 1!!
sjrl Sep 9, 2022
a068483
Merge branch 'main' of github.com:deepset-ai/haystack into issue-2969
sjrl Sep 12, 2022
d0c3d82
Update to API docs and getting tests to pass.
sjrl Sep 12, 2022
3e35c62
Adding NER model to cached models
sjrl Sep 12, 2022
4edd669
Mypy fixes
sjrl Sep 12, 2022
0a1bb57
API docs
sjrl Sep 12, 2022
3e63aaa
More mypy
sjrl Sep 12, 2022
bb000c4
Remove training parts in this branch
sjrl Sep 12, 2022
791e9af
Pylint
sjrl Sep 12, 2022
f7ffff8
Json schema
sjrl Sep 12, 2022
2b32256
Some cleanup
sjrl Sep 12, 2022
533c6ea
Added some unit tests and checked the new extract_batch produces the …
sjrl Sep 12, 2022
4382e1c
Update to api docs
sjrl Sep 12, 2022
17db8f0
Merge branch 'main' of github.com:deepset-ai/haystack into issue-2969
sjrl Sep 12, 2022
a507cb9
Update to schema
sjrl Sep 12, 2022
b3130c6
Pylint
sjrl Sep 12, 2022
0b2021d
Moved function to torch_utils
sjrl Sep 12, 2022
3db5415
Cleanup and addiing some docs
sjrl Sep 13, 2022
48e71f0
Merge branch 'main' of github.com:deepset-ai/haystack into issue-2969
sjrl Sep 13, 2022
a3b7a19
Added new option flatten_entities_in_meta_data
sjrl Sep 13, 2022
8f7a582
Update to api docs
sjrl Sep 13, 2022
db268ba
Added test for EntityExtractor in indexing pipeline, fixed bug and up…
sjrl Sep 13, 2022
ea44e55
Mypy
sjrl Sep 13, 2022
41f874e
mypy
sjrl Sep 13, 2022
50357e6
mypy
sjrl Sep 13, 2022
8b157ff
mypy
sjrl Sep 13, 2022
28762e5
Merge branch 'main' of github.com:deepset-ai/haystack into issue-2969
sjrl Sep 13, 2022
5d51028
convert float32 to float in meta
ju-gu Sep 15, 2022
630797b
add_max_seq param
ju-gu Sep 20, 2022
95f0ea6
Changed import type
sjrl Sep 26, 2022
a96bb74
Added doc string
sjrl Sep 26, 2022
cd74fd0
Merge branch 'main' of github.com:deepset-ai/haystack into issue-2969
sjrl Sep 26, 2022
3ecdc46
Updating schema
sjrl Sep 26, 2022
e946bd4
Merge branch 'main' of github.com:deepset-ai/haystack into issue-2969
sjrl Oct 4, 2022
2b3e66a
Started to add pre_split_text option
sjrl Oct 4, 2022
b55f10c
Updated test, added words_offset_mapping, split the postprocess into …
sjrl Oct 5, 2022
ff02a53
Implemented new version of _gather_pre_entities to use word_ids to de…
sjrl Oct 5, 2022
3bb9df9
Use same postprocess in both cases and added docs
sjrl Oct 5, 2022
aa89205
Some cleanup and adding docs
sjrl Oct 5, 2022
6ba9770
Started to add function _update_offset_mapping to handle updating the…
sjrl Oct 5, 2022
a3fe816
Started adding function update offset mapping
sjrl Oct 6, 2022
7b1a962
Changed mind to update character spans of grouped entities. Will need…
sjrl Oct 6, 2022
bbd2552
Fixed bug in gather_pre_entities. Added character span update after g…
sjrl Oct 6, 2022
38d154a
Docstrings and cleanup
sjrl Oct 6, 2022
aa210d3
Docs and Literals
sjrl Oct 6, 2022
25e8f71
Merge branch 'main' of github.com:deepset-ai/haystack into issue-2969
sjrl Oct 6, 2022
56b7939
More cleanup
sjrl Oct 6, 2022
b84ea3e
Fix mypy
sjrl Oct 6, 2022
e466906
Update schema
sjrl Oct 6, 2022
98db9d5
Added test for the pre_split_text option
sjrl Oct 7, 2022
b5ea424
Fixed how the original text is passed around so it can be used to gra…
sjrl Oct 7, 2022
4462505
Update schema
sjrl Oct 7, 2022
ee8826b
Remove dependence on HuggingFace TokenClassificationPipeline and grou…
sjrl Oct 7, 2022
677d89f
Update to json schema
sjrl Oct 7, 2022
29ef38a
Mypy
sjrl Oct 7, 2022
b6f09a8
Added copyright notice for HF and deepset to entity file to acknowled…
sjrl Oct 10, 2022
827de13
Adding docstrings
sjrl Oct 10, 2022
d8ad3e1
Fixed text squishing problem. Added additional unit test for it.
sjrl Oct 10, 2022
29c7a41
Merge branch 'main' of github.com:deepset-ai/haystack into issue-2969
sjrl Oct 17, 2022
868d7f9
PR comments
sjrl Oct 17, 2022
998e039
PR comments
sjrl Oct 17, 2022
684dc0b
PR comments
sjrl Oct 17, 2022
e6b7521
Pylint
sjrl Oct 17, 2022
50b2529
Pylint
sjrl Oct 17, 2022
3ebc570
Merge branch 'main' of github.com:deepset-ai/haystack into issue-2969
sjrl Oct 17, 2022
3bd72f0
json schema
sjrl Oct 17, 2022
d711a9e
Merge branch 'main' of github.com:deepset-ai/haystack into issue-2969
sjrl Oct 17, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -687,7 +687,7 @@ jobs:
- name: Download models
if: steps.cache-hf-models.outputs.cache-hit != 'true'
run: |
python -c "from transformers import AutoModel;[AutoModel.from_pretrained(model_name) for model_name in ['vblagoje/bart_lfqa','yjernite/bart_eli5', 'google/pegasus-xsum', 'vblagoje/dpr-ctx_encoder-single-lfqa-wiki', 'vblagoje/dpr-question_encoder-single-lfqa-wiki', 'facebook/dpr-question_encoder-single-nq-base', 'facebook/dpr-ctx_encoder-single-nq-base']]"
python -c "from transformers import AutoModel;[AutoModel.from_pretrained(model_name) for model_name in ['vblagoje/bart_lfqa','yjernite/bart_eli5', 'google/pegasus-xsum', 'vblagoje/dpr-ctx_encoder-single-lfqa-wiki', 'vblagoje/dpr-question_encoder-single-lfqa-wiki', 'facebook/dpr-question_encoder-single-nq-base', 'facebook/dpr-ctx_encoder-single-nq-base', 'elastic/distilbert-base-cased-finetuned-conll03-english']]"


- name: Run Elasticsearch
Expand Down
111 changes: 107 additions & 4 deletions docs/_src/api/api/extractor.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,8 +23,8 @@ The entities extracted by this Node will populate Document.entities
- `model_name_or_path`: The name of the model to use for entity extraction.
- `model_version`: The version of the model to use for entity extraction.
- `use_gpu`: Whether to use the GPU or not.
- `batch_size`: The batch size to use for entity extraction.
- `progress_bar`: Whether to show a progress bar or not.
- `batch_size`: The batch size to use for entity extraction.
- `use_auth_token`: The API token used to download private models from Huggingface.
If this parameter is set to `True`, then the token generated when running
`transformers-cli login` (stored in ~/.huggingface) will be used.
Expand All @@ -34,6 +34,30 @@ https://huggingface.co/transformers/main_classes/model.html#transformers.PreTrai
A list containing torch device objects and/or strings is supported (For example
[torch.device('cuda:0'), "mps", "cuda:1"]). When specifying `use_gpu=False` the devices
parameter is not used and a single cpu device is used for inference.
- `aggregation_strategy`: The strategy to fuse (or not) tokens based on the model prediction.
“none”: Will not do any aggregation and simply return raw results from the model.
“simple”: Will attempt to group entities following the default schema.
(A, B-TAG), (B, I-TAG), (C, I-TAG), (D, B-TAG2) (E, B-TAG2) will end up being
[{“word”: ABC, “entity”: “TAG”}, {“word”: “D”, “entity”: “TAG2”}, {“word”: “E”, “entity”: “TAG2”}]
Notice that two consecutive B tags will end up as different entities.
On word based languages, we might end up splitting words undesirably: Imagine Microsoft being tagged
as [{“word”: “Micro”, “entity”: “ENTERPRISE”}, {“word”: “soft”, “entity”: “NAME”}].
Look at the options FIRST, MAX, and AVERAGE for ways to mitigate this example and disambiguate words
(on languages that support that meaning, which is basically tokens separated by a space).
These mitigations will only work on real words, “New york” might still be tagged with two different entities.
“first”: (works only on word based models) Will use the SIMPLE strategy except that words, cannot end up with
different tags. Words will simply use the tag of the first token of the word when there is ambiguity.
“average”: (works only on word based models) Will use the SIMPLE strategy except that words, cannot end up with
different tags. The scores will be averaged across tokens, and then the label with the maximum score is chosen.
“max”: (works only on word based models) Will use the SIMPLE strategy except that words, cannot end up with
different tags. Word entity will simply be the token with the maximum score.
- `add_prefix_space`: Do this if you do not want the first word to be treated differently. This is relevant for
model types such as "bloom", "gpt2", and "roberta".
Explained in more detail here:
https://huggingface.co/docs/transformers/model_doc/roberta#transformers.RobertaTokenizer
- `num_workers`: Number of workers to be used in the Pytorch Dataloader
- `flatten_entities_in_meta_data`: If True this converts all entities predicted for a document from a list of
dictionaries into a single list for each key in the dictionary.

<a id="entity.EntityExtractor.run"></a>

Expand All @@ -47,26 +71,82 @@ def run(

This is the method called when this node is used in a pipeline

<a id="entity.EntityExtractor.preprocess"></a>

#### EntityExtractor.preprocess

```python
def preprocess(sentence: Union[str, List[str]],
offset_mapping: Optional[torch.Tensor] = None)
```

Preprocessing step to tokenize the provided text.

**Arguments**:

- `sentence`: Text to tokenize. This works with a list of texts or a single text.
- `offset_mapping`: Only needed if a slow tokenizer is used. Will be used in the postprocessing step to
determine the original character positions of the detected entities.

<a id="entity.EntityExtractor.forward"></a>

#### EntityExtractor.forward

```python
def forward(model_inputs: Dict[str, Any]) -> Dict[str, Any]
```

Forward step

**Arguments**:

- `model_inputs`: Dictionary of inputs to be given to the model.

<a id="entity.EntityExtractor.postprocess"></a>

#### EntityExtractor.postprocess

```python
def postprocess(model_outputs: Dict[str, Any]) -> List[List[Dict]]
```

Aggregate each of the items in `model_outputs` based on which text document they originally came from.

Then we pass the grouped `model_outputs` to `self.extractor_pipeline.postprocess` to take advantage of the
advanced postprocessing features available in the HuggingFace TokenClassificationPipeline object.

**Arguments**:

- `model_outputs`: Dictionary of model outputs

<a id="entity.EntityExtractor.extract"></a>

#### EntityExtractor.extract

```python
def extract(text)
def extract(text: Union[str, List[str]], batch_size: int = 1)
```

This function can be called to perform entity extraction when using the node in isolation.

**Arguments**:

- `text`: Text to extract entities from. Can be a str or a List of str.
- `batch_size`: Number of texts to make predictions on at a time.

<a id="entity.EntityExtractor.extract_batch"></a>

#### EntityExtractor.extract\_batch

```python
def extract_batch(texts: Union[List[str], List[List[str]]],
batch_size: Optional[int] = None)
batch_size: int = 1) -> List[List[Dict]]
```

This function allows to extract entities out of a list of strings or a list of lists of strings.
This function allows the extraction of entities out of a list of strings or a list of lists of strings.

The only difference between this function and `self.extract` is that it has additional logic to handle a
list of lists of strings.

**Arguments**:

Expand All @@ -82,6 +162,7 @@ def simplify_ner_for_qa(output)
```

Returns a simplified version of the output dictionary

with the following structure:
[
{
Expand All @@ -92,3 +173,25 @@ with the following structure:
The entities included are only the ones that overlap with
the answer itself.

**Arguments**:

- `output`: Output from a query pipeline

<a id="entity.TokenClassificationDataset"></a>

## TokenClassificationDataset

```python
class TokenClassificationDataset(Dataset)
```

Token Classification Dataset

This is a wrapper class to create a Pytorch dataset object from the data attribute of a
`transformers.tokenization_utils_base.BatchEncoding` object.

**Arguments**:

- `model_inputs`: The data attribute of the output from a HuggingFace tokenizer which is needed to evaluate the
forward pass of a token classification model.

57 changes: 56 additions & 1 deletion haystack/json-schemas/haystack-pipeline-1.10.0rc0.schema.json
Original file line number Diff line number Diff line change
Expand Up @@ -1346,6 +1346,28 @@
"additionalProperties": false,
"description": "Each parameter can reference other components defined in the same YAML file.",
"properties": {
"add_prefix_space": {
"anyOf": [
{
"type": "boolean"
},
{
"type": "null"
}
],
"title": "Add Prefix Space"
},
"aggregation_strategy": {
"default": "first",
"enum": [
"simple",
"first",
"average",
"max"
],
"title": "Aggregation Strategy",
"type": "string"
},
"batch_size": {
"default": 16,
"title": "Batch Size",
Expand All @@ -1372,8 +1394,31 @@
],
"title": "Devices"
},
"flatten_entities_in_meta_data": {
"default": false,
"title": "Flatten Entities In Meta Data",
"type": "boolean"
},
"ignore_labels": {
"anyOf": [
{
"items": {
"type": "string"
},
"type": "array"
},
{
"type": "null"
}
],
"title": "Ignore Labels"
},
"max_seq_len": {
"title": "Max Seq Len",
"type": "integer"
},
"model_name_or_path": {
"default": "dslim/bert-base-NER",
"default": "elastic/distilbert-base-cased-finetuned-conll03-english",
"title": "Model Name Or Path",
"type": "string"
},
Expand All @@ -1388,6 +1433,16 @@
],
"title": "Model Version"
},
"num_workers": {
"default": 0,
"title": "Num Workers",
"type": "integer"
},
"pre_split_text": {
"default": false,
"title": "Pre Split Text",
"type": "boolean"
},
"progress_bar": {
"default": true,
"title": "Progress Bar",
Expand Down
57 changes: 56 additions & 1 deletion haystack/json-schemas/haystack-pipeline-main.schema.json
Original file line number Diff line number Diff line change
Expand Up @@ -1346,6 +1346,28 @@
"additionalProperties": false,
"description": "Each parameter can reference other components defined in the same YAML file.",
"properties": {
"add_prefix_space": {
"anyOf": [
{
"type": "boolean"
},
{
"type": "null"
}
],
"title": "Add Prefix Space"
},
"aggregation_strategy": {
"default": "first",
"enum": [
"simple",
"first",
"average",
"max"
],
"title": "Aggregation Strategy",
"type": "string"
},
"batch_size": {
"default": 16,
"title": "Batch Size",
Expand All @@ -1372,8 +1394,31 @@
],
"title": "Devices"
},
"flatten_entities_in_meta_data": {
"default": false,
"title": "Flatten Entities In Meta Data",
"type": "boolean"
},
"ignore_labels": {
"anyOf": [
{
"items": {
"type": "string"
},
"type": "array"
},
{
"type": "null"
}
],
"title": "Ignore Labels"
},
"max_seq_len": {
"title": "Max Seq Len",
"type": "integer"
},
"model_name_or_path": {
"default": "dslim/bert-base-NER",
"default": "elastic/distilbert-base-cased-finetuned-conll03-english",
"title": "Model Name Or Path",
"type": "string"
},
Expand All @@ -1388,6 +1433,16 @@
],
"title": "Model Version"
},
"num_workers": {
"default": 0,
"title": "Num Workers",
"type": "integer"
},
"pre_split_text": {
"default": false,
"title": "Pre Split Text",
"type": "boolean"
},
"progress_bar": {
"default": true,
"title": "Progress Bar",
Expand Down
Loading