Change return types of indexing pipeline nodes #2342

bogdankostic · 2022-03-21T21:40:56Z

This PR changes the return types of the file converters, preprocessor and crawler such that they return a Document or a List[Document], respectively. Previously, these nodes returned Dict / List[Dict]. Furthermore, this PR removes the id_hash_keys property from the Document primitive, as it was never set to a value.

This PR also fixes wrong file paths in some of the Tutorials and allows to fetch .gz files in fetch_archive_from_http.

Breaking Changes

This PR introduces the following breaking changes:

The Crawler's run method will return a List[Document] instead of List[Dict]
The convert method of all Converters in the file_converter directory will return a List[Document] instead of List[Dict]
The PreProcessor's process, clean and split methods will return a List[Document] instead of List[Dict] or a Document instead of a Dict, respectively.
The convert_files_to_dicts and tika_convert_files_to_dicts methods in utils/preprocessing.py are renamed to convert_files_to_docs and tika_convert_files_to_docs, respectively, and will return a List[Document] instead of a List[Dict]

Closes #1859, closes #1920

…_return_types

julian-risch

Looks good to me so far. Only some very small changes would be nice before merging. content_type="text" is the default when initializing a Document so I'd say we don't explicitly set it (several occurrences in the code).
Other than that, please go through the tutorials and check whether they are running with your code changes. For example, tutorial 1 definitely needs some changes because convert_files_to_dicts is used, which is now convert_files_to_docs, the corresponding comment needs to be changed accordingly and the variable dicts should be called docs.
If the tutorials run without errors and the tests are passing feel free to merge! 👍

julian-risch · 2022-03-24T16:31:40Z

.github/workflows/linux_ci.yml

@@ -298,8 +298,6 @@ jobs:
        pip install ui/

    - name: Run tests
-      env:
-        PINECONE_API_KEY: ${{ secrets.PINECONE_API_KEY }}


Could you please explain why we have this change here?

I forgot to remove this in the Pinecone PR. We don't need the API key here in these tests, as we don't test pinecone inside this job but inside the test-pinecone job. (The API Key is already used there:

haystack/.github/workflows/linux_ci.yml

Line 392 in a398094

PINECONE_API_KEY: ${{ secrets.PINECONE_API_KEY }}

)

test/test_utils.py

haystack/utils/preprocessing.py

haystack/nodes/file_converter/image.py

julian-risch · 2022-03-25T10:41:57Z

haystack/nodes/preprocessor/base.py

@@ -10,7 +11,7 @@ class BasePreProcessor(BaseComponent):
    @abstractmethod
    def process(
        self,
-        documents: Union[dict, List[dict]],
+        documents: Union[dict, Document, List[Union[dict, Document]]],


Maybe we can in future only support List instead of single items here? Seems unintuitive to me that documents can be a single dict.

I added a DeprecationWarning in the PreProcessor's process method that is triggered if the user does not supply a list.

haystack/nodes/preprocessor/preprocessor.py

haystack/nodes/file_converter/docx.py

haystack/nodes/file_converter/azure.py

julian-risch · 2022-03-25T10:50:08Z

haystack/document_stores/utils.py

@@ -483,25 +482,25 @@ def elasticsearch_index_to_document_store(
        # Get content and metadata of current record
        content = record["_source"].pop(original_content_field, "")
        if content:
-            record_doc = {"content": content, "meta": {}}
+            record_doc = Document(content=content, meta={})


How are ids handled here? Could it be that we want to set them explicitly just as they were before? For example, what if there were two documents with the same content but different ids in the elasticsearch index? Do we silently drop documents with the same content as duplicates because of no id/id_hash_keys being set here explicitly?

I added the possibility to provide id_hash_keys here as well.

ZanSara

Amazing, this is such a welcome improvement! All my comments are really minor and negligible.

The only thing missing is updating the tutorials I believe. They make use of convert_files_to_dicts in many places, for example Tutorial 1 does:

haystack/tutorials/Tutorial1_Basic_QA_Pipeline.py

Line 61 in bf71f03

    
           dicts = convert_files_to_dicts(dir_path=doc_dir, clean_func=clean_wiki_text, split_paragraphs=True)

Once this is done I think it's ready for merge.

ZanSara · 2022-03-25T10:26:13Z

haystack/nodes/file_converter/azure.py

-            )
+            file_text = text.content
+            for table in tables:
+                assert isinstance(table.content, pd.DataFrame)


I'd rather raise a proper HaystackError in here with a description of what's wrong

Good point! Agreed. 👍

@ZanSara @julian-risch These assert statements are needed for the mypy checks. The Document's content field can be either of type str or pd.DataFrame. With these assert statements, we tell mypy that we are certain that table.content is of type pd.DataFrame. Otherwise, we would get a type error, because elements of type str don't have a method iterrows.

The alternative would be to use # type: ignore, but AFAIK we try to avoid these as much as possible.

Do you observe the failure in mypy though? I just checked the following snippet:

from typing import Union a: Union[str, int] if True: a = 1 else: a = "aaa" if not isinstance(a, str): raise ValueError() a.startswith("a")

and mypy found no issues.

Good point, I changed the related line of code.

ZanSara · 2022-03-25T10:28:09Z

haystack/nodes/file_converter/parsr.py

-            file_text = text + " ".join([cell for table in tables for row in table["content"] for cell in row])
+            file_text = text
+            for table in tables:
+                assert isinstance(table.content, pd.DataFrame)


Same as above

Yes, the same holds for the other assert statements that were newly introduced. 👍

haystack/nodes/preprocessor/preprocessor.py

test/test_preprocessor.py

review-notebook-app · 2022-03-25T15:57:09Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

# Conflicts: # tutorials/Tutorial8_Preprocessing.ipynb # tutorials/Tutorial8_Preprocessing.py

…_return_types

bogdankostic added 3 commits March 21, 2022 21:21

Change return types of file converters

7a7c1cd

Change return types of preprocessor

a549ff5

Change return types of crawler

638df72

bogdankostic added topic:file_converter breaking change type:refactor Not necessarily visible to the users topic:preprocessing labels Mar 21, 2022

bogdankostic and others added 17 commits March 22, 2022 12:49

Adapt utils to functions to new return types

d15756f

Adapt __init__.py to new method names

fbf69c2

Prevent circular imports

c07b622

Update Documentation & Code Style

fcfc646

Let DocStores' run method accept Documents

dec74a2

Adapt tests to new return types

6edd010

Update Documentation & Code Style

44e8ce3

Put "# type: ignore" to right place

0c8784d

Remove id_hash_keys property from Document primitive

65fc302

Update Documentation & Code Style

dc44b8d

Adapt tests to new return types and missing id_hash_keys property

1b7d066

Merge remote-tracking branch 'origin/change_return_types' into change…

8d9e923

…_return_types

Fix mypy

61bedc6

Fix mypy

b73cb6a

Adapt PDFToTextOCRConverter

0088d36

Remove id_hash_keys from RestAPI tests

3d2454c

Update Documentation & Code Style

b6e2075

bogdankostic requested a review from julian-risch March 23, 2022 11:20

bogdankostic marked this pull request as ready for review March 23, 2022 11:20

julian-risch approved these changes Mar 25, 2022

View reviewed changes

ZanSara reviewed Mar 25, 2022

View reviewed changes

bogdankostic added 2 commits March 25, 2022 15:43

Rename tests

eae97f7

Remove redundant setting of content_type="text"

2d681c7

bogdankostic added 4 commits March 25, 2022 16:18

Add DeprecationWarning

2a67a09

Add id_hash_keys to elasticsearch_index_to_document_store

19ea446

Change document type from dict to Docuemnt in PreProcessor test

a3432d0

Fix file path in Tutorial 5

ec8ccf0

bogdankostic and others added 11 commits March 25, 2022 17:00

Remove added output in Tutorial 5

eb0ad8a

Update Documentation & Code Style

1aa48c9

Fix file_paths in Tutorial 9 + fix gz files in fetch_archive_from_http

54783cf

Adapt tutorials to new return types

a028f2a

Merge remote-tracking branch 'origin/master' into change_return_types

b69d391

# Conflicts: # tutorials/Tutorial8_Preprocessing.ipynb # tutorials/Tutorial8_Preprocessing.py

Adapt tutorial 14 to new return types

d2f389d

Merge remote-tracking branch 'origin/change_return_types' into change…

f8b3630

…_return_types

Update Documentation & Code Style

cf065e8

Change assertions to HaystackErrors

8be88b8

Merge remote-tracking branch 'origin/change_return_types' into change…

2ab2644

…_return_types

Import HaystackError correctly

b895772

bogdankostic merged commit 834f8c4 into master Mar 29, 2022

bogdankostic deleted the change_return_types branch March 29, 2022 11:53

bogdankostic mentioned this pull request Mar 30, 2022

Again! cannot import name 'convert_files_to_dicts' from 'haystack.utils' After the recent Update. #2370

Closed

This was referenced Apr 11, 2022

Make PreProcessor support Document objects #2292

Closed

Typing issue on BaseComponent.run(), documents parameter #1579

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change return types of indexing pipeline nodes #2342

Change return types of indexing pipeline nodes #2342

bogdankostic commented Mar 21, 2022 •

edited

Loading

julian-risch left a comment

julian-risch Mar 24, 2022

bogdankostic Mar 25, 2022

julian-risch Mar 25, 2022

bogdankostic Mar 29, 2022

julian-risch Mar 25, 2022

bogdankostic Mar 29, 2022

ZanSara left a comment •

edited

Loading

ZanSara Mar 25, 2022

julian-risch Mar 25, 2022

bogdankostic Mar 25, 2022

ZanSara Mar 28, 2022

bogdankostic Mar 29, 2022

ZanSara Mar 25, 2022

julian-risch Mar 25, 2022

review-notebook-app bot commented Mar 25, 2022

Change return types of indexing pipeline nodes #2342

Change return types of indexing pipeline nodes #2342

Conversation

bogdankostic commented Mar 21, 2022 • edited Loading

Breaking Changes

julian-risch left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ZanSara left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

review-notebook-app bot commented Mar 25, 2022

bogdankostic commented Mar 21, 2022 •

edited

Loading

ZanSara left a comment •

edited

Loading