-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change return types of indexing pipeline nodes #2342
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me so far. Only some very small changes would be nice before merging. content_type="text"
is the default when initializing a Document so I'd say we don't explicitly set it (several occurrences in the code).
Other than that, please go through the tutorials and check whether they are running with your code changes. For example, tutorial 1 definitely needs some changes because convert_files_to_dicts
is used, which is now convert_files_to_docs
, the corresponding comment needs to be changed accordingly and the variable dicts
should be called docs
.
If the tutorials run without errors and the tests are passing feel free to merge! 👍
@@ -298,8 +298,6 @@ jobs: | |||
pip install ui/ | |||
|
|||
- name: Run tests | |||
env: | |||
PINECONE_API_KEY: ${{ secrets.PINECONE_API_KEY }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please explain why we have this change here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I forgot to remove this in the Pinecone PR. We don't need the API key here in these tests, as we don't test pinecone inside this job but inside the test-pinecone
job. (The API Key is already used there:
haystack/.github/workflows/linux_ci.yml
Line 392 in a398094
PINECONE_API_KEY: ${{ secrets.PINECONE_API_KEY }} |
@@ -10,7 +11,7 @@ class BasePreProcessor(BaseComponent): | |||
@abstractmethod | |||
def process( | |||
self, | |||
documents: Union[dict, List[dict]], | |||
documents: Union[dict, Document, List[Union[dict, Document]]], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can in future only support List instead of single items here? Seems unintuitive to me that documents
can be a single dict
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a DeprecationWarning in the PreProcessor
's process
method that is triggered if the user does not supply a list.
haystack/document_stores/utils.py
Outdated
@@ -483,25 +482,25 @@ def elasticsearch_index_to_document_store( | |||
# Get content and metadata of current record | |||
content = record["_source"].pop(original_content_field, "") | |||
if content: | |||
record_doc = {"content": content, "meta": {}} | |||
record_doc = Document(content=content, meta={}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How are ids handled here? Could it be that we want to set them explicitly just as they were before? For example, what if there were two documents with the same content but different ids in the elasticsearch index? Do we silently drop documents with the same content as duplicates because of no id/id_hash_keys being set here explicitly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the possibility to provide id_hash_keys
here as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing, this is such a welcome improvement! All my comments are really minor and negligible.
The only thing missing is updating the tutorials I believe. They make use of convert_files_to_dicts
in many places, for example Tutorial 1 does:
dicts = convert_files_to_dicts(dir_path=doc_dir, clean_func=clean_wiki_text, split_paragraphs=True) |
Once this is done I think it's ready for merge.
) | ||
file_text = text.content | ||
for table in tables: | ||
assert isinstance(table.content, pd.DataFrame) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather raise a proper HaystackError
in here with a description of what's wrong
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! Agreed. 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ZanSara @julian-risch These assert statements are needed for the mypy checks. The Document
's content field can be either of type str
or pd.DataFrame
. With these assert statements, we tell mypy that we are certain that table.content
is of type pd.DataFrame
. Otherwise, we would get a type error, because elements of type str
don't have a method iterrows
.
The alternative would be to use # type: ignore
, but AFAIK we try to avoid these as much as possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you observe the failure in mypy though? I just checked the following snippet:
from typing import Union
a: Union[str, int]
if True:
a = 1
else:
a = "aaa"
if not isinstance(a, str):
raise ValueError()
a.startswith("a")
and mypy found no issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I changed the related line of code.
file_text = text + " ".join([cell for table in tables for row in table["content"] for cell in row]) | ||
file_text = text | ||
for table in tables: | ||
assert isinstance(table.content, pd.DataFrame) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the same holds for the other assert statements that were newly introduced. 👍
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
# Conflicts: # tutorials/Tutorial8_Preprocessing.ipynb # tutorials/Tutorial8_Preprocessing.py
This PR changes the return types of the file converters, preprocessor and crawler such that they return a
Document
or aList[Document]
, respectively. Previously, these nodes returnedDict
/List[Dict]
. Furthermore, this PR removes theid_hash_keys
property from theDocument
primitive, as it was never set to a value.This PR also fixes wrong file paths in some of the Tutorials and allows to fetch
.gz
files infetch_archive_from_http
.Breaking Changes
This PR introduces the following breaking changes:
Crawler
'srun
method will return aList[Document]
instead ofList[Dict]
convert
method of all Converters in thefile_converter
directory will return aList[Document]
instead ofList[Dict]
PreProcessor
'sprocess
,clean
andsplit
methods will return aList[Document]
instead ofList[Dict]
or aDocument
instead of aDict
, respectively.convert_files_to_dicts
andtika_convert_files_to_dicts
methods inutils/preprocessing.py
are renamed toconvert_files_to_docs
andtika_convert_files_to_docs
, respectively, and will return aList[Document]
instead of aList[Dict]
Closes #1859, closes #1920