Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Rename FileExtensionRouter to FileTypeRouter, handle ByteStream(s) #5998

Merged
merged 7 commits into from
Oct 10, 2023

Conversation

vblagoje
Copy link
Member

@vblagoje vblagoje commented Oct 8, 2023

Why:

The main objective behind these commits was to refine the file routing mechanism by including support for ByteStream types. This initiative aims to broaden the spectrum of file types handled within the system, ensuring a more versatile and robust file routing solution.

What:

  • Renamed the existing FileExtensionRouter to FileTypeRouter to better reflect its functionality.
  • Incorporated ByteStream handling within the FileTypeRouter run method, thereby enhancing the routing capabilities to cater to byte stream data along with other file types.

How Did You Test It:

Ran a suite of unit and integration tests to validate the updated routing mechanism. Additionally, performed manual testing to verify the correct routing of ByteStream data and other file types through the updated FileTypeRouter. All tests passed successfully, affirming the reliability and efficiency of the changes made.

Notes for Reviewer:

  • Please pay special attention to the renaming of the router and the new ByteStream handling logic to ensure they align with the project's coding and naming conventions.
  • The handling of ByteStream is a crucial addition; any feedback on the implementation would be highly appreciated to ensure optimal performance and error handling.
  • Note the change to source instead of paths in the run method signature. As paths is kinda specific, I thought that sources better depicts this input parameter. We should standardize this name across all components that use Union[str, Path, ByteStream]

@vblagoje vblagoje requested a review from a team as a code owner October 8, 2023 10:23
@vblagoje vblagoje requested review from ZanSara and removed request for a team October 8, 2023 10:23
@github-actions github-actions bot added topic:tests type:documentation Improvements on the docs labels Oct 8, 2023
@vblagoje vblagoje requested a review from a team as a code owner October 8, 2023 10:27
@vblagoje vblagoje requested review from dfokina and removed request for a team October 8, 2023 10:27
@vblagoje vblagoje added the 2.x Related to Haystack v2.0 label Oct 8, 2023
@vblagoje
Copy link
Member Author

vblagoje commented Oct 9, 2023

cc @masci I named this component FileTypeRouter - as we spoke during internal discussions.

Copy link
Contributor

@masci masci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we change the name of the Python module to file_type_router.py to stay consistent with the component name? (I know we have several discrepancies around, this is to not adding one more)

Copy link
Contributor

@ZanSara ZanSara left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall very good, I left a couple of comments about improving the tests coverage

Comment on lines 52 to 79
@pytest.mark.unit
def test_run_with_bytestreams(self, preview_samples_path):
"""
Test if the component runs correctly with ByteStream inputs.
"""
file_paths = [
preview_samples_path / "txt" / "doc_1.txt",
preview_samples_path / "txt" / "doc_2.txt",
preview_samples_path / "audio" / "the context for this answer is here.wav",
preview_samples_path / "images" / "apple.jpg",
]
mime_types = ["text/plain", "text/plain", "audio/x-wav", "image/jpeg"]
# Convert file paths to ByteStream objects and set metadata
byte_streams = []
for path, mime_type in zip(file_paths, mime_types):
stream = ByteStream(path.read_bytes())

stream.metadata["content_type"] = mime_type

byte_streams.append(stream)

router = FileTypeRouter(mime_types=["text/plain", "audio/x-wav", "image/jpeg"])
output = router.run(sources=byte_streams)
assert output
assert len(output["text/plain"]) == 2
assert len(output["audio/x-wav"]) == 1
assert len(output["image/jpeg"]) == 1
assert not output.get("unclassified", None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for completeness, let's also add a ByteStream with no content_type and test that it goes under unclassified

assert len(output["audio/x-wav"]) == 1
assert len(output["image/jpeg"]) == 1
assert not output.get("unclassified", None)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a test that mixes Paths and ByteStreams

@vblagoje
Copy link
Member Author

vblagoje commented Oct 9, 2023

@ZanSara see 40ad07f for more details. It should be gtg now. TIA

Copy link
Contributor

@dfokina dfokina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a small change from my side, all good otherwise

@vblagoje
Copy link
Member Author

vblagoje commented Oct 9, 2023

@masci @ZanSara please lmk if this one is gtg now 🚀

Copy link
Contributor

@masci masci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good on my side

@vblagoje vblagoje merged commit 98215ae into main Oct 10, 2023
@vblagoje vblagoje deleted the file_router_update branch October 10, 2023 07:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2.x Related to Haystack v2.0 topic:tests type:documentation Improvements on the docs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants