fix: truncate ByteStream string representation #8673

tstadel · 2024-12-25T11:36:50Z

Related Issues

fixes logs getting too big, e.g.

haystack/haystack/components/converters/txt.py

Line 90 in 04fc187

"Could not convert file {source}. Skipping it. Error message: {error}", source=source, error=e

Proposed Changes:

truncate data in str representation of ByteStream to 1k

How did you test it?

Notes for the reviewer

Checklist

I have read the contributors guidelines and the code of conduct
I have updated the related issue with new insights and changes
I added unit tests and updated the docstrings
I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test: and added ! in case the PR includes breaking changes.
I documented my code
I ran pre-commit hooks and fixed any issue

coveralls · 2024-12-25T11:41:53Z

Pull Request Test Coverage Report for Build 12656140645

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

For more information on this, see Tracking coverage changes with pull request builds.
To avoid this issue with future PRs, see these Recommended CI Configurations.
For a quick fix, rebase this PR at GitHub. Your next report should be accurate.

Details

0 of 0 changed or added relevant lines in 0 files are covered.
2 unchanged lines in 1 file lost coverage.
Overall coverage increased (+0.07%) to 91.017%

Files with Coverage Reduction	New Missed Lines	%
components/generators/chat/openai.py	2	96.24%

Totals
Change from base Build 12443798462:	0.07%
Covered Lines:	8552
Relevant Lines:	9396

💛 - Coveralls

anakin87

I totally understand the problem.

At this point, I would propose to be more radical, taking inspiration from Document

haystack/haystack/dataclasses/document.py

Line 83 in 7b4d9ba

    
           f"content: '{self.content}'" if len(self.content) < 100 else f"content: '{self.content[:100]}...'"

truncate the data to 100 characters
minor: introduce __repr__ instead of __str__, so that this also reflected when a user accesses the ByteStream in a notebook

LMK if you see any drawbacks in these proposals.

tstadel · 2025-01-07T15:13:42Z

I totally understand the problem.

At this point, I would propose to be more radical, taking inspiration from Document

haystack/haystack/dataclasses/document.py

Line 83 in 7b4d9ba

f"content: '{self.content}'" if len(self.content) < 100 else f"content: '{self.content[:100]}...'"

truncate the data to 100 characters

minor: introduce __repr__ instead of __str__, so that this also reflected when a user accesses the ByteStream in a notebook

LMK if you see any drawbacks in these proposals.

Sounds reasonable. I will also remove dataclass's default impl.

Just a bit more on this, as there are many opinions on the web that the resulting repr should be unambiguous:

https://docs.python.org/3/library/functions.html#repr does not explicitly phrase this requirement
There are two suggestions however:
- a string that would yield an object with the same value when passed to eval()
- a string enclosed in angle brackets that contains the name of the type of the object together with additional information often including the name and address of the object

So our approach would be something in between those two. I think this is fine and way better than

printing the whole content (dataclass default) (potentially emitting gigabytes)
only printing class name and mem address (e.g. like io.ByteIO)

One last thing on being unambiguous. The object's meta is still fully emitted. So most ByteStream's (containing a id in meta) will be unambiguous anyways.

anakin87 · 2025-01-07T15:19:40Z

I agree with your analysis!

anakin87

LGTM!

fix: truncate ByteStream string representation

b8cf3cc

tstadel requested a review from a team as a code owner December 25, 2024 11:36

tstadel requested review from anakin87 and removed request for a team December 25, 2024 11:36

github-actions bot added the type:documentation Improvements on the docs label Dec 25, 2024

add reno

07c1720

tstadel requested a review from a team as a code owner December 25, 2024 11:39

tstadel requested review from dfokina and removed request for a team December 25, 2024 11:39

better reno

d12340c

add test

765525c

github-actions bot added the topic:tests label Dec 28, 2024

Update test_byte_stream.py

ea69744

anakin87 reviewed Jan 2, 2025

View reviewed changes

tstadel added 2 commits January 7, 2025 18:20

apply feedback

901f17d

update reno

89fbb19

tstadel requested a review from anakin87 January 7, 2025 17:22

anakin87 approved these changes Jan 7, 2025

View reviewed changes

tstadel merged commit e6059e6 into main Jan 7, 2025
18 checks passed

tstadel deleted the fix/truncate_bytestream_str branch January 7, 2025 18:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: truncate ByteStream string representation #8673

fix: truncate ByteStream string representation #8673

tstadel commented Dec 25, 2024

coveralls commented Dec 25, 2024 •

edited

Loading

anakin87 left a comment

tstadel commented Jan 7, 2025

anakin87 commented Jan 7, 2025

anakin87 left a comment

fix: truncate ByteStream string representation #8673

fix: truncate ByteStream string representation #8673

Conversation

tstadel commented Dec 25, 2024

Related Issues

Proposed Changes:

How did you test it?

Notes for the reviewer

Checklist

coveralls commented Dec 25, 2024 • edited Loading

Pull Request Test Coverage Report for Build 12656140645

Warning: This coverage report may be inaccurate.

Details

💛 - Coveralls

anakin87 left a comment

Choose a reason for hiding this comment

tstadel commented Jan 7, 2025

anakin87 commented Jan 7, 2025

anakin87 left a comment

Choose a reason for hiding this comment

coveralls commented Dec 25, 2024 •

edited

Loading