Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: truncate ByteStream string representation #8673

Merged
merged 7 commits into from
Jan 7, 2025

Conversation

tstadel
Copy link
Member

@tstadel tstadel commented Dec 25, 2024

Related Issues

Proposed Changes:

  • truncate data in str representation of ByteStream to 1k

How did you test it?

Notes for the reviewer

Checklist

  • I have read the contributors guidelines and the code of conduct
  • I have updated the related issue with new insights and changes
  • I added unit tests and updated the docstrings
  • I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test: and added ! in case the PR includes breaking changes.
  • I documented my code
  • I ran pre-commit hooks and fixed any issue

@tstadel tstadel requested a review from a team as a code owner December 25, 2024 11:36
@tstadel tstadel requested review from anakin87 and removed request for a team December 25, 2024 11:36
@github-actions github-actions bot added the type:documentation Improvements on the docs label Dec 25, 2024
@tstadel tstadel requested a review from a team as a code owner December 25, 2024 11:39
@tstadel tstadel requested review from dfokina and removed request for a team December 25, 2024 11:39
@coveralls
Copy link
Collaborator

coveralls commented Dec 25, 2024

Pull Request Test Coverage Report for Build 12656140645

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • 2 unchanged lines in 1 file lost coverage.
  • Overall coverage increased (+0.07%) to 91.017%

Files with Coverage Reduction New Missed Lines %
components/generators/chat/openai.py 2 96.24%
Totals Coverage Status
Change from base Build 12443798462: 0.07%
Covered Lines: 8552
Relevant Lines: 9396

💛 - Coveralls

Copy link
Member

@anakin87 anakin87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I totally understand the problem.

At this point, I would propose to be more radical, taking inspiration from Document

f"content: '{self.content}'" if len(self.content) < 100 else f"content: '{self.content[:100]}...'"

  • truncate the data to 100 characters
  • minor: introduce __repr__ instead of __str__, so that this also reflected when a user accesses the ByteStream in a notebook

LMK if you see any drawbacks in these proposals.

@tstadel
Copy link
Member Author

tstadel commented Jan 7, 2025

I totally understand the problem.

At this point, I would propose to be more radical, taking inspiration from Document

f"content: '{self.content}'" if len(self.content) < 100 else f"content: '{self.content[:100]}...'"

  • truncate the data to 100 characters
  • minor: introduce __repr__ instead of __str__, so that this also reflected when a user accesses the ByteStream in a notebook

LMK if you see any drawbacks in these proposals.

Sounds reasonable. I will also remove dataclass's default impl.

Just a bit more on this, as there are many opinions on the web that the resulting repr should be unambiguous:

  • https://docs.python.org/3/library/functions.html#repr does not explicitly phrase this requirement
  • There are two suggestions however:
    • a string that would yield an object with the same value when passed to eval()
    • a string enclosed in angle brackets that contains the name of the type of the object together with additional information often including the name and address of the object

So our approach would be something in between those two. I think this is fine and way better than

  • printing the whole content (dataclass default) (potentially emitting gigabytes)
  • only printing class name and mem address (e.g. like io.ByteIO)

One last thing on being unambiguous. The object's meta is still fully emitted. So most ByteStream's (containing a id in meta) will be unambiguous anyways.

@anakin87
Copy link
Member

anakin87 commented Jan 7, 2025

I agree with your analysis!

@tstadel tstadel requested a review from anakin87 January 7, 2025 17:22
Copy link
Member

@anakin87 anakin87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@tstadel tstadel merged commit e6059e6 into main Jan 7, 2025
18 checks passed
@tstadel tstadel deleted the fix/truncate_bytestream_str branch January 7, 2025 18:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic:tests type:documentation Improvements on the docs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants