Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: hard document length limit at max_chars_check #5191

Merged
merged 5 commits into from
Jun 23, 2023

Conversation

ZanSara
Copy link
Contributor

@ZanSara ZanSara commented Jun 22, 2023

Related Issues

Proposed Changes:

  • Split documents into shorted ones if they are longer than max_chars_check.
  • The split is recursive, i.e. if the split document is still too long, it will be split again and again until all the chunks are below the threshold.

How did you test it?

  • Updated test/nodes/test_preprocessor.py::test_preprocessor_very_long_document

Notes for the reviewer

n/a

Checklist

@ZanSara ZanSara requested a review from a team as a code owner June 22, 2023 09:37
@ZanSara ZanSara requested review from anakin87 and removed request for a team June 22, 2023 09:37
@ZanSara ZanSara changed the title implement hard cut at max_chars_check feat: hard document length limit at max_chars_check Jun 22, 2023
@coveralls
Copy link
Collaborator

coveralls commented Jun 22, 2023

Pull Request Test Coverage Report for Build 5355137661

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • 105 unchanged lines in 6 files lost coverage.
  • Overall coverage increased (+0.1%) to 42.627%

Files with Coverage Reduction New Missed Lines %
nodes/prompt/shapers.py 2 91.3%
utils/openai_utils.py 5 90.22%
agents/base.py 6 95.81%
nodes/other/shaper.py 13 92.4%
nodes/prompt/prompt_template.py 21 89.55%
nodes/preprocessor/preprocessor.py 58 82.57%
Totals Coverage Status
Change from base Build 5335925213: 0.1%
Covered Lines: 9592
Relevant Lines: 22502

💛 - Coveralls

@anakin87
Copy link
Member

Hey @ZanSara!

In this implementation, I can't understand if we generate a new document id for the tail_document(s) (it seems not).
WDYT?

@ZanSara
Copy link
Contributor Author

ZanSara commented Jun 22, 2023

Uh good catch! You're right, we're likely not. I'm fixing that 👍

Copy link
Member

@anakin87 anakin87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ZanSara I am excited because this is the first time I am doing a review in a PR by you. 😄

LGTM!

Just a comment: I would change the docstring of _long_documents to reflect the changes...

@github-actions github-actions bot added the type:documentation Improvements on the docs label Jun 23, 2023
@ZanSara ZanSara merged commit 3166462 into main Jun 23, 2023
@ZanSara ZanSara deleted the preprocessor-hard-max-char branch June 23, 2023 10:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants