Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add RecursiveSplitter component for Document preprocessing #8605

Open
wants to merge 96 commits into
base: main
Choose a base branch
from

Conversation

davidsbatista
Copy link
Contributor

Related Issues

Proposed Changes:

  • Adding a RecursiveSplitter, using a set of predefined separators to split text recursively - see issue for more details

How did you test it?

  • local unit tests and integration tests plus CI tests

Checklist

  • I have read the contributors guidelines and the code of conduct
  • I have updated the related issue with new insights and changes
  • I added unit tests and updated the docstrings
  • I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test: and added ! in case the PR includes breaking changes.
  • I documented my code
  • I ran pre-commit hooks and fixed any issue

@github-actions github-actions bot added topic:tests type:documentation Improvements on the docs labels Dec 4, 2024
@coveralls
Copy link
Collaborator

coveralls commented Dec 4, 2024

Pull Request Test Coverage Report for Build 12668989395

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • 3 unchanged lines in 2 files lost coverage.
  • Overall coverage increased (+0.2%) to 91.189%

Files with Coverage Reduction New Missed Lines %
components/preprocessors/sentence_tokenizer.py 1 94.12%
utils/callable_serialization.py 2 95.0%
Totals Coverage Status
Change from base Build 12598899410: 0.2%
Covered Lines: 8745
Relevant Lines: 9590

💛 - Coveralls

@davidsbatista davidsbatista marked this pull request as ready for review December 4, 2024 17:23
@davidsbatista davidsbatista requested review from a team as code owners December 4, 2024 17:23
@davidsbatista davidsbatista requested review from dfokina and julian-risch and removed request for a team December 4, 2024 17:23
@davidsbatista davidsbatista changed the title feat:: add recursive chunking strategy feat: add recursive chunking strategy Dec 4, 2024
@davidsbatista davidsbatista requested a review from sjrl December 4, 2024 17:24
@davidsbatista
Copy link
Contributor Author

@bglearning - mentioning you since I believe you were the one with most interest in this feature

Comment on lines 229 to 232
if re.match(r"\f\s*", text):
return 1

return len(text.split())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of using re we can pass the explicit separator to text.split() so

Suggested change
if re.match(r"\f\s*", text):
return 1
return len(text.split())
return len(text.split(" "))

passing " " causes text.split to properly count the page break. Although it would also count other things like newlines as words whichm might be okay. What do you think?

Copy link
Contributor

@sjrl sjrl Jan 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah actually this would follow how we split for word in the DocumentSplitter which is we use text.split(" ") I think we do this to specifically avoid having split remove other white space characters like \n and \f. This will keep things consistent between the two and should avoid the need for specific checks for things like page break white space characters.

Copy link
Contributor Author

@davidsbatista davidsbatista Jan 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about this, I have to check this again, this changes breaks more tests than I was expecting

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking into it, I see your concern with keeping things consistent with the other splitter/chunker

Copy link
Contributor Author

@davidsbatista davidsbatista Jan 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having an issue with this one, the code interprets each separator as a regex that also captures the separator itself. It then adds it to the previous unit token/string so that it's not lost and we can have it when constructing the chunks.

This is working for all cases, even with this new way of counting (i.e., text.split(" ").

Now, the issue is when I'm separating by seperator=[" "], this will result in something like this:

text = "This is some text. \f This text is on another page. \f This is the last pag3."

['This ',
 'is ',
 'some ',
 'text. ',
 '\x0c ',
 'This ',
 'text ',
 'is ',
 'on ',
 'another ',
 'page. ',
 '\x0c ',
 'This ',
 'is ',
 'the ',
 'last ',
 'pag3.']

This thing is that now the white space is also counted as a character, i.e.:

In [8]: len(splits[0].split(" "))
Out[8]: 2

In [9]: len(splits[0].split())
Out[9]: 1

Then, if we specify we want chunks of size=4, it builds something like This is .

I'm working on a solution to handle this specific type of separator, and way of counting, so that it can be counted as text.split(), meaning that white spaces are not counted and we can still have them when constructing the chunk based from the original string.

Right now, it involves having an extra parameter in for the self._chunk_length() informing which current separator we are using and then avoiding counting the white spaces.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we just change from text.split(" " ) to using the regex library with the separator " "? That could keep it consistent and also still not remove the newlines and page break characters right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, this also doesn't seem to work, also tried with the regex library, and the core re.

splits = ['This ',
 'is ',
 'some ',
 'text. ',
 '\x0c ',
 'This ',
 'text ',
 'is ',
 'on ',
 'another ',
 'page. ',
 '\x0c ',
 'This ',
 'is ',
 'the ',
 'last ',
 'pag3.']

import regex

len(regex.split(" ", splits[0]))
Out[61]: 2

import re
len(re.split(" ", splits[0]))
Out[64]: 2

Another possible solution would be something like this, but the issue is to have an exhaustive list of special_chars that covers most use cases.

def count_words(text):
    words = [word for word in text.split() if word]
    special_chars = text.count('\n') + text.count('\f') + text.count('\x0c')
    return len(words) + special_chars

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic:tests type:documentation Improvements on the docs type:feature New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add a Recursive Chunking strategy
6 participants