-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add RecursiveSplitter
component for Document
preprocessing
#8605
base: main
Are you sure you want to change the base?
Conversation
Pull Request Test Coverage Report for Build 12668989395Warning: This coverage report may be inaccurate.This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.
Details
💛 - Coveralls |
@bglearning - mentioning you since I believe you were the one with most interest in this feature |
Co-authored-by: Sebastian Husch Lee <[email protected]>
if re.match(r"\f\s*", text): | ||
return 1 | ||
|
||
return len(text.split()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of using re we can pass the explicit separator to text.split()
so
if re.match(r"\f\s*", text): | |
return 1 | |
return len(text.split()) | |
return len(text.split(" ")) |
passing " "
causes text.split to properly count the page break. Although it would also count other things like newlines as words whichm might be okay. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah actually this would follow how we split for word in the DocumentSplitter
which is we use text.split(" ")
I think we do this to specifically avoid having split remove other white space characters like \n
and \f
. This will keep things consistent between the two and should avoid the need for specific checks for things like page break white space characters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about this, I have to check this again, this changes breaks more tests than I was expecting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looking into it, I see your concern with keeping things consistent with the other splitter/chunker
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having an issue with this one, the code interprets each separator as a regex that also captures the separator itself. It then adds it to the previous unit token/string so that it's not lost and we can have it when constructing the chunks.
This is working for all cases, even with this new way of counting (i.e., text.split(" ")
.
Now, the issue is when I'm separating by seperator=[" "], this will result in something like this:
text = "This is some text. \f This text is on another page. \f This is the last pag3."
['This ',
'is ',
'some ',
'text. ',
'\x0c ',
'This ',
'text ',
'is ',
'on ',
'another ',
'page. ',
'\x0c ',
'This ',
'is ',
'the ',
'last ',
'pag3.']
This thing is that now the white space is also counted as a character, i.e.:
In [8]: len(splits[0].split(" "))
Out[8]: 2
In [9]: len(splits[0].split())
Out[9]: 1
Then, if we specify we want chunks of size=4, it builds something like This is
.
I'm working on a solution to handle this specific type of separator, and way of counting, so that it can be counted as text.split()
, meaning that white spaces are not counted and we can still have them when constructing the chunk based from the original string.
Right now, it involves having an extra parameter in for the self._chunk_length()
informing which current separator we are using and then avoiding counting the white spaces.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we just change from text.split(" " )
to using the regex library with the separator " "
? That could keep it consistent and also still not remove the newlines and page break characters right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, this also doesn't seem to work, also tried with the regex
library, and the core re
.
splits = ['This ',
'is ',
'some ',
'text. ',
'\x0c ',
'This ',
'text ',
'is ',
'on ',
'another ',
'page. ',
'\x0c ',
'This ',
'is ',
'the ',
'last ',
'pag3.']
import regex
len(regex.split(" ", splits[0]))
Out[61]: 2
import re
len(re.split(" ", splits[0]))
Out[64]: 2
Another possible solution would be something like this, but the issue is to have an exhaustive list of special_chars
that covers most use cases.
def count_words(text):
words = [word for word in text.split() if word]
special_chars = text.count('\n') + text.count('\f') + text.count('\x0c')
return len(words) + special_chars
Co-authored-by: Sebastian Husch Lee <[email protected]>
Co-authored-by: Sebastian Husch Lee <[email protected]>
Co-authored-by: Sebastian Husch Lee <[email protected]>
Related Issues
Proposed Changes:
How did you test it?
Checklist
fix:
,feat:
,build:
,chore:
,ci:
,docs:
,style:
,refactor:
,perf:
,test:
and added!
in case the PR includes breaking changes.