feat: add `RecursiveSplitter` component for `Document` preprocessing #8605

davidsbatista · 2024-12-04T15:11:44Z

Related Issues

fixes Add a Recursive Chunking strategy #8548

Proposed Changes:

Adding a RecursiveSplitter, using a set of predefined separators to split text recursively - see issue for more details

How did you test it?

local unit tests and integration tests plus CI tests

Checklist

I have read the contributors guidelines and the code of conduct
I have updated the related issue with new insights and changes
I added unit tests and updated the docstrings
I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test: and added ! in case the PR includes breaking changes.
I documented my code
I ran pre-commit hooks and fixed any issue

coveralls · 2024-12-04T15:26:15Z

Pull Request Test Coverage Report for Build 12668989395

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

For more information on this, see Tracking coverage changes with pull request builds.
To avoid this issue with future PRs, see these Recommended CI Configurations.
For a quick fix, rebase this PR at GitHub. Your next report should be accurate.

Details

0 of 0 changed or added relevant lines in 0 files are covered.
3 unchanged lines in 2 files lost coverage.
Overall coverage increased (+0.2%) to 91.189%

Files with Coverage Reduction	New Missed Lines	%
components/preprocessors/sentence_tokenizer.py	1	94.12%
utils/callable_serialization.py	2	95.0%

Totals
Change from base Build 12598899410:	0.2%
Covered Lines:	8745
Relevant Lines:	9590

💛 - Coveralls

davidsbatista · 2024-12-04T17:30:38Z

@bglearning - mentioning you since I believe you were the one with most interest in this feature

haystack/components/preprocessors/recursive_splitter.py

Co-authored-by: Sebastian Husch Lee <[email protected]>

haystack/components/preprocessors/recursive_splitter.py

sjrl · 2025-01-08T10:55:45Z

haystack/components/preprocessors/recursive_splitter.py

+            if re.match(r"\f\s*", text):
+                return 1
+
+            return len(text.split())


Instead of using re we can pass the explicit separator to text.split() so

Suggested change

if re.match(r"\f\s*", text):

return 1

return len(text.split())

return len(text.split(" "))

passing " " causes text.split to properly count the page break. Although it would also count other things like newlines as words whichm might be okay. What do you think?

Yeah actually this would follow how we split for word in the DocumentSplitter which is we use text.split(" ") I think we do this to specifically avoid having split remove other white space characters like \n and \f. This will keep things consistent between the two and should avoid the need for specific checks for things like page break white space characters.

I'm not sure about this, I have to check this again, this changes breaks more tests than I was expecting

looking into it, I see your concern with keeping things consistent with the other splitter/chunker

Having an issue with this one, the code interprets each separator as a regex that also captures the separator itself. It then adds it to the previous unit token/string so that it's not lost and we can have it when constructing the chunks.

This is working for all cases, even with this new way of counting (i.e., text.split(" ").

Now, the issue is when I'm separating by seperator=[" "], this will result in something like this:

text = "This is some text. \f This text is on another page. \f This is the last pag3." ['This ', 'is ', 'some ', 'text. ', '\x0c ', 'This ', 'text ', 'is ', 'on ', 'another ', 'page. ', '\x0c ', 'This ', 'is ', 'the ', 'last ', 'pag3.']

This thing is that now the white space is also counted as a character, i.e.:

In [8]: len(splits[0].split(" ")) Out[8]: 2 In [9]: len(splits[0].split()) Out[9]: 1

Then, if we specify we want chunks of size=4, it builds something like This is .

I'm working on a solution to handle this specific type of separator, and way of counting, so that it can be counted as text.split(), meaning that white spaces are not counted and we can still have them when constructing the chunk based from the original string.

Right now, it involves having an extra parameter in for the self._chunk_length() informing which current separator we are using and then avoiding counting the white spaces.

Could we just change from text.split(" " ) to using the regex library with the separator " "? That could keep it consistent and also still not remove the newlines and page break characters right?

Unfortunately, this also doesn't seem to work, also tried with the regex library, and the core re.

splits = ['This ', 'is ', 'some ', 'text. ', '\x0c ', 'This ', 'text ', 'is ', 'on ', 'another ', 'page. ', '\x0c ', 'This ', 'is ', 'the ', 'last ', 'pag3.'] import regex len(regex.split(" ", splits[0])) Out[61]: 2 import re len(re.split(" ", splits[0])) Out[64]: 2

Another possible solution would be something like this, but the issue is to have an exhaustive list of special_chars that covers most use cases.

def count_words(text): words = [word for word in text.split() if word] special_chars = text.count('\n') + text.count('\f') + text.count('\x0c') return len(words) + special_chars

Co-authored-by: Sebastian Husch Lee <[email protected]>

…parators ran

davidsbatista added 9 commits November 20, 2024 15:09

initial import

a49fc93

initial import

41f5f64

wip

79c669e

adding initial version + tests

87b8023

adding more tests

09b25f3

more tests

a39f481

Merge branch 'main' into add-recursive-chunking

4e9b4ea

incorporating SentenceSplitter based on NLTK

db82194

adding more tests

cbfcc66

github-actions bot added topic:tests type:documentation Improvements on the docs labels Dec 4, 2024

davidsbatista added 2 commits December 4, 2024 16:17

adding release notes

74de92c

adding LICENSE header

4054c47

davidsbatista added 5 commits December 4, 2024 16:34

removing unused imports

6b72a17

fixing example docstring

4c0afb1

Merge branch 'main' into add-recursive-chunking

24739be

addding docstrings

8e62968

fixing tests and returning a dictionary

12549bd

davidsbatista marked this pull request as ready for review December 4, 2024 17:23

davidsbatista requested review from a team as code owners December 4, 2024 17:23

davidsbatista requested review from dfokina and julian-risch and removed request for a team December 4, 2024 17:23

davidsbatista changed the title ~~feat:: add recursive chunking strategy~~ feat: add recursive chunking strategy Dec 4, 2024

davidsbatista requested a review from sjrl December 4, 2024 17:24

updating release notes

20a7f52

davidsbatista added 7 commits December 20, 2024 12:41

Merge branch 'main' into add-recursive-chunking

d292de6

fixing the overlap bug

b09154e

Merge branch 'main' into add-recursive-chunking

bd67369

adding more tests

c1fa6c2

refactoring _apply_overlap

de5e951

further refactoring

81c7c89

Merge branch 'main' into add-recursive-chunking

e398120

davidsbatista requested a review from sjrl December 23, 2024 12:58

anakin87 removed their request for review January 2, 2025 14:39

davidsbatista mentioned this pull request Jan 3, 2025

feat: Enhance DocumentSplitter to support semantic document splitting #8111

Closed

Merge branch 'main' into add-recursive-chunking

6977b2a