Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add RecursiveSplitter component for Document preprocessing #8605

Open
wants to merge 96 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 83 commits
Commits
Show all changes
96 commits
Select commit Hold shift + click to select a range
a49fc93
initial import
davidsbatista Nov 20, 2024
41f5f64
initial import
davidsbatista Nov 20, 2024
79c669e
wip
davidsbatista Nov 20, 2024
87b8023
adding initial version + tests
davidsbatista Nov 25, 2024
09b25f3
adding more tests
davidsbatista Dec 2, 2024
a39f481
more tests
davidsbatista Dec 2, 2024
4e9b4ea
Merge branch 'main' into add-recursive-chunking
davidsbatista Dec 4, 2024
db82194
incorporating SentenceSplitter based on NLTK
davidsbatista Dec 4, 2024
cbfcc66
adding more tests
davidsbatista Dec 4, 2024
74de92c
adding release notes
davidsbatista Dec 4, 2024
4054c47
adding LICENSE header
davidsbatista Dec 4, 2024
6b72a17
removing unused imports
davidsbatista Dec 4, 2024
4c0afb1
fixing example docstring
davidsbatista Dec 4, 2024
24739be
Merge branch 'main' into add-recursive-chunking
davidsbatista Dec 4, 2024
8e62968
addding docstrings
davidsbatista Dec 4, 2024
12549bd
fixing tests and returning a dictionary
davidsbatista Dec 4, 2024
20a7f52
updating release notes
davidsbatista Dec 4, 2024
323319b
Merge branch 'main' into add-recursive-chunking
davidsbatista Dec 5, 2024
5945e6d
attending PR comments
davidsbatista Dec 6, 2024
01ad974
Merge branch 'main' into add-recursive-chunking
davidsbatista Dec 9, 2024
eaf9b77
Merge branch 'main' into add-recursive-chunking
davidsbatista Dec 9, 2024
b5391f6
Update haystack/components/preprocessors/recursive_splitter.py
davidsbatista Dec 10, 2024
adf1b1a
wip: updating tests for split_idx_start and _split_overlap
davidsbatista Dec 10, 2024
d4a2a0b
adding tests for split_idx and split_start and overlaps
davidsbatista Dec 11, 2024
aed28c5
adjusting file for LICENSE checking
davidsbatista Dec 11, 2024
eb5afb5
Merge branch 'main' into add-recursive-chunking
davidsbatista Dec 11, 2024
824142f
adding more tests
davidsbatista Dec 11, 2024
e4815d8
adding tests for page numbering
davidsbatista Dec 11, 2024
8f1ae36
adding tests for min split lenghts and falling back to character-leve…
davidsbatista Dec 11, 2024
5a49eab
fixing linting issue
davidsbatista Dec 11, 2024
a5c1f2c
Update haystack/components/preprocessors/recursive_splitter.py
davidsbatista Dec 12, 2024
2248135
Update haystack/components/preprocessors/recursive_splitter.py
davidsbatista Dec 12, 2024
4263352
Update haystack/components/preprocessors/recursive_splitter.py
davidsbatista Dec 12, 2024
6ee5551
Update haystack/components/preprocessors/recursive_splitter.py
davidsbatista Dec 12, 2024
0325a8b
Update haystack/components/preprocessors/recursive_splitter.py
davidsbatista Dec 12, 2024
b2b94b5
Update haystack/components/preprocessors/recursive_splitter.py
davidsbatista Dec 12, 2024
85f2ea2
Update haystack/components/preprocessors/recursive_splitter.py
davidsbatista Dec 12, 2024
644056f
Update haystack/components/preprocessors/recursive_splitter.py
davidsbatista Dec 12, 2024
3cb85d9
wip
davidsbatista Dec 12, 2024
459bfa7
Merge branch 'main' into add-recursive-chunking
davidsbatista Dec 12, 2024
42faf05
wip
davidsbatista Dec 12, 2024
7d9c4df
updating tests
davidsbatista Dec 12, 2024
d66afd5
Merge branch 'main' into add-recursive-chunking
davidsbatista Dec 12, 2024
5bcf709
wip: fixing all tests after changes
davidsbatista Dec 12, 2024
9205ef2
more tests
davidsbatista Dec 12, 2024
437570f
wip: debugging sentence overlap
davidsbatista Dec 12, 2024
97437d8
wip: debugging page number
davidsbatista Dec 13, 2024
13f85e1
wip
davidsbatista Dec 16, 2024
eebe1a0
wip; fixed bug with sentence tokenizer, needs to keep white spaces
davidsbatista Dec 16, 2024
3f00b3b
adding tests for counting pages on different split approaches
davidsbatista Dec 16, 2024
d9addfa
NLTK checks done on SentenceSplitter
davidsbatista Dec 16, 2024
080a529
Merge branch 'main' into add-recursive-chunking
davidsbatista Dec 16, 2024
c3f09d0
fixing types
davidsbatista Dec 16, 2024
2df40c3
adding detecting for full overlap with previous chunks
davidsbatista Dec 16, 2024
0492025
fixing types
davidsbatista Dec 16, 2024
09362e4
improving docstring
davidsbatista Dec 16, 2024
eb38a2b
improving docstring
davidsbatista Dec 16, 2024
a418f73
adding custom lenght, 'character' use case
davidsbatista Dec 17, 2024
71ce15b
customising overlap function for word and adding a few tests
davidsbatista Dec 17, 2024
3a9d290
updating docstring
davidsbatista Dec 17, 2024
f35d4e5
Merge branch 'main' into add-recursive-chunking
davidsbatista Dec 17, 2024
938b610
Update haystack/components/preprocessors/recursive_splitter.py
davidsbatista Dec 19, 2024
bc4dfbd
Update haystack/components/preprocessors/recursive_splitter.py
davidsbatista Dec 19, 2024
371028c
Update haystack/components/preprocessors/recursive_splitter.py
davidsbatista Dec 19, 2024
79cd8bd
wip: adding more tests for word unit length
davidsbatista Dec 17, 2024
31c8412
fix
davidsbatista Dec 17, 2024
e1fed92
feat: `Tool` dataclass - unified abstraction to represent tools (#8652)
anakin87 Dec 18, 2024
f71a22b
fix: fix deserialization issues in multi-threading environments (#8651)
wochinge Dec 18, 2024
211c4ed
adding 'word' as default length
davidsbatista Dec 19, 2024
0807902
fixing types
davidsbatista Dec 19, 2024
460cc7d
handing both default strategies
davidsbatista Dec 19, 2024
7901af5
wip
davidsbatista Dec 19, 2024
2af6b03
Merge branch 'main' into add-recursive-chunking
davidsbatista Dec 19, 2024
8a09157
\f was not being counted properly
davidsbatista Dec 19, 2024
3ad73a5
updating tests
davidsbatista Dec 20, 2024
d292de6
Merge branch 'main' into add-recursive-chunking
davidsbatista Dec 20, 2024
b09154e
fixing the overlap bug
davidsbatista Dec 20, 2024
bd67369
Merge branch 'main' into add-recursive-chunking
davidsbatista Dec 20, 2024
c1fa6c2
adding more tests
davidsbatista Dec 21, 2024
de5e951
refactoring _apply_overlap
davidsbatista Dec 21, 2024
81c7c89
further refactoring
davidsbatista Dec 21, 2024
e398120
Merge branch 'main' into add-recursive-chunking
davidsbatista Dec 21, 2024
6977b2a
Merge branch 'main' into add-recursive-chunking
davidsbatista Jan 3, 2025
50ac7af
Update haystack/components/preprocessors/recursive_splitter.py
davidsbatista Jan 8, 2025
602ac9b
Update haystack/components/preprocessors/recursive_splitter.py
davidsbatista Jan 8, 2025
78ebc71
Update haystack/components/preprocessors/recursive_splitter.py
davidsbatista Jan 8, 2025
a6a2475
Update haystack/components/preprocessors/recursive_splitter.py
davidsbatista Jan 8, 2025
80b8f2c
adding ticks to close code block
davidsbatista Jan 8, 2025
2040c7c
fixing comments
davidsbatista Jan 8, 2025
977de8e
applying changes: split with space and force keep_white_spaces=True
davidsbatista Jan 8, 2025
c4ada43
Merge branch 'main' into add-recursive-chunking
davidsbatista Jan 8, 2025
d87ffe6
fixing some tests and replacing count words approach in more places
davidsbatista Jan 8, 2025
df214d6
keep_white_spaces = True only if not defined
davidsbatista Jan 9, 2025
25721bb
Merge branch 'main' into add-recursive-chunking
davidsbatista Jan 9, 2025
951956b
cleaning docs
davidsbatista Jan 9, 2025
e1464eb
handling some more edge cases, when split is still too big and all se…
davidsbatista Jan 9, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions haystack/components/preprocessors/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
from .document_cleaner import DocumentCleaner
from .document_splitter import DocumentSplitter
from .nltk_document_splitter import NLTKDocumentSplitter
from .sentence_tokenizer import SentenceSplitter
from .recursive_splitter import RecursiveDocumentSplitter
from .text_cleaner import TextCleaner

__all__ = ["DocumentSplitter", "DocumentCleaner", "NLTKDocumentSplitter", "SentenceSplitter", "TextCleaner"]
__all__ = ["DocumentSplitter", "DocumentCleaner", "RecursiveDocumentSplitter", "TextCleaner", "NLTKDocumentSplitter"]
Loading
Loading