-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Document splitting #377
Comments
Hi @flozi00 , Thanks for this first draft! Goals:
Current options in Haystack: Further options I can see:
In addition, we need one approach to avoid splits in the middle of an answer (Goal 3): Happy to discuss this further and implement one of the above together. Regarding your draft I have a few questions:
Do you really mean large sentences here? I'd rather think about large texts / documents ...
Why do you append All in all: I appreciate the direction, but we should align on the concept first before going forward with implementation details |
Hi, sorry for these bug in code. I totally agree with the goals, that's the reasons why we implemented those features. Yeah I meant documents instead of sentences. The 60% is another piece of code j forgot to refactor correctly, it's used cause we took other numbers by customer settings in our software. In case of haystack there is an list of json arrays to index, right ? |
Hey @flozi00, Just to set the context, we have three main steps in Haystack for preprocessing: File conversion, Splitting and Cleaning.
You can expect for your method:
We could later see if it makes sense to add some split info to |
https://gist.github.com/flozi00/ead2cf450f8e5db8cb3891b7a8ee7bd5 I took the time to clean up the code, this is another draft. |
@flozi00 thx, this looks better and we are currently implementing something similar to your sketch here: https://github.com/deepset-ai/haystack/pull/473/files#diff-2df6944ad0156c6b18c3f6f47c648281R74 We'll have:
Maybe you can have a look at the PR once it's ready and give us feedback if that also covers your use case? |
sorry, having too much work at the moment. |
Implemented a first version in #473 . |
This is an first draft how the handling of large sentences could look like.
I would even do splitting by words (spaces and percentage of overlapping) and tokens with percentage of overlapping.
Looking forward to discuss about :)
@tholor
The text was updated successfully, but these errors were encountered: