Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document splitting #377

Closed
flozi00 opened this issue Sep 15, 2020 · 7 comments
Closed

Document splitting #377

flozi00 opened this issue Sep 15, 2020 · 7 comments
Assignees
Labels
type:feature New feature or request
Milestone

Comments

@flozi00
Copy link

flozi00 commented Sep 15, 2020

This is an first draft how the handling of large sentences could look like.
I would even do splitting by words (spaces and percentage of overlapping) and tokens with percentage of overlapping.
Looking forward to discuss about :)

def spitUpBySentence(context , chars_to_replace = [[",",""], [".","\n"], [":",""], ["!","\n"], ["?","\n"]], split_char = "\n", qna_seq_len = 128):
        corpus = []

    for x in context:
        if(tmp_cont == None):
            tmp_cont = x + " \n"
        elif(len((tmp_cont + x + "\n").split(" ")) < int(qna_seq_len*0.6)):
            tmp_cont = tmp_cont + x + " \n"
        else:
            tmp_corp = tmp_cont.split("\n")
            corpus.append(tmp_cont)
            tmp_cont = ""
            if(int(len(tmp_corp)) > 1):
                subtract = 1
            elif(int(len(tmp_corp)) > 3):
                subtract = 2
            else:
                subtract = 0
            for xy in range(int(len(tmp_corp)) - subtract):
                try:
                    tmp_corp.pop(xy)
                except:
                    pass
            for entry in tmp_corp:
                tmp_cont += entry + " \n"

        step += 1

    corpus.append(tmp_cont)

    return corpus

@tholor

@flozi00 flozi00 added the type:feature New feature or request label Sep 15, 2020
@tholor tholor self-assigned this Sep 15, 2020
@tholor
Copy link
Member

tholor commented Sep 16, 2020

Hi @flozi00 ,

Thanks for this first draft!
As discussed before: I totally see the value of splitting long files into meaningful, shorter docs/passages before ingesting to the documentstore. As this has quite big implications further down the pipeline I want to make sure that we are on the same page here...

Goals:

  1. Increasing speed of Reader (the longer the docs passed from the retriever, the slower the reader)
  2. Improving accuracy of retriever (especially DPR relies on shorter passages <= 512 tokens, but also BM25 can benefit if we don't index a doc with 100 pages)
  3. Don't loose potential answers due to split across docs (e.g. first 2 tokens of answer in doc 1, last 2 tokens in doc 2 => impossible to find it for reader)

Current options in Haystack:
a) Split by paragraph (split string by "\n\n")
b) Split by page (during file conversion from pdf / docx)

Further options I can see:
i) Split by a fixed number of tokens

  • Pros: Efficiently utilizing the available max_seq_len of reader / retriever (usually 512 or 384)
  • Cons: Depends on modeltype and vocab; additional complexity in preprocessing step due to tokenizer
    ii) Split by a fixed number of words that heuristically equal a certain number of tokens
  • Pros: Simple & Fast; Similar to approach in DPR paper; no dependency on tokenizer
  • Cons: The ratio of words to tokens can vary in different domains / languages / models

In addition, we need one approach to avoid splits in the middle of an answer (Goal 3):
a) respecting sentence boundaries
b) sliding window (i.e. overlapping parts between splitted docs)

Happy to discuss this further and implement one of the above together.
We are also planning bigger refactorings of the whole preprocessing pipeline where we could fit these methods in (see epic #378, and subtask #382)


Regarding your draft I have a few questions:

This is an first draft how the handling of large sentences could look like.

Do you really mean large sentences here? I'd rather think about large texts / documents ...

  1. What is the input that we pass here as "context"? List of lines from one document? List of sentences from previous sentence segmentation?

  2. What is the high-level goal that you want to achieve here? I understand that you append lines of a doc until their words exceed 60% of max_seq_len, but it's not clear what you intend to do with this block:

else:
            tmp_corp = tmp_cont.split("\n")
            corpus.append(tmp_cont)
            tmp_cont = ""
            if(int(len(tmp_corp)) > 1):
                subtract = 1
            elif(int(len(tmp_corp)) > 3):
                subtract = 2
            else:
                subtract = 0
            for xy in range(int(len(tmp_corp)) - subtract):
                try:
                    tmp_corp.pop(xy)
                except:
                    pass
            for entry in tmp_corp:
                tmp_cont += entry + " \n"

        step += 1
    corpus.append(tmp_cont)

Why do you append tmp_cont in the start to corpus and then again in the very end?

All in all: I appreciate the direction, but we should align on the concept first before going forward with implementation details

@flozi00
Copy link
Author

flozi00 commented Sep 16, 2020

Hi, sorry for these bug in code.
I copied and modified an internal piece of code for this thread, didn't checked it really works.

I totally agree with the goals, that's the reasons why we implemented those features.

Yeah I meant documents instead of sentences.

The 60% is another piece of code j forgot to refactor correctly, it's used cause we took other numbers by customer settings in our software.

In case of haystack there is an list of json arrays to index, right ?
I would iterate over this list and append to an clean list, while linking to the original in meta datas.
So the input would be an list, actually it would be an string.

@flozi00 flozi00 changed the title [Feature] handling large sentences while indexing [Feature] handling large documents while indexing Sep 20, 2020
@tholor
Copy link
Member

tholor commented Sep 22, 2020

Hey @flozi00,

Just to set the context, we have three main steps in Haystack for preprocessing: File conversion, Splitting and Cleaning.
As mentioned above, we will refactor the objects and interactions soon. Your PR currently addresses Splitting and Cleaning.
I suggest to just focus on splitting for now as we will probably separate both steps in the future pipeline.

In case of haystack there is an list of json arrays to index, right ?
I would iterate over this list and append to an clean list, while linking to the original in meta datas.
So the input would be an list, actually it would be an string.

You can expect for your method:

  • Input: a list of dicts
[
{"text": "some very long text that we want to split", "meta": {"filename": "some_value", ...}},
...
]
  • Output: a list of dicts where the text is splitted
[
{"text": "some very long text", "meta": {"filename": "some_value", ...}},
{"text": " that we want to split", "meta": {"filename": "some_value", ...}},
...
]

We could later see if it makes sense to add some split info to meta (e.g. "split" = 1, 2, 3)

@tholor tholor changed the title [Feature] handling large documents while indexing WIP Document splitting Sep 22, 2020
@flozi00
Copy link
Author

flozi00 commented Sep 24, 2020

https://gist.github.com/flozi00/ead2cf450f8e5db8cb3891b7a8ee7bd5

I took the time to clean up the code, this is another draft.
Better now ?

@tholor tholor added this to the #2 milestone Oct 6, 2020
@tholor
Copy link
Member

tholor commented Oct 8, 2020

@flozi00 thx, this looks better and we are currently implementing something similar to your sketch here: https://github.com/deepset-ai/haystack/pull/473/files#diff-2df6944ad0156c6b18c3f6f47c648281R74

We'll have:

  • splits based on "words", sentences or passages
  • an option to respect sentence boundaries e.g. when splitting every 100 words
  • an option for a sliding window

Maybe you can have a look at the PR once it's ready and give us feedback if that also covers your use case?

@tholor tholor changed the title WIP Document splitting Document splitting Oct 8, 2020
@flozi00
Copy link
Author

flozi00 commented Oct 8, 2020

sorry, having too much work at the moment.
left an review

@tholor
Copy link
Member

tholor commented Oct 19, 2020

Implemented a first version in #473 .
We might add token-based splits in a later version as commented in the PR.

@tholor tholor closed this as completed Oct 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type:feature New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants