Document splitting #377

flozi00 · 2020-09-15T07:01:43Z

This is an first draft how the handling of large sentences could look like.
I would even do splitting by words (spaces and percentage of overlapping) and tokens with percentage of overlapping.
Looking forward to discuss about :)

def spitUpBySentence(context , chars_to_replace = [[",",""], [".","\n"], [":",""], ["!","\n"], ["?","\n"]], split_char = "\n", qna_seq_len = 128):
        corpus = []

    for x in context:
        if(tmp_cont == None):
            tmp_cont = x + " \n"
        elif(len((tmp_cont + x + "\n").split(" ")) < int(qna_seq_len*0.6)):
            tmp_cont = tmp_cont + x + " \n"
        else:
            tmp_corp = tmp_cont.split("\n")
            corpus.append(tmp_cont)
            tmp_cont = ""
            if(int(len(tmp_corp)) > 1):
                subtract = 1
            elif(int(len(tmp_corp)) > 3):
                subtract = 2
            else:
                subtract = 0
            for xy in range(int(len(tmp_corp)) - subtract):
                try:
                    tmp_corp.pop(xy)
                except:
                    pass
            for entry in tmp_corp:
                tmp_cont += entry + " \n"

        step += 1

    corpus.append(tmp_cont)

    return corpus

@tholor

tholor · 2020-09-16T13:59:55Z

Hi @flozi00 ,

Thanks for this first draft!
As discussed before: I totally see the value of splitting long files into meaningful, shorter docs/passages before ingesting to the documentstore. As this has quite big implications further down the pipeline I want to make sure that we are on the same page here...

Goals:

Increasing speed of Reader (the longer the docs passed from the retriever, the slower the reader)
Improving accuracy of retriever (especially DPR relies on shorter passages <= 512 tokens, but also BM25 can benefit if we don't index a doc with 100 pages)
Don't loose potential answers due to split across docs (e.g. first 2 tokens of answer in doc 1, last 2 tokens in doc 2 => impossible to find it for reader)

Current options in Haystack:
a) Split by paragraph (split string by "\n\n")
b) Split by page (during file conversion from pdf / docx)

Further options I can see:
i) Split by a fixed number of tokens

Pros: Efficiently utilizing the available max_seq_len of reader / retriever (usually 512 or 384)
Cons: Depends on modeltype and vocab; additional complexity in preprocessing step due to tokenizer
ii) Split by a fixed number of words that heuristically equal a certain number of tokens
Pros: Simple & Fast; Similar to approach in DPR paper; no dependency on tokenizer
Cons: The ratio of words to tokens can vary in different domains / languages / models

In addition, we need one approach to avoid splits in the middle of an answer (Goal 3):
a) respecting sentence boundaries
b) sliding window (i.e. overlapping parts between splitted docs)

Happy to discuss this further and implement one of the above together.
We are also planning bigger refactorings of the whole preprocessing pipeline where we could fit these methods in (see epic #378, and subtask #382)

Regarding your draft I have a few questions:

This is an first draft how the handling of large sentences could look like.

Do you really mean large sentences here? I'd rather think about large texts / documents ...

What is the input that we pass here as "context"? List of lines from one document? List of sentences from previous sentence segmentation?
What is the high-level goal that you want to achieve here? I understand that you append lines of a doc until their words exceed 60% of max_seq_len, but it's not clear what you intend to do with this block:

else:
            tmp_corp = tmp_cont.split("\n")
            corpus.append(tmp_cont)
            tmp_cont = ""
            if(int(len(tmp_corp)) > 1):
                subtract = 1
            elif(int(len(tmp_corp)) > 3):
                subtract = 2
            else:
                subtract = 0
            for xy in range(int(len(tmp_corp)) - subtract):
                try:
                    tmp_corp.pop(xy)
                except:
                    pass
            for entry in tmp_corp:
                tmp_cont += entry + " \n"

        step += 1
    corpus.append(tmp_cont)

Why do you append tmp_cont in the start to corpus and then again in the very end?

All in all: I appreciate the direction, but we should align on the concept first before going forward with implementation details

flozi00 · 2020-09-16T19:39:03Z

Hi, sorry for these bug in code.
I copied and modified an internal piece of code for this thread, didn't checked it really works.

I totally agree with the goals, that's the reasons why we implemented those features.

Yeah I meant documents instead of sentences.

The 60% is another piece of code j forgot to refactor correctly, it's used cause we took other numbers by customer settings in our software.

In case of haystack there is an list of json arrays to index, right ?
I would iterate over this list and append to an clean list, while linking to the original in meta datas.
So the input would be an list, actually it would be an string.

tholor · 2020-09-22T12:25:26Z

Hey @flozi00,

Just to set the context, we have three main steps in Haystack for preprocessing: File conversion, Splitting and Cleaning.
As mentioned above, we will refactor the objects and interactions soon. Your PR currently addresses Splitting and Cleaning.
I suggest to just focus on splitting for now as we will probably separate both steps in the future pipeline.

In case of haystack there is an list of json arrays to index, right ?
I would iterate over this list and append to an clean list, while linking to the original in meta datas.
So the input would be an list, actually it would be an string.

You can expect for your method:

Input: a list of dicts

[
{"text": "some very long text that we want to split", "meta": {"filename": "some_value", ...}},
...
]

Output: a list of dicts where the text is splitted

[
{"text": "some very long text", "meta": {"filename": "some_value", ...}},
{"text": " that we want to split", "meta": {"filename": "some_value", ...}},
...
]

We could later see if it makes sense to add some split info to meta (e.g. "split" = 1, 2, 3)

flozi00 · 2020-09-24T12:59:46Z

https://gist.github.com/flozi00/ead2cf450f8e5db8cb3891b7a8ee7bd5

I took the time to clean up the code, this is another draft.
Better now ?

tholor · 2020-10-08T10:24:19Z

@flozi00 thx, this looks better and we are currently implementing something similar to your sketch here: https://github.com/deepset-ai/haystack/pull/473/files#diff-2df6944ad0156c6b18c3f6f47c648281R74

We'll have:

splits based on "words", sentences or passages
an option to respect sentence boundaries e.g. when splitting every 100 words
an option for a sliding window

Maybe you can have a look at the PR once it's ready and give us feedback if that also covers your use case?

flozi00 · 2020-10-08T12:16:16Z

sorry, having too much work at the moment.
left an review

tholor · 2020-10-19T11:59:05Z

Implemented a first version in #473 .
We might add token-based splits in a later version as commented in the PR.

flozi00 added the type:feature New feature or request label Sep 15, 2020

tholor self-assigned this Sep 15, 2020

flozi00 changed the title ~~[Feature] handling large sentences while indexing~~ [Feature] handling large documents while indexing Sep 20, 2020

tholor changed the title ~~[Feature] handling large documents while indexing~~ WIP Document splitting Sep 22, 2020

tholor added this to the #2 milestone Oct 6, 2020

tholor changed the title ~~WIP Document splitting~~ Document splitting Oct 8, 2020

tholor closed this as completed Oct 19, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Document splitting #377

Document splitting #377

flozi00 commented Sep 15, 2020 •

edited by tanaysoni

Loading

tholor commented Sep 16, 2020

flozi00 commented Sep 16, 2020 •

edited

Loading

tholor commented Sep 22, 2020

flozi00 commented Sep 24, 2020

tholor commented Oct 8, 2020

flozi00 commented Oct 8, 2020

tholor commented Oct 19, 2020 •

edited

Loading

Document splitting #377

Document splitting #377

Comments

flozi00 commented Sep 15, 2020 • edited by tanaysoni Loading

tholor commented Sep 16, 2020

flozi00 commented Sep 16, 2020 • edited Loading

tholor commented Sep 22, 2020

flozi00 commented Sep 24, 2020

tholor commented Oct 8, 2020

flozi00 commented Oct 8, 2020

tholor commented Oct 19, 2020 • edited Loading

flozi00 commented Sep 15, 2020 •

edited by tanaysoni

Loading

flozi00 commented Sep 16, 2020 •

edited

Loading

tholor commented Oct 19, 2020 •

edited

Loading