Inconsistent text inference output with plain text #225

ep0p · 2024-11-25T13:56:26Z

I'm encountering an issue when using GOT for inferencing plain text. The output is not consistent: sometimes it detects the text correctly, but other times, it introduces spaces between letters, creating nonsense words:

For example:

Correct output: This is a well-detected text.
Incorrect output: Th i s i s a tex t wi thspa c e s.

This inconsistency becomes particularly problematic when processing PDFs with multiple pages. Even if most pages are inferenced correctly, a couple of pages might have this spacing issue, which disrupts the results.

I can't figure out why this happens or how to enforce a consistent format, ensuring only the "good" text format is used

The text was updated successfully, but these errors were encountered:

paulgekeler · 2024-11-27T10:41:14Z

If the model predictions are a little off, maybe because your provided PDFs (format, content and so on..) deviate to some degree from the training material, this is nothing to worry about and not uncommon. Could be the fonts or spacing are different and therefore harder to parse correctly for the model. I'd suggest to post-process the predictions yourself in this case, using an NLP package to detect word boundaries (an idea from here) and remove the faulty spacing within those boundaries. Or you could fine-tune the model on your data, which I guess if it's just a spacing issue should be resolved quickly.
I am also noticing the text is in French and a legislative text or conference protocol, which might also contribute to the problem...

ep0p · 2024-11-27T11:31:16Z

Hi @paulgekeler,

Indeed when using the fined tuned version this issue no longer exists.
It is however replaced with pages being ignored, not a single word on them recognised

Do you have any idea if GOT can handle images that might be skewed?
Another guess, it might be the noise and i need to fine tune with noisy images, i'll see about that.

PS: all my documents are french legal documents with, sometimes, complicated layouts.

paulgekeler · 2024-11-27T11:53:32Z

@ep0p yes, I've experienced the same thing. When I try to run multi page inference, I barely get any output. Maybe the first couple of lines of text. My suspicion is that the compression of the visual information is too much for dense text over multiple pages. I think their multi page training consisted of multiple pages of sparse text.

Ucas-HaoranWei · 2024-11-27T12:04:07Z

Hi， it would help if you use a for-loop for multi-page inference. The multi-page is only for training, more details can be found in the paper.

paulgekeler · 2024-11-27T14:42:55Z

@Ucas-HaoranWei thanks I read the paper. I will try to fine-tune some more on multi-page data.

ep0p · 2024-11-27T16:35:01Z

@paulgekeler and @Ucas-HaoranWei
In my case, I split the PDF into images and performed inference in a loop, page by page.
Some pages were ignored, even though they had the same format as the others. However, it seemed to me that they were slightly tilted.
I deskewed them, and this apparently helped because they were properly recognized afterward.

Would fine-tuning with skewed images help in this case?

paulgekeler · 2024-11-27T18:24:03Z

@ep0p pretty sure it would. For example in Nougat and Donut as well, they distort some image pages before training to increase robustness

ep0p · 2024-11-28T08:33:33Z

@paulgekeler thanks a lot. i will add a skewed subset in my dataset as well and attempt a fine tuning

thhung · 2024-11-30T05:09:40Z

@ep0p Did you manage to finetune your dataset? If you did sucessfully, would you mind sharing the format of your data and training settings?

ep0p · 2024-12-02T07:31:12Z

@thhung i did finish the fine tuning with no errors, however i don't know if i can say if it is successfully

My dataset, at the moment contains, around 6k images, full page images from documents, and in jsonl, the records are of this type :

{"query": "\nOCR:", "response": "LE GREFFIER LE PRESIDENT\n6", "images": ["/home/epop/DATASET/GOT/output_GOT_dataset_100k/pages_images_noise/noisy_110.jpg"]}

As for the training params, i changed only the batch size, epochs number and fp16 in order to speed up a bit the training and use less memory since i can work only with 2gpus at a time.
The rest i kept at default values:

    export CUDA_VISIBLE_DEVICES=0,1
    swift sft \
    --model_type got-ocr2 \
    --model_id_or_path stepfun-ai/GOT-OCR2_0 \
    --sft_type lora \
    --dataset /home/epop/DATASET/GOT/output_GOT_dataset_100k/recognition_data_noise_5000_copy.jsonl \
    --batch_size 4 \
    --output_dir output \
    --save_on_each_node true \
    --dtype fp16 \
    --num_train_epochs 100

PS: also, since i can have full pages of text, i have changed max_length: int = -1 # -1: no limit under class SftArgument

thhung · 2024-12-02T13:12:45Z

@ep0p So the performance is not as you expected?

ep0p · 2024-12-02T13:20:22Z

@thhung It is more accurate than the original model, but there are still words that are not recognized properly.
I will try to increase my dataset or add more noise to make it more robust.
If that still does not provide satisfactory results, I will switch to training from scratch.

paulgekeler · 2024-12-03T09:21:10Z

@ep0p Did you try to fine tune with the prompt "OCR with format across multiple pages: "? Because you are fine tuning on multiple pages right?

ep0p · 2024-12-03T09:41:37Z

@paulgekeler Since I had to introduce multiple types of noise, diversify fonts, and skew the images, I kept it simple with one page per entry. So, I opted for straightforward "OCR" training rather than fine-tuning for multiple pages.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent text inference output with plain text #225

Inconsistent text inference output with plain text #225

ep0p commented Nov 25, 2024

paulgekeler commented Nov 27, 2024

ep0p commented Nov 27, 2024

paulgekeler commented Nov 27, 2024

Ucas-HaoranWei commented Nov 27, 2024

paulgekeler commented Nov 27, 2024

ep0p commented Nov 27, 2024

paulgekeler commented Nov 27, 2024

ep0p commented Nov 28, 2024

thhung commented Nov 30, 2024

ep0p commented Dec 2, 2024 •

edited

Loading

thhung commented Dec 2, 2024

ep0p commented Dec 2, 2024

paulgekeler commented Dec 3, 2024

ep0p commented Dec 3, 2024

Inconsistent text inference output with plain text #225

Inconsistent text inference output with plain text #225

Comments

ep0p commented Nov 25, 2024

paulgekeler commented Nov 27, 2024

ep0p commented Nov 27, 2024

paulgekeler commented Nov 27, 2024

Ucas-HaoranWei commented Nov 27, 2024

paulgekeler commented Nov 27, 2024

ep0p commented Nov 27, 2024

paulgekeler commented Nov 27, 2024

ep0p commented Nov 28, 2024

thhung commented Nov 30, 2024

ep0p commented Dec 2, 2024 • edited Loading

thhung commented Dec 2, 2024

ep0p commented Dec 2, 2024

paulgekeler commented Dec 3, 2024

ep0p commented Dec 3, 2024

ep0p commented Dec 2, 2024 •

edited

Loading