Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent text inference output with plain text #225

Open
ep0p opened this issue Nov 25, 2024 · 14 comments
Open

Inconsistent text inference output with plain text #225

ep0p opened this issue Nov 25, 2024 · 14 comments

Comments

@ep0p
Copy link

ep0p commented Nov 25, 2024

I'm encountering an issue when using GOT for inferencing plain text. The output is not consistent: sometimes it detects the text correctly, but other times, it introduces spaces between letters, creating nonsense words:

image

For example:

Correct output: This is a well-detected text.
Incorrect output: Th i s i s a tex t wi thspa c e s.

This inconsistency becomes particularly problematic when processing PDFs with multiple pages. Even if most pages are inferenced correctly, a couple of pages might have this spacing issue, which disrupts the results.

I can't figure out why this happens or how to enforce a consistent format, ensuring only the "good" text format is used

@paulgekeler
Copy link

If the model predictions are a little off, maybe because your provided PDFs (format, content and so on..) deviate to some degree from the training material, this is nothing to worry about and not uncommon. Could be the fonts or spacing are different and therefore harder to parse correctly for the model. I'd suggest to post-process the predictions yourself in this case, using an NLP package to detect word boundaries (an idea from here) and remove the faulty spacing within those boundaries. Or you could fine-tune the model on your data, which I guess if it's just a spacing issue should be resolved quickly.
I am also noticing the text is in French and a legislative text or conference protocol, which might also contribute to the problem...

@ep0p
Copy link
Author

ep0p commented Nov 27, 2024

Hi @paulgekeler,

Indeed when using the fined tuned version this issue no longer exists.
It is however replaced with pages being ignored, not a single word on them recognised

Do you have any idea if GOT can handle images that might be skewed?
Another guess, it might be the noise and i need to fine tune with noisy images, i'll see about that.

PS: all my documents are french legal documents with, sometimes, complicated layouts.

@paulgekeler
Copy link

@ep0p yes, I've experienced the same thing. When I try to run multi page inference, I barely get any output. Maybe the first couple of lines of text. My suspicion is that the compression of the visual information is too much for dense text over multiple pages. I think their multi page training consisted of multiple pages of sparse text.

@Ucas-HaoranWei
Copy link
Owner

Hi, it would help if you use a for-loop for multi-page inference. The multi-page is only for training, more details can be found in the paper.

@paulgekeler
Copy link

@Ucas-HaoranWei thanks I read the paper. I will try to fine-tune some more on multi-page data.

@ep0p
Copy link
Author

ep0p commented Nov 27, 2024

@paulgekeler and @Ucas-HaoranWei
In my case, I split the PDF into images and performed inference in a loop, page by page.
Some pages were ignored, even though they had the same format as the others. However, it seemed to me that they were slightly tilted.
I deskewed them, and this apparently helped because they were properly recognized afterward.

Would fine-tuning with skewed images help in this case?

@paulgekeler
Copy link

@ep0p pretty sure it would. For example in Nougat and Donut as well, they distort some image pages before training to increase robustness

@ep0p
Copy link
Author

ep0p commented Nov 28, 2024

@paulgekeler thanks a lot. i will add a skewed subset in my dataset as well and attempt a fine tuning

@thhung
Copy link

thhung commented Nov 30, 2024

@ep0p Did you manage to finetune your dataset? If you did sucessfully, would you mind sharing the format of your data and training settings?

@ep0p
Copy link
Author

ep0p commented Dec 2, 2024

@thhung i did finish the fine tuning with no errors, however i don't know if i can say if it is successfully

My dataset, at the moment contains, around 6k images, full page images from documents, and in jsonl, the records are of this type :

{"query": "\nOCR:", "response": "LE GREFFIER LE PRESIDENT\n6", "images": ["/home/epop/DATASET/GOT/output_GOT_dataset_100k/pages_images_noise/noisy_110.jpg"]}

As for the training params, i changed only the batch size, epochs number and fp16 in order to speed up a bit the training and use less memory since i can work only with 2gpus at a time.
The rest i kept at default values:

    export CUDA_VISIBLE_DEVICES=0,1
    swift sft \
    --model_type got-ocr2 \
    --model_id_or_path stepfun-ai/GOT-OCR2_0 \
    --sft_type lora \
    --dataset /home/epop/DATASET/GOT/output_GOT_dataset_100k/recognition_data_noise_5000_copy.jsonl \
    --batch_size 4 \
    --output_dir output \
    --save_on_each_node true \
    --dtype fp16 \
    --num_train_epochs 100

PS: also, since i can have full pages of text, i have changed max_length: int = -1 # -1: no limit under class SftArgument

@thhung
Copy link

thhung commented Dec 2, 2024

@ep0p So the performance is not as you expected?

@ep0p
Copy link
Author

ep0p commented Dec 2, 2024

@thhung It is more accurate than the original model, but there are still words that are not recognized properly.
I will try to increase my dataset or add more noise to make it more robust.
If that still does not provide satisfactory results, I will switch to training from scratch.

@paulgekeler
Copy link

@ep0p Did you try to fine tune with the prompt "OCR with format across multiple pages: "? Because you are fine tuning on multiple pages right?

@ep0p
Copy link
Author

ep0p commented Dec 3, 2024

@paulgekeler Since I had to introduce multiple types of noise, diversify fonts, and skew the images, I kept it simple with one page per entry. So, I opted for straightforward "OCR" training rather than fine-tuning for multiple pages.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants