Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Japanese Support #145

Open
dezyh opened this issue Dec 19, 2024 · 2 comments
Open

Japanese Support #145

dezyh opened this issue Dec 19, 2024 · 2 comments

Comments

@dezyh
Copy link

dezyh commented Dec 19, 2024

I'd like to try and implement OCR support for Japanese (when time permits). I don't expect to finish anything soon, as I'm very inexperienced with OCR. I'm mainly making this issue to track/coordinate my work in case anyone else is interesting in contributing or wants to offering any advice.

I'd personally like to focus the OCR on manga content initially. However, I see a general purpose OCR as being the final goal. If we get to that point, I'm not sure if a separate model optimized for manga would be in scope of this projects direction or not.

Related

1. Challenges

There's a few major differences from latin scripts which will need to be addressed.

a) Kanji

There are many more kanji than there is latin characters. Probably around 2,000 common kanji and in the order of 10,000 currently used kanji.

b) Layout: Horizontal / Vertical

Text can be written either vertically (縦書き) or horizontally (横書き).

For example:
image

c) Annotations: Furigana / Ruby text

Text can have annotations either on the right (for vertical text) or above (for horizontal text).

This text usually explains how certain words written in Kanji should be read, but can also be used by authors to provide synonyms, nuances, etc. It is therefore valuable to extract in the OCR, however, since it's only adding additional information to the base text, it should be possible to separate it from the base text in the OCRs output.

This is definitely going to require the WIP layout engine.

For example:
image

d) Fonts / Handwriting

I think various fonts can should be supported however I think handwriting be too difficult initially as there can be quite a big difference between digital characters and handwritten characters. I would propose a working OCR engine/model for digital text is implemented first, and then handwritten text can be optimized and trained later.

2. Training Data

I will need to conduct more research into this...

a) Datasets

  • Manga109-s
    • Available for commercial use (with some nuances)
    • Cannot be redistributed (should such a dataset even be considered with ocrs requirements for datasets?)

b) Synthetic data

In the absence of a good dataset, one possibility is to generate synthetic data. This was used in robertknight/mana-ocr in this synthetic data generator. I'm thinking we could start with this until a good dataset is found, made, or becomes available for use.

Related Projects

@robertknight
Copy link
Owner

I'm not sure if a separate model optimized for manga would be in scope of this projects direction or not.

Yes. Script, language or task-specific models are welcome.

In general there is a trade-off between model size and capacity, so even though it is possible to create one large model which recognizes "everything", smaller and more limited models can still be useful.

@robertknight
Copy link
Owner

Cannot be redistributed (should such a dataset even be considered with ocrs requirements for datasets?)

Ocrs's "core" models need to be trained exclusively on openly licensed data, but additional models trained on more restrictive datasets can be created and published, as long as the usage terms are clearly identified.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants