Japanese Support #145

dezyh · 2024-12-19T13:10:08Z

I'd like to try and implement OCR support for Japanese (when time permits). I don't expect to finish anything soon, as I'm very inexperienced with OCR. I'm mainly making this issue to track/coordinate my work in case anyone else is interesting in contributing or wants to offering any advice.

I'd personally like to focus the OCR on manga content initially. However, I see a general purpose OCR as being the final goal. If we get to that point, I'm not sure if a separate model optimized for manga would be in scope of this projects direction or not.

1. Challenges

There's a few major differences from latin scripts which will need to be addressed.

a) Kanji

There are many more kanji than there is latin characters. Probably around 2,000 common kanji and in the order of 10,000 currently used kanji.

b) Layout: Horizontal / Vertical

Text can be written either vertically (縦書き) or horizontally (横書き).

For example:

c) Annotations: Furigana / Ruby text

Text can have annotations either on the right (for vertical text) or above (for horizontal text).

This text usually explains how certain words written in Kanji should be read, but can also be used by authors to provide synonyms, nuances, etc. It is therefore valuable to extract in the OCR, however, since it's only adding additional information to the base text, it should be possible to separate it from the base text in the OCRs output.

This is definitely going to require the WIP layout engine.

For example:

d) Fonts / Handwriting

I think various fonts can should be supported however I think handwriting be too difficult initially as there can be quite a big difference between digital characters and handwritten characters. I would propose a working OCR engine/model for digital text is implemented first, and then handwritten text can be optimized and trained later.

2. Training Data

I will need to conduct more research into this...

a) Datasets

Manga109-s
- Available for commercial use (with some nuances)
- Cannot be redistributed (should such a dataset even be considered with ocrs requirements for datasets?)

b) Synthetic data

In the absence of a good dataset, one possibility is to generate synthetic data. This was used in robertknight/mana-ocr in this synthetic data generator. I'm thinking we could start with this until a good dataset is found, made, or becomes available for use.

Related Projects

https://github.com/kha-white/manga-ocr

robertknight · 2024-12-19T14:00:57Z

I'm not sure if a separate model optimized for manga would be in scope of this projects direction or not.

Yes. Script, language or task-specific models are welcome.

In general there is a trade-off between model size and capacity, so even though it is possible to create one large model which recognizes "everything", smaller and more limited models can still be useful.

robertknight · 2024-12-19T14:09:52Z

Cannot be redistributed (should such a dataset even be considered with ocrs requirements for datasets?)

Ocrs's "core" models need to be trained exclusively on openly licensed data, but additional models trained on more restrictive datasets can be created and published, as long as the usage terms are clearly identified.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Japanese Support #145

Japanese Support #145

dezyh commented Dec 19, 2024 •

edited

Loading

robertknight commented Dec 19, 2024

robertknight commented Dec 19, 2024

Japanese Support #145

Japanese Support #145

Comments

dezyh commented Dec 19, 2024 • edited Loading

Related

1. Challenges

a) Kanji

b) Layout: Horizontal / Vertical

c) Annotations: Furigana / Ruby text

d) Fonts / Handwriting

2. Training Data

a) Datasets

b) Synthetic data

Related Projects

robertknight commented Dec 19, 2024

robertknight commented Dec 19, 2024

dezyh commented Dec 19, 2024 •

edited

Loading