You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'd like to try and implement OCR support for Japanese (when time permits). I don't expect to finish anything soon, as I'm very inexperienced with OCR. I'm mainly making this issue to track/coordinate my work in case anyone else is interesting in contributing or wants to offering any advice.
I'd personally like to focus the OCR on manga content initially. However, I see a general purpose OCR as being the final goal. If we get to that point, I'm not sure if a separate model optimized for manga would be in scope of this projects direction or not.
There's a few major differences from latin scripts which will need to be addressed.
a) Kanji
There are many more kanji than there is latin characters. Probably around 2,000 common kanji and in the order of 10,000 currently used kanji.
b) Layout: Horizontal / Vertical
Text can be written either vertically (縦書き) or horizontally (横書き).
For example:
c) Annotations: Furigana / Ruby text
Text can have annotations either on the right (for vertical text) or above (for horizontal text).
This text usually explains how certain words written in Kanji should be read, but can also be used by authors to provide synonyms, nuances, etc. It is therefore valuable to extract in the OCR, however, since it's only adding additional information to the base text, it should be possible to separate it from the base text in the OCRs output.
This is definitely going to require the WIP layout engine.
For example:
d) Fonts / Handwriting
I think various fonts can should be supported however I think handwriting be too difficult initially as there can be quite a big difference between digital characters and handwritten characters. I would propose a working OCR engine/model for digital text is implemented first, and then handwritten text can be optimized and trained later.
Cannot be redistributed (should such a dataset even be considered with ocrs requirements for datasets?)
b) Synthetic data
In the absence of a good dataset, one possibility is to generate synthetic data. This was used in robertknight/mana-ocr in this synthetic data generator. I'm thinking we could start with this until a good dataset is found, made, or becomes available for use.
I'm not sure if a separate model optimized for manga would be in scope of this projects direction or not.
Yes. Script, language or task-specific models are welcome.
In general there is a trade-off between model size and capacity, so even though it is possible to create one large model which recognizes "everything", smaller and more limited models can still be useful.
Cannot be redistributed (should such a dataset even be considered with ocrs requirements for datasets?)
Ocrs's "core" models need to be trained exclusively on openly licensed data, but additional models trained on more restrictive datasets can be created and published, as long as the usage terms are clearly identified.
I'd like to try and implement OCR support for Japanese (when time permits). I don't expect to finish anything soon, as I'm very inexperienced with OCR. I'm mainly making this issue to track/coordinate my work in case anyone else is interesting in contributing or wants to offering any advice.
I'd personally like to focus the OCR on manga content initially. However, I see a general purpose OCR as being the final goal. If we get to that point, I'm not sure if a separate model optimized for manga would be in scope of this projects direction or not.
Related
1. Challenges
There's a few major differences from latin scripts which will need to be addressed.
a) Kanji
There are many more kanji than there is latin characters. Probably around 2,000 common kanji and in the order of 10,000 currently used kanji.
b) Layout: Horizontal / Vertical
Text can be written either vertically (縦書き) or horizontally (横書き).
For example:
c) Annotations: Furigana / Ruby text
Text can have annotations either on the right (for vertical text) or above (for horizontal text).
This text usually explains how certain words written in Kanji should be read, but can also be used by authors to provide synonyms, nuances, etc. It is therefore valuable to extract in the OCR, however, since it's only adding additional information to the base text, it should be possible to separate it from the base text in the OCRs output.
This is definitely going to require the WIP layout engine.
For example:
d) Fonts / Handwriting
I think various fonts can should be supported however I think handwriting be too difficult initially as there can be quite a big difference between digital characters and handwritten characters. I would propose a working OCR engine/model for digital text is implemented first, and then handwritten text can be optimized and trained later.
2. Training Data
I will need to conduct more research into this...
a) Datasets
b) Synthetic data
In the absence of a good dataset, one possibility is to generate synthetic data. This was used in robertknight/mana-ocr in this synthetic data generator. I'm thinking we could start with this until a good dataset is found, made, or becomes available for use.
Related Projects
The text was updated successfully, but these errors were encountered: