-
Notifications
You must be signed in to change notification settings - Fork 136
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Trainable reading order #492
Conversation
Fix normalization factors for RO datasets
The model and greedy decoding used in both papers is the same, the only change is the hierarchical decoder which doesn't really work when lines can be found outside of regions (at least not easily and if I understand their implementation correctly). But I never got their code to run properly (it obviously has only vague connection to the one used to produce the paper) and substituted my own for anything but the decoder. It shouldn't really matter though as the whole thing is really simple (the primary reason why I implemented it in the first place). |
They also dropped the RDF model in the second paper (although it did perform better than the MLP on the table dataset). Moreover, they introduced a region level model (with slightly different feature vector) in the second paper (which is needed for the hierarchical mode). But when they trained on multiple datasets, they did not investigate how well models would generalize from one domain to the other or whether the training would benefit from cross-domain training. To me, all that is significant enough to warrant additional experimentation. (Especially if you look at the mistakes their model still makes. Even if rank distance is low, that does not per se mean you get a high-quality order. The prediction sometimes alternates like crazy between different parts of the page, not just at difficult/ambiguous spots.)
The hierarchical model simply restricts the topology during training (avoiding arbitrary line pairs), it does not concern coordinates (esp. not whether polygons are properly contained). When decoding, you first apply the region-level model on regions, then the hierarchical line-level model on the lines within each region.
It's inconsistent, yes, likely the last state was in the middle of the second paper. But I can reproduce their published figures from that. See above issue for the details.
I agree it's not necessary to re-use their code in the end. But IMO the hierarchical variant (and probably also some transfer learning) should be used. And/or perhaps one can restrict the decoder with some basic geometric rules to avoid silly obvious mistakes. |
We don't have terribly high hopes about the generalization of this method as the model is a) linked deeply to the segmentation model typology and b) it is unable to deal with mixed-directional inputs.
The issue arises when lines exist outside of regions. Then ordering the regions first and then the lines inside each region doesn't work as there's no way to know where to insert the non-region-affiliated lines in the order as you can't feed disparate line-region feature pairs into either the region/line model. |
I also have doubts, but perhaps this could be factored in with additional features (e.g. enabling/disabling segment categories with additional input, both during training and inference).
Sure – generalization to bottom-up or left-right textline order systems is unlikely to be possible. But within the same system, you still have divergent material (as the paper shows) – with/out columns, marginals, tables, print vs handwriting etc. I would hope to at least gain some coverage in that area by curating training data/schemes.
Oh, now I remember. That's why you insisted ALTO v4.3 should have reading order even on the line level. I would argue that either this kind of segment is typical and common – in which case you should not need a hierarchical/region-level model – or it is special and rare – in which case probably the best design would be to add a dedicated hierarchy level (say 'insertion'):
|
Fix small bugs in the Feature/reading order branch
Would drastically slow down display of help message in CLI drivers
@mittagessen are there any pretrained segmentation models with neural reading order available which one can already try? And do you have plans on adding RO to the builtin blla.mlmodel? Finally, do you have eval results to share (or will there be a paper)? |
The code is a fairly straightforward adaptation of the method mentioned above so any results should translate (it certainly isn't publishable from my point of view). We've ran some tests and it generally seems to perform better than the heuristic for specific use cases. The big BUT here is though that the net only uses line features (class, position, and extents) without any visual features for determining order. This makes it not a good choice for a default model that can be used for different text directions as it doesn't know if it is looking at a Latin or Arabic text (in the absence of such line classes) and will order columns incorrectly. I'd say it is mostly useful for people that are training a new segmentation model for some material that isn't well captured by the default model and want better reading order for only slight computational overhead and no manual annotation effort. I've written some basic documentation here. |
This pull request is an implementation of this article modelling the sorting of baselines/regions as estimated binary order relations.
ToDo