Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trainable reading order #492

Merged
merged 69 commits into from
Nov 14, 2023
Merged

Trainable reading order #492

merged 69 commits into from
Nov 14, 2023

Conversation

mittagessen
Copy link
Owner

@mittagessen mittagessen commented Apr 16, 2023

This pull request is an implementation of this article modelling the sorting of baselines/regions as estimated binary order relations.

ToDo

  • Training
  • Serialization
  • Inference
  • Correct reading order parsing from XML files

@mittagessen
Copy link
Owner Author

The model and greedy decoding used in both papers is the same, the only change is the hierarchical decoder which doesn't really work when lines can be found outside of regions (at least not easily and if I understand their implementation correctly).

But I never got their code to run properly (it obviously has only vague connection to the one used to produce the paper) and substituted my own for anything but the decoder. It shouldn't really matter though as the whole thing is really simple (the primary reason why I implemented it in the first place).

@bertsky
Copy link

bertsky commented May 25, 2023

The model and greedy decoding used in both papers is the same,

They also dropped the RDF model in the second paper (although it did perform better than the MLP on the table dataset).

Moreover, they introduced a region level model (with slightly different feature vector) in the second paper (which is needed for the hierarchical mode).

But when they trained on multiple datasets, they did not investigate how well models would generalize from one domain to the other or whether the training would benefit from cross-domain training. To me, all that is significant enough to warrant additional experimentation. (Especially if you look at the mistakes their model still makes. Even if rank distance is low, that does not per se mean you get a high-quality order. The prediction sometimes alternates like crazy between different parts of the page, not just at difficult/ambiguous spots.)

the only change is the hierarchical decoder which doesn't really work when lines can be found outside of regions (at least not easily and if I understand their implementation correctly).

The hierarchical model simply restricts the topology during training (avoiding arbitrary line pairs), it does not concern coordinates (esp. not whether polygons are properly contained). When decoding, you first apply the region-level model on regions, then the hierarchical line-level model on the lines within each region.

But I never got their code to run properly (it obviously has only vague connection to the one used to produce the paper)

It's inconsistent, yes, likely the last state was in the middle of the second paper. But I can reproduce their published figures from that. See above issue for the details.

and substituted my own for anything but the decoder. It shouldn't really matter though as the whole thing is really simple (the primary reason why I implemented it in the first place).

I agree it's not necessary to re-use their code in the end. But IMO the hierarchical variant (and probably also some transfer learning) should be used. And/or perhaps one can restrict the decoder with some basic geometric rules to avoid silly obvious mistakes.

@mittagessen
Copy link
Owner Author

But when they trained on multiple datasets, they did not investigate how well models would generalize from one domain to the other or whether the training would benefit from cross-domain training.

We don't have terribly high hopes about the generalization of this method as the model is a) linked deeply to the segmentation model typology and b) it is unable to deal with mixed-directional inputs.

The hierarchical model simply restricts the topology during training (avoiding arbitrary line pairs), it does not concern coordinates (esp. not whether polygons are properly contained). When decoding, you first apply the region-level model on regions, then the hierarchical line-level model on the lines within each region.

The issue arises when lines exist outside of regions. Then ordering the regions first and then the lines inside each region doesn't work as there's no way to know where to insert the non-region-affiliated lines in the order as you can't feed disparate line-region feature pairs into either the region/line model.

@bertsky
Copy link

bertsky commented May 26, 2023

We don't have terribly high hopes about the generalization of this method as the model is a) linked deeply to the segmentation model typology

I also have doubts, but perhaps this could be factored in with additional features (e.g. enabling/disabling segment categories with additional input, both during training and inference).

and b) it is unable to deal with mixed-directional inputs.

Sure – generalization to bottom-up or left-right textline order systems is unlikely to be possible. But within the same system, you still have divergent material (as the paper shows) – with/out columns, marginals, tables, print vs handwriting etc. I would hope to at least gain some coverage in that area by curating training data/schemes.

The issue arises when lines exist outside of regions. Then ordering the regions first and then the lines inside each region doesn't work as there's no way to know where to insert the non-region-affiliated lines in the order as you can't feed disparate line-region feature pairs into either the region/line model.

Oh, now I remember. That's why you insisted ALTO v4.3 should have reading order even on the line level.

I would argue that either this kind of segment is typical and common – in which case you should not need a hierarchical/region-level model – or it is special and rare – in which case probably the best design would be to add a dedicated hierarchy level (say 'insertion'):

  • ignore the region level (both at training and inference)
  • train at line level both with these insertions (but as extra parent type) and without them (i.e. skipped randomly)
  • decode as after-step

@mittagessen mittagessen merged commit 003568b into main Nov 14, 2023
@bertsky
Copy link

bertsky commented Jan 17, 2024

@mittagessen are there any pretrained segmentation models with neural reading order available which one can already try? And do you have plans on adding RO to the builtin blla.mlmodel? Finally, do you have eval results to share (or will there be a paper)?

@mittagessen
Copy link
Owner Author

mittagessen commented Jan 17, 2024

The code is a fairly straightforward adaptation of the method mentioned above so any results should translate (it certainly isn't publishable from my point of view). We've ran some tests and it generally seems to perform better than the heuristic for specific use cases.

The big BUT here is though that the net only uses line features (class, position, and extents) without any visual features for determining order. This makes it not a good choice for a default model that can be used for different text directions as it doesn't know if it is looking at a Latin or Arabic text (in the absence of such line classes) and will order columns incorrectly. I'd say it is mostly useful for people that are training a new segmentation model for some material that isn't well captured by the default model and want better reading order for only slight computational overhead and no manual annotation effort.

I've written some basic documentation here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants