-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expand schema documentation for PointsType #49
Comments
Interesting. So this is even more liberal here than the specification in PAGE-XML, which uses a unified, polygon-based representation for all its coordinates. There, polygons are restricted via a regexp to contain a comma-separated, non-negative list with at least two point pairs. So the syntax is well standardized. (This was introduced in 2013.) But there is no specification (or even comment/recommendation) regarding semantics. Not even a description of the absolute x-y pixel coordinate system based on the upper left corner (like in ALTO's IMO the following aspects should be addressed in the schema:
Care should be taken to be as best compatible to existing implementations as is consistently possible. |
As discussed into the last meeting we should keep back-compatibility and do not change the schema. Nevertheless, before closing the topic would be good as original topic mentioned, to give a guideline of string format. I will add a proposal for documentation and then set the topic for voting |
I just found this issue because of the different handling of
The current kraken code does not understand the ABBYY variant. |
Description was updated for more clarity and some guidelines. Changes are here: 4a301be |
Lines 702 to 712 in 4a301be
So it's going to be laissez-faire all the way then? This will make it hard for implementors, though. (Conversely, users will have to pay the price of that design decision by ending up with incompatibilities.) IMHO adding a few format restrictions retroactively (beginning with 4.4) – still allowing what has been in use overwhelmingly, but precluding new/rare formats – would be a better choice. For example: <xsd:restriction base="xsd:string">
<pattern value="([0-9.]+ [0-9.]+ )+([0-9.]+ [0-9.]+)"/>
<pattern value="([0-9.]+,[0-9.]+ )+([0-9.]+,[0-9.]+)"/>
</xsd:restriction> (One could debate whether the decimal point must be allowed as well, and whether it must be at least 1 point or at least 3, which can be expressed by |
@bertsky, maybe would be a good idea to do this in multiple steps, since the proposed change you made will break back-compatibility and if we decide to implement it, will be anyway part of a major release (5.0). I agree is better to have something enforced, the only concern on the last meeting was related to compatibility and what is prefered (keep it even solution is not the best, or break it with future advantages). We could release on 4.4 only the recommendation for usage, so that users have the chance to comply with this recommendation till we will add the enforced rule in 5.0. Regarding the regular expression itself, should be a bit different since [0-9.]+ will match also something like 89...989..97.99 which is wrong. Maybe something like [0-9]+.?[0-9]* for a coordinate - will not match .89 but if we use * instead of + when we create expression for a pair then we could easily match any single space with that pattern. |
@bertsky Your pattern is insufficient to capture valid floating point values like Personally, I'd prefer limiting it to one valid representation and then bumping up the major version even though it breaks backwards compatibility. |
@cipriandinu @mittagessen I agree with your assessments. |
Set back to discussion for next meeting - new proposal:
|
According with the last meeting results we will split the topic into two parts as proposed before:
|
In order to decide how restrictive should be rule for pointsType we should vote for three options:
Voting for this topic would be an comment with a simple text: "Option 1" or "Option 2" or "Option 3" |
Option 2 |
Option 2 |
Option 2 with additional remark: From the documentation: "The upper left corner of the page is defined as coordinate (0,0)". Is it possible to say that |
Option 2. Changed to option 1 with comma separated coordinates on February 16. 2023. |
Option 2, agreeing with the remark of @stweil |
Option 1 IMHO Has anyone actually seen existing implementations already using parenthesis or brackets? (If not, let's not encourage this new paradigm!) |
@stweil - would be better to have a more neutral comment like: "The upper left corner of the page is defined as x=0 and y=0" - then we do not give any hint about what is prefferable and what not? I hope I properly understood your comment. Or you would like to have more clear indication on what is preffered and what not? |
If option 1 is chosen, I'd suggest to mark the variant with commas as the preferred one. Citing @artunit: "This is arguably clearer as a list of coordinate pairs by using commas". |
My vote would be for Option 1 and also a recommendation on the use of commas to aid with readability actually. |
I would vote for option 2 |
I would vote Option 2 |
Even I voted for Option 2, this is a good point. I agree we should not encourage this if indeed nobody used in the past brackets. Maybe we should go with 1 and see if there is any reaction on ALTO mail list when we will announce the proposal for 4.4 (before officially launch the version) |
I agree with Stweil, Option 1 with the recommendation to use comma's for readability. |
option 1 with comma separated coordinates |
Option 1 with comma looks good. |
Based on your votes and last ALTO Board discussions the option 1 was selected. Here is the documentation proposal: <xsd:simpleType name="PointsType"> |
|
4.4 released |
PointsType in ALTO v4 has very basic documentation:
It would seem clearer to explictedly surface PointsType as a list of coordinate-pairs, particularly for complex shapes and polylines. For example, using the Polygon syntax from issue 22:
This is arguably clearer as a list of coordinate pairs by using commas:
Or perhaps:
The documentation might be a variation of what is used for MeasurementUnitType:
This would seem to reduce the possibility of missing a coordinate and be more friendly to software interpretation without breaking backwards compatibility.
The text was updated successfully, but these errors were encountered: