Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No UTF32 Characters in the Regex for Strings #362

Closed
sebbader-sap opened this issue Feb 28, 2024 · 21 comments
Closed

No UTF32 Characters in the Regex for Strings #362

sebbader-sap opened this issue Feb 28, 2024 · 21 comments
Labels
aas-core impact on aas-core but not on specification accepted accepted in principle to be fixed bug Something isn't working specification impact on specification and thus on xml, json etc., label "aas-core" not set additinally
Milestone

Comments

@sebbader-sap
Copy link
Contributor

sebbader-sap commented Feb 28, 2024

Describe the bug

The regex pattern in the JSON Schema has only UTF-16 characters, while constraint AASd-130 demands the following:
^[\x09\x0A\x0D\x20-\uD7FF\uE000-\uFFFD\u00010000-\u0010FFFF]*$

Where
JSON Schema, e.g., https://github.com/admin-shell-io/aas-specs/blob/2ab08f92bdd1d44edc1cfee52552fe5429d2178e/schemas/json/aas.json#L44C22-L44C36

      "pattern": "^([\\t\\n\\r -\ud7ff\ue000-\ufffd]|\\ud800[\\udc00-\\udfff]|[\\ud801-\\udbfe][\\udc00-\\udfff]|\\udbff[\\udc00-\\udfff])*$"

Additional context
Needs to be adopted in the SwaggerHub Domains for Part 1 and Part 2.

@sebbader-sap sebbader-sap added bug Something isn't working aas-core impact on aas-core but not on specification labels Feb 28, 2024
@mristin
Copy link
Collaborator

mristin commented Feb 28, 2024

@sebbader-sap we transpiled the patterns into UTF-16 since most JSON schema engines we tested operated on UTF-16 and could not handle UTF-32.

It is a trade-off between correctness and practicality -- if we put UTF-32 in JSON schema (e.g., the pattern you mentioned: ^[\x09\x0A\x0D\x20-\uD7FF\uE000-\uFFFD\u00010000-\u0010FFFF]*$), many JSON schema validators will fail. I suppose any schema validator written as a library for C# or Java will be in that group.

You can test it online. The first answer on Google for "JSON schema validator" for me is https://www.jsonschemavalidator.net/. This validator does support UTF-32:

{
  "$schema": "https://json-schema.org/draft/2019-09/schema",
  "title": "AssetAdministrationShellEnvironment",
  "type": "string",
  "pattern": "^[\\x09\\x0A\\x0D\\x20-\\uD7FF\\uE000-\\uFFFD\\U00010000-\\U0010FFFF]*$"
}

The following value passes:

"test"

This is not a big change in aas-core-codegen, so whatever the decision, it shouldn't be hard to fix.

@mristin
Copy link
Collaborator

mristin commented Feb 28, 2024

Please consider also the SHACL -- I think the same issue appears there as well.

@sebbader-sap
Copy link
Contributor Author

The original pattern from the constraint has the same problems with OpenAPI-based validators, as they usually translate the YAML into JSON Schema --> then using the same JSON Schema Validation libraries with the same UTF-32 problems.

I am uncertain how to proceed now.
Requirement 1: I want to transform aas.json into the Part 1 Domain.
Requirement 2: Attributes from Part 2 with the same meaning as Part 1 attributes shall have exactly the same regex pattern.
Requirement 3: Data conforming to Constraint AASd-130 should survive a validation.
Requirement 4: Widely used libraries should accept the schema / OpenAPI files.

#3 and #4 are conflicting with each other. However, for Java-based servers, I need to adjust the pattern from Part 2 anyway, so I then rather want to see it 100% correct in the OpenAPI files and do the implementation-specific adjustments on top of it.

Which then means that my actually used OpenAPI file is not string-equals to the IDTA published one anymore...

@sebbader-sap
Copy link
Contributor Author

@BirgitBoss I think we need a formal decision for all parts. Either way, the Part 2 Domain must go the same way as the Part 1 Domain & the schemas.

@BirgitBoss BirgitBoss added the requires workstream approval strategic decision proposal needs to be prepared in TF spec team label Feb 28, 2024
@BirgitBoss BirgitBoss modified the milestones: V3.1, V3.0.1 Feb 28, 2024
@BirgitBoss BirgitBoss added the specification impact on specification and thus on xml, json etc., label "aas-core" not set additinally label Feb 28, 2024
@g1zzm0
Copy link
Collaborator

g1zzm0 commented Mar 6, 2024

Because there was some clarification needed in the taskforce:

^: Asserts the start of the string.

[\x09\x0A\x0D\x20-\uD7FF\uE000-\uFFFD\u00010000-\u0010FFFF]: Defines a character class that allows various Unicode characters.

\x09: ASCII horizontal tab.
\x0A: ASCII linefeed (newline).
\x0D: ASCII carriage return.
\x20: ASCII space.
-: Represents a range.
\uD7FF: The upper limit of the Basic Multilingual Plane (BMP) in UTF-16.
\uE000-\uFFFD: Represents the range of characters from the start of the supplementary planes up to the last valid Unicode character (excluding surrogate pairs).
\u00010000-\u0010FFFF: Represents the range of valid surrogate pairs used for characters beyond the BMP.
*: Allows for zero or more occurrences of the characters within the character class.

$: Asserts the end of the string.

@g1zzm0
Copy link
Collaborator

g1zzm0 commented Mar 6, 2024

Maybe we should change the Constraint AASd-130 in the following way:

An attribute with data type "string" shall consist of these characters only:
The string contains only valid Unicode characters encoded in UTF-16 format
The string can include common characters like tabs, newlines, carriage returns, and spaces.
It allows a broad range of Unicode characters, including those beyond the Basic Multilingual Plane (BMP) which are represented using surrogate pairs in UTF-16 encoding.
It ensures that the entire string adheres to the rules of UTF-16 encoding, which is a standard way of representing a wide range of characters from different languages.

@mristin
Copy link
Collaborator

mristin commented Mar 6, 2024

I think that the most important bit is missing: this constraint is required, so that the text can be represented in XML.

For example, you can not represent \x00 in XML text, not even with �.

@g1zzm0
Copy link
Collaborator

g1zzm0 commented Mar 7, 2024

@mristin
Copy link
Collaborator

mristin commented Mar 8, 2024

Here is a short example as illustration.

The smiley "😀" is represented as character code 128512. This code is encoded in UTF-32 as "\U0001F600". In UTF-16, this is encoded as "\ud83d\ude00".

Hence, if you want to match this smiley with an regex engine that uses UTF-16, you have to write the pattern "\ud83d\ude00" even though it is a single character. If your regex engine operates on UTF-32, you can simply write "\U0001F600".

@mristin
Copy link
Collaborator

mristin commented Mar 8, 2024

For example, this schema works on an on-line JSON schema tester:

{
  "$schema": "https://json-schema.org/draft/2019-09/schema",
  "title": "AssetAdministrationShellEnvironment",
  "type": "string",
  "pattern": "^([\\t\\n\\r -\ud7ff\ue000-\ufffd]|\\ud800[\\udc00-\\udfff]|[\\ud801-\\udbfe][\\udc00-\\udfff]|\\udbff[\\udc00-\\udfff])*$"
}

This matches "😀" on https://www.jsonschemavalidator.net/.

The smiley does not test with:

{
  "$schema": "https://json-schema.org/draft/2019-09/schema",
  "title": "AssetAdministrationShellEnvironment",
  "type": "string",
  "pattern": "^[\\x09\\x0A\\x0D\\x20-\\uD7FF\\uE000-\\uFFFD\\U00010000-\\U0010FFFF]*$"
}

@sebbader-sap
Copy link
Contributor Author

Maybe we should change the Constraint AASd-130 in the following way:

Update from the latest state of Part 1 V3.1.0: Description for AASd-130 is already extended:

Constraint AASd-130 ensures that encoding and interoperability between different serializations is possible. It corresponds to the restrictions as defined for the XML Schema 1.0footnote:[https://www.w3.org/TR/xml/#charsets].

@sebbader-sap
Copy link
Contributor Author

sebbader-sap commented Mar 12, 2024

Proposal from a meeting of us (@mristin, @g1zzm0, and myself):

  1. AASd-130 shall be described using ^[\x09\x0A\x0D\x20-\uD7FF\uE000-\uFFFD\u00010000-\u0010FFFF]*$ --> no further change needed.
  2. This pattern is mapped to ^([\\t\\n\\r -\ud7ff\ue000-\ufffd]|\\ud800[\\udc00-\\udfff]|[\\ud801-\\udbfe][\\udc00-\\udfff]|\\udbff[\\udc00-\\udfff])*$ for the schemas. In particular, this is the case for the JSON Schema (which is already the case), OpenAPI Part 1 (also already the case, as Part 1 is a translation of the JSON Schema), and OpenAPI Part 2. It's not needed to introduce the pattern to the XML Schema, as XML is the one reason why AASd-130 is needed at first.

Therefore, the following activities are needed:

  • Replace the current pattern in OpenAPI Part 2 V3.0.2 to the one in 2. See Adjust Regex Changes also to the Part2 Classes aas-specs-api#256
  • Add a design decision to the serialisation formats in Part 1 stating that the AASd-130 pattern is expressed by another one in the schemas, and that this is due to the case that the currently available tooling is generally operating on UTF-16 characters, therefore usability is significantly higher. Later, we may change the pattern to the original AASd-130 one.
  • Get formal approval of the AAS Workstream

@sebbader-sap
Copy link
Contributor Author

Side-effect:
Depending on the implementation technology, developers must replace the pattern with the technology-matching regex variant of this pattern.

Example:
^([\\t\\n\\r -\ud7ff\ue000-\ufffd]|\\ud800[\\udc00-\\udfff]|[\\ud801-\\udbfe][\\udc00-\\udfff]|\\udbff[\\udc00-\\udfff])*$ is the pattern in the official JSON schema.

  1. If the server is implemented in Python, which regex engine is already UTF-32 capable, the needed pattern is ^[\x09\x0A\x0D\x20-\uD7FF\uE000-\uFFFD\u00010000-\u0010FFFF]*$ (the original one from AASd-130)
  2. If the server is coded with Java, the JSON Schema works out of the box.
  3. If it's in C#, even another pattern is needed: "^[\\u{9}\\u{a}\\u{d}\\u{20}-\\u{d7ff}\\u{e000}-\\u{fffd}\\u{10000}-\\u{10ffff}]*$"
  4. TypeScript: "^[\t\n\r -\ud7ff\ue000-\ufffd\U00010000-\U0010ffff]*$"

@BirgitBoss
Copy link
Collaborator

Decision Proposal TF Metamodel AAS 2024-03-27

Change formulation of Contraint AASd-130 from

Constraint AASd-130: an attribute with data type "string" shall consist of these characters only: ^[\x09\x0A\x0D\x20-\uD7FF\uE000-\uFFFD\u00010000-\u0010FFFF]*$.

to
Constraint AASd-130: an attribute with data type "string" shall be restricted to the characters as defined in XML Schema 1.0, i.e. the string shall consist of these characters only: ^[\x09\x0A\x0D\x20-\uD7FF\uE000-\uFFFD\u00010000-\u0010FFFF]*$.

Constraint AASd-130 ensures that encoding and interoperability between different serializations is possible. See https://www.w3.org/TR/xml/#charsets for more information on XML Schema 1.0 string handling.

@g1zzm0 : please check

@BirgitBoss BirgitBoss added the ready for approval TF proposes how to resolve the issue. Needs final approval my Workstream label Apr 12, 2024
@BirgitBoss
Copy link
Collaborator

Proposal from a meeting of us (@mristin, @g1zzm0, and myself):
[...]

2. This pattern is mapped to `^([\\t\\n\\r -\ud7ff\ue000-\ufffd]|\\ud800[\\udc00-\\udfff]|[\\ud801-\\udbfe][\\udc00-\\udfff]|\\udbff[\\udc00-\\udfff])*$` 

[...]

This representation makes problems in swagger representation

image

How about this RegEx (see \ instead of \ before first ud7ff and before ue000 and ufffd at the beginning):
"^([\\t\\n\\r-\\ud7ff\\ue000-\\ufffd]|\\ud800[\\udc00-\\udfff]|[\\ud801-\\udbfe][\\udc00-\\udfff]|\\udbff[\\udc00-\\udfff])*$"

@mristin may you please have a look whether the regex we are using is really ok? Thank you!

@mristin
Copy link
Collaborator

mristin commented Apr 17, 2024

@BirgitBoss wrote:

How about this RegEx (see \ instead of \ before first ud7ff and before ue000 and ufffd at the beginning):
"^([\t\n\r-\ud7ff\ue000-\ufffd]|\ud800[\udc00-\udfff]|[\ud801-\udbfe][\udc00-\udfff]|\udbff[\udc00-\udfff])*$"

There is no single standard syntax for regular expressions. It all depends on the engine that you plan to use and support.

Best you fix the engine & test against it, and then also document somewhere why you picked that engine and not another one.

Whatever engine you pick, the particular syntax will be incompatible with some other engine.

@sebbader-sap
Copy link
Contributor Author

sebbader-sap commented Apr 17, 2024

But independent of the engine, having for some unicode characters one backslash (e.g. "\ud7ff") but for the others two (e.g. "\\ud800") in the same pattern seems pretty strange.

@mristin
Copy link
Collaborator

mristin commented Apr 17, 2024

But independent of the engine, having for some unicode characters one backslash (e.g. "\ud7ff") but for the others two (e.g. "\ud800") in the same pattern seems pretty strange.

Ah, I haven't even noticed that -- yes, it should be consistent.

@BirgitBoss
Copy link
Collaborator

Thank you @sebbader-sap and @mristin for reviewing, so we change to
"^([\\t\\n\\r-\\ud7ff\\ue000-\\ufffd]|\\ud800[\\udc00-\\udfff]|[\\ud801-\\udbfe][\\udc00-\\udfff]|\\udbff[\\udc00-\\udfff])*$"

to have a consistent way and it would also be supported by swagger.

s-heppner added a commit that referenced this issue May 23, 2024
We document in the JSON and RDF schema `README`
files, that we deviate from the pattern in the
specification of AASd-130, due to the fact that
most schema engines test UTF-16, instead
of the used UTF-32.

For the full discussion, refer to #362
@BirgitBoss
Copy link
Collaborator

Workstream AAS Specs
accepted

@BirgitBoss BirgitBoss added accepted accepted in principle to be fixed and removed ready for approval TF proposes how to resolve the issue. Needs final approval my Workstream requires workstream approval strategic decision proposal needs to be prepared in TF spec team labels Jun 13, 2024
@BirgitBoss
Copy link
Collaborator

s-heppner added a commit that referenced this issue Nov 15, 2024
We document in the JSON and RDF schema `README`
files, that we deviate from the pattern in the
specification of AASd-130, due to the fact that
most schema engines test UTF-16, instead
of the used UTF-32.

For the full discussion, refer to #362
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
aas-core impact on aas-core but not on specification accepted accepted in principle to be fixed bug Something isn't working specification impact on specification and thus on xml, json etc., label "aas-core" not set additinally
Projects
None yet
Development

When branches are created from issues, their pull requests are automatically linked.

4 participants