Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow provider-specific data types #529

Open
wants to merge 13 commits into
base: develop
Choose a base branch
from

Conversation

merkys
Copy link
Member

@merkys merkys commented Jun 14, 2024

This PR attempts to generalize #436 by allowing to define provider-specific data types. The following conditions for such data types apply:

  • Their values have to be expressed as string literals both in responses and in filters;
  • They have to be named using provider-specific prefixes;
  • Properties can only be compared with values of the same type (as per specification up til now).

The latter condition might be too strict for some uses, but since casting is unclear at the moment, let us leave such extensions for future.

Merging this PR will allow to reassign #392 and #436 (SMILES) to _cheminfo_ namespace where it can become a property with special set of rules for comparison.

@merkys merkys requested review from blokhin, rartino, vaitkus and ml-evs June 14, 2024 14:11
optimade.rst Outdated Show resolved Hide resolved
optimade.rst Outdated Show resolved Hide resolved
merkys and others added 2 commits June 19, 2024 15:45
optimade.rst Outdated Show resolved Hide resolved
@merkys merkys requested a review from vaitkus February 4, 2025 08:47
optimade.rst Outdated Show resolved Hide resolved
optimade.rst Outdated Show resolved Hide resolved
optimade.rst Outdated Show resolved Hide resolved
optimade.rst Outdated Show resolved Hide resolved
@blokhin
Copy link
Member

blokhin commented Feb 4, 2025

I have a general question: how to understand, if a prefix denotes a provider-specific datatype, not a general-purpose common datatype. In other words, compare _cheminfo_ and _mpds_, where is what?

@blokhin
Copy link
Member

blokhin commented Feb 4, 2025

apparently, by underscore as the first symbol, correct?

blokhin
blokhin previously approved these changes Feb 4, 2025
Co-authored-by: Antanas Vaitkus <[email protected]>
@merkys
Copy link
Member Author

merkys commented Feb 4, 2025

I have a general question: how to understand, if a prefix denotes a provider-specific datatype, not a general-purpose common datatype. In other words, compare _cheminfo_ and _mpds_, where is what?

Provider-specific data type names are only going to appear in x-optimade-type in property definitions. There they will be identifiable by an initial underscore and namespacing prefix. General purpose data types should be elevated to the main specification, thus without _<prefix>_.

@merkys merkys requested a review from vaitkus February 4, 2025 14:04
optimade.rst Outdated Show resolved Hide resolved
vaitkus
vaitkus previously approved these changes Feb 4, 2025
Co-authored-by: Antanas Vaitkus <[email protected]>
@merkys merkys requested review from vaitkus and blokhin February 5, 2025 09:33
@blokhin
Copy link
Member

blokhin commented Feb 8, 2025

@merkys should we merge?

@merkys
Copy link
Member Author

merkys commented Feb 10, 2025

@merkys should we merge?

@blokhin I would like to get a review from @rartino as this change affects property definitions and might impact the way query filters are supposed to be handled by the backends.

@rartino
Copy link
Contributor

rartino commented Feb 10, 2025

Ok, I'll take a deeper look at this over the next couple of days.

@ml-evs
Copy link
Member

ml-evs commented Feb 10, 2025

Ok, I'll take a deeper look at this over the next couple of days.

Likewise, as I want to use this pretty much straight away!

Copy link
Member

@ml-evs ml-evs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for writing this up @merkys!

I only have one general comment; should we encourage the custom type to be described in the description of the fields that use it? Otherwise I don't see where this information would go (even if its only human-readable anyway). I've added two suggestions below along these lines.

@@ -2158,7 +2167,7 @@ A Property Definition MUST be composed according to the combination of the requi

- :field:`x-optimade-type`: String.
Specifies the OPTIMADE data type for this level of the defined property.
MUST be one of :val:`"string"`, :val:`"integer"`, :val:`"float"`, :val:`"boolean"`, :val:`"timestamp"`, :val:`"list"`, or :val:`"dictionary"`.
MUST be one of :val:`"string"`, :val:`"integer"`, :val:`"float"`, :val:`"boolean"`, :val:`"timestamp"`, :val:`"list"`, :val:`"dictionary"` or the name of a namespace-specific data type that starts with an underscore followed by the provider-specific prefix.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
MUST be one of :val:`"string"`, :val:`"integer"`, :val:`"float"`, :val:`"boolean"`, :val:`"timestamp"`, :val:`"list"`, :val:`"dictionary"` or the name of a namespace-specific data type that starts with an underscore followed by the provider-specific prefix.
MUST be one of :val:`"string"`, :val:`"integer"`, :val:`"float"`, :val:`"boolean"`, :val:`"timestamp"`, :val:`"list"`, :val:`"dictionary"` or the name of a namespace-specific data type that starts with an underscore followed by the provider-specific prefix.
When a namespace-specific data type is used, the human-readable property description should be used to describe its usage and format.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess in the end we will want this info to go into our "standards" definitions (see https://github.com/Materials-Consortia/OPTIMADE/tree/develop/schemas#standards-definitions ). For the OPTIMADE one see here: https://schemas.optimade.org/defs/v1.2/standards/optimade.html

If forced to handle this today I would put that info into the description field there. Something along the lines of:

Description: The Magnetic materials standard node types and properties.

Apart from the supplied property definitions, the Magnetic materials standard defines a namespace-specific data type for the magnetic susceptibility tensor "_mag_sustensor" that allows comparison using < and > to express filters of the magnetic susceptibility magnitude. In more detail, the following operators are valid:
...

BUT, writing this up; even though the magnetic susceptibility tensor example is quite bogus, I realize we may be overly limiting deciding all custom datatypes must be represented as strings? Why not allow it to be expressed as any basic or container datatype?

But, back to @ml-evs comment: to allow not repeating the definition a lot, maybe this is better?:

When a namespace-specific data type is used, the human-readable property description should be used to describe its usage and format or to give a reference to where this information is available.

(That reference can then be to the "standards definition" if it is included there).

@@ -2158,7 +2167,7 @@ A Property Definition MUST be composed according to the combination of the requi

- :field:`x-optimade-type`: String.
Specifies the OPTIMADE data type for this level of the defined property.
MUST be one of :val:`"string"`, :val:`"integer"`, :val:`"float"`, :val:`"boolean"`, :val:`"timestamp"`, :val:`"list"`, or :val:`"dictionary"`.
MUST be one of :val:`"string"`, :val:`"integer"`, :val:`"float"`, :val:`"boolean"`, :val:`"timestamp"`, :val:`"list"`, :val:`"dictionary"` or the name of a namespace-specific data type that starts with an underscore followed by the provider-specific prefix.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
MUST be one of :val:`"string"`, :val:`"integer"`, :val:`"float"`, :val:`"boolean"`, :val:`"timestamp"`, :val:`"list"`, :val:`"dictionary"` or the name of a namespace-specific data type that starts with an underscore followed by the provider-specific prefix.
MUST be one of :val:`"string"`, :val:`"integer"`, :val:`"float"`, :val:`"boolean"`, :val:`"timestamp"`, :val:`"list"`, :val:`"dictionary"` or the name of a namespace-specific data type that starts with an underscore followed by the provider-specific prefix.
When a namespace-specific data type is used, the property definition MUST list the top-level `type` as `'string'` (or `['string', 'null']` for optional fields).

Copy link
Contributor

@rartino rartino left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work with this!

The one blocking issue I see is that the PR misses a corresponding edit under "Property Definition keys from JSON Schema" -> "type", where in the current edition of the PR (see below) should say that when x-optimade-type is set to a string that indicates a provider-specific datatype, "type" MUST be "string". (I see @ml-evs is "fixing" this on line 2171, but whether or not that is added, the change should also be done on in that part of the specification).

(I couldn't add this as an "edit" comment, it is too far away...)

I would be ok with merging this for string custom datatypes only. However, I think it is worthwhile to pause for one minute here to consider what I wrote in a reply to one of @ml-evs suggestions: can we lightly edit the PR to allow custom data types to be represented using any of the basic, list or dict types?

I foresee that basically the first thing that will happen once we implement this is someone coming along and asking "oh, but in our community we have this data thing that is always represented as an NxM matrix, and for this data the whole community obviously want the ">" operator to mean Z." ...

@merkys
Copy link
Member Author

merkys commented Feb 12, 2025

Thanks for reviews, @ml-evs and @rartino!

Data type description in both human-readable text and Property Definitions is noncontroversial, thus I will address it later on.

Now on the extension beyond the string-based data types. Historically, I drafted this PR having SMILES in mind, thus strings were enough for me. Moreover, we already have strings with internal semantics (timestamps). Now I do not see why we cannot extend this to all basic+list+dict data types, but yes, this needs some more thought.

Some issues coming to my mind as I am writing this:

  1. Intuition. A human reader is capable to perceive that a string has internal semantics. Just by looking at lists and dicts it is difficult to say whether they constitute a data type or not. And this risks becoming counterintuitive for number-based data types where, for example, > operator is overloaded to mean the opposite.

  2. Accessing constituents. Should filters on constituents of list (items) and dict (keys and values) be supported, re-using compound identifier syntax as everywhere else in OPTIMADE?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants