Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

- adds decimal format #3190

Merged
merged 5 commits into from
Mar 21, 2023
Merged

Conversation

baywet
Copy link
Contributor

@baywet baywet commented Mar 6, 2023

related #3167
related #889

@handrews
Copy link
Member

handrews commented Mar 6, 2023

@baywet what is the rationale for including floating point in this? #889 is about fixed point, and handling of fixed point and floating point is radically different. Plus, JSON numbers are already floating point, so I'm not sure what the value is of encoding floating point numbers as JSON strings.

@handrews
Copy link
Member

handrews commented Mar 6, 2023

I guess encoding floating point as strings allows retaining leading/trailing zeroes, but if this is for code generation then how does one know which type to generate? Is that made clear in the ISO standard? I can't read ISO standards because I don't have the budget to buy access as a freelancer. Is there a public specification that can be referenced?

@baywet
Copy link
Contributor Author

baywet commented Mar 6, 2023

I'll defer the question to @ioggstream who made the suggestion originally as to which standard to use. maybe @ralfhandl can also help considering his comment

I was merely trying to capture the different discussions in a single PR, and later on splat some of the formats to separate PRs to facilitate the main one's merge as you know.

As you mentioned the main value of this format over float is to retain the trailing zeros which "maintains the level of precision". Something anybody working with money, scientific data (measurements), etc... will appreciate. It will come at a cost in terms of serialization format. (i.e. using string as opposed to numbers, and having to parse things "twice")

@handrews
Copy link
Member

handrews commented Mar 6, 2023

The draft link @ioggstream supplied as a free alternative no longer works. I do notice that he requested fixed point decimal and not floating point, so I feel like floating point is a.) unnecessary scope creep, and b.) an undesirable overload of a single format.

If people want floating point in strings, add a floating-point format.

@baywet baywet force-pushed the feature/decimal-format branch from 2464efc to 02cc280 Compare March 7, 2023 13:28
@handrews
Copy link
Member

handrews commented Mar 9, 2023

Also, how is anyone expected to tell if the value is intended to be fixed vs floating?

type: string
format: decimal

applied to instance:

foo: "00.10"

Is this fixed point or floating point, and why?

@ralfhandl
Copy link
Contributor

ralfhandl commented Mar 10, 2023

@handrews Some background: SQL has two types / type families for numbers that are stored with a decimal mantissa, which is quite useful for financial data:

  • DECIMAL(precision, scale), for example DECIMAL(23,2) for 23 decimal digits of mantissa, two of which are fractional digits
  • DECFLOATprecision, for example DECFLOAT34 for 34 decimal digits of mantissa, see Wikipedia

These numbers can natively be represented in many programming languages, with the notable exception of JavaScript (as of today, one can hope that BigDecimal will arrive some time after BigInt).

They also can be represented as JSON numbers (unless you want to JSON.parse() it with JavaScript), but to be on the safe side it is more robust to allow wrapping them in JSON strings.

Now we need to express that a JSON string will contain such a decimal number:

type: string
format: decimal
x-sap-precision: 23
x-sap-scale: 2

and

type: string
format: decimal
x-sap-precision: 34
# no scale

We are currently using SAP-specific extensions for OpenAPI 3.0 and will contact you for how to define a corresponding JSON Schema vocabulary for OpenAPI 3.1.

@handrews
Copy link
Member

@ralfhandl that's helpful but still does not answer the most important question: Why overload format: decimal with two meanings? Why not just format: decimal and format: decfloat?

If you insist on overloading it, again, without relying on someone else's extension keyword that may or may not be present, how do I tell, given schema type: string, format: decimal and instance "00.10"" whether the instance is a fixed or floating point number?

And for that matter, if we're using int16, int32, etc. why not decimal23-2 and decimal34, etc. in which case you don't need decfloat. Or are the precision and scale to arbitrary to enumerate?

This proposal for format: decimal does not feel right to me. As written, it's not usable without those x- fields.

A format should solve one clearly defined problem, unambiguously. If you want to solve multiple propblems, you need multiple formats or a vocabulary. You should not mix them.

@handrews
Copy link
Member

@ralfhandl I should qualify this assertion:

If you want to solve multiple propblems, you need multiple formats or a vocabulary. You should not mix them.

Realistically, it might be easier to find an implementation with extensible format support than with full vocabulary support (although full vocabulary support exists in implementations in quite a few languages at this point). So I definitely understand and endorse a hybrid approach if needed.

However:

  • the format should be usable on its own when the vocabulary is not present (or not supported but used as an optional, meaning non-validation-impacting, vocabulary)
  • due to the wildly inconsistent support for format, the vocabulary should not depend on format being present or implemented, or if implemented, having support for a custom value. Or any value, really - I'm constantly amazed at how pathological format behavior is in the wild, assuming support for it was configured correctly in the first place, which it often is not

So, if the point of a format is to indicate the data type into which the instance value should be parsed, whether that is through code generation, or through reading the format value as an annotation and handling it dynamically, it MUST be possible to fully determine the behavior from the format value and the instance value alone.

Based on that wikipedia article, it's not possible from just that information to determine whether DECIMAL or DECFLOAT is intended. The format needs to indicate which is intended, or it's useless – different applications will behave differently given the same input.

@baywet
Copy link
Contributor Author

baywet commented Mar 13, 2023

The interesting aspect here is that languages don't seem to agree with one another:

  • dotnet "Represents a decimal floating-point number"
  • java "consists of an arbitrary precision integer unscaled value and a 32-bit integer scale" => I'm understanding floating point.
  • go "represents a fixed-point decimal"
  • Swift the documentation for the constructors would lead me to thing it's floating as well, but it's unclear.
  • python "is based on a floating-point model which was designed with people in mind"
  • php everything that's not int it float, with a big warning for the decimal case and a link to libraries if you care about this issue.
  • rust doesn't have a native type for it, but the main implementation library seems to be doing floating point as well.

@handrews
Copy link
Member

@baywet yeah, wow that's a mess, isn't it? All the more reason that the format specification needs to say exactly what is meant, so that implementers can make the appropriate effort to produce compliant behavior, whatever that means. We don't want the behavior to be "whatever the underlying language uses", because then the schema will behave inconsistently.

There are several problems to solve here:

Number in a string or mathematical guarantees?

If all we are doing is indicating that it's safe to parse the string as a base-10 number of some sort, then we should just have a format: number or format: base10-number (to avoid confusion with encodings). And if that's the case, then that's all that is needed.

Fixed vs floating?

If we want to make mathematical guarantees, fixed-point and floating-point are radically different, and I cannot conceive of a use case where I would both care about the mathematical properties and not care whether those were fixed-point or floating-point.

Given the variations in what "decimal" means, I think we need a different name or names that are less ambiguous (see next section for ideas). Each name needs to correspond to clear behavior. If someone really wants a "fixed or floating, I don't care", that's what "anyOf" is for.

This is a hill I wil die on.

There is no conceivable way in which I will voluntarily approve a format that spans both fixed and floating point. I'm not part of @OAI/tsc so folks can overrule me if they want, and I'll take it in good grace, but nothing anyone has said here has given me any reason to change my mind in the slightest.

Does the format constrain the range and precision, or is that application-interpreted?

If I have a schema of:

type: array
items:
    type: string
    format: fixed-point

and an instance of:

- 1.22
- 111.2

What is happening? We have a number with a precision of 3 and a scale of 2, and one with a precision of 4 and a scale of 1. Neither of which likely matches the actual storage precision and scale of any data type.

This, btw, is why no one has succeeded in defining a fixed-point format despite it having been requested repeatedly for years. I've yet to see anyone propose a workable, interoperable approach grounded in real use cases.

For floating point, there are interchange formats decimal32, decimal64, and decimal128 that already solve this problem. If we want floating-point-in-string formats, we should use these (there are also binary16, binary32, etc. which overlap with the already-defined float and double formats).

For fixed point, the tricky thing is that while precision is probably dictated by storage format, scale can be anything allowed within that precision. Enumerating different scales in the format name is impractical (this is one of many reasons that vocabularies are a better solution).

  • I'd suggest letting the instance define the scale instead of having the format enforce it, so 1.00 would have a scale of 2, while 0.12345 would have a scale of 5
  • If 23 and 34 are standard precisions, we could have fixed23 and fixed34, otherwise I'd suggest just saying that the precision needs to be sufficient to accommodate the instance, including leading zeros, and call the format fixed

@baywet
Copy link
Contributor Author

baywet commented Mar 13, 2023

Interestingly enough the IEE-754 seems to define decimals as floating points as well.

For the exchange of decimal floating-point numbers, interchange formats of any multiple of 32 bits are defined.

Also most languages seem to leverage 128 bits behind the scenes if that helps.

@handrews
Copy link
Member

Yes, I was proposing that if we want floating-point-as-strings that decimal32 etc. would be the correct name for those formats. That is what those terms mean in IEEE-754.

But the original ask was fixed-point-as-strings. As far as I'm concerned, all of this floating point stuff is a distraction from the original ask and should be a separate issue.

Copy link
Member

@handrews handrews left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point I think it is clear that this concept is not ready. I'm going to ask that this PR be closed and two issues be filed: one for fixed-point numbers as strings, and one for floating-point numbers as strings.

Questions that need to be resolved (separately in each of the two issues) include

  • does this concept need a format?
  • do there need to be multiple formats based on standardized storage sizes?
  • how does the application determine the precision and scale to use?
  • does the formatting (leading/trailing zeros) of the instance affect the precision and scale used, and if so, how?
  • what should the format(s) be called?

These questions need to be answered by people who will be using these formats (a group of people that does not include me). A PR that conflates two major use cases and has a complex conversation already is not the place to debate these questions. We should be using issues to hammer out what is really needed, and then move to PRs when the only work remaining is how to word the pages.

I'm marking this review as "request changes" but I'd prefer to mark it "reject in favor of further work in issues." @MikeRalphson @darrelmiller @ralfhandl what do you think?

@baywet baywet mentioned this pull request Mar 14, 2023
@baywet
Copy link
Contributor Author

baywet commented Mar 14, 2023

@handrews my goal here was merely to capture discussions that already happened here #889 #845 and keep the registry with "what's in the wild" up to date. That being said it seems that most descriptions using the decimal format today are actually azure related. I suspect that's a reflection of the dotnet decimal type for the most part, which probably is itself tied to some kind of data storage ultimately.
Interestingly enough, most of those APIs use the number and not string type. So either they don't care about trailing zeros and precision or they haven't though about transparent precision loss.

I do understand the harm that putting something in the registry that's not "specific enough" could make. We've just spent a couple of weeks cleaning that up...

Could we define this format as equivalent to IEE-754 decimal128, floating or fixed, since more client languages seem to lean towards that? And if people need more precision let them contribute to JSON schema so additional attributes about the size/precision can be added to the schema?

@handrews
Copy link
Member

handrews commented Mar 14, 2023

@baywet I remain vehemently against anything that combines fixed point and floating point, as:
1. I cannot see how to implement such a thing meaningfully, and neither you nor @ralfhandl has explained how to do so
2. @ralfhandl has shown that there are two related concepts of DECIMAL and DECFLOAT, but has also that his usage depends on additional keywords which AFAICT proves my point that it is not possible to usefully implement the format on its own
3. The proposal in your last paragraph still completely ignores the questions I have raised about how to determine the scale of a fixed-point interpretation
4. Saying that format: decimal is used in Azure-related descriptions doesn't do anything to answer my concerns - if it is intended to, you will need to be more clear about what it means

This is at least the third time over the years that somoene has tried to add a format of this sort, and the result is the same every time: I raise a bunch of concerns pointing out that it's not clear how to implement the proposal, and the person making the proposal refuses to address them and keeps trying to get me to approve something that i think is too vague to implement. And what's lacking here, as in all past attempts, is someone who actually is using this format themselves who can answer the hard questions in way that is grounded in real usage and results in a definition of the format that anyone can implement without additional keywords.

I don't think it's a good idea to put a format in the registry just because it exists in the wild - the registry is for making people aware of formats that can be used in a reasonably interoperable way. That is not what this format is doing. I still don't even understand why you added floating-point in the first place, and am not at all convinced that it's relevant to the original ask or the existing usage in the wild.

I don't have anything else to say on this proposal, so I am handing this off to @OAI/tsc . I'll go with whatever they decide, but I think my position is clear.

@baywet
Copy link
Contributor Author

baywet commented Mar 14, 2023

@handrews I appreciate your effort to drive precision work here, thanks a lot with your help on this (and other) pull requests.
I'm perceiving a bit of an upset tone in your last answer, this might simply be a perception due to the communication format (async text is lossy sometimes), but regardless I'd like to offer apologies if some of the discussions here have been repetitive/upsetting.

I think the (potential) disconnect here might come from multiple factors:

  • I've been working backwards (looking at the current state of things) when you seem to be working forwards (specifying things for future use).
  • The format registry doesn't state its intent: normative or a simple reference. My understanding of it was that it is meant to be an index of things that are found out there. And that potential problems with what is found could be documented in there, as well as potential alternatives/better practices.

Would it help if we got a formal direction to follow from TSC on that last point? And documented that on the index page of the format registry? We can always come back to the nitty gritty details of this specific format later on.

@ralfhandl
Copy link
Contributor

@handrews I'm fine with splitting into fixed and floating-point decimals.

I like the idea of using decimal64 and decimal128 for the mainstream decimal floating-point values (we are only using those two, usually called DECFLOAT16 and DECFLOAT34 after the number of significant decimal digits).

@ralfhandl
Copy link
Contributor

@handrews Fixed-point decimals DECIMAL(precision,scale) can only be fully described by specifying both precision and scale in addition to the bit "this is a fixed-point decimal".

We could define a family of formats decimal(p,s) with p ranging from 1 to 38 and s ranging from 0 to p.

A validator can extract precision and scale from the format name and use them for validating a numeric value.

This proposal doesn't need additional keywords because all information is encoded in the format names.

I'm open for cosmetic changes on how to encode precision and scale in the format name.

@handrews
Copy link
Member

@baywet my frustration (and I apologize for letting that show a bit too much) doesn't have much to do with you at all. While I do feel like there has been a bit of repetition without much progress, that is equally true of my own comments. Most of my frustration has to do with the fact that I don't think format should exist at all.

The only specifically frustrating thing here is that, unless I've missed it, you have never directly addressed my concern with mixing fixed and floating point. @ralfhandl has said he is OK with splitting them, so I hope you are now on board with that.

If you are not, I have tried to show how you can make a case. For example, if you think this is what is happening in the wild, you need to show that in detail, not just tell me that that's what you are doing and expect me to accept it – I've never see a fixed point / floating point mix so I don't understand your argument. If the usage is inconsistent then I don't see a point in registering it.

My last comment wasn't intended as a storming off in anger, but as a recognition that we've taken this as far as I can take it and still contribute productively. I would like to hand this off to someone before I start showing more frustration, becasue I do not have the authority to reject or fully approve this, and I don't personally care about fixed point and floating point, nor do I have the necessary expertise to fully evaluate proposals.

I feel like I became the person responding because I was responding about the encoding formats, but those are very different beasts. I am begging someone else to take this over for me as I have already spent way more time researching this topic than I intended.

@handrews
Copy link
Member

handrews commented Mar 15, 2023

And if whoever takes it over wants to just approve it as-is, I am truly 100% fine with that. I am just not comfortable with approving it myself, and the push to accept something is increasing my discomfort/frustration. So I am trying to get out of the way.

@MikeRalphson
Copy link
Member

MikeRalphson commented Mar 15, 2023

Thanks @handrews for all your input, and sorry to have left you to it as it were.

My only concerns with registering a new format are:

  • the name should be obvious or relatively intuitive as to meaning
  • the format is well defined to do one thing and one thing only
  • the format is already used in the wild or provides new guidance

My feeling is we could add decimal as a string representing a fixed point number and get some benefit, or drop the proposal until we have more consensus on the original issue.

In one corpus, decimal is the 16th most used format (after many of the 'built-in' ones). Which is both an argument for adopting it, and not blindly assuming what people's intent was when using it.

https://github.com/Mermade/openapi-specification-extensions/blob/main/formats/2021/formats.tsv

I can try and pull up sibling example properties of format: decimal from the data set if that would help? @baywet @ralfhandl

@handrews
Copy link
Member

@MikeRalphson

and sorry to have left you to it as it were.

Eh, I tend to put these things on myself and then get frustrated about it. I'm trying to work on not doing that so much. But thank you for both the sentiment and stepping in! 🙂

@baywet
Copy link
Contributor Author

baywet commented Mar 15, 2023

@handrews no worries, in case I didn't make this clear: I don't "care" about floating & fixed point vs fixed point only. I'm just trying to facilitate the work and get those formats published. While doing so, the only reason I included floating point in the description was because of an earlier comment from @ralfhandl in the previous PR. But I'm happy to remove it.

I think at this point we've all reached a consensus that decimal, which is already in use, most likely represents a fixed point decimal number. And because we don't want to introduce new semantics/keywords at this point, the information of "how it is structured" should come from a convention that relies on a broadly adopted standard (kind of what we did with the different date formats). Is that a fair representation of where things are at?

Assuming that's where we are all at, we'd need to remove the floating point part from that format description. And my suggestion is to align on a 128 bits wide type, as it seems to be what most languages are using today. Now, I don't know enough about that space to properly map it to a standardized type information, and that's where I need help. Once I get that information, I'll update the PR to reflect the above. Does that work for everyone?

@ralfhandl
Copy link
Contributor

@baywet That means you would add a file decimal128.md to this PR with content

---
owner: baywet
issue: 889
description: A 128-bit floating point decimal number as defined by IEEE 754-2008
base_type: string
layout: default
---

# <a href="..">{{ page.collection }}</a>

## {{ page.slug }} - {{ page.description }}

Base type: `{{ page.base_type }}`.

The `{{page.slug}}` format represents a [128-bit floating point decimal number](https://en.wikipedia.org/wiki/Decimal128_floating-point_format) as defined by IEEE 754-2008.

{% if page.issue %}
### GitHub Issue

* [#{{ page.issue }}](https://github.com/OAI/OpenAPI-Specification/issues/{{ page.issue }})
{% endif %}

{% if page.remarks %}
### Remarks

{{ page.remarks }}
{% endif %}

Or should I rather open a separate PR for decimal128?

@baywet
Copy link
Contributor Author

baywet commented Mar 15, 2023

@ralfhandl my suggestion he is to add a "decimal" format that maps to fixed 128 to it captures the existing.
And we can later on add decimal32 or 64 as well if we feel this is something we need. What do you think?

@ralfhandl
Copy link
Contributor

The 128-bit wide IEEE EEE 754 decimal128 representation is for floating-point decimals, so it unfortunately is of no use if format: decimal means fixed-point decimals. We have to live with the fact that format: decimal on its own does not define the exact value range.

What we can do is define a new decimal128 format which maps to the IEEE 754 decimal128 representation for floating-point decimals and exactly expresses its value range, thus my proposal of a second file.

What we could in addition do is define a new "template format" decimal({precision},{scale}) for fixed-point decimals that APIs could use going forward, for example format: decimal(20,2) for a value range of -999999999999999999.99 to 999999999999999999.99 in increments of 0.01.

@baywet
Copy link
Contributor Author

baywet commented Mar 15, 2023

What would you map the existing usage of "decimal" (keep the exact same name) to with this suggestion?

@ralfhandl
Copy link
Contributor

ralfhandl commented Mar 16, 2023

The current "decimal" format doesn't map to anything specific, and API designers are advised to use one of the concrete formats

  • decimal128
  • decimal64
  • decimal({precision},{scale})

instead if they want to map to a specific decimal type.

We only know that "decimal" is widely used, and we don't know its API-specific interpretation.

@baywet
Copy link
Contributor Author

baywet commented Mar 16, 2023

Notes from today's meeting:

  • there's value in documenting the existing use of decimal
  • it should come with warnings about loss of precision (when using type number), and not normative enough (where does the point live? does it support floating point?)

TODO (for me)

  • remove mentions of floating points
  • add warnings about the loss of precision
  • add warnings about the normative aspect
  • (in separate PRs) document decimal128

@baywet baywet force-pushed the feature/decimal-format branch from 02cc280 to 3d2cbf9 Compare March 16, 2023 17:08
@baywet
Copy link
Contributor Author

baywet commented Mar 16, 2023

@handrews @ralfhandl Thanks again for the support through this one. I updated it to reflect our recent discussions.
The plan is once this once gets merged to open a new PR for decimal128 (unless you'd like me to make that change here?) and update the warnings to recommend using that format instead.
I've also removed the mention of the ISO standard since nothing as things are today guarantees that people are following that standard.
Let me know what you think!

@handrews
Copy link
Member

@baywet I'm less concerned about warning about loss of precision (loss compared to what? It does not guarantee precision) than I am about warning of inconsistent usage and implementation.

Registering a format sets the expectation that it will behave according to the registry. This is a reasonable expectation for all other registered formats, which all meet at least one of the following criteria:

  • normatively defined by OAS
  • normatively defined by JSON Schema
  • reference a clear, published specification (RFC, HTML, the CommonMark spec)
  • have an unambiguous definition on which the vast majority of programming environments agree, if they support it at all (char, int16)

The most complex is HTML, because of the nature of that spec, but that is a problem with parsing HTML in general (and is addressed in the spec, at least in theory).

It needs to be very clear from this registry that using format: decimal does not guarantee any particular behavior because the current usage varies.

@handrews
Copy link
Member

"guarantee" might the wrong word b/c the registry doesn't guarantee things, but I hope you get what I mean. If not, please ask, as I am happy to keep discussing this one point.

registries/_format/decimal.md Outdated Show resolved Hide resolved
registries/_format/decimal.md Outdated Show resolved Hide resolved
registries/_format/decimal.md Outdated Show resolved Hide resolved
Signed-off-by: Vincent Biret <[email protected]>
@baywet
Copy link
Contributor Author

baywet commented Mar 17, 2023

@ralfhandl @handrews added changes from the comments. I'm almost tempted to mark it as deprecated now, or should we wait for the PR for decimal128 (or equivalent) to do that?

Copy link
Member

@handrews handrews left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@baywet thanks for working through everything on this- I think we've reached the finish line! I appreciate your perseverance.

Let's not mark it deprecated now, as the way in which decimal is discouraged (lack of interoperability) is very different from the way that binary and byte are discouraged (a better alternative outside of format entirely) are very different. If we need to have a discussion about whether this is also a "deprecation" let's have that in a separate issue or PR so this one can get merged now.

@baywet
Copy link
Contributor Author

baywet commented Mar 17, 2023

Thanks everyone for pulling through this one and making sure we documented the right thing.
@MikeRalphson for a final review
@darrelmiller for merge.

Copy link
Member

@MikeRalphson MikeRalphson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - thank you all!

@darrelmiller darrelmiller merged commit 1b88087 into OAI:gh-pages Mar 21, 2023
@baywet baywet deleted the feature/decimal-format branch March 21, 2023 13:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants