Background for precision loss due to change to scientific notation #798

fkleedorfer · 2023-10-30T10:13:36Z

fkleedorfer
Oct 30, 2023
Collaborator

Hi all,

In the latest release notes, @steveraysteveray writes:

We decided to return to using scientific notation, but only for very large (>10^5) and very small (<10^-5) values. While this issue has come up before, we currently believe the ability to express numbers without many zeroes outweighs the small errors (typically in the 4th decimal place or smaller) introduced in calculations. Users should be aware, of course, that critical applications should always look to authoritative sources for numbers such as conversion factors and constant values, such as ISO or NIST.

I am trying to understand this move and what it means for our use case, in which the 4th decimal place may be very relevant at times. I have reviewed the changes made to the units file in commit d82ee0684858d6035baefd5f176d27135ce4b7a7,

two questions:

I don't see why there should be a loss of precision. All numbers that I checked visually do seem correct and retain the same level of precision, at least in the units file.
However, if there is some systematic loss of precision introduced by this change, I don't understand the reasoning. Should we not strive for the highest possible level of correctness? I understand that there may always be bugs and so users have to be cautious, but systematic errors would be an unexpected design choice. Is it a trade-off between systematic errors and manual entry errors?

Cheers!

Florian

jhodgesatmb · 2023-10-30T20:21:42Z

jhodgesatmb
Oct 30, 2023
Collaborator

I did the conversion and I kept every digit of precision that was in any conversion multiplier. This was about counting zeros and nothing else.Jack Hodges, Ph.D.Arbor StudiosOn Oct 30, 2023, at 3:13 AM, Florian Kleedorfer ***@***.***> wrote: Hi all, In the latest release notes, @steveraysteveray writes: We decided to return to using scientific notation, but only for very large (>10^5) and very small (<10^-5) values. While this issue has come up before, we currently believe the ability to express numbers without many zeroes outweighs the small errors (typically in the 4th decimal place or smaller) introduced in calculations. Users should be aware, of course, that critical applications should always look to authoritative sources for numbers such as conversion factors and constant values, such as ISO or NIST. I am trying to understand this move and what it means for our use case, in which the 4th decimal place may be very relevant at times. I have reviewed the changes made to the units file in commit d82ee06, two questions: I don't see why there should be a loss of precision. All numbers that I checked visually do seem correct and retain the same level of precision, at least in the units file. However, if there is some systematic loss of precision introduced by this change, I don't understand the reasoning. Should we not strive for the highest possible level of correctness? I understand that there may always be bugs and so users have to be cautious, but systematic errors would be an unexpected design choice. Is it a trade-off between systematic errors and manual entry errors? Cheers! Florian —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

0 replies

steveraysteveray · 2023-10-31T14:16:50Z

steveraysteveray
Oct 31, 2023
Collaborator

@fkleedorfer, I do not profess to be an expert here, but the tradeoffs seem to be detailed here and here.

0 replies

fkleedorfer · 2023-11-01T09:24:41Z

fkleedorfer
Nov 1, 2023
Collaborator Author

Thanks for explaining.

I'll try to sum up what I have understood so far:

the ttl files currently on the main branch use scientific notation, e.g:
constant:Value_PlanckConstant qudt:value 6.62606896e-34 ;
scientific notation is interpreted as xsd:double
xsd:double leads to problems of numeric stability - which is why we use BigDecimal (java) and Decimal.js (javascript) in QUDTLib, btw. For a quick demonstration, hit F12 in your browser and enter 1.2 - 0.1 in the console ;-)

If that is correct then I must admit I wish the choice had been the opposite - decimal notation. The way it is, QUDT is saying: here are the numbers - please don't use them.

In order to figure out what that means for our library downstream of QUDT I made a quick junit test:

 @Test
    public void testStabilitiyOfSimpleFractions(){
       // passes - still using value 0.001 from an older QUDT version
       Assertions.assertEquals("0.0010",Qudt.Prefixes.Milli.getMultiplier().toString(), "Numerically instable multiplier detected");  
        
       // passes - this is how we will instantiate the value after the upgrade to latest QUDT: we're on the safe side
       Assertions.assertEquals("0.0010", new BigDecimal( "1.0E-3").toString(), "Numerically instable multiplier detected"); 
       
       // passes - this is how it is apparently safe to instantiate a BigDecimal from a double
       Assertions.assertEquals("0.001", BigDecimal.valueOf( 0.001).toString(), "Numerically instable multiplier detected"); 
       
       // passes - another way to write the double in java
       Assertions.assertEquals("0.001", BigDecimal.valueOf( 1.0E-3).toString(), "Numerically instable multiplier detected");  

        // fails (see output below)
        Assertions.assertEquals("0.0010", new BigDecimal( 1.0E-3).toString(), "Numerically instable multiplier detected"); 
        
        // this one would fail, too, with the same output
        Assertions.assertEquals("0.0010", new BigDecimal( 0.001).toString(), "Numerically instable multiplier detected"); 
    }
---
output:
org.opentest4j.AssertionFailedError: Numerically instable multiplier detected ==> 
Expected :0.0010
Actual   :0.001000000000000000020816681711721685132943093776702880859375

Which convinces me that QUDTLib does not need to undergo any changes to acommodate the switch to scientific notation. This immunity is owed to the robust implementation of BigDecimal.

Other projects on other platforms might be in a different situation.

However, the stability problems will sooner or later bite anyone who follows section 4 or 5 of the How-To, i.e, who uses the multipliers/offsets/constants directly in SPARQL.

Having said that, I am possibly unaware of the advantages of using scientific notation that were factored into the decision, and being made aware of them might change my opinion. As I see it now, I'd recommend walking back on that decision.

0 replies

dr-shorthair · 2023-11-01T22:48:55Z

dr-shorthair
Nov 1, 2023

AFAICT the reason to use scientific notation is only to avoid potential errors counting large numbers of consecutive zeros - either before or after the significant figures. i.e. this is a crutch to help people manually editing. Which should not happen often.

To me that seems the wrong optimisation. Integrity in the values is more important. So for the reasons explained by @fkleedorfer xsd:decimal is a better solution.

3 replies

jhodgesatmb Nov 27, 2023
Collaborator

You say that this should not happen often and that may be correct, but in a recent harmonization effort with the IEC CDD, I had to do it for many hundreds of units, requiring the manual counting of zeros on either side of the decimal. In cases where humans are validating the values submitted, this in itself introduces the possibility of error, and this kind of error will possibly be on the order of magnitude (say, the human counts off by one decimal point it is a ten-fold error), when the loss of precision discussed would be much, much less.

I am not arguing 'for' the use of xsd:double for computational precision. That argument seems to have been made in favor of xsd:decimal. Rather, I am looking for a way to have both the validation clarity of xsd:double and the computational clarity of xsd:decimal.

fkleedorfer Nov 28, 2023
Collaborator Author

How about introducing an optional property that is easy to assess for humans and which is used by a SHACL constraint to compute the expected value for the actual property and check if it is correct?

For example:

unit:PlanckVolume
qudt:conversionMultiplier 0.00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000422419 ; 
qudt:conversionMultiplierInScientificNotation "4.22419e-105"^^xsd:double ;

and a SHACL-SPARQL rule that breaks the qudt:conversionMultiplierInScientificNotation value apart, generates the decimal string and compares it (if such a rule is possible)?

Or, for a more general solution, there could be one property for all such checks for a given unit/quantitykind/constant that, line by line, lists the property and the expected value in sci notation.

fkleedorfer Nov 28, 2023
Collaborator Author

@jhodgesatmb I made a POC PR for this suggestion here #821

steveraysteveray · 2023-11-01T23:08:41Z

steveraysteveray
Nov 1, 2023
Collaborator

@dr-shorthair, ...except for the very small and very large numbers that require xsd:double, right?

3 replies

dr-shorthair Nov 1, 2023

Why do very small or large numbers require xsd:double?
I guess you are referring to the 'minimally conforming processors' rule here https://www.w3.org/TR/xmlschema-2/#decimal .

@fkleedorfer have you surveyed the status of xsd:decimal processing in standard libraries? Is 18 digits common?

steveraysteveray Nov 1, 2023
Collaborator

Exactly. This came up here, and is a problem with all the Planck units among others.

fkleedorfer Nov 2, 2023
Collaborator Author

@dr-shorthair No I haven't done a survey on the number of digits in xsd:decimal. It is certainly a problem for people using systems that just barely implement the standard.

I think the options for the trade-off are:
a) QUDT will be compatible with all standard-compliant RDF stores but have numeric instability problems in all of them
b) QUDT will be as numerically stable and precise as each store allows, but will need pre-processing for loading to succeed in some stores.

jhodgesatmb · 2023-11-01T23:16:31Z

jhodgesatmb
Nov 1, 2023
Collaborator

As long as it ‘never’ needs to be edited manually, by anyone (including casual submissions), then having decimal values is fine. Personally, I would like to be able to see what the value is every once in a while, and counting zeros, on either side of the decimal point is, simply, ridiculous. If we had a SPARQL insert that could convert to any numeric value, that we could execute when a PR is submitted, that might be the way to go.Jack Hodges, Ph.D.Arbor StudiosOn Nov 1, 2023, at 3:49 PM, Simon Cox ***@***.***> wrote: AFAICT the reason to use scientific notation is only to avoid potential errors counting large numbers of consecutive zeros - either before or after the significant figures. i.e. this is a crutch to help people manually editing. Which should not happen often. To me that seems the wrong optimisation. Integrity in the values is more important. So for the reasons explained by @fkleedorfer xsd:decimal is a better solution. —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you commented.Message ID: ***@***.***>

0 replies

jhodgesatmb · 2023-11-01T23:18:49Z

jhodgesatmb
Nov 1, 2023
Collaborator

There should be no exceptions. All decimal if we are going this route, with the provisos I mentioned in my earlier response.As I mentioned today, it is almost worth having both a decimal value and a double value.Jack Hodges, Ph.D.Arbor StudiosOn Nov 1, 2023, at 4:08 PM, steveraysteveray ***@***.***> wrote: @dr-shorthair, ...except for the very small and very large numbers that require xsd:double, right? —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you commented.Message ID: ***@***.***>

0 replies

fkleedorfer · 2023-11-02T08:50:09Z

fkleedorfer
Nov 2, 2023
Collaborator Author

Personally, I would like to be able to see what the value is every once in a while, and counting zeros, on either side of the decimal point is, simply, ridiculous.

Certainly, but isn't that a presentation issue? Is it worth sacrificing precision and numeric stability?

0 replies

fkleedorfer · 2023-11-02T09:31:07Z

fkleedorfer
Nov 2, 2023
Collaborator Author

As a part of this problem set is about ensuring the correctness of community submissions, how about requiring 2 unit tests for each numeric value that involves other units/constants, and include those in the upcoming github CI pipeline? By linking to 2 other (existing) units/constants we would implicitly get tests for them, and in the aggregate, a network of tests covering the whole database (assuming we'd also do this for existing entities). This would ensure correctness beyond a reviewer not making mistakes counting zeros.

For example, the pipeline could have a query that calculates conversions between units and a csv file with inputs and expected outputs. For each unit, there could be two entries in the csv file, such as:

fromUnit; toUnit; inputValue; expectedResult; 
unit:KiloM; unit:CentiM; 11.0;  1100000.0

0 replies

jhodgesatmb · 2023-11-02T14:27:57Z

jhodgesatmb
Nov 2, 2023
Collaborator

It is unacceptable to compromise precision but it seems that the stability problem is not in the model but in the system that interprets it. The question I ask, given the stability example provided, is how many significant digits is anyone responsible for. Even saying this, it is more a rhetorical question because precision should not be compromised.Jack Hodges, Ph.D.Arbor StudiosOn Nov 2, 2023, at 1:50 AM, Florian Kleedorfer ***@***.***> wrote: Personally, I would like to be able to see what the value is every once in a while, and counting zeros, on either side of the decimal point is, simply, ridiculous. Certainly, but isn't that a presentation issue? Is it worth sacrificing precision and numeric stability? —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you commented.Message ID: ***@***.***>

3 replies

fkleedorfer Nov 2, 2023
Collaborator Author

it seems that the stability problem is not in the model but in the system that interprets it

Not really. The turtle spec (turtle being the serialization QUDT uses, i.e. 'the model' ) also identifies scientific notation with xsd:double. So, after ingesting the data, the precision/stability problems are there. The SPARQL engine later used to query it also influences those issues.

Only the lexical representation in the ttl file is always correct, so it looks like the problem lies only in the system that interprets it.

steveraysteveray Nov 6, 2023
Collaborator

I'm confused. @fkleedorfer, you begin your response by saying "Not really" to the premise that the problem is with the system that interprets the model, and yet you finish your response by saying "it looks like the problem lies only in the system that interprets it."

fkleedorfer Nov 6, 2023
Collaborator Author

Sorry. Instead of 'it looks like' I should have said, 'I understand how one might get the impression'

dr-shorthair · 2023-11-28T21:25:55Z

dr-shorthair
Nov 28, 2023

Bing (i.e. ChatGPT) does conversions from decimal to scientific notation.
This makes it easy to count zeroes when checking data manually.

Question: What is 0.00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000422419 in scientific notation
Answer:
0.00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000422419 in scientific notation is 4.22419 x 10^-99

3 replies

steveraysteveray Nov 28, 2023
Collaborator

That addresses one problem, but not the issue raised in #542. Stay tuned for more on this.

dr-shorthair Nov 28, 2023

I just checked and the answer from Bing is wrong, so I just disproved the point I thought I was making

dr-shorthair Nov 29, 2023

FWIW I've just looked at a few online calculators.
These two seem to be able to cope with >100 zeros
https://www.desmos.com/scientific
https://www.symbolab.com/solver/step-by-step/
The rest don't.

jhodgesatmb · 2023-11-29T02:55:22Z

jhodgesatmb
Nov 29, 2023
Collaborator

It is always worth asking ChatGPT things but you (we all) always have to check. We should ask for a conversion from xsd:double to xsd:decimal since that is what we currently have. I tried a casting xsd:decimal to xsd:double in SPARQL but it didn’t work.Jack Hodges, Ph.D.Arbor StudiosOn Nov 28, 2023, at 1:31 PM, Simon Cox ***@***.***> wrote: I just checked and the answer from Bing is wrong, so I just disproved the point I thought I was making —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: ***@***.***>

0 replies

jhodgesatmb · 2023-11-29T02:57:30Z

jhodgesatmb
Nov 29, 2023
Collaborator

We really need to write the function in SPIN or SHACL so that we can apply it to the entire graph anyway.Jack Hodges, Ph.D.Arbor StudiosOn Nov 28, 2023, at 1:26 PM, Simon Cox ***@***.***> wrote: Bing (i.e. ChatGPT) does conversions from decimal to scientific notation. This makes it easy to count zeroes when checking data manually. Question: What is 0.0000000000000000000000000000000000000000000000000000346789 in scientific notation Answer: [The number 0.0000000000000000000000000000000000000000000000000000346789 in scientific notation is 3.46789 x 10^-53 ](https://www.calculatorsoup.com/calculators/math/scientific-notation-converter.php)[1](https://www.calculatorsoup.com/calculators/math/scientific-notation-converter.php). —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: ***@***.***>

1 reply

fkleedorfer Nov 29, 2023
Collaborator Author

The proof of concept PR #821 contains a working implementation of the conversion in SHACL via a SPARQL constraint.

ralphtq · 2024-01-24T18:27:38Z

ralphtq
Jan 24, 2024
Collaborator

Maybe the answer is another optional property for the value in scientific notation? The property to be called 'qudt:valueInScientificNotation'?

This could be used if preferred by the specific software executables.

2 replies

fkleedorfer Feb 5, 2024
Collaborator Author

I agree with this suggestion - see the proof of concept PR #821, which includes a shacl constraint that requires the two values to be the same using ancient dark regex magic

the property name would probably have to be a bit more specific, as qudt:valueInScientificNotation could be any value, but we are talking about the conversion multiplier, so it would have to be something along the lines of qudt:conversionMultiplierInScientificNotation (as used in the PR) - preferrably something a bit shorter, maybe qudt:conversionMultiplierDouble or qudt:conversionMultiplierSN.

steveraysteveray Feb 5, 2024
Collaborator

Agreed on the more specific name, because two others values also come to mind: the value of a PhysicalConstant, and the standard uncertainty of those values.

steveraysteveray · 2024-02-16T18:15:41Z

steveraysteveray
Feb 16, 2024
Collaborator

After discussion in the Board, our intention is to create variants of the following relations:

qudt:conversionMultiplier (used by Unit) 
qudt:value (used by ConstantValue)
qudt:standardUncertainty (used by ConstantValue)

...namely:

qudt:conversionMultiplierSN (used by Unit) 
qudt:valueSN (used by ConstantValue)
qudt:standardUncertaintySN (used by ConstantValue)

These new relations will be used to identify the scientific notation version of each of the respective values. So, each Unit instance will have qudt:conversionMultiplier and qudt:conversionMultiplierSN. The former will be expressed as a decimal number (xsd:decimal) and the latter in scientific notation that is commonly interpreted as an xsd:double. Applications can choose whichever value they like for computation, display, etc. ConstantValue instances will be handled similarly.

This change will likely take place in March 2024, time permitting.

6 replies

steveraysteveray Feb 27, 2024
Collaborator

@fkleedorfer, do you have routines that can populate the above three relations, or just a validation test? If you do, I would invite you to run them on our Units and ConstantValue instances. If not, let us know and we can do the population step.

fkleedorfer Feb 28, 2024
Collaborator Author

I'll see what I can do and check back with you. It should be relatively straightforward for unit, not so much for ConstantValue, as QUDTLib does not use the constants data yet, but maybe it's a good time to add them.

fkleedorfer Mar 4, 2024
Collaborator Author

Looks simple enough - I expect to provide a PR later today

fkleedorfer Mar 4, 2024
Collaborator Author

quick question: what about qudt:conversionOffset?

fkleedorfer Mar 4, 2024
Collaborator Author

First commit to Draft PR #869 has been pushed

steveraysteveray · 2024-03-04T16:32:45Z

steveraysteveray
Mar 4, 2024
Collaborator

Good point. We only use it for Celsius and Fahrenheit currently, but we should treat it the same way.

0 replies

fkleedorfer · 2024-03-11T15:47:28Z

fkleedorfer
Mar 11, 2024
Collaborator Author

Closes with #870, which concludes the implementations of the decisions made in this thread

0 replies

Background for precision loss due to change to scientific notation #798

fkleedorfer Oct 30, 2023 Collaborator

Replies: 17 comments · 21 replies

jhodgesatmb Oct 30, 2023 Collaborator

steveraysteveray Oct 31, 2023 Collaborator

fkleedorfer Nov 1, 2023 Collaborator Author

dr-shorthair Nov 1, 2023

jhodgesatmb Nov 27, 2023 Collaborator

fkleedorfer Nov 28, 2023 Collaborator Author

fkleedorfer Nov 28, 2023 Collaborator Author

steveraysteveray Nov 1, 2023 Collaborator

dr-shorthair Nov 1, 2023

steveraysteveray Nov 1, 2023 Collaborator

fkleedorfer Nov 2, 2023 Collaborator Author

jhodgesatmb Nov 1, 2023 Collaborator

jhodgesatmb Nov 1, 2023 Collaborator

fkleedorfer Nov 2, 2023 Collaborator Author

fkleedorfer Nov 2, 2023 Collaborator Author

jhodgesatmb Nov 2, 2023 Collaborator

fkleedorfer Nov 2, 2023 Collaborator Author

steveraysteveray Nov 6, 2023 Collaborator

fkleedorfer Nov 6, 2023 Collaborator Author

dr-shorthair Nov 28, 2023

steveraysteveray Nov 28, 2023 Collaborator

dr-shorthair Nov 28, 2023

dr-shorthair Nov 29, 2023

jhodgesatmb Nov 29, 2023 Collaborator

jhodgesatmb Nov 29, 2023 Collaborator

fkleedorfer Nov 29, 2023 Collaborator Author

ralphtq Jan 24, 2024 Collaborator

fkleedorfer Feb 5, 2024 Collaborator Author

steveraysteveray Feb 5, 2024 Collaborator

steveraysteveray Feb 16, 2024 Collaborator

steveraysteveray Feb 27, 2024 Collaborator

fkleedorfer Feb 28, 2024 Collaborator Author

fkleedorfer Mar 4, 2024 Collaborator Author

fkleedorfer Mar 4, 2024 Collaborator Author

fkleedorfer Mar 4, 2024 Collaborator Author

steveraysteveray Mar 4, 2024 Collaborator

fkleedorfer Mar 11, 2024 Collaborator Author

fkleedorfer
Oct 30, 2023
Collaborator

Replies: 17 comments 21 replies

jhodgesatmb
Oct 30, 2023
Collaborator

steveraysteveray
Oct 31, 2023
Collaborator

fkleedorfer
Nov 1, 2023
Collaborator Author

dr-shorthair
Nov 1, 2023

jhodgesatmb Nov 27, 2023
Collaborator

fkleedorfer Nov 28, 2023
Collaborator Author

fkleedorfer Nov 28, 2023
Collaborator Author

steveraysteveray
Nov 1, 2023
Collaborator

steveraysteveray Nov 1, 2023
Collaborator

fkleedorfer Nov 2, 2023
Collaborator Author

jhodgesatmb
Nov 1, 2023
Collaborator

jhodgesatmb
Nov 1, 2023
Collaborator

fkleedorfer
Nov 2, 2023
Collaborator Author

fkleedorfer
Nov 2, 2023
Collaborator Author

jhodgesatmb
Nov 2, 2023
Collaborator

fkleedorfer Nov 2, 2023
Collaborator Author

steveraysteveray Nov 6, 2023
Collaborator

fkleedorfer Nov 6, 2023
Collaborator Author

dr-shorthair
Nov 28, 2023

steveraysteveray Nov 28, 2023
Collaborator

jhodgesatmb
Nov 29, 2023
Collaborator

jhodgesatmb
Nov 29, 2023
Collaborator

fkleedorfer Nov 29, 2023
Collaborator Author

ralphtq
Jan 24, 2024
Collaborator

fkleedorfer Feb 5, 2024
Collaborator Author

steveraysteveray Feb 5, 2024
Collaborator

steveraysteveray
Feb 16, 2024
Collaborator

steveraysteveray Feb 27, 2024
Collaborator

fkleedorfer Feb 28, 2024
Collaborator Author

fkleedorfer Mar 4, 2024
Collaborator Author

fkleedorfer Mar 4, 2024
Collaborator Author

fkleedorfer Mar 4, 2024
Collaborator Author

steveraysteveray
Mar 4, 2024
Collaborator

fkleedorfer
Mar 11, 2024
Collaborator Author