Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spurious failures with test.67 #2188

Open
hmaarrfk opened this issue Jan 18, 2022 · 14 comments
Open

Spurious failures with test.67 #2188

hmaarrfk opened this issue Jan 18, 2022 · 14 comments
Milestone

Comments

@hmaarrfk
Copy link
Contributor

To report a non-security related issue, please provide:

  • the version of the software with which you are encountering an issue: 4.8.1
  • environmental information (i.e. Operating System, compiler info, java version, python version, etc.): Recent conda-forge environment
  • a description of the issue with the steps needed to reproduce it: Build and run tests.

For some reasons, it seems that test.67 is failing due to rounding errors.

Is there a way to relax the tolerance, or to skip the test in CIs while the fix is implemented?

Thank you very much for your help

2022-01-18T22:49:55.2133446Z 153: Parallel Performance Test for NASA
2022-01-18T22:49:55.3589607Z 153: *** Testing parallel IO for NASA...
2022-01-18T22:49:55.3590559Z 153: num_proc	MPI mode	access		cache (MB)	grid size	chunks	avg. write time(s)	avg. write bandwidth(MB/s)	num_tries
2022-01-18T22:49:56.5404167Z 153: 4		MPI-IO   	independent	1		40x61		contiguous	0.140169		0.664239			2
2022-01-18T22:49:56.6507536Z 187: *** Testing: test.21 ; url=[cache][prefetch]https://remotetest.unidata.ucar.edu/dts/test.21
2022-01-18T22:49:57.7011207Z 187: *** Testing: test.50 ; url=[cache][prefetch]https://remotetest.unidata.ucar.edu/dts/test.50
2022-01-18T22:49:58.8266840Z 187: *** Testing: test.53 ; url=[cache][prefetch]https://remotetest.unidata.ucar.edu/dts/test.53
2022-01-18T22:49:59.5764629Z 153: 4		MPI-IO   	independent	1		40x61		61x40    	0.135389		0.689269			2
2022-01-18T22:50:00.0023322Z 187: *** Testing: test.55 ; url=[cache][prefetch]https://remotetest.unidata.ucar.edu/dts/test.55
2022-01-18T22:50:01.4301522Z 187: *** Testing: test.56 ; url=[cache][prefetch]https://remotetest.unidata.ucar.edu/dts/test.56
2022-01-18T22:50:01.9445081Z 153: 4		MPI-IO   	independent	1		40x61		31x20    	0.096213		1.110586			2
2022-01-18T22:50:02.6089203Z 187: *** Testing: test.57 ; url=[cache][prefetch]https://remotetest.unidata.ucar.edu/dts/test.57
2022-01-18T22:50:03.8811705Z 187: *** Testing: test.66 ; url=[cache][prefetch]https://remotetest.unidata.ucar.edu/dts/test.66
2022-01-18T22:50:04.8067202Z 187: *** Testing: test.67 ; url=[cache][prefetch]https://remotetest.unidata.ucar.edu/dts/test.67
2022-01-18T22:50:05.7020792Z 187: 41c41
2022-01-18T22:50:05.7021613Z 187: <     0.589788025031098, 0.581683089463883, 0.573519986072457, 
2022-01-18T22:50:05.7022702Z 187: ---
2022-01-18T22:50:05.7023192Z 187: >     0.589788025031098, 0.581683089463884, 0.573519986072457, 
2022-01-18T22:50:05.7092990Z 187: *** FAIL:  test.67
2022-01-18T22:50:05.7361715Z 153: 4		MPI-IO   	collective	1		40x61		contiguous	1.390005		0.066968			2
2022-01-18T22:50:05.8327858Z 187: *** Testing: test.68 ; url=[cache][prefetch]https://remotetest.unidata.ucar.edu/dts/test.68
2022-01-18T22:50:06.8210803Z 187: *** Testing: test.69 ; url=[cache][prefetch]https://remotetest.unidata.ucar.edu/dts/test.69
2022-01-18T22:50:08.6198212Z 187: *** Testing: test.01.1 ; url=[cache][prefetch]https://remotetest.unidata.ucar.edu/dts/test.01?f64
2022-01-18T22:50:09.1583404Z 187: *** Testing: test.02.1 ; url=[cache][prefetch]https://remotetest.unidata.ucar.edu/dts/test.02?b[1:2:10]
2022-01-18T22:50:10.2771125Z 187: *** Testing: test.03.1 ; url=[cache][prefetch]https://remotetest.unidata.ucar.edu/dts/test.03?i32[0:1][1:2][0:2]
2022-01-18T22:50:11.4927304Z 153: 4		MPI-IO   	collective	1		40x61		61x40    	1.583846		0.059221			2
2022-01-18T22:50:11.5621661Z 187: *** Testing: test.04.1 ; url=[cache][prefetch]https://remotetest.unidata.ucar.edu/dts/test.04?types.i32
2022-01-18T22:50:12.6384408Z 187: *** Testing: test.05.1 ; url=[cache][prefetch]https://remotetest.unidata.ucar.edu/dts/test.05?types.floats.f32
2022-01-18T22:50:13.6222597Z 187: *** Testing: test.07.1 ; url=[cache][prefetch]https://remotetest.unidata.ucar.edu/dts/test.07?person.age
2022-01-18T22:50:14.7774280Z 187: *** Testing: test.07.3 ; url=[cache][prefetch]https://remotetest.unidata.ucar.edu/dts/test.07?person
2022-01-18T22:50:16.0414786Z 187: *** Testing: test.07.4 ; url=[cache][prefetch]https://remotetest.unidata.ucar.edu/dts/test.07?types.f32

See full logs:
conda-forge/libnetcdf-feedstock#135
test.67_failure_example.txt

@hmaarrfk
Copy link
Contributor Author

Tests are being run with:

ctest -VV --output-on-failure -j${CPU_COUNT} ${SKIP}

@WardF
Copy link
Member

WardF commented Jan 18, 2022

Tagging @DennisHeimbigner since this is DAP related and there might be something related to that in my blindspot.

@WardF WardF added this to the 4.9.0 milestone Jan 18, 2022
@DennisHeimbigner
Copy link
Collaborator

This kind of problem occurs in a number of tests that involve float or double comparisons.
There is no general fix, although sometimes the ncdump -p flag can be used.
As a rule, we have just made the test conditional so that it is not run when this failure is known
to occur.
Is this being run under windows or mingw?

@hmaarrfk
Copy link
Contributor Author

No currently being in linux.

Can you please provide me the syntax to skip the tests with ctest I'm not too familiar with your build/test system.

Thank you!

@WardF
Copy link
Member

WardF commented Jan 18, 2022

I would need to look up the cmake syntax; currently, test.67 appears to be part of the broader test suite, running as part of test number 187. The numbering of individual tests will vary depending on the underlying configuration options specified; can you tell me what the name of test 187 in this configuration is? There is a cmake/ctest flag we should be able to specify to exclude that test by name.

@WardF
Copy link
Member

WardF commented Jan 18, 2022

You can turn off remote DAP testing all together by invoking cmake with the -DENABLE_DAP_REMOTE_TESTS=OFF command-line argument.

@hmaarrfk
Copy link
Contributor Author

Thanks!

@verduijn
Copy link

Just wanted to add another data point. I've seen the same issue while building from source using Docker.
As a base amazonlinux:2 is used with gcc (GCC) 7.3.1 20180712 (Red Hat 7.3.1-13) and after building netcdf-c v4.6.1 the make check fails on the same error. Previously this image build fine. My "fix" was to add --disable-dap-remote-tests to the ./configure step.

@Dave-Allured
Copy link
Contributor

I have also seen the same test.67 failure in several recent builds, all on Mac OS 10.15. These are the only versions that I tested, so other versions are probably also affected in the same way.

  • 4.3.3.1
  • 4.4.1.1
  • 4.6.3
  • 4.8.1 release
  • recent development snapshots

These failures occurred consistently between 2021 December 23 and 2022 March 6. I have some old testing leftovers and other evidence indicating test.67 was working fine for many years, from inception through 2020 February 26. I have no evidence for the gap from 2020 March through 2021 November. In other words, I can't pin down closely when this started.

From this, I think it is most likely that there was a recent change in behavior in handling test.67 on Unidata's test server, and that there is no particular problem within the local test code in netcdf-c.

DennisHeimbigner added a commit to DennisHeimbigner/netcdf-c that referenced this issue Mar 19, 2022
This is in response to issue Unidata#2188 (comment) although it does not fix that problem.
@ArchangeGabriel
Copy link
Contributor

I can fill the gap a lot then: on Arch, it worked up to the 2021 October 1st builds (including builds in 2021 July for instance). After that I don’t have builds before this February, so you’ve got a closer point, but with the two of us we can already narrow it down a lot.

@Dave-Allured
Copy link
Contributor

@ArchangeGabriel, thank you, this is helpful. So it seems the relevant change in behavior was between 2021 October 1 and 2021 December 23.

I would like to look into this more closely. Can someone at Unidata please show me how to get a direct copy of the original netCDF data file for test.67, NOT through OpenDAP? Also can you please show me how to access the server-side OpenDAP support code, and its recent change history on remotetest.unidata.ucar.edu? Are you using Unidata/tds, or something else?

@DennisHeimbigner
Copy link
Collaborator

DennisHeimbigner commented Mar 24, 2022

The test.67 is produced by the WAR file constructed by the code
in this github repository: https://github.com/Unidata/Tds.
Specifically, the path is tds/opendap/dtswar.
This code is quite old.

The process for accessing e.g. test.67 is a bit complicated.
The .dds file is ./src/main/webapp/WEB-INF/resources/testdatasets/dds/test.67.
However, the server sends out a synthetic .dods file and specifically
the data for 64-bit floats in that .dds are synthesized
by generating a sequence of doubles in the file
tds/opendap/dtswar/src/main/java/opendap/dts/testEngine.java.
and specifically in the function nextFloat64(), which looks like this:

public double nextFloat64() {
    double b = (double) 1000 * Math.cos(tFloat64);

    tFloat64 += 0.01;
    return (b);
}

where the starting value for tFloat64 is 0.0.
So, the sequence of values is

  1. 1000*cos(0.0)
  2. 1000cos(1000cos(# 1))
    ...
    n. 1000cos(1000cos(#n-1))

So perhaps the issue is the cosine function? (seems unlikely)

@Dave-Allured
Copy link
Contributor

Synthesized! I had not really thought about that. Well, I see at least two opportunities for roundoff deviation to creep in. Yes, cosine is one of them. tFloat64 += 0.01 is another because 0.01 is not exactly representable in base 2. Possible causes of deviations are change in hosting CPU type, compiler or runtime math optimizations, and runtime code fixes and improvements. On reflection I think cosine is the more likely culprit. All those Taylor series with their vanishing fractions...

This still does not rule out the possibility of some deviation in the way OpenDAP is communicating the synthesized data to the client.

Rather than diving deeper, I suggest a tentative easy solution that may be good for the long run. Generate a one-time actual netCDF file that will output the expected test.67 pattern through OpenDAP. Install this netCDF file in place of synthesis on remotetest.unidata. Then the future is insulated from possible new math deviations within synthesis. Also, if crafted properly, then all of the old and current distributed netCDF code release versions will start working properly again. As a further benefit, this would exercise more of the active OpenDAP code on the server side.

Attached is one such crafted netCDF file that should regenerate the expected test.67 pattern, for your consideration. Also included is the full-precision CDL file that generated the netCDF file using ncgen.
test.67.regen4.zip

@DennisHeimbigner
Copy link
Collaborator

Changing the dts war to serve up a fixed file for a single test is, I suspect,
a rather large task. It would be easier to do it for all test files, but that too is still a large task.
It is worth remembering that the reason for remotetest is to check out the actual operation
of the OPeNDAP protocol with a real server. The actual data is not particularly important.
So we could modify test.67 to generate all integers instead of double values.

@WardF WardF modified the milestones: 4.9.0, 4.9.1 Jun 15, 2022
@WardF WardF modified the milestones: 4.9.1, 4.9.2 Feb 13, 2023
@WardF WardF modified the milestones: 4.9.2, 4.9.3 May 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants