Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix sporadic h5diff_172 test failure w/ NVHPC #4571

Closed
derobins opened this issue Jun 15, 2024 · 2 comments
Closed

Fix sporadic h5diff_172 test failure w/ NVHPC #4571

derobins opened this issue Jun 15, 2024 · 2 comments
Assignees
Labels
Component - Parallel Parallel HDF5 (NOT thread-safety) Component - Testing Code in test or testpar directories, GitHub workflows Priority - 0. Blocker ⛔ This MUST be merged for the release to happen Type - Bug / Bugfix Please report security issues to [email protected] instead of creating an issue on GitHub
Milestone

Comments

@derobins
Copy link
Member

We are seeing sporadic test failures in the NVHPC CI action. The one I see is usually h5diff_172, though there may be others. These failures appear to be due to a mkdir call failing and this appears to be a known problem with OpenMPI.

See here:

open-mpi/ompi#8510

We're currently testing with a pretty elderly version of NVHPC (23.9.0) since newer versions have problems with some long double conversions. This version of NVHPC appears to use an older version of OpenMPI (3.1.5 - see the docs: https://docs.nvidia.com/hpc-sdk/archive/23.9/hpc-sdk-release-notes/index.html). They claim that this is fixed in recent versions of OpenMPI and I don't see it on my VMs, where I build with OpenMPI 4.1.5. We don't see this in other parallel test actions since we usually only configure and build for parallel in GitHub CI.

We probably have a few options:

  1. Disable NVHPC in GitHub CI and rely on CDash reporting, like we do for every other compiler w/ parallel HDF5
  2. Add --mca orte_tmpdir_base <dir> to OpenMPI's mpiexec options
  3. Fix the issues w/ long double so Update NVHPC to 24.5 #4171 can go in, bumping NVHPC to 24.5, which should give us OpenMPI 4.1.x via HPC-X (I think - this is unclear from casually perusing the docs)

The test failures look like this:

160: Test command: /usr/local/bin/cmake "-D" "TEST_EMULATOR=" "-D" "TEST_PROGRAM=/home/runner/work/hdf5/build/bin/h5diff" "-D" "TEST_ARGS:STRING=-v;h5diff_basic1.h5;h5diff_basic1.h5;/g1/fp20;/g1/fp20_COPY" "-D" "TEST_FOLDER=/home/runner/work/hdf5/build/tools/test/h5diff/testfiles" "-D" "TEST_OUTPUT=h5diff_172.out" "-D" "TEST_EXPECT=0" "-D" "TEST_REFERENCE=h5diff_172.txt" "-D" "TEST_APPEND=EXIT CODE:" "-P" "/home/runner/work/hdf5/hdf5/config/cmake/runTest.cmake"
160: Working Directory: /home/runner/work/hdf5/build/tools/test/h5diff/testfiles
160: Test timeout computed to be: 1200
160: -- Require TEST_EXPECT to be defined
160: -- COMMAND:  /home/runner/work/hdf5/build/bin/h5diff -v;h5diff_basic1.h5;h5diff_basic1.h5;/g1/fp20;/g1/fp20_COPY
160: -- COMMAND Result: 0
160: -- COMMAND Error: 
160: -- COMPARE Result: 0
160: -- /home/runner/work/hdf5/build/bin/h5diff Passed
1632/2920 Test  #160: H5DIFF-h5diff_172 ..........................................................   Passed    0.02 sec
test 161
          Start  161: MPI_TEST_H5DIFF-h5diff_172

161: Test command: /usr/local/bin/cmake "-D" "TEST_PROGRAM=/opt/nvidia/hpc_sdk/Linux_x86_64/23.9/comm_libs/openmpi4/bin/mpiexec" "-D" "TEST_ARGS:STRING=-n;2;--mca;opal_warn_on_missing_libcuda;0;/home/runner/work/hdf5/build/bin/ph5diff;;-v;h5diff_basic1.h5;h5diff_basic1.h5;/g1/fp20;/g1/fp20_COPY" "-D" "TEST_FOLDER=/home/runner/work/hdf5/build/tools/test/h5diff/PAR/testfiles" "-D" "TEST_OUTPUT=h5diff_172.out" "-D" "TEST_EXPECT=0" "-D" "TEST_REFERENCE=h5diff_172.txt" "-D" "TEST_APPEND=EXIT CODE:" "-D" "TEST_REF_APPEND=EXIT CODE: [0-9]" "-D" "TEST_REF_FILTER=EXIT CODE: 0" "-D" "TEST_SORT_COMPARE=TRUE" "-P" "/home/runner/work/hdf5/hdf5/config/cmake/runTest.cmake"
161: Working Directory: /home/runner/work/hdf5/build/tools/test/h5diff/PAR/testfiles
161: Test timeout computed to be: 1200
161: -- Require TEST_EXPECT to be defined
161: -- COMMAND:  /opt/nvidia/hpc_sdk/Linux_x86_64/23.9/comm_libs/openmpi4/bin/mpiexec -n;2;--mca;opal_warn_on_missing_libcuda;0;/home/runner/work/hdf5/build/bin/ph5diff;;-v;h5diff_basic1.h5;h5diff_basic1.h5;/g1/fp20;/g1/fp20_COPY
161: -- COMMAND Result: 1
161: -- Output :
161: EXIT CODE: 1
161: 
161: -- Error Output :
161: --------------------------------------------------------------------------
161: A call to mkdir was unable to create the desired directory:
161: 
161:   Directory: /tmp/ompi.fv-az651-831.1001/pid.37803
161:   Error:     No such file or directory
161: 
161: Please check to ensure you have adequate permissions to perform
161: the desired operation.
161: --------------------------------------------------------------------------
161: [fv-az651-831:37803] [[33398,0],0] ORTE_ERROR_LOG: Error in file ../../orte/util/session_dir.c at line 107
161: [fv-az651-831:37803] [[33398,0],0] ORTE_ERROR_LOG: Error in file ../../orte/util/session_dir.c at line 346
161: --------------------------------------------------------------------------
161: It looks like orte_init failed for some reason; your parallel process is
161: likely to abort.  There are many reasons that a parallel process can
161: fail during orte_init; some of which are due to configuration or
161: environment problems.  This failure appears to be an internal failure;
161: here's some additional information (which may only be relevant to an
161: Open MPI developer):
161: 
161:   orte_session_dir failed
161:   --> Returned value Error (-1) instead of ORTE_SUCCESS
161: --------------------------------------------------------------------------
161: 
161: CMake Error at /home/runner/work/hdf5/hdf5/config/cmake/runTest.cmake:130 (message):
161:   Failed: Test program
161:   /opt/nvidia/hpc_sdk/Linux_x86_64/23.9/comm_libs/openmpi4/bin/mpiexec exited
161:   != 0.
161: 
161: 
161: 
1633/2920 Test  #161: MPI_TEST_H5DIFF-h5diff_172 .................................................***Failed    0.03 sec
@derobins derobins added Priority - 0. Blocker ⛔ This MUST be merged for the release to happen Component - Parallel Parallel HDF5 (NOT thread-safety) Component - Testing Code in test or testpar directories, GitHub workflows Type - Bug / Bugfix Please report security issues to [email protected] instead of creating an issue on GitHub labels Jun 15, 2024
@derobins derobins added this to the 1.14.5 milestone Jun 15, 2024
derobins added a commit to hyoklee/hdf5 that referenced this issue Jun 15, 2024
We don't test parallel in other GitHub actions, so this converts the
NVHPC check to configure and build only while we discuss how we'll
test parallel HDF5 in GitHub.

There is a blocking GitHub issue to address the test failures for
HDF5 1.14.5 (HDFGroup#4571).
derobins pushed a commit that referenced this issue Jun 15, 2024
We don't test parallel in other GitHub actions, so this also converts the
NVHPC check to configure and build only while we discuss how we'll
test parallel HDF5 in GitHub.

There is a blocking GitHub issue to address the test failures for
HDF5 1.14.5 (#4571).
byrnHDF pushed a commit to byrnHDF/hdf5 that referenced this issue Jun 26, 2024
We don't test parallel in other GitHub actions, so this also converts the
NVHPC check to configure and build only while we discuss how we'll
test parallel HDF5 in GitHub.

There is a blocking GitHub issue to address the test failures for
HDF5 1.14.5 (HDFGroup#4571).
lrknox pushed a commit to lrknox/hdf5 that referenced this issue Jul 2, 2024
We don't test parallel in other GitHub actions, so this also converts the
NVHPC check to configure and build only while we discuss how we'll
test parallel HDF5 in GitHub.

There is a blocking GitHub issue to address the test failures for
HDF5 1.14.5 (HDFGroup#4571).
lrknox added a commit that referenced this issue Jul 3, 2024
* Fix typos in context/property documentation (#4550)

* Fix CI markdown link check http 500 errors (#4556)

Sites like GitLab can have internal problems that return http 500
errors while they fix their problems. Some sites also return http
200 OK, which is fine.

This PR adds a config file to the markdown link check so those
are considered "passing" and don't break the CI.

* Simplify property copying between lists internally (#4551)

* Add Python examples (#4546)

These examples are referred to from the replacement page of https://portal.hdfgroup.org/display/HDF5/Other+Examples.

* Correct property cb signatures in docs (#4554)

* Correct property cb signatures in docs
* Correct delete callback type name in docs
* add missing word to H5P__free_prop doc

* Move C++ and Fortran and examples to HDF5Examples folder (#4552)

* Document 'return-and-read' field in API context (#4560)

* Add compression includes to tests needing zlib support (#4561)

* Allow usage of page buffering for serial file access from parallel HDF5 builds (#4568)

* Remove old version of libaec (#4567)

* Add property names to context field docs (#4563)

* Document property shared name behavior (#4565)

* Clarify H5CX macro documentation (#4569)

* Document H5Punregister modifying default properties (#4570)

* Update NVHPC to 24.5 (#4171)

We don't test parallel in other GitHub actions, so this also converts the
NVHPC check to configure and build only while we discuss how we'll
test parallel HDF5 in GitHub.

There is a blocking GitHub issue to address the test failures for
HDF5 1.14.5 (#4571).

* Clean up comments in H5FDros3.c (#4572)

* Rename INSTALL_Auto.txt to INSTALL_Autotools.txt (#4575)

* Clean up ros3 VFD stats code (#4579)

* Removes printf debugging
* Simplifies and centralizes stats code
* Use #ifdef ROS3_STATS instead of #if
* Other misc tidying

* Turn off ros3 VFD stat collection by default (#4581)

Not a new change - an artifact from a previous check-in.

* Pause recording errors instead of clearing the error stack (#4475)

An internal capability that's similar to the H5E_BEGIN_TRY / H5E_END_TRY
macros in H5Epublic.h, but more efficient since we can avoid pushing errors on
the stack entirely (and those macros use public API routines).

This capability (and other techniques) can be used to remove use of
H5E_clear_stack() and H5E_BEGIN_TRY / H5E_END_TRY within library routines.

We want to remove H5E_clear_stack() because it can trigger calls to the H5I
interface from within the H5E code, which creates a great deal of complexity
for threadsafe code.  And we want to remove H5E_BEGIN_TRY / H5E_END_TRY's
because they make public API calls from within the library code.

Also some other minor tidying in routines related to removing the use of
H5E_clear_stack() and H5E_BEGIN_TRY / H5E_END_TRY from H5Fint.c

* Add page buffer cache command line option to tools (#4562)


Co-authored-by: github-actions <41898282+github-actions[bot]@users.noreply.github.com>

* Clarify documentation for H5CX_get_data_transform (#4580)

* Correct comment for H5CX_get_data_transform

* Document why data transform ctx field doesnt use macro

* Remove public API call from ros3 VFD (#4583)

* Remove printf debugging from H5FDs3comms.c (#4584)

* Cleanup of ros3 test (#4587)

* Removed JS* macro scheme (replaced w/ h5test.h macros)
* Moved curl setup/teardown to main()
* A lot of cleanup and simplification

* Removed unused code from H5FDs3comms.c (#4588)

* H5FD_s3comms_nlowercase()
* H5FD_s3comms_trim()
* H5FD_s3comms_uriencode()

* Remove magic fields from s3comms structs (#4589)

* Remove dead H5FD_s3comms_percent_encode_char() (#4591)

* Rework the TestExpress usage and refactor dead code (#4590)

* Skip examples if running sanitizers (#4592)

* Clean up s3comms test code (#4594)

* Remove JS* macros
* Remove dead code
* Bring in line with other test code

* Add publish to bucket workflow (#4566)

* Update abi report CI workflow for last release (#4596)

* Update abi report workflow to handle 1.14.4.3 release

* Update name of java report

* Document that ctx VOL property isn't drawn from the FAPL (#4597)

* Update macos workflow to 14 (keep 13 as alternate) (#4603)

* Removed unnecessary call to H5E_clear_stack (#4607)

H5FO_opened and H5SL_search don't push errors on the stack

* Bring subfiling VFD code closer to typical library code (#4595)

Remove API calls, use FUNC_ENTER/LEAVE macros, use the library's error macros,
rename functions to have more standardized names, etc.

* Correct documentation for return-and-read fields (#4598)

* These two generators create strings without NUL for testing (#4608)

* Fix Fortran pkconfig to indicate full path of modules (#4593)

* Updated release schedule (#4615)

1.16 and 2.0 information

* Document VOL object wrapping context (#4611)

* Earray.c and farray.c in hdf5_1_14 still need time_t curr_time for HDsrandom.

* Remove line to use future 116_API from CMakeListat.txt files in HDF5
examples directories
@lrknox lrknox closed this as completed Aug 23, 2024
@derobins derobins reopened this Aug 23, 2024
@derobins
Copy link
Member Author

derobins commented Aug 23, 2024

This isn't fixed. We're still getting h5diff mkdir failures after #4171 was merged.

@derobins derobins modified the milestones: 1.14.5, 2.0.0 Oct 15, 2024
@lrknox
Copy link
Collaborator

lrknox commented Oct 24, 2024

Not seeing h5diff mkdir failures anymore.

@lrknox lrknox closed this as completed Oct 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component - Parallel Parallel HDF5 (NOT thread-safety) Component - Testing Code in test or testpar directories, GitHub workflows Priority - 0. Blocker ⛔ This MUST be merged for the release to happen Type - Bug / Bugfix Please report security issues to [email protected] instead of creating an issue on GitHub
Projects
None yet
Development

No branches or pull requests

2 participants