Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

t_cache_image hanging on some machines #71

Closed
bljhdf opened this issue Nov 2, 2020 · 4 comments
Closed

t_cache_image hanging on some machines #71

bljhdf opened this issue Nov 2, 2020 · 4 comments
Assignees
Labels
Component - Parallel Parallel HDF5 (NOT thread-safety) Component - Testing Code in test or testpar directories, GitHub workflows Priority - 1. High 🔼 These are important issues that should be resolved in the next release Type - Bug / Bugfix Please report security issues to [email protected] instead of creating an issue on GitHub
Milestone

Comments

@bljhdf
Copy link
Contributor

bljhdf commented Nov 2, 2020

Date: Fri, 23 Oct 2020 07:28:52 -0600
From: Orion Poplawski [email protected]
To: HDF Helpdesk [email protected]
Subject: t_cache_image hanging on some machines
Parts/Attachments:
1 Shown ~152 lines Text

[ This message was cryptographically signed but the signature could not be verified. ]

When building hdf5 1.10.6 or 1.10.7 for Fedora Rawhide using the Fedora builders, t_cache_image is hanging when run with openmpi
on some architectures (including x86_64). Unfortunately we cannot reproduce it locally and so are reduced in our ability to debug
the issue. Here is the output of the test:

============================
Testing: t_cache_image

        Error ignored
        ============================
        Test log for t_cache_image
        ============================
        ===================================
        Parallel metadata cache image tests
        mpi_size = 6
        ===================================
        Constructing test files:
        writing t_cache_image_00 ... done.
        writing t_cache_image_01 ... done.
        Test file construction complete.
        testfile construction complete – proceeding with tests.
        Testing parallel CI load test – proc0 md write – R/O HDF5-DIAG: Error detected in HDF5 (1.10.7) MPI-process 2:

#000: ../../src/H5D.c line 298 in H5Dopen2(): unable to open dataset
major: Dataset
minor: Can't open object
#1: ../../src/H5Dint.c line 1429 in H5D__open_name(): not found
major: Dataset
minor: Object not found
#2: ../../src/H5Gloc.c line 420 in H5G_loc_find(): can't find object
major: Symbol table
minor: Object not found
#3: ../../src/H5Gtraverse.c line 848 in H5G_traverse(): internal path traversal failed
major: Symbol table
minor: Object not found
#4: ../../src/H5Gtraverse.c line 579 in H5G__traverse_real(): can't look up component
major: Symbol table
minor: Object not found
#5: ../../src/H5Gobj.c line 1118 in H5G__obj_lookup(): can't check for link info message
major: Symbol table
minor: Can't get value
#6: ../../src/H5Gobj.c line 324 in H5G__obj_get_linfo(): unable to read object header
major: Symbol table
minor: Can't get value
#7: ../../src/H5Omessage.c line 873 in H5O_msg_exists(): unable to protect object header
major: Object header
minor: Unable to protect metadata
#8: ../../src/H5Oint.c line 1056 in H5O_protect(): unable to load object header
major: Object header
minor: Unable to protect metadata
#9: ../../src/H5AC.c line 1517 in H5AC_protect(): H5C_protect() failed
major: Object cache
minor: Unable to protect metadata
#10: ../../src/H5C.c line 2378 in H5C_protect(): Can't load cache image
major: Object cache
minor: Unable to load metadata into cache
#11: ../../src/H5Cimage.c line 1164 in H5C__load_cache_image(): Can't reconstruct cache contents from image block
major: Object cache
minor: Unable to decode value
#12: ../../src/H5Cimage.c line 3137 in H5C__reconstruct_cache_contents(): reconstruction of cache entry failed
major: Object cache
minor: Internal error detected
#13: ../../src/H5Cimage.c line 3408 in H5C__reconstruct_cache_entry(): invalid entry size
major: Object cache
minor: Bad value
HDF5-DIAG: Error detected in HDF5 (1.10.7) MPI-process 1:
#000: ../../src/H5D.c line 298 in H5Dopen2(): unable to open dataset
major: Dataset
minor: Can't open object
#1: ../../src/H5Dint.c line 1429 in H5D__open_name(): not found
major: Dataset
minor: Object not found
#2: ../../src/H5Gloc.c line 420 in H5G_loc_find(): can't find object
major: Symbol table
minor: Object not found
#3: ../../src/H5Gtraverse.c line 848 in H5G_traverse(): internal path traversal failed
major: Symbol table
minor: Object not found
#4: ../../src/H5Gtraverse.c line 579 in H5G__traverse_real(): can't look up component
major: Symbol table
minor: Object not found
#5: ../../src/H5Gobj.c line 1118 in H5G__obj_lookup(): can't check for link info message
major: Symbol table
minor: Can't get value
#6: ../../src/H5Gobj.c line 324 in H5G__obj_get_linfo(): unable to read object header
major: Symbol table
minor: Can't get value
#7: ../../src/H5Omessage.c line 873 in H5O_msg_exists(): unable to protect object header
major: Object header
minor: Unable to protect metadata
#8: ../../src/H5Oint.c line 1056 in H5O_protect(): unable to load object header
major: Object header
minor: Unable to protect metadata
#9: ../../src/H5AC.c line 1517 in H5AC_protect(): H5C_protect() failed
major: Object cache
minor: Unable to protect metadata
#10: ../../src/H5C.c line 2378 in H5C_protect(): Can't load cache image
major: Object cache
minor: Unable to load metadata into cache
#11: ../../src/H5Cimage.c line 1164 in H5C__load_cache_image(): Can't reconstruct cache contents from image block
major: Object cache
minor: Unable to decode value
#12: ../../src/H5Cimage.c line 3137 in H5C__reconstruct_cache_contents(): reconstruction of cache entry failed
major: Object cache
minor: Internal error detected
#13: ../../src/H5Cimage.c line 3408 in H5C__reconstruct_cache_entry(): invalid entry size
major: Object cache
minor: Bad value

It would be helpful to know what the developers think of this and what we could do to further debug the issue.


Orion Poplawski
Manager of NWRA Technical Systems 720-772-5637
NWRA, Boulder/CoRA Office FAX: 303-415-9702
3380 Mitchell Lane [email protected]
Boulder, CO 80301 https://www.nwra.com/


It's inside a VM, on an XFS filesystem.

@derobins
Copy link
Member

This may be due to a collective metadata issue that has already been fixed, possibly in 1.10.6.

@opoplawski
Copy link
Contributor

Currently with hdf5 1.12.2 on Fedora Rawhide I'm seeing this hang only on s390x with mpich 4.0.2. This also seems to be new with the change from 1.12.1 -> 1.12.2, but possibly due to a mpich update.

fortnern pushed a commit that referenced this issue Mar 23, 2023
@derobins derobins added Priority - 1. High 🔼 These are important issues that should be resolved in the next release Component - Parallel Parallel HDF5 (NOT thread-safety) Component - Testing Code in test or testpar directories, GitHub workflows Type - Bug / Bugfix Please report security issues to [email protected] instead of creating an issue on GitHub labels May 3, 2023
@derobins derobins added this to the 1.14.3 milestone Oct 9, 2023
@derobins
Copy link
Member

@opoplawski - Is this still an issue with the hdf5_1_14 branch?

@derobins derobins assigned brtnfld and unassigned bmribler Oct 13, 2023
@derobins derobins modified the milestones: 1.14.3, 1.14.4 Oct 28, 2023
@derobins
Copy link
Member

Closing due to age

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component - Parallel Parallel HDF5 (NOT thread-safety) Component - Testing Code in test or testpar directories, GitHub workflows Priority - 1. High 🔼 These are important issues that should be resolved in the next release Type - Bug / Bugfix Please report security issues to [email protected] instead of creating an issue on GitHub
Projects
None yet
Development

No branches or pull requests

6 participants