Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Example to document a consistency problem in HDF5 file with parallel execution in Unify #588

Open
clmendes opened this issue Dec 9, 2020 · 0 comments

Comments

@clmendes
Copy link

clmendes commented Dec 9, 2020

As I have been asked, I am creating this issue to document the problem that we have observed in some of the HDF5 tests that create a file in a parallel execution. The effect of this problem is wrong contents resulting in the file under Unify (i.e. the contents of the file produced by a Unify execution are different from what is produced by a non-Unify execution). This program is a simplified version of test CCHUNK5 in the HDF5 testsuite.

The current workaround for this problem is one of these:
(a) insert a call to H5Fflush(file_id) at an appropriate location in the source file t_chunk.c
or
(b) run the program with Unify's setting UNIFYFS_CLIENT_WRITE_SYNC=1

The program creates the HDF5 file named ParaTest.h5. With Unify, when none of the workarounds above is applied, the resulting file has these differences to the file produced without Unify:

[mendes3@catalyst160:UNIFY]$ cmp -b -l ParaTest.h5 ../MPI/ParaTest.h5
949 124 T 220 M-^P
950 63 3 62 2
4865 0 ^@ 1 ^A
4869 0 ^@ 2 ^B
5057 0 ^@ 3 ^C
5061 0 ^@ 4 ^D

The first two bytes do not matter: they are a timestamp in the HDF5 file, so it is expected that they will differ. However, bytes 4865/4869/5057/5061 are indeed different/wrong.

I am leaving here the three source files (testchunk.c, t_chunk.c and t_ds.c), plus a Makefile. This Makefile builds two versions of the program, one without Unify (testchunk) and one with Unify (testchunk-gotcha). These executables are copied to sub-dirs MPI/ and UNIFY/, respectively, so that they can be executed from there.

It must be noted that the Makefile defines two locations:

UNIFYFS=/g/g12/mendes3/UnifyFS-581/UnifyFS/install
HDF5=/g/g12/mendes3/HDF5-1.10.2/hdf5-1.10.2/hdf5

UNIFYFS is where Unify was installed. HDF5 is where the SOURCES of HDF5-1.10.2 are located. The build process does require the HDF5 sources, because there are include files from HDF5 that are needed and are not available in the system. The h5pcc command used in the Makefile must be obtained in the Catalyst system with the command module load hdf5-parallel:

$ which h5pcc
/usr/tce/packages/hdf5/hdf5-parallel-1.10.2-intel-19.0.4-mvapich2-2.3/bin/h5pcc

The program has been tested with 4 processors on 2 Catalyst nodes (i.e. 2 processors/node).

The three source files and the Makefile are in the gzip file below (chunk5.gz).

chunk5.gz

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant