Variant_test fails on Fedora Rawhide #3749

junghans · 2020-06-05T17:59:51Z

10/68 Test #10: Variant_test .....................***Failed    0.00 sec
Running 6 test cases...
unknown location(0): �[4;31;49mfatal error: in "pack_pair_test": std::out_of_range: vector::_M_range_check: __n (which is 1) >= this->size() (which is 1)�[0;39;49m
/builddir/build/BUILD/espresso/src/core/unit_tests/Variant_test.cpp(130): �[1;36;49mlast checkpoint�[0;39;49m
unknown location(0): �[4;31;49mfatal error: in "pack_map_test": std::out_of_range: vector::_M_range_check: __n (which is 1) >= this->size() (which is 1)�[0;39;49m
/builddir/build/BUILD/espresso/src/core/unit_tests/Variant_test.cpp(139): �[1;36;49mlast checkpoint�[0;39;49m
�[1;31;49m*** 2 failures are detected in the test module "Variant test"
�[0;39;49m

Build log here

The text was updated successfully, but these errors were encountered:

junghans · 2020-06-05T18:01:19Z

Same on x86_64: build_x86_64.log.txt

jngrad · 2020-06-05T18:22:23Z

Already fixed by #3725 😉

junghans · 2020-06-05T18:34:40Z

Ah and you even told me about it, let me test it quickly.

junghans · 2020-06-05T20:03:19Z

Hmm, now I am getting:

       Start   1: save_checkpoint_lb.cpu-p3m.cpu-lj-therm.lb_1
  1/146 Test   #1: save_checkpoint_lb.cpu-p3m.cpu-lj-therm.lb_1 ......***Failed    1.03 sec
/usr/include/c++/10/bits/stl_vector.h:1123: std::vector<_Tp, _Alloc>::reference std::vector<_Tp, _Alloc>::front() [with _Tp = char; _Alloc = boost::mpi::allocator<char>; std::vector<_Tp, _Alloc>::reference = char&]: Assertion '__builtin_expect(!this->empty(), true)' failed.
/usr/include/c++/10/bits/stl_vector.h:1123: std::vector<_Tp, _Alloc>::reference std::vector<_Tp, _Alloc>::front() [with _Tp = char; _Alloc = boost::mpi::allocator<char>; std::vector<_Tp, _Alloc>::reference = char&]: Assertion '__builtin_expect(!this->empty(), true)' failed.
/usr/include/c++/10/bits/stl_vector.h:1123: std::vector<_Tp, _Alloc>::reference std::vector<_Tp, _Alloc>::front() [with _Tp = char; _Alloc = boost::mpi::allocator<char>; std::vector<_Tp, _Alloc>::reference = char&]: Assertion '__builtin_expect(!this->empty(), true)' failed.
===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 4143224 RUNNING AT buildvm-17.phx2.fedoraproject.org
=   EXIT CODE: 134
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Aborted (signal 6)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions

for nearly all tests.

build_x86_64.log.txt

junghans · 2020-06-05T20:09:38Z

Reason I doing a rebuilt: https://bugzilla.redhat.com/show_bug.cgi?id=1843105

jngrad · 2020-06-05T20:57:59Z

The F33FailsToInstall issue is quite unfortunate. There's a similar packaging issue with CentOS 8, when I tried to create a docker image, it failed due to missing python dependencies...

I've just created a Fedora 33 docker image and can reproduce the new error with OpenMPI. It's only happening for 2 or more MPI ranks. Here's my trace:

/usr/include/c++/10/bits/stl_vector.h:1123: std::vector<_Tp, _Alloc>::reference std::vector<_Tp, _Alloc>::front() [with _Tp = char; _Alloc = boost::mpi::allocator<char>; std::vector<_Tp, _Alloc>::reference = char&]: Assertion '__builtin_expect(!this->empty(), true)' failed.
[fc8380a4505c:08126] *** Process received signal ***
[fc8380a4505c:08126] Signal: Aborted (6)
[fc8380a4505c:08126] Signal code:  (-6)
[fc8380a4505c:08126] [ 0] /lib64/libc.so.6(+0x3cb70)[0x7efc20a8bb70]
[fc8380a4505c:08126] [ 1] /lib64/libc.so.6(gsignal+0x145)[0x7efc20a8bae5]
[fc8380a4505c:08126] [ 2] /lib64/libc.so.6(abort+0x116)[0x7efc20a74884]
[fc8380a4505c:08126] [ 3] /usr/lib64/openmpi/lib/libboost_mpi.so.1.73.0(boost::mpi::detail::packed_archive_send(boost::mpi::communicator const&, int, int, boost::mpi::packed_oarchive const&)+0xad)[0x7efc19b683bd]
[fc8380a4505c:08126] [ 4] /home/espresso/espresso/build/src/core/EspressoCore.so(+0x1dca7f)[0x7efc19d54a7f]
[fc8380a4505c:08126] [ 5] /home/espresso/espresso/build/src/core/EspressoCore.so(void Utils::Mpi::gather_buffer<ErrorHandling::RuntimeError>(std::vector<ErrorHandling::RuntimeError, std::allocator<ErrorHandling::RuntimeError> >&, boost::mpi::communicator, int)+0xa9)[0x7efc19d55089]
[fc8380a4505c:08126] [ 6] /home/espresso/espresso/build/src/core/EspressoCore.so(ErrorHandling::RuntimeErrorCollector::gatherSlave()+0x2e)[0x7efc19d50b1e]
[fc8380a4505c:08126] [ 7] /home/espresso/espresso/build/src/core/EspressoCore.so(mpi_loop()+0x110)[0x7efc19ca2220]
[fc8380a4505c:08126] [ 8] /home/espresso/espresso/build/src/python/espressomd/_init.so(+0xdcde)[0x7efc1a293cde]

The issue happens in boost::mpi::detail::packed_archive_send(). When I find the time next week, I'll recompile the boost::mpi library with debug info and attach GDB to both MPI ranks from within a docker container to look at the value of variables.

junghans · 2020-06-08T22:01:14Z

Any news on this?

fweik · 2020-06-09T16:29:33Z

Isn't this again a instance of the well known problem where it tries to get a pointer by taking the address of the first element? (which is UB when die vector is empty). I think the eventual cause
is here: https://github.com/boostorg/mpi/blob/1c09f39948218d094a4fde9c09309580f33c5db5/include/boost/mpi/detail/packed_oprimitive.hpp#L41, not much we can do about this tho...

fweik · 2020-06-09T16:30:36Z

Seems like boost.mpi is still not using UB sanitizer in the CI...

jngrad · 2020-06-09T16:33:05Z

Reproducible in the dev branch.

During initialization in espressomd/_init.pyx L37, mpi_loop() calls mpiCallbacks().loop(). This method iterates over a list of functors (MpiCallbacks.hpp:L595) and calls them with the current boost mpi communicator and a packed archive, in order to initialize mpi callbacks on slaves. This fails on a static callback, ErrorHandling::mpi_gather_runtime_errors_slave(). That's because we reduce a vector of runtime errors that is empty at initialization time. Since a std::vector of size 0 returns a null pointer from std::vector::data(), we end up providing a null pointer as the source destination of boost::mpi::communicator::irecv(). Perhaps such null pointers were ignored in the past. They are now checked and will trigger an assertion in /usr/include/c++/10/bits/stl_vector.h:1123. Not sure if this is an intended change from the boost::mpi library or a regression (see boostorg/mpi#81 for a related issue). This can be fixed from our side with a proper guard in src/utils/include/utils/mpi/gather_buffer.hpp, but this only reduces the number of failing tests to 16.

The same error is triggered in src/utils/include/utils/mpi/gatherv.hpp, e.g. when collecting particles by id: if one MPI rank has no particles matching the ids, it will communicate an empty std::vector. This is more difficult to diagnose, as it depends on the number of MPI ranks and the geometry of domain decomposition. Fixing the remaining 16 failing tests for 3 and 4 MPI ranks will not guarantee that we properly guarded all of our MPI infrastructure against empty vectors.

fweik · 2020-06-09T16:35:48Z

As you might guess I'm also not a huge fan of adding awkward stuff in our code to work around broken libraries...

junghans · 2020-06-09T16:49:32Z

Can you make a PR for boostorg/mpi? Then we can patch it in Fedora

jngrad · 2020-06-09T16:56:29Z

I'm not a huge fan of the espresso MPI infrastructure :) It took me 5 hours to get this far and I still don't know if it's a regression in boost::mpi or if it's a side effect from gcc 10. I get the same failure with MPICH instead of OpenMPI.

fweik · 2020-06-09T16:59:25Z

Don't you agree that the line I pointed out is the root cause?

fweik · 2020-06-09T17:00:52Z

boostorg/mpi#81fixes a different instance of the same error.

junghans · 2020-06-09T18:08:06Z

@jwakely is there anything we can do to test boost-mpi on Fedora better?

mkuron · 2020-06-09T18:16:29Z

grepping for [0] and front in Boost.MPI reveals several dozen lines potentially affected by the same bugs. Some places explicitly check for emptyness, some ensure non-emptyness by surrounding code, but for many the vector comes straight from a function argument and can thus be empty. All these places should probably be converted to calling c_data so that they benefit from the fix in boostorg/mpi#81. Here's a quick patch, entirely untested, that could eventually be submitted upstream and included in Fedora.

jngrad · 2020-06-09T18:18:56Z

Following the GDB backtrace in the boost::mpi frames, we have a call to communicator::send(), which calls communicator::array_send_impl(), which calls detail::packed_archive_send(). In that function there's a call to packed_oarchive::address(), but taking the address of a dereferenced null pointer returns a null pointer (thanks to compiler optimizations). I still don't know how we get to std::vector::front(), though.

Fedora 33 currently has OpenMPI 4.0.4rc1 and boost::mpi 1.73.0. I've recompiled OpenMPI 4.0.4rc1, rc2, and rc3 from sources without UCX and boost::mpi 1.73.0 from sources and in all 3 cases couldn't reproduce the error.

fweik · 2020-06-09T18:36:20Z

The front comes from the binary version of oprimitiv, same story: https://github.com/boostorg/mpi/blob/1c09f39948218d094a4fde9c09309580f33c5db5/include/boost/mpi/detail/binary_buffer_oprimitive.hpp#L43

jwakely · 2020-06-09T22:14:59Z

@jwakely is there anything we can do to boost-mpi on Fedora better?

@junghans is there a word missing from that question? "test" maybe?

If somebody has a fix for Boost.MPI we can certainly apply that to the Fedora boost packages.

junghans · 2020-06-09T23:19:14Z

@jwakely is there anything we can do to boost-mpi on Fedora better?

@junghans is there a word missing from that question? "test" maybe?

Yeah.

If somebody has a fix for Boost.MPI we can certainly apply that to the Fedora boost packages.

Well that is something, so we can at least unbreak Rawhide.

mkuron · 2020-06-10T07:34:58Z

@jngrad, since you have already compiled the libraries from source, could you please test (and probably fix) https://github.com/espressomd/espresso/files/4754022/cdata.txt? That would get rid of the problem once and for all.

jwakely · 2020-06-10T07:51:00Z

@mkuron that very last change looks wrong. You've changed the function name as well as the body.

fweik · 2020-06-10T09:27:51Z

FYI, boostorg/mpi@28a73ea seems to pass CI. ~~I'll open PR upstream...~~ I made a PR (boostorg/mpi#119) upstream

jngrad · 2020-06-10T11:32:13Z

Compiling boost with debug symbols

curl -sL https://dl.bintray.com/boostorg/release/1.73.0/source/boost_1_73_0.tar.bz2 | tar xj
cd boost_1_73_0
echo 'using mpi ;' > tools/build/src/user-config.jam
 ./bootstrap.sh --with-libraries=filesystem,mpi,serialization,test,system
./b2 -j $(nproc) install --prefix=/opt/boost variant=debug debug-symbols=on

won't generate ./stage/lib/cmake/boost_mpi-1.73.0/libboost_mpi-variant-shared.cmake (other submodules like test and serialization have one), which is looked for by CMake:

set(Boost_DEBUG ON)
find_package(Boost 1.65 REQUIRED mpi;serialization;filesystem;system;unit_test_framework)

-- Found Boost 1.73.0 at /opt/boost/lib/cmake/Boost-1.73.0
--   Requested configuration: QUIET REQUIRED COMPONENTS mpi;serialization;filesystem;system;unit_test_framework
-- BoostConfig: find_package(boost_headers 1.73.0 EXACT CONFIG REQUIRED QUIET HINTS /opt/boost/lib/cmake)
-- Found boost_headers 1.73.0 at /opt/boost/lib/cmake/boost_headers-1.73.0
-- BoostConfig: find_package(boost_mpi 1.73.0 EXACT CONFIG REQUIRED QUIET HINTS /opt/boost/lib/cmake)
-- Found boost_mpi 1.73.0 at /opt/boost/lib/cmake/boost_mpi-1.73.0
-- Boost toolset is gcc10 (GNU 10.1.1)
-- Scanning /opt/boost/lib/cmake/boost_mpi-1.73.0/libboost_mpi-variant*.cmake
--   Library has no variants and is considered not found
CMake Error at /opt/boost/lib/cmake/Boost-1.73.0/BoostConfig.cmake:141 (find_package):
  Found package configuration file:

    /opt/boost/lib/cmake/boost_mpi-1.73.0/boost_mpi-config.cmake

  but it set boost_mpi_FOUND to FALSE so package "boost_mpi" is considered to
  be NOT FOUND.  Reason given by package:

  No suitable build variant has been found.

Call Stack (most recent call first):
  /opt/boost/lib/cmake/Boost-1.73.0/BoostConfig.cmake:258 (boost_find_component)
  /usr/share/cmake/Modules/FindBoost.cmake:444 (find_package)
  CMakeLists.txt:265 (find_package)

-- Configuring incomplete, errors occurred!

KaiSzuttor · 2020-06-10T11:36:39Z

the same build protocol works if not building DEBUG?

jngrad · 2020-06-10T11:38:31Z

Without DEBUG I can compile and use boost, but there are no assertions, so I can't test if @fweik's PR actually solves the bug.

jngrad · 2020-06-10T13:14:30Z

Took me two hours, but I finally got boost to compile in debug mode. The trick was to compile OpenMPI from sources first (instead of fetching openmpi-devel from the Fedora repo), then the boost build system generates a libboost_mpi-variant-shared.cmake. But I still can't reproduce the bug with that debug build of boost 1.73.0. I've tried again with

./b2 -j $(nproc) install --prefix=/opt/boost \
  variant=release threading=multi debug-symbols=on pch=off

to get closer to the build environment in the boost-openmpi section of the boost specfile, but it's still not reproducible.

fweik · 2020-06-10T15:37:01Z

I'd expect you need a debug version of the standard library? This is where the assert is triggered...

jwakely · 2020-06-10T16:24:22Z

No, there is no debug version. Just define _GLIBCXX_ASSERTIONS when compiling the code that does the out of bounds accesses. The assert is in templates in headers, not compiled into the standard library.

The entire fedora distro is built with that macro defined.

fweik · 2020-06-10T16:43:58Z

Ah you are right, I got it confused with _GLIBCXX_DEBUG which changes ABI...

jngrad · 2020-06-10T17:43:03Z

Just define _GLIBCXX_ASSERTIONS when compiling the code that does the out of bounds accesses.

Thanks! I'm now able to reproduce the bug with boost compiled from sources. The b2 command is:

./b2 -j $(nproc) install --prefix=/opt/boost variant=release threading=multi \
  debug-symbols=on pch=off define=_GLIBCXX_ASSERTIONS

Applying the patch in boostorg/mpi#119 fixes the bug reported by @junghans for both openmpi-devel (OpenMPI 4.0.4rc1) and mpich-devel (MPICH 3.3.2).

fweik · 2020-06-10T18:11:25Z

Do you think we should turn on _GLIBCXX_ASSERTIONS in the CI on linux builds?

junghans · 2020-06-10T18:31:38Z

@jwakely can you patch boostorg/mpi#119 into rawhide?

jwakely · 2020-06-10T18:37:58Z

Do you think we should turn on _GLIBCXX_ASSERTIONS in the CI on linux builds?

@fweik it can't hurt, and could be useful.

@junghans will do -- nice work, everybody.

jngrad · 2020-06-10T18:56:59Z

Do you think we should turn on _GLIBCXX_ASSERTIONS in the CI on linux builds?

Depends on the extra runtime. If it adds more than 5 min to all builds, we could be more selective, e.g. enabling it only on the fedora and centos images, as well as maxset (which uses the latest Ubuntu packages and isn't a dependency of other jobs). I guess it shouldn't increase runtime by much, because our codebase uses mostly Vector instead of std::vector, and we have bounds checking in Vector.

jngrad · 2020-06-10T20:54:08Z

Running the python tests on maxset in a Fedora 33 image with _GLIBCXX_ASSERTIONS (de1d978) only increases the runtime by 2 seconds, so we could add this define in our CI once boost 1.73.1 becomes available in Ubuntu. The bug is reproducible in our Ubuntu 20.04 image with boost 1.71.0 and gcc 8.4 (Travis-CI logfile) but not on ICP machines with Ubuntu 18.04 and boost 1.65.1 and gcc 7.5. I'm surprised @junghans didn't run into this bug before, _GLIBCXX_ASSERTIONS seems to have been introduced in Fedora 28, and is in the STL since June 2016 (gcc-mirror/gcc@bd2ee798d).

junghans · 2020-06-10T23:45:34Z

@junghans will do -- nice work, everybody.

@jwakely, let me know when and I will attempt an espresso rebuild after that.

mkuron · 2020-06-11T06:11:23Z

Do you think we should turn on _GLIBCXX_ASSERTIONS in the CI on linux builds?

If we do that, we can‘t use the distribution-packaged Boost anymore, but need to build our own.

Don‘t we have Fedora in CI already? Why didn‘t it catch this problem?

jwakely · 2020-06-11T08:29:02Z

If we do that, we can‘t use the distribution-packaged Boost anymore, but need to build our own.

No, that's not true.

Defining _GLIBCXX_ASSERTIONS doesn't change ABI and can be turned on or off per-translation unit. You won't check the entire program if some objects are built without it, but it's still more checking than not turning it on at all.

mkuron · 2020-06-11T09:55:24Z

Did that change recently? I seem to recall that it used to break ABI compatibility. Might have been before they separated _GLIBCXX_DEBUG and _GLIBCXX_ASSERTIONS.

jwakely · 2020-06-11T10:15:48Z

"They" is me. I added _GLIBCXX_ASSERTIONS, and the whole point is that it doesn't alter the ABI the way that _GLIBCXX_DEBUG does.

jwakely · 2020-06-11T10:16:36Z

Anyway, the change in boostorg/mpi#119 seems to have a problem, please see my comment.

jwakely · 2020-06-11T16:26:58Z

@junghans there's a new Boost in rawhide now.

junghans · 2020-06-11T16:31:35Z

Testing: https://koji.fedoraproject.org/koji/taskinfo?taskID=45598515

junghans · 2020-06-11T17:16:20Z

On s390x:

      Start 66: all_gatherv_test
66/68 Test #66: all_gatherv_test .................***Exception: SegFault 36.10 sec
Running 2 test cases...
�[1;32;49m*** No errors detected
�[0;39;49m

junghans · 2020-06-11T18:40:00Z

All other archs pass, so I will exclude s390x for now and have opened an issue for the s390x issue (#3753)

jngrad added this to the Espresso 4.1.3 milestone Jun 5, 2020

jngrad closed this as completed Jun 16, 2020

Variant_test fails on Fedora Rawhide #3749

Variant_test fails on Fedora Rawhide #3749

Comments

junghans commented Jun 5, 2020

junghans commented Jun 5, 2020

jngrad commented Jun 5, 2020

junghans commented Jun 5, 2020

junghans commented Jun 5, 2020

junghans commented Jun 5, 2020

jngrad commented Jun 5, 2020

junghans commented Jun 8, 2020

fweik commented Jun 9, 2020 • edited Loading

fweik commented Jun 9, 2020

jngrad commented Jun 9, 2020

fweik commented Jun 9, 2020

junghans commented Jun 9, 2020

jngrad commented Jun 9, 2020

fweik commented Jun 9, 2020

fweik commented Jun 9, 2020

junghans commented Jun 9, 2020 • edited Loading

mkuron commented Jun 9, 2020

jngrad commented Jun 9, 2020

fweik commented Jun 9, 2020

jwakely commented Jun 9, 2020

junghans commented Jun 9, 2020

mkuron commented Jun 10, 2020

jwakely commented Jun 10, 2020

fweik commented Jun 10, 2020 • edited Loading

jngrad commented Jun 10, 2020

KaiSzuttor commented Jun 10, 2020

jngrad commented Jun 10, 2020

jngrad commented Jun 10, 2020

fweik commented Jun 10, 2020

jwakely commented Jun 10, 2020 • edited Loading

fweik commented Jun 10, 2020

jngrad commented Jun 10, 2020

fweik commented Jun 10, 2020

junghans commented Jun 10, 2020

jwakely commented Jun 10, 2020

jngrad commented Jun 10, 2020

jngrad commented Jun 10, 2020

junghans commented Jun 10, 2020

mkuron commented Jun 11, 2020 • edited Loading

jwakely commented Jun 11, 2020 • edited Loading

mkuron commented Jun 11, 2020

jwakely commented Jun 11, 2020 • edited Loading

jwakely commented Jun 11, 2020

jwakely commented Jun 11, 2020

junghans commented Jun 11, 2020

junghans commented Jun 11, 2020

junghans commented Jun 11, 2020

fweik commented Jun 9, 2020 •

edited

Loading

junghans commented Jun 9, 2020 •

edited

Loading

fweik commented Jun 10, 2020 •

edited

Loading

jwakely commented Jun 10, 2020 •

edited

Loading

mkuron commented Jun 11, 2020 •

edited

Loading

jwakely commented Jun 11, 2020 •

edited

Loading

jwakely commented Jun 11, 2020 •

edited

Loading