Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix tests with boost 1.86 on macos #43

Open
1 task done
traversaro opened this issue Nov 21, 2024 · 56 comments · Fixed by ompl/ompl#1199
Open
1 task done

Fix tests with boost 1.86 on macos #43

traversaro opened this issue Nov 21, 2024 · 56 comments · Fixed by ompl/ompl#1199
Labels
bug Something isn't working

Comments

@traversaro
Copy link
Contributor

traversaro commented Nov 21, 2024

Solution to issue cannot be found in the documentation.

  • I checked the documentation.

Issue

There are plenty of discussion on this, but they are spread over multiple PRs, that typically get outdated. Let's use this to keep track of everything.

This is quite important as it is now blocking:

Detailed description

Since boost 1.84 migration (see #35, the previous pinned version was 1.82) two tests are failing:

13/21 Test #13: test_planner_data ................***Failed   56.65 sec
Running 6 test cases...
Error:   Failed to load PlannerData: input stream error
         at line 140 in $SRC_DIR/src/ompl/base/src/PlannerDataStorage.cpp
$SRC_DIR/tests/base/planner_data.cpp:530: error: in "Serialization": check data2.numVertices() == states.size() has failed [0 != 1000]
$SRC_DIR/tests/base/planner_data.cpp:531: error: in "Serialization": check data2.numEdges() == num_edges_to_add has failed [0 != 10000]
$SRC_DIR/tests/base/planner_data.cpp:534: error: in "Serialization": check data2.numStartVertices() == 3 has failed
$SRC_DIR/tests/base/planner_data.cpp:535: error: in "Serialization": check data2.numGoalVertices() == 2 has failed
$SRC_DIR/tests/base/planner_data.cpp:536: error: in "Serialization": check data2.isStartVertex(0) has failed
$SRC_DIR/tests/base/planner_data.cpp:537: error: in "Serialization": check data2.isStartVertex(states.size()/2) has failed
$SRC_DIR/tests/base/planner_data.cpp:538: error: in "Serialization": check data2.isStartVertex(states.size()-1) has failed
$SRC_DIR/tests/base/planner_data.cpp:539: error: in "Serialization": check data2.isGoalVertex(1) has failed
$SRC_DIR/tests/base/planner_data.cpp:540: error: in "Serialization": check data2.isGoalVertex(states.size()-2) has failed
unknown location:0: fatal error: in "Serialization": memory access violation at address: 0x8: no mapping at fault address
$SRC_DIR/tests/base/planner_data.cpp:544: last checkpoint

*** 10 failures are detected in the test module "PlannerData"

and

21/21 Test #21: test_planner_data_control ........***Failed   48.40 sec
Running 5 test cases...
Debug:   Storing 10000 PlannerDataEdgeControl objects
Error:   Failed to load PlannerData: input stream error
         at line 112 in $SRC_DIR/src/ompl/control/src/PlannerDataStorage.cpp
$SRC_DIR/tests/control/planner_data.cpp:549: error: in "Serialization": check data2.numVertices() == states.size() has failed [0 != 1000]
$SRC_DIR/tests/control/planner_data.cpp:550: error: in "Serialization": check data2.numEdges() == num_edges_to_add has failed [0 != 10000]
unknown location:0: fatal error: in "Serialization": memory access violation at address: 0x8: no mapping at fault address
$SRC_DIR/tests/control/planner_data.cpp:555: last checkpoint

*** 3 failures are detected in the test module "PlannerDataControl"
@traversaro traversaro added the bug Something isn't working label Nov 21, 2024
@traversaro
Copy link
Contributor Author

traversaro commented Nov 21, 2024

Something that I forgot (and I can't find any reference to) I did but apparently I did (I found it on google) is https://github.com/traversaro/ompl-macos-test/ . It is a comparison of the issue against difference version of compilers and boost:

  • brew seems to work fine, using AppleClang 14
  • conda-forge fails for boost>=1.84, works for boost==1.82, with all clangs from 14 to 17
  • conda-forge with system compiler works, using AppleClang 14

@EzraBrooks
Copy link

Thanks for writing this up! I was already finding it difficult to juggle all the PRs and issues related to this. I will start looking into this, using your minimal reproduction in that repo.

@traversaro
Copy link
Contributor Author

Something that I forgot (and I can't find any reference to) I did but apparently I did (I found it on google) is https://github.com/traversaro/ompl-macos-test/ . It is a comparison of the issue against difference version of compilers and boost:

* brew seems to work fine, using AppleClang 14

* conda-forge fails for boost>=1.84, works for boost==1.82, with all clangs from 14 to 17

* conda-forge with system compiler works, using AppleClang 14

So, unfortunately AppleClang and Clang are slightly different compilers, but the fact that conda-forge's Clang 14 fails and AppleClang 14 is successful suggest me that there is something else going on. I am not sure what libcxx the system compiler are using, but just in case I also added the libcxx version to the build matrix if just in case test pass with an older libcxx .

@traversaro
Copy link
Contributor Author

For reference, ti seems that the libboost package was built with clang 17 (see https://conda-metadata-app.streamlit.app/?q=conda-forge%2Fosx-64%2Flibboost-1.86.0-hbe88bda_2.conda).

@EzraBrooks
Copy link

EzraBrooks commented Nov 22, 2024

I notice that the target SDK version in the boost PR is 11 - are we deliberately targeting systems as old as Big Sur? Since the ABI break was in macOS 12, would setting the minimum supported API to 12 either fix the problem or at least narrow the problem space?

Edit: nvm, I'm still learning the layout of all these repos and CI jobs. I see now that that PR is updating packages that depend upon boost, not updating boost itself.

@EzraBrooks
Copy link

EzraBrooks commented Nov 22, 2024

Interestingly, running the contents of your repro case on my Mac locally (macOS 15) does not provoke the test failure. That makes me circle back to my hunch above that this has something to do with targeting macOS 11 as the lowest SDK version while building on a newer version.

@traversaro
Copy link
Contributor Author

Interestingly, running the contents of your repro case on my Mac locally (macOS 15) does not provoke the test failure.

Interesting, can you share the environment in which you are running the tests (i.e. the conda list or pixi list output)? Thanks!

That makes me circle back to my hunch above that this has something to do with targeting macOS 11 as the lowest SDK version while building on a newer version.

I am testing this in #44 . 12.3 still fails, I can try to increase the version.

@traversaro
Copy link
Contributor Author

I am testing this in #44 . 12.3 still fails, I can try to increase the version.

I tested 12.3 and 13.3 and they both fails. At the moment the conda-forge infrastructure doe snot support anything newer.

@EzraBrooks
Copy link

Hopefully I copy-pasted the install commands from your CI repro case correctly!

~/D/P/o/build (main|✔) $ micromamba list                                                                                                                          (ompl)
List of packages in environment: "/Users/ezra/micromamba/envs/ompl"

  Name                    Version       Build                 Channel
───────────────────────────────────────────────────────────────────────────
  bzip2                   1.0.8         h99b78c6_7            conda-forge
  c-ares                  1.34.3        h5505292_0            conda-forge
  ca-certificates         2024.8.30     hf0a4a13_0            conda-forge
  cctools                 1010.6        hf67d63f_1            conda-forge
  cctools_osx-arm64       1010.6        h4208deb_1            conda-forge
  clang                   17.0.6        default_h360f5da_7    conda-forge
  clang-17                17.0.6        default_h146c034_7    conda-forge
  clang_impl_osx-arm64    17.0.6        he47c785_23           conda-forge
  clang_osx-arm64         17.0.6        h07b0088_23           conda-forge
  clangxx                 17.0.6        default_h360f5da_7    conda-forge
  clangxx_impl_osx-arm64  17.0.6        h50f59cd_23           conda-forge
  clangxx_osx-arm64       17.0.6        h07b0088_23           conda-forge
  cmake                   3.31.1        h326f17c_0            conda-forge
  compiler-rt             17.0.6        h856b3c1_2            conda-forge
  compiler-rt_osx-arm64   17.0.6        h832e737_2            conda-forge
  eigen                   3.4.0         hc021e02_0            conda-forge
  flann                   1.9.2         hedd063d_2            conda-forge
  hdf5                    1.14.4        nompi_ha698983_103    conda-forge
  icu                     75.1          hfee45f7_0            conda-forge
  krb5                    1.21.3        h237132a_0            conda-forge
  ld64                    951.9         h39a299f_1            conda-forge
  ld64_osx-arm64          951.9         hc81425b_1            conda-forge
  libaec                  1.1.3         hebf3989_0            conda-forge
  libblas                 3.9.0         25_osxarm64_openblas  conda-forge
  libboost                1.86.0        h29978a0_2            conda-forge
  libboost-devel          1.86.0        hf450f58_2            conda-forge
  libboost-headers        1.86.0        hce30654_2            conda-forge
  libcblas                3.9.0         25_osxarm64_openblas  conda-forge
  libccd-double           2.1           h9a09cb3_2            conda-forge
  libclang-cpp17          17.0.6        default_h146c034_7    conda-forge
  libcurl                 8.10.1        h13a7ad3_0            conda-forge
  libcxx                  19.1.4        ha82da77_0            conda-forge
  libcxx-devel            17.0.6        h86353a2_6            conda-forge
  libedit                 3.1.20191231  hc8eb9b7_2            conda-forge
  libev                   4.33          h93a5062_2            conda-forge
  libexpat                2.6.4         h286801f_0            conda-forge
  libffi                  3.4.2         h3422bc3_5            conda-forge
  libgfortran             5.0.0         13_2_0_hd922786_3     conda-forge
  libgfortran5            13.2.0        hf226fd6_3            conda-forge
  libglib                 2.82.2        h07bd6cf_0            conda-forge
  libiconv                1.17          h0d3ecfb_2            conda-forge
  libintl                 0.22.5        h8414b35_3            conda-forge
  liblapack               3.9.0         25_osxarm64_openblas  conda-forge
  libllvm17               17.0.6        h5090b49_2            conda-forge
  libllvm19               19.1.4        hc4b4ae8_0            conda-forge
  libmpdec                4.0.0         h99b78c6_0            conda-forge
  libnghttp2              1.64.0        h6d7220d_0            conda-forge
  libode                  0.16.5        py313hbab1857_0       conda-forge
  libopenblas             0.3.28        openmp_hf332438_1     conda-forge
  libsqlite               3.47.0        hbaaea75_1            conda-forge
  libssh2                 1.11.0        h7a5bd25_0            conda-forge
  libuv                   1.49.2        h7ab814d_0            conda-forge
  libxml2                 2.13.5        hbbdcc80_0            conda-forge
  libzlib                 1.3.1         h8359307_2            conda-forge
  llvm-openmp             19.1.4        hdb05f8b_0            conda-forge
  llvm-tools              17.0.6        h5090b49_2            conda-forge
  llvm-tools-19           19.1.4        h87a4c7e_0            conda-forge
  lz4-c                   1.9.4         hb7217d7_0            conda-forge
  make                    4.4.1         hc9fafa5_2            conda-forge
  ncurses                 6.5           h7bae524_1            conda-forge
  ninja                   1.12.1        h420ef59_0            conda-forge
  numpy                   2.1.3         py313hca4752e_0       conda-forge
  openssl                 3.4.0         h39f12f2_0            conda-forge
  pcre2                   10.44         h297a79d_2            conda-forge
  pip                     24.3.1        pyh145f28c_0          conda-forge
  pkg-config              0.29.2        hde07d2e_1009         conda-forge
  pthread-stubs           0.4           hd74edd7_1002         conda-forge
  python                  3.13.0        h206b6c5_100_cp313    conda-forge
  python_abi              3.13          5_cp313               conda-forge
  readline                8.2           h92ec313_1            conda-forge
  rhash                   1.4.5         h7ab814d_0            conda-forge
  sigtool                 0.1.3         h44b9a77_0            conda-forge
  tapi                    1300.6.5      h03f4b80_0            conda-forge
  tk                      8.6.13        h5083fa2_1            conda-forge
  tzdata                  2024b         hc8b5060_0            conda-forge
  xz                      5.2.6         h57fd34a_0            conda-forge
  zstd                    1.5.6         hb46c0d2_0            conda-forge

@EzraBrooks
Copy link

That list output is suspicious.. should libcxx and libcxx-devel be such different versions? 🤔

@traversaro
Copy link
Contributor Author

That list output is suspicious.. should libcxx and libcxx-devel be such different versions? 🤔

Yes, that is common. I am not 100% sure why that is the case, but I saw something similar happening in the past.

@EzraBrooks
Copy link

I tested 12.3 and 13.3 and they both fails.

I see that only some of the 12.3 build failed in that PR and some passed. it's been awhile since I last used Azure Pipelines and I'm having trouble finding the logs as to why some failed and some didn't..

@traversaro
Copy link
Contributor Author

I tested 12.3 and 13.3 and they both fails.

I see that only some of the 12.3 build failed in that PR and some passed. it's been awhile since I last used Azure Pipelines and I'm having trouble finding the logs as to why some failed and some didn't..

The corresponding job is https://github.com/conda-forge/ompl-feedstock/runs/33425622198, unfortunately it seems to me that all macos jobs failed.

@EzraBrooks
Copy link

It looks like the x86-64 jobs failed but the arm64 jobs passed? unless I'm misreading the report.

@traversaro
Copy link
Contributor Author

It looks like the x86-64 jobs failed but the arm64 jobs passed? unless I'm misreading the report.

Ahh sorry, yes that is confusing. The arm64 builds are cross-compiled, as there are no macos arm64 machine in Azure (differently from GitHub Actions, see conda-forge/conda-forge.github.io#1781). Furthermore, as there is no emulator availator available to emulate arm64 on macos amd64, the tests on osx-arm64 are not running. See the code in

if [[ "$target_platform" != "linux-aarch64" && "$CONDA_BUILD_CROSS_COMPILATION" != "1" ]]; then
.

@traversaro
Copy link
Contributor Author

traversaro commented Nov 25, 2024

I was able to reproduce the problem locally on arm64.

This is my setup:

(ompldev) icub@iCubs-Mac-mini build % conda info
     active environment : ompldev
    active env location : /Users/icub/miniforge3/envs/ompldev
            shell level : 1
       user config file : /Users/icub/.condarc
 populated config files : /Users/icub/miniforge3/.condarc
                          /Users/icub/.condarc
          conda version : 24.1.2
    conda-build version : not installed
         python version : 3.10.14.final.0
                 solver : libmamba (default)
       virtual packages : __archspec=1=m1
                          __conda=24.1.2=0
                          __osx=14.6.1=0
                          __unix=0=0
       base environment : /Users/icub/miniforge3  (writable)
      conda av data dir : /Users/icub/miniforge3/etc/conda
  conda av metadata url : None
           channel URLs : https://conda.anaconda.org/conda-forge/osx-arm64
                          https://conda.anaconda.org/conda-forge/noarch
          package cache : /Users/icub/miniforge3/pkgs
                          /Users/icub/.conda/pkgs
       envs directories : /Users/icub/miniforge3/envs
                          /Users/icub/.conda/envs
               platform : osx-arm64
             user-agent : conda/24.1.2 requests/2.31.0 CPython/3.10.14 Darwin/23.6.0 OSX/14.6.1 solver/libmamba conda-libmamba-solver/24.1.0 libmambapy/1.5.7
                UID:GID : 501:20
             netrc file : None
           offline mode : False

and my env:

(ompldev) icub@iCubs-Mac-mini ~ % conda list
# packages in environment at /Users/icub/miniforge3/envs/ompldev:
#
# Name                    Version                   Build  Channel
bzip2                     1.0.8                h99b78c6_7    conda-forge
c-ares                    1.34.3               h5505292_1    conda-forge
c-compiler                1.8.0                hf48404e_1    conda-forge
ca-certificates           2024.8.30            hf0a4a13_0    conda-forge
cctools                   1010.6               hf67d63f_1    conda-forge
cctools_osx-arm64         1010.6               h4208deb_1    conda-forge
clang                     17.0.6          default_h360f5da_7    conda-forge
clang-17                  17.0.6          default_h146c034_7    conda-forge
clang_impl_osx-arm64      17.0.6              he47c785_23    conda-forge
clang_osx-arm64           17.0.6              h07b0088_23    conda-forge
clangxx                   17.0.6          default_h360f5da_7    conda-forge
clangxx_impl_osx-arm64    17.0.6              h50f59cd_23    conda-forge
clangxx_osx-arm64         17.0.6              h07b0088_23    conda-forge
cmake                     3.31.1               h326f17c_0    conda-forge
compiler-rt               17.0.6               h856b3c1_2    conda-forge
compiler-rt_osx-arm64     17.0.6               h832e737_2    conda-forge
compilers                 1.8.0                hce30654_1    conda-forge
cxx-compiler              1.8.0                h18dbf2f_1    conda-forge
eigen                     3.4.0                h1995070_0    conda-forge
fortran-compiler          1.8.0                hc3477c4_1    conda-forge
gfortran                  13.2.0               h1ca8e4b_1    conda-forge
gfortran_impl_osx-arm64   13.2.0               h252ada1_3    conda-forge
gfortran_osx-arm64        13.2.0               h57527a5_1    conda-forge
gmp                       6.3.0                h7bae524_2    conda-forge
icu                       75.1                 hfee45f7_0    conda-forge
isl                       0.26            imath32_h347afa1_101    conda-forge
krb5                      1.21.3               h237132a_0    conda-forge
ld64                      951.9                h39a299f_1    conda-forge
ld64_osx-arm64            951.9                hc81425b_1    conda-forge
libblas                   3.9.0           25_osxarm64_openblas    conda-forge
libboost                  1.86.0               h29978a0_2    conda-forge
libboost-devel            1.86.0               hf450f58_2    conda-forge
libboost-headers          1.86.0               hce30654_2    conda-forge
libcblas                  3.9.0           25_osxarm64_openblas    conda-forge
libclang-cpp17            17.0.6          default_h146c034_7    conda-forge
libcurl                   8.10.1               h13a7ad3_0    conda-forge
libcxx                    19.1.4               ha82da77_0    conda-forge
libcxx-devel              17.0.6               h86353a2_6    conda-forge
libedit                   3.1.20191231         hc8eb9b7_2    conda-forge
libev                     4.33                 h93a5062_2    conda-forge
libexpat                  2.6.4                h286801f_0    conda-forge
libffi                    3.4.2                h3422bc3_5    conda-forge
libgfortran               5.0.0           13_2_0_hd922786_3    conda-forge
libgfortran-devel_osx-arm64 13.2.0               h5d7a38c_3    conda-forge
libgfortran5              13.2.0               hf226fd6_3    conda-forge
libglib                   2.82.2               h07bd6cf_0    conda-forge
libiconv                  1.17                 h0d3ecfb_2    conda-forge
libintl                   0.22.5               h8414b35_3    conda-forge
liblapack                 3.9.0           25_osxarm64_openblas    conda-forge
libllvm17                 17.0.6               h5090b49_2    conda-forge
libmpdec                  4.0.0                h99b78c6_0    conda-forge
libnghttp2                1.64.0               h6d7220d_0    conda-forge
libopenblas               0.3.28          openmp_hf332438_1    conda-forge
libsqlite                 3.47.0               hbaaea75_1    conda-forge
libssh2                   1.11.1               h9cc3647_0    conda-forge
libuv                     1.49.2               h7ab814d_0    conda-forge
libxml2                   2.13.5               hbbdcc80_0    conda-forge
libzlib                   1.3.1                h8359307_2    conda-forge
llvm-openmp               19.1.4               hdb05f8b_0    conda-forge
llvm-tools                17.0.6               h5090b49_2    conda-forge
make                      4.4.1                hc9fafa5_2    conda-forge
mpc                       1.3.1                h8f1351a_1    conda-forge
mpfr                      4.2.1                hb693164_3    conda-forge
ncurses                   6.5                  h7bae524_1    conda-forge
ninja                     1.12.1               h420ef59_0    conda-forge
numpy                     2.1.3           py313hca4752e_0    conda-forge
openssl                   3.4.0                h39f12f2_0    conda-forge
pcre2                     10.44                h297a79d_2    conda-forge
pip                       24.3.1             pyh145f28c_0    conda-forge
pkg-config                0.29.2            hde07d2e_1009    conda-forge
python                    3.13.0          h75c3a9f_100_cp313    conda-forge
python_abi                3.13                    5_cp313    conda-forge
readline                  8.2                  h92ec313_1    conda-forge
rhash                     1.4.5                h7ab814d_0    conda-forge
sigtool                   0.1.3                h44b9a77_0    conda-forge
tapi                      1300.6.5             h03f4b80_0    conda-forge
tk                        8.6.13               h5083fa2_1    conda-forge
tzdata                    2024b                hc8b5060_0    conda-forge
xz                        5.2.6                h57fd34a_0    conda-forge
zlib                      1.3.1                h8359307_2    conda-forge
zstd                      1.5.6                hb46c0d2_0    conda-forge

@traversaro
Copy link
Contributor Author

By comparing my env with @EzraBrooks , I noticed that in @EzraBrooks's env the c-compiler and cxx-compiler meta packages are missing. This is indeed aligned with the CI (as there I needed to test multiple different versions of the compiler, but I wonder if that is the reason why @EzraBrooks is not reproducing the problem.

@EzraBrooks
Copy link

I installed c-compiler and cxx-compiler and did a clean CMake configure and rebuild and still can't reproduce the issue 😵

@EzraBrooks
Copy link

I'm doing a wdiff on our envs and trying to get them 100% matched, will retry shortly

@traversaro
Copy link
Contributor Author

I'm doing a wdiff on our envs and trying to get them 100% matched, will retry shortly

I am also setting up a pixi config file to ensure that we use exactly the same lock file.

@traversaro
Copy link
Contributor Author

traversaro commented Nov 25, 2024

Ok, this fails for me (branch is https://github.com/traversaro/ompl/tree/pixi):

git clone https://github.com/traversaro/ompl
cd ompl
git checkout 052f1b6773175a9a8a08978cd20d257e809bc3d4
pixi run test

This fails for me, on this machine:

icub@iCubs-Mac-mini ompl % pixi info
System
------------
      Pixi version: 0.37.0
          Platform: osx-arm64
  Virtual packages: __unix=0=0
                  : __osx=14.6.1=0
                  : __archspec=1=m1
         Cache dir: /Users/icub/Library/Caches/rattler/cache
      Auth storage: /Users/icub/.rattler/credentials.json
  Config locations: No config files found

Global
------------
           Bin dir: /Users/icub/.pixi/bin
   Environment dir: /Users/icub/.pixi/envs
      Manifest dir: /Users/icub/.pixi/manifests/pixi-global.toml

Project
------------
              Name: ompl
           Version: 0.1.0
     Manifest file: /Users/icub/ompl/pixi.toml
      Last updated: 25-11-2024 16:17:58

Environments
------------
       Environment: default
          Features: default
          Channels: conda-forge
  Dependency count: 7
      Dependencies: compilers, cmake, pkg-config, eigen, make, ninja, libboost-devel
  Target platforms: osx-arm64
             Tasks: test, build

@traversaro
Copy link
Contributor Author

See also #41 (comment) .

@EzraBrooks
Copy link

That raw pointer does look odd, and I did find some mentions of recurring issues in Boost Serialization w/r/t null pointers with Clang, but they're fairly old issues so I don't think they're related: boostorg/serialization#119

@EzraBrooks
Copy link

EzraBrooks commented Nov 25, 2024

Thanks for the Pixi example, I am able to reproduce the failure now.

@traversaro
Copy link
Contributor Author

In https://github.com/traversaro/ompl/tree/simplifytestfailure I tried to simplify the failing test, I was able to verify that if one does not uses a derived class, nothing is happening (comment out the line #define TRIGGER_CONDA_FORGE_FAILURE). So I guess there is something happening related to the derived class not being registered or something similar?

@traversaro
Copy link
Contributor Author

traversaro commented Nov 25, 2024

Something nice is that it does not seems that PlannerDataVertex is used at all in moveit or navigation: https://github.com/search?q=org%3Amoveit+PlannerDataVertex&type=code and https://github.com/search?q=org%3Aros-navigation%20PlannerDataVertex&type=code , so if we can't find the solution perhaps we should relatively safe in just skipping the test, at least for what regards ROS usage.

@mamoll
Copy link

mamoll commented Nov 26, 2024

Can someone summarize the failure condition? Is it limited to a specific Boost version / compiler version / package manager? Does it only happen on Apple silicon? I still have an Intel-based Mac running Sonoma. I have MacPorts installed and can try to reproduce the error if I have more info on which specific things to install.

@EzraBrooks
Copy link

Hi Mark! If you check out my branch linked in the comment above, @traversaro and I have a reproducible build environment (using Pixi) that provokes the issue on both x86 and arm64 macOS. The issue originally arose when conda-forge attempted to bump from Boost 1.82 to Boost 1.84, and has persisted up til now (Boost 1.86).

You should be able to just check out that branch, install Pixi if you don't have it, and run pixi run test to reproduce the failing test.

I hadn't opened an issue on the OMPL upstream yet because I still don't have a great clue as to what exactly is going wrong - we've only narrowed it down to PlannerDataStorage.load in tests/base/planner_data.cpp and tests/control/planner_data.cpp.

Weirdly, temporarily removing the PlannerDataTestVertex class and using PlannerDataVertex directly fixes the issue in tests/base/planner_data.cpp, but not in tests/control/planner_data.cpp.

The error is annoyingly cryptic - it's an internal boost error that simply reads "input stream error". According to the boost docs, there are a few causes to this:

Aside from the common situations such as a corrupted or truncated input file, there are several less obvious ones that sometimes occur.
This includes an attempt to read past the end of the file. Text files need a terminating new line character at the end of the file which is appended when the archive destructor is invoked. Be sure that an output archive on a stream is destroyed before opening an input archive on that same stream.
Another one is the passing of uninitialized data. In general, the behavior of the serialization library when passed uninitialized data is undefined. If it can be detected, it will invoke an assertion in debug builds. Otherwise, depending on the type of archive, it may pass through without incident or it may result in an archive with unexpected data in it. This, in turn, can result in the throwing of this exception.

@EzraBrooks
Copy link

More narrowly, the error appears to being thrown during loadVertices within PlannerDataStorage.load.

@traversaro
Copy link
Contributor Author

Thanks @mamoll for chiming in! To complement what @EzraBrooks's wrote, we can't reproduce the error when using the system compiler, instead of the conda-forge owns (conda-forge compile its own version of the clang compiler), even by using in both cases the conda-forge builds of boost. That is one of the reason we did not open an issue upstream, as we could not reproduce the issue outside of conda-forge. However, we tried to build for a complex matrix of clang versions, and we always had the problem even with quite different compiler and C++ standard library version, and that is the reason why the issue does not seem to be a compiler problem (or simply it is a compiler problem that went unnoticed for a long time).

@mamoll
Copy link

mamoll commented Nov 27, 2024

This sounds vaguely familiar. I have seen these tests fail when you run all the test in parallel (ctest -j $(NPROC)) rather than sequential (ctest -j 1). Can you check if the error goes away if you run the unit tests sequentially?

@EzraBrooks
Copy link

Running ctest -j1 doesn't seem to help - I don't think I was running them in parallel in the first place anyway afaict. New discovery, though: @dsobek did some other investigation yesterday afternoon and confirmed that it's the save step saving bad data, thus causing the load step to fail. This definitely makes it feel like there's some pointer serialization issues here (which was what @Tobias-Fischer theorized in #41 as well)

@EzraBrooks
Copy link

It also doesn't seem to have anything to do with archive size (another potential culprit in my research of common pitfalls in boost serialization) since cranking down the number of states in the test from 1000 to 2 still fails. Since loadVertices is where the exception was being thrown, and loading the vertices from a Boost 1.82 built version of the library works fine, this seems to point the finger at storeVertices.

@traversaro
Copy link
Contributor Author

Thanks for all the support! If if turns out that this feature is not used so much in downstream software, a possible way forward (to unblock robostack builds) is just to skip the test and upload the ompl build boost with 1.86 , while keeping the issue open to track the problem and eventual resolution.

@EzraBrooks
Copy link

EzraBrooks commented Nov 27, 2024

that would be my (selfish) preferred solution for the near term, since I'm not actually trying to use this with baremetal macOS 😆

I don't know enough about the OMPL bindings in MoveIt to say for sure but it seems promising that your earlier search didn't turn up direct invocations of these functions. I'll get someone more hands-on from the MoveIt team to chime in here.

@henningkayser
Copy link

We do call PlannerDataStorage::store() and PlannerDataStorage::load() directly, but of course only if the persistent planner data usage is configured.

@EzraBrooks
Copy link

EzraBrooks commented Nov 27, 2024

@henningkayser also mentioned that that feature in MoveIt is not commonly used (and might be already buggy due to disuse and decay)

@EzraBrooks
Copy link

EzraBrooks commented Nov 27, 2024

More info from @dsobek's investigation - there is definitely a clear difference between the binary archive serialized from Boost 1.82 vs Boost 1.84+ (on Mac):

https://www.diffchecker.com/EjuqlV0T/

screenshot for posterity since that link will expire eventually
Image

@dsobek
Copy link

dsobek commented Nov 27, 2024

This the state of the test that I produced that test data with:
https://github.com/EzraBrooks/ompl/blob/f7d028a186b4bb7cf841a43e828440125848ff66/tests/base/planner_data.cpp

@EzraBrooks
Copy link

@JafarAbdi appears to have verified that this exact same error message is indeed triggered by test parallelism on Linux as @mamoll mentioned.. why does it happen every time on macOS but only when run in parallel on Linux..

@traversaro
Copy link
Contributor Author

We do call PlannerDataStorage::store() and PlannerDataStorage::load() directly, but of course only if the persistent planner data usage is configured.

Yes, but from what I understand you are not defining any custom user-defined PlannerDataVertex ? As that is what is triggering the problem.

@traversaro
Copy link
Contributor Author

@JafarAbdi appears to have verified that this exact same error message is indeed triggered by test parallelism on Linux as @mamoll mentioned.. why does it happen every time on macOS but only when run in parallel on Linux..

Wow, I am start getting intrigued. Is this with conda compilers or what?

At this point something I would be curious to explore is to build Boost serialization (that if I recall correctly is a compiled component of Boost) as part of the ompl build via CMake's FetchContent, and once the full dep tree is managed by the same CMake build, just start enabling all possible GCC or Clang sanitizers to understand more about this, but I am not sure how easy is to build boost via CMake's FetchContent.

Anyhow, while this is fun I guess we can decouple this issue from the robostack side, if no major public dependency of ompl is using this problematic functionality I think we can look into skipping this tests for now and make the ompl + boost 1.86 build available so that we unblock the robostack builds.

@traversaro
Copy link
Contributor Author

By the way, I noticed that tests are skipped also on linux-aarch64:

# run tests, currently failing on arm64 for some reason
, I wonder if the test failures mentioned there are related.

@EzraBrooks
Copy link

"for some reason" - classic!

traversaro added a commit to conda-forge-admin/ompl-feedstock that referenced this issue Nov 27, 2024
@traversaro
Copy link
Contributor Author

Anyhow, while this is fun I guess we can decouple this issue from the robostack side, if no major public dependency of ompl is using this problematic functionality I think we can look into skipping this tests for now and make the ompl + boost 1.86 build available so that we unblock the robostack builds.

Done in #47 .

@dsobek
Copy link

dsobek commented Nov 27, 2024

I ran a git bisect on the boost serialization library and the culprit commit that first broke the test is boostorg/serialization#287.

@JafarAbdi
Copy link

JafarAbdi commented Nov 27, 2024

As @EzraBrooks I was able to reproduce the issue in my linux machine with parallelism, but that issue was easy to fix (It has to do with two tests saving/loading to the same file)

@johnwason
Copy link
Contributor

We had another strange problem with newer versions of boost serialization in tesseract. The problem was a missing boost serialization export macro somewhere. It worked with older versions of boost serialization but began failing with newer versions.

boostorg/serialization#309

@johnwason
Copy link
Contributor

More thoughts about boostorg/serialization#309 : I think the problem is they introduced a singleton that is being instantiated multiple times if the linking isn't exactly correct. The linking can vary slightly depending on the compiler and compiler version so it can be very difficult to debug. The macros if used correctly should enforce that only one singleton exists but it does not raise an error if there is a problem.

@traversaro
Copy link
Contributor Author

Thanks a lot for the great team-up on the issue! The robostack builds should be unblocked by #47, but I think it make sense to keep this issue open until we actually fix the issue (and thanks @johnwason for the great inputs).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants