-
-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix tests with boost 1.86 on macos #43
Comments
Something that I forgot (and I can't find any reference to) I did but apparently I did (I found it on google) is https://github.com/traversaro/ompl-macos-test/ . It is a comparison of the issue against difference version of compilers and boost:
|
Thanks for writing this up! I was already finding it difficult to juggle all the PRs and issues related to this. I will start looking into this, using your minimal reproduction in that repo. |
So, unfortunately AppleClang and Clang are slightly different compilers, but the fact that conda-forge's Clang 14 fails and AppleClang 14 is successful suggest me that there is something else going on. I am not sure what libcxx the system compiler are using, but just in case I also added the libcxx version to the build matrix if just in case test pass with an older libcxx . |
For reference, ti seems that the libboost package was built with clang 17 (see https://conda-metadata-app.streamlit.app/?q=conda-forge%2Fosx-64%2Flibboost-1.86.0-hbe88bda_2.conda). |
I notice that the target SDK version in the boost PR is 11 - are we deliberately targeting systems as old as Big Sur? Since the ABI break was in macOS 12, would setting the minimum supported API to 12 either fix the problem or at least narrow the problem space? Edit: nvm, I'm still learning the layout of all these repos and CI jobs. I see now that that PR is updating packages that depend upon boost, not updating boost itself. |
Interestingly, running the contents of your repro case on my Mac locally (macOS 15) does not provoke the test failure. That makes me circle back to my hunch above that this has something to do with targeting macOS 11 as the lowest SDK version while building on a newer version. |
Interesting, can you share the environment in which you are running the tests (i.e. the
I am testing this in #44 . 12.3 still fails, I can try to increase the version. |
I tested 12.3 and 13.3 and they both fails. At the moment the conda-forge infrastructure doe snot support anything newer. |
Hopefully I copy-pasted the install commands from your CI repro case correctly!
|
That list output is suspicious.. should |
Yes, that is common. I am not 100% sure why that is the case, but I saw something similar happening in the past. |
I see that only some of the 12.3 build failed in that PR and some passed. it's been awhile since I last used Azure Pipelines and I'm having trouble finding the logs as to why some failed and some didn't.. |
The corresponding job is https://github.com/conda-forge/ompl-feedstock/runs/33425622198, unfortunately it seems to me that all |
It looks like the x86-64 jobs failed but the arm64 jobs passed? unless I'm misreading the report. |
Ahh sorry, yes that is confusing. The ompl-feedstock/recipe/build.sh Line 16 in c4a6188
|
I was able to reproduce the problem locally on This is my setup:
and my env:
|
By comparing my env with @EzraBrooks , I noticed that in @EzraBrooks's env the |
I installed |
I'm doing a wdiff on our envs and trying to get them 100% matched, will retry shortly |
I am also setting up a pixi config file to ensure that we use exactly the same lock file. |
Ok, this fails for me (branch is https://github.com/traversaro/ompl/tree/pixi):
This fails for me, on this machine:
|
See also #41 (comment) . |
That raw pointer does look odd, and I did find some mentions of recurring issues in Boost Serialization w/r/t null pointers with Clang, but they're fairly old issues so I don't think they're related: boostorg/serialization#119 |
Thanks for the Pixi example, I am able to reproduce the failure now. |
In https://github.com/traversaro/ompl/tree/simplifytestfailure I tried to simplify the failing test, I was able to verify that if one does not uses a derived class, nothing is happening (comment out the line |
Something nice is that it does not seems that |
Can someone summarize the failure condition? Is it limited to a specific Boost version / compiler version / package manager? Does it only happen on Apple silicon? I still have an Intel-based Mac running Sonoma. I have MacPorts installed and can try to reproduce the error if I have more info on which specific things to install. |
Hi Mark! If you check out my branch linked in the comment above, @traversaro and I have a reproducible build environment (using Pixi) that provokes the issue on both x86 and arm64 macOS. The issue originally arose when conda-forge attempted to bump from Boost 1.82 to Boost 1.84, and has persisted up til now (Boost 1.86). You should be able to just check out that branch, install Pixi if you don't have it, and run I hadn't opened an issue on the OMPL upstream yet because I still don't have a great clue as to what exactly is going wrong - we've only narrowed it down to Weirdly, temporarily removing the The error is annoyingly cryptic - it's an internal boost error that simply reads "input stream error". According to the boost docs, there are a few causes to this:
|
More narrowly, the error appears to being thrown during |
Thanks @mamoll for chiming in! To complement what @EzraBrooks's wrote, we can't reproduce the error when using the system compiler, instead of the conda-forge owns (conda-forge compile its own version of the clang compiler), even by using in both cases the conda-forge builds of boost. That is one of the reason we did not open an issue upstream, as we could not reproduce the issue outside of conda-forge. However, we tried to build for a complex matrix of clang versions, and we always had the problem even with quite different compiler and C++ standard library version, and that is the reason why the issue does not seem to be a compiler problem (or simply it is a compiler problem that went unnoticed for a long time). |
This sounds vaguely familiar. I have seen these tests fail when you run all the test in parallel ( |
Running |
It also doesn't seem to have anything to do with archive size (another potential culprit in my research of common pitfalls in boost serialization) since cranking down the number of states in the test from 1000 to 2 still fails. Since |
Thanks for all the support! If if turns out that this feature is not used so much in downstream software, a possible way forward (to unblock robostack builds) is just to skip the test and upload the ompl build boost with 1.86 , while keeping the issue open to track the problem and eventual resolution. |
that would be my (selfish) preferred solution for the near term, since I'm not actually trying to use this with baremetal macOS 😆 I don't know enough about the OMPL bindings in MoveIt to say for sure but it seems promising that your earlier search didn't turn up direct invocations of these functions. I'll get someone more hands-on from the MoveIt team to chime in here. |
We do call PlannerDataStorage::store() and PlannerDataStorage::load() directly, but of course only if the persistent planner data usage is configured. |
@henningkayser also mentioned that that feature in MoveIt is not commonly used (and might be already buggy due to disuse and decay) |
More info from @dsobek's investigation - there is definitely a clear difference between the binary archive serialized from Boost 1.82 vs Boost 1.84+ (on Mac): https://www.diffchecker.com/EjuqlV0T/ screenshot for posterity since that link will expire eventually |
This the state of the test that I produced that test data with: |
@JafarAbdi appears to have verified that this exact same error message is indeed triggered by test parallelism on Linux as @mamoll mentioned.. why does it happen every time on macOS but only when run in parallel on Linux.. |
Yes, but from what I understand you are not defining any custom user-defined PlannerDataVertex ? As that is what is triggering the problem. |
Wow, I am start getting intrigued. Is this with conda compilers or what? At this point something I would be curious to explore is to build Boost serialization (that if I recall correctly is a compiled component of Boost) as part of the ompl build via CMake's FetchContent, and once the full dep tree is managed by the same CMake build, just start enabling all possible GCC or Clang sanitizers to understand more about this, but I am not sure how easy is to build boost via CMake's FetchContent. Anyhow, while this is fun I guess we can decouple this issue from the robostack side, if no major public dependency of ompl is using this problematic functionality I think we can look into skipping this tests for now and make the ompl + boost 1.86 build available so that we unblock the robostack builds. |
By the way, I noticed that tests are skipped also on linux-aarch64: ompl-feedstock/recipe/build.sh Line 17 in c4a6188
|
"for some reason" - classic! |
As a workaround for conda-forge#43 .
Done in #47 . |
I ran a git bisect on the boost serialization library and the culprit commit that first broke the test is boostorg/serialization#287. |
As @EzraBrooks I was able to reproduce the issue in my linux machine with parallelism, but that issue was easy to fix (It has to do with two tests saving/loading to the same file) |
We had another strange problem with newer versions of boost serialization in tesseract. The problem was a missing boost serialization export macro somewhere. It worked with older versions of boost serialization but began failing with newer versions. |
More thoughts about boostorg/serialization#309 : I think the problem is they introduced a singleton that is being instantiated multiple times if the linking isn't exactly correct. The linking can vary slightly depending on the compiler and compiler version so it can be very difficult to debug. The macros if used correctly should enforce that only one singleton exists but it does not raise an error if there is a problem. |
Thanks a lot for the great team-up on the issue! The robostack builds should be unblocked by #47, but I think it make sense to keep this issue open until we actually fix the issue (and thanks @johnwason for the great inputs). |
Solution to issue cannot be found in the documentation.
Issue
There are plenty of discussion on this, but they are spread over multiple PRs, that typically get outdated. Let's use this to keep track of everything.
This is quite important as it is now blocking:
Detailed description
Since boost 1.84 migration (see #35, the previous pinned version was 1.82) two tests are failing:
and
The text was updated successfully, but these errors were encountered: