-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Drop Centos7 support #2010
Drop Centos7 support #2010
Conversation
To fix: NVIDIA#1991 Drop Centos7 support, switch to build in a Rocky 8 Docker image Update the script to support both amd64 and arm64 CPUs Signed-off-by: Tim Liu <[email protected]>
Signed-off-by: Tim Liu <[email protected]>
Signed-off-by: Tim Liu <[email protected]>
build |
Signed-off-by: Tim Liu <[email protected]>
build/build-in-docker
Outdated
# Set env for arm64 build | ||
if [ "$arch" == "arm64" ]; then | ||
profiles="${profiles},arm64" | ||
USE_GDS="OFF" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious why we don't build GDS on arm64? It used to be separate but now is part of the CUDA toolkit. Is it not part of the arm64 CUDA toolkit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious why we don't build GDS on arm64? It used to be separate but now is part of the CUDA toolkit. Is it not part of the arm64 CUDA toolkit?
Yes, GDS cuFiles RDMA
lib is not in the arm64 CUDA toolkit
[INFO] [exec] Could NOT find cuFile (missing: cuFile_LIBRARY cuFileRDMA_LIBRARY
[INFO] [exec] cuFile_INCLUDE_DIR)
cufaultinj
links to static cupti_static
is not found in arm64 CUDA toolkit
[INFO] [exec] -- Generating done (0.0s)
[INFO] [exec] CMake Error at faultinj/CMakeLists.txt:34 (target_link_libraries):
[INFO] [exec] Target "cufaultinj" links to:
[INFO] [exec]
[INFO] [exec] CUDA::cupti_static
[INFO] [exec]
[INFO] [exec] but the target was not found. Possible reasons include:
[INFO] [exec]
[INFO] [exec] * There is a typo in the target name.
[INFO] [exec] * A find_package call is missing for an IMPORTED target.
[INFO] [exec] * An ALIAS target is missing.
[INFO] [exec]
[INFO] [exec]
[INFO] [exec]
[INFO] [exec] CMake Generate step failed. Build files cannot be regenerated correctly.
rmm OOM
issue reported as below for arm64 test if USE_SANITIZER=OFF
, but I've no idea what the root cause of the issue
[ERROR] There was an error in the forked process
[ERROR] Could not allocate native memory: std::bad_alloc: out_of_memory: RMM failure at:/home/nvidia/timl/spark-rapids-jni/thirdparty/cudf/cpp/build/_deps/rmm-src/include/rmm/mr/device/pool_memory_resource.hpp:254: Maximum pool size exceeded
[ERROR] org.apache.maven.surefire.booter.SurefireBooterForkException: There was an error in the forked process
[ERROR] Could not allocate native memory: std::bad_alloc: out_of_memory: RMM failure at:/home/nvidia/timl/spark-rapids-jni/thirdparty/cudf/cpp/build/_deps/rmm-src/include/rmm/mr/device/pool_memory_resource.hpp:254: Maximum pool size exceeded
[ERROR] at org.apache.maven.plugin.surefire.booterclient.ForkStarter.fork(ForkStarter.ja
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to fix the tests on arm64, not hack the build script to run the sanitizer to hide the problem. We will not be running the sanitizer in production, so letting this slip through by hacking the build script is not what we want.
Signed-off-by: Tim Liu <[email protected]>
build |
build/build-in-docker
Outdated
# Set env for arm64 build | ||
if [ "$arch" == "arm64" ]; then | ||
profiles="${profiles},arm64" | ||
USE_GDS="OFF" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to fix the tests on arm64, not hack the build script to run the sanitizer to hide the problem. We will not be running the sanitizer in production, so letting this slip through by hacking the build script is not what we want.
Signed-off-by: Tim Liu <[email protected]>
build |
If tests are failing it's a blocker for this PR. Hacking the script to pass the test is not the right answer. The goal is to produce a good build on rocky8, but the failing tests indicate there may be issues with the build. The tests are there for a reason, to indicate whether the code is working properly. If the tests used to pass when building on centos7 but fail when building on rocky8 then there may be an issue with the code produced by the rocky8 build. We need to investigate the test failures and fix them. |
Signed-off-by: Tim Liu <[email protected]>
build |
Converted this to a draft PR, as we don't want this merged until we get to the bottom of the arm64 test failures after building on rocky8, tracked by #2022. |
Co-authored-by: Jason Lowe <[email protected]>
Signed-off-by: Tim Liu <[email protected]>
Here's the relevant fix for the issue2022 : rapidsai/cudf#15706 |
build |
Unfortunately this breaks my local build:
|
I suspect if you clean out thirdparty/cudf/cpp/build it will fix that issue. The compiler changed locations between the old and new image, so anything caching the compiler location (such as an existing cmake cache file) will break. |
Oh yes, that fixes it. Thanks. |
Also fixed #1958 |
To fix: #1991
Drop Centos7 support, switch to build in a Rocky 8 Docker image
Update the script to support both amd64 and arm64 CPUs