You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Changes to libcudf's build system lead not infrequently to breakages in the spark-rapids-jni. While not all changes in libcudf that break the Spark builds should be considered showstoppers (for example, the plugin still uses detail APIs that libcudf should continue to feel free to change without warning if needed), changes involving CMake should be scrutinized more carefully due to the very specific dance that the spark plugin's build uses to support the layers of the build they need (libcudf, the libcudfjni interface layer, and finally the spark-rapids-jni). Currently, such breaks are reported to us after the fact and then we have to go through rounds of changes in libcudf and then manual testing in Spark.
To improve this situation, we should add builds of spark-rapids-jni to cudf CI. We do not need to run tests, and we should be able to build for just a single architecture. spark-rapids-jni already has detailed instructions for containerized builds that support using a custom version of cudf that we should be able to adapt into a Github Actions job running in the same container. That should be sufficient for us to catch the majority of breaking changes that should be fixed by cudf itself. I would not suggest running the full Spark test suite since that would be quite a bit more expensive; for now, a build job alone should suffice. We should also make the build failures non-blocking errors so that it does not block CI.
The text was updated successfully, but these errors were encountered:
NVIDIA/spark-rapids-jni#2677 should make this a bit easier, as it refactors the C++ build portion into a shell script that can be invoked separately. Here's how I see this running part of cudf precommit:
Note that this will build for all GPU architectures supported by RAPIDS which is overkill for a premerge "does it build" check on spark-rapids-jni. To build for a specific architecture, add GPU_ARCHS=arch to the front of that last cmdline, e.g.:
GPU_ARCHS=89-real LIBCUDF_DEPENDENCY_MODE=latest scl enable gcc-toolset-11 build/buildcpp.sh
Changes to libcudf's build system lead not infrequently to breakages in the spark-rapids-jni. While not all changes in libcudf that break the Spark builds should be considered showstoppers (for example, the plugin still uses detail APIs that libcudf should continue to feel free to change without warning if needed), changes involving CMake should be scrutinized more carefully due to the very specific dance that the spark plugin's build uses to support the layers of the build they need (libcudf, the libcudfjni interface layer, and finally the spark-rapids-jni). Currently, such breaks are reported to us after the fact and then we have to go through rounds of changes in libcudf and then manual testing in Spark.
To improve this situation, we should add builds of spark-rapids-jni to cudf CI. We do not need to run tests, and we should be able to build for just a single architecture. spark-rapids-jni already has detailed instructions for containerized builds that support using a custom version of cudf that we should be able to adapt into a Github Actions job running in the same container. That should be sufficient for us to catch the majority of breaking changes that should be fixed by cudf itself. I would not suggest running the full Spark test suite since that would be quite a bit more expensive; for now, a build job alone should suffice. We should also make the build failures non-blocking errors so that it does not block CI.
The text was updated successfully, but these errors were encountered: