-
Notifications
You must be signed in to change notification settings - Fork 241
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Add a pre-merge check to validate that a PR has been committed using git signoff #399
Comments
Fixed by #439 |
pxLi
pushed a commit
to pxLi/spark-rapids
that referenced
this issue
May 12, 2022
tgravescs
pushed a commit
to tgravescs/spark-rapids
that referenced
this issue
Nov 30, 2023
This PR is the initial version of CUDA fault injection tool to explore and test for correctness of CUDA error handling in fault-tolerant CUDA applications. The tool is designed with automated testing and interactive testing use cases in mind. The tool is a dynamically linked library `libcufaultinj.so` that is loaded by the CUDA process via CUDA Driver API `cuInit` if it's provided via the `CUDA_INJECTION64_PATH` environment variable. As an example it can be used to test RAPIDS Accelerator for Apache Spark. ### Local Mode ```bash CUDA_INJECTION64_PATH=$PWD/target/cmake-build/faultinj/libcufaultinj.so \ FAULT_INJECTOR_CONFIG_PATH=src/test/cpp/faultinj/test_faultinj.json \ $SPARK_HOME/bin/pyspark \ --jars $SPARK_RAPIDS_REPO/dist/target/rapids-4-spark_2.12-22.08.0-SNAPSHOT-cuda11.jar \ --conf spark.plugins=com.nvidia.spark.SQLPlugin ``` ### Distributed Mode ```bash $SPARK_HOME/bin/spark-shell \ --jars $SPARK_RAPIDS_REPO/dist/target/rapids-4-spark_2.12-22.08.0-SNAPSHOT-cuda11.jar \ --conf spark.plugins=com.nvidia.spark.SQLPlugin \ --files ./target/cmake-build/faultinj/libcufaultinj.so,./src/test/cpp/faultinj/test_faultinj.json \ --conf spark.executorEnv.CUDA_INJECTION64_PATH=./libcufaultinj.so \ --conf spark.executorEnv.FAULT_INJECTOR_CONFIG_PATH=test_faultinj.json \ --conf spark.rapids.memory.gpu.minAllocFraction=0 \ --conf spark.rapids.memory.gpu.allocFraction=0.2 \ --master spark://hostname:7077 ``` When we configure the executor environment spark.executorEnv.CUDA_INJECTION64_PATH we have to use a path separator in the value ./libcufaultinj.so with the leading dot to make sure that dlopen loads the library file submitted. Otherwise it will assume a locally installed library accessible to the dynamic linker via LD_LIBRARY_PATH and similar mechanisms. See [dlopen man page](https://man7.org/linux/man-pages/man3/dlopen.3.html) ### Fault injection configuration Fault injection configuration is provided via the `FAULT_INJECTOR_CONFIG_PATH` environment variable. It's a set of rules to apply fault injection when CUDA Drvier or Runtime is matched by function name or callback id with a given probability. There are currently three types of fault injection: - launch a kernel with the PTX `trap` instruction - launch a kernel with a device assert - replace the return code for the CUDA Runtime call Example config: ```json { "logLevel": 1, "dynamic": true, "cudaRuntimeFaults": { "cudaLaunchKernel_ptsz": { "percent": 0, "injectionType": 0, "injectionType_comment": "PTX trap = 0, C assert = 1", "interceptionCount": 1 } }, "cudaDriverFaults": { "cuMemFreeAsync_ptsz": { "percent": 0, "injectionType": 2, "injectionType_comment": "substitute return code", "substituteReturnCode": 999, "interceptionCount": 1 } } } ``` Signed-off-by: Gera Shegalov <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Is your feature request related to a problem? Please describe.
We are introducing a contributor license agreement signoff requirement. We need to ensure that commits are signed off with https://git-scm.com/docs/git-commit#Documentation/git-commit.txt--s . If a PR is not signed off, it should fail a pre-merge check.
Describe the solution you'd like
Pre-merge checks (builds?) fail without seeing a -s signoff from anyone committing to a PR. If the author of a commit adds a single commit with a -s, that will be sufficient for the pre-merge build to pass.
Describe alternatives you've considered
None
Additional context
https://github.com/NVIDIA/spark-rapids/blob/branch-0.2/CONTRIBUTING.md#sign-your-work
The text was updated successfully, but these errors were encountered: