Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Add a pre-merge check to validate that a PR has been committed using git signoff #399

Closed
sameerz opened this issue Jul 22, 2020 · 1 comment
Labels
build Related to CI / CD or cleanly building P0 Must have for release

Comments

@sameerz
Copy link
Collaborator

sameerz commented Jul 22, 2020

Is your feature request related to a problem? Please describe.

We are introducing a contributor license agreement signoff requirement. We need to ensure that commits are signed off with https://git-scm.com/docs/git-commit#Documentation/git-commit.txt--s . If a PR is not signed off, it should fail a pre-merge check.

Describe the solution you'd like
Pre-merge checks (builds?) fail without seeing a -s signoff from anyone committing to a PR. If the author of a commit adds a single commit with a -s, that will be sufficient for the pre-merge build to pass.

Describe alternatives you've considered
None

Additional context
https://github.com/NVIDIA/spark-rapids/blob/branch-0.2/CONTRIBUTING.md#sign-your-work

@sameerz sameerz added the build Related to CI / CD or cleanly building label Jul 22, 2020
@sameerz sameerz added the P0 Must have for release label Jul 22, 2020
@jlowe
Copy link
Member

jlowe commented Aug 14, 2020

Fixed by #439

@jlowe jlowe closed this as completed Aug 14, 2020
pxLi pushed a commit to pxLi/spark-rapids that referenced this issue May 12, 2022
tgravescs pushed a commit to tgravescs/spark-rapids that referenced this issue Nov 30, 2023
This PR is the initial version of CUDA fault injection tool to explore and test for correctness of CUDA error handling in fault-tolerant CUDA applications.

The tool is designed with automated testing and interactive testing use cases in mind. The tool is a dynamically linked library `libcufaultinj.so` that is loaded by the CUDA process via CUDA Driver API `cuInit` if it's provided via the `CUDA_INJECTION64_PATH` environment variable.

As an example it can be used to test RAPIDS Accelerator for Apache Spark. 

### Local Mode
```bash
CUDA_INJECTION64_PATH=$PWD/target/cmake-build/faultinj/libcufaultinj.so \ 
FAULT_INJECTOR_CONFIG_PATH=src/test/cpp/faultinj/test_faultinj.json \
$SPARK_HOME/bin/pyspark \ 
  --jars $SPARK_RAPIDS_REPO/dist/target/rapids-4-spark_2.12-22.08.0-SNAPSHOT-cuda11.jar \ 
  --conf spark.plugins=com.nvidia.spark.SQLPlugin
```
### Distributed Mode
```bash
$SPARK_HOME/bin/spark-shell \
  --jars $SPARK_RAPIDS_REPO/dist/target/rapids-4-spark_2.12-22.08.0-SNAPSHOT-cuda11.jar \
  --conf spark.plugins=com.nvidia.spark.SQLPlugin \
  --files ./target/cmake-build/faultinj/libcufaultinj.so,./src/test/cpp/faultinj/test_faultinj.json \
  --conf spark.executorEnv.CUDA_INJECTION64_PATH=./libcufaultinj.so \
  --conf spark.executorEnv.FAULT_INJECTOR_CONFIG_PATH=test_faultinj.json \
  --conf spark.rapids.memory.gpu.minAllocFraction=0 \
  --conf spark.rapids.memory.gpu.allocFraction=0.2 \
  --master spark://hostname:7077 
```
When we configure the executor environment spark.executorEnv.CUDA_INJECTION64_PATH we have to use a path separator in the value ./libcufaultinj.so with the leading dot to make sure that dlopen loads the library file submitted. Otherwise it will assume a locally installed library accessible to the dynamic linker via LD_LIBRARY_PATH and similar mechanisms. See [dlopen man page](https://man7.org/linux/man-pages/man3/dlopen.3.html)

### Fault injection configuration 

Fault injection configuration is provided via the `FAULT_INJECTOR_CONFIG_PATH` environment variable. It's a set of rules to apply fault injection when CUDA Drvier or Runtime is matched by function name or callback id with a given probability.

There are currently three types of fault injection:
- launch a kernel with the PTX `trap` instruction
- launch a kernel with a device assert
- replace the return code for the CUDA Runtime call

Example config:
```json
{
    "logLevel": 1,
    "dynamic": true,
    "cudaRuntimeFaults": {
        "cudaLaunchKernel_ptsz": {
            "percent": 0,
            "injectionType": 0,
            "injectionType_comment": "PTX trap = 0, C assert = 1",
            "interceptionCount": 1
        }
    },
    "cudaDriverFaults": {
        "cuMemFreeAsync_ptsz": {
            "percent": 0,
            "injectionType": 2,
            "injectionType_comment": "substitute return code",
            "substituteReturnCode": 999,
            "interceptionCount": 1
        }
    }
}
```

Signed-off-by: Gera Shegalov <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build Related to CI / CD or cleanly building P0 Must have for release
Projects
None yet
Development

No branches or pull requests

2 participants