Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Suspected memory corruption with regexp calls #11768

Closed
andygrove opened this issue Sep 26, 2022 · 3 comments · Fixed by #11797
Closed

[BUG] Suspected memory corruption with regexp calls #11768

andygrove opened this issue Sep 26, 2022 · 3 comments · Fixed by #11797
Labels
bug Something isn't working

Comments

@andygrove
Copy link
Contributor

andygrove commented Sep 26, 2022

Describe the bug
We are seeing some Spark queries involving regexp calls produce different results on each run and have established that matchesRe is producing incorrect results on the GPU in a very non-deterministic way. This only happens with large input columns.

Steps/Code to reproduce bug
The tracking issue is NVIDIA/spark-rapids#6431 and I am working on creating a simple repro case that does not involve Spark but have failed to achieve that so far.

Expected behavior
Calls to matchesRe should produce the same results each time (given the same input).

Environment overview (please complete the following information)
Bare metal. Workstation with RTX 6000.

Environment details

Click here to see environment details
 **git***
 commit 204a09cb9de58485df2c931db9a56d19a6eda356 (HEAD)
 Author: Elias Stehle <[email protected]>
 Date:   Thu Sep 22 21:47:36 2022 +0200
 
 Reduces memory requirements in JSON parser and adds bytes/s and peak memory usage to benchmarks (#11732)
 
 This PR reduces memory requirements in the new nested JSON parser and adds `bytes_per_second` and `peak_memory_usage` usage to benchmarks
 
 Authors:
 - Elias Stehle (https://github.com/elstehle)
 
 Approvers:
 - Tobias Ribizel (https://github.com/upsj)
 - Karthikeyan (https://github.com/karthikeyann)
 - Yunsong Wang (https://github.com/PointKernel)
 
 URL: https://github.com/rapidsai/cudf/pull/11732
 **git submodules***
 
 ***OS Information***
 DISTRIB_ID=Ubuntu
 DISTRIB_RELEASE=20.04
 DISTRIB_CODENAME=focal
 DISTRIB_DESCRIPTION="Ubuntu 20.04.3 LTS"
 NAME="Ubuntu"
 VERSION="20.04.3 LTS (Focal Fossa)"
 ID=ubuntu
 ID_LIKE=debian
 PRETTY_NAME="Ubuntu 20.04.3 LTS"
 VERSION_ID="20.04"
 HOME_URL="https://www.ubuntu.com/"
 SUPPORT_URL="https://help.ubuntu.com/"
 BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
 PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
 VERSION_CODENAME=focal
 UBUNTU_CODENAME=focal
 Linux nvworkstation 5.13.0-27-generic #29~20.04.1-Ubuntu SMP Fri Jan 14 00:32:30 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
 
 ***GPU Information***
 Mon Sep 26 10:14:11 2022
 +-----------------------------------------------------------------------------+
 | NVIDIA-SMI 495.29.05    Driver Version: 495.29.05    CUDA Version: 11.5     |
 |-------------------------------+----------------------+----------------------+
 | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
 | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
 |                               |                      |               MIG M. |
 |===============================+======================+======================|
 |   0  Quadro RTX 6000     On   | 00000000:17:00.0 Off |                  Off |
 | 43%   67C    P2   111W / 260W |    970MiB / 24220MiB |    100%      Default |
 |                               |                      |                  N/A |
 +-------------------------------+----------------------+----------------------+
 
 +-----------------------------------------------------------------------------+
 | Processes:                                                                  |
 |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
 |        ID   ID                                                   Usage      |
 |=============================================================================|
 |    0   N/A  N/A      1411      G   /usr/lib/xorg/Xorg                  4MiB |
 |    0   N/A  N/A    193705      C   ...penjdk-amd64/jre/bin/java      481MiB |
 |    0   N/A  N/A    202213      C   ...penjdk-amd64/jre/bin/java      481MiB |
 +-----------------------------------------------------------------------------+
 
 ***CPU***
 Architecture:                    x86_64
 CPU op-mode(s):                  32-bit, 64-bit
 Byte Order:                      Little Endian
 Address sizes:                   46 bits physical, 48 bits virtual
 CPU(s):                          12
 On-line CPU(s) list:             0-11
 Thread(s) per core:              2
 Core(s) per socket:              6
 Socket(s):                       1
 NUMA node(s):                    1
 Vendor ID:                       GenuineIntel
 CPU family:                      6
 Model:                           85
 Model name:                      Intel(R) Core(TM) i7-7800X CPU @ 3.50GHz
 Stepping:                        4
 CPU MHz:                         3500.000
 CPU max MHz:                     4000.0000
 CPU min MHz:                     1200.0000
 BogoMIPS:                        6999.82
 Virtualization:                  VT-x
 L1d cache:                       192 KiB
 L1i cache:                       192 KiB
 L2 cache:                        6 MiB
 L3 cache:                        8.3 MiB
 NUMA node0 CPU(s):               0-11
 Vulnerability Itlb multihit:     KVM: Mitigation: VMX disabled
 Vulnerability L1tf:              Mitigation; PTE Inversion; VMX conditional cache flushes, SMT vulnerable
 Vulnerability Mds:               Mitigation; Clear CPU buffers; SMT vulnerable
 Vulnerability Meltdown:          Mitigation; PTI
 Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
 Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
 Vulnerability Spectre v2:        Mitigation; Full generic retpoline, IBPB conditional, IBRS_FW, STIBP conditional, RSB filling
 Vulnerability Srbds:             Not affected
 Vulnerability Tsx async abort:   Mitigation; Clear CPU buffers; SMT vulnerable
 Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req md_clear flush_l1d
 
 ***CMake***
 /usr/bin/cmake
 cmake version 3.16.3
 
 CMake suite maintained and supported by Kitware (kitware.com/cmake).
 
 ***g++***
 /usr/bin/g++
 g++ (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
 Copyright (C) 2019 Free Software Foundation, Inc.
 This is free software; see the source for copying conditions.  There is NO
 warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
 
 
 ***nvcc***
 
 ***Python***
 /home/andygrove/miniconda3/bin/python
 Python 3.9.1
 
 ***Environment Variables***
 PATH                            : /home/andygrove/.local/bin:/home/andygrove/miniconda3/bin:/home/andygrove/miniconda3/condabin:/home/andygrove/.cargo/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin://opt/apache-maven-3.8.1/bin
 LD_LIBRARY_PATH                 :
 NUMBAPRO_NVVM                   :
 NUMBAPRO_LIBDEVICE              :
 CONDA_PREFIX                    : /home/andygrove/miniconda3
 PYTHON_PATH                     :
 
 ***conda packages***
 /home/andygrove/miniconda3/bin/conda
 # packages in environment at /home/andygrove/miniconda3:
 #
 # Name                    Version                   Build  Channel
 _libgcc_mutex             0.1                        main
 attrs                     21.2.0                   pypi_0    pypi
 brotlipy                  0.7.0           py39h27cfd23_1003
 ca-certificates           2020.12.8            h06a4308_0
 certifi                   2020.12.5        py39h06a4308_0
 cffi                      1.14.4           py39h261ae71_0
 chardet                   3.0.4           py39h06a4308_1003
 conda                     4.9.2            py39h06a4308_0
 conda-package-handling    1.7.2            py39h27cfd23_1
 cryptography              3.3.1            py39h3c74f83_0
 idna                      2.10                       py_0
 iniconfig                 1.1.1                    pypi_0    pypi
 ld_impl_linux-64          2.33.1               h53a641e_7
 libedit                   3.1.20191231         h14c3975_1
 libffi                    3.3                  he6710b0_2
 libgcc-ng                 9.1.0                hdf63c60_0
 libstdcxx-ng              9.1.0                hdf63c60_0
 ncurses                   6.2                  he6710b0_1
 numpy                     1.23.3                   pypi_0    pypi
 openssl                   1.1.1i               h27cfd23_0
 packaging                 21.0                     pypi_0    pypi
 pandas                    1.4.4                    pypi_0    pypi
 pip                       20.3.1           py39h06a4308_0
 pluggy                    0.13.1                   pypi_0    pypi
 py                        1.10.0                   pypi_0    pypi
 pyarrow                   9.0.0                    pypi_0    pypi
 pycosat                   0.6.3            py39h27cfd23_0
 pycparser                 2.20                       py_2
 pyopenssl                 20.0.0             pyhd3eb1b0_1
 pyparsing                 2.4.7                    pypi_0    pypi
 pysocks                   1.7.1            py39h06a4308_0
 pytest                    6.2.4                    pypi_0    pypi
 python                    3.9.1                hdb3f193_2
 python-dateutil           2.8.2                    pypi_0    pypi
 pytz                      2022.2.1                 pypi_0    pypi
 readline                  8.0                  h7b6447c_0
 requests                  2.25.0             pyhd3eb1b0_0
 ruamel_yaml               0.15.80          py39h27cfd23_0
 setuptools                51.0.0           py39h06a4308_2
 six                       1.15.0           py39h06a4308_0
 sqlite                    3.33.0               h62c20be_0
 sre-yield                 1.2                      pypi_0    pypi
 tk                        8.6.10               hbc83047_0
 toml                      0.10.2                   pypi_0    pypi
 tqdm                      4.54.1             pyhd3eb1b0_0
 tzdata                    2020d                h14c3975_0
 urllib3                   1.25.11                    py_0
 wheel                     0.36.1             pyhd3eb1b0_0
 xz                        5.2.5                h7b6447c_0
 yaml                      0.2.5                h7b6447c_0
 zlib                      1.2.11               h7b6447c_3

Additional context
None

@andygrove andygrove added bug Something isn't working Needs Triage Need team to review and classify labels Sep 26, 2022
@davidwendt
Copy link
Contributor

I'd like to request a C++ reproducer if possible.

@davidwendt
Copy link
Contributor

I'm not sure how the NVIDIA/spark-rapids#6431 is converted to libcudf calls. But converting 1 billion long values to a strings column would require almost 9GB of character bytes. So casting the longs to strings is perhaps where the corruption is occurring.

@andygrove
Copy link
Contributor Author

andygrove commented Sep 26, 2022

The data is split across six partitions in this case, with each partition containing ~166mm rows.

The config allows two concurrent tasks on GPU. Setting concurrentGpuTasks=1 makes the query run reliably.

Here is debug output from one run:

GpuRLike returned 166666667 rows (166666667 bytes) nulls = 0 trueCount = 9105453
GpuRLike returned 166666667 rows (166666667 bytes) nulls = 0 trueCount = 13055032
GpuRLike returned 166666667 rows (166666667 bytes) nulls = 0 trueCount = 9105453
GpuRLike returned 166666667 rows (166666667 bytes) nulls = 0 trueCount = 10173771
GpuRLike returned 166666666 rows (166666666 bytes) nulls = 0 trueCount = 10173771
GpuRLike returned 166666666 rows (166666666 bytes) nulls = 0 trueCount = 16799551

Here is another run of the same query showing slightly different counts .. the total here is the same so the data was likely just partitioned differently on this run.

GpuRLike returned 166666667 rows (166666667 bytes) nulls = 0 trueCount = 9105453
GpuRLike returned 166666667 rows (166666667 bytes) nulls = 0 trueCount = 10173832
GpuRLike returned 166666667 rows (166666667 bytes) nulls = 0 trueCount = 9105453
GpuRLike returned 166666666 rows (166666666 bytes) nulls = 0 trueCount = 16799551
GpuRLike returned 166666666 rows (166666666 bytes) nulls = 0 trueCount = 10173771
GpuRLike returned 166666667 rows (166666667 bytes) nulls = 0 trueCount = 13054971

We are working on a C++ repro case.

rapids-bot bot pushed a commit that referenced this issue Sep 28, 2022
Fixes an out-of-bounds write error when a large number of strings requires a strided loop to meet an internal memory maximum. For row sizes that do not require strided loops, the row index never exceeds the size of the column preventing any out-of-bounds access. For large row counts, the CUDA `thread index` may be larger than the minimal count used for building the working-memory buffer. Since the kernel is launched with a thread-count with a specific block size, extra threads past the end of the minimal count are necessary to fill out the last block. These threads never contribute to the overall result but will attempt to access past the end of the working memory. Writing to this memory may corrupt memory for another kernel launched in parallel from another CPU thread. This change adds logic to prevent the extra threads from doing any work.

Fixes #11768

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - MithunR (https://github.com/mythrocks)
  - Nghia Truong (https://github.com/ttnghia)
  - Mike Wilson (https://github.com/hyperbolic2346)

URL: #11797
@bdice bdice removed the Needs Triage Need team to review and classify label Mar 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants