Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] contains_re hangs with some regexp patterns #10006

Closed
andygrove opened this issue Jan 10, 2022 · 1 comment · Fixed by #10095
Closed

[BUG] contains_re hangs with some regexp patterns #10006

andygrove opened this issue Jan 10, 2022 · 1 comment · Fixed by #10095
Assignees
Labels
bug Something isn't working

Comments

@andygrove
Copy link
Contributor

andygrove commented Jan 10, 2022

Describe the bug
The contains_re function hangs with some regular expression patterns.

Steps/Code to reproduce bug

I ran the following with rapids-21.10 conda env and it hangs. I am seeing the same behavior with 22.02 via the Spark accelerator.

>>> import cudf
>>> test_string = cudf.Series([None, 'A', 'B'])
>>> test_string.str.contains('(3?)+', regex=True)

Expected behavior
This should not hang.

Environment overview (please complete the following information)
conda rapids-21.10

Environment details

Click here to see environment details
 **git***
 commit d4344571ac624cf9d74b5299d3bd57110dfa1daf (HEAD -> extract-all-research, origin/extract-all-research)
 Author: Andy Grove <[email protected]>
 Date:   Thu Jan 6 14:47:46 2022 -0700
 
 save
 **git submodules***
 
 ***OS Information***
 DISTRIB_ID=Ubuntu
 DISTRIB_RELEASE=20.04
 DISTRIB_CODENAME=focal
 DISTRIB_DESCRIPTION="Ubuntu 20.04.3 LTS"
 NAME="Ubuntu"
 VERSION="20.04.3 LTS (Focal Fossa)"
 ID=ubuntu
 ID_LIKE=debian
 PRETTY_NAME="Ubuntu 20.04.3 LTS"
 VERSION_ID="20.04"
 HOME_URL="https://www.ubuntu.com/"
 SUPPORT_URL="https://help.ubuntu.com/"
 BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
 PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
 VERSION_CODENAME=focal
 UBUNTU_CODENAME=focal
 Linux ripper 5.11.0-44-generic #48~20.04.2-Ubuntu SMP Tue Dec 14 15:36:44 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
 
 ***GPU Information***
 Mon Jan 10 09:38:16 2022
 +-----------------------------------------------------------------------------+
 | NVIDIA-SMI 495.29.05    Driver Version: 495.29.05    CUDA Version: 11.5     |
 |-------------------------------+----------------------+----------------------+
 | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
 | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
 |                               |                      |               MIG M. |
 |===============================+======================+======================|
 |   0  NVIDIA GeForce ...  On   | 00000000:42:00.0  On |                  N/A |
 | 40%   68C    P2   134W / 320W |   2185MiB / 10015MiB |     99%      Default |
 |                               |                      |                  N/A |
 +-------------------------------+----------------------+----------------------+
 
 +-----------------------------------------------------------------------------+
 | Processes:                                                                  |
 |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
 |        ID   ID                                                   Usage      |
 |=============================================================================|
 |    0   N/A  N/A      1717      G   /usr/lib/xorg/Xorg                102MiB |
 |    0   N/A  N/A    147030      G   /usr/lib/firefox/firefox            3MiB |
 |    0   N/A  N/A    454523      G   /usr/lib/firefox/firefox            3MiB |
 |    0   N/A  N/A    919688      G   /usr/lib/xorg/Xorg                609MiB |
 |    0   N/A  N/A    919876      G   /usr/bin/gnome-shell              140MiB |
 |    0   N/A  N/A    920362      G   ...AAAAAAAAA= --shared-files       62MiB |
 |    0   N/A  N/A    921961      G   ./jetbrains-toolbox                12MiB |
 |    0   N/A  N/A    928952      G   ...AAAAAAAAA= --shared-files       46MiB |
 |    0   N/A  N/A   1273865      C   ....0-openjdk-amd64/bin/java      489MiB |
 |    0   N/A  N/A   1277257      C   python                            515MiB |
 |    0   N/A  N/A   2722181      G   ...936911.log --shared-files        4MiB |
 |    0   N/A  N/A   4116645      G   /usr/lib/firefox/firefox          169MiB |
 |    0   N/A  N/A   4125089      G   /usr/lib/firefox/firefox            3MiB |
 +-----------------------------------------------------------------------------+
 
 ***CPU***
 Architecture:                    x86_64
 CPU op-mode(s):                  32-bit, 64-bit
 Byte Order:                      Little Endian
 Address sizes:                   43 bits physical, 48 bits virtual
 CPU(s):                          48
 On-line CPU(s) list:             0-47
 Thread(s) per core:              2
 Core(s) per socket:              24
 Socket(s):                       1
 NUMA node(s):                    4
 Vendor ID:                       AuthenticAMD
 CPU family:                      23
 Model:                           8
 Model name:                      AMD Ryzen Threadripper 2970WX 24-Core Processor
 Stepping:                        2
 Frequency boost:                 enabled
 CPU MHz:                         2200.000
 CPU max MHz:                     3000.0000
 CPU min MHz:                     2200.0000
 BogoMIPS:                        5987.89
 Virtualization:                  AMD-V
 L1d cache:                       768 KiB
 L1i cache:                       1.5 MiB
 L2 cache:                        12 MiB
 L3 cache:                        64 MiB
 NUMA node0 CPU(s):               0-5,24-29
 NUMA node1 CPU(s):               12-17,36-41
 NUMA node2 CPU(s):               6-11,30-35
 NUMA node3 CPU(s):               18-23,42-47
 Vulnerability Itlb multihit:     Not affected
 Vulnerability L1tf:              Not affected
 Vulnerability Mds:               Not affected
 Vulnerability Meltdown:          Not affected
 Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
 Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
 Vulnerability Spectre v2:        Mitigation; Full AMD retpoline, IBPB conditional, STIBP disabled, RSB filling
 Vulnerability Srbds:             Not affected
 Vulnerability Tsx async abort:   Not affected
 Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid amd_dcm aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb hw_pstate sme ssbd sev ibpb vmmcall sev_es fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 xsaves clzero irperf xsaveerptr arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif overflow_recov succor smca
 
 ***CMake***
 /usr/bin/cmake
 cmake version 3.16.3
 
 CMake suite maintained and supported by Kitware (kitware.com/cmake).
 
 ***g++***
 /usr/bin/g++
 g++ (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
 Copyright (C) 2019 Free Software Foundation, Inc.
 This is free software; see the source for copying conditions.  There is NO
 warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
 
 
 ***nvcc***
 /usr/local/cuda/bin/nvcc
 nvcc: NVIDIA (R) Cuda compiler driver
 Copyright (c) 2005-2021 NVIDIA Corporation
 Built on Thu_Nov_18_09:45:30_PST_2021
 Cuda compilation tools, release 11.5, V11.5.119
 Build cuda_11.5.r11.5/compiler.30672275_0
 
 ***Python***
 /home/andy/miniconda3/bin/python
 Python 3.9.5
 
 ***Environment Variables***
 PATH                            : /home/andy/miniconda3/bin:/home/andy/miniconda3/condabin:/home/andy/.cargo/bin:/home/andy/.local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/opt/apache-maven-3.8.2/bin:mvnd-0.5.2-linux-amd64/bin:/usr/local/cuda/bin
 LD_LIBRARY_PATH                 : :/usr/local/cuda/targets/x86_64-linux/lib/:/usr/local/cuda/lib64
 NUMBAPRO_NVVM                   :
 NUMBAPRO_LIBDEVICE              :
 CONDA_PREFIX                    : /home/andy/miniconda3
 PYTHON_PATH                     :
 
 ***conda packages***
 /home/andy/miniconda3/bin/conda
 # packages in environment at /home/andy/miniconda3:
 #
 # Name                    Version                   Build  Channel
 _libgcc_mutex             0.1                        main
 _openmp_mutex             4.5                       1_gnu
 attrs                     21.2.0                   pypi_0    pypi
 awscli                    1.20.44                  pypi_0    pypi
 awscli-plugin-endpoint    0.4                      pypi_0    pypi
 botocore                  1.21.44                  pypi_0    pypi
 bpytop                    1.0.67                   pypi_0    pypi
 brotlipy                  0.7.0           py39h27cfd23_1003
 ca-certificates           2021.7.5             h06a4308_1
 certifi                   2021.5.30        py39h06a4308_0
 cffi                      1.14.6           py39h400218f_0
 cfgv                      3.3.1                    pypi_0    pypi
 chardet                   4.0.0           py39h06a4308_1003
 colorama                  0.4.3                    pypi_0    pypi
 conda                     4.10.3           py39h06a4308_0
 conda-package-handling    1.7.3            py39h27cfd23_1
 conda-standalone          4.9.0                h718eed5_1
 constructor               3.2.1            py39h06a4308_0
 cryptography              3.4.7            py39hd23ed53_0
 datafusion                0.2.0                    pypi_0    pypi
 distlib                   0.3.4                    pypi_0    pypi
 docutils                  0.15.2                   pypi_0    pypi
 execnet                   1.9.0                    pypi_0    pypi
 filelock                  3.4.2                    pypi_0    pypi
 findspark                 1.4.2                    pypi_0    pypi
 identify                  2.4.1                    pypi_0    pypi
 idna                      2.10               pyhd3eb1b0_0
 iniconfig                 1.1.1                    pypi_0    pypi
 jmespath                  0.10.0                   pypi_0    pypi
 ld_impl_linux-64          2.35.1               h7274673_9
 libffi                    3.3                  he6710b0_2
 libgcc-ng                 9.3.0               h5101ec6_17
 libgomp                   9.3.0               h5101ec6_17
 libstdcxx-ng              9.3.0               hd4cf53a_17
 ncurses                   6.2                  he6710b0_1
 nodeenv                   1.6.0                    pypi_0    pypi
 numpy                     1.21.2                   pypi_0    pypi
 openssl                   1.1.1l               h7f8727e_0
 packaging                 21.0                     pypi_0    pypi
 pandas                    1.3.4                    pypi_0    pypi
 pip                       21.1.3           py39h06a4308_0
 platformdirs              2.4.1                    pypi_0    pypi
 pluggy                    1.0.0                    pypi_0    pypi
 pre-commit                2.16.0                   pypi_0    pypi
 psutil                    5.8.0                    pypi_0    pypi
 py                        1.10.0                   pypi_0    pypi
 pyarrow                   5.0.0                    pypi_0    pypi
 pyasn1                    0.4.8                    pypi_0    pypi
 pycosat                   0.6.3            py39h27cfd23_0
 pycparser                 2.20                       py_2
 pyopenssl                 20.0.1             pyhd3eb1b0_1
 pyparsing                 2.4.7                    pypi_0    pypi
 pysocks                   1.7.1            py39h06a4308_0
 pytest                    6.2.5                    pypi_0    pypi
 pytest-forked             1.3.0                    pypi_0    pypi
 pytest-xdist              2.4.0                    pypi_0    pypi
 python                    3.9.5                h12debd9_4
 python-dateutil           2.8.2                    pypi_0    pypi
 pytz                      2021.3                   pypi_0    pypi
 pyyaml                    5.4.1                    pypi_0    pypi
 readline                  8.1                  h27cfd23_0
 requests                  2.25.1             pyhd3eb1b0_0
 rsa                       4.7.2                    pypi_0    pypi
 ruamel_yaml               0.15.100         py39h27cfd23_0
 s3transfer                0.5.0                    pypi_0    pypi
 setuptools                52.0.0           py39h06a4308_0
 six                       1.16.0             pyhd3eb1b0_0
 sqlite                    3.36.0               hc218d9a_0
 sre-yield                 1.2                      pypi_0    pypi
 tk                        8.6.10               hbc83047_0
 toml                      0.10.2                   pypi_0    pypi
 tqdm                      4.61.2             pyhd3eb1b0_1
 tzdata                    2021a                h52ac0ba_0
 urllib3                   1.26.6             pyhd3eb1b0_1
 virtualenv                20.13.0                  pypi_0    pypi
 wheel                     0.36.2             pyhd3eb1b0_0
 xdist                     0.0.1                    pypi_0    pypi
 xz                        5.2.5                h7b6447c_0
 yaml                      0.2.5                h7b6447c_0
 zlib                      1.2.11               h7b6447c_3

Additional context
None

@andygrove andygrove added bug Something isn't working Needs Triage Need team to review and classify labels Jan 10, 2022
@davidwendt davidwendt self-assigned this Jan 11, 2022
@davidwendt
Copy link
Contributor

davidwendt commented Jan 11, 2022

This pattern resolves to the following instructions in the libcudf regex code:

Flags = 0x00000000
Instructions:
  0:    CHAR c='3', next=2
  1:      OR right=0, left=2, next=2
  2:    RBRA id=1, next=4
  3:    LBRA id=1, next=1
  4:      OR right=3, left=5, next=5
  5:     END
startinst_id=3
startinst_ids: [ 3 -1]

Classes 0
Number of capturing groups: 1

This causes in an infinite loop at instruction 4 when the character '3' is not found in the target string.
The instruction path would then follow 4->3->1->2->4 ...

So it appears the implementation unfortunately does not support this pattern. Since this is not an extract call, if you ignore the capture markers ( ) this would give this pattern 3?+ which is an invalid regex that throws an exception. The capture markers somehow make this valid but is not something the current implementation can to handle.

I'm not sure if the code can be fixed right now. At the very least I hope we could detect the infinite-loop well enough to throw an error.

rapids-bot bot pushed a commit that referenced this issue Jan 27, 2022
Closes #10006 

Fixes a use case where the regex pattern creates a set of instructions that can cause the regex evaluation process to go into an infinite loop. For example, the pattern `(x?)+` creates the following instructions:

```
Instructions:
  0:    CHAR c='x', next=2
  1:      OR right=0, left=2, next=2
  2:    RBRA id=1, next=4
  3:    LBRA id=1, next=1
  4:      OR right=3, left=5, next=5
  5:     END
startinst_id=3
startinst_ids: [ 3 -1]
```

This causes in an infinite loop at instruction 4 where the path may go like: 4->3->1->2->4 ... forever.
Supporting this pattern does not look possible. The `+` quantifier is applied to capture group symbol `)` inside of which `x?` means 0 or more repeating the character `x`. This means it could match `x` or nothing and so applying the `+` to nothing would be invalid. That said, the pattern `x?+` currently already throws an error because of the invalid usage of `+` quantifier.

Therefore, the fix here adds a checking step after the instruction set is created to check for a possible infinite-loop case. If one is detected, an exception is thrown indicating the pattern is not supported.

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Devavret Makkar (https://github.com/devavret)
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: #10095
@bdice bdice removed the Needs Triage Need team to review and classify label Mar 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants