Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] NVStrings regex Refactorings #3582

Closed
codereport opened this issue Dec 11, 2019 · 4 comments
Closed

[FEA] NVStrings regex Refactorings #3582

codereport opened this issue Dec 11, 2019 · 4 comments
Assignees
Labels
0 - Backlog In queue waiting for assignment strings strings issues (C++ and Python)

Comments

@codereport
Copy link
Contributor

codereport commented Dec 11, 2019

After discussing with David, seeing as the NVStrings regex op is a very large port - any non-local refactorings or changes should be pushed to a follow up clean up issue (this issue). List will be updated below:

  • rename LBRA, RBRA and LBRA_NC to be PAREN
  • regex::print case statement consolidation
  • change reprog_device::find to return result, instead of have output parameters; this:
__device__ inline int32_t reprog_device::find( int32_t idx, string_view const& dstr, int32_t& begin, int32_t& end )

leads to this unnecessarily confusing code:

        string_view d_str = d_strings.element<string_view>(idx);
        int32_t find_count = 0;
        size_type nchars = d_str.length();
        size_type begin = 0;
        while( begin <= nchars )
        {
            auto end = nchars;
            if( prog.find(idx,d_str,begin,end) <=0 )
                break;
            ++find_count;
            begin = end > begin ? end : begin + 1;
        }
        return find_count;
  • Node op1, op2; should be initialized to some invalid state or the design should be changed
  • Change this for loop to the any_of algorithm with strided adaptor or iterators
    for(int i = 0; i < count; i += 2) {
        if( (ch >= literals[i]) && (ch <= literals[i+1]) )
            return true;
    }
  • remove bool nextc(char32_t& c) output parameter, and then we can directly initialise uninitialised variables:
        char32_t c;
        int quoted = nextc(c);

and

        int quoted;
        quoted = nextc(yy);
  • clean this up:
char buff[8];
buff[0] = 0;
for (int i = 0; i < 7 && *exprp != '}' && *exprp != ',' && *exprp != 0; i++, exprp++)
{
    buff[i] = *exprp;
    buff[i + 1] = 0;
}

and this:

buff[0] = 0;
for (int i = 0; i < 7 && *exprp != '}' && *exprp != 0; i++, exprp++)
{
    buff[i] = *exprp;
    buff[i + 1] = 0;
}
  • give Item a non-default explicit destructor and clean this up:
            Item item;
            item.t = token;
            if (token == CCLASS || token == NCCLASS)
                item.d.yyclass_id = yyclass_id;
            else if (token == COUNTED || token == COUNTED_LAZY)
            {
                item.d.yycount.n = yy_min_count;
                item.d.yycount.m = yy_max_count;
                m_has_counted = true;
            }
            else
                item.d.yy = yy;
  • Put const everywhere
  • std::vector<int> stack; - just use a stack
  • come up with better names:
struct reljunk;
struct reinst;
  • clean this up (in tandem with to_char_utf8)
std::vector<char32_t> string_to_char32_vector( std::string const& pattern )
{
    size_type size = static_cast<size_type>(pattern.size());
    size_type count = characters_in_string(pattern.c_str(),size);
    std::vector<char32_t> result(count+1);
    char32_t* output_ptr = result.data();
    const char* input_ptr = pattern.data();
    for( size_type idx=0; idx < size; ++idx )
    {
        char_utf8 output_character = 0;
        size_type ch_width = to_char_utf8(input_ptr,output_character);
        input_ptr += ch_width;
        idx += ch_width - 1;
        *output_ptr++ = output_character;
    }
    result[count] = 0; // last entry set to 0
    return result;
}
  • use std::accumulate:
    auto classes_size = classes_count * sizeof(_classes[0]);
    for( int32_t idx=0; idx < classes_count; ++idx )
        classes_size += static_cast<int32_t>((h_prog.class_at(idx).literals.size())*sizeof(char32_t));
@codereport codereport added Needs Triage Need team to review and classify feature request New feature or request labels Dec 11, 2019
@codereport codereport self-assigned this Dec 11, 2019
@codereport codereport added code quality and removed Needs Triage Need team to review and classify labels Dec 11, 2019
@codereport
Copy link
Contributor Author

from reviewing #3409 definitely refactor reprog_device::find

@davidwendt davidwendt added the strings strings issues (C++ and Python) label Dec 15, 2020
@kkraus14
Copy link
Collaborator

kkraus14 commented Mar 5, 2021

@davidwendt @codereport is this still relevant?

@codereport
Copy link
Contributor Author

Yes, this is just a tech debt job - so not a top priority. Will put in backlog.

@codereport codereport added 0 - Backlog In queue waiting for assignment tech debt and removed feature request New feature or request labels Mar 5, 2021
@davidwendt davidwendt assigned davidwendt and unassigned codereport Oct 22, 2021
rapids-bot bot pushed a commit that referenced this issue May 25, 2022
Cleans up the source for handling fixed quantifiers `{n,m}` used for repeating patterns using a range of values instead of just zero, one, or infinite. Hopefully this will help make this part of the regex parser/compiler easier to follow and maintain. There are many other items to cleanup (reference #3582) and this change concentrates mainly on the fixed quantifier handling.

No function or behavior has changed but new gtests have been added that did not previously cover these quantifier combinations.

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: #10843
rapids-bot bot pushed a commit that referenced this issue May 27, 2022
Cleans up the `regcomp.cpp` source to fix class names, comments, and simplify logic around processing operators and operands returned by the parser. Several class member variables used for state are moved or eliminated. Some member functions and variables are renamed. Cleanup of the parser logic will be in a follow-on PR.

Reference #3582
Follow on to #10843

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Vyas Ramasubramani (https://github.com/vyasr)
  - Yunsong Wang (https://github.com/PointKernel)

URL: #10879
rapids-bot bot pushed a commit that referenced this issue Jun 15, 2022
Cleans up the `regex_parser` class source to fix member names, comments, and simplify some logic creating parse-items. 
Minimal changes were made to the `regex_compiler` to accommodate some public interface changes.

Also, the `regex_compiler::expand_counted()` function was moved into `regex_parser` since it only need the parser class' `_items` data. It seemed more apt for the member function to part of `regex_parser` than `regex_compiler`.

Reference #3582

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Mike Wilson (https://github.com/hyperbolic2346)
  - Devavret Makkar (https://github.com/devavret)

URL: #10975
rapids-bot bot pushed a commit that referenced this issue Jun 27, 2022
This cleans up the awkward range literals for supporting the `CCLASS` and `NCCLASS` regex instructions. The range values were always paired (first,last) but arranged consecutively in a flat vector so `[idx] and [idx+1]` were range pairs `idx` was even. This PR introduces a `reclass_range` class that holds the pairs so we can use normal algorithms to manipulate them.

There is some overlap with code changes in PR #10975 

Reference #3582

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Mike Wilson (https://github.com/hyperbolic2346)
  - MithunR (https://github.com/mythrocks)

URL: #11045
@davidwendt
Copy link
Contributor

All or most of this has been addressed. We can open a new issue for any new or remaining concerns.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0 - Backlog In queue waiting for assignment strings strings issues (C++ and Python)
Projects
None yet
Development

No branches or pull requests

3 participants