Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix a out of bound bug in parse_url query #1746

Merged
merged 8 commits into from
Jan 29, 2024

Conversation

thirtiseven
Copy link
Collaborator

@thirtiseven thirtiseven commented Jan 29, 2024

Fixes #1745

#1740 adds new test cases "https://www.nvidia.com/vote.php?=50" as url and "" (empty string) as key in a parse_url query with a key kernel, which will lead to an out of bound issue detected by compute sanitizer.

In the following code:

__device__ std::pair<string_view, bool> find_query_part(string_view haystack, string_view needle)
{
  auto const n_bytes     = needle.size_bytes();
- auto const find_length = haystack.size_bytes() - n_bytes + 1;
+ auto const find_length = haystack.size_bytes() - n_bytes;

  auto h           = haystack.data();
  auto const end_h = haystack.data() + find_length;
  auto n           = needle.data();
  bool match       = false;
  while (h < end_h) {
    match = false;  // initialize to false to prevent empty query key
    for (size_type jdx = 0; (jdx == 0 || match) && (jdx < n_bytes); ++jdx) {
      match = (h[jdx] == n[jdx]);
    }
    if (match) { match = n_bytes < haystack.size_bytes() && h[n_bytes] == '='; }
    if (match) {
      // we don't care about the matched part, we want the string data after that.
      h += n_bytes;
      break;
    } else {
      // skip to the next param, which is after a &.
      while (h < end_h && *h != '&') {
        h++;
      }
    }
    h++;
  }

  // if not match or no value, return nothing
  if (!match || *h != '=') { return {{}, false}; }

  // skip over the =
  h++;

  // rest of string until end or until '&' is query match
  auto const bytes_left = haystack.size_bytes() - (h - haystack.data());
  int match_len         = 0;
  auto start            = h;
  while (*h != '&' && match_len < bytes_left) {
    ++match_len;
    ++h;
  }

  return {{start, match_len}, true};
}

when pass haystack = "=50" and needle = "" in, the find_length will be 3 - 0 + 1 = 4, so in line:

while (h < end_h && *h != '&') {

h will be out of bound.

We use find_length to stop matching early after it can no longer contain the string we are searching for, and find_length = haystack.size_bytes() - n_bytes is a tighter bound for it.

This PR will unblock CI and submodule-sync, Integration tests and big dataset test passed from plugin side.

(I decide to keep -DUSE_SANITIZER=ON in my build command forever...)

@thirtiseven
Copy link
Collaborator Author

build

@sameerz sameerz added the bug Something isn't working label Jan 29, 2024
@thirtiseven thirtiseven self-assigned this Jan 29, 2024
Signed-off-by: Haoyang Li <[email protected]>
@thirtiseven
Copy link
Collaborator Author

Tested again with current code from plugin, passed.

Signed-off-by: Haoyang Li <[email protected]>
Signed-off-by: Haoyang Li <[email protected]>
Signed-off-by: Haoyang Li <[email protected]>

// stop matching early after it can no longer contain the string we are searching for
while (h + n_bytes < h_end) {
bool match_needle = false; // initialize to false to prevent empty query key
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems can not handle the corner case:
needle = '' empty string.
Set match_needle = true?

Is this a valid case: find_query_part('=value', '') ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice catch, thanks.
The test code to generate excepted results in java test has a bug, it returns null for '=value', '' this case, while spark return value. So you are right, it should be true here.

@res-life
Copy link
Collaborator

Can current code handle this case?
haystack = "https://www.nvidia.com/vote.php?key=50" and needle = "key"?
Before there is a h++ in the end of while block, now it's removed.

Signed-off-by: Haoyang Li <[email protected]>
@thirtiseven
Copy link
Collaborator Author

Can current code handle this case? haystack = "https://www.nvidia.com/vote.php?key=50" and needle = "key"? Before there is a h++ in the end of while block, now it's removed.

only query part will be passed in this function, so I think we are good for this case.

@thirtiseven
Copy link
Collaborator Author

build

@hyperbolic2346 hyperbolic2346 merged commit c776bd9 into NVIDIA:branch-24.02 Jan 29, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Unit tests failed: ai.rapids.cudf.CudaFatalException: cudaErrorLaunchFailure: unspecified launch failure
4 participants