Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add work around for string split with empty input. #11292

Merged
merged 2 commits into from
Aug 5, 2024

Conversation

revans2
Copy link
Collaborator

@revans2 revans2 commented Aug 2, 2024

This fixes #11287

I traced down all of the string split calls mentioned in the original CUDF issue rapidsai/cudf#16453.

  • split
    • ColumnView.stringSplit
      • HiveTableScant for a HiveTextFile
      • GpuStringToMap
  • split_record
    • ColumnView.stringSplitRecord
      • GpuStringSplit with N = 0 and N != 1
  • split_record_re
    • ColumnView.stringSplitRecordRe => ColumnView.stringSplitRecord
  • split_re
    • ColumnView.stringSplitRe -> ColumnView.stringSplit
  • rsplit
    • Not exposed in JNI

I added some tests for GpuStringToMap, but I could not trigger the issue there. I also added in a test form array because I saw some similar odd behavior for GpuStringSplit when the number of splits = 1, which didn't call stringSplit* at all, but I also could not trigger any issues there.

I didn't touch HiveTextFile because an empty file felt invalid, but I can go back and try to test it if we want to.

jlowe
jlowe previously approved these changes Aug 2, 2024
Copy link
Member

@jlowe jlowe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor nits, lgtm.

Comment on lines 1058 to 1059
else:
return StringGen("", nullable=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: defensive programming

Suggested change
else:
return StringGen("", nullable=True)
elif (empty_type == EmptyStringType.MIXED):
return StringGen("", nullable=True)
else:
raise AssertionError("unexpected empty type " + str(empty_type))

or alternatively make it a map lookup, e.g.:

empty_string_gens_map = {
  EmptyStringType.ALL_NULL : lambda: NullGen(StringType()),
  EmptyStringType.ALL_EMPTY : lambda: StringGen("", nullable=False)
  EmptyStringType.MIXED : lambda: StringGen("", nullable=True)
}

def mk_empty_str_gen(empty_type):
    return empty_string_gens_map[empty_type]()

Comment on lines 443 to 446
@pytest.mark.parametrize('empty_type', [
EmptyStringType.ALL_NULL,
EmptyStringType.ALL_EMPTY,
EmptyStringType.MIXED])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
@pytest.mark.parametrize('empty_type', [
EmptyStringType.ALL_NULL,
EmptyStringType.ALL_EMPTY,
EmptyStringType.MIXED])
@pytest.mark.parametrize('empty_type', list(EmptyStringType.__members__))

Comment on lines 33 to 36
@pytest.mark.parametrize('empty_type', [
EmptyStringType.ALL_NULL,
EmptyStringType.ALL_EMPTY,
EmptyStringType.MIXED])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
@pytest.mark.parametrize('empty_type', [
EmptyStringType.ALL_NULL,
EmptyStringType.ALL_EMPTY,
EmptyStringType.MIXED])
@pytest.mark.parametrize('empty_type', list(EmptyStringType.__members__))

@revans2
Copy link
Collaborator Author

revans2 commented Aug 2, 2024

build

@revans2
Copy link
Collaborator Author

revans2 commented Aug 2, 2024

@jlowe please take another look

@revans2 revans2 merged commit d3dc496 into NVIDIA:branch-24.08 Aug 5, 2024
43 checks passed
@revans2 revans2 deleted the split_fix branch August 5, 2024 14:20
@sameerz sameerz added the bug Something isn't working label Aug 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] String split APIs on empty string produce incorrect result
3 participants