Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multibyte_split #8702

Merged
merged 93 commits into from
Aug 24, 2021
Merged

multibyte_split #8702

merged 93 commits into from
Aug 24, 2021

Conversation

cwharris
Copy link
Contributor

@cwharris cwharris commented Jul 9, 2021

Adds multbyte_split API, part of #8557. Takes one large text input and splits it in to a single strings column.

  • Features:
    • split on multi-byte delimiters
    • split on multiple delimiters simultaneously
    • erase delimiters from output (will implement later)
    • replace delimiters with alternate text (will implement later)
  • Supported input types
    • cudf::io::text::data_chunk_source
      • cudf::string_scalar via cudf::device_span
      • std::string via std::istream
      • files via std::istream
  • Supported delimiter type
    • std::string
  • Performance Goals
    • ~2G/s from file, ~4G/s on-device. There is room for improvement, but perf is good enough for now.
  • Additional goals:
    • add reusable block-level pattern-matching utility.
    • add reusable block-level utility to "peek" at "future" scan states (will implement with delimiter erasure).

@github-actions github-actions bot added CMake CMake build issue libcudf Affects libcudf (C++/CUDA) code. labels Jul 9, 2021
@cwharris cwharris added non-breaking Non-breaking change feature request New feature or request labels Jul 9, 2021
Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some questions/suggestions related to the new benchmark.

cpp/benchmarks/io/text/multibyte_split_benchmark.cpp Outdated Show resolved Hide resolved
cpp/benchmarks/io/text/multibyte_split_benchmark.cpp Outdated Show resolved Hide resolved
cpp/benchmarks/io/text/multibyte_split_benchmark.cpp Outdated Show resolved Hide resolved
cpp/benchmarks/io/text/multibyte_split_benchmark.cpp Outdated Show resolved Hide resolved
@cwharris cwharris requested a review from vuule August 23, 2021 21:59
@cwharris cwharris requested a review from jrhemstad August 24, 2021 15:49
@cwharris
Copy link
Contributor Author

rerun tests

@cwharris
Copy link
Contributor Author

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 8075199 into rapidsai:branch-21.10 Aug 24, 2021
@cwharris cwharris deleted the multibyte-split branch August 24, 2021 21:26
rapids-bot bot pushed a commit that referenced this pull request Sep 17, 2021
Provides the Python/Cython bindings for #8702 multibyte_split. This PR depends on #8702 being merged first.

Closes #8557

Authors:
  - Jeremy Dyer (https://github.com/jdye64)
  - Christopher Harris (https://github.com/cwharris)

Approvers:
  - https://github.com/nvdbaranec
  - Vyas Ramasubramani (https://github.com/vyasr)
  - GALI PREM SAGAR (https://github.com/galipremsagar)

URL: #8998
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CMake CMake build issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants