Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework specific cudf::strings and nvtext APIs to use make_strings_children utility #12167

Closed
13 tasks done
davidwendt opened this issue Nov 16, 2022 · 0 comments
Closed
13 tasks done
Assignees
Labels
improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code.

Comments

@davidwendt
Copy link
Contributor

davidwendt commented Nov 16, 2022

The fix for #12087 is expected to go into the cudf::strings::detail::make_strings_children utility which will throw an error if it detects the output column will be too large.

There are currently 86 cudf::strings APIs with 55 of those creating new strings though only 30 of those can possibly create larger strings than its input. 18 of those 30 use the make_strings_children utility and will be automatically covered by the fix for #12087. And 2 of the 30 are already have builtin logic to handle any possible overflow. Similarly, nvtext has 3 APIs not covered.

This means there are 13 APIs that should either use the make_strings_children utility or apply the same fix. This issue is to keep track of the work for fixing these APIs.

@davidwendt davidwendt added libcudf Affects libcudf (C++/CUDA) code. tech debt improvement Improvement / enhancement to an existing function labels Nov 16, 2022
@davidwendt davidwendt self-assigned this Nov 16, 2022
rapids-bot bot pushed a commit that referenced this issue Nov 23, 2022
…h to new header (#12185)

Create new `cudf/strings/detail/strings_children.cuh` header by moving strings children functions from `utilities.cuh`. This will help with refactoring needed for #12167 and #12180
No code has changed. The `make_strings_children` and `make_offsets_child_column` have moved requiring updates to include statements in source files that included `cudf/strings/detail/utilities.cuh`.

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - Robert Maynard (https://github.com/robertmaynard)

URL: #12185
rapids-bot bot pushed a commit that referenced this issue Dec 1, 2022
)

Rework `cudf::strings::pad` and `cudf::strings::zfill` to use the internal `make_strings_children`. This simplifies the code and allows the operation to throw an error if it the output would exceed the size limit of a column.
No function has been added, removed, or changed.

Reference #12167

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)
  - MithunR (https://github.com/mythrocks)

URL: #12238
rapids-bot bot pushed a commit that referenced this issue Dec 7, 2022
Rework `nvtext::detokenize` to use the `cudf::detail::make_strings_children` and the `cudf::detail::indexalator`.
This allows the operation to throw an error if it the output would exceed the size limit of a column.
The indexalator usage removes the need for a type-dispatcher call for the row-indices.
No function has been added, removed, or changed.

Reference #12167

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)
  - Bradley Dice (https://github.com/bdice)
  - Karthikeyan (https://github.com/karthikeyann)

URL: #12267
@vuule vuule self-assigned this Dec 12, 2022
rapids-bot bot pushed a commit that referenced this issue Dec 15, 2022
…12317)

Rework `cudf::strings::from_timstamps` to use the internal `make_strings_children`. This simplifies the code and allows the operation to throw an error if it the output would exceed the size limit of a column.
No function has been added, removed, or changed.

Reference #12167

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)
  - Bradley Dice (https://github.com/bdice)

URL: #12317
rapids-bot bot pushed a commit that referenced this issue Dec 20, 2022
)

Issue #12167
 
Use `make_strings_children` in `cudf::strings::from_booleans` to simplify code and make the function more robust.

Authors:
  - Vukasin Milovanovic (https://github.com/vuule)

Approvers:
  - David Wendt (https://github.com/davidwendt)
  - Divye Gala (https://github.com/divyegala)

URL: #12365
rapids-bot bot pushed a commit that referenced this issue Jan 3, 2023
…tility (#12401)

Rework `cudf::strings::integers_to_ipv4` to use the internal `make_strings_children`, `count_digits`, and `integer_to_string` functions. This simplifies the code and allows the operation to throw an error if it the output would exceed the size limit of a column.

Reference #12167

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Vyas Ramasubramani (https://github.com/vyasr)
  - Nghia Truong (https://github.com/ttnghia)

URL: #12401
rapids-bot bot pushed a commit that referenced this issue Jan 4, 2023
…#12385)

Rework `cudf::strings::url_encode` to use the internal `make_strings_children`. This simplifies the code and allows the operation to throw an error if it the output would exceed the size limit of a column.
Also fixed a bug found by inspection in the capitalize functor.

Reference #12167

Authors:
  - David Wendt (https://github.com/davidwendt)
  - Nghia Truong (https://github.com/ttnghia)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)
  - Bradley Dice (https://github.com/bdice)

URL: #12385
rapids-bot bot pushed a commit that referenced this issue Jan 6, 2023
…#12480)

Rework `nvtext::generate_character_ngrams` to use the internal `make_strings_children`. This simplifies the code and allows the operation to throw an error if it the output would exceed the size limit of a column.

Reference #12167

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Divye Gala (https://github.com/divyegala)
  - Nghia Truong (https://github.com/ttnghia)

URL: #12480
rapids-bot bot pushed a commit that referenced this issue Jan 7, 2023
…s_children (#12434)

Issue #12167

Use `make_strings_children` in `cudf::strings::from_fixed_point`, `cudf::strings::from_floats` and `cudf::strings::from_integers` to simplify code and make the functions more robust.

Authors:
  - Vukasin Milovanovic (https://github.com/vuule)

Approvers:
  - David Wendt (https://github.com/davidwendt)
  - MithunR (https://github.com/mythrocks)
  - Nghia Truong (https://github.com/ttnghia)

URL: #12434
rapids-bot bot pushed a commit that referenced this issue Jan 9, 2023
Adds the `cudf::sizes_to_offsets` utility to the `cudf::strings::detail::make_offsets_child_column` utility. Strings APIs use this when `make_strings_children` is not practical. Also lists APIs use this for building offsets too.

Reference: #12167

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - Nghia Truong (https://github.com/ttnghia)

URL: #12345
rapids-bot bot pushed a commit that referenced this issue Jan 10, 2023
Use `make_strings_children` utility in parse_data nested json reader
Addresses part of #12167

Authors:
  - Karthikeyan (https://github.com/karthikeyann)

Approvers:
  - David Wendt (https://github.com/davidwendt)
  - Vukasin Milovanovic (https://github.com/vuule)

URL: #12382
@GregoryKimball GregoryKimball moved this to Pairing in libcudf Jan 15, 2023
rapids-bot bot pushed a commit that referenced this issue Jan 17, 2023
#12541)

Rework `cudf::lists::sequences` to use the internal `sizes_to_offsets` utility instead of `thrust::exclusive_scan` for computing offsets. This allows the operation to throw an error if it the output would exceed the size limit of a column.

Reference #12167

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)
  - Yunsong Wang (https://github.com/PointKernel)

URL: #12541
rapids-bot bot pushed a commit that referenced this issue Jan 17, 2023
Rework `nvtext::ngrams_tokenize` to use the internal `sizes_to_offsets` utility. This simplifies the code and allows the operation to throw an error if it the output would exceed the size limit of a column.

Reference #12167

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - Nghia Truong (https://github.com/ttnghia)

URL: #12540
@GregoryKimball GregoryKimball removed the status in libcudf May 17, 2023
@GregoryKimball GregoryKimball removed this from libcudf Jun 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code.
Projects
None yet
Development

No branches or pull requests

2 participants