Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Replace nvstrdesc with string_view #5682

Closed
4 tasks done
devavret opened this issue Jul 13, 2020 · 5 comments · Fixed by #7841
Closed
4 tasks done

[FEA] Replace nvstrdesc with string_view #5682

devavret opened this issue Jul 13, 2020 · 5 comments · Fixed by #7841
Assignees
Labels
cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.

Comments

@devavret
Copy link
Contributor

devavret commented Jul 13, 2020

Is your feature request related to a problem? Please describe.
The struct nvstrdesc_s used in cuIO is similar to string_view. They both contain a pointer to the char array and the size of the string. (string_view contains only 2 additional members). string_view has the necessary methods to perform operations expected of nvstrdesc_s

struct nvstrdesc_s {
const char *ptr;
size_t count;
};

Describe the solution you'd like
For all purposes, nvstrdesc_s can be replaced with string_view.

Doing this will be a multi-step process.

  • In parquet writer, reduce usage of stats_column_desc members in favor of column_device_view
  • In orc writer, reduce usage of stats_column_desc members in favor of column_device_view
  • Refactor statistics calculation to remove nvstrdesc_s and use column_device_view instead of members of stats_column_desc that replicate the same information
  • Kick out stringdata_to_nvstrdesc() in all the writers.
@devavret devavret added feature request New feature or request Needs Triage Need team to review and classify labels Jul 13, 2020
@devavret devavret self-assigned this Jul 13, 2020
@davidwendt
Copy link
Contributor

I would not recommend using cudf::strings::detail::create_string_vector_from_column for this. The only function that uses this right now is cudf::strings::detail::scatter and appears to be the last remnant of code that tries to impose nullness into a string_view instance:

if (d_column.is_null(idx))
d_strings[idx] = string_view(nullptr, 0);
else
d_strings[idx] = d_column.element<string_view>(idx);

This works but is kind of hacky -- using string_view instance merely as pair container of two values (pointer and length).

This implementation should have been probably removed/fixed as part of #4548 but got missed.

I would recommend using the following API (which seems to be lacking documentation):

// Create a strings-type column from vector of pointer/size pairs
template<typename IndexPairIterator>
std::unique_ptr<column> make_strings_column(
IndexPairIterator begin, IndexPairIterator end,
rmm::mr::device_memory_resource* mr,
cudaStream_t stream )

This takes iterators over types like pair<pointer,integer> and created just for this kind of thing.

@devavret
Copy link
Contributor Author

@davidwendt the API you suggested seems to create a strings column from a pair of pointer an integer. What stringdata_to_nvstrdesc() does is the opposite. It creates a struct that is a pair of pointer and size values. Are you suggesting that we should use a pair instead of nvstrdesc_s?

@davidwendt
Copy link
Contributor

Yes, my mistake. If the intention is to encode nullness into the data vector (instead of in a separate bitmask), then I would recommend using a pair or struct like nvstrdesc_s instead of string_view.

The API I (wrongly) suggested converts a vector of pairs into a strings column.

@devavret
Copy link
Contributor Author

I don't believe nvstrdesc_s exists to encode nullness. I believe it is an artifact of the cuIO's handling of the old libcudf nvstring.

At the very least, we could consolidate the string comparison codes located here:

__device__ inline int string_view::compare(const char* data, size_type bytes) const
{
size_type const len1 = size_bytes();
const unsigned char* ptr1 = reinterpret_cast<const unsigned char*>(this->data());
const unsigned char* ptr2 = reinterpret_cast<const unsigned char*>(data);
size_type idx = 0;
for (; (idx < len1) && (idx < bytes); ++idx) {
if (*ptr1 != *ptr2) return (int)*ptr1 - (int)*ptr2;
++ptr1;
++ptr2;
}
if (idx < len1) return 1;
if (idx < bytes) return -1;
return 0;
}
and here:
template <class T, const T lesser, const T greater, const T equal>
inline __device__ T nvstr_compare(const char *as, uint32_t alen, const char *bs, uint32_t blen)
{
uint32_t len = min(alen, blen);
uint32_t i = 0;
if (len >= 4) {
uint32_t align_a = 3 & reinterpret_cast<uintptr_t>(as);
uint32_t align_b = 3 & reinterpret_cast<uintptr_t>(bs);
const uint32_t *as32 = reinterpret_cast<const uint32_t *>(as - align_a);
const uint32_t *bs32 = reinterpret_cast<const uint32_t *>(bs - align_b);
uint32_t ofsa = align_a * 8;
uint32_t ofsb = align_b * 8;
do {
uint32_t a = *as32++;
uint32_t b = *bs32++;
if (ofsa) a = __funnelshift_r(a, *as32, ofsa);
if (ofsb) b = __funnelshift_r(b, *bs32, ofsb);
if (a != b) {
return (lesser == greater || __byte_perm(a, 0, 0x0123) < __byte_perm(b, 0, 0x0123))
? lesser
: greater;
}
i += 4;
} while (i + 4 <= len);
}
while (i < len) {
uint8_t a = as[i];
uint8_t b = bs[i];
if (a != b) { return (a < b) ? lesser : greater; }
++i;
}
return (alen == blen) ? equal : (alen < blen) ? lesser : greater;
}

@kkraus14 kkraus14 added libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue and removed Needs Triage Need team to review and classify labels Aug 5, 2020
@vuule vuule added the tech debt label Dec 3, 2020
@devavret devavret assigned kaatish and unassigned devavret Dec 11, 2020
rapids-bot bot pushed a commit that referenced this issue Feb 4, 2021
Closes #6893, closes #6894. Contributes to #5682 

Reduce usage of stats_column_desc members in Parquet writer with column_device_view members.

Authors:
  - Kumar Aatish (@kaatish)

Approvers:
  - David (@davidwendt)
  - Vukasin Milovanovic (@vuule)
  - Devavret Makkar (@devavret)

URL: #7097
@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

rapids-bot bot pushed a commit that referenced this issue Mar 25, 2021
This PR adds column_device_view members to EncChunk, DictionaryChunk and StripeDictionary structures which are used in the ORC writer. The idea is to replace members in these structures which replicate the same information. Usage of nvstrdesc_s has also been eliminated in the ORC writer.

Fixes #7347, Addresses #5682, Addresses #7334

Authors:
  - Kumar Aatish (@kaatish)

Approvers:
  - Vukasin Milovanovic (@vuule)
  - Devavret Makkar (@devavret)

URL: #7676
rapids-bot bot pushed a commit that referenced this issue Apr 17, 2021
Fixes #5682.

- Structure `nvstrdesc_s` was replaced with `thrust::pair<const char*, size_type>;`.
- `nvstrdesc_s` related logical functions such as `nvstr_is_lesser`, `nvstr_is_greater` etc. were removed.
- Include directives for headers included by source files residing in the same directory were made relative as per the developer guide.
- `make_column` function related to `column_buffer` was moved from a header file to an implementation file.

Authors:
  - Kumar Aatish (https://github.com/kaatish)

Approvers:
  - Vukasin Milovanovic (https://github.com/vuule)
  - https://github.com/nvdbaranec
  - Devavret Makkar (https://github.com/devavret)
  - Keith Kraus (https://github.com/kkraus14)

URL: #7841
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants