Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parquet_column_view should use type_dispatcher #5565

Closed
jrhemstad opened this issue Jun 23, 2020 · 9 comments · Fixed by #7461
Closed

parquet_column_view should use type_dispatcher #5565

jrhemstad opened this issue Jun 23, 2020 · 9 comments · Fixed by #7461
Labels
cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code.

Comments

@jrhemstad
Copy link
Contributor

Describe the bug

parquet_column_view has an explicit switch over type_ids:

switch (col.type().id()) {

This kind of code is extremely problematic and is precisely why the type_dispatcher was developed.

This should be replaced with a type dispatched function object.

@jrhemstad jrhemstad added bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue labels Jun 23, 2020
@devavret
Copy link
Contributor

This exact thought crossed my mind when I was working on it, so I tried. But the code for each type is so different that it needed a separate template specialization for each type. There's no code reusability between any two types. The primary goal this achieves is to provide a translation for cud::type_id to parquet::Type, which needs to be defined case by case somewhere a. la. type_to_id

As a negative, this reduced the readability of the existing code.

@harrism harrism changed the title [BUG] parquet_column_view should use type_disptacher [BUG] parquet_column_view should use type_dispatcher Jun 25, 2020
@jrhemstad
Copy link
Contributor Author

This exact thought crossed my mind when I was working on it, so I tried. But the code for each type is so different that it needed a separate template specialization for each type. There's no code reusability between any two types. The primary goal this achieves is to provide a translation for cud::type_id to parquet::Type, which needs to be defined case by case somewhere a. la. type_to_id

As a negative, this reduced the readability of the existing code.

Using the type_dispatcher ensures that any new types added are automatically picked up instead of easy to forget need to manually update the switch statement.

@devavret
Copy link
Contributor

Using the type_dispatcher ensures that any new types added are automatically picked up instead of easy to forget need to manually update the switch statement.

What good would that be if parquet needs special handling for every type anyway?

@jrhemstad
Copy link
Contributor Author

Using the type_dispatcher ensures that any new types added are automatically picked up instead of easy to forget need to manually update the switch statement.

What good would that be if parquet needs special handling for every type anyway?

Because the type_dispatcher is the single source of truth for mapping runtime types to compile time types. There should not be exceptions or deviations.

@cwharris
Copy link
Contributor

The design of type_dispatcher encompasses a few things, some of which are not accounted for in the example code. For one, the type_dispatcher pattern encourages a pure-function approach, so setting instance-specific state is discouraged, or it at least encourages a single assignment of a custom type. In that way, type_dispatcher's design has a cascade effect. Using it incorrectly or in a clunky way reveals "anti-patterns" in the calling code.

Here, for example, we are assigning three separate highly coupled values, and this would require, at the very least, a tie at the call site to accomplish using type_dispatcher, but given that we would have already created a tuple to return from type dispatcher, it's likely we would have noticed the correlation and created a custom type, or simply used the tuple as the source of truth, requiring us to change each value at once, reducing the likelihood of erroneous state changes down the line.

switch (col.type().id()) {
case cudf::type_id::INT8:
_physical_type = Type::INT32;
_converted_type = ConvertedType::INT_8;
_stats_dtype = statistics_dtype::dtype_int8;
break;

Although less of an issue, type_dispatcher also eliminates the possibility of omitting a break and falling through to the next switch, which could save some debugging time.

@cwharris
Copy link
Contributor

cwharris commented Sep 13, 2020

That being said, this code works as is and probably shouldn't be changes just for the sake of removing the switch. Replacing it and refactoring the class in to a pure function at the same time would probably be advantageous, but as it stands there's not much to be gained now that it's implemented and tested. The type_dispatcher pattern should still be used for new code, avoiding type-based switches. It's verbose, but encourages better code upstream.

@cwharris
Copy link
Contributor

One more thought on the subject...

The primary goal this achieves is to provide a translation for cud::type_id to parquet::Type, which needs to be defined case by case somewhere a. la. type_to_id

This is actually an argument in favor of type_dispatcher, as it imposes a compile-time restraint on handling all cases, even if one of those cases is a catch all al a default that calls CUDF_FAIL. If there truly is a mapping per type, there won't CUDF_FAIL case, and any time a new type is added, it will need to be accounted for explicitly, reducing the likelihood of future bugs.

@vuule vuule removed the bug Something isn't working label Feb 3, 2021
@vuule vuule changed the title [BUG] parquet_column_view should use type_dispatcher parquet_column_view should use type_dispatcher Feb 3, 2021
@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@kaatish
Copy link
Contributor

kaatish commented Apr 12, 2021

Fixed by #7461

@kaatish kaatish closed this as completed Apr 12, 2021
@kaatish kaatish linked a pull request Apr 13, 2021 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants