Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Add Apache Arrow schema for parquet writing #6862

Closed
hyperbolic2346 opened this issue Nov 30, 2020 · 3 comments
Closed

[FEA] Add Apache Arrow schema for parquet writing #6862

hyperbolic2346 opened this issue Nov 30, 2020 · 3 comments
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.

Comments

@hyperbolic2346
Copy link
Contributor

Is your feature request related to a problem? Please describe.
As discussed in issue #6816 it seems necessary to support some sort of schema for writing parquet files. There are cases where the caller has information about the writing that cudf currently doesn't track or has no way to know. Specifically, decimal precision and maps. To prevent clouding that discussion with specifics, I am creating this request for a specific schema, which is Arrow's schema.

Describe the solution you'd like
We should import this schema into cudf. We could build against the Arrow files or import them wholesale. I think building against them is best for maintenance, but adds a dependency on Arrow, which seems undesirable.

Describe alternatives you've considered
We could roll our own schema, but there are multiple reasons to avoid that.

  1. People who would use this would probably be moving to GPU from some other system, so adopting a commonly used system seems more useful to people.
  2. We would spend a lot of time building and maintaining the schema.

Additional context
Here is a link to the schema that is used in Arrow: https://github.com/apache/arrow/blob/master/cpp/src/parquet/schema.h

@hyperbolic2346 hyperbolic2346 added feature request New feature or request Needs Triage Need team to review and classify labels Nov 30, 2020
@kkraus14 kkraus14 added libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Nov 30, 2020
@hyperbolic2346
Copy link
Contributor Author

This has been discussed in the past as well: #3225

@github-actions
Copy link

This issue has been marked stale due to no recent activity in the past 30d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be marked rotten if there is no activity in the next 60d.

@github-actions github-actions bot added the stale label Feb 16, 2021
@github-actions
Copy link

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.
Projects
None yet
Development

No branches or pull requests

2 participants