Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ORC reader supports struct #2887

Merged
merged 17 commits into from
Jul 28, 2021
Merged

Conversation

firestarman
Copy link
Collaborator

@firestarman firestarman commented Jul 8, 2021

Let ORC reader support struct, since cudf has supported them by the PR rapidsai/cudf#8599 .
Also add tests for pruning nested fields.

closes #2879
closes #1481

Signed-off-by: Firestarman [email protected]

@firestarman firestarman added the cudf_dependency An issue or PR with this label depends on a new feature in cudf label Jul 9, 2021
@firestarman
Copy link
Collaborator Author

firestarman commented Jul 9, 2021

Now it is draft because some tests for nested type are failing due to the different values between CPU and GPU.

Filed the issue rapidsai/cudf#8704 to track the test failures.

Signed-off-by: Firestarman <[email protected]>
@firestarman
Copy link
Collaborator Author

Updated the tests to avoid generating null struct rows, to unblock this PR.

@firestarman firestarman marked this pull request as ready for review July 12, 2021 07:40
@firestarman firestarman changed the title [WIP] orc reader supports struct and list ORC reader supports struct and list Jul 12, 2021
@firestarman
Copy link
Collaborator Author

build

Copy link
Member

@jlowe jlowe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general this looks good, but we can't default to turning ORC struct support on if we know it cannot load null struct rows properly.

TimestampGen(start=datetime(1590, 1, 1, tzinfo=timezone.utc))],
TimestampGen(start=datetime(1590, 1, 1, tzinfo=timezone.utc))]

# Set to true after the issue https://github.com/rapidsai/cudf/issues/8704 is fixed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't check this in as-is -- it can silently corrupt data on an ORC read. Either we need to make the struct support qualified by a separate incompatible config that defaults to off or wait until it is fixed.

Copy link
Collaborator Author

@firestarman firestarman Jul 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue rapidsai/cudf#8704 is marked as " v21.08 Release", so I chose to wait for the fix.

@firestarman firestarman marked this pull request as draft July 13, 2021 05:40
Signed-off-by: Firestarman <[email protected]>
@firestarman
Copy link
Collaborator Author

firestarman commented Jul 16, 2021

As suggested by @sameerz , we plan to make two separate PRs for list and struct respectively.
#2946 is for list, and will update this one for struct only.

@firestarman firestarman changed the title ORC reader supports struct and list [WIP] ORC reader supports struct Jul 16, 2021
@firestarman
Copy link
Collaborator Author

Here is the fix rapidsai/cudf#8819 for the issue rapidsai/cudf#8704

Signed-off-by: Firestarman <[email protected]>
@firestarman
Copy link
Collaborator Author

firestarman commented Jul 23, 2021

Found a bug when reading data from orc with different column order in read schema against file schema, and @wbo4958 is working on it.

@firestarman firestarman changed the title [WIP] ORC reader supports struct ORC reader supports struct Jul 26, 2021
@firestarman firestarman marked this pull request as ready for review July 26, 2021 06:13
@firestarman
Copy link
Collaborator Author

build

@firestarman
Copy link
Collaborator Author

build

@firestarman
Copy link
Collaborator Author

build

Signed-off-by: Firestarman <[email protected]>
@firestarman
Copy link
Collaborator Author

build

@firestarman
Copy link
Collaborator Author

build

@sameerz sameerz added the feature request New feature or request label Jul 27, 2021
@firestarman firestarman merged commit b4e56f3 into NVIDIA:branch-21.08 Jul 28, 2021
@firestarman firestarman deleted the orc-read branch July 28, 2021 02:16
firestarman added a commit to firestarman/spark-rapids that referenced this pull request Aug 9, 2021
Also add tests for the nested predicate pushdown, and
the support for nested column pruning.

Relevant PRs:
  NVIDIA#3079
  NVIDIA#2887

Signed-off-by: Firestarman <[email protected]>
firestarman added a commit to firestarman/spark-rapids that referenced this pull request Aug 11, 2021
Also add tests for the nested predicate pushdown, and
the support for nested column pruning.

Relevant PRs:
  NVIDIA#3079
  NVIDIA#2887

Signed-off-by: Firestarman <[email protected]>
firestarman added a commit to firestarman/spark-rapids that referenced this pull request Aug 17, 2021
Also add tests for the nested predicate pushdown, and
the support for nested column pruning.

Relevant PRs:
  NVIDIA#3079
  NVIDIA#2887

Signed-off-by: Firestarman <[email protected]>
firestarman added a commit to firestarman/spark-rapids that referenced this pull request Aug 20, 2021
Also add tests for the nested predicate pushdown, and
the support for nested column pruning.

Relevant PRs:
  NVIDIA#3079
  NVIDIA#2887

Signed-off-by: Firestarman <[email protected]>
firestarman added a commit that referenced this pull request Aug 27, 2021
* Re-enable the struct support for the orc reader.

Also add tests for the nested predicate pushdown, and
the support for nested column pruning.

Relevant PRs:
  #3079
  #2887

Signed-off-by: Firestarman <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cudf_dependency An issue or PR with this label depends on a new feature in cudf feature request New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] ORC reader supports reading Struct columns. [FEA] ORC Predicate pushdown for Nested fields
3 participants