Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Java API change for supporting structs #7730

Merged
merged 19 commits into from
Apr 16, 2021

Conversation

razajafri
Copy link
Contributor

@razajafri razajafri commented Mar 25, 2021

This is a very rough draft PR to tie down the interface change to support Structs for Parquet writer. Once we have the interface down, it's just a matter of coding in the rest of the pieces.

Here is how I envision it to be used by the end-user.

      ParquetWriterOptions options = ParquetWriterOptions.builder()
          .withColumnOptions(ParquetColumnWriterOptions.builder()
              .withColumnName("_c0")
              .withTimestampInt96(false)
              .withDecimalPrecision(4)
              .isNullable(true).build())
          .withStatisticsFrequency(ParquetWriterOptions.StatisticsFrequency.NONE)
          .withCompressionType(CompressionType.AUTO)
          .withMetadata("test", "test")
          .build();

I still don't know a good way to get rid of the old withColumnNames API on ParquetWriterOptions. We can't remove it as ORC is still using it. One option could be to just rip out the WriterOptions from the ParquetWriterOptions hierarchy.

@github-actions github-actions bot added the Java Affects Java cuDF API. label Mar 25, 2021
@razajafri razajafri changed the title API change for supporting structs [skip ci] Java API change for supporting structs [skip ci] Mar 25, 2021
@razajafri razajafri added breaking Breaking change 2 - In Progress Currently a work in progress 4 - Needs cuDF (Java) Reviewer Spark Functionality that helps Spark RAPIDS improvement Improvement / enhancement to an existing function labels Mar 25, 2021
Copy link
Contributor

@jlowe jlowe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

High-level idea seems OK to me as long as we flatten it when calling into JNI.

I still don't know a good way to get rid of the old withColumnNames API on ParquetWriterOptions.

We'd either need to disconnect it from the old one or have a compatibility mode where it translates the old values to new values if they were set via the old API, and of course the old API won't support structs. Setting options via both APIs would be an error. I'm OK if we want to ditch the old one, but it would be a breaking change.

@revans2
Copy link
Contributor

revans2 commented Mar 25, 2021

A few things about the API.

  1. I don't see a lot of value in requiring a builder for simple child columns.
ParquetWriterOptions options = ParquetWriterOptions.builder()
  .withSimpleColumn("column name 1")
  .withSimpleNullableColumn("column name 2")
  .withComplextColumn(ParquetColumnWriterOptions.builder()....
  1. Should we add some kind of types to make the API more readable and less likely to let users point a gun at their foot?

The current C++ API lets anything go. i.e. Setting precision for non-decimal types and adding children to non-nested columns. Perhaps we could have something like.

ParquetWriterOptions options = ParquetWriterOptions.builder()
  .withColumn("column name 1")
  .withListColumn(ParquetColumnWriterOptions.builder()...)

One of the best ways I have found to do API design is to use the API.

Lets create a schema for a table that has a list of timestamps that should be output as int96.

     ParquetWriterOptions options = ParquetWriterOptions.builder()
         .withListColumn("_c0")
             .withTimestampData()
                 .withInt96()
                 .build()
             .build()
         .build();

Is that better than

      ParquetWriterOptions options = ParquetWriterOptions.builder()
          .withColumnOptions(
              ParquetColumnWriterOptions.builder()
                  .withColumnName("_c0").
                  .withColumnOptions(
                      ParquetColumnWriterOptions.builder()
                      .withTimestampInt96()
                      .build())
              .build())
          .build();

?

The second one is much more verbose, but I think it does a better job of matching up the build() with what it is building... Perhaps we can have a hybrid

    ParquetWriterOptions options = ParquetWriterOptions.builder()
        .withColumn(
             ParquetColumnWriterOptions.listBuilder(
                ParquetColumnWriterOptions.timestampBuilder()
                    .withInt96()
                    .build())
                .withColumnName("_c0")
                .build())
    .build();

@razajafri razajafri force-pushed the parquet-struct-support branch from 68d0d30 to 4907f2b Compare March 30, 2021 05:16
@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Mar 30, 2021
@razajafri
Copy link
Contributor Author

@revans2 @jlowe PTAL.

I think we need to get this in ASAP so it makes it to this release

Copy link
Contributor

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we should be rushing this. I know you have deadlines but the current API is not clean and I'm not that excited about pushing one change at the end of a release knowing that we are likely going to have to make breaking changes to it in the next release.

@razajafri
Copy link
Contributor Author

@devavret for cpp change

@razajafri razajafri changed the title Java API change for supporting structs [skip ci] Java API change for supporting structs Mar 30, 2021
@razajafri
Copy link
Contributor Author

@revans2 do you have any more questions?

@razajafri razajafri marked this pull request as ready for review April 8, 2021 20:28
@razajafri razajafri requested review from a team as code owners April 8, 2021 20:28
@razajafri razajafri requested a review from hyperbolic2346 April 8, 2021 20:28
@razajafri razajafri changed the base branch from branch-0.19 to branch-0.20 April 9, 2021 17:49
@razajafri
Copy link
Contributor Author

@revans2 @jlowe I have improved the API. PTAL

@razajafri
Copy link
Contributor Author

@revans2 I have addressed your concerns. PTAL

@razajafri
Copy link
Contributor Author

build

@razajafri
Copy link
Contributor Author

rerun tests

}

/**
* Set column name
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better to say that it adds a child column with the given name. and that should apply to all of the with methods.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still not done


/**
* Set column name
* @param name
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless we describe what the name is marking it as a param does nothing. This applies to all of the other java doc comments for params that are empty.

return new ListBuilder(name, true);
}

public static ListBuilder listBuilder(String name, boolean isNullable) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need java docs here. They really need to describe how a list builder is different from a StructBuilder and that the name of the child is ignored.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I Still don't see it documented anywhere how a list builder works or that the name for it's children is ignored.

@razajafri
Copy link
Contributor Author

razajafri commented Apr 15, 2021

CI is failing due to a known problem. There is a PR that was merged in rmm. This PR fixes this error, so we will have to wait until its merged to kick off the tests again

Copy link
Contributor

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am just done to nits in the comments.

@razajafri
Copy link
Contributor Author

rerun tests

@razajafri razajafri added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 2 - In Progress Currently a work in progress labels Apr 16, 2021
@razajafri
Copy link
Contributor Author

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 3327f7b into rapidsai:branch-0.20 Apr 16, 2021
@razajafri razajafri deleted the parquet-struct-support branch April 16, 2021 15:21
razajafri added a commit to razajafri/cudf that referenced this pull request Apr 16, 2021
rapids-bot bot pushed a commit that referenced this pull request Apr 17, 2021
This reverts commit 3327f7b.

We have to revert this because the dependent project is broken and my system is in a broken state.

Authors:
  - Raza Jafri (https://github.com/razajafri)

Approvers:
  - Rong Ou (https://github.com/rongou)
  - Ram (Ramakrishna Prabhu) (https://github.com/rgsl888prabhu)
  - Robert (Bobby) Evans (https://github.com/revans2)

URL: #7987
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge breaking Breaking change improvement Improvement / enhancement to an existing function Java Affects Java cuDF API. libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants