Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor Parquet writer options and builders #15831

Merged
merged 22 commits into from
Jun 10, 2024

Conversation

etseidl
Copy link
Contributor

@etseidl etseidl commented May 22, 2024

Description

Adding options to the Parquet writer is made somewhat tedious by the duplication of code between the two current sets of options/builder classes; one each for the chunked and non-chunked Parquet writers. This PR pulls common options into a parent options class, and common setters into a parent builder class. The builder parent uses CRTP to allow chaining of options.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@etseidl etseidl requested a review from a team as a code owner May 22, 2024 23:27
Copy link

copy-pr-bot bot commented May 22, 2024

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label May 22, 2024
@etseidl
Copy link
Contributor Author

etseidl commented May 22, 2024

cc @vuule @mhaseeb123

@mhaseeb123 mhaseeb123 self-requested a review May 23, 2024 01:30
@vyasr
Copy link
Contributor

vyasr commented May 23, 2024

Seems closely related to #15825, just for a different format, right? Just wondering if it's worth rolling that change in too; no worries if not.

@vyasr
Copy link
Contributor

vyasr commented May 23, 2024

/ok to test

@etseidl
Copy link
Contributor Author

etseidl commented May 23, 2024

Seems closely related to #15825, just for a different format, right? Just wondering if it's worth rolling that change in too; no worries if not.

Yeah, I noticed that issue after submitting this. I can take a look tomorrow.

@ttnghia
Copy link
Contributor

ttnghia commented May 23, 2024

Wow, it seems like there are rooms for Parquet reader and ORC reader/writer too 👍

@mhaseeb123
Copy link
Member

mhaseeb123 commented May 23, 2024

Thanks for the effort, as always @etseidl. I quickly went over the long-overdue changes now but will properly review them tomorrow or Friday. I was wondering if we need to update corresponding python bindings in python.pxd and perhaps python.pyx as well?

@etseidl
Copy link
Contributor Author

etseidl commented May 23, 2024

I was wondering if we need to update corresponding python bindings in python.pxd and perhaps python.pyx as well?

@mhaseeb123 I was worried about that when I started (due to an offline discussion with @vuule), but since the API didn't change, the cython bindings didn't have to change at all 🤯. It all just worked ("worked" == "all tests passed"...just ignore all those CI failures 🤣).

@etseidl
Copy link
Contributor Author

etseidl commented May 23, 2024

Wow, it seems like there are rooms for Parquet reader and ORC reader/writer too 👍

Oh dear 😂

@vyasr
Copy link
Contributor

vyasr commented May 23, 2024

I was wondering if we need to update corresponding python bindings in python.pxd and perhaps python.pyx as well?

@mhaseeb123 I was worried about that when I started (due to an offline discussion with @vuule), but since the API didn't change, the cython bindings didn't have to change at all 🤯. It all just worked ("worked" == "all tests passed"...just ignore all those CI failures 🤣).

While this is true, in the future if you make changes to add to the API of the base class you'll have a similar issue to what this PR is trying to address: you'll have to add those methods to both the chunked and non-chunked versions in Cython because they don't have the concept of the inheritance structure. If it's not too much trouble I would recommend making the changes in this PR to save yourself the trouble. Here's an example of how that inheritance looks in Cython.

To give a bit of context, the way that exposing C++ classes to Cython works is that pxd files are basically a promise to Cython that if it generates C++ code calling a certain function, that code will exist. So if you add something to a pxd file but never call that function in any pyx files, the generated C++ code will remain unchanged.

@etseidl
Copy link
Contributor Author

etseidl commented May 23, 2024

While this is true, in the future if you make changes to add to the API of the base class you'll have a similar issue to what this PR is trying to address: you'll have to add those methods to both the chunked and non-chunked versions in Cython because they don't have the concept of the inheritance structure. If it's not too much trouble I would recommend making the changes in this PR to save yourself the trouble.

Drat, I thought I'd get away with no python changes 😭. Thanks for pointing this out.

@vuule
Copy link
Contributor

vuule commented May 23, 2024

Wow, it seems like there are rooms for Parquet reader and ORC reader/writer too 👍

Oh dear 😂

Yeah! I hope I can count on you to refactor one of these @ttnghia :)

@@ -29,6 +29,7 @@
#include <memory>
#include <optional>
#include <string>
#include <utility>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this header needed or automatically added by clangd?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clangd...part of its insidious IWYU campaign

@etseidl etseidl requested a review from a team as a code owner May 23, 2024 18:42
@etseidl etseidl requested review from vyasr and mroeschke May 23, 2024 18:42
@github-actions github-actions bot added the Python Affects Python cuDF API. label May 23, 2024
@etseidl
Copy link
Contributor Author

etseidl commented May 23, 2024

@vyasr I've made the python changes, please check them out. I hope I did the CRTP in cython correctly. Also, I just guessed at the formatting since pre-commit would only flag problems but not fix them. It seems happy now, but that might be a low bar.

@vuule vuule added the improvement Improvement / enhancement to an existing function label May 23, 2024
@etseidl
Copy link
Contributor Author

etseidl commented May 24, 2024

I've pushed the troublesome methods back into the parent class for now and left a FIXME. The two methods in question are never actually called, so perhaps they can just be removed. It would be good to run to ground why having the extra setters in the child class where they belong won't compile. @vyasr

BTW, I did a test where I modified the code where the builders are used and added the two setters to the chain. Neither of the earlier hacks (ctypedef or using <parent>.BuilderT as the return type) actually worked. The current code does, but if these methods are called from a chunked_parquet_writer_options_builder an error is thrown at compile time, so at least non-functioning code won't make it into a release (assuming users cannot access the builders...maybe a bad assumption?).

Copy link
Contributor Author

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove leftover code

python/cudf/cudf/_lib/pylibcudf/libcudf/io/parquet.pxd Outdated Show resolved Hide resolved
@etseidl
Copy link
Contributor Author

etseidl commented May 24, 2024

I've tried to reproduce the cython error by modifying the CRTP test from the cython repo, and have gotten as far as:

cdef extern from "curiously_recurring_template_pattern_GH1458_suport.h" namespace "foo" nogil:
    cdef cppclass shape_t:
        square_t()
        void set_x(int)

    cdef cppclass square_t(shape_t):
        square_t()

    cdef cppclass cube_t(shape_t):
        cube_t()
        @staticmethod
        Cube builder()
        @staticmethod
        Cube builder(int)

    cdef cppclass Base[T, Derived]:
        Base()
        T& build()
        int calculate()
        Derived& chain1()
        ...
        Derived& chain13()

    cdef cppclass Square(Base[square_t, Square]):
        Square()

    cdef cppclass Cube(
            Base[cube_t,
                 Cube]):
        Cube()
        Cube(int)
        Cube& chain20(int)

def test_derived(int x):
    """
    >>> test_derived(5)
    (8, 125)
    """
    cube_int = cube_t.builder()
    cube_int5 = cube_t.builder(x)
    cube_int.chain1().chain2().chain20(10).build().set_x(2)
    return (cube_int.calculate(), cube_int5.calculate())

and this compiles and passes. chain20() is the analog of the function that fails to compile. Haven't yet tried with a pxd file.

So the builder pattern w/ CRTP should be working. Maybe different eyes will see what I'm missing 😦.

@vyasr
Copy link
Contributor

vyasr commented May 30, 2024

Yikes I'm sorry @etseidl I meant to respond earlier and then completely forgot about it. I appreciate how hard you've tried to get the Cython working! I am (unfortunately) not that surprised that it's proving tricky. If the example from the Cython test suite is working for you and seems that similar, my next guess would be some sort of circular dependency issue in our Cython due to how many things we're importing? It's probably not the best use of your time to try and track this down though, so I'm fine with you pushing the code back up to the base class in order to get this merged. Once this PR is merged I can test out the Cython in a follow-up and see what I come up with. Thank you for getting it this far!

@github-actions github-actions bot added the pylibcudf Issues specific to the pylibcudf package label Jun 4, 2024
@etseidl etseidl requested review from vuule and mhaseeb123 June 4, 2024 15:56
@vuule
Copy link
Contributor

vuule commented Jun 4, 2024

/ok to test

@hyperbolic2346
Copy link
Contributor

/ok to test

@vuule
Copy link
Contributor

vuule commented Jun 10, 2024

/merge

@rapids-bot rapids-bot bot merged commit c02260f into rapidsai:branch-24.08 Jun 10, 2024
71 checks passed
@vyasr
Copy link
Contributor

vyasr commented Jun 11, 2024

Yikes I'm sorry @etseidl I meant to respond earlier and then completely forgot about it. I appreciate how hard you've tried to get the Cython working! I am (unfortunately) not that surprised that it's proving tricky. If the example from the Cython test suite is working for you and seems that similar, my next guess would be some sort of circular dependency issue in our Cython due to how many things we're importing? It's probably not the best use of your time to try and track this down though, so I'm fine with you pushing the code back up to the base class in order to get this merged. Once this PR is merged I can test out the Cython in a follow-up and see what I come up with. Thank you for getting it this far!

See #15978

rapids-bot bot pushed a commit that referenced this pull request Jun 11, 2024
#15831 added new inheritance patterns to the Parquet options classes, but mirroring them perfectly in Cython proved problematic due to what appeared to be issues with Cython parsing of CRTP and inheritance. A deeper investigation revealed that the underlying issue was cython/cython#6238. This PR applies the appropriate fix.

Authors:
  - Vyas Ramasubramani (https://github.com/vyasr)

Approvers:
  - Thomas Li (https://github.com/lithomas1)
  - Bradley Dice (https://github.com/bdice)

URL: #15978
@etseidl etseidl deleted the pq_writer_opts_refactor branch June 12, 2024 21:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change pylibcudf Issues specific to the pylibcudf package Python Affects Python cuDF API.
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

6 participants