-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor Parquet writer options and builders #15831
Refactor Parquet writer options and builders #15831
Conversation
Seems closely related to #15825, just for a different format, right? Just wondering if it's worth rolling that change in too; no worries if not. |
/ok to test |
Yeah, I noticed that issue after submitting this. I can take a look tomorrow. |
Wow, it seems like there are rooms for Parquet reader and ORC reader/writer too 👍 |
Thanks for the effort, as always @etseidl. I quickly went over the long-overdue changes now but will properly review them tomorrow or Friday. I was wondering if we need to update corresponding python bindings in |
@mhaseeb123 I was worried about that when I started (due to an offline discussion with @vuule), but since the API didn't change, the cython bindings didn't have to change at all 🤯. It all just worked ("worked" == "all tests passed"...just ignore all those CI failures 🤣). |
Oh dear 😂 |
While this is true, in the future if you make changes to add to the API of the base class you'll have a similar issue to what this PR is trying to address: you'll have to add those methods to both the chunked and non-chunked versions in Cython because they don't have the concept of the inheritance structure. If it's not too much trouble I would recommend making the changes in this PR to save yourself the trouble. Here's an example of how that inheritance looks in Cython. To give a bit of context, the way that exposing C++ classes to Cython works is that pxd files are basically a promise to Cython that if it generates C++ code calling a certain function, that code will exist. So if you add something to a pxd file but never call that function in any pyx files, the generated C++ code will remain unchanged. |
Drat, I thought I'd get away with no python changes 😭. Thanks for pointing this out. |
Yeah! I hope I can count on you to refactor one of these @ttnghia :) |
@@ -29,6 +29,7 @@ | |||
#include <memory> | |||
#include <optional> | |||
#include <string> | |||
#include <utility> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this header needed or automatically added by clangd
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clangd...part of its insidious IWYU campaign
@vyasr I've made the python changes, please check them out. I hope I did the CRTP in cython correctly. Also, I just guessed at the formatting since |
match C++ template args exactly and move build() into base class
…o pq_writer_opts_refactor
I've pushed the troublesome methods back into the parent class for now and left a BTW, I did a test where I modified the code where the builders are used and added the two setters to the chain. Neither of the earlier hacks ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove leftover code
I've tried to reproduce the cython error by modifying the CRTP test from the cython repo, and have gotten as far as: cdef extern from "curiously_recurring_template_pattern_GH1458_suport.h" namespace "foo" nogil:
cdef cppclass shape_t:
square_t()
void set_x(int)
cdef cppclass square_t(shape_t):
square_t()
cdef cppclass cube_t(shape_t):
cube_t()
@staticmethod
Cube builder()
@staticmethod
Cube builder(int)
cdef cppclass Base[T, Derived]:
Base()
T& build()
int calculate()
Derived& chain1()
...
Derived& chain13()
cdef cppclass Square(Base[square_t, Square]):
Square()
cdef cppclass Cube(
Base[cube_t,
Cube]):
Cube()
Cube(int)
Cube& chain20(int)
def test_derived(int x):
"""
>>> test_derived(5)
(8, 125)
"""
cube_int = cube_t.builder()
cube_int5 = cube_t.builder(x)
cube_int.chain1().chain2().chain20(10).build().set_x(2)
return (cube_int.calculate(), cube_int5.calculate()) and this compiles and passes. So the builder pattern w/ CRTP should be working. Maybe different eyes will see what I'm missing 😦. |
Yikes I'm sorry @etseidl I meant to respond earlier and then completely forgot about it. I appreciate how hard you've tried to get the Cython working! I am (unfortunately) not that surprised that it's proving tricky. If the example from the Cython test suite is working for you and seems that similar, my next guess would be some sort of circular dependency issue in our Cython due to how many things we're importing? It's probably not the best use of your time to try and track this down though, so I'm fine with you pushing the code back up to the base class in order to get this merged. Once this PR is merged I can test out the Cython in a follow-up and see what I come up with. Thank you for getting it this far! |
/ok to test |
/ok to test |
/merge |
See #15978 |
#15831 added new inheritance patterns to the Parquet options classes, but mirroring them perfectly in Cython proved problematic due to what appeared to be issues with Cython parsing of CRTP and inheritance. A deeper investigation revealed that the underlying issue was cython/cython#6238. This PR applies the appropriate fix. Authors: - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - Thomas Li (https://github.com/lithomas1) - Bradley Dice (https://github.com/bdice) URL: #15978
Description
Adding options to the Parquet writer is made somewhat tedious by the duplication of code between the two current sets of options/builder classes; one each for the chunked and non-chunked Parquet writers. This PR pulls common options into a parent options class, and common setters into a parent builder class. The builder parent uses CRTP to allow chaining of options.
Checklist