Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[REVIEW] Enable writing to
s3
storage in chunked parquet writer #10769[REVIEW] Enable writing to
s3
storage in chunked parquet writer #10769Changes from 10 commits
a9ff579
8176093
7569db6
c523d19
6b01d73
9d3f9be
d05c339
a79e6d5
a11b658
aa7e050
23eccb6
867b8de
14a100a
3591b3e
75c5de3
e42796d
2a743a2
a11f507
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does compression accept
None
while statistics accepts'NONE'
? Is this a repeated pattern across cuDF? Can we adoptNone
instead of string'NONE'
values universally?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, I'll open a separate PR to standardize this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think it would be possible fto wrap these
ParquetWriter
objects in a file-opeing shim? This way, for non-local storaage, we could open the necessarry paths with fsspec before passing the open "file-like" objects down toParquetWriter
. The shim would need to keep track of the open files and close them after the underlyingParquetWriter
object is closed.If I understand correctly (which I may not), we should be able to write directly to s3 without the file-copy workaround.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this was my initial goto-approach(same as
df.to_parquet
) and after implementing it I ran into issues. Upon investigation, the wayParquetWriter
is written right now I realized that writing to non-local storage won't fully create and flush the data..even when we manually flush the data after writing a table, only a file is created with 0 bytes and no data is written...for the formation of a correct parquet file with the same file connection open if I try to write the parquet metadata in the footer during theclose
operation the libraries end up overwriting the existing file ins3
storage and thus losing all the past data and only writing the footer metadata. After repeated tries of different combinations of (write table, flush, write metadata, flush, close) and no one happening successfully I came to a conclusion that the ways3
(or other cloud file systems) are designed seems to be such that they act as object stores rather than true file-systems where they allow append operations to the contents of existing files. At least fors3
this is the limitation whileParquetWriter
is basically built around the concept ofappending
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay - I still have a suspicion that we can make this work, but I do think the approach you are using here is reasonable - So, I can experiment and follow up separately. My question here is definitely not a blocker :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, will be happy to chat about the findings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://docs.python.org/3/library/socket.html#socket.socket.close
""
which maps to listening on all addresses (including public addresses, which is a bad safety practice). As far as I can tell, we only access it from127.0.0.1
. The socket should be bound to127.0.0.1
instead.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did make the above change locally but it seems to error consistently, did it work for you?:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, I've seen it close connections to the ports very consistently with the existing approach too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't test this before suggesting, but I'm not sure what's going wrong. I wonder if the context manager is yielding the port and immediately exiting the context / closing the resource? Might need to refresh myself on how pytest fixtures interact with context managers, or add some print statements to check what's happening. Alternatively, you can try this instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
⬆️ This too gives the same error, but this works:
I've pushed this change, let me know if that looks good to you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like we're using
moto
in Stand-alone server mode. Is it possible to use the Python API for this server instead of callingsubprocess
? http://docs.getmoto.org/en/latest/docs/server_mode.html#start-within-pythonI think it'd look like this:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're hard-coding the endpoint URI in a few locations in this file. Can we re-use the host/port/endpoint values instead? (Possibly with a fixture.)