-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-15054: [C++] Change s3 finalization to happen after arrow threads finished, add pyarrow exit hook #33858
GH-15054: [C++] Change s3 finalization to happen after arrow threads finished, add pyarrow exit hook #33858
Conversation
Thanks for opening a pull request! If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project. Then could you also rename the pull request title in the following format?
or
In the case of PARQUET issues on JIRA the title also supports:
See also: |
switch (options.log_level) { | ||
LOG_LEVEL_CASE(Fatal) | ||
LOG_LEVEL_CASE(Error) | ||
LOG_LEVEL_CASE(Warn) | ||
LOG_LEVEL_CASE(Info) | ||
LOG_LEVEL_CASE(Debug) | ||
LOG_LEVEL_CASE(Trace) | ||
default: | ||
aws_log_level = Aws::Utils::Logging::LogLevel::Off; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the new tab an expected change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. The initialize and finalize methods were file-scoped methods previously. Now they have become the constructor/destructor of the AwsInstance
class. Since they are in a class they get some indentation.
ac63d43
to
c6ab7cb
Compare
@westonpace do we want to try to get this into 12.0.0? (don't know the state of the PR though) |
@jorisvandenbossche Yes, I think we do. I'll rebase this right now but I think the main concern here was just getting a review / second opinion. I'm going to label it a blocker. |
c6ab7cb
to
95b1458
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if I'm the best to review here, but at least at the moment I don't see anything glaringly problematic with this and it seems like a good idea in general.
It feels like it should be better tested, but I'm not particularly sure how would be a good way to do so. But that might just be because it's the end of the day lol.
cpp/src/arrow/filesystem/s3fs.cc
Outdated
} | ||
|
||
bool IsAwsInitialized() { return !!GetAwsInstance({}); } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this gets called before InitializeS3
, the options
passed there will be ignored and an instance based on {}
options will stick around. I think bringing back the bool
flag can avoid these risks.
Thanks @felipecrv , that is a good point regarding IsInitialized. I added the flag back but put it inside the instance object. I put everything inside the instance object and I think the code is simpler now. What do you think of this version? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This LGTM. Thank you!
These R crashes are a little concerning since they seem to be crashes at shutdown. Maybe it is related. |
Yes, the R crashes are relevant. With this fix we ensure that |
r/R/arrow-package.R
Outdated
# Registers a callback to run at session exit | ||
.onLoad <- function(libname, pkgname) { | ||
print(parent.env(environment())) | ||
reg.finalizer(parent.env(environment()), finalize_s3, onexit=TRUE) | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know you asked about this earlier and found .onUnload()
...just putting a note here in case this gets forgotten about.
If .onUnload()
isn't working you can create your own environment an register a finalizer on that:
s3_finalizer <- new.env(parent = emptyenv())
.onLoad <- function(...) {
# ...
reg.finalizer(s3_finalizer, finalize_s3, onexit=TRUE)
}
You probably shouldn't register a finalizer on the package namespace because somebody else might have done that (maybe even R).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is also .onLoad()
defined twice here as written (you should be able to use the existing .onLoad()
to register this hook if you can't in fact use .onUnload()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, ok. I will change to this pattern instead of parent.env(environment())
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I just realized I left the print in >_<
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And if .onUnload()
really isn't working make sure to put a comment in explaining that (so I don't forget about this and try to consolidate our namespace unload incorrectly)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I think I've resolved all your concerns now. Thanks for looking at this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And if .onUnload() really isn't working make sure to put a comment in explaining that (so I don't forget about this and try to consolidate our namespace unload incorrectly)
Added a comment
The broken C++ tests are unrelated (will be fixed by #35145). It seems the R tests are passing now. Would appreciate a review from @paleolimbot or @thisisnic on the R changes. @kou do you know if cglib / ruby is using S3? If so, is it calling |
Yes, I know. GLib/Ruby is using S3. |
Thanks, that sounds reasonable. I'll merge when CI passes then. |
…finished, add pyarrow exit hook (#33858) CRITICAL FIX: When statically linking error with AWS it was possible to have a crash on shutdown/exit. Now that should no longer be possible. BREAKING CHANGE: S3 can only be initialized and finalized once. BREAKING CHANGE: S3 (the AWS SDK) will not be finalized until after all CPU & I/O threads are finished. * Closes: #15054 Authored-by: Weston Pace <[email protected]> Signed-off-by: Weston Pace <[email protected]>
I have added the appropriate labels for visibility |
Benchmark runs are scheduled for baseline = 9626c7d and contender = 1de159d. 1de159d is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
…reads finished, add pyarrow exit hook (apache#33858) CRITICAL FIX: When statically linking error with AWS it was possible to have a crash on shutdown/exit. Now that should no longer be possible. BREAKING CHANGE: S3 can only be initialized and finalized once. BREAKING CHANGE: S3 (the AWS SDK) will not be finalized until after all CPU & I/O threads are finished. * Closes: apache#15054 Authored-by: Weston Pace <[email protected]> Signed-off-by: Weston Pace <[email protected]>
…reads finished, add pyarrow exit hook (apache#33858) CRITICAL FIX: When statically linking error with AWS it was possible to have a crash on shutdown/exit. Now that should no longer be possible. BREAKING CHANGE: S3 can only be initialized and finalized once. BREAKING CHANGE: S3 (the AWS SDK) will not be finalized until after all CPU & I/O threads are finished. * Closes: apache#15054 Authored-by: Weston Pace <[email protected]> Signed-off-by: Weston Pace <[email protected]>
…reads finished, add pyarrow exit hook (apache#33858) CRITICAL FIX: When statically linking error with AWS it was possible to have a crash on shutdown/exit. Now that should no longer be possible. BREAKING CHANGE: S3 can only be initialized and finalized once. BREAKING CHANGE: S3 (the AWS SDK) will not be finalized until after all CPU & I/O threads are finished. * Closes: apache#15054 Authored-by: Weston Pace <[email protected]> Signed-off-by: Weston Pace <[email protected]>
CRITICAL FIX: When statically linking error with AWS it was possible to have a crash on shutdown/exit. Now that should no longer be possible.
BREAKING CHANGE: S3 can only be initialized and finalized once.
BREAKING CHANGE: S3 (the AWS SDK) will not be finalized until after all CPU & I/O threads are finished.