Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] S3 Finalization functions need to exist even when S3 isn't enabled. #36974

Closed
robertmaynard opened this issue Aug 1, 2023 · 6 comments
Closed

Comments

@robertmaynard
Copy link

Describe the bug, including details regarding any error messages, version, and platform.

Before Arrow 12 you could have c++ consumers of Arrow conditional use S3 without knowing if the Arrow they are linking to has S3 support enabled.

 std::string s3_uri = "s3://.....?region=us-east-1";
 auto const no_s3_support = arrow::fs::FileSystemFromUri(s3_uri).status().IsNotImplemented();
 ....

But with Arrow 12 due to AWS SDK issues ( #33858 ) creation of an S3 filesystem
now requires an explicit call to fs::FinalizeS3() during shutdown of the application. Since the FinalizeS3 function is in the optionally compiled cpp/src/arrow/filesystem/s3fs.cc there is no easy way for a C++ user to determine if they can call FinalizeS3.

I believe that the correct solution is that Arrow needs some generalized shutdown API that is always provided. This API would be aware if S3 support is enabled and call the function when needed. Currently this can't be done by consumers since S3 support can't be safely determined by consumers:

The current solution to work around this, is that every C++ user/consumer needs to use dlopen ( or equivalant on windows ) on either libArrow or the root executable for static builds ( dlopen(NULL) ) and check if the FinalizeS3 functions exists.

Component(s)

C++

@kou
Copy link
Member

kou commented Aug 1, 2023

I agree with a finalization API that can be called with/without S3 makes user codes simpler.

We can use the following code for now:

#include <arrow/util/config.h>
#ifdef ARROW_S3
#  include <arrow/filesystem/s3fs.h>
#endif

// ...
#ifdef ARROW_S3
  arrow::fs::EnsureS3Finalized();
#endif

You don't need to use dlopen().

@robertmaynard
Copy link
Author

A compile time check is insufficient since it binds our application to a given build of Arrow. The FileSystemFromUri allows us to have runtime support checks since that ensure that we fail gracefully when the user / env has an Arrow build with different build options.

E.g. We build on machines without S3, but run on machines with S3 support.

@kou
Copy link
Member

kou commented Aug 3, 2023

Does this mean that you build your module's binary with S3 disabled Apache Arrow C++ and your user uses your built-binary with S3 enabled Apache Arrow C++ (that may be built by your user not you)?

Why do you build your module with S3 disabled Apache Arrow C++?

@robertmaynard
Copy link
Author

robertmaynard commented Aug 3, 2023

Why do you build your module with S3 disabled Apache Arrow C++?

Because it dramatically simplifies the build env requirements.

We should also remember that the ARROW_S3 is only exposed to CMake downstream users. Anyone that is using package config will not know if the Arrow they are consuming has S3 enabled or not.

@kou
Copy link
Member

kou commented Aug 3, 2023

We don't support the build style. (We don't guarantee ABI for different build options binaries.)
If you use the build style, it's "use it at your own risk". (Tricky dlopen() may be needed like you for the case.)

We should also remember that the ARROW_S3 is only exposed to CMake downstream users. Anyone that is using package config will not know if the Arrow they are consuming has S3 enabled or not.

Does the "package config" means pkg-config/pkgconf command? If so, we can add support for it by providing arrow-filesystem-s3.pc like we did for JSON module (arrow-json.pc) or adding a s3 variable to arrow-filesystem.pc like *_version in arrow.pc. If you need it, please open a new issue for it.

Note: In C++ level, we can use ARROW_S3 by #include <arrow/util/config.h> as I mentioned at #36974 (comment) . It doesn't depend on CMake nor pkg-config/pkgconf.

@robertmaynard
Copy link
Author

Does the "package config" means pkg-config/pkgconf command? If so, we can add support for it by providing arrow-filesystem-s3.pc like we did for JSON module (arrow-json.pc) or adding a s3 variable to arrow-filesystem.pc like *_version in arrow.pc. If you need it, please open a new issue for it.

Yes I meant the pkg-config files. But I missed in your first explanation that ARROW_S3 is provided via arrow/util/config.h. Since that is build system independent I don't think anything changes are needed.

The entire situtation around the AWS SDK requiring an explicit end of program shutdown call is rather unforunate. I wish we had a cleaner approach but that is best discussed in another issue.

Thank you for the help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants