-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Java] while using s3 FileSystemDatasetFactory getting this exception #36069
Comments
CC @davisusanibar @lidavidm (note that this warning was newly added to the S3 filesystem in the previous release so it is very possible the Java implementation has never been calling finalize) |
Just able to reproduce this warning with: import org.apache.arrow.dataset.file.FileFormat;
import org.apache.arrow.dataset.file.FileSystemDatasetFactory;
import org.apache.arrow.dataset.jni.NativeMemoryPool;
import org.apache.arrow.dataset.scanner.ScanOptions;
import org.apache.arrow.dataset.scanner.Scanner;
import org.apache.arrow.dataset.source.Dataset;
import org.apache.arrow.dataset.source.DatasetFactory;
import org.apache.arrow.memory.BufferAllocator;
import org.apache.arrow.memory.RootAllocator;
import org.apache.arrow.vector.ipc.ArrowReader;
import org.apache.arrow.vector.types.pojo.Schema;
public class DatasetModule {
public static void main(String[] args) {
String uri = "s3://voltrondata-labs-datasets/nyc-taxi-tiny/year=2022/month=2/part-0.parquet"; // AWS S3
// String uri = "hdfs://{hdfs_host}:{port}/nyc-taxi-tiny/year=2022/month=2/part-0.parquet"; // HDFS
// String uri = "gs://voltrondata-labs-datasets/nyc-taxi-tiny/year=2022/month=2/part-0.parquet"; // Google Cloud Storage
ScanOptions options = new ScanOptions(/*batchSize*/ 32768);
try (
BufferAllocator allocator = new RootAllocator();
DatasetFactory datasetFactory = new FileSystemDatasetFactory(allocator, NativeMemoryPool.getDefault(), FileFormat.PARQUET, uri);
Dataset dataset = datasetFactory.finish();
Scanner scanner = dataset.newScan(options);
ArrowReader reader = scanner.scanBatches()
) {
Schema schema = scanner.schema();
System.out.println(schema);
while (reader.loadNextBatch()) {
System.out.println("RowCount: " + reader.getVectorSchemaRoot().getRowCount());
}
} catch (Exception e) {
e.printStackTrace();
}
}
} Output messages:
Next step:
|
while running s3 file reader using java after for sometime the machine is crash looping due to this AWS Error NETWORK_CONNECTION during HeadObject operation: curlCode: 43, A libcurl function was given a bad argument |
A simple solution to fix this error:
|
this is in java not python |
Are you not using the pyarrow package for java? |
I guess the problem in introduced in #33858 . The description contains:
A helpful way is calling However, I guess this patch might fixed it: #36442 . Maybe you can confirm it latter? |
AFAIK, there is no pyarrow package for Java. Java's dataset API uses JNI. It means that Java's dataset API calls C++ implementation directly. (It doesn't use Python.) We need to implement a Java binding for |
Java has Runtime#addShutdownHook |
@davisusanibar @danepitkin any interest in this? |
Yes! We can take this on. Thanks for the ping. |
### Rationale for this change Java datasets can implicitly create an S3 filesystem, which will initialize S3 APIs. There is currently no explicit call to shutdown S3 APIs in Java, which results in a warning message being printed at runtime: `arrow::fs::FinalizeS3 was not called even though S3 was initialized. This could lead to a segmentation fault at exit` ### What changes are included in this PR? * Add a Java runtime shutdown hook that calls `EnsureS3Finalized()` via JNI. This is a noop if S3 is uninitialized or already finalized. ### Are these changes tested? Yes, reproduced with: ``` import org.apache.arrow.dataset.file.FileFormat; import org.apache.arrow.dataset.file.FileSystemDatasetFactory; import org.apache.arrow.dataset.jni.NativeMemoryPool; import org.apache.arrow.dataset.source.DatasetFactory; import org.apache.arrow.memory.BufferAllocator; import org.apache.arrow.memory.RootAllocator; public class DatasetModule { public static void main(String[] args) { String uri = "s3://voltrondata-labs-datasets/nyc-taxi-tiny/year=2022/month=2/part-0.parquet"; try ( BufferAllocator allocator = new RootAllocator(); DatasetFactory datasetFactory = new FileSystemDatasetFactory(allocator, NativeMemoryPool.getDefault(), FileFormat.PARQUET, uri); ) { // S3 is initialized } catch (Exception e) { e.printStackTrace(); } } } ``` I didn't think a unit test was worth adding. Let me know if you think otherwise. Reasoning: * We can't test the actual shutdown since thats a JVM thing. * We could test to see if the hook is registered, but that involves exposing the API and having access to the thread object registered with the hook. Or using reflection to obtain it. Not worth it IMO. * No need to test the functionality inside the hook, its just a wrapper around a single C++ API with no params/retval. ### Are there any user-facing changes? No * Closes: #36069 Authored-by: Dane Pitkin <[email protected]> Signed-off-by: David Li <[email protected]>
### Rationale for this change Java datasets can implicitly create an S3 filesystem, which will initialize S3 APIs. There is currently no explicit call to shutdown S3 APIs in Java, which results in a warning message being printed at runtime: `arrow::fs::FinalizeS3 was not called even though S3 was initialized. This could lead to a segmentation fault at exit` ### What changes are included in this PR? * Add a Java runtime shutdown hook that calls `EnsureS3Finalized()` via JNI. This is a noop if S3 is uninitialized or already finalized. ### Are these changes tested? Yes, reproduced with: ``` import org.apache.arrow.dataset.file.FileFormat; import org.apache.arrow.dataset.file.FileSystemDatasetFactory; import org.apache.arrow.dataset.jni.NativeMemoryPool; import org.apache.arrow.dataset.source.DatasetFactory; import org.apache.arrow.memory.BufferAllocator; import org.apache.arrow.memory.RootAllocator; public class DatasetModule { public static void main(String[] args) { String uri = "s3://voltrondata-labs-datasets/nyc-taxi-tiny/year=2022/month=2/part-0.parquet"; try ( BufferAllocator allocator = new RootAllocator(); DatasetFactory datasetFactory = new FileSystemDatasetFactory(allocator, NativeMemoryPool.getDefault(), FileFormat.PARQUET, uri); ) { // S3 is initialized } catch (Exception e) { e.printStackTrace(); } } } ``` I didn't think a unit test was worth adding. Let me know if you think otherwise. Reasoning: * We can't test the actual shutdown since thats a JVM thing. * We could test to see if the hook is registered, but that involves exposing the API and having access to the thread object registered with the hook. Or using reflection to obtain it. Not worth it IMO. * No need to test the functionality inside the hook, its just a wrapper around a single C++ API with no params/retval. ### Are there any user-facing changes? No * Closes: apache#36069 Authored-by: Dane Pitkin <[email protected]> Signed-off-by: David Li <[email protected]>
Describe the bug, including details regarding any error messages, version, and platform.
/Users/voltrondata/github-actions-runner/_work/crossbow/crossbow/arrow/cpp/src/arrow/filesystem/s3fs.cc:2598: arrow::fs::FinalizeS3 was not called even though S3 was initialized. This could lead to a segmentation fault at exit
i think the java client is not closing the s3 client gracefully because of that memory leak happening
Component(s)
Java
The text was updated successfully, but these errors were encountered: