Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Java] while using s3 FileSystemDatasetFactory getting this exception #36069

Closed
febinct opened this issue Jun 14, 2023 · 11 comments · Fixed by #36934
Closed

[Java] while using s3 FileSystemDatasetFactory getting this exception #36069

febinct opened this issue Jun 14, 2023 · 11 comments · Fixed by #36934

Comments

@febinct
Copy link

febinct commented Jun 14, 2023

Describe the bug, including details regarding any error messages, version, and platform.

/Users/voltrondata/github-actions-runner/_work/crossbow/crossbow/arrow/cpp/src/arrow/filesystem/s3fs.cc:2598: arrow::fs::FinalizeS3 was not called even though S3 was initialized. This could lead to a segmentation fault at exit

i think the java client is not closing the s3 client gracefully because of that memory leak happening

Component(s)

Java

@westonpace westonpace changed the title while using s3 FileSystemDatasetFactory getting this exception [Java] while using s3 FileSystemDatasetFactory getting this exception Jun 14, 2023
@westonpace
Copy link
Member

westonpace commented Jun 14, 2023

CC @davisusanibar @lidavidm (note that this warning was newly added to the S3 filesystem in the previous release so it is very possible the Java implementation has never been calling finalize)

@davisusanibar
Copy link
Contributor

CC @davisusanibar @lidavidm (note that this warning was newly added to the S3 filesystem in the previous release so it is very possible the Java implementation has never been calling finalize)

Just able to reproduce this warning with:

import org.apache.arrow.dataset.file.FileFormat;
import org.apache.arrow.dataset.file.FileSystemDatasetFactory;
import org.apache.arrow.dataset.jni.NativeMemoryPool;
import org.apache.arrow.dataset.scanner.ScanOptions;
import org.apache.arrow.dataset.scanner.Scanner;
import org.apache.arrow.dataset.source.Dataset;
import org.apache.arrow.dataset.source.DatasetFactory;
import org.apache.arrow.memory.BufferAllocator;
import org.apache.arrow.memory.RootAllocator;
import org.apache.arrow.vector.ipc.ArrowReader;
import org.apache.arrow.vector.types.pojo.Schema;

public class DatasetModule {
    public static void main(String[] args) {
        String uri = "s3://voltrondata-labs-datasets/nyc-taxi-tiny/year=2022/month=2/part-0.parquet"; // AWS S3
        // String uri = "hdfs://{hdfs_host}:{port}/nyc-taxi-tiny/year=2022/month=2/part-0.parquet"; // HDFS
        // String uri = "gs://voltrondata-labs-datasets/nyc-taxi-tiny/year=2022/month=2/part-0.parquet"; // Google Cloud Storage
        ScanOptions options = new ScanOptions(/*batchSize*/ 32768);
        try (
            BufferAllocator allocator = new RootAllocator();
            DatasetFactory datasetFactory = new FileSystemDatasetFactory(allocator, NativeMemoryPool.getDefault(), FileFormat.PARQUET, uri);
            Dataset dataset = datasetFactory.finish();
            Scanner scanner = dataset.newScan(options);
            ArrowReader reader = scanner.scanBatches()
        ) {
            Schema schema = scanner.schema();
            System.out.println(schema);
            while (reader.loadNextBatch()) {
                System.out.println("RowCount: " + reader.getVectorSchemaRoot().getRowCount());
            }
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

Output messages:

RowCount: 2979
/Users/runner/work/crossbow/crossbow/arrow/cpp/src/arrow/filesystem/s3fs.cc:2598:  arrow::fs::FinalizeS3 was not called even though S3 was initialized.  This could lead to a segmentation fault at exit

Next step:

  1. Review reason of error messages
  2. Add Arrow Java cookbook to cover S3 integration

@febinsathar
Copy link

febinsathar commented Jun 15, 2023

while running s3 file reader using java after for sometime the machine is crash looping due to this AWS Error NETWORK_CONNECTION during HeadObject operation: curlCode: 43, A libcurl function was given a bad argument
this generally happens when aws s3 connection pool are not closed AFAIK

@AcidChristLab
Copy link

A simple solution to fix this error:

pip install --upgrade --force-reinstall pyarrow==11.0.0

@febinct
Copy link
Author

febinct commented Jul 21, 2023

this is in java not python

@AcidChristLab
Copy link

this is in java not python

Are you not using the pyarrow package for java?

@mapleFU
Copy link
Member

mapleFU commented Jul 23, 2023

I guess the problem in introduced in #33858 . The description contains:

BREAKING CHANGE: S3 can only be initialized and finalized once.
BREAKING CHANGE: S3 (the AWS SDK) will not be finalized until after all CPU & I/O threads are finished.

A helpful way is calling FinalizeS3 when exit. This patch is in arrow-12.0

However, I guess this patch might fixed it: #36442 . Maybe you can confirm it latter?

@kou
Copy link
Member

kou commented Jul 24, 2023

AFAIK, there is no pyarrow package for Java.

Java's dataset API uses JNI. It means that Java's dataset API calls C++ implementation directly. (It doesn't use Python.)

We need to implement a Java binding for arrow::fs::FinalizeS3() and users need to call it explicitly.
Or we may be able to do it implicitly if Java provides an atexit() like hook.

@lidavidm
Copy link
Member

Java has Runtime#addShutdownHook

@lidavidm
Copy link
Member

@davisusanibar @danepitkin any interest in this?

@danepitkin danepitkin added this to the 14.0.0 milestone Jul 24, 2023
@danepitkin
Copy link
Member

Yes! We can take this on. Thanks for the ping.

danepitkin added a commit to danepitkin/arrow that referenced this issue Jul 28, 2023
lidavidm pushed a commit that referenced this issue Aug 2, 2023
### Rationale for this change

Java datasets can implicitly create an S3 filesystem, which will initialize S3 APIs. There is currently no explicit call to shutdown S3 APIs in Java, which results in a warning message being printed at runtime:

`arrow::fs::FinalizeS3 was not called even though S3 was initialized. This could lead to a segmentation fault at exit`

### What changes are included in this PR?

* Add a Java runtime shutdown hook that calls `EnsureS3Finalized()` via JNI. This is a noop if S3 is uninitialized or already finalized.

### Are these changes tested?

Yes, reproduced with:

```
import org.apache.arrow.dataset.file.FileFormat;
import org.apache.arrow.dataset.file.FileSystemDatasetFactory;
import org.apache.arrow.dataset.jni.NativeMemoryPool;
import org.apache.arrow.dataset.source.DatasetFactory;
import org.apache.arrow.memory.BufferAllocator;
import org.apache.arrow.memory.RootAllocator;

public class DatasetModule {
    public static void main(String[] args) {
        String uri = "s3://voltrondata-labs-datasets/nyc-taxi-tiny/year=2022/month=2/part-0.parquet";
        try (
            BufferAllocator allocator = new RootAllocator();
            DatasetFactory datasetFactory = new FileSystemDatasetFactory(allocator, NativeMemoryPool.getDefault(), FileFormat.PARQUET, uri);
        ) {
            // S3 is initialized
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}
```

I didn't think a unit test was worth adding. Let me know if you think otherwise. Reasoning:
* We can't test the actual shutdown since thats a JVM thing.
* We could test to see if the hook is registered, but that involves exposing the API and having access to the thread object registered with the hook. Or using reflection to obtain it. Not worth it IMO.
* No need to test the functionality inside the hook, its just a wrapper around a single C++ API with no params/retval.

### Are there any user-facing changes?

No
* Closes: #36069

Authored-by: Dane Pitkin <[email protected]>
Signed-off-by: David Li <[email protected]>
loicalleyne pushed a commit to loicalleyne/arrow that referenced this issue Nov 13, 2023
### Rationale for this change

Java datasets can implicitly create an S3 filesystem, which will initialize S3 APIs. There is currently no explicit call to shutdown S3 APIs in Java, which results in a warning message being printed at runtime:

`arrow::fs::FinalizeS3 was not called even though S3 was initialized. This could lead to a segmentation fault at exit`

### What changes are included in this PR?

* Add a Java runtime shutdown hook that calls `EnsureS3Finalized()` via JNI. This is a noop if S3 is uninitialized or already finalized.

### Are these changes tested?

Yes, reproduced with:

```
import org.apache.arrow.dataset.file.FileFormat;
import org.apache.arrow.dataset.file.FileSystemDatasetFactory;
import org.apache.arrow.dataset.jni.NativeMemoryPool;
import org.apache.arrow.dataset.source.DatasetFactory;
import org.apache.arrow.memory.BufferAllocator;
import org.apache.arrow.memory.RootAllocator;

public class DatasetModule {
    public static void main(String[] args) {
        String uri = "s3://voltrondata-labs-datasets/nyc-taxi-tiny/year=2022/month=2/part-0.parquet";
        try (
            BufferAllocator allocator = new RootAllocator();
            DatasetFactory datasetFactory = new FileSystemDatasetFactory(allocator, NativeMemoryPool.getDefault(), FileFormat.PARQUET, uri);
        ) {
            // S3 is initialized
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}
```

I didn't think a unit test was worth adding. Let me know if you think otherwise. Reasoning:
* We can't test the actual shutdown since thats a JVM thing.
* We could test to see if the hook is registered, but that involves exposing the API and having access to the thread object registered with the hook. Or using reflection to obtain it. Not worth it IMO.
* No need to test the functionality inside the hook, its just a wrapper around a single C++ API with no params/retval.

### Are there any user-facing changes?

No
* Closes: apache#36069

Authored-by: Dane Pitkin <[email protected]>
Signed-off-by: David Li <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants