Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How can I use HadoopFileSystem to read from s3a:// ? #28603

Closed
MorganRoff-UnlikelyAI opened this issue Sep 21, 2023 · 1 comment
Closed

How can I use HadoopFileSystem to read from s3a:// ? #28603

MorganRoff-UnlikelyAI opened this issue Sep 21, 2023 · 1 comment

Comments

@MorganRoff-UnlikelyAI
Copy link

Hello there,

I have a Beam job running on Spark that reads data from S3. I have tried using the S3FileSystem to read an s3:// path directly, but found this to be much slower than when I use s3a:// via HDFS in a non-Beam job. I believe it should be possible to read s3a:// paths using the HadoopFileSystem, but I can't seem to get this working.

If I include the org.apache.beam:beam-sdks-java-io-hadoop-file-system dependency, I still see this error:

 java.lang.IllegalArgumentException: No filesystem found for scheme s3a
	at org.apache.beam.sdk.io.FileSystems.getFileSystemInternal(FileSystems.java:515)

This is in an environment where I know the org.apache.hadoop:hadoop-aws dependency is already included (AWS EMR), and s3a:// paths work out-of-the-box with a plain Spark job. To be safe, I tried also including org.apache.hadoop:hadoop-aws and org.apache.hadoop:hadoop-client directly, as recommended in the Hadoop docs, but that still gave the same errors as above and below.

From looking at the HadoopFileSystemRegistrar, it looks like the only way to register a custom scheme is to use an option like --hdfsConfiguration=[{\"fs.default.name\":\"s3a://{bucket_path}\"}] to get the scheme registered, but this still results in an error:

org.apache.beam.sdk.util.UserCodeException: org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "s3a"
	at org.apache.beam.sdk.util.UserCodeException.wrap(UserCodeException.java:39)
	at org.apache.beam.sdk.io.FileIO$MatchAll$MatchFn$DoFnInvoker.invokeProcessElement(Unknown Source)

Has anyone managed to make this work?

@aromanenko-dev
Copy link
Contributor

@MorganRoff-UnlikelyAI
Please, ask user-related questions either on [email protected] or Apache Beam category in Stackoverflow or The ASF slack workspace
More info about this here https://beam.apache.org/community/contact-us/

@github-actions github-actions bot added this to the 2.52.0 Release milestone Sep 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants