How can I use HadoopFileSystem to read from s3a:// ? #28603

MorganRoff-UnlikelyAI · 2023-09-21T22:01:39Z

Hello there,

I have a Beam job running on Spark that reads data from S3. I have tried using the S3FileSystem to read an s3:// path directly, but found this to be much slower than when I use s3a:// via HDFS in a non-Beam job. I believe it should be possible to read s3a:// paths using the HadoopFileSystem, but I can't seem to get this working.

If I include the org.apache.beam:beam-sdks-java-io-hadoop-file-system dependency, I still see this error:

 java.lang.IllegalArgumentException: No filesystem found for scheme s3a
	at org.apache.beam.sdk.io.FileSystems.getFileSystemInternal(FileSystems.java:515)

This is in an environment where I know the org.apache.hadoop:hadoop-aws dependency is already included (AWS EMR), and s3a:// paths work out-of-the-box with a plain Spark job. To be safe, I tried also including org.apache.hadoop:hadoop-aws and org.apache.hadoop:hadoop-client directly, as recommended in the Hadoop docs, but that still gave the same errors as above and below.

From looking at the HadoopFileSystemRegistrar, it looks like the only way to register a custom scheme is to use an option like --hdfsConfiguration=[{\"fs.default.name\":\"s3a://{bucket_path}\"}] to get the scheme registered, but this still results in an error:

org.apache.beam.sdk.util.UserCodeException: org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "s3a"
	at org.apache.beam.sdk.util.UserCodeException.wrap(UserCodeException.java:39)
	at org.apache.beam.sdk.io.FileIO$MatchAll$MatchFn$DoFnInvoker.invokeProcessElement(Unknown Source)

Has anyone managed to make this work?

The text was updated successfully, but these errors were encountered:

aromanenko-dev · 2023-09-28T09:21:47Z

@MorganRoff-UnlikelyAI
Please, ask user-related questions either on [email protected] or Apache Beam category in Stackoverflow or The ASF slack workspace
More info about this here https://beam.apache.org/community/contact-us/

aromanenko-dev closed this as completed Sep 28, 2023

github-actions bot added this to the 2.52.0 Release milestone Sep 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How can I use HadoopFileSystem to read from s3a:// ? #28603

How can I use HadoopFileSystem to read from s3a:// ? #28603

MorganRoff-UnlikelyAI commented Sep 21, 2023

aromanenko-dev commented Sep 28, 2023

How can I use HadoopFileSystem to read from s3a:// ? #28603

How can I use HadoopFileSystem to read from s3a:// ? #28603

Comments

MorganRoff-UnlikelyAI commented Sep 21, 2023

aromanenko-dev commented Sep 28, 2023