You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have a Beam job running on Spark that reads data from S3. I have tried using the S3FileSystem to read an s3:// path directly, but found this to be much slower than when I use s3a:// via HDFS in a non-Beam job. I believe it should be possible to read s3a:// paths using the HadoopFileSystem, but I can't seem to get this working.
If I include the org.apache.beam:beam-sdks-java-io-hadoop-file-system dependency, I still see this error:
java.lang.IllegalArgumentException: No filesystem found for scheme s3a
at org.apache.beam.sdk.io.FileSystems.getFileSystemInternal(FileSystems.java:515)
This is in an environment where I know the org.apache.hadoop:hadoop-aws dependency is already included (AWS EMR), and s3a:// paths work out-of-the-box with a plain Spark job. To be safe, I tried also including org.apache.hadoop:hadoop-aws and org.apache.hadoop:hadoop-client directly, as recommended in the Hadoop docs, but that still gave the same errors as above and below.
From looking at the HadoopFileSystemRegistrar, it looks like the only way to register a custom scheme is to use an option like --hdfsConfiguration=[{\"fs.default.name\":\"s3a://{bucket_path}\"}] to get the scheme registered, but this still results in an error:
org.apache.beam.sdk.util.UserCodeException: org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "s3a"
at org.apache.beam.sdk.util.UserCodeException.wrap(UserCodeException.java:39)
at org.apache.beam.sdk.io.FileIO$MatchAll$MatchFn$DoFnInvoker.invokeProcessElement(Unknown Source)
Has anyone managed to make this work?
The text was updated successfully, but these errors were encountered:
Hello there,
I have a Beam job running on Spark that reads data from S3. I have tried using the S3FileSystem to read an s3:// path directly, but found this to be much slower than when I use s3a:// via HDFS in a non-Beam job. I believe it should be possible to read s3a:// paths using the HadoopFileSystem, but I can't seem to get this working.
If I include the org.apache.beam:beam-sdks-java-io-hadoop-file-system dependency, I still see this error:
This is in an environment where I know the org.apache.hadoop:hadoop-aws dependency is already included (AWS EMR), and s3a:// paths work out-of-the-box with a plain Spark job. To be safe, I tried also including org.apache.hadoop:hadoop-aws and org.apache.hadoop:hadoop-client directly, as recommended in the Hadoop docs, but that still gave the same errors as above and below.
From looking at the HadoopFileSystemRegistrar, it looks like the only way to register a custom scheme is to use an option like
--hdfsConfiguration=[{\"fs.default.name\":\"s3a://{bucket_path}\"}]
to get the scheme registered, but this still results in an error:Has anyone managed to make this work?
The text was updated successfully, but these errors were encountered: