-
Notifications
You must be signed in to change notification settings - Fork 311
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed of Reading into ADAM RDDs from S3 #2003
Comments
Hello @nick-phillips, thanks for the question. Things have been strange for me and others reading BAM and VCF from S3 via s3a recently. Parquet works fine though. See #1951 Perhaps it might be useful discussing this further on gitter? Feel free to start a one-on-one if there is anything sensitive about your environment. |
@heuermh - would you care to elaborate on "Things have been strange for me and others reading BAM and VCF from S3 via s3a recently." Also I am attempting to load vcfs via parquet via the python API, and as you suggested for @nick-phillips, and have saved a vcf I loaded tp parquet via df=adamContext.loadVariants(path).toDF() when I try to load this however, I get the following errors. Py4JJavaError: An error occurred while calling o72.loadVariants. My parquet vcf exists as a directory on s3 with many partitioned files, and I did verify there was the SUCCESS file, but although the documentation says that loadVariants in the python API supports parquet, it can't seem to load it with the s3a protocol. Is there something I am missing here? |
@pjongeneel It's in the linked issue, there are thread leaks upstream in Hadoop libraries that cause trouble. df=adamContext.loadVariants(path).toDF()
df.write.format("parquet")
.save("s3a://my_bucket/df.parquet")
.saveMetadata("s3a://my_bucket/df.parquet") I haven't tried writing the You can write to Parquet via the adamContext.loadVariants(path).saveAsParquet("s3a://my_bucket/df.parquet") |
@heuermh , thanks for the info, I tried that and it worked fine! Side note: I actually got the original code to save my dataframe from the ADAM scala api override def saveAsParquet(filePath: String, however when I saved it manually like that, I got Not sure yet if there is an easy way to save the dataframe directly as the .gz.parquet files but I have a solution that works for now, so thank you! |
Closing as resolved. |
Hi all,
We're using ADAM via the Python API, and we're running into some bottlenecks loading data from S3 using s3a. We're experiencing a max throughput of about 100mbps when reading into ADAM RDDs from S3, for bams and vcfs. Loading the same files into Spark as textfiles is ~1gbps. I realize many factors could affect this performance, but are these numbers ballpark of what's expected for this use case of ADAM? If not, are there recommended troubleshooting steps?
Could provide more info if needed.
Thanks!
The text was updated successfully, but these errors were encountered: