-
Notifications
You must be signed in to change notification settings - Fork 109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Poor performance for QC filtering on 1KG data #148
Comments
Hi @eric-czech, I observed a significant performance improvement by skipping the
On a 1-worker (r4.4xlarge) cluster, running the rewritten analysis on a There are likely more performance gains to be made here, but this was the clearest change to make for me. |
Hey @karenfeng thank you! I can confirm that running a version like yours without maps is much faster, though I had to slightly modify it since my version of Spark doesn't yet have a def filterBySampleCallRate(threshold: Double)(df: DataFrame): DataFrame = {
val qc = df.selectExpr("sample_call_summary_stats(genotypes, referenceAllele, alternateAlleles) as qc")
df.crossJoin(qc)
// For each row, filter the genotypes array (which has one element per sampleId) based on whether it passed the QC filter
.selectExpr("*", s"""
transform(
filter(
zip_with(sequence(0, size(genotypes)), genotypes, (i, g) -> (i, g)), e ->
qc[e.i].callRate >= $threshold
),
e -> e.g
) as filtered_genotypes
""")
// Remove intermediate fields
.drop("qc", "genotypes")
.withColumnRenamed("filtered_genotypes", "genotypes")
// Ensure that the original dataset schema was preserved
.transform(d => {assert(d.schema.equals(df.schema)); d})
} I found that this runs in about 1 hour and 25 mins on my setup (a comparable one to r4.4xlarge) when reading from parquet. One last thought then before I close this out: Have you all at Databricks/Regeneron thought about how to potentially propagate some bitpacked or otherwise optimized schema for these kinds of pipelines? I think that difference when reading from PLINK files directly you saw is pretty telling, so I was wondering if you had ideas on how to maintain a more efficient data representation as parquet throughout a whole pipeline. I was also wondering if you had any plans/ideas for how to get methods like VariantQC to benefit from some form of vectorization/data parallelism. I'm finding in some benchmarks with other systems that this also makes for a huge difference in performance, so I was curious if optimizations like either of those are on the roadmap. Thank you again for taking the time to take a look at that code! |
@eric-czech fwiw, I think that the filter approach in Spark 3.0 will be significantly faster than the Those are great questions! I believe that the parquet spec actually will enable very efficient encoding for genotype data. Both the Regarding SIMD optimizations, we don't have immediate plans to do this directly in Glow, but Databricks engineering is working on some exciting projects to use these techniques to speed up many workloads including genomics. Stay tuned ;). |
Thanks @henrydavidge. It looks like the integer call array field ends up as RLE encoded PLAIN_DICTIONARY values in this dataset, if I'm interpreting parquet-tools correctly. It also looks like that's the default behavior if the number of unique values is low enough, so all fields have either that or a plain encoding. The entire parquet dataset with the Glow schema is about 5.3G while the comparable Hail MatrixTable is 2.3G (about 43% as big) so there must be a good ways to go in shrinking the serialized parquet data down. I did notice in the row page statistics that the sampleIds are really dominating the overall space taken. In one page from one file at least, that field is accounting for ~60% of all the bytes stored while calls are 37% and everything else is marginal. I'm not sure if there's a better encoding than RLE + PLAIN_DICTIONARY for the calls, but fwiw I think getting those sample ids out of the same array as the calls would make a huge difference. Or maybe there's some way to get parquet to recognize that the sample ids are essentially the same in every row? Either way, I bet this is a big reason why the Hail/PLINK files are so much smaller. Also, @karenfeng do you have a specific EC2 image you often use for tests like this? I'm curious why you saw that take 30 minutes while it was more like 90 for me so I was hoping to get a better understanding for what kind of configuration you were running with. Was it local mode or standalone? Maybe that's a big part of the difference given that the specs of the machines are so similar. |
Will this work be made available as open source? @eric-czech, it appears that Intel is trying to make use of Gandiva to generate code that includes SIMD instructions. |
@eric-czech I was looking over old issues and saw this. Sorry I never responded... I looked into encodings further, and actually I don't think there's much we can do. On datasets with many samples, each row can grow large enough that we only have a few rows in each parquet row group. Dictionaries for encoding are built per row group, so if the sample ids are not repeated many times, we don't get much compression. You're definitely right that for GWAS type use cases it's best to get the sample IDs out of the genotypes array. Most of our customers write out the sample IDs and the index separately and use that for filtering. Of course, that's less than ideal from an API perspective and we're looking at a few different options like providing functionality for moving sample ids to/from the column metadata. If you have thoughts feel free to chime in. |
Closing this old issue. |
Hi, I am trying to run some basic QC on a 1KG dataset with ~25M variants and 629 samples and I'm having issues making this work efficiently with Glow. These same operations take 35 seconds with PLINK, about 20 minutes with Hail, and at least 3 hours with Glow (I killed the jobs after that long so I'm not sure how long it would have ultimately taken). Is there some way I can rewrite this more efficiently?
I'm using the latest version of Glow, but I'm running on my own snapshot build with the fix for #135 (which is necessary for the above to run). I'm also using Spark 2.4.4 in local mode with a 64G heap size (in 128G, 16 core machine). I know local mode is not a typical use case, but I don't see what the problem is here since all 16 cores are at nearly 100% utilization the full 3 hours. It's also not GC pressure since only ~32G of the heap is actually being used and according to the Spark UI, the task time to GC time ratio is 4.6 h to 1.4 min for one these jobs after a couple hours.
Do you have any ideas what might be going on here? Thank you!
The text was updated successfully, but these errors were encountered: