-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tensorstore afdist #157
Tensorstore afdist #157
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow @Will-Tyler , you are a legend! I'll catch up properly with my inboxes next week, but teo quick things here.
- Don't worry about minor discrepancies in the output. The Zarr python version doesn't exactly match bcftools either. It doesn't matter once the values output are roughly the same
- Would it be possible to cpu profile your current code? I would use perf on Linux to get a breakdown of where the time is spent. There's probably something similar on macs (if that's what you use)?
I'd be surprised if it is a sync Vs async thing, seems more likely than tensorstore is doing multiple reads of the same chunk, for whatever reason
Good call on using a CPU profiler! I used gperftools, and it showed that much of the processing was in the array indexing. I changed the implementation to copy the TensorStore array into a vector, and it's faster now. PerformanceOn the
|
Very cool @Will-Tyler! I wanted to mention to y'all that I recently spoke to a group working with Zarr in C++ who created https://github.com/abcucberkeley/cpp-zarr. In their paper they say that
The "column-major" caught my eye, as I presume that's a better fit for our read use case. If we're going to benchmark speed, it may be worth trying out their library if it's not too much trouble! |
This is awesome thanks @Will-Tyler! I think we'll merge this much and I'll have a play with it when I get a chance. I'll follow up with further issues then. |
So yeah, I'm happy to merge this now if you are. |
Do you want me to implement finding the bin index? My implementation simply multiplies the frequency value by 10 and floors the result to calculate the bin index, whereas the other programs appear to search for the correct bin index. If not, feel free to merge! |
Let's merge this - we can tweak the output later if needs be. Thanks again - I had hit a wall on this! |
Overview
We want to compare the performance of an operation on the entire genotype matrix using C++ to read VCF Zarr data to classical approaches using VCF data. To that end, this pull request implements the bcftools afdist program in C++ using the TensorStore library to read some genotype data stored in VCF format.
This pull request is based on @jeromekelleher's #154.
I believe additional work is needed here (see discussion), but I wanted to open this pull request as a draft for others to examine and/or share ideas.
Example usage
Performance
Using the$10^4$ -samples genotype data,
For comparison, bcftools tools on the same data does
The Python-Zarr afdist implementation does
Discussion
The performance of the TensorStore implementation is surprisingly slow. I suspect that the slowest step is synchronously reading the chunks from the store. I will see if I can implement asynchronous chunk reading.
The numbers in the TensorStore implementation's output are slightly larger than the numbers in bcftools' output. I also need to determine what the discrepancy is here.
References