Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tensorstore afdist #157

Merged
merged 6 commits into from
Aug 27, 2024
Merged

Conversation

Will-Tyler
Copy link
Contributor

Overview

We want to compare the performance of an operation on the entire genotype matrix using C++ to read VCF Zarr data to classical approaches using VCF data. To that end, this pull request implements the bcftools afdist program in C++ using the TensorStore library to read some genotype data stored in VCF format.

This pull request is based on @jeromekelleher's #154.

I believe additional work is needed here (see discussion), but I wanted to open this pull request as a draft for others to examine and/or share ideas.

Example usage

./software/ts-afdist/build/ts_afdist scaling/data/chr21_10_2.zarr/call_genotype
# PROB_DIST, genotype probability distribution, assumes HWE
PROB_DIST       0       0.1     431747
PROB_DIST       0.1     0.2     475194
PROB_DIST       0.2     0.3     585208
PROB_DIST       0.3     0.4     833974
PROB_DIST       0.4     0.5     1855856
PROB_DIST       0.5     0.6     188051
PROB_DIST       0.6     0.7     211121
PROB_DIST       0.7     0.8     169395
PROB_DIST       0.8     0.9     191841
PROB_DIST       0.9     1       745980

Performance

Using the $10^4$-samples genotype data,

./software/ts-afdist/build/ts_afdist   77.48s user 2.43s system 98% cpu 1:20.74 total

For comparison, bcftools tools on the same data does

BCFTOOLS_PLUGINS=software/bcftools-1.18/plugins ./software/bcftools +af-dist   47.42s user 0.28s system 99% cpu 47.840 total

The Python-Zarr afdist implementation does

scaling/data/chr21_10_4.ts n=10000, m=863998
   num_samples  num_sites  tool  user_time  sys_time  wall_time storage
0        10000     863998  zarr  31.375917   2.02435   30.13042     hdd

Discussion

The performance of the TensorStore implementation is surprisingly slow. I suspect that the slowest step is synchronously reading the chunks from the store. I will see if I can implement asynchronous chunk reading.

The numbers in the TensorStore implementation's output are slightly larger than the numbers in bcftools' output. I also need to determine what the discrepancy is here.

References

@Will-Tyler Will-Tyler marked this pull request as draft August 23, 2024 17:38
Copy link
Contributor

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow @Will-Tyler , you are a legend! I'll catch up properly with my inboxes next week, but teo quick things here.

  1. Don't worry about minor discrepancies in the output. The Zarr python version doesn't exactly match bcftools either. It doesn't matter once the values output are roughly the same
  2. Would it be possible to cpu profile your current code? I would use perf on Linux to get a breakdown of where the time is spent. There's probably something similar on macs (if that's what you use)?

I'd be surprised if it is a sync Vs async thing, seems more likely than tensorstore is doing multiple reads of the same chunk, for whatever reason

@Will-Tyler
Copy link
Contributor Author

Good call on using a CPU profiler! I used gperftools, and it showed that much of the processing was in the array indexing. I changed the implementation to copy the TensorStore array into a vector, and it's faster now.

Performance

On the $10^4$-samples dataset, the implementation now acheives

./software/ts-afdist/build/ts_afdist   35.87s user 1.31s system 97% cpu 38.217 total

@hammer
Copy link

hammer commented Aug 27, 2024

Very cool @Will-Tyler! I wanted to mention to y'all that I recently spoke to a group working with Zarr in C++ who created https://github.com/abcucberkeley/cpp-zarr. In their paper they say that

compared to TensorStore, Cpp-Zarr is 2.2× and 1.5× faster for reading and writing, respectively, for their preferred data orders (row-major in TensorStore and column-major in Cpp-Zarr)

The "column-major" caught my eye, as I presume that's a better fit for our read use case.

If we're going to benchmark speed, it may be worth trying out their library if it's not too much trouble!

@jeromekelleher jeromekelleher mentioned this pull request Aug 27, 2024
@jeromekelleher
Copy link
Contributor

That's super helpful, thanks @hammer! I'll poke around a bit, but I think we'll want the tensorstore version regardless. It's not clear whether the cpp-zarr version supports cloud stores, e.g.

I've opened #158 to track citing cpp-zarr

@jeromekelleher
Copy link
Contributor

This is awesome thanks @Will-Tyler! I think we'll merge this much and I'll have a play with it when I get a chance. I'll follow up with further issues then.

@jeromekelleher
Copy link
Contributor

So yeah, I'm happy to merge this now if you are.

@Will-Tyler
Copy link
Contributor Author

Do you want me to implement finding the bin index? My implementation simply multiplies the frequency value by 10 and floors the result to calculate the bin index, whereas the other programs appear to search for the correct bin index.

If not, feel free to merge!

@jeromekelleher jeromekelleher marked this pull request as ready for review August 27, 2024 14:59
@jeromekelleher
Copy link
Contributor

Let's merge this - we can tweak the output later if needs be.

Thanks again - I had hit a wall on this!

@jeromekelleher jeromekelleher merged commit f5a2db9 into sgkit-dev:main Aug 27, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants