Tensorstore afdist #157

Will-Tyler · 2024-08-23T17:38:38Z

Overview

We want to compare the performance of an operation on the entire genotype matrix using C++ to read VCF Zarr data to classical approaches using VCF data. To that end, this pull request implements the bcftools afdist program in C++ using the TensorStore library to read some genotype data stored in VCF format.

This pull request is based on @jeromekelleher's #154.

I believe additional work is needed here (see discussion), but I wanted to open this pull request as a draft for others to examine and/or share ideas.

Example usage

./software/ts-afdist/build/ts_afdist scaling/data/chr21_10_2.zarr/call_genotype
# PROB_DIST, genotype probability distribution, assumes HWE
PROB_DIST       0       0.1     431747
PROB_DIST       0.1     0.2     475194
PROB_DIST       0.2     0.3     585208
PROB_DIST       0.3     0.4     833974
PROB_DIST       0.4     0.5     1855856
PROB_DIST       0.5     0.6     188051
PROB_DIST       0.6     0.7     211121
PROB_DIST       0.7     0.8     169395
PROB_DIST       0.8     0.9     191841
PROB_DIST       0.9     1       745980

Performance

Using the $10^4$-samples genotype data,

./software/ts-afdist/build/ts_afdist   77.48s user 2.43s system 98% cpu 1:20.74 total

For comparison, bcftools tools on the same data does

BCFTOOLS_PLUGINS=software/bcftools-1.18/plugins ./software/bcftools +af-dist   47.42s user 0.28s system 99% cpu 47.840 total

The Python-Zarr afdist implementation does

scaling/data/chr21_10_4.ts n=10000, m=863998
   num_samples  num_sites  tool  user_time  sys_time  wall_time storage
0        10000     863998  zarr  31.375917   2.02435   30.13042     hdd

Discussion

The performance of the TensorStore implementation is surprisingly slow. I suspect that the slowest step is synchronously reading the chunks from the store. I will see if I can implement asynchronous chunk reading.

The numbers in the TensorStore implementation's output are slightly larger than the numbers in bcftools' output. I also need to determine what the discrepancy is here.

References

jeromekelleher

Wow @Will-Tyler , you are a legend! I'll catch up properly with my inboxes next week, but teo quick things here.

Don't worry about minor discrepancies in the output. The Zarr python version doesn't exactly match bcftools either. It doesn't matter once the values output are roughly the same
Would it be possible to cpu profile your current code? I would use perf on Linux to get a breakdown of where the time is spent. There's probably something similar on macs (if that's what you use)?

I'd be surprised if it is a sync Vs async thing, seems more likely than tensorstore is doing multiple reads of the same chunk, for whatever reason

Will-Tyler · 2024-08-25T02:44:07Z

Good call on using a CPU profiler! I used gperftools, and it showed that much of the processing was in the array indexing. I changed the implementation to copy the TensorStore array into a vector, and it's faster now.

Performance

On the $10^4$-samples dataset, the implementation now acheives

./software/ts-afdist/build/ts_afdist   35.87s user 1.31s system 97% cpu 38.217 total

hammer · 2024-08-27T04:53:09Z

Very cool @Will-Tyler! I wanted to mention to y'all that I recently spoke to a group working with Zarr in C++ who created https://github.com/abcucberkeley/cpp-zarr. In their paper they say that

compared to TensorStore, Cpp-Zarr is 2.2× and 1.5× faster for reading and writing, respectively, for their preferred data orders (row-major in TensorStore and column-major in Cpp-Zarr)

The "column-major" caught my eye, as I presume that's a better fit for our read use case.

If we're going to benchmark speed, it may be worth trying out their library if it's not too much trouble!

jeromekelleher · 2024-08-27T11:10:25Z

That's super helpful, thanks @hammer! I'll poke around a bit, but I think we'll want the tensorstore version regardless. It's not clear whether the cpp-zarr version supports cloud stores, e.g.

I've opened #158 to track citing cpp-zarr

jeromekelleher · 2024-08-27T11:11:51Z

This is awesome thanks @Will-Tyler! I think we'll merge this much and I'll have a play with it when I get a chance. I'll follow up with further issues then.

jeromekelleher · 2024-08-27T11:12:26Z

So yeah, I'm happy to merge this now if you are.

Will-Tyler · 2024-08-27T14:44:52Z

Do you want me to implement finding the bin index? My implementation simply multiplies the frequency value by 10 and floors the result to calculate the bin index, whereas the other programs appear to search for the correct bin index.

If not, feel free to merge!

jeromekelleher · 2024-08-27T14:59:55Z

Let's merge this - we can tweak the output later if needs be.

Thanks again - I had hit a wall on this!

jeromekelleher and others added 5 commits August 6, 2024 16:54

Add outline for tensorstore af-dist

10a284c

Sort-of working C++ tensorstore code

211ce22

Working tensorstore af-dist program

2223782

Iterate over chunks

8feba44

Handle missing genotypes

78e79b6

Will-Tyler marked this pull request as draft August 23, 2024 17:38

jeromekelleher reviewed Aug 23, 2024

View reviewed changes

Make program faster

68885dd

jeromekelleher mentioned this pull request Aug 27, 2024

Cite cpp-zarr #158

Open

jeromekelleher marked this pull request as ready for review August 27, 2024 14:59

jeromekelleher merged commit f5a2db9 into sgkit-dev:main Aug 27, 2024
1 check passed

jeromekelleher mentioned this pull request Aug 28, 2024

Python Tensorstore af-dist implementation #161

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tensorstore afdist #157

Tensorstore afdist #157

Will-Tyler commented Aug 23, 2024

jeromekelleher left a comment

Will-Tyler commented Aug 25, 2024

hammer commented Aug 27, 2024

jeromekelleher commented Aug 27, 2024

jeromekelleher commented Aug 27, 2024

jeromekelleher commented Aug 27, 2024

Will-Tyler commented Aug 27, 2024

jeromekelleher commented Aug 27, 2024

Tensorstore afdist #157

Tensorstore afdist #157

Conversation

Will-Tyler commented Aug 23, 2024

Overview

Example usage

Performance

Discussion

References

jeromekelleher left a comment

Choose a reason for hiding this comment

Will-Tyler commented Aug 25, 2024

Performance

hammer commented Aug 27, 2024

jeromekelleher commented Aug 27, 2024

jeromekelleher commented Aug 27, 2024

jeromekelleher commented Aug 27, 2024

Will-Tyler commented Aug 27, 2024

jeromekelleher commented Aug 27, 2024