-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ancestor generation requires entire genotype array to be in-memory. #806
Comments
I'm not sure we can do that @benjeffery - it's the C implementation of AncestorBuilder that's being called, and that's making some hard assumptions about fitting the genotypes into memory IIRC. We'll need to think carefully about this. |
Yes, this gets tricky pretty fast. One way would be to give the C code a callback to get genotypes, but the memory management will be fiddly. |
If we packed down to 1 bit per genotype, would we manage? We can exclude singletons etc, so not all sites need to be considered. |
Maybe simpler to do a numba version of the ancestor builder and keep it all in python? |
Yeah, I was thinking along the same lines, but good to look at the possibilities. |
Numba/zarr won't be trivial either because we'll have to explicitly work in chunks |
It is my understanding that the algorithm sweeps left and right from the focal site until half the samples are evicted. If we had a way to determine a rough distance from the focal site where (say) 99% of ancestor finding is contained we can load that chunk and keep the existing code the same (except for adding an additional stopping criteria when you hit the end of the loaded chunk). Will have a think on my bike ride in. |
@benjeffery and I discussed this and we think a pragmatic approach may be to bit-pack the genotype data to reduce the memory requirements four-fold. |
Can you run it on a large machine in the cloud? On GCP for example, the M2 machine series has up to 12TB of memory. |
It's a good point - we're definitely looking at the "lets throw some money at it" approach, but there are complexities on where these datasets can be processed because of data access agreements. |
As currently implemented,
AncestorsGenerator.add_sites
loads the entire sample data genotype array into memory. As one of our currently intended inference targets has 1.8TB of genotypes this is not possible.We can of course read the genotypes from the zarr-backed sample data as needed (in
break_ancestor
,make_ancestor
andcompute_ancestral_state
) but care will need to be taken to have some kind of decoded chunk caching mechanism. I thought this might be simple to do withzarr
, but zarr-developers/zarr-python#306 has been open for a long time.A simple FIFO or LRU cache might not be too difficult to implement, as long as the cache is larger than the typical ancestor length chunks shouldn't need to be loaded more than once.
@savitakartik This is what is OOM killing your jobs!
The text was updated successfully, but these errors were encountered: