-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ability to read from cloud storage #22
Comments
Very interesting idea. Thanks! Object_store's vectored_read seems especially on-point. Time line: I should be able to at least take a look in more detail in a few weeks. |
https://github.com/roeap/object-store-python seems to be a Python interface. Thanks for the consideration Carl! |
Also, some details on our use case: we are finally writing up a publication on |
A request for more info to allow eventual performance testing ... |
1 million samples by 6.8 million variants. Our use case is to read the entire BED file and write it back out in the Zarr storage format. Currently we don't do anything intelligent to read and write in blocks, unfortunately, but that's something we should probably be able to add in our library, as we do when reading large VCF files. |
Oh and for first choice cloud provider I'm currently using Google, but I'd say Amazon is the most common and I'd be happy to switch to them if you prefer. |
Thanks for the additional info.
Back-of-envelope (actually Excel): 1M x 6.8M x 8% (chrome 1) * 1 byte (i8 values) / (1000*1000*1000) = 544 Gbytes.
Can you really handle a numpy array that big?
AWS (or Azure) would be better for me. I've used it before and have Amazon connections if needed.
From: Jeff Hammerbacher ***@***.***>
Sent: Thursday, December 7, 2023 7:44 AM
To: fastlmm/bed-reader ***@***.***>
Cc: Carl Kadie ***@***.***>; Comment ***@***.***>
Subject: Re: [fastlmm/bed-reader] Add ability to read from cloud storage (Issue #22)
Importance: High
1 million samples by 6.8 million variants. Our use case is to read the entire BED file and write it back out in the Zarr storage format. Currently we don't do anything intelligent to read and write in blocks, unfortunately, but that's something we should probably be able to add in our library, as we do when reading large VCF files.
-
Reply to this email directly, view it on GitHub<#22 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ABR65PZXZW5UJCSA77ZDWBTYIHP5TAVCNFSM6AAAAABAGIQCCCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNBVGU4DAOBQG4>.
You are receiving this because you commented.Message ID: ***@***.******@***.***>>
|
I'm making progress. Below is some (working) example code. Current API: a
let object_store = Arc::new(LocalFileSystem::new());
let file_path = sample_bed_file("plink_sim_10s_100v_10pmiss.bed")?;
let store_path = StorePath::from_filesystem_path(file_path).map_err(BedErrorPlus::from)?;
let mut bed_cloud = BedCloud::new((object_store, store_path)).await?;
let val = bed_cloud.read::<i8>().await?;
println!("{val:?}"); Another example: let object_path = sample_bed_object_path("plink_sim_10s_100v_10pmiss.bed")?;
let mut bed_cloud = BedCloud::new(object_path).await?;
let val = ReadOptions::builder()
.count_a2()
.i8()
.read_cloud(&mut bed_cloud)
.await?;
let mean = val.mapv(|elem| elem as f64).mean().unwrap();
println!("{mean:?}");
assert!(mean == -13.274); // really shouldn't do mean on data where -127 represents missing @hammer Let me know if this looks like it would work well with your use case. |
Thanks Carl! Let me cc @tomwhite as I believe he wrote the BED reader code in |
Yeah, I kind of forgot that everyone (including my other projects) use the library from Python and not from Rust. :-) My next steps:
@hammer or @tomwhite, if you have any or know of any large non-private PLINK Bed files on AWS (or Google if necessary) that I can use for performance testing, please let me know. |
@CarlKCarlK Happy new year! I've been using the files from https://www.ebi.ac.uk/biostudies/studies/S-BSST936 for testing. I have all of the smaller files (600 samples) and |
Please see the end of this note for something for you to try. Meanwhile, some observations. I'm finding this feature somewhat frustrating. Let me share some of those frustrations and perhaps you can make suggestions (or just offer sympathy 😊).
None of this means the feature isn't worth adding. I think the state of cloud/async/etc. a bit more primitive that I would have thought.
============
The documentation is here: https://fastlmm.github.io/bed-reader/1.0.1beta/ Here is sample usage:
|
Thanks Carl I'll check it out! To your point about the state of cloud/async being more primitive than expected, we've found the same thing. As one example the primary AWS Python library does not support asyncio: boto/botocore#458. |
I'm very excited because bed-reader can now efficiently read from regular web servers. This lets us, for example, read a SNP--almost instantly--directly from the S-BSST936 website!
More examples here: https://fastlmm.github.io/bed-reader/1.0.1beta/cloud_urls.html Please install with The full docs are: The documentation is here: https://fastlmm.github.io/bed-reader/1.0.1beta/
|
Thanks Carl! I'm on vacation this week but am excited to try this out next week when I'm back. |
Support for cloud files is now released as version 1.02 on PiPy. Thanks for your suggestion! (I also used this as an excuse to write an article about adding this feature to the Rust side of the code. It will hopefully soon be published on https://medium.com/@carlmkadie) |
Perhaps with the object_store crate?
The text was updated successfully, but these errors were encountered: